Automatic Normalization through Morphological Disambiguation

This is a preliminary version of an automatic text normalizer for Southern Quechua. There is a wide range of different orthographies for Southern Quechua (QIIC). This normalizer rewrites text into the Unified Southern Quechua, as proposed by the Peruvian linguist Cerrón-Palomino[1]. For a detailed description of the orthography used for the normalization, please read the information on our spell checker website.

IMPORTANT: This is not a spell checker, the normalizer is not able to analyse misspelled or gramatically wrong words: disambiguation and normalization is limited to gramatically correct words. Words that could not be analysed will be printed out unchanged. If your text contains spelling or gramatical errors (e.g. wsai instead of wasi, or yawarp instead of yawarpa), please try to correct those with the spell checker first!

The procedure for the text normalization is as follows:

  1. sentence splitting with the Lingua::Sentence perl module
  2. tokenization with the Uplug::PreProcess::Tokenizer perl module
  3. morphological analysis with a cascade of finite state transducers, including a special transducer that handles Spanish loan words. This transducer contains the Spanish lexicon from FreeLing, see [2].
  4. morphological disambiguation with Wapiti, an open source implementation of conditional random fields.
  5. generation of the normalized word forms
Note that the morphological disambiguation relies on statistical models trained on a relatively small training corpus, therefore, not all word forms might be disambiguated correctly. We did an evaluation of the normalization procedure on two texts, Hanaq pachaman wichaq wayna [3] and the last 70 sentences of Gregorio Condori Mamani's biography [4].
The models disambiguated 96.46% of the word forms in the first text and 93.01% in the second text correctly. Not all ambiguities are relevant for the normalization. In fact, many morphological ambiguities are not relevant for the conversion to the standard orthography, therefore, the number of correctly normalized forms is actually higher than the proportion of correctly disambiguated words:

Morphological Disambiguation
Hanaq pachaman wichaq waynabiography Gregorio Condori Mamani
total words1298 844
correctly disambiguated form125296.46% 78593.01%
wrongly disambiguated form332.54% 102.49%
still ambiguous0- 75.51%
failed to analyse90.48% 102.46%
Normalization
Hanaq pachaman wichaq waynabiography Gregorio Condori Mamani
total words1298 844
correct normalized form129599.77% 83498.82%
wrong normalized form30.23% 101.18%

Nevertheless, you should problably proof read your text. If a word could not be disambiguated, the ambiguous forms are both included in the output, separated by '/'.

Bibliography:

[1]
Unified Southern Quechua as described by Cerrón-Palomino (1994):
Cerrón-Palomino, R. (1994). Quechua sureño, diccionario unificado quechua-castellano, castellano-quechua. Lima: Biblioteca Nacional del Perú
[2]
Lluís Padró. Analizadores Multilingües en FreeLing. Linguamatica, vol. 3, n. 2, pg. 13--20. December, 2011.
[3]
Jorge Lira. 1990. Cuentos del Alto Urubamba. Centro de Estudios Regionales Andinos 'Bartolomé de las Casas', Cuzco.
[4]
Ricardo Valderrama Fernandez and Carmen Escalante Gutierrez. 1977. Gregorio Condori Mamani - Autobiografía. Biblioteca de la Tradición Oral Andina. Centro de Estudios Rurales Andinos 'Bartolomé de las Casas’, Cuzco.

Input should be in uft8, otherwise non-ASCII characters might not be handled correctly.

The source code, trained and test files, and all the models are available from our GitHub repository.


For the normalization, please indicate whether the direct evidential suffix has the form -n in the text to normalize:
→ otherwise, all ocurrences of -n will be assumed to be 3. person markers

Ayacucho Quechua (evidential suffix -m)
Cuzco/Bolivian Quechua (evidential suffix -n)


please indicate whether the additive suffix occurs as -pis in the text to normalize:
→ otherwise, all occurrences of -pis will be assumed to mean -pi -s (locative + indirect evidentiality)

additive ocurrs only as -pas
additive occurs as -pis

Note: This is only a web demo, if the normalization takes longer than 3 minutes, the process will be aborted.



your text in Unified Southern Quechua: