Quechua Normalization

Automatic Normalization through Morphological Disambiguation

This is a preliminary version of an automatic text normalizer for Southern Quechua. There is a wide range of different orthographies for Southern Quechua (QIIC). This normalizer rewrites text into the Unified Southern Quechua, as proposed by the Peruvian linguist Cerrón-Palomino[1]. For a detailed description of the orthography used for the normalization, please read the information on our spell checker website.

IMPORTANT: This is not a spell checker, the normalizer is not able to analyse misspelled or gramatically wrong words: disambiguation and normalization is limited to gramatically correct words. Words that could not be analysed will be printed out unchanged. If your text contains spelling or gramatical errors (e.g. wsai instead of wasi, or yawarp instead of yawarpa), please try to correct those with the spell checker first!

The procedure for the text normalization is as follows:

sentence splitting with the Lingua::Sentence perl module
tokenization with the Uplug::PreProcess::Tokenizer perl module
morphological analysis with a cascade of finite state transducers, including a special transducer that handles Spanish loan words. This transducer contains the Spanish lexicon from FreeLing, see [2].
morphological disambiguation with Wapiti, an open source implementation of conditional random fields.
generation of the normalized word forms

Note that the morphological disambiguation relies on statistical models trained on a relatively small training corpus, therefore, not all word forms might be disambiguated correctly. We did an evaluation of the normalization procedure on two texts, Hanaq pachaman wichaq wayna [3] and the last 70 sentences of Gregorio Condori Mamani's biography [4].
The models disambiguated 96.46% of the word forms in the first text and 93.01% in the second text correctly. Not all ambiguities are relevant for the normalization. In fact, many morphological ambiguities are not relevant for the conversion to the standard orthography, therefore, the number of correctly normalized forms is actually higher than the proportion of correctly disambiguated words:

Morphological Disambiguation
	Hanaq pachaman wichaq wayna		biography Gregorio Condori Mamani
total words	1298		844
correctly disambiguated form	1252	96.46%	785	93.01%
wrongly disambiguated form	33	2.54%	10	2.49%
still ambiguous	0	-	7	5.51%
failed to analyse	9	0.48%	10	2.46%

Normalization
	Hanaq pachaman wichaq wayna		biography Gregorio Condori Mamani
total words	1298		844
correct normalized form	1295	99.77%	834	98.82%
wrong normalized form	3	0.23%	10	1.18%

Nevertheless, you should problably proof read your text. If a word could not be disambiguated, the ambiguous forms are both included in the output, separated by '/'.

Bibliography:

[1]: Unified Southern Quechua as described by Cerrón-Palomino (1994):
Cerrón-Palomino, R. (1994). Quechua sureño, diccionario uniﬁcado quechua-castellano, castellano-quechua. Lima: Biblioteca Nacional del Perú

[2]

Lluís Padró. Analizadores Multilingües en FreeLing. Linguamatica, vol. 3, n. 2, pg. 13--20. December, 2011.

[3]

Jorge Lira. 1990. Cuentos del Alto Urubamba. Centro de Estudios Regionales Andinos 'Bartolomé de las Casas', Cuzco.

[4]

Ricardo Valderrama Fernandez and Carmen Escalante Gutierrez. 1977. Gregorio Condori Mamani - Autobiografía. Biblioteca de la Tradición Oral Andina. Centro de Estudios Rurales Andinos 'Bartolomé de las Casas’, Cuzco.

Input should be in uft8, otherwise non-ASCII characters might not be handled correctly.

The source code, trained and test files, and all the models are available from our GitHub repository.

you can clone the repository or individual parts of it with git or subversion
the disambiguation pipeline is included in the released parsing pipeline under Releases
the morphological analyzer (required for the normalization) is a separate package under Releases

Automatic Normalization through Morphological Disambiguation

your text in Unified Southern Quechua: