|Simplified illustration of the finite-state transducer for Quechua (click to enlarge) (as PDF)||download morphological analyzer|
Comprehensive finite-state morphology systems have been developed for numerous languages, nevertheless the American indigenous languages have received far less attention from the computational linguistic field than the standard European languages. For my master thesis, I implemented a complete morphology system for the Andean language Quechua, consisting of a morphological analyzer, a generation tool and also a spell checker. Thanks to Richard Castro Mamani from the Universidad Nacional de San Antonio Abad del Cusco, there is now a user friendly text editor with Quechua spell checker available (old version!). We used our morphological analyzer to create a system for automatic text normalization, you can test it here.
Quechua is a group of closely related languages, spoken by 8-10 million people in Peru, Bolivia, Ecuador, Southern Colombia and the North-West of Argentina. Ethnologue also lists some Quechua speakers for Chile. Quechua is one of the official languages of Peru and Bolivia. Peru especially, has increased efforts to provide its citizens with official information not only in Spanish, but also in Quechua and (to less extent) in some other indigenous languages like Aymara and Asháninka.
Although Quechua is often referred to as ’language’ and its local varieties as ’dialects’, Quechua represents a language family, comparable in depth to the Romance or Slavic languages (Adelaar & Muysken 2004). Mutual intelligibility, especially between speakers of distant ’dialects’, is not always given.
The Quechuan Languages are divided into two main branches, Quechua I and II in terms of the Peruvian linguist Torero (1964). Quechua I is the more archaic group of dialects, spoken in Central Peru. It comprises a heavily fragmented dialect complex, with limited mutual comprehension between the different local varieties, although they share a number of clear common features (Adelaar & Muysken 2004). The origin of the Quechuan languages lies probably in this area (Cerrón-Palomino 2003). Quechua II itself consists of three subgroups:
The main focus for this project lies on the dialects of the QIIC group, and within these,
especially on Cuzco and Ayacucho Quechua.
The analyzer can be tested below. It is especially designed to analyze QIIC input, therefore it will
not be able to analyze input of QI or other QII dialects.
The output is given in trivocalic ortography, but for the input, the vocals e and o are also accepted. Aspirated stops should be written as ph, th, kh, qh and chh , glottalized stops should be written with an apostrophe, e.g. q', k', etc.
This finite-state transducer was built with Xerox Finite-State Tools, its size is about 7 Mb.
If you don't speak Quechua, feel free to try a word from the Declaration of Human Rights below. Be careful that the input doesn't contain any whitespaces. For suggestions, please write to ariosATifi.uzh.ch