====== The Zurich Parallel Corpus Collection ======
The corpus files are available for download at [[https://pub.cl.uzh.ch/corpora/PaCoCo/]]
The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance. Each corpus is available at the following links: * [[Text+Berg|Text+Berg]] * [[Credit Suisse|Credit Suisse]] * [[Medi-Notice|Medi-Notice]] * [[Horizonte|Horizonte]] * [[Sparcling|Sparcling]] * [[Rumantsch Grischun|Rumantsch-Grischun]] * [[Swiss Legislation Corpus|Swiss Legislation Corpus]] * [[Swatchgroup|Swatchgroup «Geschäftsbricht»]] In order to make these corpora publicly available, we have extended the popular [[https://universaldependencies.org/format.html|CoNLL-U]] format to efficiently accommodate our parallel texts. ===== How to cite ===== We presented the format at the [[http://corpora.ids-mannheim.de/cmlc-2019.html|7th Workshop on the Challenges in the Management of Large Corpora]] at [[http://www.cl2019.org/|CL 2019]]. The publication is available via the [[https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8998|IDS publication server]] or [[http://www.zora.uzh.ch/175081|ZORA]]. @inproceedings{GraenKewShaitarovaVolk2019, month = {July}, author = {Gra\"{e}n, Johannes and Kew, Tannon and Shaitarova, Anastassia and Volk, Martin}, booktitle = {Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC)}, editor = {Bański, Piotr and Barbaresi, Adrien and Biber, Hanno and Breiteneder, Evelyn and Clematide, Simon and Kupietz, Marc and Lüngen, Harald and Iliadi, Caroline}, title = {Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection}, publisher = {Leibniz-Institut f\"{u}r Deutsche Sprache}, pages = {1--8}, year = {2019}, url = {https://doi.org/10.5167/uzh-175081}, doi = {10.14618/ids-pub-9020} } ===== The CoNLL-UPPa Format ===== ==== Token ==== At present, these corpora are available in an extended CoNLL-U format, which we refer to as **CoNLL-UPPa** ( **CoNLL-U P**lus **Pa**rallel). CoNLL-UPPa is a tabular 'one-token-per-line' file format that allows for the encoding of parallel texts by adding 3 extra columns to the standard CoNLL-U. These columns contain numeric identifier values for corpus tokens, sentences and texts, acting as primary and foreign keys to which stand-off annotation can be attached. Each corpus consists of multiple **token files** (one per language). The image below shows a sample of a typical token file. {{ :public:pacoco:cmlc_format_token1.png?900 |CoNLL-UPPa Token File}} The first 10 columns match the standard CoNNL-U format: - **ID**: Word index, integer starting at 1 for each new sentence; //may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).//((we don't support multiword tokens and empty nodes)) - **FORM**: Word form or punctuation symbol. - **LEMMA**: Lemma or stem of word form. - **UPOS**: Universal part-of-speech tag. - **XPOS**: Language-specific part-of-speech tag. - **FEATS**: List of morphological features from the universal feature inventory or from a defined language-specific extension. - **HEAD**: Head of the current word, which is either a value of ID or zero (0). - **DEPREL**: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one. - **DEPS**: Enhanced dependency graph in the form of a list of head-deprel pairs. - **MISC**: Any other annotation. - **TokenID**: Language and corpus-specific integer //in range of 100 million. (German=1, English=2, French=4, etc).//((we define full ranges for readability)) - **SentenceID**: Language and corpus-specific integer //in range of 10 million (German=1, English=2, French=4, etc).//((we define full ranges for readability)) - **TextID**: Language and corpus-specific integer //in range of 1 million (German=1, English=2, French=4, etc).//((we define full ranges for readability)) The **MISC** column allows a list of token-level annotation as key-value pairs separated by a pipe character ('|'). Here we include the following annotation information if available: * **SpaceAfter**: Indicates whether the detokenised surface form is followed by a white space or not (can only have the value //No//, yes assumed if omitted). * **Language**: Language code for token (e.g. de/en/fr) if it differs from the language code of the sentence (language of the sentence if omitted). ===== Stand-off Files ===== ==== Sentence ==== Sentence-level annotation and metadata is stored in a **sentence file**, which contains 2 columns: - **SentenceID**: Language and corpus-specific integer in range of 10 million (German=1, English=2, French=4, etc). - **MISC**: Any other annotation. Similar to the token file, the **MISC** column allows a list of key-value pairs separated by a pipe character ('|'). Here we include the following annotation information if available: - **Language**: language code for the sentence (e.g. de/en/fr) if it differs from the language code of the text (language of the text if omitted). ==== Text ==== Text-level annotation and metadata can be stored in a **text file**. Similar to the sentence file, a text file can also contain 2 columns: - **TextID**: Language and corpus-specific integer in range of 1 million (German=1, English=2, French=4, etc). - **MISC**: Any other annotation. Again, the **MISC** column allows for a list of text-level annotation as key-value pairs separated by a pipe character ('|'). Here we include the following annotation information: - **Language**: language code for the text (e.g. de/en/fr) if it differs from that of the language-specific subcorpus; this value can also be derived from the filename if the corpus is split into language-specific token files. At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therefore, the **MISC** column is used to store this information. **Note:** Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below). Some examples of metadata currently stored in sentence files are: * Type (i.e. heading, caption or footnote) * Div ID (for paragraph demarkation) Examples of metadata currently stored in text files are corpus-specific attributes are: * author attribution (Text+Berg) * category/domain (Credit Suisse) * speaker attribution (Sparcling) * known substances (Medi-Notice) * etc. {{ :public:pacoco:cmlc_format_overview.png | CoNLL-UPPa Format Overview}} ==== Alignment ==== We store hierarchical alignment (see [[https://www.zora.uzh.ch/id/eprint/111877/|Graën and Clematide 2015]], [[https://www.zora.uzh.ch/id/eprint/153213/|Graën 2018]]) information in a 2-column tabular format. Alignments are modelled at all levels as an aggregation relation, which allows unique entity ID to belong to a unique alignment unit ID. The picture below provides an example for a challenging case of heirarchical token alignment. {{ :public:pacoco:menschenrechtsverletzung_1_.png?500 |Hierchical Token Alignment for the German word 'Menchenrechtsverletzung'}} The German word //Menschenrechtsverletzung// corresponds to the English phrase //violation of human rights// and the French //violation de droits de l’homme//. Together, these constitute a single alignment unit. However, below this level, we also have alignment units consisting of English and French parallel tokens: //violation// and //violation//; //of// and //de//; //rights// and //droits//; and //human// and //l'homme//. ==== Named Entities ==== In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see [[https://www.zora.uzh.ch/id/eprint/50451/|Ebling et al. 2011]]) The stand-off file for Named Entities contains the following 7 columns: * **TOKEN ID**: corresponds to CORPUS_TOK_ID in the MAIN CORPUS FILE. * **SENTENCE ID**: corresponds to CORPUS_SENT_ID in the MAIN CORPUS FILE. * **NAMED ENTITY ID**: unique named entity ID * for person p_000; * for geo g_000; * for time t_000. * **POSITION WITHIN ENTITY**: if single-word entity => 1/1; if multi-word entity => 1/n, 2/n, 3/n, etc. * **NAMED ENTITY TYPE**: person, geo (toponym), time * **NAMED ENTITY SUBTYPE**: * for person (first_name, last_name, address1, title2, profession); * for geo (city, lake, valley, mountain, mountain cabin, glacier); * for time (date, duration, time, set). * **ATTRIBUTES**: * for person {"gender" : "M"}; * for geo {"stid"3: "666"}; * for time {"value" : "PX", "freq" : "3X", "mod" : "AFTER"}.