The corpus files are available for download at https://pub.cl.uzh.ch/corpora/PaCoCo/
The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance.
Each corpus is available at the following links:
In order to make these corpora publicly available, we have extended the popular CoNLL-U format to efficiently accommodate our parallel texts.
We presented the format at the 7th Workshop on the Challenges in the Management of Large Corpora at CL 2019. The publication is available via the IDS publication server or ZORA.
@inproceedings{GraenKewShaitarovaVolk2019, month = {July}, author = {Gra\"{e}n, Johannes and Kew, Tannon and Shaitarova, Anastassia and Volk, Martin}, booktitle = {Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC)}, editor = {Bański, Piotr and Barbaresi, Adrien and Biber, Hanno and Breiteneder, Evelyn and Clematide, Simon and Kupietz, Marc and Lüngen, Harald and Iliadi, Caroline}, title = {Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection}, publisher = {Leibniz-Institut f\"{u}r Deutsche Sprache}, pages = {1--8}, year = {2019}, url = {https://doi.org/10.5167/uzh-175081}, doi = {10.14618/ids-pub-9020} }
At present, these corpora are available in an extended CoNLL-U format, which we refer to as CoNLL-UPPa ( CoNLL-U Plus Parallel).
CoNLL-UPPa is a tabular 'one-token-per-line' file format that allows for the encoding of parallel texts by adding 3 extra columns to the standard CoNLL-U. These columns contain numeric identifier values for corpus tokens, sentences and texts, acting as primary and foreign keys to which stand-off annotation can be attached.
Each corpus consists of multiple token files (one per language). The image below shows a sample of a typical token file.
The first 10 columns match the standard CoNNL-U format:
The MISC column allows a list of token-level annotation as key-value pairs separated by a pipe character ('|'). Here we include the following annotation information if available:
Sentence-level annotation and metadata is stored in a sentence file, which contains 2 columns:
Similar to the token file, the MISC column allows a list of key-value pairs separated by a pipe character ('|'). Here we include the following annotation information if available:
Text-level annotation and metadata can be stored in a text file. Similar to the sentence file, a text file can also contain 2 columns:
Again, the MISC column allows for a list of text-level annotation as key-value pairs separated by a pipe character ('|'). Here we include the following annotation information:
At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therefore, the MISC column is used to store this information. Note: Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below).
Some examples of metadata currently stored in sentence files are:
Examples of metadata currently stored in text files are corpus-specific attributes are:
We store hierarchical alignment (see Graën and Clematide 2015, Graën 2018) information in a 2-column tabular format. Alignments are modelled at all levels as an aggregation relation, which allows unique entity ID to belong to a unique alignment unit ID. The picture below provides an example for a challenging case of heirarchical token alignment.
The German word Menschenrechtsverletzung corresponds to the English phrase violation of human rights and the French violation de droits de l’homme.
Together, these constitute a single alignment unit. However, below this level, we also have alignment units consisting of English and French parallel tokens: violation and violation; of and de; rights and droits; and human and l'homme.
In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see Ebling et al. 2011)
The stand-off file for Named Entities contains the following 7 columns: