The Zurich Parallel Corpus Collection

The corpus file are available for download at https://pub.cl.uzh.ch/corpora/PaCoCo/

The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance.

Each corpus is available at the following links:

In order to make these corpora publicly available, we have extended the popular CoNLL-U format to efficiently accommodate our parallel texts.

How to cite

We presented the format at the 7th Workshop on the Challenges in the Management of Large Corpora at CL 2019. The publication is available via the IDS publication server or ZORA.

GraenKewShaitarovaVolk2019.bib
@inproceedings{GraenKewShaitarovaVolk2019,
           month = {July},
          author = {Graën, Johannes and Kew, Tannon and Shaitarova, Anastassia and Volk, Martin},
       booktitle = {Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC)},
          editor = {Bański, Piotr and Barbaresi, Adrien and Biber, Hanno and Breiteneder, Evelyn and Clematide, Simon and Kupietz, Marc and Lüngen, Harald and Iliadi, Caroline},
           title = {Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection},
       publisher = {Leibniz-Institut für Deutsche Sprache},
           pages = {1--8},
            year = {2019},
             url = {https://doi.org/10.5167/uzh-175081},
             doi = {10.14618/ids-pub-9020}
}

The CoNLL-UPPa Format

Token

At present, these corpora are available in an extended CoNLL-U format, which we refer to as CoNLL-UPPa ( CoNLL-U Plus Parallel).

CoNLL-UPPa is a tabular 'one-token-per-line' file format that allows for the encoding of parallel texts by adding 3 extra columns to the standard CoNLL-U. These columns contain numeric identifier values for corpus tokens, sentences and texts, acting as primary and foreign keys to which stand-off annotation can be attached.

Each corpus consists of multiple token files (one per language). The image below shows a sample of a typical token file.

CoNLL-UPPa Token File

The first 10 columns match the standard CoNNL-U format:

  1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).1)
  2. FORM: Word form or punctuation symbol.
  3. LEMMA: Lemma or stem of word form.
  4. UPOS: Universal part-of-speech tag.
  5. XPOS: Language-specific part-of-speech tag.
  6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension.
  7. HEAD: Head of the current word, which is either a value of ID or zero (0).
  8. DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  10. MISC: Any other annotation.
  11. TokenID: Language and corpus-specific integer in range of 100 million. (German=1, English=2, French=4, etc).2)
  12. SentenceID: Language and corpus-specific integer in range of 10 million (German=1, English=2, French=4, etc).3)
  13. TextID: Language and corpus-specific integer in range of 1 million (German=1, English=2, French=4, etc).4)

The MISC column allows a list of token-level annotation as key-value pairs separated by a pipe character ('|'). Here we include the following annotation information if available:

  • SpaceAfter: Indicates whether the detokenised surface form is followed by a white space or not (can only have the value No, yes assumed if omitted).
  • Language: Language code for token (e.g. de/en/fr) if it differs from the language code of the sentence (language of the sentence if omitted).

Stand-off Files

Sentence

Sentence-level annotation and metadata is stored in a sentence file, which contains 2 columns:

  1. SentenceID: Language and corpus-specific integer in range of 10 million (German=1, English=2, French=4, etc).
  2. MISC: Any other annotation.

Similar to the token file, the MISC column allows a list of key-value pairs separated by a pipe character ('|'). Here we include the following annotation information if available:

  1. Language: language code for the sentence (e.g. de/en/fr) if it differs from the language code of the text (language of the text if omitted).

Text

Text-level annotation and metadata can be stored in a text file. Similar to the sentence file, a text file can also contain 2 columns:

  1. TextID: Language and corpus-specific integer in range of 1 million (German=1, English=2, French=4, etc).
  2. MISC: Any other annotation.

Again, the MISC column allows for a list of text-level annotation as key-value pairs separated by a pipe character ('|'). Here we include the following annotation information:

  1. Language: language code for the text (e.g. de/en/fr) if it differs from that of the language-specific subcorpus; this value can also be derived from the filename if the corpus is split into language-specific token files.

At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therefore, the MISC column is used to store this information. Note: Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below).

Some examples of metadata currently stored in sentence files are:

  • Type (i.e. heading, caption or footnote)
  • Div ID (for paragraph demarkation)

Examples of metadata currently stored in text files are corpus-specific attributes are:

  • author attribution (Text+Berg)
  • category/domain (Credit Suisse)
  • speaker attribution (Sparcling)
  • known substances (Medi-Notice)
  • etc.

 CoNLL-UPPa Format Overview

Alignment

We store hierarchical alignment (see Graën and Clematide 2015, Graën 2018) information in a 2-column tabular format. Alignments are modelled at all levels as an aggregation relation, which allows unique entity ID to belong to a unique alignment unit ID. The picture below provides an example for a challenging case of heirarchical token alignment.

Hierchical Token Alignment for the German word 'Menchenrechtsverletzung'

The German word Menschenrechtsverletzung corresponds to the English phrase violation of human rights and the French violation de droits de l’homme.

Together, these constitute a single alignment unit. However, below this level, we also have alignment units consisting of English and French parallel tokens: violation and violation; of and de; rights and droits; and human and l'homme.

Named Entities

In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see Ebling et al. 2011)

The stand-off file for Named Entities contains the following 7 columns:

  • TOKEN ID: corresponds to CORPUS_TOK_ID in the MAIN CORPUS FILE.
  • SENTENCE ID: corresponds to CORPUS_SENT_ID in the MAIN CORPUS FILE.
  • NAMED ENTITY ID: unique named entity ID
    • for person p_000;
    • for geo g_000;
    • for time t_000.
  • POSITION WITHIN ENTITY: if single-word entity ⇒ 1/1; if multi-word entity ⇒ 1/n, 2/n, 3/n, etc.
  • NAMED ENTITY TYPE: person, geo (toponym), time
  • NAMED ENTITY SUBTYPE:
    • for person (first_name, last_name, address1, title2, profession);
    • for geo (city, lake, valley, mountain, mountain cabin, glacier);
    • for time (date, duration, time, set).
  • ATTRIBUTES:
    • for person {“gender” : “M”};
    • for geo {“stid”3: “666”};
    • for time {“value” : “PX”, “freq” : “3X”, “mod” : “AFTER”}.
1)
we don't support multiword tokens and empty nodes
2) , 3) , 4)
we define full ranges for readability

CL Wiki

Institute of Computational Linguistics – University of Zurich