Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:pacoco:start [2019-07-17 18:54] – [Alignment] tkewpublic:pacoco:start [2023-09-15 20:33] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== The Zurich Parallel Corpus Collection ====== ====== The Zurich Parallel Corpus Collection ======
  
-<div center round tip 60%> +<div center round info 50%> 
-Data will be available on July 22nd.+The corpus files are available for download at [[https://pub.cl.uzh.ch/corpora/PaCoCo/]]
 </div> </div>
  
  
 The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance. The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance.
 +
 +Each corpus is available at the following links:
 +  * [[Text+Berg|Text+Berg]]
 +  * [[Credit Suisse|Credit Suisse]]
 +  * [[Medi-Notice|Medi-Notice]]
 +  * [[Horizonte|Horizonte]]
 +  * [[Sparcling|Sparcling]]
 +  * [[Rumantsch Grischun|Rumantsch-Grischun]]
 +  * [[Swiss Legislation Corpus|Swiss Legislation Corpus]]
 +  * [[Swatchgroup|Swatchgroup «Geschäftsbricht»]]
 +
  
 In order to make these corpora publicly available, we have extended the popular [[https://universaldependencies.org/format.html|CoNLL-U]] format to efficiently accommodate our parallel texts. In order to make these corpora publicly available, we have extended the popular [[https://universaldependencies.org/format.html|CoNLL-U]] format to efficiently accommodate our parallel texts.
 +
 +
 +===== How to cite =====
 +
 +We presented the format at the [[http://corpora.ids-mannheim.de/cmlc-2019.html|7th Workshop on the Challenges in the Management of Large Corpora]] at [[http://www.cl2019.org/|CL 2019]]. The publication is available via the [[https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8998|IDS publication server]] or [[http://www.zora.uzh.ch/175081|ZORA]].
 +
 +<file biblatex GraenKewShaitarovaVolk2019.bib>
 +@inproceedings{GraenKewShaitarovaVolk2019,
 +           month = {July},
 +          author = {Gra\"{e}n, Johannes and Kew, Tannon and Shaitarova, Anastassia and Volk, Martin},
 +       booktitle = {Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC)},
 +          editor = {Bański, Piotr and Barbaresi, Adrien and Biber, Hanno and Breiteneder, Evelyn and Clematide, Simon and Kupietz, Marc and Lüngen, Harald and Iliadi, Caroline},
 +           title = {Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection},
 +       publisher = {Leibniz-Institut f\"{u}r Deutsche Sprache},
 +           pages = {1--8},
 +            year = {2019},
 +             url = {https://doi.org/10.5167/uzh-175081},
 +             doi = {10.14618/ids-pub-9020}
 +}
 +</file>
 +
 +
  
 ===== The CoNLL-UPPa Format ===== ===== The CoNLL-UPPa Format =====
 +
 +==== Token ====
  
 At present, these corpora are available in an extended CoNLL-U format, which we refer to as **CoNLL-UPPa** ( **CoNLL-U P**lus **Pa**rallel).  At present, these corpora are available in an extended CoNLL-U format, which we refer to as **CoNLL-UPPa** ( **CoNLL-U P**lus **Pa**rallel). 
Line 67: Line 102:
  
  
-At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therfore the **MISC** column is used to store this information. **Note:** Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below).+At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therefore, the **MISC** column is used to store this information. **Note:** Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below).
  
 Some examples of metadata currently stored in sentence files are: Some examples of metadata currently stored in sentence files are:
Line 89: Line 124:
 The picture below provides an example for a challenging case of heirarchical token alignment. The picture below provides an example for a challenging case of heirarchical token alignment.
  
-{{ :public:pacoco:menschenrechtsverletzung.png | Hierchical Token Alignment for the German word 'Menchenrechtsverletzung'}}+{{ :public:pacoco:menschenrechtsverletzung_1_.png?500 |Hierchical Token Alignment for the German word 'Menchenrechtsverletzung'}}
  
  
Line 100: Line 135:
 In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see [[https://www.zora.uzh.ch/id/eprint/50451/|Ebling et al. 2011]]) In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see [[https://www.zora.uzh.ch/id/eprint/50451/|Ebling et al. 2011]])
  
 +The stand-off file for Named Entities contains the following 7 columns:
 +
 +  * **TOKEN ID**: corresponds to CORPUS_TOK_ID in the MAIN CORPUS FILE.
 +  * **SENTENCE ID**: corresponds to CORPUS_SENT_ID in the MAIN CORPUS FILE.
 +  * **NAMED ENTITY ID**: unique named entity ID 
 +      * for person p_000; 
 +      * for geo g_000; 
 +      * for time t_000.
 +  * **POSITION WITHIN ENTITY**: if single-word entity => 1/1; if multi-word entity => 1/n, 2/n, 3/n, etc.
 +  * **NAMED ENTITY TYPE**: person, geo (toponym), time 
 +  * **NAMED ENTITY SUBTYPE**: 
 +      * for person (first_name, last_name, address<sup>1</sup>, title<sup>2</sup>, profession); 
 +      * for geo (city, lake, valley, mountain, mountain cabin, glacier); 
 +      * for time (date, duration, time, set).
 +  * **ATTRIBUTES**: 
 +      * for person {"gender" : "M"}; 
 +      * for geo {"stid"<sup>3</sup>: "666"}; 
 +      * for time {"value" : "PX", "freq" : "3X", "mod" : "AFTER"}.

CL Wiki

Institute of Computational Linguistics – University of Zurich