Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:pacoco:start [2019-07-17 21:44] tkewpublic:pacoco:start [2023-09-15 20:33] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== The Zurich Parallel Corpus Collection ====== ====== The Zurich Parallel Corpus Collection ======
  
-<div center round tip 60%> +<div center round info 50%> 
-Data will be available on July 22nd.+The corpus files are available for download at [[https://pub.cl.uzh.ch/corpora/PaCoCo/]]
 </div> </div>
  
  
 The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance. The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance.
 +
 +Each corpus is available at the following links:
 +  * [[Text+Berg|Text+Berg]]
 +  * [[Credit Suisse|Credit Suisse]]
 +  * [[Medi-Notice|Medi-Notice]]
 +  * [[Horizonte|Horizonte]]
 +  * [[Sparcling|Sparcling]]
 +  * [[Rumantsch Grischun|Rumantsch-Grischun]]
 +  * [[Swiss Legislation Corpus|Swiss Legislation Corpus]]
 +  * [[Swatchgroup|Swatchgroup «Geschäftsbricht»]]
 +
  
 In order to make these corpora publicly available, we have extended the popular [[https://universaldependencies.org/format.html|CoNLL-U]] format to efficiently accommodate our parallel texts. In order to make these corpora publicly available, we have extended the popular [[https://universaldependencies.org/format.html|CoNLL-U]] format to efficiently accommodate our parallel texts.
 +
 +
 +===== How to cite =====
 +
 +We presented the format at the [[http://corpora.ids-mannheim.de/cmlc-2019.html|7th Workshop on the Challenges in the Management of Large Corpora]] at [[http://www.cl2019.org/|CL 2019]]. The publication is available via the [[https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8998|IDS publication server]] or [[http://www.zora.uzh.ch/175081|ZORA]].
 +
 +<file biblatex GraenKewShaitarovaVolk2019.bib>
 +@inproceedings{GraenKewShaitarovaVolk2019,
 +           month = {July},
 +          author = {Gra\"{e}n, Johannes and Kew, Tannon and Shaitarova, Anastassia and Volk, Martin},
 +       booktitle = {Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC)},
 +          editor = {Bański, Piotr and Barbaresi, Adrien and Biber, Hanno and Breiteneder, Evelyn and Clematide, Simon and Kupietz, Marc and Lüngen, Harald and Iliadi, Caroline},
 +           title = {Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection},
 +       publisher = {Leibniz-Institut f\"{u}r Deutsche Sprache},
 +           pages = {1--8},
 +            year = {2019},
 +             url = {https://doi.org/10.5167/uzh-175081},
 +             doi = {10.14618/ids-pub-9020}
 +}
 +</file>
 +
 +
  
 ===== The CoNLL-UPPa Format ===== ===== The CoNLL-UPPa Format =====
Line 69: Line 102:
  
  
-At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therfore the **MISC** column is used to store this information. **Note:** Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below).+At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. Therefore, the **MISC** column is used to store this information. **Note:** Language-independent metadata (e.g. author attribution, date, etc.) can also be attached to the alignment unit IDs rather than the entity IDs themselves (see image below).
  
 Some examples of metadata currently stored in sentence files are: Some examples of metadata currently stored in sentence files are:

CL Wiki

Institute of Computational Linguistics – University of Zurich