This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:pacoco:start [2019-07-17 18:54] – [Alignment] tkew | public:pacoco:start [2023-09-15 20:33] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== The Zurich Parallel Corpus Collection ====== | ====== The Zurich Parallel Corpus Collection ====== | ||
- | <div center round tip 60%> | + | <div center round info 50%> |
- | Data will be available | + | The corpus files are available |
</ | </ | ||
The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance. | The Zurich Parallel Corpus Collection currently consists of seven publicly available text corpora. These corpora are largely parallel or multi-parallel and cover a diverse range of domains, from mountaineering reports to articles on international business and finance. | ||
+ | |||
+ | Each corpus is available at the following links: | ||
+ | * [[Text+Berg|Text+Berg]] | ||
+ | * [[Credit Suisse|Credit Suisse]] | ||
+ | * [[Medi-Notice|Medi-Notice]] | ||
+ | * [[Horizonte|Horizonte]] | ||
+ | * [[Sparcling|Sparcling]] | ||
+ | * [[Rumantsch Grischun|Rumantsch-Grischun]] | ||
+ | * [[Swiss Legislation Corpus|Swiss Legislation Corpus]] | ||
+ | * [[Swatchgroup|Swatchgroup «Geschäftsbricht»]] | ||
+ | |||
In order to make these corpora publicly available, we have extended the popular [[https:// | In order to make these corpora publicly available, we have extended the popular [[https:// | ||
+ | |||
+ | |||
+ | ===== How to cite ===== | ||
+ | |||
+ | We presented the format at the [[http:// | ||
+ | |||
+ | <file biblatex GraenKewShaitarovaVolk2019.bib> | ||
+ | @inproceedings{GraenKewShaitarovaVolk2019, | ||
+ | month = {July}, | ||
+ | author = {Gra\" | ||
+ | | ||
+ | editor = {Bański, Piotr and Barbaresi, Adrien and Biber, Hanno and Breiteneder, | ||
+ | title = {Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection}, | ||
+ | | ||
+ | pages = {1--8}, | ||
+ | year = {2019}, | ||
+ | url = {https:// | ||
+ | doi = {10.14618/ | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
===== The CoNLL-UPPa Format ===== | ===== The CoNLL-UPPa Format ===== | ||
+ | |||
+ | ==== Token ==== | ||
At present, these corpora are available in an extended CoNLL-U format, which we refer to as **CoNLL-UPPa** ( **CoNLL-U P**lus **Pa**rallel). | At present, these corpora are available in an extended CoNLL-U format, which we refer to as **CoNLL-UPPa** ( **CoNLL-U P**lus **Pa**rallel). | ||
Line 67: | Line 102: | ||
- | At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. | + | At the sentence and text level, metadata can vary dramatically. As such it is not possible to account for all potential types of metadata. |
Some examples of metadata currently stored in sentence files are: | Some examples of metadata currently stored in sentence files are: | ||
Line 89: | Line 124: | ||
The picture below provides an example for a challenging case of heirarchical token alignment. | The picture below provides an example for a challenging case of heirarchical token alignment. | ||
- | {{ : | + | {{ : |
Line 100: | Line 135: | ||
In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see [[https:// | In the Text+Berg corpus, we also have additional stand-off anntotation for Named Entities (see [[https:// | ||
+ | The stand-off file for Named Entities contains the following 7 columns: | ||
+ | |||
+ | * **TOKEN ID**: corresponds to CORPUS_TOK_ID in the MAIN CORPUS FILE. | ||
+ | * **SENTENCE ID**: corresponds to CORPUS_SENT_ID in the MAIN CORPUS FILE. | ||
+ | * **NAMED ENTITY ID**: unique named entity ID | ||
+ | * for person p_000; | ||
+ | * for geo g_000; | ||
+ | * for time t_000. | ||
+ | * **POSITION WITHIN ENTITY**: if single-word entity => 1/1; if multi-word entity => 1/n, 2/n, 3/n, etc. | ||
+ | * **NAMED ENTITY TYPE**: person, geo (toponym), time | ||
+ | * **NAMED ENTITY SUBTYPE**: | ||
+ | * for person (first_name, | ||
+ | * for geo (city, lake, valley, mountain, mountain cabin, glacier); | ||
+ | * for time (date, duration, time, set). | ||
+ | * **ATTRIBUTES**: | ||
+ | * for person {" | ||
+ | * for geo {" | ||
+ | * for time {" |