====== Corrected & Structured Europarl Corpus (CoStEP) ====== This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version.
L’ordre du jour appelle le rapport A4-0114/97 de M. Rack, au nom de la commission de la politique régionale, sur la communication de la Commission COM(96)0316 - C4-0533/96 relative à la mise en oeuvre de la politique régionale de l’Union européenne en Autriche, en Finlande et en Suède.
* URLs are enclosed in **url** tags:
Por desgracia, ahora me es imposible extenderme más sobre este punto, porque la administración del Parlamento Europeo ha decidido acortar implacablemente las explicaciones de voto que superen 200 palabras. Pero las personas que se interesen por la verdad en los asuntos europeos, y que estén cansadas de no encontrarla en los debates oficiales de nuestra Asamblea, pueden dirigirse a la página web de los diputados franceses del MPF en Internet, cuya dirección es la siguiente: http://www.autre-europe.org .
* **quote** encloses quotations, comprising the language specific quotation marks in the **start** and **end** attribute, so that the correct surface form can be recovered by just applying a regular expression:
Og endelig, og det er ikke det mest uvæsentlige, beklager vi det upræcise ved ordet etik
, der dog dukker op ret så ofte i denne betænkning. Det er det ordforråd, der anvendes i det nye politisk korrekte
sprog for at undgå medicinens nødve ndige underlæggelse under den naturlige og kristne morals principper.
* **gen** marks genitive clitics (in English, Finnish, Swedish):
Var därför på er vakt när ni analyserar OECD:s statistik beträffande de finansiella flödena, eftersom de finansiella flödena inte alltid motsvarar utvecklingsprocesserna eftersom det helt enkelt rör sig om ett flyktigt kapital som förändras med räntenivån.
* **n** and **ord** inclose cardinal and ordinal numbers, respectively.
* The empty **ellipsis** element replaces any form of ellipsis, thus flagging incomplete sentences.
===== XPath queries =====
The XML Corpus can be queried best by means of XPath expressions. The following sample script extracts aligned **text**s in English and French, converts them into columns of a two-column [[http://en.wikipedia.org/wiki/Tab-separated_values|tsv]] file, filters out quotation and ellipsis and extracts short sentences (where both languages' texts possess less than 20 characters).
In order to run the script, you need to have the [[http://xmlstar.sourceforge.net/docs.php|XMLStarlet]] installed.
#!/bin/sh
for i in $(ls sessions/*.xml)
do
cat $i \
| xmlstarlet sel --encode utf-8 --template \
--match "/session/chapter/turn/speaker[text[@language='en']/p[@type='speech'] and text[@language='fr']/p[@type='speech']]" \
--output "@@@" \
--copy-of "text[@language='en']/p" \
--output "@@@" \
--copy-of "text[@language='fr']/p" \
--output "@#@" \
| tr -d "\n" \
| sed -r -e "s/@#@/\n/g" \
-e "s/<\/p>/|/g" \
-e "s/<\/p>@@@
/\t/g" \
-e "s/@@@
//g" \
-e "s/<\/p>//g" \
| grep -v "
" \
| sed -r -e "s/<\/?\w+\/?>//g" -e "s/\|/ /g" \
| awk 'BEGIN{FS="\t"}{if((length($1) < 20) && (length($2) < 20)) print}'
done
The tabular output looks like this:
^ English ^ French ^
| Thank you! | Merci beaucoup ! |
| The debate is closed | Le débat est clos. |
| That is noted. | C’est noté. |
| Thank you very much. | Merci beaucoup ! |
| No. | Non. |
| Why not? | Et pourquoi pas ? |
| I hope so. | Je l’espère. |
| Thank you Elvis! | Merci Elvis ! |
| Yes, I will. | Oui, je le ferai. |
| Yes! | Oui ! |
| My pleasure! | Je vous en prie ! |
| Surely not. | Sûrement pas. |
| I shall try. | Je vais essayer. |
| Yes, exactly. | Oui, tout à fait. |
| Yes, it is covered. | Oui, c’est couvert. |
| ... | ... |
===== Known errors =====
* When a **speaker**'s attribute **president** is set to yes, the **name** attribute should be undefined since the name of the person acting as president is not known at any time.
===== Contributors =====
* Johannes Graën
* Dolores Batinic
* Martin Volk
* Simon Clematide
* Mathias Müller