Table of Contents

Corrected & Structured Europarl Corpus (CoStEP)

This page provides information about the CoStEP Corpus1) which is based on the well-known Europarl Corpus2). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version.

The latest version (1.0) is the last one to be released; further improvement requires manual correction.

Static URL to this page: http://pub.cl.uzh.ch/purl/costep

Download

The XML corpus

The CoStEP corpus consists of a list of xml files, described by this XML Schema file (up to 0.9.2) or this XML Schema file (from 0.9.4 on). Each file is named after the date on which the respective plenary session took place.

Structure

The structure that we identified in the minutes of the European Parliament and Koehn's original Europarl Corpus3) is depicted by the UML class diagram on the right.

Speaker information

The following speaker attributes have been added from a list of members of the European Parliament (as of version 0.9.4) to the speaker turns:

No existing attribute has been overwritten or deleted so that forename and surname may coexist, as well as affiliation and group.

Additional annotation

The text and headline content may contain the following elements to facilitate further linguistic processing:

<p type="speech">L’ordre du jour appelle le rapport <report>A4-0114/97</report> de M. Rack, au nom de la commission de la politique régionale, sur la communication de la Commission <ref>COM(96)0316 - C4-0533/96</ref> relative à la mise en oeuvre de la politique régionale de l’Union européenne en Autriche, en Finlande et en Suède.</p>
<p type="speech">Por desgracia, ahora me es imposible extenderme más sobre este punto, porque la administración del Parlamento Europeo ha decidido acortar implacablemente las explicaciones de voto que superen 200 palabras. Pero las personas que se interesen por la verdad en los asuntos europeos, y que estén cansadas de no encontrarla en los debates oficiales de nuestra Asamblea, pueden dirigirse a la página web de los diputados franceses del MPF en Internet, cuya dirección es la siguiente: <url>http://www.autre-europe.org</url>.</p>
<p type="speech">Og endelig, og det er ikke det mest uvæsentlige, beklager vi det upræcise ved ordet <quote start="»" end="«">etik</quote>, der dog dukker op ret så ofte i denne betænkning. Det er det ordforråd, der anvendes i det nye <quote start="»" end="«">politisk korrekte</quote> sprog for at undgå medicinens nødve ndige underlæggelse under den naturlige og kristne morals principper.</p>
<p type="speech">Var därför på er vakt när ni analyserar OECD<gen>:s</gen> statistik beträffande de finansiella flödena, eftersom de finansiella flödena inte alltid motsvarar utvecklingsprocesserna eftersom det helt enkelt rör sig om ett flyktigt kapital som förändras med räntenivån.</p>

XPath queries

The XML Corpus can be queried best by means of XPath expressions. The following sample script extracts aligned texts in English and French, converts them into columns of a two-column tsv file, filters out quotation and ellipsis and extracts short sentences (where both languages' texts possess less than 20 characters).

In order to run the script, you need to have the XMLStarlet installed.

short_sentences_english_french_tsv.sh
#!/bin/sh
 
for i in $(ls sessions/*.xml)
do
        cat $i \
        | xmlstarlet sel --encode utf-8 --template \
                --match "/session/chapter/turn/speaker[text[@language='en']/p[@type='speech'] and text[@language='fr']/p[@type='speech']]" \
                --output "@@@" \
                --copy-of "text[@language='en']/p" \
                --output "@@@" \
                --copy-of "text[@language='fr']/p" \
                --output "@#@" \
        | tr -d "\n" \
        | sed -r -e "s/@#@/\n/g" \
                -e "s/<\/p><p type=\"speech\">/|/g" \
                -e "s/<\/p>@@@<p type=\"speech\">/\t/g" \
                -e "s/@@@<p type=\"speech\">//g" \
                -e "s/<\/p>//g" \
        | grep -v "<quote " \
        | grep -v "<ellipsis/>" \
        | sed -r -e "s/<\/?\w+\/?>//g" -e "s/\|/ /g" \
        | awk 'BEGIN{FS="\t"}{if((length($1) < 20) && (length($2) < 20)) print}'
done

The tabular output looks like this:

English French
Thank you! Merci beaucoup !
The debate is closed Le débat est clos.
That is noted. C’est noté.
Thank you very much. Merci beaucoup !
No. Non.
Why not? Et pourquoi pas ?
I hope so. Je l’espère.
Thank you Elvis! Merci Elvis !
Yes, I will. Oui, je le ferai.
Yes! Oui !
My pleasure! Je vous en prie !
Surely not. Sûrement pas.
I shall try. Je vais essayer.
Yes, exactly. Oui, tout à fait.
Yes, it is covered. Oui, c’est couvert.

Known errors

Contributors

1)
Graën, J., Batinic, D., and Volk, M. (2014). Cleaning the Europarl corpus for linguistic applications. In Konvens 2014. Stiftung Universität Hildesheim.
2) , 3)
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).