Corrected & Structured Europarl Corpus (CoStEP)

Corrected & Structured Europarl Corpus (CoStEP)

This page provides information about the CoStEP Corpus¹⁾ which is based on the well-known Europarl Corpus²⁾. CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version.

The latest version (1.0) is the last one to be released; further improvement requires manual correction.

Static URL to this page: http://pub.cl.uzh.ch/purl/costep

Download

Version 0.9 beta (1.5 GB, contains 827 out of 968 sessions)
Version 0.9.2 beta (some URLs corrected)
Version 0.9.4 beta (improved information about speakers (see new XML Schema file); removed monolingual and untranslated turns)
Version 0.9.6 beta (encoding and quotation errors fixed; schema unchanged)
Version 1.0 (character level errors (e.g. as accents and missing unicode characters) corrected; XML Schema)
Version 1.0.1 (encoding in two sessions corrected; XML Schema)

The XML corpus

The CoStEP corpus consists of a list of xml files, described by this XML Schema file (up to 0.9.2) or this XML Schema file (from 0.9.4 on). Each file is named after the date on which the respective plenary session took place.

Structure

The structure that we identified in the minutes of the European Parliament and Koehn's original Europarl Corpus³⁾ is depicted by the UML class diagram on the right.

Any session of the European Parliament is identified by its date.
A chapter is an agenda item of a particular session.
chapters may have a topic or title per language, here called headlines.
Every chapter consists of at least one speaker's contribution, called turn.
The speaker can by either the president of the European Parliament (or one of her vice-presidents) or any member or guest; in the latter case, at least her name is defined.
A text encapsulates the speech contribution of the speaker it belongs to for a particular language.
The concept of a paragraph is adopted from Koehn's data; its type is either text spoken by the speaker (speech) or a comment added by the transcribers of the debates.

Speaker information

The following speaker attributes have been added from a list of members of the European Parliament (as of version 0.9.4) to the speaker turns:

forename and surname (the Europarl corpus often only gives surnames),
which country a members comes from,
the political group she belongs to and
a unique id for each member.

No existing attribute has been overwritten or deleted so that forename and surname may coexist, as well as affiliation and group.

Additional annotation

The text and headline content may contain the following elements to facilitate further linguistic processing:

report, procedure and ref are used to mark reports, procedures and references of the European Parliament:

<p type="speech">L’ordre du jour appelle le rapport <report>A4-0114/97</report> de M. Rack, au nom de la commission de la politique régionale, sur la communication de la Commission <ref>COM(96)0316 - C4-0533/96</ref> relative à la mise en oeuvre de la politique régionale de l’Union européenne en Autriche, en Finlande et en Suède.</p>

URLs are enclosed in url tags:

<p type="speech">Por desgracia, ahora me es imposible extenderme más sobre este punto, porque la administración del Parlamento Europeo ha decidido acortar implacablemente las explicaciones de voto que superen 200 palabras. Pero las personas que se interesen por la verdad en los asuntos europeos, y que estén cansadas de no encontrarla en los debates oficiales de nuestra Asamblea, pueden dirigirse a la página web de los diputados franceses del MPF en Internet, cuya dirección es la siguiente: <url>http://www.autre-europe.org</url>.</p>

quote encloses quotations, comprising the language specific quotation marks in the start and end attribute, so that the correct surface form can be recovered by just applying a regular expression:

<p type="speech">Og endelig, og det er ikke det mest uvæsentlige, beklager vi det upræcise ved ordet <quote start="»" end="«">etik</quote>, der dog dukker op ret så ofte i denne betænkning. Det er det ordforråd, der anvendes i det nye <quote start="»" end="«">politisk korrekte</quote> sprog for at undgå medicinens nødve ndige underlæggelse under den naturlige og kristne morals principper.</p>

gen marks genitive clitics (in English, Finnish, Swedish):

<p type="speech">Var därför på er vakt när ni analyserar OECD<gen>:s</gen> statistik beträffande de finansiella flödena, eftersom de finansiella flödena inte alltid motsvarar utvecklingsprocesserna eftersom det helt enkelt rör sig om ett flyktigt kapital som förändras med räntenivån.</p>

n and ord inclose cardinal and ordinal numbers, respectively.
The empty ellipsis element replaces any form of ellipsis, thus flagging incomplete sentences.

XPath queries

The XML Corpus can be queried best by means of XPath expressions. The following sample script extracts aligned texts in English and French, converts them into columns of a two-column tsv file, filters out quotation and ellipsis and extracts short sentences (where both languages' texts possess less than 20 characters).

In order to run the script, you need to have the XMLStarlet installed.

short_sentences_english_french_tsv.sh

#!/bin/sh
 
for i in $(ls sessions/*.xml)
do
        cat $i \
        | xmlstarlet sel --encode utf-8 --template \
                --match "/session/chapter/turn/speaker[text[@language='en']/p[@type='speech'] and text[@language='fr']/p[@type='speech']]" \
                --output "@@@" \
                --copy-of "text[@language='en']/p" \
                --output "@@@" \
                --copy-of "text[@language='fr']/p" \
                --output "@#@" \
        | tr -d "\n" \
        | sed -r -e "s/@#@/\n/g" \
                -e "s/<\/p><p type=\"speech\">/|/g" \
                -e "s/<\/p>@@@<p type=\"speech\">/\t/g" \
                -e "s/@@@<p type=\"speech\">//g" \
                -e "s/<\/p>//g" \
        | grep -v "<quote " \
        | grep -v "<ellipsis/>" \
        | sed -r -e "s/<\/?\w+\/?>//g" -e "s/\|/ /g" \
        | awk 'BEGIN{FS="\t"}{if((length($1) < 20) && (length($2) < 20)) print}'
done

The tabular output looks like this:

English	French
Thank you!	Merci beaucoup !
The debate is closed	Le débat est clos.
That is noted.	C’est noté.
Thank you very much.	Merci beaucoup !
No.	Non.
Why not?	Et pourquoi pas ?
I hope so.	Je l’espère.
Thank you Elvis!	Merci Elvis !
Yes, I will.	Oui, je le ferai.
Yes!	Oui !
My pleasure!	Je vous en prie !
Surely not.	Sûrement pas.
I shall try.	Je vais essayer.
Yes, exactly.	Oui, tout à fait.
Yes, it is covered.	Oui, c’est couvert.
…	…

Known errors

When a speaker's attribute president is set to yes, the name attribute should be undefined since the name of the person acting as president is not known at any time.

Contributors

Johannes Graën
Dolores Batinic
Martin Volk
Simon Clematide
Mathias Müller

¹⁾

Graën, J., Batinic, D., and Volk, M. (2014). Cleaning the Europarl corpus for linguistic applications. In Konvens 2014. Stiftung Universität Hildesheim.

²⁾ , ³⁾

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).

Table of Contents