====== Corrected & Structured Europarl Corpus (CoStEP) ======
This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version.

<div center round tip 60%>
The latest version (1.0) is the last one to be released; further improvement requires manual correction.
</div>


<wrap info>
Static URL to this page: http://pub.cl.uzh.ch/purl/costep
</wrap>
===== Download =====
  * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9_beta.tar.gz|Version 0.9 beta]] (1.5 GB, contains 827 out of 968 sessions)
  * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.2_beta.tar.gz|Version 0.9.2 beta]] (some URLs corrected)
  * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns)
  * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged)
  * [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.tar.gz|Version 1.0]] (character level errors (e.g. as accents and missing unicode characters) corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]])
  * **[[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.1.tar.gz|Version 1.0.1]] (encoding in two sessions corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]])**
 

===== The XML corpus =====
The CoStEP corpus consists of a list of xml files, described by this [[http://pub.cl.uzh.ch/corpora/costep/costep.xsd|XML Schema file]] (up to 0.9.2) or this [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|XML Schema file]] (from 0.9.4 on). Each file is named after the date on which the respective plenary session took place.


==== Structure ====
The structure that we identified in the [[http://www.europarl.europa.eu/plenary/en/minutes.html|minutes of the European Parliament]] and [[http://www.statmt.org/europarl/|Koehn's original Europarl Corpus]]((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)) is depicted by the UML class diagram on the right.

{{ costep_xml_structure.png?nolink}}

  * Any **session** of the European Parliament is identified by its **date**.
  * A **chapter** is an agenda item of a particular **session**.
  * **chapter**s may have a topic or title per **language**, here called **headlines**.
  * Every **chapter** consists of at least one **speaker**'s contribution, called **turn**.
  * The **speaker** can by either the president of the European Parliament (or one of her vice-presidents) or any member or guest; in the latter case, at least her **name** is defined.
  * A **text** encapsulates the speech contribution of the **speaker** it belongs to for a particular **language**.
  * The concept of a **p**aragraph is adopted from Koehn's data; its **type** is either text spoken by the **speaker** (//speech//) or a //comment// added by the transcribers of the debates.


==== Speaker information ====
The following speaker attributes have been added from a list of members of the European Parliament (as of version 0.9.4) to the speaker turns:
  * **forename** and **surname** (the //Europarl// corpus often only gives surnames),
  * which **country** a members comes from,
  * the political **group** she belongs to and
  * a unique **id** for each member.

No existing attribute has been overwritten or deleted so that **forename** and **surname** may coexist, as well as **affiliation** and **group**.


==== Additional annotation ====
The **text** and **headline** content may contain the following elements to facilitate further linguistic processing:
  * **report**, **procedure** and **ref** are used to mark reports, procedures and references of the European Parliament:
<code xml>
<p type="speech">L’ordre du jour appelle le rapport <report>A4-0114/97</report> de M. Rack, au nom de la commission de la politique régionale, sur la communication de la Commission <ref>COM(96)0316 - C4-0533/96</ref> relative à la mise en oeuvre de la politique régionale de l’Union européenne en Autriche, en Finlande et en Suède.</p>
</code>
  * URLs are enclosed in **url** tags:
<code xml>
<p type="speech">Por desgracia, ahora me es imposible extenderme más sobre este punto, porque la administración del Parlamento Europeo ha decidido acortar implacablemente las explicaciones de voto que superen 200 palabras. Pero las personas que se interesen por la verdad en los asuntos europeos, y que estén cansadas de no encontrarla en los debates oficiales de nuestra Asamblea, pueden dirigirse a la página web de los diputados franceses del MPF en Internet, cuya dirección es la siguiente: <url>http://www.autre-europe.org</url>.</p>
</code>
  * **quote** encloses quotations, comprising the language specific quotation marks in the **start** and **end** attribute, so that the correct surface form can be recovered by just applying a regular expression:
<code xml>
<p type="speech">Og endelig, og det er ikke det mest uvæsentlige, beklager vi det upræcise ved ordet <quote start="»" end="«">etik</quote>, der dog dukker op ret så ofte i denne betænkning. Det er det ordforråd, der anvendes i det nye <quote start="»" end="«">politisk korrekte</quote> sprog for at undgå medicinens nødve ndige underlæggelse under den naturlige og kristne morals principper.</p>
</code>
  * **gen** marks genitive clitics (in English, Finnish, Swedish):
<code xml>
<p type="speech">Var därför på er vakt när ni analyserar OECD<gen>:s</gen> statistik beträffande de finansiella flödena, eftersom de finansiella flödena inte alltid motsvarar utvecklingsprocesserna eftersom det helt enkelt rör sig om ett flyktigt kapital som förändras med räntenivån.</p>
</code>
  * **n** and **ord** inclose cardinal and ordinal numbers, respectively.
  * The empty **ellipsis** element replaces any form of ellipsis, thus flagging incomplete sentences.


===== XPath queries =====
The XML Corpus can be queried best by means of XPath expressions. The following sample script extracts aligned **text**s in English and French, converts them into columns of a two-column [[http://en.wikipedia.org/wiki/Tab-separated_values|tsv]] file, filters out quotation and ellipsis and extracts short sentences (where both languages' texts possess less than 20 characters).

In order to run the script, you need to have the [[http://xmlstar.sourceforge.net/docs.php|XMLStarlet]] installed.
<code bash short_sentences_english_french_tsv.sh>
#!/bin/sh

for i in $(ls sessions/*.xml)
do
        cat $i \
        | xmlstarlet sel --encode utf-8 --template \
                --match "/session/chapter/turn/speaker[text[@language='en']/p[@type='speech'] and text[@language='fr']/p[@type='speech']]" \
                --output "@@@" \
                --copy-of "text[@language='en']/p" \
                --output "@@@" \
                --copy-of "text[@language='fr']/p" \
                --output "@#@" \
        | tr -d "\n" \
        | sed -r -e "s/@#@/\n/g" \
                -e "s/<\/p><p type=\"speech\">/|/g" \
                -e "s/<\/p>@@@<p type=\"speech\">/\t/g" \
                -e "s/@@@<p type=\"speech\">//g" \
                -e "s/<\/p>//g" \
        | grep -v "<quote " \
        | grep -v "<ellipsis/>" \
        | sed -r -e "s/<\/?\w+\/?>//g" -e "s/\|/ /g" \
        | awk 'BEGIN{FS="\t"}{if((length($1) < 20) && (length($2) < 20)) print}'
done
</code>

The tabular output looks like this:
^  English  ^  French  ^
| Thank you! | Merci beaucoup ! |
| The debate is closed | Le débat est clos. |
| That is noted. | C’est noté. |
| Thank you very much. | Merci beaucoup ! |
| No. | Non. |
| Why not? | Et pourquoi pas ? |
| I hope so. | Je l’espère. |
| Thank you Elvis! | Merci Elvis ! |
| Yes, I will. | Oui, je le ferai. |
| Yes! | Oui ! |
| My pleasure! | Je vous en prie ! |
| Surely not. | Sûrement pas. |
| I shall try. | Je vais essayer. |
| Yes, exactly. | Oui, tout à fait. |
| Yes, it is covered. | Oui, c’est couvert. |
| ... | ... |


===== Known errors =====
  * When a **speaker**'s attribute **president** is set to yes, the **name** attribute should be undefined since the name of the person acting as president is not known at any time.
===== Contributors =====
  * Johannes Graën
  * Dolores Batinic
  * Martin Volk
  * Simon Clematide
  * Mathias Müller