This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:costep:start [2015-11-11 15:05] – Johannes Graën | public:costep:start [2023-09-15 20:33] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Corrected & Structured Europarl Corpus (CoStEP) ====== | ||
+ | This page provides information about the //CoStEP Corpus// | ||
+ | <div center round tip 60%> | ||
+ | The latest version (1.0) is the last one to be released; further improvement requires manual correction. | ||
+ | </ | ||
+ | |||
+ | |||
+ | <wrap info> | ||
+ | Static URL to this page: http:// | ||
+ | </ | ||
+ | ===== Download ===== | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * **[[http:// | ||
+ | |||
+ | |||
+ | ===== The XML corpus ===== | ||
+ | The CoStEP corpus consists of a list of xml files, described by this [[http:// | ||
+ | |||
+ | |||
+ | ==== Structure ==== | ||
+ | The structure that we identified in the [[http:// | ||
+ | |||
+ | {{ costep_xml_structure.png? | ||
+ | |||
+ | * Any **session** of the European Parliament is identified by its **date**. | ||
+ | * A **chapter** is an agenda item of a particular **session**. | ||
+ | * **chapter**s may have a topic or title per **language**, | ||
+ | * Every **chapter** consists of at least one **speaker**' | ||
+ | * The **speaker** can by either the president of the European Parliament (or one of her vice-presidents) or any member or guest; in the latter case, at least her **name** is defined. | ||
+ | * A **text** encapsulates the speech contribution of the **speaker** it belongs to for a particular **language**. | ||
+ | * The concept of a **p**aragraph is adopted from Koehn' | ||
+ | |||
+ | |||
+ | ==== Speaker information ==== | ||
+ | The following speaker attributes have been added from a list of members of the European Parliament (as of version 0.9.4) to the speaker turns: | ||
+ | * **forename** and **surname** (the // | ||
+ | * which **country** a members comes from, | ||
+ | * the political **group** she belongs to and | ||
+ | * a unique **id** for each member. | ||
+ | |||
+ | No existing attribute has been overwritten or deleted so that **forename** and **surname** may coexist, as well as **affiliation** and **group**. | ||
+ | |||
+ | |||
+ | ==== Additional annotation ==== | ||
+ | The **text** and **headline** content may contain the following elements to facilitate further linguistic processing: | ||
+ | * **report**, **procedure** and **ref** are used to mark reports, procedures and references of the European Parliament: | ||
+ | <code xml> | ||
+ | <p type=" | ||
+ | </ | ||
+ | * URLs are enclosed in **url** tags: | ||
+ | <code xml> | ||
+ | <p type=" | ||
+ | </ | ||
+ | * **quote** encloses quotations, comprising the language specific quotation marks in the **start** and **end** attribute, so that the correct surface form can be recovered by just applying a regular expression: | ||
+ | <code xml> | ||
+ | <p type=" | ||
+ | </ | ||
+ | * **gen** marks genitive clitics (in English, Finnish, Swedish): | ||
+ | <code xml> | ||
+ | <p type=" | ||
+ | </ | ||
+ | * **n** and **ord** inclose cardinal and ordinal numbers, respectively. | ||
+ | * The empty **ellipsis** element replaces any form of ellipsis, thus flagging incomplete sentences. | ||
+ | |||
+ | |||
+ | ===== XPath queries ===== | ||
+ | The XML Corpus can be queried best by means of XPath expressions. The following sample script extracts aligned **text**s in English and French, converts them into columns of a two-column [[http:// | ||
+ | |||
+ | In order to run the script, you need to have the [[http:// | ||
+ | <code bash short_sentences_english_french_tsv.sh> | ||
+ | #!/bin/sh | ||
+ | |||
+ | for i in $(ls sessions/ | ||
+ | do | ||
+ | cat $i \ | ||
+ | | xmlstarlet sel --encode utf-8 --template \ | ||
+ | --match "/ | ||
+ | --output " | ||
+ | --copy-of " | ||
+ | --output " | ||
+ | --copy-of " | ||
+ | --output " | ||
+ | | tr -d " | ||
+ | | sed -r -e " | ||
+ | -e " | ||
+ | -e " | ||
+ | -e " | ||
+ | -e " | ||
+ | | grep -v "< | ||
+ | | grep -v "< | ||
+ | | sed -r -e " | ||
+ | | awk ' | ||
+ | done | ||
+ | </ | ||
+ | |||
+ | The tabular output looks like this: | ||
+ | ^ English | ||
+ | | Thank you! | Merci beaucoup ! | | ||
+ | | The debate is closed | Le débat est clos. | | ||
+ | | That is noted. | C’est noté. | | ||
+ | | Thank you very much. | Merci beaucoup ! | | ||
+ | | No. | Non. | | ||
+ | | Why not? | Et pourquoi pas ? | | ||
+ | | I hope so. | Je l’espère. | | ||
+ | | Thank you Elvis! | Merci Elvis ! | | ||
+ | | Yes, I will. | Oui, je le ferai. | | ||
+ | | Yes! | Oui ! | | ||
+ | | My pleasure! | Je vous en prie ! | | ||
+ | | Surely not. | Sûrement pas. | | ||
+ | | I shall try. | Je vais essayer. | | ||
+ | | Yes, exactly. | Oui, tout à fait. | | ||
+ | | Yes, it is covered. | Oui, c’est couvert. | | ||
+ | | ... | ... | | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Known errors ===== | ||
+ | * When a **speaker**' | ||
+ | ===== Contributors ===== | ||
+ | * Johannes Graën | ||
+ | * Dolores Batinic | ||
+ | * Martin Volk | ||
+ | * Simon Clematide | ||
+ | * Mathias Müller |