Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:costep:start [2016-02-23 10:09] – [Known errors] Johannes Graënpublic:costep:start [2023-09-15 20:33] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== Corrected & Structured Europarl Corpus (CoStEP) ====== ====== Corrected & Structured Europarl Corpus (CoStEP) ======
 This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version. This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version.
 +
 +<div center round tip 60%>
 +The latest version (1.0) is the last one to be released; further improvement requires manual correction.
 +</div>
 +
  
 <wrap info> <wrap info>
Line 10: Line 15:
   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns)   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns)
   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged)   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged)
 +  * [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.tar.gz|Version 1.0]] (character level errors (e.g. as accents and missing unicode characters) corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]]) 
 +  * **[[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.1.tar.gz|Version 1.0.1]] (encoding in two sessions corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]])** 
 + 
  
 ===== The XML corpus ===== ===== The XML corpus =====

CL Wiki

Institute of Computational Linguistics – University of Zurich