Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:costep:start [2017-08-09 14:28] horatpublic:costep:start [2023-09-15 20:33] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== Corrected and Structured Europarl Corpus (CoStEP) ======+====== Corrected Structured Europarl Corpus (CoStEP) ======
 This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version. This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version.
  
Line 15: Line 15:
   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns)   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns)
   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged)   * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged)
-  * **[[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.tar.gz|Version 1.0]] (character level errors (e.g. as accents and missing unicode characters) corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]])** +  * [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.tar.gz|Version 1.0]] (character level errors (e.g. as accents and missing unicode characters) corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]]) 
 +  * **[[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.1.tar.gz|Version 1.0.1]] (encoding in two sessions corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]])** 
 + 
  
 ===== The XML corpus ===== ===== The XML corpus =====

CL Wiki

Institute of Computational Linguistics – University of Zurich