Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision |
public:costep:start [2016-02-23 10:09] – [Known errors] Johannes Graën | public:costep:start [2017-11-11 19:01] – old revision restored (2017-05-12 10:42) Johannes Graën |
---|
====== Corrected & Structured Europarl Corpus (CoStEP) ====== | ====== Corrected & Structured Europarl Corpus (CoStEP) ====== |
This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version. | This page provides information about the //CoStEP Corpus//((Graën, J., Batinic, D., and Volk, M. (2014). [[http://www.zora.uzh.ch/99005/|Cleaning the Europarl corpus for linguistic applications]]. In Konvens 2014. Stiftung Universität Hildesheim.)) which is based on the well-known //Europarl Corpus//((Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86. Asia-Pacific Association for Machine Translation (AAMT).)). CoStEP is a cleaned version of Europarl with respect to tokenization, encoding, and orthography. Additionally, it is structured and aligned on speaker turns. Thus CoStEP is much better suited for linguistic research than the original Europarl version. |
| |
| <div center round tip 60%> |
| The latest version (1.0) is the last one to be released; further improvement requires manual correction. |
| </div> |
| |
| |
<wrap info> | <wrap info> |
* [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns) | * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4_beta.tar.gz|Version 0.9.4 beta]] (improved information about speakers (see [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.4.xsd|new XML Schema file]]); removed monolingual and untranslated turns) |
* [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged) | * [[http://pub.cl.uzh.ch/corpora/costep/costep_0.9.6_beta.tar.gz|Version 0.9.6 beta]] (encoding and quotation errors fixed; schema unchanged) |
| * **[[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.tar.gz|Version 1.0]] (character level errors (e.g. as accents and missing unicode characters) corrected; [[http://pub.cl.uzh.ch/corpora/costep/costep_1.0.xsd|XML Schema]])** |
| |
| |