CL Wiki

Institute of Computational Linguistics – University of Zurich

User Tools

Site Tools


public:pacoco:sparcling

Sparcling

The Sparcling corpus is built on top of a cleaned version of the Europarl corpus. It provided a basis for alignment experiments and features multilingual alignment on the sentence and text level. The corpus served as a reference for the development of the Multilingwis search engine for exploration of multilingual word-aligned corpora. Many other applications (see http://pub.cl.uzh.ch/purl/graen) turn to account its combination of language-dependend annotation and interlingual alignment.

lang tokens types lemmas sents texts
bg 7509902 100378 n/a 302703 33187
de 41107021 368437 97327 1754173 146544
el 32263532 244448 n/a 1245560 114950
en 43151584 129616 43981 1675807 146544
es 45232847 177200 93572 1667918 146544
et 8136702 251360 40374 447006 45126
fi 28363987 669416 134694 1587455 136299
fr 47270588 143977 82548 1692398 146544
it 42648100 181510 98312 1646744 146544
nl 42954617 263736 33630 1800838 145478
pl 9334433 162314 19495 455103 44371
pt 44029641 182020 26645 1642878 144408
ro 7963967 83339 19369 308289 33725
sk 9406142 161572 28317 435121 44613
sl 9208808 134342 16210 420850 43810
sv 36135818 337746 253731 1655134 137540
Total 454717689 3591411 988206 18737977 1656227

Alignment

The corpus has been aligned on the document, sentence and word level.

Publications

  • Exploiting alignment in multiparallel corpora for applications in linguistics and language learning Graën 2018
public/pacoco/sparcling.txt · Last modified: 2023-09-15 20:33 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki