The Sparcling corpus is built on top of a cleaned version of the Europarl corpus. It provided a basis for alignment experiments and features multilingual alignment on the sentence and text level. The corpus served as a reference for the development of the Multilingwis search engine for exploration of multilingual word-aligned corpora. Many other applications (see http://pub.cl.uzh.ch/purl/graen) turn to account its combination of language-dependend annotation and interlingual alignment.
lang | tokens | types | lemmas | sents | texts |
---|---|---|---|---|---|
bg | 7509902 | 100378 | n/a | 302703 | 33187 |
de | 41107021 | 368437 | 97327 | 1754173 | 146544 |
el | 32263532 | 244448 | n/a | 1245560 | 114950 |
en | 43151584 | 129616 | 43981 | 1675807 | 146544 |
es | 45232847 | 177200 | 93572 | 1667918 | 146544 |
et | 8136702 | 251360 | 40374 | 447006 | 45126 |
fi | 28363987 | 669416 | 134694 | 1587455 | 136299 |
fr | 47270588 | 143977 | 82548 | 1692398 | 146544 |
it | 42648100 | 181510 | 98312 | 1646744 | 146544 |
nl | 42954617 | 263736 | 33630 | 1800838 | 145478 |
pl | 9334433 | 162314 | 19495 | 455103 | 44371 |
pt | 44029641 | 182020 | 26645 | 1642878 | 144408 |
ro | 7963967 | 83339 | 19369 | 308289 | 33725 |
sk | 9406142 | 161572 | 28317 | 435121 | 44613 |
sl | 9208808 | 134342 | 16210 | 420850 | 43810 |
sv | 36135818 | 337746 | 253731 | 1655134 | 137540 |
Total | 454717689 | 3591411 | 988206 | 18737977 | 1656227 |
The corpus has been aligned on the document, sentence and word level.