This is an old revision of the document!


Horizonte

Horizonte nr 119

The Horizonte corpus is built upon the magazine of the same name, published by the Swiss National Science Foundation (SNSF).

This corpus consists of magazine articles in German, French and English related to popular science and research projects in and around Switzerland.

Horizonte Online

The Horizonte Online corpus consists of articles available on the Horizons magazine website, collected in 2018. These articles span 4 years, from 2014 until 2018.

lang tokens types lemmas sents texts
de 114084 19318 10609 8584 158
en 131146 13324 8404 8035 157
fr 126333 15010 7315 7583 158
Total 371563 47652 26328 24202 473

Alignment

The corpus has been aligned on the document level.

Horizonte PDF

The Horizonte PDF corpus consists of articles taken from electronic PDFs of the Horizonte magazine from their online archive. The articles span 12 years, from 2005 until 2017.

lang tokens types lemmas sents texts
de 1025245 85221 35577 75014 1237
en 392975 24793 14209 23865 395
fr 1193874 51562 17557 71995 1237
Total 2612094 161576 67343 170874 2869

Alignment

The corpus has been aligned on the document level.


CL Wiki

Institute of Computational Linguistics – University of Zurich