The Text+Berg Corpus

Page from the 1906 SAC yearbook

The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus.

Source Timespan Language(s)
Das Jahrbuch des SAC (SAC) 1864-1923 de, fr, it, rm, gsw, en (mixed)
Das Echo des Alpes (EdA) 1872-1924 fr
Die Alpen (SAC) 1925-1956 de, fr, it, rm, gsw, en (mixed)
Die Alpen (SAC) 1957-2011 de, fr (parallel)
The Alpine Journal (BAC) 1969-2008 en

Being a diarchronic heritage corpus, its development has inspired numerous experiments in order to semantically enrich this corpus as both a historic and a linguistic resource (see below).

The corpus has been divided into its language specific subsections. The table below provides an overview of corpus statistics for each subsection.

SAC

lang tokens types lemmas sents texts
de 23.4m 769k 325k 1.3m 12k
fr 14.9m 317k 85k 787k 8k
it 324k 39k 18k 16k 162
rm 14k 4.5k 0.2k 786 18
gsw 3k 1.3k 0.2k 156 3
en 0.9k 0.4k 0.3k 41 1
Total 38.6m 1.1m 429k 2.1m 21k

Alignment

The corpus has been aligned on the sentence level.

EdA

lang tokens types lemmas sents texts
fr 7.4m 185k 40k 376k 4.5k

BAC

lang tokens types lemmas sents texts
en 6.5m 181k 60k 289k 1.5k

Publications


CL Wiki

Institute of Computational Linguistics – University of Zurich