The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus.
Source | Timespan | Language(s) |
---|---|---|
Das Jahrbuch des SAC (SAC) | 1864-1923 | de, fr, it, rm, gsw, en (mixed) |
Das Echo des Alpes (EdA) | 1872-1924 | fr |
Die Alpen (SAC) | 1925-1956 | de, fr, it, rm, gsw, en (mixed) |
Die Alpen (SAC) | 1957-2011 | de, fr (parallel) |
The Alpine Journal (BAC) | 1969-2008 | en |
Being a diarchronic heritage corpus, its development has inspired numerous experiments in order to semantically enrich this corpus as both a historic and a linguistic resource (see below).
The corpus has been divided into its language specific subsections. The table below provides an overview of corpus statistics for each subsection.
lang | tokens | types | lemmas | sents | texts |
---|---|---|---|---|---|
de | 23.4m | 769k | 325k | 1.3m | 12k |
fr | 14.9m | 317k | 85k | 787k | 8k |
it | 324k | 39k | 18k | 16k | 162 |
rm | 14k | 4.5k | 0.2k | 786 | 18 |
gsw | 3k | 1.3k | 0.2k | 156 | 3 |
en | 0.9k | 0.4k | 0.3k | 41 | 1 |
Total | 38.6m | 1.1m | 429k | 2.1m | 21k |
The corpus has been aligned on the sentence level.
lang | tokens | types | lemmas | sents | texts |
---|---|---|---|---|---|
fr | 7.4m | 185k | 40k | 376k | 4.5k |
lang | tokens | types | lemmas | sents | texts |
---|---|---|---|---|---|
en | 6.5m | 181k | 60k | 289k | 1.5k |