CL Wiki

This is an old revision of the document!

The Text+Berg Corpus

The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus.

Source	Timespan	Language(s)
Das Jahrbuch des SAC	1864-1923	de, fr, it, rm, en (mixed)
Das Echo des Alpes	1872-1924	fr
Die Alpen	1925-1956	de, fr, it, rm, en (mixed)
Die Alpen	1957-2011	de, fr (parallel)
The Alpine Journal	1969-2008	en

Relevant links: Text+Berg Project Website

This corpus has posed a number of challenges regarding digitisation and linguistic annotation

inspired a number of experiments in linguistic annotation of and spans 150 years of alpin

Being a diarchronic heritage corpus, its development . Numerous experiments have been undertaken to semantically enrich this corpus as both a historic and a linguistic resource. These include, but are not limited to, a novel approach to correcting optical character recognition (OCR) errors \anonref{\citep{Clem16}}; gazetteer and rule-based NER for the annotation of personal names, toponyms, organisations and time expressions \anonref{\citep{Ebling2011}}; improved lemmatisation for German separable prefix verbs and elliptical compound nouns \anon{\cite{Volk2016}}; innovative techniques for sentence alignment in parallel texts \anonref{\cite{Sennrich2010}}; and the creation of a manually annotated parallel treebank with more than 1000 sentences in French and German for the purpose of assisting statistical machine translation \anonref{\citep{Goering2011}}.

https://www.zora.uzh.ch/id/eprint/50451/

User Tools

The Text+Berg Corpus

Page Tools

CL Wiki

Site Tools