Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
public:pacoco:text_berg [2019-07-17 20:45] – created tkewpublic:pacoco:text_berg [2023-09-15 20:33] (current) – external edit 127.0.0.1
Line 1: Line 1:
-===== The Text+Berg Corpus =====+~~NOTOC~~ 
 +====== The Text+Berg Corpus ======
  
-{{ :public:pacoco:sac_page.png?400|}}+{{ :public:pacoco:sac_page.png?300|Page from the 1906 SAC yearbook}}
  
 The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus. The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus.
  
-^ Source      ^ Timespan       ^ Language(s)         +^ Source                ^ Timespan   ^ Language(s)                      
-| Das Jahrbuch des SAC  | 1864-1923 | de, fr, it, rm, en (mixed) | +| Das Jahrbuch des SAC (SAC)  | 1864-1923  | de, fr, it, rm, gsw, en (mixed)  
-| Das Echo des Alpes  | 1872-1924 | fr  +| Das Echo des Alpes (EdA)   | 1872-1924  | fr                               
-| Die Alpen  | 1925-1956 | de, fr, it, rm, en (mixed) | +| Die Alpen (SAC)            | 1925-1956  | de, fr, it, rm, gsw, en (mixed)  
-| Die Alpen  | 1957-2011 | de, fr (parallel) | +| Die Alpen (SAC)            | 1957-2011  | de, fr (parallel)                
-| The Alpine Journal  | 1969-2008 | en |+| The Alpine Journal (BAC)   | 1969-2008  | en                               |
  
  
 +Being a diarchronic heritage corpus, its development has inspired numerous experiments in order to semantically enrich this corpus as both a historic and a linguistic resource (see below).
  
-Relevant links: +The corpus has been divided into its language specific subsections. The table below provides an overview of corpus statistics for each subsection.
-[[http://textberg.ch/site/en/corpora/|Text+Berg Project Website]]+
  
-This corpus has posed a number of challenges regarding digitisation and linguistic annotation 
  
-inspired a number of experiments in linguistic annotation of   and spans 150 years of alpin+===== SAC ===== 
 +^ lang   ^ tokens  ^ types  ^ lemmas  ^ sents  ^ texts  ^ 
 +^ de       23.4m |   769k |    325k |   1.3m |    12k | 
 +^ fr       14.9m |   317k |     85k |   787k |     8k | 
 +^ it        324k |    39k |     18k |    16k |    162 | 
 +^ rm         14k |   4.5k |    0.2k |    786 |     18 | 
 +^ gsw    |      3k |   1.3k |    0.2k |    156 |      3 | 
 +^ en        0.9k |   0.4k |    0.3k |     41 |      1 | 
 +^ Total  ^   38.6m ^   1.1m ^    429k ^   2.1m ^    21k ^
  
-Being a diarchronic heritage corpus, its development . Numerous experiments have been undertaken to semantically enrich this corpus as both a historic and a linguistic resource. These include, but are not limited to, a novel approach to correcting optical character recognition (OCR) errors \anonref{\citep{Clem16}}; gazetteer and rule-based NER for the annotation of personal names, toponyms, organisations and time expressions \anonref{\citep{Ebling2011}}; improved lemmatisation for German separable prefix verbs and elliptical compound nouns \anon{\cite{Volk2016}}; innovative techniques for sentence alignment in parallel texts \anonref{\cite{Sennrich2010}}; and the creation of a manually annotated parallel treebank with more than 1000 sentences in French and German for the purpose of assisting statistical machine translation \anonref{\citep{Goering2011}}.+==== Alignment ==== 
 +The corpus has been aligned on the sentence level.
  
  
-[[https://www.zora.uzh.ch/id/eprint/50451/|]]+===== EdA ===== 
 +^ lang  ^ tokens  ^ types  ^ lemmas  ^ sents  ^ texts  ^ 
 +^ fr    |    7.4m |   185k |     40k |   376k |   4.5k | 
 + 
 + 
 +===== BAC ===== 
 +^ lang  ^ tokens  ^ types  ^ lemmas  ^ sents  ^ texts  ^ 
 +^ en    |    6.5m |   181k |     60k |   289k |   1.5k | 
 + 
 + 
 +===== Publications ===== 
 +  * Detection and annotation of code-switching [[https://www.zora.uzh.ch/id/eprint/100577/|Clematide and Volk 2014]] 
 +  * Crowdsourced correction of OCR errors [[https://www.zora.uzh.ch/id/eprint/124786/|Clematide et al. 2016]], [[https://www.zora.uzh.ch/id/eprint/162395/|Clematide et al. 2018]] 
 +  * Gazetteer and rule-based NER for the annotation of personal names, toponyms, organisations and time expressions [[https://www.zora.uzh.ch/id/eprint/50451/|Ebling et al. 2011]], [[https://www.zora.uzh.ch/id/eprint/20591/|Volk et al. 2009]] 
 +  * Development of a manually annotated parallel treebank [[https://www.zora.uzh.ch/id/eprint/33378/|Göhring et al. 2010]] 
 +  * Challenges in building a heritage alpine text [[https://www.zora.uzh.ch/id/eprint/34264/|Volk et al 2010]] 
 +  * special handling of elliptical compound nouns and separable prefix verbs in German [[https://www.zora.uzh.ch/id/eprint/126372/|Volk et al. 2016]], [[https://www.zora.uzh.ch/id/eprint/85249/|Aepli and Volk 2013]] 
 +  * See here for more [[http://textberg.ch/site/de/publi/|publications from the Text+Berg project]] 
 + 
 + 
 +===== Relevant links ===== 
 + 
 +  * [[http://textberg.ch/site/en/corpora/|Text+Berg Project Website]] 
 +  * [[https://www.sac-cas.ch/|Swiss Alpine Club]] 

CL Wiki

Institute of Computational Linguistics – University of Zurich