This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:pacoco:text_berg [2019-07-17 22:09] – tkew | public:pacoco:text_berg [2023-09-15 20:33] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ===== The Text+Berg Corpus ===== | + | ~~NOTOC~~ |
+ | ====== The Text+Berg Corpus | ||
{{ : | {{ : | ||
Line 5: | Line 6: | ||
The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus. | The Text-Berg corpus is a heritage corpus of alpine and mountaineering texts. Texts have been digitised from the yearbooks of the Swiss Alpine Club, Echo des Alpes and Die Alpen, as well as the British Alpine Club's Alpine Journal. The table below provides an overview of the source material, timespan and languages included in the corpus. | ||
- | ^ Source | + | ^ Source |
- | | Das Jahrbuch des SAC | 1864-1923 | de, fr, it, rm, en (mixed) | | + | | Das Jahrbuch des SAC (SAC) | 1864-1923 |
- | | Das Echo des Alpes | 1872-1924 | fr | | + | | Das Echo des Alpes (EdA) | 1872-1924 |
- | | Die Alpen | 1925-1956 | de, fr, it, rm, en (mixed) | | + | | Die Alpen (SAC) |
- | | Die Alpen | 1957-2011 | de, fr (parallel) | | + | | Die Alpen (SAC) |
- | | The Alpine Journal | + | | The Alpine Journal |
Line 17: | Line 18: | ||
The corpus has been divided into its language specific subsections. The table below provides an overview of corpus statistics for each subsection. | The corpus has been divided into its language specific subsections. The table below provides an overview of corpus statistics for each subsection. | ||
- | === SAC === | ||
- | ^ lang ^ tokens ^ types ^ lemmas ^ sents ^ texts ^ | ||
- | | **de** | ||
- | | **fr** | ||
- | | **it** | ||
- | | **rm** | ||
- | | **gsw** | ||
- | | **en** | ||
- | ^ Total ^ 38.6m ^ 1.1m ^ 429k ^ 2.1m ^ 21k ^ | ||
- | ===EdA=== | + | ===== SAC ===== |
- | ^ lang ^ tokens ^ types ^ lemmas ^ sents ^ texts ^ | + | ^ lang |
- | | **fr** | + | ^ de | 23.4m | 769k | |
+ | ^ fr | ||
+ | ^ it | ||
+ | ^ rm | ||
+ | ^ gsw | 3k | 1.3k | 0.2k | 156 | 3 | | ||
+ | ^ en | ||
+ | ^ Total ^ 38.6m ^ 1.1m ^ 429k ^ 2.1m ^ 21k ^ | ||
- | ===BAC=== | + | ==== Alignment ==== |
- | ^ lang ^ tokens ^ types ^ lemmas ^ sents ^ texts ^ | + | The corpus has been aligned on the sentence level. |
- | | **en** | + | |
- | ------------------------------ | ||
- | Relevant links: | + | ===== EdA ===== |
+ | ^ lang ^ tokens | ||
+ | ^ fr | 7.4m | 185k | 40k | 376k | 4.5k | | ||
- | * [[http:// | ||
- | * [[https:// | ||
- | Publications: | + | ===== BAC ===== |
+ | ^ lang ^ tokens | ||
+ | ^ en | 6.5m | 181k | 60k | 289k | 1.5k | | ||
+ | |||
+ | |||
+ | ===== Publications | ||
* Detection and annotation of code-switching [[https:// | * Detection and annotation of code-switching [[https:// | ||
* Crowdsourced correction of OCR errors [[https:// | * Crowdsourced correction of OCR errors [[https:// | ||
Line 50: | Line 51: | ||
* special handling of elliptical compound nouns and separable prefix verbs in German [[https:// | * special handling of elliptical compound nouns and separable prefix verbs in German [[https:// | ||
* See here for more [[http:// | * See here for more [[http:// | ||
+ | |||
+ | |||
+ | ===== Relevant links ===== | ||
+ | |||
+ | * [[http:// | ||
+ | * [[https:// | ||
+ |