Release_3.0, 27. February 2019 by Institute of Computational Linguistics, University of Zurich
The Credit Suisse Bulletin In Print Corpus is a collection of magazine articles from the Credit Suisse Bulletin in five languages (English, French, German, Italian, Spanish). They range from 1895 to 1997. This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (from 1895 to today).
Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.
English | French | German | Italian | Spanish | |
---|---|---|---|---|---|
Magazines | 101 | 627 | 787 | 99 | 19 |
Tokens | 4.6 million | 16.1 million | 14.2 million | 4.6 million | 0.9 million |
Types | 112'646 | 262'097 | 467'354 | 129'632 | 46'421 |
The magazines were collected as printed issues. In a first step, the magazines were scanned in cooperation with the Swiss National Library. Then, we extracted the text from the images using the Optical Character Recognition (OCR) tool provided by ABBYY. The output XML was then further processed to fit our own XML structure. We have not yet identified article boundaries in the OCRed magazines.
Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.
We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.
So far, we do not offer article and sentence alignments for the Credit Suisse Bulletin In Print Corpus.
We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD.
The corpus is distributed free of charge and is freely available for non-commercial purposes (as granted by Credit Suisse).
For quoting the corpus we recommend:
@MISC{CS_OCR_Bulletin_Corpus_Release_v3.0_2019, editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein}, year = 2018, title = {Credit Suisse OCR Bulletin Corpus (Release 3.0)}, note = {A collection of translated magazines in English, French, German, Italian and Spanish}, howpublished = {XML-Format}, school = {Institut für Computerlinguistik, Universität Zürich} }
We gratefully acknowledge support by the Swiss National Library and Credit Suisse.
The following students have made special contributions to this corpus.
Martin Volk, Institute of Computational Linguistics, University of Zurich