The Credit Suisse Bulletin In Print Corpus

Release_3.0, 27. February 2019 by Institute of Computational Linguistics, University of Zurich

The Credit Suisse Bulletin In Print Corpus is a collection of magazine articles from the Credit Suisse Bulletin in five languages (English, French, German, Italian, Spanish). They range from 1895 to 1997. This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (from 1895 to today).


Release 3.0 contains multiple minor fixes and improvements:

Language identification

Tokenization

Ocr errors

POS tags


Size

Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.

  English French German Italian Spanish
Magazines 101 627 787 99 19
Tokens 4.6 million 16.1 million 14.2 million 4.6 million 0.9 million
Types 112'646 262'097 467'354 129'632 46'421

Processing

Cleaning and Conversion to XML

The magazines were collected as printed issues. In a first step, the magazines were scanned in cooperation with the Swiss National Library. Then, we extracted the text from the images using the Optical Character Recognition (OCR) tool provided by ABBYY. The output XML was then further processed to fit our own XML structure. We have not yet identified article boundaries in the OCRed magazines.

Tokenisation

Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.

Part-of-Speech Tagging and Lemmatisation

  1. We used the TreeTagger for PoS Tagging and Lemmatisation of the five languages.
  2. Our French tag set is a slightly modified version of the set used in the Text+Berg corpus .
  3. For German, we extended the tagger lexicon with the help of the morphological analyser GerTwol. We also applied our program for the detection of multiword adverbs (e.g. auf und ab, nach und nach, nach wie vor) and the re-attachment of separated verb prefixes (e.g. fängt ... an --> anfangen).
  4. For Italian and English, we extended the tagger lexicon with the help of the morphological analyser TextPro.
  5. For Spanish, we have not yet extended the tagger lexicon since there are only a few magazines published in Spanish.

Language Identification

We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.

Article and Sentence Alignment

So far, we do not offer article and sentence alignments for the Credit Suisse Bulletin In Print Corpus.

XML-Format

We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD.

Limitations

  1. We are aware that the Optical Layout Recognition (OLR) errors may have resulted in falsely split sentences and text fragments.
  2. We are aware that the OCR errors may have resulted in incorrectly recognized characters.
  3. We are aware that the different language versions contain sentences and fragments in other languages. So far, we have applied only language identification, but not code-switching detection.

License

The corpus is distributed free of charge and is freely available for non-commercial purposes (as granted by Credit Suisse).

Acknowledgement

For quoting the corpus we recommend:

@MISC{CS_OCR_Bulletin_Corpus_Release_v3.0_2019,
  editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein},
  year = 2018,
  title = {Credit Suisse OCR Bulletin Corpus (Release 3.0)},
  note = {A collection of translated magazines in English, French, German, Italian and Spanish},
  howpublished = {XML-Format},
  school = {Institut für Computerlinguistik, Universität Zürich}
}

We gratefully acknowledge support by the Swiss National Library and Credit Suisse.

The following students have made special contributions to this corpus.

Martin Volk, Institute of Computational Linguistics, University of Zurich