The Credit Suisse Bulletin In Print Corpus

Release_3.0, 27. February 2019 by Institute of Computational Linguistics, University of Zurich

The Credit Suisse Bulletin In Print Corpus is a collection of magazine articles from the Credit Suisse Bulletin in five languages (English, French, German, Italian, Spanish). They range from 1895 to 1997. This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (from 1895 to today).

Release 3.0 contains multiple minor fixes and improvements:

Language identification

sentences that are written only in capital letters were not identified correctly by the Python module "langid". This issue is now fixed. However sentences that are shorter than 32 characters are assigned the language of a previous sentence which sometimes causes faulty language identification.

Tokenization

tokenization of possessive ‘s in English has been corrected;
some abbreviations have been added for English and French;
symbol ‘|’ (pipe) has been replaced with ‘/’ to insure correct tokenization;
problematic tokenization of ampersand has been fixed;
tokenization of hyphenated words like 'semble-t-il' has been corrected.

Ocr errors

some consistent ocr errors have been corrected, for example i’été → l’été;
174 cases of corrupted spelling of 'CREDIT SUISSE' were found and corrected.

POS tags

multiple tags have been corrected for French in the tagger lexicon;
updated French tagset information sheet has been compiled and made available;
141 German function words written in capital letters have been added to the German tagger lexicon.

Size

Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.

	English	French	German	Italian	Spanish
Magazines	101	627	787	99	19
Tokens	4.6 million	16.1 million	14.2 million	4.6 million	0.9 million
Types	112'646	262'097	467'354	129'632	46'421

Processing

Cleaning and Conversion to XML

The magazines were collected as printed issues. In a first step, the magazines were scanned in cooperation with the Swiss National Library. Then, we extracted the text from the images using the Optical Character Recognition (OCR) tool provided by ABBYY. The output XML was then further processed to fit our own XML structure. We have not yet identified article boundaries in the OCRed magazines.

Tokenisation

Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.

Part-of-Speech Tagging and Lemmatisation

We used the TreeTagger for PoS Tagging and Lemmatisation of the five languages.
Our French tag set is a slightly modified version of the set used in the Text+Berg corpus .
For German, we extended the tagger lexicon with the help of the morphological analyser GerTwol. We also applied our program for the detection of multiword adverbs (e.g. auf und ab, nach und nach, nach wie vor) and the re-attachment of separated verb prefixes (e.g. fängt ... an --> anfangen).
For Italian and English, we extended the tagger lexicon with the help of the morphological analyser TextPro.
For Spanish, we have not yet extended the tagger lexicon since there are only a few magazines published in Spanish.

Language Identification

We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.

Article and Sentence Alignment

So far, we do not offer article and sentence alignments for the Credit Suisse Bulletin In Print Corpus.

XML-Format

We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD.

Limitations

We are aware that the Optical Layout Recognition (OLR) errors may have resulted in falsely split sentences and text fragments.
We are aware that the OCR errors may have resulted in incorrectly recognized characters.
We are aware that the different language versions contain sentences and fragments in other languages. So far, we have applied only language identification, but not code-switching detection.

License

The corpus is distributed free of charge and is freely available for non-commercial purposes (as granted by Credit Suisse).

Acknowledgement

For quoting the corpus we recommend:

@MISC{CS_OCR_Bulletin_Corpus_Release_v3.0_2019,
  editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein},
  year = 2018,
  title = {Credit Suisse OCR Bulletin Corpus (Release 3.0)},
  note = {A collection of translated magazines in English, French, German, Italian and Spanish},
  howpublished = {XML-Format},
  school = {Institut für Computerlinguistik, Universität Zürich}
}

We gratefully acknowledge support by the Swiss National Library and Credit Suisse.

The following students have made special contributions to this corpus.

Noëmi Aepli
Katrin Affolter
Chiara Baffelli
Mathias Müller
Michela Rossi
Till Salinger
Dominique Sandoz
Phillip Ströbel
Yvonne Zgraggen
Anastassia Shaitarova

Martin Volk, Institute of Computational Linguistics, University of Zurich