The Credit Suisse PDF Bulletin Corpus

The Credit Suisse PDF Bulletin Corpus is a collection of magazine articles from the Credit Suisse Bulletin in four languages (English, French, German, Italian). They range from 1998 to 2017. This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (since 1895).

Release 6.0 contains the following improvements:

Release_6.0, 19. July 2021 by Institute of Computational Linguistics, University of Zurich

Lemmas

More than 200 entries in the English tagger lexicon included problematic lemmas (cause: unknown). Due to this issue, the English corpus files in previous releases contained a large number of false lemmas (example: token="German", lemma="speaking"). The English tagger lexicon was manually corrected, and the corpus was parsed anew.

Tokenization

Phone numbers are now tokenized as a single token where groups of digits are separated by a space (example: token="044 403 41 10", lemma="@card@"). Coincidentally, sequences of space-separated numbers are also tokenized as one token (example: token="2 3 1 4", lemma="@card@");
The text originally extracted from PDF files was sometimes corrupted. Some words appeared to be glued together either completely or with a soft hyphen ("BrightlyU+00ADcoloredU+00ADyetU+00ADstrangelyU+00ADserene"). This caused tokenization problems in the previous releases (example: token="Brightlycoloredyetstrangelyserene", lemma="unk"). In this release, the soft hyphen was replaced with a space prior to tokenization which solved the issue.
A large number of glued-together words containing an apostrophe were split into separate tokens (example: Alzheimer'scare => Alzheimer's care).
Nevertheless, a number of word clusters resulting from PDF processing still remain in the corpus.

General information about the corpus

Size

Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.

	English	French	German	Italian
Magazines	71	117	125	103
Articles	1'405	2'614	2'714	2'320
Tokens	2.2 million	4.0 million	3.6 million	3.4 million
Types	79'253	115'787	207'228	120'233
Unknown lemmas	49'683	154'110	109'758	147'453

Processing

Cleaning and Conversion to XML

The articles were collected as PDF files. In a first step, the files were converted to TETML with PDFlib TET. At this stage, a tool for article boundary recognition was used to identify article headings. At the same time we also used the text coordinates given in the TETML-format to filter out headers and footers:

For filtering out headers, we used string matching to find the first header (often "Editorial" or "Content"). We stored its height and removed all segments at the same or above that height.
For filtering out footers, we identified the height of the last segment on the first twenty pages. We stored the most frequent height and removed all segments at the same or below that height.

The magazines were then converted to an XML-format that preserves the article structure.

Tokenisation

Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.

Part-of-Speech Tagging and Lemmatisation

We used the TreeTagger for PoS Tagging and Lemmatisation of the four languages.
Our French tag set is a slightly modified version of the set used in the Text+Berg corpus .
For German, we extended the tagger lexicon with the help of the morphological analyser GerTwol. We also applied our program for the detection of multiword adverbs (e.g. auf und ab, nach und nach, nach wie vor), filling up elliptical compounds (e.g. Ein- und Ausfahrt) and the re-attachment of separated verb prefixes (e.g. fängt ... an --> anfangen).
For Italian and English, we extended the tagger lexicon with the help of the morphological analyser TextPro.

Language Identification

We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.

Cross-language Alignment

Both stand-off article and sentence alignments are included in this release.

Article Alignment

Our approach for article alignment was inspired by Rico Sennrich's BLEU-Align tool (which we used for sentence alignment). For each language pair, one side is translated to the other language using an SMT system. We then compute their BLEU score and use it as a similarity metric to decide whether two articles are parallel or not.

Sentence Alignment

We computed the sentence alignments for all six language pairs with HunAlign. It includes 1-1, 1-many, many-1, 0-many, many-0, and (few) many-many alignments. The 1-1 alignments account for more than 85% of all alignments, 1-2 and 2-1 account for around another 10%. This indicates that the different language versions of the news texts are close to each other.

	DE-EN	DE-FR	DE-IT	FR-IT	FR-EN	IT-EN
Aligned Articles	1250	2479	1995	2252	1204	1159
Sentence Alignments	112'816	214'905	182'691	190'077	103'452	99'574
1-1 Sentence Alignments	89.4%	87.2%	88.1%	87.7%	87.0%	87.3%
1-2 + 2-1 Sent. Alignments	9.4%	11.3%	10.5%	11.0%	11.6%	11.3%

Below is an example exerpt of the sentence alignment of the fourth magazine published in 2016:

<linkGrp targType="yearbook" xtargets="bulletin_2016_4_stability_de.xml;bulletin_2016_4_stability_en.xml" lang="de;en">
      ...
      <linkGrp targType="article" xtargets="a2;a2">
        <link targType="sentence" type="1-1" xtargets="s1;s1"/>
        <link targType="sentence" type="2-1" xtargets="s2 s3;s2"/>

The exerpt shows that the German article "a2" is aligned with the English article "a2". The first sentence of both magazines are aligned with each other. However, the second sentence in the English issue is aligned both with the second and third sentence of the German version.

Here the example sentences:

DE sentence 1: "Die Weltkarte der Stabilität"
EN sentence 1: "World Stability Map"

DE sentence 2: "Ein Blick auf die Weltkarte zeigt:"
DE sentence 3: "Stabilität und Wohlstand bedingen sich gegenseitig."
EN sentence 2: "A glance at the world map shows that stability and prosperity are mutually dependent."

XML-Format

We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD.

Limitations

We are aware that the complex magazine layouts occasionally resulted in text fragments that appear at a wrong position in the text or that are split in half. (Especially article headings that run over two pages.)
We are aware that PDFlib TET occasionally splits words at random positions. We use a script to concatenate tokens with unknown lemmas if their concatenated form is known to TreeTagger. However, it is possible that not all split errors are covered by this measure.
We are aware that the different language versions contain sentences and fragments in other languages. So far, we have applied only language identification, but not code-switching detection.
We are aware that our automatic alignment of articles and sentences may occasionally contain wrong alignments.

License

The corpus is distributed free of charge and is freely available for non-commercial purposes (as granted by Credit Suisse).

Acknowledgement

For quoting the corpus we recommend:

@MISC{CS_Bulletin_Corpus_Release_v5.0_2019,
  editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein},
  year = 2018,
  title = {Credit Suisse Bulletin Corpus (Release 5.0)},
  note = {A collection of translated magazines in English, French, German and Italian},
  howpublished = {XML-Format},
  school = {Institut für Computerlinguistik, Universität Zürich}
}

We gratefully acknowledge support by the Swiss National Library and Credit Suisse.

The following students have made special contributions to this corpus.

Noëmi Aepli
Katrin Affolter
Chiara Baffelli
Mathias Müller
Michela Rossi
Till Salinger
Dominique Sandoz
Phillip Ströbel
Yvonne Zgraggen
Anastassia Shaitarova

Martin Volk, Institute of Computational Linguistics, University of Zurich