The Credit Suisse PDF Bulletin Corpus

Release_5.0, 27. February 2019 by Institute of Computational Linguistics, University of Zurich

The Credit Suisse PDF Bulletin Corpus is a collection of magazine articles from the Credit Suisse Bulletin in four languages (English, French, German, Italian). They range from 1998 to 2017. This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (since 1895).

Release 5.0 contains multiple minor fixes and improvements:

Language identification

Tokenization

POS tags


Size

Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.

  English French German Italian
Magazines 71 117 125 103
Articles 1'405 2'614 2'714 2'320
Tokens 2.2 million 4.0 million 3.6 million 3.4 million
Types 78'754 114'011 204'528 117'638

Processing

Cleaning and Conversion to XML

The articles were collected as PDF files. In a first step, the files were converted to TETML with PDFlib TET. At this stage, a tool for article boundary recognition was used to identify article headings. At the same time we also used the text coordinates given in the TETML-format to filter out headers and footers:

  1. For filtering out headers, we used string matching to find the first header (often "Editorial" or "Content"). We stored its height and removed all segments at the same or above that height.
  2. For filtering out footers, we identified the height of the last segment on the first twenty pages. We stored the most frequent height and removed all segments at the same or below that height.
The magazines were then converted to an XML-format that preserves the article structure.

Tokenisation

Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.

Part-of-Speech Tagging and Lemmatisation

  1. We used the TreeTagger for PoS Tagging and Lemmatisation of the four languages.
  2. Our French tag set is a slightly modified version of the set used in the Text+Berg corpus .
  3. For German, we extended the tagger lexicon with the help of the morphological analyser GerTwol. We also applied our program for the detection of multiword adverbs (e.g. auf und ab, nach und nach, nach wie vor), filling up elliptical compounds (e.g. Ein- und Ausfahrt) and the re-attachment of separated verb prefixes (e.g. fängt ... an --> anfangen).
  4. For Italian and English, we extended the tagger lexicon with the help of the morphological analyser TextPro.

Language Identification

We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.

Cross-language Alignment

Both stand-off article and sentence alignments are included in this release.

Article Alignment

Our approach for article alignment was inspired by Rico Sennrich's BLEU-Align tool (which we used for sentence alignment). For each language pair, one side is translated to the other language using an SMT system. We then compute their BLEU score and use it as a similarity metric to decide whether two articles are parallel or not.

Sentence Alignment

We computed the sentence alignments for all six language pairs with HunAlign. It includes 1-1, 1-many, many-1, 0-many, many-0, and (few) many-many alignments. The 1-1 alignments account for more than 85% of all alignments, 1-2 and 2-1 account for around another 10%. This indicates that the different language versions of the news texts are close to each other.

  DE-EN DE-FR DE-IT FR-IT FR-EN IT-EN
Aligned Articles 1250 2479 1995 2252 1204 1159
Sentence Alignments 114'100 215'998 183'179 190'561 105'347 101'163
1-1 Sentence Alignments 89.4% 87.2% 88.1% 87.8% 86.9% 87.2%
1-2 + 2-1 Sent. Alignments 9.4% 11.3% 10.5% 10.9% 11.6% 11.4%

Below is an example exerpt of the sentence alignment of the fourth magazine published in 2016:

<linkGrp targType="yearbook" xtargets="bulletin_2016_4_stability_de.xml;bulletin_2016_4_stability_en.xml" lang="de;en">
      ...
      <linkGrp targType="article" xtargets="a2;a2">
        <link targType="sentence" type="1-1" xtargets="s1;s1"/>
        <link targType="sentence" type="2-1" xtargets="s2 s3;s2"/>

The exerpt shows that the German article "a2" is aligned with the English article "a2". The first sentence of both magazines are aligned with each other. However, the second sentence in the English issue is aligned both with the second and third sentence of the German version.

Here the example sentences:

XML-Format

We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD.

Limitations

  1. We are aware that the complex magazine layouts occasionally resulted in text fragments that appear at a wrong position in the text or that are splitted in half. (Especially article headings that run over two pages.)
  2. We are aware that PDFlib TET occasionally splits words at random positions. We use a script to concatenate tokens with unknown lemmas if their concatenated form is known to TreeTagger. However, it is possible that not all split errors are covered by this measure.
  3. We are aware that the different language versions contain sentences and fragments in other languages. So far, we have applied only language identification, but not code-switching detection.
  4. We are aware that our automatic alignment of articles and sentences may occasionally contain wrong alignments.

License

The corpus is distributed free of charge and is freely available for non-commercial purposes (as granted by Credit Suisse).

Acknowledgement

For quoting the corpus we recommend:

@MISC{CS_Bulletin_Corpus_Release_v5.0_2019,
  editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein},
  year = 2018,
  title = {Credit Suisse Bulletin Corpus (Release 5.0)},
  note = {A collection of translated magazines in English, French, German and Italian},
  howpublished = {XML-Format},
  school = {Institut für Computerlinguistik, Universität Zürich}
}

We gratefully acknowledge support by the Swiss National Library and Credit Suisse.

The following students have made special contributions to this corpus.

Martin Volk, Institute of Computational Linguistics, University of Zurich