The Credit Suisse PDF Bulletin Corpus is a collection of magazine articles from the Credit Suisse Bulletin in four languages (English, French, German, Italian). They range from 1998 to 2017. This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (since 1895).
Release_6.0, 19. July 2021 by Institute of Computational Linguistics, University of Zurich
Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.
English | French | German | Italian | |
---|---|---|---|---|
Magazines | 71 | 117 | 125 | 103 |
Articles | 1'405 | 2'614 | 2'714 | 2'320 |
Tokens | 2.2 million | 4.0 million | 3.6 million | 3.4 million |
Types | 79'253 | 115'787 | 207'228 | 120'233 |
Unknown lemmas | 49'683 | 154'110 | 109'758 | 147'453 |
The articles were collected as PDF files. In a first step, the files were converted to TETML with PDFlib TET. At this stage, a tool for article boundary recognition was used to identify article headings. At the same time we also used the text coordinates given in the TETML-format to filter out headers and footers:
Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.
We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.
Both stand-off article and sentence alignments are included in this release.
Our approach for article alignment was inspired by Rico Sennrich's BLEU-Align tool (which we used for sentence alignment). For each language pair, one side is translated to the other language using an SMT system. We then compute their BLEU score and use it as a similarity metric to decide whether two articles are parallel or not.
We computed the sentence alignments for all six language pairs with HunAlign. It includes 1-1, 1-many, many-1, 0-many, many-0, and (few) many-many alignments. The 1-1 alignments account for more than 85% of all alignments, 1-2 and 2-1 account for around another 10%. This indicates that the different language versions of the news texts are close to each other.
DE-EN | DE-FR | DE-IT | FR-IT | FR-EN | IT-EN | |
---|---|---|---|---|---|---|
Aligned Articles | 1250 | 2479 | 1995 | 2252 | 1204 | 1159 |
Sentence Alignments | 112'816 | 214'905 | 182'691 | 190'077 | 103'452 | 99'574 |
1-1 Sentence Alignments | 89.4% | 87.2% | 88.1% | 87.7% | 87.0% | 87.3% |
1-2 + 2-1 Sent. Alignments | 9.4% | 11.3% | 10.5% | 11.0% | 11.6% | 11.3% |
Below is an example exerpt of the sentence alignment of the fourth magazine published in 2016:
<linkGrp targType="yearbook" xtargets="bulletin_2016_4_stability_de.xml;bulletin_2016_4_stability_en.xml" lang="de;en"> ... <linkGrp targType="article" xtargets="a2;a2"> <link targType="sentence" type="1-1" xtargets="s1;s1"/> <link targType="sentence" type="2-1" xtargets="s2 s3;s2"/>
The exerpt shows that the German article "a2" is aligned with the English article "a2". The first sentence of both magazines are aligned with each other. However, the second sentence in the English issue is aligned both with the second and third sentence of the German version.
Here the example sentences:
We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD.
The corpus is distributed free of charge and is freely available for non-commercial purposes (as granted by Credit Suisse).
For quoting the corpus we recommend:
@MISC{CS_Bulletin_Corpus_Release_v5.0_2019, editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein}, year = 2018, title = {Credit Suisse Bulletin Corpus (Release 5.0)}, note = {A collection of translated magazines in English, French, German and Italian}, howpublished = {XML-Format}, school = {Institut für Computerlinguistik, Universität Zürich} }
We gratefully acknowledge support by the Swiss National Library and Credit Suisse.
The following students have made special contributions to this corpus.
Martin Volk, Institute of Computational Linguistics, University of Zurich