Release 05, 27. February 2019 by Institute of Computational Linguistics, University of Zurich
The Credit Suisse News Corpus is a collection of news articles from the Credit Suisse web page in four languages (English, French, German, Italian). They range from 2001 to 2017. The articles were collected by students in the course "Introduction to Multilingual Text Analysis" (fall semester 2014, 2015, 2016, 2017). This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (since 1895).
Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.
English | French | German | Italian | |
---|---|---|---|---|
Articles | 1821 | 1596 | 1797 | 1542 |
Tokens | 2.09 million | 2.03 million | 1.91 million | 1.87 million |
Types | 53'534 | 56'333 | 105'560 | 64'029 |
The articles have been classified by Credit Suisse. The corpus consists of articles in the following categories. All category names are mapped to English.
Category | # of DE articles |
---|---|
Article Archive | 13 |
Asia Pacific | 15 |
Banking | 207 |
Culture | 66 |
Economy | 581 |
Entrepreneurs | 184 |
Europe | 32 |
Global Trends | 22 |
Investing | 201 |
Middle East & Africa | 5 |
News & Stories | 117 |
Sectors and Companies | 3 |
Society | 96 |
Sport | 43 |
Switzerland | 209 |
The Americas | 3 |
The articles were collected as HTML files. In a first step, the files were stripped off boilerplate material and the text was saved in XML. All hyperlinks (html anchor tags) were removed.
Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.
We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.
All articles have a unique identifier (cross_lang_id) that is consistent across languages. This id is used for article alignment.
We computed the sentence alignments for all six language pairs with HunAlign. It includes 1-1, 1-many, many-1, 0-many, many-0, and (few) many-many alignments. The 1-1 alignments account for around 90% of all alignments, 1-2 and 2-1 account for almost another 10%. This indicates that the different language versions of the news texts are close to each other.
DE-EN | DE-FR | DE-IT | FR-IT | FR-EN | IT-EN | |
---|---|---|---|---|---|---|
Aligned Articles | 1793 | 1591 | 1536 | 1531 | 1590 | 1533 |
Sentence Alignments | 105'464 | 94'609 | 90'575 | 89'175 | 92'968 | 89'169 |
1-1 Sentence Alignments | 91.9% | 90.0% | 90.0% | 90.6% | 91.3% | 91.6% |
1-2 + 2-1 Sent. Alignments | 7.6% | 9.4% | 9.3% | 8.9% | 8.2% | 7.9% |
Below is an example exerpt of the sentence alignment of a news article published in 2006:
<linkGrp targType="yearbook" xtargets="cs_news_2016_de.xml;cs_news_2016_en.xml" lang="de;en"> ... <linkGrp targType="article" xtargets="a16;a16"> ... <link targType="sentence" type="1-1" xtargets="s9;s9"/> <link targType="sentence" type="1-2" xtargets="s10 s11;s10"/>
The exerpt shows that the German article "a16" is aligned with the English article "a16". Sentences 9 in both articles are aligned with each other. However, sentence 10 in the English issue is aligned with sentence 10 and 11 in the German version.
Here the example sentences:
We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD. Please consult the DTD for explanations on the XML tags and attributes.
For quoting the corpus we recommend:
@MISC{CS_News_Corpus_Release_v05_2019, editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein}, year = 2018, title = {Credit Suisse News Corpus (Release 05)}, note = {A collection of translated news in English, French, German and Italian}, howpublished = {XML-Format}, school = {Institut für Computerlinguistik, Universität Zürich} }
We gratefully acknowledge support by the Swiss National Library and Credit Suisse.
The following students have made special contributions to this corpus.
Martin Volk, Institute of Computational Linguistics, University of Zurich