The Credit Suisse News Corpus

Release 05, 27. February 2019 by Institute of Computational Linguistics, University of Zurich

The Credit Suisse News Corpus is a collection of news articles from the Credit Suisse web page in four languages (English, French, German, Italian). They range from 2001 to 2017. The articles were collected by students in the course "Introduction to Multilingual Text Analysis" (fall semester 2014, 2015, 2016, 2017). This corpus is part of a larger initiative to build a corpus on the Credit Suisse Bulletin, the world's oldest banking magazine (since 1895).

Release 05 contains multiple minor fixes and improvements:

Language identification

Tokenization

POS tags


Size

Tokens were counted after automatic tokenisation (i.e. punctuation marks count as separate tokens). Types (= unique words) were counted without modifications. This means that upper and lower case variants count as separate types.

  English French German Italian
Articles 1821 1596 1797 1542
Tokens 2.09 million 2.03 million 1.91 million 1.87 million
Types 53'534 56'333 105'560 64'029

Categories

The articles have been classified by Credit Suisse. The corpus consists of articles in the following categories. All category names are mapped to English.

Category # of DE articles
Article Archive 13
Asia Pacific 15
Banking 207
Culture 66
Economy 581
Entrepreneurs 184
Europe 32
Global Trends 22
Investing 201
Middle East & Africa 5
News & Stories 117
Sectors and Companies 3
Society 96
Sport 43
Switzerland 209
The Americas 3

Processing

Cleaning and Conversion to XML

The articles were collected as HTML files. In a first step, the files were stripped off boilerplate material and the text was saved in XML. All hyperlinks (html anchor tags) were removed.

Tokenisation

Tokenisation was done with Johannes Graën's "cutter". It uses abbreviation lists for distinguishing sentence final dots from abbrevation dots. It follows our tokenisation guidelines which are included in this corpus package.

Part-of-Speech Tagging and Lemmatisation

  1. We used the TreeTagger for PoS Tagging and Lemmatisation of the four languages.
  2. Our French tag set is a slightly modified version of the set used in the Text+Berg corpus .
  3. For German, we extended the tagger lexicon with the help of the morphological analyser GerTwol. We also applied our program for the detection of multiword adverbs (e.g. auf und ab, nach und nach, nach wie vor), filling up elliptical compounds (e.g. Ein- und Ausfahrt) and the re-attachment of separated verb prefixes (e.g. fängt ... an --> anfangen).
  4. For Italian and English, we extended the tagger lexicon with the help of the morphological analyser TextPro.

Language Identification

We offer language identification on sentence level. We used the Python module "langid" for the identification. The language of each sentence is stored in the XML with the attribute "lang". We restricted the allowed languages to German, English, French, Italian and Spanish.

Cross-language Alignment

Article Alignment

All articles have a unique identifier (cross_lang_id) that is consistent across languages. This id is used for article alignment.

Sentence Alignment

We computed the sentence alignments for all six language pairs with HunAlign. It includes 1-1, 1-many, many-1, 0-many, many-0, and (few) many-many alignments. The 1-1 alignments account for around 90% of all alignments, 1-2 and 2-1 account for almost another 10%. This indicates that the different language versions of the news texts are close to each other.

  DE-EN DE-FR DE-IT FR-IT FR-EN IT-EN
Aligned Articles 1793 1591 1536 1531 1590 1533
Sentence Alignments 105'464 94'609 90'575 89'175 92'968 89'169
1-1 Sentence Alignments 91.9% 90.0% 90.0% 90.6% 91.3% 91.6%
1-2 + 2-1 Sent. Alignments 7.6% 9.4% 9.3% 8.9% 8.2% 7.9%

Below is an example exerpt of the sentence alignment of a news article published in 2006:

<linkGrp targType="yearbook" xtargets="cs_news_2016_de.xml;cs_news_2016_en.xml" lang="de;en">
      ...
      <linkGrp targType="article" xtargets="a16;a16">
      ...
    	<link targType="sentence" type="1-1" xtargets="s9;s9"/>
    	<link targType="sentence" type="1-2" xtargets="s10 s11;s10"/>

The exerpt shows that the German article "a16" is aligned with the English article "a16". Sentences 9 in both articles are aligned with each other. However, sentence 10 in the English issue is aligned with sentence 10 and 11 in the German version.

Here the example sentences:

XML-Format

We supply an XML-DTD with the corpus files. All corpus files have been validated against this DTD. Please consult the DTD for explanations on the XML tags and attributes.

Limitations

  1. We are aware that in corpus collection and cleaning we may have missed parts of a news text, or we may have some boilerplate text still left. Because of the heterogeneous formats of the original files, both errors are difficult to avoid.
  2. We are aware that the different language versions contain sentences and fragments in other languages. So far, we have applied only language identification, but not code-switching detection.
  3. We have not yet applied lemma disambiguation over parallel texts.
  4. We are aware that our automatic alignment of sentences may occasionally contain wrong alignments.

Acknowledgement

For quoting the corpus we recommend:

@MISC{CS_News_Corpus_Release_v05_2019,
  editor = {Martin Volk, Alena Zwahlen and Chantal Amrhein},
  year = 2018,
  title = {Credit Suisse News Corpus (Release 05)},
  note = {A collection of translated news in English, French, German and Italian},
  howpublished = {XML-Format},
  school = {Institut für Computerlinguistik, Universität Zürich}
}

We gratefully acknowledge support by the Swiss National Library and Credit Suisse.

The following students have made special contributions to this corpus.

Martin Volk, Institute of Computational Linguistics, University of Zurich