Informatiktage 2017

In June 2017 project bulletin4corpus took part in the Informatiktage. Visitors could get to know the corpus and discover what linguistic treasures can be found in multilingual texts. Two web applications were presented to explore the corpus. You can find all the URLs here:

Multilingwis² - a web application to discover differences and connections between multiple languages
Diachronvis CS Korpus - a web application to explore historical language development
Ideas for interesting queries for the above mentioned web applications (in German)
The bulletin4corpus Informatiktage poster in PDF format (in German)

Steps towards a corpus

Conversion of HTML files

Autumn 2014: The news articles from the Credit Suisse website are collected. They are included into a first corpus (Credit Suisse News Corpus) which is constantly extended.

Conversion of PDF documents

January until September 2016: The Bulletin issues from 1998 until now are available online as PDF files. We also extract the texts from these PDF files, convert them to XML and included them in a corpus (Credit Suisse PDF Bulletin Corpus).

Scanning

September until November 2016: The Swiss National Library is scanning all the magazine issues for us.

OCR

Dezember 2016: We use an OCR tool (optical character recognition) in order to get the contents of the books and store them in XML format.