Simplified News in Many Languages

Simplified News in Many Languages (SNIML) is a collection of news articles that have been published by six simplified news portals. As we describe in our paper, we have prepared the collected articles as an XML corpus.

The corpus can be downloaded from this page. It contains simplified news articles in Finnish, French, Italian, Swedish, English and German, ranging from November 2003 to March 2023. All articles in the corpus are available under an open license that permits academic research use.

This corpus was created as part of a project of the Department of Computational Linguistics at the University of Zurich.

Overview of Sources

News Portal URL Language License
The Times in Plain English https://www.thetimesinplainenglish.com/ en-US “may be distributed and reproduced by all”
Informazione Facile https://informazionefacile.it/ it-IT CC BY-SA 4.0
Journal Essentiel https://journalessentiel.be/ fr-BE CC BY-SA 4.0
Infoeasy https://infoeasy-news.ch/ fr-BE CC BY-NC-ND 4.0
Selkosanomat https://selkosanomat.fi/ fi CC BY-NC-ND 4.0
Lätta Bladet https://lattabladet.fi/ sv-SE CC BY-NC-ND 4.0

We would like to thank the editors of the news portals used for making their articles available.

Citation

@inproceedings{hauser-etal-2022-multilingual,
    title = "A Multilingual Simplified Language News Corpus",
    author = "Hauser, Renate  and
      Vamvas, Jannis  and
      Ebling, Sarah  and
      Volk, Martin",
    booktitle = "Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.readi-1.4",
    pages = "25--30",
    abstract = "Simplified language news articles are being offered by specialized web portals in several countries. The thousands of articles that have been published over the years are a valuable resource for natural language processing, especially for efforts towards automatic text simplification. In this paper, we present SNIML, a large multilingual corpus of news in simplified language. The corpus contains 13k simplified news articles written in one of six languages: Finnish, French, Italian, Swedish, English, and German. All articles are shared under open licenses that permit academic use. The level of text simplification varies depending on the news portal. We believe that even though SNIML is not a parallel corpus, it can be useful as a complement to the more homogeneous but often smaller corpora of news in the simplified variety of one language that are currently in use.",
}

CL Wiki

Institute of Computational Linguistics – University of Zurich