Dr. Johannes Graën

Johannes Graën, Dr.

Academic Associate

Tel.: +41 44 635 67 21
Raumbezeichnung: AND-2-22

Current project

Since April 2019, I am working on my mobility project From parallel corpora to multilingual exercises - Making use of large text collections and crowdsourcing techniques for innovative autonomous language learning applications funded by the SNF, partly at Språkbanken in Gothenburg and partly at GR@ELin Barcelona.

Short bio

I studied Computational Linguistics at the IMS in Stuttgart. During my studies I spent two years in Barcelona where I had the opportunity to participate in the PATExpert project at the TALN group and, later on, write my diploma thesis about dialogue interaction in role-playing video games.

Following my graduation, I led the development of digital cinema content logistics in an international firm (now Ymagis), resuming work that I had carried out for several years as student. From 2013 to 2017, I worked as PhD student in the SPARCLING project, investigating methods for assembling and querying large multiparallel corpora with a special focus on multilingual alignment.

Having learned programming in high school, I used to write small applications to support my own language acquisition (vocabulary and inflection) and later on continued developing language learning software with schoolmates, which brought us several prizes in the Jugend forscht competition. In 2017, I spent two months at Språkbanken in Gothenburg to learn about their approaches to corpus-based language learning applications.

Teaching and supervision

Starting in fall semester 2014 (HS14), I have been giving the last lecture in the programming techniques for computational linguistics series (PCL III), in HS15 together with Peter Makarov and in HS16 with Fabio Rinaldi. This lecture is centered around designing and implementing interactive applications (in contrast to processing pipelines) with complex data structures. Another key aspect of the lecture is software development in teams employing version control.

In spring semester 2014 (FS14), Martin Volk and I held a research seminar about methods and tools for large parallel corpora.

I (co-)supervised and worked with these students:

Stéphanie Lehner (2014) – Deutsche Substantivkomposita in parallelen Korpora: Erstellung und Evaluation eines multilingualen Goldstandards zur Optimierung der automatischen Übersetzungsbestimmung (Lizentiatsarbeit ≈ master thesis) [thesis] [publication] [gold standard]
Raphael Stöcklin (2015) – Implementation einer Mehrwortsuche in grossen parallelen Korpora anhand von Bilingwis (Facharbeit ≈ scientific project over at most six months)
Phillip Ströbel (2017) – The “Raison d’Être” of Word Embeddings in Identifying Multiword Expressions in Bilingual Corpora (master thesis) [thesis]
Selena Calleri & Barbara Pejkovic (2017) – Creation of a gold standard for hierarchical multilingual word alignment in six languages (English, French, German, Italian, Slovene, Spanish) [data]
Dominique Sandoz (2017) – web frontend and middleware for Multilingwis² (programming project) [publication] [git repo]
Christopf Bless (contribution to the SPARCLING project as student assistant from 2016 to 2017):
- Youquery – web frontend for the exploration of corpus association measures [publication]
- SentStructure.js – D3.js-based library for visualizing annotation and alignment of corpus examples [git repo]
- HAT (hierarchical alignment tool) – tool for efficient multilingual hierarchical word and sentence alignment [git repo] [example1] [example2]
Sarah Zurmühle (2018) – Erweiterung der Abfragesprache für das grosse multi-parallele Textkorpus des Multilingwis2 Projektes (bachelor thesis)
Tannon Kew & Anastassia Shaitarova (2018–2019) – definition of a flexible corpus format for multiparallel corpora, conversion of existing corpora into that format and export to Multilingwis [publication] [web page]
Jonathan Schaber (2019) – development of search interface for database in the DiFuPaRo research project

Interests

My main interest are parallel and multiparallel corpora, and their exploitation for multilingual phraseology and CALL applications. As regards the topic of language learning, I have mostly worked together with Gerold Schneider.

These applications require multiingual alignment on different levels, in particular alignment of units larger than single tokens (e.g. phrases), a problem I dealt with in my dissertation.

Other aspects include efficient corpus storage and query systems for multiparallel corpora and visualization of their results.

Other interests and proficiencies

I previously have been working for several years as SysAdmin for Linux servers and DevOp for applications based on PostgreSQL. During my time at the Department of Computational Linguistics, I had the chance to build our new IT infrastructure from scratch, based on virtualization (Proxmox) and distributed services. The architecture of our infrastructure has proven useful for both development and providing services, that is, web applications and pure web services. Most of my skills in this area date back to my active time at Selfnet e.V. in Stuttgart.

During my studies, I used to regularly take language courses. Besides German and English, I speak Spanish (C1), French, Portuguese, Italian (B1), Catalan and Swedish (A2). I also took lessons in Russian, Polish, Czech, Turkish and Icelandic, but, up to now, I can merely read simple texts in these languages.

I enjoy garlic (like in Tzatziki or Gazpacho), volleyball (my previous team) and good movies (Uni-Film Stuttgart, Filmstelle Zürich, Texas cinema). My preferred red wines come from the Montsant region.

Demos

(see also thumbnails on the right)

Multilingwis² – a web based search engine for exploration of word-aligned parallel and multiparallel corpora
Youquery – a web interface to explore properties of interlingual and intralingual corpus association measures
Cutter – frontend for tokenization web service in several languages
Alignment Overlap – tool for exploring translations shared between multiple terms
Constellations – syntactic queries on word-aligned parallel corpora

Publications

ZORA Publication List

Publications

Schaber, Jonathan; Graën, Johannes; McDonald, Daniel; Mustac, Igor; Rajovic, Nikolina; Schneider, Gerold; Bubenhofer, Noah (2023). The LiRI Corpus Platform. In: CLARIN Annual Conference, Leuven, 16 Oktober 2023 - 18 Oktober 2023. CLARIN ERIC, 145-149.
Graën, Johannes; Mustac, Igor; Rajovic, Nikolina; Schaber, Jonathan; Schneider, Gerold; Bubenhofer, Noah (2023). Swissdox@ LiRI–a large database of media articles made accessible to researchers. In: CLARIN Annual Conference 2023, Leuven, 16 October 2023 - 18 October 2023. CLARIN ERIC, 111-115.
Vamvas, Jannis; Graën, Johannes; Sennrich, Rico (2023). SwissBERT: The Multilingual Language Model for Switzerland. In: 8th Swiss Text Analytics Conference (SwissText), Neuchâtel, Switzerland, 12 Juni 2023 - 14 Juni 2023. Association for Computational Linguistics, 54-69.
Graën, Johannes; Bach, Carme; Cassany, Daniel (2023). Using a bilingual concordancer to promote metalinguistic reflection in the learning of an additional language: The case of B1 learners of Catalan. In: Santos Díaz, Inmaculada Clotilde; Torrado Cespón, Milagros; Díaz Lage, José María; López Pérez, Sidoní. Current Trends on Digital Technologies and Gaming for Teaching and Linguistics. Berlin: Peter Lang, 27-45.
Graën, Johannes (2022). Learning languages from parallel corpora. Slovenscina 2.0, 10(2):101-131.
Volk, Martin; Graën, Johannes (2022). Binomials in Swedish corpora – ‘Ordpar 1965’ revisited. In: Volodina, Elena; Dannélls, Dana; Berdicevskis, Aleksandrs; Forsberg, Markus; Virk, Shafqat. Live and Learn : Festschrift in honor of Lars Borin. Göteborg: Department of Swedish, Multilingualism and Language Technology, University of Gothenburg, 139-144.
Graën, Johannes; Volk, Martin (2021). Binomial adverbs in Germanic and Romance Languages : A corpus-based study. In: Lavid-López, Julia; Maíz-Arévalo, Carmen; Zamorano-Mansilla, Juan Rafael. Corpora in Translation and Contrastive Research in the Digital Age : Recent advances and explorations. Amsterdam: John Benjamins, 326-342.
Schaber, Jonathan; Graën, Johannes; Davatz, Jan Pavel; Ihsane, Tabea; Pinzin, Francesco; Poletto, Cecilia; Stark, Elisabeth (2021). The DiFuPaRo database. UZH.
Graën, Johannes (2021). Identifying phrasemes via interlingual association measures - A data-driven approach on dependency-parsed and word-aligned parallel corpora. In: Konecny, Christine; Autelli, Erica; Abel, Andrea; Zanasi, Lorenzo. Lexemkombinationen und typisierte Rede im mehrsprachigen Kontext. Tübingen: Stauffenburg Verlag, im Druck.
Graën, Johannes; Alfter, David; Schneider, Gerold (2020). Using Multilingual Resources to Evaluate CEFRLex for Learner Applications. In: 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, 11 May 2020 - 16 May 2020. European Language Resources Association, 346-355.
Graën, Johannes; Kew, Tannon; Shaitarova, Anastassia; Volk, Martin (2019). Modelling Large Parallel Corpora: The Zurich Parallel Corpus Collection. In: Challenges in the Management of Large Corpora (CMLC-7), Cardiff, Wales, 22 Juli 2019 - 22 Juli 2019.
Schneider, Gerold; Graën, Johannes (2018). NLP Corpus Observatory – Looking for Constellations in Parallel Corpora to Improve Learners’ Collocational Skills. In: 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL 2018), Stockholm, 7 November 2018 - 7 November 2018, 69-78.
Graën, Johannes; Bertamini, Mara; Volk, Martin (2018). Cutter – a Universal Multilingual Tokenizer. In: Swiss Text Analytics Conference, Winterthur, 12 June 2018 - 13 June 2018. CEUR-WS, 75-81.
Clematide, Simon; Lehner, Stéphanie; Graën, Johannes; Volk, Martin (2018). A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units in English, French, Italian and Spanish. In: Mitkov, Ruslan; Monti, Johanna; Corpas Pastor, Gloria; Seretan, Violeta. Multiword Units in Machine Translation and Translation Technology. Amsterdam: John Benjamins, 125-145.
Graën, Johannes. Exploiting alignment in multiparallel corpora for applications in linguistics and language learning. 2018, University of Zurich, Faculty of Arts.
Volk, Martin; Graën, Johannes (2017). Multi-word Adverbs – How well are they handled in Parsing and Machine Translation?. In: The 3rd Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2017), London, 14 November 2017 - 14 November 2017.
Graën, Johannes; Bless, Christof (2017). Exploring Properties of Intralingual and Interlingual Association Measures Visually. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, Gothenburg, Sweden, 22 May 2017 - 24 May 2017. Linköping University Electronic Press, Linköpings universitet, 314-317.
Graën, Johannes; Sandoz, Dominique; Volk, Martin (2017). Multilingwis2 – Explore Your Parallel Corpus. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, Gothenburg, Sweden, 22 May 2017 - 24 May 2017. Linköping University Electronic Press, Linköpings universitet, 247-250.
Graën, Johannes; Schneider, Gerold (2017). Crossing the Border Twice: Reimporting Prepositions to Alleviate L1-Specific Transfer Errors. In: Joint 6th Workshop on NLP for Computer Assisted Language Learning and 2nd Workshop on NLP for Research on Language Acquisition, Gothenburg, 22 May 2017 - 22 May 2017, 18-26.
Volk, Martin; Clematide, Simon; Graën, Johannes; Ströbel, Phillip (2016). Bi-particle adverbs, PoS-tagging and the recognition of german separable prefix verbs. In: KONVENS 2016, Bochum, 19 September 2016 - 21 September 2016.
Jancso, Anna; Rao, Xi; Graën, Johannes; Ebling, Sarah (2016). A Web Application for Geolocalized Signs in Synthesized Swiss German Sign Language. In: Proceedings of the International Conference of Computers Helping People with Special Needs (ICCHP), Linz, Austria, 13 July 2016 - 15 July 2016, Springer.
Graën, Johannes; Clematide, Simon; Volk, Martin (2016). Efficient Exploration of Translation Variants in Large Multiparallel Corpora Using a Relational Database. In: 4th Workshop on the Challenges in the Management of Large Corpora, Portorož, 28 May 2016 - 28 May 2016, 20-23.
Clematide, Simon; Graën, Johannes; Volk, Martin (2016). Multilingwis – A Multilingual Search Tool for Multi-Word Units in Multiparallel Corpora. In: Corpas Pastor, Gloria. Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives/Fraseología computacional y basada en corpus: perspectivas monolingües y multilingües. Geneva: Tradulex, n/a.
Graën, Johannes; Clematide, Simon (2015). Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora. In: 3rd Workshop on the Challenges in the Management of Large Corpora, Lancaster, 20 July 2015 - 20 July 2015, 15-20.
Graën, Johannes; Batinić, Dolores; Volk, Martin (2014). Cleaning the Europarl Corpus for Linguistic Applications. In: Konvens 2014, Hildesheim, 8 October 2014 - 10 October 2014, Stiftung Universität Hildesheim.
Volk, Martin; Graën, Johannes; Callegaro, Elena (2014). Innovations in parallel corpus search tools. In: Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, 26 May 2014 - 31 May 2014, European Language Resources Association (ELRA).