at Nodalida 2015, Vilnius (Lithuania), May, 11th, 2015
Recent years have seen an increased interest in and availability of many different kinds of corpora. These range from small, but carefully annotated treebanks to large parallel corpora and very large monolingual corpora for big data research. It remains a challenge to query the multilayer annotations of small corpora, to efficiently access large corpora as well as to visualize the query results.
When dealing with large corpora, query tools need to scale in terms of processing speed and reporting through statistical information and visualization options. This becomes evident, for example, when dealing with very large corpora (such as complete Wikipedia corpora) or multi-parallel corpora (such as Europarl or JRC Acquis). The goal of the workshop is to gather researchers who develop or evaluate new corpus query and visualization tools for linguistics, language technology or related disciplines.
The proceedings have been published online at Linköping University Electronic Press.
13:30h to 13:45h | Martin Volk | Introduction to the Workshop |
---|---|---|
Session 1 (Chair: Andrius Utka) | ||
13:45h to 14:15h | Lucia Kocincová, Vít Baisa, Miloš Jakubíček and Vojtěch Kovář | Interactive Visualizations of Corpus Data in Sketch Engine |
14:15h to 14:45h | Michał Kosek, Anders Nøklestad, Joel Priestley, Kristin Hagen and Janne Bondi Johannessen | Visualisation in speech corpora: maps and waves in the Glossa system |
30 min | Coffee break | |
Session 2 (Chair: Simon Clematide) | ||
15:15h-16:00h | Marc Kupietz (Institut für Deutsche Sprache, Mannheim) | Invited Talk: Scaling out corpus technology: the open source query and analysis engine KorAP |
16:00h-16:30h | Joachim Bingel and Nils Diewald | KoralQuery - A General Corpus Query Protocol |
15 min | Break | |
Session 3 (Chair: Johannes Graën) | ||
16:45-17:15h | Ruprecht von Waldenfels | ParaViz: A vizualization tool for crosslinguistic functional comparisons based on a parallel corpus |
17:15h-17:45h | Simon Clematide | Reflections and Proposals for a Query and Reporting Language for Richly Annotated Multiparallel Corpora |
17:45h-18:00h | Gintare Grigonyte | Closing Session |
Abstract: With the growing importance of empiricism and a rapidly growing amount of research data, progress in linguistic research nowadays requires more and more sophisticated and methodologically sound technical infrastructure, far beyond of what typical university computing centres or typical research projects can deliver. Unfortunately however, the funding conditions in linguistics are still not as well adapted to this circumstance as in more established data-intensive research fields and even large scale e-infrastructure initiatives like CLARIN have provided a solid basis of standards and best practises, but nothing coming close to a sufficiently general tool for corpus based research. The talk will introduce KorAP, an open-source corpus analysis platform, mainly developed at the Institut für Deutsche Sprache. It will sketch KorAP's background, how it deals with current and upcoming scientific and technological challenges, how it tries to achieve long-term sustainability despite the aforementioned constraints and how it tries to contribute to progress in linguistic research.
Gintarė Grigonytė (Stockholm University)