This is an old revision of the document!


Cutter - a Universal Multilingual Tokenizer

History

  • end of 2015 – concept and first development version (PHP)
  • April 2016 – first release
  • until May 2017 – continued development for up to 17 languages
  • from January 2018 – reimplementation in Python
  • June 2018 – released as Python module

Demos

Source

Contributors

  • Johannes Graën
  • Martin Volk
  • Mara Bertamini
  • Chantal Amrhein
  • Phillip Ströbel
  • Anne Göhring
  • Natalia Korchagina
  • Simon Clematide
  • Daniel Wüest
  • Alex Flückiger

Citation

cutter.bib
@inproceedings{GraenBertaminiVolk2018,
          number = {2226},
           month = {June},
          author = {Johannes Gra{\"e}n and Mara Bertamini and Martin Volk},
          series = {CEUR Workshop Proceedings},
       booktitle = {Swiss Text Analytics Conference},
          editor = {Mark Cieliebak and Don Tuggener and Fernando Benites},
           title = {Cutter -- a Universal Multilingual Tokenizer},
       publisher = {CEUR-WS},
            year = {2018},
           pages = {75--81},
             url = {https://doi.org/10.5167/uzh-157243},
            issn = {1613-0073},
}

CL Wiki

Institute of Computational Linguistics – University of Zurich