Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
public:cutter:start [2018-11-18 02:03] Johannes Graënpublic:cutter:start [2018-11-18 17:36] Johannes Graën
Line 1: Line 1:
 ====== Cutter – a Universal Multilingual Tokenizer ====== ====== Cutter – a Universal Multilingual Tokenizer ======
 Cutter is a rule-based tokenizer that is easily adaptable to other languages and text types.  It currently supports Catalan, Dutch, English, French, German, Italian, Portuguese, Romansh, Spanish and Swedish, but can also be used without any language-specific rules. Cutter is a rule-based tokenizer that is easily adaptable to other languages and text types.  It currently supports Catalan, Dutch, English, French, German, Italian, Portuguese, Romansh, Spanish and Swedish, but can also be used without any language-specific rules.
 +
  
 ===== History ===== ===== History =====
Line 9: Line 10:
   * June 2018 -- released as Python module   * June 2018 -- released as Python module
   * November 2018 -- released as [[https://pypi.org/project/cutter-ng/|PyPI package]]   * November 2018 -- released as [[https://pypi.org/project/cutter-ng/|PyPI package]]
 +
  
 ===== Demos ===== ===== Demos =====
Line 21: Line 23:
   * [[https://pub.cl.uzh.ch/projects/sparcling/cutter/v2.2/|version 2.2]] (September 2018)   * [[https://pub.cl.uzh.ch/projects/sparcling/cutter/v2.2/|version 2.2]] (September 2018)
   * [[https://pub.cl.uzh.ch/projects/sparcling/cutter/v2.3/|version 2.3]] (November 2018)   * [[https://pub.cl.uzh.ch/projects/sparcling/cutter/v2.3/|version 2.3]] (November 2018)
 +
  
 ===== Source ===== ===== Source =====
   * [[gitlab>graen/cutter]] (version 1.x -- PHP)   * [[gitlab>graen/cutter]] (version 1.x -- PHP)
   * [[gitlab>graen/cutter-ng]] and [[https://github.com/j0hannes/cutter-ng]] (version 2.x -- Python3)   * [[gitlab>graen/cutter-ng]] and [[https://github.com/j0hannes/cutter-ng]] (version 2.x -- Python3)
 +
  
 ===== PyPI package ===== ===== PyPI package =====
Line 61: Line 65:
 By means of the third column, the tokenization tree can be reconstructed: By means of the third column, the tokenization tree can be reconstructed:
 {{:public:cutter:tokenization_tree.png?nolink&500 |}} {{:public:cutter:tokenization_tree.png?nolink&500 |}}
 +
 +
 +===== Web service =====
 +We also provide a web service for tokenization using one of the pre-defined profiles:
 +<code>
 +echo "text=On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer." \
 + | curl --data @- https://pub.cl.uzh.ch/service/cutter-ng/current/fr/
 + | jq
 +</code>
 +
 +This call returns a JSON object comprising a list of tokens and their respective tags:
 +<file json>
 +[
 +  {
 +    "tag": "frRtkA",
 +    "tok": "On"
 +  },
 +  {
 +    "tag": "frRtkA",
 +    "tok": "nous"
 +  },
 +  {
 +    "tag": "frRtkA",
 +    "tok": "dit"
 +  },
 +  {
 +    "tag": "frXel",
 +    "tok": "qu’"
 +  },
 +  {
 +    "tag": "frQlx",
 +    "tok": "aujourd’hui"
 +  },
 +  {
 +    "tag": "frXel",
 +    "tok": "c’"
 +  },
 +  {
 +    "tag": "frRtkA",
 +    "tok": "est"
 +  },
 +  {
 +    "tag": "frRtkA",
 +    "tok": "le"
 +  },
 +  {
 +    "tag": "frRtkB",
 +    "tok": "cas"
 +  },
 +  {
 +    "tag": "+punct",
 +    "tok": ","
 +  },
 +  {
 +    "tag": "frRtkA",
 +    "tok": "encore"
 +  },
 +  {
 +    "tag": "frRtkB",
 +    "tok": "faudra"
 +  },
 +  {
 +    "tag": "frXpr1",
 +    "tok": "-t-il"
 +  },
 +  {
 +    "tag": "frXel",
 +    "tok": "l’"
 +  },
 +  {
 +    "tag": "frRtkB",
 +    "tok": "évaluer"
 +  },
 +  {
 +    "tag": "+dot",
 +    "tok": "."
 +  },
 +  {
 +    "tag": "+EOS5",
 +    "tok": ""
 +  }
 +]
 +</file>
  
  
Line 74: Line 161:
   * Daniel Wüest   * Daniel Wüest
   * Alex Flückiger   * Alex Flückiger
 +
  
 ===== Citation ===== ===== Citation =====

CL Wiki

Institute of Computational Linguistics – University of Zurich