This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
public:cutter:start [2018-06-13 08:07] – Johannes Graën | public:cutter:start [2019-10-25 14:35] – Johannes Graën | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Cutter | + | ====== Cutter |
+ | Cutter is a rule-based tokenizer that is easily adaptable to other languages and text types. | ||
===== History ===== | ===== History ===== | ||
Line 7: | Line 9: | ||
* from January 2018 -- reimplementation in Python | * from January 2018 -- reimplementation in Python | ||
* June 2018 -- released as Python module | * June 2018 -- released as Python module | ||
+ | * November 2018 -- released as [[https:// | ||
+ | |||
===== Demos ===== | ===== Demos ===== | ||
- | * [[https:// | + | The current version is always available at [[https:// |
- | * [[https:// | + | |
- | * [[https:// | + | * version 1.0 (Apr. 2016) |
- | * [[https:// | + | * version 1.2 (Jul. 2016) |
- | * [[https:// | + | * version 1.4 (Feb. 2017) |
+ | * version 1.6 (May 2017) | ||
+ | * version 2.0 (June 2018) | ||
+ | * version 2.1 (August 2018) | ||
+ | * version 2.2 (September 2018) | ||
+ | * version 2.3 (November 2018) | ||
+ | * version 2.4 (January 2019) | ||
+ | * version 2.5 (June 2019) | ||
===== Source ===== | ===== Source ===== | ||
* [[gitlab> | * [[gitlab> | ||
- | * [[gitlab> | + | * [[gitlab> |
+ | |||
+ | |||
+ | ===== PyPI package ===== | ||
+ | We provide the newer Python | ||
+ | |||
+ | It can simply be used with its pre-defined profiles like this: | ||
+ | <code python> | ||
+ | import Cutter | ||
+ | |||
+ | cutter = Cutter.Cutter(profile=' | ||
+ | text = "On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer." | ||
+ | for token in cutter.cut(text): | ||
+ | print(token) | ||
+ | </ | ||
+ | |||
+ | Which will return the following tuples: | ||
+ | ^ Token ^ Tag ^ Tree ^ Start ^ End ^ | ||
+ | | On | frRtkA | ||
+ | | nous | frRtkA | ||
+ | | dit | frRtkA | ||
+ | | qu’ | frXel | 2 | 12 | 15 | | ||
+ | | aujourd’hui | ||
+ | | c’ | frXel | 2 | 27 | 29 | | ||
+ | | est | frRtkA | ||
+ | | le | frRtkA | ||
+ | | cas | frRtkB | ||
+ | | , | +punct | ||
+ | | encore | ||
+ | | faudra | ||
+ | | -t-il | frXpr1 | ||
+ | | l’ | frXel | 3 | 60 | 62 | | ||
+ | | évaluer | ||
+ | | . | +dot | 4 | 69 | 70 | | ||
+ | | | +EOS5 | 4 | 70 | 70 | | ||
+ | |||
+ | By means of the third column, the tokenization tree can be reconstructed: | ||
+ | {{: | ||
+ | |||
+ | |||
+ | ===== Web service ===== | ||
+ | We also provide a web service for tokenization using one of the pre-defined profiles: | ||
+ | <code bash> | ||
+ | echo " | ||
+ | | curl --data @- https:// | ||
+ | | jq | ||
+ | </ | ||
+ | |||
+ | This call returns a JSON object comprising a list of tokens and their respective tags: | ||
+ | <file json> | ||
+ | [ | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | }, | ||
+ | { | ||
+ | " | ||
+ | " | ||
+ | } | ||
+ | ] | ||
+ | </ | ||
===== Contributors ===== | ===== Contributors ===== | ||
Line 29: | Line 162: | ||
* Simon Clematide | * Simon Clematide | ||
* Daniel Wüest | * Daniel Wüest | ||
+ | * Alex Flückiger | ||
+ | |||
+ | |||
+ | ===== Citation ===== | ||
+ | See also [[https:// | ||
+ | <code biblatex cutter.bib> | ||
+ | @inproceedings{GraenBertaminiVolk2018, | ||
+ | number = {2226}, | ||
+ | month = {June}, | ||
+ | author = {Johannes Gra{\" | ||
+ | series = {CEUR Workshop Proceedings}, | ||
+ | | ||
+ | editor = {Mark Cieliebak and Don Tuggener and Fernando Benites}, | ||
+ | title = {Cutter -- a Universal Multilingual Tokenizer}, | ||
+ | | ||
+ | year = {2018}, | ||
+ | pages = {75--81}, | ||
+ | url = {https:// | ||
+ | issn = {1613-0073}, | ||
+ | } | ||
+ | </ |