====== How (not) to build publicly available NLP web services ======
Material:
* CGI: [[wp>Common_Gateway_Interface]]
* FastCGI: [[wp>FastCGI]]
* WSGI: [[wp>Web_Server_Gateway_Interface]]
* Repo: [[gitlab>graen/nlpservice]]
* DB: [[https://pub.cl.uzh.ch/adminer/?pgsql=pollux&username=nlpserviceadmin&db=nlpservice&ns=public]]
* Visual Parsing
* Demo [[https://pub.cl.uzh.ch/projects/sparcling/visualparsing/bootstrap/]]
* Repo: [[gitlab>pcl-iii/visualparsing]]
===== Converting command line tools to services =====
==== TreeTaggerWrapper ====
* File: ''/mnt/storage/clfiles/resources/lib/python2.7/dist-packages/treetaggerwrapper.py''
# Send text to TreeTagger, get result.
logger.debug("Tagging text.")
t = threading.Thread(target=pipe_writer,
args=(self.taginput,
lines, self.dummysequence,
self.taginencoding,
self.taginencerr))
t.start()
result = []
intext = False
lastline_time = time.time()
while True:
line = self.tagoutput.readline()
if DEBUG: logger.debug("Read from TreeTagger: %r", line)
if not line:
if (time.time() - lastline_time) > TAGGER_TIMEOUT:
# We already wait some times, there may be a problem with tagging
# process communication. This avoid infinite loop.
logger.error("Time out for TreeTagger reply.")
raise TreeTaggerError("Time out for TreeTagger reply, enable debug / see error logs")
else:
# We process too much quickly, leave time for tagger and writer
# thread to work.
time.sleep(0.1)
continue # read again.
lastline_time = time.time()
line = line.decode(self.tagoutencoding, self.tagoutencerr)
line = line.strip()
if line == STARTOFTEXT:
intext = True
continue
if line == ENDOFTEXT: # The flag we sent to identify texts.
intext = False
break
if intext and line:
if not (self.removesgml and is_sgml_tag(line)):
result.append(line)
def pipe_writer(pipe, text, flushsequence, encoding, errors):
"""Write a text to a pipe and manage pre-post data to ensure flushing.
For internal use.
If text is composed of str strings, they are written as-is (ie. assume
ad-hoc encoding is providen by caller). If it is composed of unicode
strings, then they are converted to the specified encoding.
:param pipe: the Popen pipe on what to write the text.
:type pipe: Popen object (file-like with write and flush methods)
:param text: the text to write.
:type text: string or list of strings
:param flushsequence: lines of tokens to ensure flush by TreeTagger.
:type flushsequence: string (with \\n between tokens)
:param encoding: encoding of texts written on the pipe.
:type encoding: str
:param errors: how to manage encoding errors: strict/ignore/replace.
:type errors: str
"""
"de": {
"encoding": "utf-8",
"tagparfile": "german-utf8.par",
"abbrevfile": "german-abbreviations-utf8",
"pchar": ALONEMARKS + "'",
"fchar": ALONEMARKS + "'",
"pclictic": "",
"fclictic": "'(s|re|ve|d|m|em|ll)|n't",
"number": NUMBER_EXPRESSION,
"dummysentence": "Das ist ein Testsatz um das Stossen der "
"daten sicherzustellen .",
"replurlexp": 'replaced-url',
"replemailexp": 'replaced-email',
"replipexp": 'replaced-ip',
"repldnsexp": 'replaced-dns'
},
===== Connection limit in Nginx config =====
[[https://nginx.org/en/docs/http/ngx_http_limit_conn_module.html#limit_conn_zone|"limit_conn_zone" directive]]
limit_conn_zone $binary_remote_addr zone=addr:10m;
location /demo/parzu/ {
limit_conn addr 1;
rewrite /demo/parzu/(.*)$ /$1 break;
proxy_pass http://dutchy.cli/clfiles/projects/cl/webapp/parzu/$1$is_args$args;
}
location /demo/corzu/ {
limit_conn addr 1;
rewrite /demo/corzu/(.*)$ /$1 break;
proxy_pass http://dutchy.cli/harlie/projects/clcoref/corzu_web_demo/$1$is_args$args;
}
===== Self-locking application =====
==== ParZu ====
* Demo: [[https://pub.cl.uzh.ch/demo/parzu/]]
* File: ''/mnt/storage/clfiles/projects/cl/webapp/parzu/parzu.cgi''
# Don spam/DDOS prevention
# Check if last log entry is older than X seconds. If not, abort.
if os.path.isfile(logfile):
time_since_last_call = time.time() - os.stat(logfile).st_mtime
if time_since_last_call < 5:
additional_styles = "\nfont-family: Arial, Helvetica, sans-serif;\nfont-size: 12px;\n"
print(html_text.format(additional_styles,'Demo already running. Wait 20 seconds and try again.' ))
sys.exit()
Try it out:
curl --data "output=conll&rawtext=Mein Luftkissenfahrzeug ist voller Aale." \
https://pub.cl.uzh.ch/demo/parzu/parzu.cgi
==== CorZu ====
* Demo: [[https://pub.cl.uzh.ch/demo/corzu/]]
* File: ''/mnt/storage/harlie/projects/clcoref/corzu_web_demo/CorZu.cgi''
# Spam / DDOS prevention: Check if the parsed file is at least 30 secs old before starting anew
if [[ -f $tmp_dir/parsed.conll ]]
then
lastchange=$(($(date +%s) - $(date +%s -r $tmp_dir/parsed.conll)))
if [[ "$lastchange" -lt 5 ]]
then
echo Content-type: text/html
echo ""
echo "Demo already running. Please wait 5 seconds and try again."
exit
fi
fi
Try it out:
curl --form "format=conll" --form "text=Mein Luftkissenfahrzeug ist voller Aale. Sie sind überall." \
https://pub.cl.uzh.ch/demo/corzu/CorZu.cgi
===== Pipelines =====
echo "Mein Luftkissenfahrzeug ist voller Aale. Sie sind überall." \
| maltparser-tokenizer-treetagger-german.bash
echo "À cheval donné on ne regarde pas les dents." \
| maltparser-tokenizer-MElt-french.bash
echo "Det är viktigt att du aktiverar ditt studentkonto på Studentportalen." \
| maltparser-stagger-swedish.bash
===== Service =====
echo "Mein Luftkissenfahrzeug ist voller Aale." \
| curl --data @- pub.cl.uzh.ch/service/nlpservice/parzu