Author: Natalia Korchagina. University of Zurich, 2020.
This folder contains resources produced during my PhD project "Temporal entity extraction from historical texts".

Temporal tagging.

1. The Gold standard corpus used to train the models at the evaluation step in Chapter 4 (with historical sentences only):

GS_hist_only.csv

2. Artificially generated training set GEN (described in Section 4.2.3):

generated_corpus.csv

2. Manually curated historical corpus with temporal annotation used at the last evaluation step in Chapter 4:
SG_III_4_GS.csv


3. Models used to evaluate the performance of the temporal taggers on SG III/4_GS

a) wapiti CRF model trained on the Gold Standard (historical sentences only)

	- unzip model.zip in wapiti/model
	- to annotate an input file with this model:
	
	wapiti label -m model input.csv output.csv

b) Five randomly initiated biLSTM models trained on the Gold Standard (historical sentences only) with random initialization:
	
	- unzip models.zip in biLSTM/GS_models.zip
	- install the Dynet library from http://dynet.io
	- run the tagging script biLSTM_tagger.py (after modifying paths to the pre-trained word embeddings on lines 159-164):

homedir=/your/home/directory
i=model_number (e.g., model1, if you want to use model1.data and model1.meta)

python biLSTM_tagger.py --train=$homedir_io/GS_hist_only.csv --test=$homedir_io/input.csv --out=output.csv --models_folder=biLSTM_models/GS_models --include-FT-NORM-emb --include-FT-emb --emb-dim=100 --model=model$i --dynet-mem 1000

c) Five randomly initiated biLSTM models trained on generated_corpus.csv:

	- unzip models.zip in biLSTM/GEN_models.zip
	- run the tagging script

python biLSTM_tagger.py --train=$homedir_io/generated_corpus.csv --test=$homedir_io/input.csv --out=output.csv --models_folder=biLSTM_models/GEN_models --include-FT-NORM-emb --include-FT-emb --emb-dim=100 --model=model$i --dynet-mem 1000

4. To produce the results described at the evaluation step, I ensembled the output of four GS modes, of 1 GEN model, and of a wapiti model. The ensembling was performed by a majority voting on each token's label. Of course, the ensembling may be tested in further combinations.