Author: Natalia Korchagina. University of Zurich, 2020.
This folder contains resources produced during my PhD project "Temporal entity extraction from historical texts".

Word embeddings.

1. SSRQ corpora cleaned by Ismail Prada, used to pre-train historical word embeddings (Section 4.1.3): 
raw_ssrq.txt

2. In Fasttext_models.zip:
Fasttext word embeddings in sizes 100 and 300.

histDE_100.vec -  word embeddings of size 100 pre-trained on the SSRQ corpora
histDE_300.vec - word embeddings of size 300 pre-trained on the SSRQ corpora
modDE_100.vec  -  word embeddings of size 100 pre-trained on WMT14 news corpus and a corpus of Wikipedia articles
modDE_300.vec - word embeddings of size 100 pre-trained on WMT14 news corpus and a corpus of Wikipedia articles

Vectors in this folder reflect the totality of the training corpora, with the minimal count of 5 token for historical corpora and 50 tokens for modern corpora. 
For optimization purposes during my experiment, I pruned histDE_300.vec, modDE_100.vec, and modDE_300.vec for the vocabulary of each input corpus. For example, if the input corpus of a biLSTM temporal tagger is a volume of historical articles, I pruned historical and modern word embeddings to only contain the vocabulary of this volume.