Download corpus via wget -r -np

This function reads from compressed files directly:

Read in sentence alignment units (relation sentenceID <=> sentenceAU):

Read in token lists (relation token <=> sentenceID):

Let's have a look at an example:

We iterative over all tokens in both languages and calculate frequencies and a cooccurrence matrix:

The total number of tokens per language:

The ten most frequent tokens per language (lowercased):

Ranking of the cooccurrences using mutual information (MI) score:

Pairs with the highest mutual information:

Inspect the highest MI scores for «ergebnis»:

Inspect the highest MI scores for words starting with «viagiant»: