Johannes Graën
2018-03-19
The term “alignment” denotes different concepts:
It is applicable to parallel corpora (two languages)
or multiparallel corpora (more than two languages).
Correspondence of one or more structural elements (e.g., documents, articles, paragraphs, sentences, phrases, tokens) in at least two languages:
Alignment units are sets of the elements in question:
{Katze, cat}A list of alignment units of a given higher-level element:
Alignment sets are sets of alignment units
(i.e., sets of sets of the elements in question):
{ | {Wir, Nous}, {kaufen, acheter}, {Katze, chat}, {im, en}, {nicht, ne, pas}, {Sack, poche}, {möchten, voulons} | } |
A list of alignment units that constitute a hierachy:
Hierarchical alignment units can only be proper subsets/supersets, but are not allowed to intersect: $A_1 \subset A_2 \vee A_1 \supset A_2 \vee A_1 \cap A_2 = \emptyset$
Because of variation!
Alignment of more than two languages (multilingual alignment) requires hierachical alignment sets.
Same approach for sentence and word alignment, but with different features and different linkage criteria
Quality of multilingual sentence alignment similar to bilingual sentence alignment, but with two advantages:
Probability that a lemma $\lambda_s$ is aligned with lemma $\lambda_t$:
$p_a(\lambda_t|\lambda_s) = \frac{f_a(\lambda_s,\lambda_t)}{\sum_{\lambda_{t'}} f_a(\lambda_s,\lambda_{t'})}$
Lemma with the highest alignment probability over
all languages replaces the ambiguous lemma:
$\lambda_s^{final} = \mathit{argmax}_{\lambda_s^i} \sum\limits_n p_a(\lambda_s^i|\lambda_t^n)$
gehören | hören | ||
Dutch | horen | 0.2339 | 0.2909 |
English | hear | 0.1515 | 0.3782 |
Finnish | kuulla | 0.1143 | 0.3192 |
French | entendre | 0.0694 | 0.1482 |
Spanish | entender | 0.0043 | 0.0213 |
Swedish | höra | 0.3314 | 0.2779 |
$\sum p_a$ | 0.9047 | 1.4357 |
High values for the aligned nouns and low values
for the aligned verbs indicate idiomaticity.
1 | Gestalt annehmen | take shape | 39 |
2 | Präzedenzfall darstellen | set precedent | 10 |
3 | Armut bekämpfen | reduce poverty | 4 |
4 | Präzedenzfall schaffen | set precedent | 78 |
5 | Vorrang haben | take precedence | 47 |
6 | Illusion machen | have illusion | 10 |
7 | Hausaufgabe machen | do homework | 74 |
8 | Beileid aussprechen | send condolence | 2 |
9 | Vergleich ermöglichen | make comparison | 2 |
10 | Vergleich ziehen | make comparison | 2 |
11 | Vorschub leisten | give rise | 4 |
12 | Berücksichtigung finden | take account | 32 |
13 | Rekord halten | have record | 3 |
14 | Privileg genießen | have privilege | 4 |
15 | Beileid übermitteln | express condolence | 3 |
Our web application allows the user:
Alignment benefits
Multilingual hierarchical alignment
Нашата работа, разбира се, не е приключила.
Doch unsere Arbeit ist selbstverständlich noch nicht beendet.
But our work of course is not finished.
Pero nuestra labor todavía no ha acabado.
Loomulikult ei ole meie töö sellega veel lõppenud.
Työmme ei ole luonnollisesti valmis.
Cependant, notre travail n’est bien entendu pas terminé.
Il nostro lavoro non è però concluso.
Maar ons werk zit er uiteraard nog niet op.
Nasza praca oczywiście nie dobiegła końca.
Todavia, o nosso trabalho ainda não terminou.
Dar munca noastră nu s-a terminat, desigur.
Naša práca sa však, samozrejme, neskončila.
Seveda pa naše delo ni končano.
Men allt arbete är naturligtvis inte avslutat.