~~NOCACHE~~
~~REVEAL white~~

{{background>:public:bg19.jpg}}
====== Joint Approaches for Sentence Alignment on Multiparallel Texts ======
Johannes Graën\\
2016-11-29

==== Overview ====
  * Sentence Alignment on Multiparallel Texts
  * Approaches
    * Agreement of Bilingual Alignments
    * Sampling in Discrete Vector Space
    * Agglomerative Hierarchical Clustering
  * Outlook
    * Evaluation Metrics
    * Hierarchical Clustering for Word Alignment


===== Sentence Alignment on Multiparallel Texts =====
  * Like sentence alignment of two languages.
  * There may be alignments that only exists on a subset of languages.
    * This leads to hierarchical alignments.
  * Alignments can be partially correct.
    * Evaluation requires more than counting right and wrong alignments.

==== Example ====
| de | Es geht nicht um die Großzügigkeit des Präsidenten, es geht um die Zeit, die Sie sich selbst genehmigen; **[1+3]** ich habe Ihnen angezeigt, wann die Minute abgelaufen war. **[2+3]** |
| en | Mr Izquierdo Collado, it is not a question of the President’s generosity. **[3]** It is a question of the time you allow yourself, because I informed you when your minute was up. **[3]** |
| fr | Monsieur Izquierdo, il ne s’agit pas de la générosité du président, il s’agit du temps que vous vous attribuez. **[1+3]** Je vous ai fait signe quand vous avez atteint la minute. **[2+3]** |


===== Agreement of Bilingual Alignments =====
  * Perform sentence alignment on all language pairs.
  * Derive correct alignments by "voting".
    * Hypothesis: The majority of pairwise alignments is correct in most cases, i.e. any alignment error is due to the properties of its particular language pair.

==== Approach ====
  - Perform pairwise alignments with //hunalign//.
  - Join all these alignments in a graph.
  - Calculate "connectedness" by counting supporting languages for each edge.
    * How many languages align with both sentences of a particular language pair?
  - Continue deleting the least supported edge until small consistent clusters emerge.
  - An alignment hierarchy can be obtained by reversing the deletion process.

==== Two Languages ====
{{https://pub.cl.uzh.ch/users/graen/img/sentalign1/align2.png?direct}}

==== Three Languages ====
{{https://pub.cl.uzh.ch/users/graen/img/sentalign1/align3.png?direct}}

==== Four Languages ====
{{https://pub.cl.uzh.ch/users/graen/img/sentalign1/align4.png?direct}}

==== Five Languages ====
{{https://pub.cl.uzh.ch/users/graen/img/sentalign1/align5.png?direct}}

==== Hierarchical Alignments ====
{{https://pub.cl.uzh.ch/users/graen/img/sentalign1/align5c.png?direct}}

==== Problems/Limitations ====
  * Short sentences in **1:n** pairs do not align in any language.
    * ... and we are merely removing (wrong) alignments.
  * Decisions not straightforward if more than one pair disagrees.
    * //hunalign//'s alignment score does not help taking the right decision.
  * Dictionaries for hunalign only allow binary entries.


===== Sampling in Discrete Vector Space =====
  * Discrete vector space spanned by languages as dimensions.
  * Alignments are pairs of location and direction vectors with all positive components.
    * E.g. $\left((1,2)^T,(2,1)^T\right)$ defines the alignment of the second and third sentence of the first language with the third sentence of the second language.
  * The location vector of the **nth** alignment equals the sum of the direction vector **1..n-1**.
  * Alignment is obtained by simulated annealing.

==== Visual Representation ====
{{https://pub.cl.uzh.ch/users/graen/img/sentalign2/3d_example.png?500}}

==== Approach ====
  - Set initial aligment to a sequence of vectors approximating a diagonal line.
  - Calculate local (pairwise) and global alignment scores (and keep results in memory).
  - Find and evaluate all applicable sampling operations.
  - Sample by selecting one of those operations -- according to their respective evaluation scores.
  - Lower //temperature//, i.e. probability of picking an operation that leads to a worse sample.
  - Repeat from (3) until //temperature// reaches zero-point.

==== Sampling Operations ====
  - Changing two consecutive vectors $\vec{x}$ and $\vec{y}$ such that $\vec{x}^\prime + \vec{y}^\prime = \vec{x} + \vec{y}$.
  - Replace two consecutive vectors $\vec{x}$ and $\vec{y}$ by vector $\vec{z}$ such that $\vec{z} = \vec{x} + \vec{y}$.
  - Split a vector $\vec{z}$ into vectors $\vec{x}$ and $\vec{y}$ such that $\vec{x} + \vec{y} = \vec{z}$.

==== Problems/Limitations ====
  * The sampling often does not converge.
  * For higher dimensions (more languages), sampling did not inlcude the correct (gold) alignment.
  * Finding and evaluation all applicable operations is expensive, even if scores are only calculated on language pairs.


===== Agglomerative Hierarchical Clustering =====
  * Collect evidence from different sources/heuristics in a graph.
  * Perform agglomerative clustering such that
    - a cluster cannot take more than one sentence of each language.
    - crossing clusters are prohibited.
  * Join incomplete clusters (not every language covered) with complete clusters.

==== Approach ====
  - Calculate scores for each a) language pair, b) source and c) sentence pair.
  - Map the (normal) distribution of each source's scores to one with $\mu = 1$ and $\sigma = 1$ and
  - multiply the values with a source-specific weight between 0 and 1.
  - Sum up these score-specific values to set the link weight between each two sentences.
  - Calculate the supported link weight based on the weight of a particular link and the link weights of all "triangles" with other languages.
  - Use those weights for clustering.

==== Approach (Clustering) ====
  - Perform first agglomerative clustering such that 
    - a cluster cannot take more than one sentence of each language.
    - crossing clusters are prohibited.
  - Let all remaining sentences be the only member of their own cluster.
  - Perform secondary agglomerative clustering for incomplete clusters such that
    - crossing clusters are prohibited.

==== Sources for Pairwise Scores ====
  * Corresponding discourse markers
  * Phrase table matches
  * Identical punctuation
  * Relative cumulative length
  * Identical numbers
  * Identical acronyms

==== Demo ====
  * [[https://pub.cl.uzh.ch/projects/sparcling/msalign/?dir=dir8&stretch=3&render=0&firstt=0.0&firstl=0.0&last=1.0&anyt=1.0&anyl=1.0&len=1.0&sec_len=1.0&num=1.0&acr=1.0&langs=de,en,es,fr,it|Graph with weights (5 languages)]]
  * [[https://pub.cl.uzh.ch/projects/sparcling/msalign/?dir=dir8&stretch=3&render=1&firstt=0.0&firstl=0.0&last=1.0&anyt=1.0&anyl=1.0&len=1.0&sec_len=1.0&num=1.0&acr=1.0&langs=de,en,es,fr,it|Cluster (5 languages)]]
  * [[https://pub.cl.uzh.ch/projects/sparcling/msalign/?dir=dir8&stretch=3&render=0&firstt=0.0&firstl=0.0&last=1.0&anyt=1.0&anyl=1.0&len=1.0&sec_len=1.0&num=1.0&acr=1.0&langs=de,en,es,fr,it,fi,pl,sv,pt,ro,nl,bg,sl,sk,et,el|Graph with weights (16 languages)]]
  * [[https://pub.cl.uzh.ch/projects/sparcling/msalign/?dir=dir8&stretch=3&render=1&firstt=0.0&firstl=0.0&last=1.0&anyt=1.0&anyl=1.0&len=1.0&sec_len=1.0&num=1.0&acr=1.0&langs=de,en,es,fr,it,fi,pl,sv,pt,ro,nl,bg,sl,sk,et,el|Cluster (16 languages)]]


===== Evaluation Metrics =====
  * Gold standard for multiparallel hierarchical sentence alignment ([[https://pub.cl.uzh.ch/projects/sparcling/align_annotation_tool/sents/#32|Tool]])
  * Evaluation of bilingual sentence aligners:
    * Minimal bilingual alignments can be extracted from gold standard for each language pair
  * Evaluation of multiparallel clustering approach:
    * Minimal bilingual alignments of both data sets
    * Average and standard deviation of those bilingual scores per language
    * Distribution of errors by average sentence count


===== Hierarchical Clustering for Word Alignment =====
  * Almost the same approach
  * Preliminary results promising
    * [[https://pub.cl.uzh.ch/projects/sparcling/alignment_linkage/?id=2|Weights]]
    * [[https://pub.cl.uzh.ch/projects/sparcling/alignment_linkage/get.php?id=2|Clustering with heuristics]]
    * [[https://pub.cl.uzh.ch/projects/sparcling/alignment_linkage/cluster.php?aid=2&alw=9.7.6.2|Hierarchical clustering]]
  * Evaluation against gold standard ([[https://pub.cl.uzh.ch/projects/sparcling/align_annotation_tool/words/#13|Tool]])


{{background>:public:bg23.jpg}}
===== EOP =====