User Tools


Slide 1 :public:bg19.jpg

Joint Approaches for Sentence Alignment on Multiparallel Texts

Johannes Graën
2016-11-29

↓ Slide 2

Overview

  • Sentence Alignment on Multiparallel Texts
  • Approaches
    • Agreement of Bilingual Alignments
    • Sampling in Discrete Vector Space
    • Agglomerative Hierarchical Clustering
  • Outlook
    • Evaluation Metrics
    • Hierarchical Clustering for Word Alignment
→ Slide 3

Sentence Alignment on Multiparallel Texts

  • Like sentence alignment of two languages.
  • There may be alignments that only exists on a subset of languages.
    • This leads to hierarchical alignments.
  • Alignments can be partially correct.
    • Evaluation requires more than counting right and wrong alignments.
↓ Slide 4

Example

de Es geht nicht um die Großzügigkeit des Präsidenten, es geht um die Zeit, die Sie sich selbst genehmigen; [1+3] ich habe Ihnen angezeigt, wann die Minute abgelaufen war. [2+3]
en Mr Izquierdo Collado, it is not a question of the President’s generosity. [3] It is a question of the time you allow yourself, because I informed you when your minute was up. [3]
fr Monsieur Izquierdo, il ne s’agit pas de la générosité du président, il s’agit du temps que vous vous attribuez. [1+3] Je vous ai fait signe quand vous avez atteint la minute. [2+3]
→ Slide 5

Agreement of Bilingual Alignments

  • Perform sentence alignment on all language pairs.
  • Derive correct alignments by “voting”.
    • Hypothesis: The majority of pairwise alignments is correct in most cases, i.e. any alignment error is due to the properties of its particular language pair.
↓ Slide 6

Approach

  1. Perform pairwise alignments with hunalign.
  2. Join all these alignments in a graph.
  3. Calculate “connectedness” by counting supporting languages for each edge.
    • How many languages align with both sentences of a particular language pair?
  4. Continue deleting the least supported edge until small consistent clusters emerge.
  5. An alignment hierarchy can be obtained by reversing the deletion process.
↓ Slide 7

Two Languages

↓ Slide 8

Three Languages

↓ Slide 9

Four Languages

↓ Slide 10

Five Languages

↓ Slide 11

Hierarchical Alignments

↓ Slide 12

Problems/Limitations

  • Short sentences in 1:n pairs do not align in any language.
    • … and we are merely removing (wrong) alignments.
  • Decisions not straightforward if more than one pair disagrees.
    • hunalign's alignment score does not help taking the right decision.
  • Dictionaries for hunalign only allow binary entries.
→ Slide 13

Sampling in Discrete Vector Space

  • Discrete vector space spanned by languages as dimensions.
  • Alignments are pairs of location and direction vectors with all positive components.
    • E.g. $\left((1,2)^T,(2,1)^T\right)$ defines the alignment of the second and third sentence of the first language with the third sentence of the second language.
  • The location vector of the nth alignment equals the sum of the direction vector 1..n-1.
  • Alignment is obtained by simulated annealing.
↓ Slide 14

Visual Representation

↓ Slide 15

Approach

  1. Set initial aligment to a sequence of vectors approximating a diagonal line.
  2. Calculate local (pairwise) and global alignment scores (and keep results in memory).
  3. Find and evaluate all applicable sampling operations.
  4. Sample by selecting one of those operations – according to their respective evaluation scores.
  5. Lower temperature, i.e. probability of picking an operation that leads to a worse sample.
  6. Repeat from (3) until temperature reaches zero-point.
↓ Slide 16

Sampling Operations

  1. Changing two consecutive vectors $\vec{x}$ and $\vec{y}$ such that $\vec{x}^\prime + \vec{y}^\prime = \vec{x} + \vec{y}$.
  2. Replace two consecutive vectors $\vec{x}$ and $\vec{y}$ by vector $\vec{z}$ such that $\vec{z} = \vec{x} + \vec{y}$.
  3. Split a vector $\vec{z}$ into vectors $\vec{x}$ and $\vec{y}$ such that $\vec{x} + \vec{y} = \vec{z}$.
↓ Slide 17

Problems/Limitations

  • The sampling often does not converge.
  • For higher dimensions (more languages), sampling did not inlcude the correct (gold) alignment.
  • Finding and evaluation all applicable operations is expensive, even if scores are only calculated on language pairs.
→ Slide 18

Agglomerative Hierarchical Clustering

  • Collect evidence from different sources/heuristics in a graph.
  • Perform agglomerative clustering such that
    1. a cluster cannot take more than one sentence of each language.
    2. crossing clusters are prohibited.
  • Join incomplete clusters (not every language covered) with complete clusters.
↓ Slide 19

Approach

  1. Calculate scores for each a) language pair, b) source and c) sentence pair.
  2. Map the (normal) distribution of each source's scores to one with $\mu = 1$ and $\sigma = 1$ and
  3. multiply the values with a source-specific weight between 0 and 1.
  4. Sum up these score-specific values to set the link weight between each two sentences.
  5. Calculate the supported link weight based on the weight of a particular link and the link weights of all “triangles” with other languages.
  6. Use those weights for clustering.
↓ Slide 20

Approach (Clustering)

  1. Perform first agglomerative clustering such that
    1. a cluster cannot take more than one sentence of each language.
    2. crossing clusters are prohibited.
  2. Let all remaining sentences be the only member of their own cluster.
  3. Perform secondary agglomerative clustering for incomplete clusters such that
    1. crossing clusters are prohibited.
↓ Slide 21

Sources for Pairwise Scores

  • Corresponding discourse markers
  • Phrase table matches
  • Identical punctuation
  • Relative cumulative length
  • Identical numbers
  • Identical acronyms
↓ Slide 22
→ Slide 23

Evaluation Metrics

  • Gold standard for multiparallel hierarchical sentence alignment (Tool)
  • Evaluation of bilingual sentence aligners:
    • Minimal bilingual alignments can be extracted from gold standard for each language pair
  • Evaluation of multiparallel clustering approach:
    • Minimal bilingual alignments of both data sets
    • Average and standard deviation of those bilingual scores per language
    • Distribution of errors by average sentence count
→ Slide 24

Hierarchical Clustering for Word Alignment

Slide 25 :public:bg23.jpg

EOP


CL Wiki

Institute of Computational Linguistics – University of Zurich