Largescale discriminative training has become promising for statistical machine translation by leveraging the huge training corpus; for example the recent effort in phrasebased MT [21] significantly outperforms mainstream methods that only train on small tuning sets. However, phrasebased MT suffers from limited reorderings, and thus its training can only utilize a small portion of the bitext due to the distortion limit. To address this problem, we extend \newciteyu+:2013 to syntaxbased MT by generalizing their latent variable “violationfixing” perceptron from graphs to hypergraphs. Experiments confirm that our method leads to up to +1.2 Bleuimprovement over mainstream methods such as Mertand Pro.
1.5em
1em
algorithmAlgorithm
Many natural language processing problems including partofspeech tagging [7], parsing [18], and event extraction [16] have enjoyed great success using largescale discriminative training algorithms. However, a similar success on machine translation has been elusive, where the mainstream methods still tune on small datasets.
What makes largescale MT training so hard then? After numerous attempts by various researchers [17, 20, 1, 2, 5, 9, 10], the recent work of \newciteyu+:2013 finally reveals a major reason: it is the vast amount of (inevitable) search errors in MT decoding that astray learning. To alleviate this problem, their work adopts the theoreticallymotivated framework of violationfixing perceptron [12] tailed for inexact search, yielding great results on phrasebased MT (outperforming smallscale Mert/Proby a large margin for the first time). However, the underlying phrasebased model suffers from limited distortion and thus can only employ a small portion (about 1/3 in their ChEn experiments) of the bitext in training.
Collins (02)  $\text{stackbin}\left[\text{search}\right]\text{inexact}\u27f6$  Huang et al. (12)  $\text{stackbin}\left[\text{variable}\right]\text{latent}\u27f6$  Yu et al. (13) 
$\downarrow $ hypergraph  $\downarrow $  
Zhang et al. (13)  $\text{stackbin}\left[{\text{variable}}\right]\u27f6$  this work 
\arraybackslash

\arraybackslash  \arraybackslash
${}_{0}$ Bùshí${}_{1}$ yǔ${}_{2}$ Shālóng${}_{3}$ jǔxíng${}_{4}$ huìtán${}_{5}$ ${}_{0}$ Bush ${}_{1}$ with ${}_{2}$ Sharon ${}_{3}$ held ${}_{4}$ talks ${}_{5}$  
\arraybackslash(a) Hierorules  \arraybackslash(b) gold derivation  \arraybackslash(c) Viterbi derivation 
To better utilize the large training set, we propose to generalize from phrasebased MT to syntaxbased MT, in particular the hierarchical phrasebased translation model (Hiero) [6], in order to exploit sentence pairs beyond the expressive capacity of phrasebased MT.
The key challenge here is to extend the latent variable violationfixing perceptron of \newciteyu+:2013 to handle treestructured derivations and translation hypergraphs. Luckily, \newcitezhang+:2013 have recently generalized the underlying violationfixing perceptron of \newcitehuang+:2012 from graphs to hypergraphs for bottomup parsing, which resembles syntaxbased decoding. We just need to further extend it to handle latent variables. We make the following contributions:
We generalize the latent variable violationfixing perceptron framework to inexact search over hypergraphs, which subsumes previous algorithms for PBMT and bottomup parsing as special cases (see Fig. 1).
We show that syntaxbased MT, with its better handling of longdistance reordering, can exploit a larger portion of the training set, which facilitates sparse lexicalized features.
Experiments show that our training algorithm outperforms mainstream tuning methods (which optimize on small devsets) by +1.2 Bleuover Mertand Proon FBIS.
For clarity reasons we will describe Hierodecoding as a twopass process, first without a language model, and then integrating the LM. This section mostly follows \newcitehuang+chiang:2007.
In the first, $\mathrm{}\text{LM}$phase, the decoder parses the source sentence using the source projection of the synchronous grammar (see Fig. 2 (a) for an example), producing a $\mathrm{}\text{LM}$hypergraph where each node has a signature ${N}_{[i:j]}$, where $N$ is the nonterminal type (either X or S in Hiero) and $\left[i:j\right]$ is the span, and each hyperedge $e$ is an application of the translation rule $r\left(e\right)$ (see Figure 3).
To incorporate the language model, each node also needs to remember its target side boundary words. Thus a $\mathrm{}\text{LM}$node ${N}_{[i:j]}$ is split into multiple $\mathrm{+}\text{LM}$nodes of signature ${N}_{[i:j]}^{a\star b}$, where $a$ and $b$ are the boundary words. For example, with a bigram LM, ${\text{X}}_{[1:5]}^{\text{held}\star \text{Sharon}}$ is a node whose translation starts with “held” and ends with “Sharon”.
More formally, the whole decoding process can be cast as a deductive system. Take the partial translation of “held talks with Sharon” in Figure 2 (b) for example, the deduction is
$$\text{inferrule}\hspace{1em}{\text{X}}_{[2:3]}^{\text{Sharon}\star \text{Sharon}}:{s}_{1}\hspace{1em}\hspace{1em}{\text{X}}_{[4:5]}^{\text{talks}\star \text{talks}}:{s}_{2}\hspace{1em}{\text{X}}_{[1:5]}^{\text{held}\star \text{Sharon}}:{s}_{1}+{s}_{2}+s\left({r}_{5}\right)+\lambda \hspace{1em}{r}_{5},$$ 
where $s\left({r}_{5}\right)$ is the score of rule ${r}_{5}$, and the LM combo score $\lambda $ is $\mathrm{log}{\mathrm{P}}_{\mathrm{lm}}(\text{talks}\mid \text{held}){\mathrm{P}}_{\mathrm{lm}}(\text{with}\mid \text{talks}){\mathrm{P}}_{\mathrm{lm}}(\text{Sharon}\mid \text{with})$.
As mentioned in Section 1, the key to the success of \newciteyu+:2013 is the adoption of violationfixing perceptron of \newcitehuang+:2012 which is tailored for vastly inexact search. The general idea is to update somewhere in the middle of the search (where search error happens) rather than at the very end (standard update is often invalid). To adapt it to MT where many derivations can output the same translation (i.e., spurious ambiguity), \newciteyu+:2013 extends it to handle latent variables which correspond to phrasebased derivations. On the other hand, \newcitezhang+:2013 has generalized \newcitehuang+:2012 from graphs to hypergraphs for bottomup parsing, which resembles Hierodecoding. So we just need to combine the two generalizing directions (latent variable and hypergraph, see Fig. 1).
The key difference between bottomup parsing and MT decoding is that in parsing the gold tree for each input sentence is unique, while in MT many derivations can generate the same reference translation. In other words, the gold derivation to update towards is a latent variable.
Here we formally define the latent variable “maxviolation” perceptron over a hypergraph for MT training. For a given sentence pair $\u27e8x,y\u27e9$, we denote $H\left(x\right)$ as the decoding hypergraph of Hierowithout any pruning. We say $D\in H\left(x\right)$ if $D$ is a full derivation of decoding $x$, and $D$ can be derived from the hypergraph. Let $\mathrm{\mathit{g}\mathit{o}\mathit{o}\mathit{d}}\left(x,y\right)$ be the set of $y$goodderivations for $\u27e8x,y\u27e9$:
$$\mathrm{\mathit{g}\mathit{o}\mathit{o}\mathit{d}}\left(x,y\right)\stackrel{\mathrm{\Delta}}{=}\left\{D\in H\left(x\right)\mid e\left(D\right)=y\right\},$$ 
where $e\left(D\right)$ is the translation from derivation $D$. We then define the set of $y$goodpartial derivations that cover ${x}_{[i:j]}$ with root ${N}_{[i:j]}$ as
${\mathrm{\mathit{g}\mathit{o}\mathit{o}\mathit{d}}}_{{N}_{[i:j]}}(x,y)\stackrel{\mathrm{\Delta}}{=}\{$  $d\in D\mid D\in \mathrm{\mathit{g}\mathit{o}\mathit{o}\mathit{d}}(x,y),$  
$\mathrm{\mathit{r}\mathit{o}\mathit{o}\mathit{t}}\left(d\right)={N}_{[i:j]}\}$ 
We further denote the real decoding hypergraph with beampruning and cubepruning as ${H}^{\prime}\left(x\right)$. The set of $y$badderivations is defined as
${\mathrm{\mathit{b}\mathit{a}\mathit{d}}}_{{N}_{[i:j]}}(x,y)\stackrel{\mathrm{\Delta}}{=}\{$  $d\in D\mid D\in {H}^{\prime}(x,y),$  
$\mathrm{\mathit{r}\mathit{o}\mathit{o}\mathit{t}}\left(d\right)$  $={N}_{[i:j]},d\notin {\mathrm{\mathit{g}\mathit{o}\mathit{o}\mathit{d}}}_{{N}_{[i:j]}}(x,y)\}.$ 
Note that the $y$goodderivations are defined over the unpruned whole decoding hypergraph, while the $y$badderivations are defined over the real decoding hypergraph with pruning.
The maxviolation method performs the update where the model score difference between the incorrect Viterbi partial derivation and the best $y$goodpartial derivation is maximal, by penalizing the incorrect Viterbi partial derivation and rewarding the $y$goodpartial derivation.
More formally, we first find the Viterbi partial derivation ${d}^{}$ and the best $y$goodpartial derivation ${d}^{+}$ for each ${N}_{[i:j]}$ group in the pruned $\mathrm{+}\text{LM}$hypergraph:
$${d}_{{N}_{[i:j]}}^{+}\left(x,y\right)\stackrel{\mathrm{\Delta}}{=}\underset{d\in {\mathrm{\mathit{g}\mathit{o}\mathit{o}\mathit{d}}}_{{N}_{[i:j]}}\left(x,y\right)}{\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{m}\mathbf{a}\mathbf{x}}\mathbf{w}\cdot \mathbf{\Phi}\left(x,d\right),$$ 
$${d}_{{N}_{[i:j]}}^{}\left(x,y\right)\stackrel{\mathrm{\Delta}}{=}\underset{d\in {\mathrm{\mathit{b}\mathit{a}\mathit{d}}}_{{N}_{[i:j]}}\left(x,y\right)}{\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{m}\mathbf{a}\mathbf{x}}\mathbf{w}\cdot \mathbf{\Phi}\left(x,d\right),$$ 
where $\mathbf{\Phi}\left(x,d\right)$ is the feature vector for derivation $d$. Then it finds the group ${N}_{[{i}^{*}:{j}^{*}]}^{*}$ with the maximal score difference between the Viterbi derivation and the best $y$goodderivation:
${N}_{[{i}^{*}:{j}^{*}]}^{*}\stackrel{\mathrm{\Delta}}{=}$  $\underset{{N}_{[i:j]}}{\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{m}\mathbf{a}\mathbf{x}}$  
$\mathbf{w}\cdot \mathrm{\Delta}\mathbf{\Phi}\left(x,{d}_{{N}_{[i:j]}}^{+}\left(x,y\right),{d}_{{N}_{[i:j]}}^{}\left(x,y\right)\right),$ 
and update as follows:
$$\mathbf{w}\leftarrow \mathbf{w}+\mathrm{\Delta}\mathbf{\Phi}\left(x,{d}_{{N}_{[{i}^{*}:{j}^{*}]}^{*}}^{+}\left(x,y\right),{d}_{{N}_{[{i}^{*}:{j}^{*}]}^{*}}^{}\left(x,y\right)\right),$$ 
where $\mathrm{\Delta}\mathbf{\Phi}\left(x,d,{d}^{\prime}\right)\stackrel{\mathrm{\Delta}}{=}\mathbf{\Phi}\left(x,d\right)\mathbf{\Phi}\left(x,{d}^{\prime}\right)$.
We now describe how to find the gold derivations.^{1}^{1}We only consider single reference in this paper. Such derivations can be generated in way similar to \newciteyu+:2013 by using a language model tailored for forced decoding:
$${\mathrm{P}}_{\mathrm{\mathit{f}\mathit{o}\mathit{r}\mathit{c}\mathit{e}\mathit{d}}}(q\mid p)=\{\begin{array}{cc}1\hfill & \text{if}q=p+1\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array},$$ 
where $p$ and $q$ are the indices of the boundary words in the reference translation. The $\mathrm{+}\text{LM}$node now has signature ${N}_{[i:j]}^{p\star q}$, where $p$ and $q$ are the indexes of the boundary words. If a boundary word does not occur in the reference, its index is set to $\mathrm{\infty}$ so that its language model score will always be $\mathrm{\infty}$; if a boundary word occurs more than once in the reference, its $\mathrm{}\text{LM}$node is split into multiple $\mathrm{+}\text{LM}$nodes, one for each such index.^{2}^{2}Our formulation of indexbased language model fixes a bug in the wordbased LM of \newciteyu+:2013 when a substring appears more than once in the reference (e.g. “the man…the man…”); thanks to Dan Gildea for pointing it out.
We have a similar deductive system for forced decoding. For the previous example, rule ${r}_{5}$ in Figure 2 (a) is rewritten as
$$\text{X}\to \u27e8\mathit{\text{y\u01d4}}{\text{X}}_{\text{1}}\mathit{\text{j\u01d4x\xedng}}{\text{X}}_{\text{2}},1{\text{X}}_{\text{2}}\mathrm{\hspace{0.25em}4}{\text{X}}_{\text{1}}\u27e9,$$ 
where $1$ and $4$ are the indexes for reference words “held” and “with” respectively. The deduction for ${\text{X}}_{[1:5]}$ in Figure 2 (b) is
$$\text{inferrule}\hspace{1em}{\text{X}}_{[2:3]}^{5\star 5}:{s}_{1}\hspace{1em}\hspace{1em}{\text{X}}_{[4:5]}^{2\star 3}:{s}_{2}\hspace{1em}{\text{X}}_{[1:5]}^{1\star 5}:s\left({r}_{5}\right)+\lambda +{s}_{1}+{s}_{2}\hspace{1em}{r}_{5},$$ 
where $\lambda =\mathrm{log}{\prod}_{i\in \left\{1,3,4\right\}}{\mathrm{P}}_{\mathrm{\mathit{f}\mathit{o}\mathit{r}\mathit{c}\mathit{e}\mathit{d}}}(i+1\mid i)=0$.
Following \newciteyu+:2013, we call our maxviolation method MaxForce. Our implementation is mostly in Python on top of the cdecsystem [8] via the pycdec interface [3]. In addition, we use minibatch parallelization of [22] to speedup perceptron training. We evaluate MaxForcefor Hieroover two ChEncorpora, IWSLT09 and FBIS, and compare the performance with vanilla $n$best Mert[19] from Moses [14], Hypergraph Mert[15], and Pro[11] from cdec.
We use all the 18 dense features from cdec, including language model, direct translation probability $p\left(e\rightf)$, lexical translation probabilities ${p}_{l}\left(e\rightf)$ and ${p}_{l}\left(f\righte)$, length penalty, counts for the source and target sides in the training corpus, and flags for the glue rules and passthrough rules.
For sparse features we use WordEdges features [4, 13] which are shown to be extremely effective in both parsing and phrasebased MT [21]. We find that even simple WordEdges features boost the performance significantly, and adding complex WordEdges features from \newciteyu+:2013 brings limited improvement and slows down the decoding. So in the following experiments we only use WordEdges features consisting of combinations of English and Chinese words, and Chinese characters, and do not use word clusters nor word types. For simplicity and efficiency reasons, we also exclude all nonlocal features.
Our first corpus, IWSLT09, contains $\sim $30k short sentences collected from spoken language. IWSLT04 is used as development set in MaxForcetraining, and as tuning set for $n$best Mert, Hypergraph Mert, and Pro. IWSLT05 is used as test set. Both IWSLT04 and IWSLT05 contain 16 references.We mainly use this corpus to investigate the properties of MaxForce.
The second corpus, FBIS, contains $\sim $240k sentences. NIST06 newswire is used as development set for MaxForcetraining, and as tuning set for all other tuning methods. NIST08 newswire is used as test set. Both NIST06 newswire and NIST08 newswire contain 4 references. We mainly use this corpus to demonstrate the performance of MaxForcein largescale training.
For both corpora, we do standard tokenization, alignment and rule extraction using the cdectools. In rule extraction, we remove all 1count rules but keep the rules mapping from one Chinese word to one English word to help balancing between overfitting and coverage. We use a trigram language model trained from the target sides of the two corpora respectively.
sent.  words  

phrasebased MT  32%  12% 
Hiero  35%  30% 
Hiero (all rules)  65%  55% 
We first report the forced decoding reachability for Hieroon FBIS in Table 1. With the full rule set, 65% sentences and 55% words of the whole corpus are forced decodable in Hiero. After pruning 1count rules, our forced decoding covers significantly more words than phrasebased MT in \newciteyu+:2013. Furthermore, in phrasebased MT, most decodable sentences are very short, while in Hierothe lengths of decodable sentences are more evenly distributed.
However, in the following experiments, due to efficiency considerations, we use the “tight” rule extraction in cdecthat is more strict than the standard “loose” rule extraction, which generates a reduced rule set and, thus, a reduced reachability. We show the reachability distributions of both tight and loose rule extraction in Figure 4.
For IWSLT, we first compare the performance from various update methods in Figure 5. The maxviolation method is more than 15 Bleupoints better than the standard perceptron (also known as “boldupdate” in \newciteliang+:2006) which updates at the root of the derivation tree.^{3}^{3}We find that while MaxForcegenerates translations of length ratio close to 1 during training, the length ratios on dev/test sets are significantly lower, due to OOVs. So we run a binary search for the length penalty weight after each training iteration to tune the length ratio to $\sim $0.97 on dev set.${}^{,}$^{4}^{4} We report Bleuwith averaged reference lengths. This can be explained by the fact that in training $\sim $58% of the standard updates are invalid (i.e., they do not fix any violation). We also use the “skip” strategy of \newcitezhang+:2013 which updates at the root of the derivation only when it fixes a search error, avoiding all invalid updates. This achieves $\sim $10 Bleubetter than the standard update, but is still more than $\sim $5 Bleuworse than MaxViolation update. Finally we also try the “localupdate” method from \newciteliang+:2006 which updates towards the derivation with the best ${\mathrm{Bleu}}^{+1}$in the root group ${S}_{[0:\leftx\right]}$. This method is about 2 Bleupoints worse than maxviolation.
We further investigate the contribution of sparse features in Figure 6. On the development set, maxviolation update without WordEdges features achieves Bleusimilar to $n$best Mertand Pro, but lower than Hypergraph Mert. Adding simple WordEdges features improves Bleuby $\sim $2 points, outperforming the very strong Hypergraph Mertbaseline by $\sim $1 point. See Table 2 for details. The results of $n$best Mert, Hypergraph Mert, and Proare averages from 3 runs.
algorithm  # feats  dev  test 

$n$best Mert  18  44.9  47.9 
Hypergraph Mert  18  46.6  50.7 
Pro  18  45.0  49.5 
local update perc.  443K  45.6  49.1 
MaxForce  529K  47.4  51.5 
algorithm  # feats  dev  test 

Hypergraph Mert  18  27.3  23.0 
Pro  18  26.4  22.7 
MaxForce  4.5M  27.7  23.9 
Table 3 shows Bleuscores of Hypergraph Mert, Pro, and MaxForceon FBIS. MaxForceactives 4.5M features, and achieves +1.2 Bleuover Proand +0.9 Bleuover Hypergraph Mert. The training time (on 32 cores) for Hypergraph Mertand Prois about 30 min. on the dev set, and is about 5 hours for MaxForceon the training set.
We have presented a latentvariable violationfixing framework for general structured prediction problems with inexact search over hypergraphs. Its application on Hierobrings significant improvement in Bleu, compared to algorithms that are specially designed for MT tuning such as Mertand Pro.
Part of this work was done during K. Z.’s internship at IBM. We thank Martin Čmejrek and Lemao Liu for discussions, David Chiang for pointing us to pycdec, Dan Gildea for Footnote 3.2, and the anonymous reviewers for comments. This work is supported by DARPA FA87501320041 (DEFT), DARPA HR001112C0015 (BOLT), and a Google Faculty Research Award.