Hierarchical MT Training using Max-Violation Perceptron

 Kai Zhao     Liang Huang
Graduate Center & Queens College
City University of New York
{kzhao@gc,huang@cs.qc}.cuny.edu &      Haitao Mi     Abe Ittycheriah
     T. J. Watson Research Center

Large-scale discriminative training has become promising for statistical machine translation by leveraging the huge training corpus; for example the recent effort in phrase-based MT [21] significantly outperforms mainstream methods that only train on small tuning sets. However, phrase-based MT suffers from limited reorderings, and thus its training can only utilize a small portion of the bitext due to the distortion limit. To address this problem, we extend \newciteyu+:2013 to syntax-based MT by generalizing their latent variable “violation-fixing” perceptron from graphs to hypergraphs. Experiments confirm that our method leads to up to +1.2 Bleuimprovement over mainstream methods such as Mertand Pro.







1 Introduction

Many natural language processing problems including part-of-speech tagging [7], parsing [18], and event extraction [16] have enjoyed great success using large-scale discriminative training algorithms. However, a similar success on machine translation has been elusive, where the mainstream methods still tune on small datasets.

What makes large-scale MT training so hard then? After numerous attempts by various researchers [17, 20, 1, 2, 5, 9, 10], the recent work of \newciteyu+:2013 finally reveals a major reason: it is the vast amount of (inevitable) search errors in MT decoding that astray learning. To alleviate this problem, their work adopts the theoretically-motivated framework of violation-fixing perceptron [12] tailed for inexact search, yielding great results on phrase-based MT (outperforming small-scale Mert/Proby a large margin for the first time). However, the underlying phrase-based model suffers from limited distortion and thus can only employ a small portion (about 1/3 in their Ch-En experiments) of the bitext in training.

Collins (02) \stackbin[search]inexact Huang et al. (12) \stackbin[variable]latent Yu et al. (13)
Zhang et al. (13) \stackbin[variable] this work
Figure 1: Relationship with previous work.
id rule
r0 SX1,X1
r1 SS1X2,S1X2
r2 XBùshí,Bush
r3 XShālóng,Sharon
r4 Xhuìtán,talks
r5 XX1jǔxíngX2,            held X2 with X1
r6 XShālóng, with Sharon
r7 XX1jǔxíngX2,             X1held X2
\arraybackslash S[0:5]S[0:1]X[0:1]

0 Bùshí1








0 Bush 1

held 2

talks 3

with 4

Sharon 5


\arraybackslash [0:5][0:1][0:1]

0 Bùshí1







0 Bush 1

with 2

Sharon 3

held 4

talks 5

\arraybackslash(a) Hierorules \arraybackslash(b) gold derivation \arraybackslash(c) Viterbi derivation
Figure 2: An example of Hierotranslation.

To better utilize the large training set, we propose to generalize from phrase-based MT to syntax-based MT, in particular the hierarchical phrase-based translation model (Hiero) [6], in order to exploit sentence pairs beyond the expressive capacity of phrase-based MT.

The key challenge here is to extend the latent variable violation-fixing perceptron of \newciteyu+:2013 to handle tree-structured derivations and translation hypergraphs. Luckily, \newcitezhang+:2013 have recently generalized the underlying violation-fixing perceptron of \newcitehuang+:2012 from graphs to hypergraphs for bottom-up parsing, which resembles syntax-based decoding. We just need to further extend it to handle latent variables. We make the following contributions:

  1. 1.

    We generalize the latent variable violation-fixing perceptron framework to inexact search over hypergraphs, which subsumes previous algorithms for PBMT and bottom-up parsing as special cases (see Fig. 1).

  2. 2.

    We show that syntax-based MT, with its better handling of long-distance reordering, can exploit a larger portion of the training set, which facilitates sparse lexicalized features.

  3. 3.

    Experiments show that our training algorithm outperforms mainstream tuning methods (which optimize on small devsets) by +1.2 Bleuover Mertand Proon FBIS.

2 Review: Syntax-based MT Decoding

Figure 3: A -LMhypergraph with two derivations: the gold derivation (Fig. 2b) in solid lines, and the Viterbi derivation (Fig. 2c) in dashed lines.

For clarity reasons we will describe Hierodecoding as a two-pass process, first without a language model, and then integrating the LM. This section mostly follows \newcitehuang+chiang:2007.

In the first, -LMphase, the decoder parses the source sentence using the source projection of the synchronous grammar (see Fig. 2 (a) for an example), producing a -LMhypergraph where each node has a signature N[i:j], where N is the nonterminal type (either X or S in Hiero) and [i:j] is the span, and each hyperedge e is an application of the translation rule r(e) (see Figure 3).

To incorporate the language model, each node also needs to remember its target side boundary words. Thus a -LMnode N[i:j] is split into multiple +LMnodes of signature N[i:j]ab, where a and b are the boundary words. For example, with a bigram LM, X[1:5]heldSharon is a node whose translation starts with “held” and ends with “Sharon”.

More formally, the whole decoding process can be cast as a deductive system. Take the partial translation of “held talks with Sharon” in Figure 2 (b) for example, the deduction is

\inferruleX[2:3]SharonSharon:s1  X[4:5]talkstalks:s2X[1:5]heldSharon:s1+s2+s(r5)+λr5,

where s(r5) is the score of rule r5, and the LM combo score λ is logPlm(talksheld)Plm(withtalks)Plm(Sharonwith).

3 Violation-Fixing Perceptron for Hiero

As mentioned in Section 1, the key to the success of \newciteyu+:2013 is the adoption of violation-fixing perceptron of \newcitehuang+:2012 which is tailored for vastly inexact search. The general idea is to update somewhere in the middle of the search (where search error happens) rather than at the very end (standard update is often invalid). To adapt it to MT where many derivations can output the same translation (i.e., spurious ambiguity), \newciteyu+:2013 extends it to handle latent variables which correspond to phrase-based derivations. On the other hand, \newcitezhang+:2013 has generalized \newcitehuang+:2012 from graphs to hypergraphs for bottom-up parsing, which resembles Hierodecoding. So we just need to combine the two generalizing directions (latent variable and hypergraph, see Fig. 1).

3.1 Latent Variable Hypergraph Search

The key difference between bottom-up parsing and MT decoding is that in parsing the gold tree for each input sentence is unique, while in MT many derivations can generate the same reference translation. In other words, the gold derivation to update towards is a latent variable.

Here we formally define the latent variable “max-violation” perceptron over a hypergraph for MT training. For a given sentence pair x,y, we denote H(x) as the decoding hypergraph of Hierowithout any pruning. We say DH(x) if D is a full derivation of decoding x, and D can be derived from the hypergraph. Let 𝑔𝑜𝑜𝑑(x,y) be the set of y-goodderivations for x,y:


where e(D) is the translation from derivation D. We then define the set of y-goodpartial derivations that cover x[i:j] with root N[i:j] as

𝑔𝑜𝑜𝑑N[i:j](x,y)=Δ{ dDD𝑔𝑜𝑜𝑑(x,y),

We further denote the real decoding hypergraph with beam-pruning and cube-pruning as H(x). The set of y-badderivations is defined as

𝑏𝑎𝑑N[i:j](x,y)=Δ{ dDDH(x,y),
𝑟𝑜𝑜𝑡(d) =N[i:j],d𝑔𝑜𝑜𝑑N[i:j](x,y)}.

Note that the y-goodderivations are defined over the unpruned whole decoding hypergraph, while the y-badderivations are defined over the real decoding hypergraph with pruning.

The max-violation method performs the update where the model score difference between the incorrect Viterbi partial derivation and the best y-goodpartial derivation is maximal, by penalizing the incorrect Viterbi partial derivation and rewarding the y-goodpartial derivation.

More formally, we first find the Viterbi partial derivation d- and the best y-goodpartial derivation d+ for each N[i:j] group in the pruned +LMhypergraph:


where 𝚽(x,d) is the feature vector for derivation d. Then it finds the group N[i*:j*]* with the maximal score difference between the Viterbi derivation and the best y-goodderivation:

N[i*:j*]*=Δ 𝐚𝐫𝐠𝐦𝐚𝐱N[i:j]

and update as follows:


where Δ𝚽(x,d,d)=Δ𝚽(x,d)-𝚽(x,d).

3.2 Forced Decoding for Hiero

We now describe how to find the gold derivations.11We only consider single reference in this paper. Such derivations can be generated in way similar to \newciteyu+:2013 by using a language model tailored for forced decoding:

P𝑓𝑜𝑟𝑐𝑒𝑑(qp)={1if q=p+10otherwise,

where p and q are the indices of the boundary words in the reference translation. The +LMnode now has signature N[i:j]pq, where p and q are the indexes of the boundary words. If a boundary word does not occur in the reference, its index is set to so that its language model score will always be -; if a boundary word occurs more than once in the reference, its -LMnode is split into multiple +LMnodes, one for each such index.22Our formulation of index-based language model fixes a bug in the word-based LM of \newciteyu+:2013 when a substring appears more than once in the reference (e.g. “the man…the man…”); thanks to Dan Gildea for pointing it out.

We have a similar deductive system for forced decoding. For the previous example, rule r5 in Figure 2 (a) is rewritten as

XX1jǔxíngX2,1X2 4X1,

where 1 and 4 are the indexes for reference words “held” and “with” respectively. The deduction for X[1:5] in Figure 2 (b) is

\inferruleX[2:3]55:s1  X[4:5]23:s2X[1:5]15:s(r5)+λ+s1+s2r5,

where λ=logi{1,3,4}P𝑓𝑜𝑟𝑐𝑒𝑑(i+1i)=0.

4 Experiments

Following \newciteyu+:2013, we call our max-violation method MaxForce. Our implementation is mostly in Python on top of the cdecsystem [8] via the pycdec interface [3]. In addition, we use minibatch parallelization of [22] to speedup perceptron training. We evaluate MaxForcefor Hieroover two Ch-Encorpora, IWSLT09 and FBIS, and compare the performance with vanilla n-best Mert[19] from Moses [14], Hypergraph Mert[15], and Pro[11] from cdec.

4.1 Features Design

We use all the 18 dense features from cdec, including language model, direct translation probability p(e|f), lexical translation probabilities pl(e|f) and pl(f|e), length penalty, counts for the source and target sides in the training corpus, and flags for the glue rules and pass-through rules.

For sparse features we use Word-Edges features [4, 13] which are shown to be extremely effective in both parsing and phrase-based MT [21]. We find that even simple Word-Edges features boost the performance significantly, and adding complex Word-Edges features from \newciteyu+:2013 brings limited improvement and slows down the decoding. So in the following experiments we only use Word-Edges features consisting of combinations of English and Chinese words, and Chinese characters, and do not use word clusters nor word types. For simplicity and efficiency reasons, we also exclude all non-local features.

4.2 Datasets and Preprocessing

Our first corpus, IWSLT09, contains 30k short sentences collected from spoken language. IWSLT04 is used as development set in MaxForcetraining, and as tuning set for n-best Mert, Hypergraph Mert, and Pro. IWSLT05 is used as test set. Both IWSLT04 and IWSLT05 contain 16 references.We mainly use this corpus to investigate the properties of MaxForce.

The second corpus, FBIS, contains 240k sentences. NIST06 newswire is used as development set for MaxForcetraining, and as tuning set for all other tuning methods. NIST08 newswire is used as test set. Both NIST06 newswire and NIST08 newswire contain 4 references. We mainly use this corpus to demonstrate the performance of MaxForcein large-scale training.

For both corpora, we do standard tokenization, alignment and rule extraction using the cdectools. In rule extraction, we remove all 1-count rules but keep the rules mapping from one Chinese word to one English word to help balancing between overfitting and coverage. We use a trigram language model trained from the target sides of the two corpora respectively.

sent. words
phrase-based MT 32% 12%
Hiero 35% 30%
Hiero (all rules) 65% 55%
Table 1: Reachability comparison (on FBIS) between phrase-based MT reported in \newciteyu+:2013 (without 1-count rules) and Hiero(with and without 1-count rules).
Figure 4: Reachability vs. sent. length on FBIS. See text below for “loose” and “tight”.

4.3 Forced Decoding Reachability

We first report the forced decoding reachability for Hieroon FBIS in Table 1. With the full rule set, 65% sentences and 55% words of the whole corpus are forced decodable in Hiero. After pruning 1-count rules, our forced decoding covers significantly more words than phrase-based MT in \newciteyu+:2013. Furthermore, in phrase-based MT, most decodable sentences are very short, while in Hierothe lengths of decodable sentences are more evenly distributed.

However, in the following experiments, due to efficiency considerations, we use the “tight” rule extraction in cdecthat is more strict than the standard “loose” rule extraction, which generates a reduced rule set and, thus, a reduced reachability. We show the reachability distributions of both tight and loose rule extraction in Figure 4.

4.4 Evaluation on IWSLT

Figure 5: Comparison of various update methods.
Figure 6: Sparse features (Word-Edges) contribute 2 Bleupoints, outperforming Proand Mert.

For IWSLT, we first compare the performance from various update methods in Figure 5. The max-violation method is more than 15 Bleupoints better than the standard perceptron (also known as “bold-update” in \newciteliang+:2006) which updates at the root of the derivation tree.33We find that while MaxForcegenerates translations of length ratio close to 1 during training, the length ratios on dev/test sets are significantly lower, due to OOVs. So we run a binary search for the length penalty weight after each training iteration to tune the length ratio to 0.97 on dev set.,44 We report Bleuwith averaged reference lengths. This can be explained by the fact that in training 58% of the standard updates are invalid (i.e., they do not fix any violation). We also use the “skip” strategy of \newcitezhang+:2013 which updates at the root of the derivation only when it fixes a search error, avoiding all invalid updates. This achieves 10 Bleubetter than the standard update, but is still more than 5 Bleuworse than Max-Violation update. Finally we also try the “local-update” method from \newciteliang+:2006 which updates towards the derivation with the best Bleu+1in the root group S[0:|x|]. This method is about 2 Bleupoints worse than max-violation.

We further investigate the contribution of sparse features in Figure 6. On the development set, max-violation update without Word-Edges features achieves Bleusimilar to n-best Mertand Pro, but lower than Hypergraph Mert. Adding simple Word-Edges features improves Bleuby 2 points, outperforming the very strong Hypergraph Mertbaseline by 1 point. See Table 2 for details. The results of n-best Mert, Hypergraph Mert, and Proare averages from 3 runs.

algorithm # feats dev test
n-best Mert 18 44.9 47.9
Hypergraph Mert 18 46.6 50.7
Pro 18 45.0 49.5
local update perc. 443K 45.6 49.1
MaxForce 529K 47.4 51.5
Table 2: Bleuscores (with 16 references) of various training algorithms on IWSLT09.

4.5 Evaluation on FBIS

algorithm # feats dev test
Hypergraph Mert 18 27.3 23.0
Pro 18 26.4 22.7
MaxForce 4.5M 27.7 23.9
Table 3: Bleuscores (with 4 references) of various training algorithms on FBIS.

Table 3 shows Bleuscores of Hypergraph Mert, Pro, and MaxForceon FBIS. MaxForceactives 4.5M features, and achieves +1.2 Bleuover Proand +0.9 Bleuover Hypergraph Mert. The training time (on 32 cores) for Hypergraph Mertand Prois about 30 min. on the dev set, and is about 5 hours for MaxForceon the training set.

5 Conclusions

We have presented a latent-variable violation-fixing framework for general structured prediction problems with inexact search over hypergraphs. Its application on Hierobrings significant improvement in Bleu, compared to algorithms that are specially designed for MT tuning such as Mertand Pro.


Part of this work was done during K. Z.’s internship at IBM. We thank Martin Čmejrek and Lemao Liu for discussions, David Chiang for pointing us to pycdec, Dan Gildea for Footnote 3.2, and the anonymous reviewers for comments. This work is supported by DARPA FA8750-13-2-0041 (DEFT), DARPA HR0011-12-C-0015 (BOLT), and a Google Faculty Research Award.



  • [1] A. Arun and P. Koehn(2007) Online learning methods for discriminative training of phrase based statistical machine translation. Proc. of MT Summit XI 2 (5), pp. 29. Cited by: 1.
  • [2] P. Blunsom, T. Cohn and M. Osborne(2008) A discriminative latent variable model for statistical machine translation.. pp. 200–208. Cited by: 1.
  • [3] V. Chahuneau, N. Smith and C. Dyer(2012) Pycdec: a python interface to cdec. Prague Bulletin of Mathematical Linguistics (98). Cited by: 4.
  • [4] E. Charniak and M. Johnson(2005-06) Coarse-to-fine-grained n-best parsing and discriminative reranking. Ann Arbor, MI. Cited by: 4.1.
  • [5] D. Chiang, Y. Marton and P. Resnik(2008) Online large-margin training of syntactic and structural translation features. Cited by: 1.
  • [6] D. Chiang(2005) A hierarchical phrase-based model for statistical machine translation. Cited by: 1.
  • [7] M. Collins(2002) Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. Cited by: 1, Acknowledgment.
  • [8] C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman and P. Resnik(2010) Cdec: a decoder, alignment, and learning framework for finite-state and context-free translation models. Cited by: 4.
  • [9] J. Flanigan, C. Dyer and J. Carbonell(2013-06) Large-scale discriminative training for statistical machine translation using held-out line search. Atlanta, Georgia, pp. 248–258. External Links: Link Cited by: 1.
  • [10] S. Green, S. Wang, D. Cer and C. D. Manning(2013) Fast and adaptive online training of feature-rich translation models. to appear) ACL. Cited by: 1.
  • [11] M. Hopkins and J. May(2011) Tuning as ranking. Cited by: 4.
  • [12] L. Huang, S. Fayong and Y. Guo(2012) Structured perceptron with inexact search. Cited by: 1.
  • [13] L. Huang(2008-06) Forest reranking: discriminative parsing with non-local features. Columbus, OH. Cited by: 4.1.
  • [14] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst(2007) Moses: open source toolkit for statistical machine translation. Cited by: 4.
  • [15] S. Kumar, W. Macherey, C. Dyer and F. Och(2009) Efficient minimum error rate training and minimum bayes-risk decoding for translation hypergraphs and lattices. Cited by: 4.
  • [16] Q. Li, H. Ji and L. Huang(2013) Joint event extraction via structured prediction with global features. Cited by: 1.
  • [17] P. Liang, A. Bouchard-Côté, D. Klein and B. Taskar(2006-07) An end-to-end discriminative approach to machine translation. Sydney, Australia. External Links: Link Cited by: 1.
  • [18] R. McDonald, K. Crammer and F. Pereira(2005-07) Online large-margin training of dependency parsers. Cited by: 1.
  • [19] F. J. Och(2003) Minimum error rate training in statistical machine translation. pp. 160–167. Cited by: 4.
  • [20] T. Watanabe, J. Suzuki, H. Tsukada and H. Isozaki(2007) Online large-margin training for statistical machine translation. Cited by: 1.
  • [21] H. Yu, L. Huang, H. Mi and K. Zhao(2013) Max-violation perceptron and forced decoding for scalable MT training. External Links: Link Cited by: Hierarchical MT Training using Max-Violation Perceptron, 4.1.
  • [22] K. Zhao and L. Huang(2013) Minibatch and parallelization for online large margin structured learning. Cited by: 4.