Automatic Detection of Machine Translated Text and Translation Quality Estimation

Roee Aharoni
Dept. of Computer Science
Bar Ilan University
Ramat-Gan, Israel 52900
&Moshe Koppel
Dept. of Computer Science
Bar Ilan University
Ramat-Gan, Israel 52900
Yoav Goldberg
Dept. of Computer Science
Bar Ilan University
Ramat-Gan, Israel 52900

We show that it is possible to automatically detect machine translated text at sentence level from monolingual corpora, using text classification methods. We show further that the accuracy with which a learned classifier can detect text as machine translated is strongly correlated with the translation quality of the machine translation system that generated it. Finally, we offer a generic machine translation quality estimation technique based on this approach, which does not require reference sentences.

1 Introduction

The recent success and proliferation of statistical machine translation (MT) systems raise a number of important questions. Prominent among these are how to evaluate the quality of such a system efficiently and how to detect the output of such systems (for example, to avoid using it circularly as input for refining MT systems).

In this paper, we will answer both these questions. First, we will show that using style-related linguistic features, such as frequencies of parts-of-speech n-grams and function words, it is possible to learn classifiers that distinguish machine-translated text from human-translated or native English text. While this is a straightforward and not entirely novel result, our main contribution is to relativize the result. We will see that the success of such classifiers are strongly correlated with the quality of the underlying machine translation system. Specifically, given a corpus consisting of both machine-translated English text (English being the target language) and native English text (not necessarily the reference translation of the machine-translated text), we measure the accuracy of the system in classifying the sentences in the corpus as machine-translated or not. This accuracy will be shown to decrease as the quality of the underlying MT system increases. In fact, the correlation is strong enough that we propose that this accuracy measure itself can be used as a measure of MT system quality, obviating the need for a reference corpus, as for example is necessary for BLEU [].

The paper is structured as follows: In the next section, we review previous related work. In the third section, we describe experiments regarding the detection of machine translation and in the fourth section we discuss the use of detection techniques as a machine translation quality estimation method. In the final section we offer conclusions and suggestions for future work.

2 Previous Work

2.1 Translationese

The special features of translated texts have been studied widely for many years. Attempts to define their characteristics, often called ”Translation Universals”, include []. The differences between native and translated texts found there go well beyond systematic translation errors and point to a distinct ”Translationese” dialect.

Using automatic text classification methods in the field of translation studies had many use cases in recent years, mainly as an empirical method of measuring, proving or contradicting translation universals. Several works [] used text classification techniques in order to distinguish human translated text from native language text at document or paragraph level, using features like word and POS n-grams, proportion of grammatical words in the text, nouns, finite verbs, auxiliary verbs, adjectives, adverbs, numerals, pronouns, prepositions, determiners, conjunctions etc. Koppel and Ordan [] classified texts to original or translated, using a list of 300 function words taken from LIWC [] as features. Volanski et al. [] also tested various hypotheses regarding ”Translationese”, using 32 different linguistically-informed features, to assess the degree to which different sets of features can distinguish between translated and original texts.

2.2 Machine Translation Detection

Regarding the detection of machine translated text, Carter and Inkpen [] translated the Hansards of the 36th Parliament of Canada using the Microsoft Bing MT web service, and conducted three detection experiments at document level, using unigrams, average token length, and type-token ratio as features. Arase and Zhou [] trained a sentence-level classifier to distinguish machine translated text from human generated text on English and Japanese web-page corpora, translated by Google Translate, Bing and an in-house SMT system. They achieved very high detection accuracy using application-specific feature sets for this purpose, including indicators of the ”Phrase Salad” [] phenomenon or ”Gappy-Phrases” [].

While Arase and Zhou [] considered MT detection at sentence level, as we do in this paper, they did not study the correlation between the translation quality of the machine translated text and the ability to detect it. We show below that such detection is possible with very high accuracy only on low-quality translations. We examine this detection accuracy vs. quality correlation, with various MT systems, such as rule-based and statistical MT, both commercial and in-house, using various feature sets.

3 Detection Experiments

3.1 Features

We wish to distinguish machine translated English sentences from either human-translated sentences or native English sentences. Due to the sparseness of the data at the sentence level, we use common content-independent linguistic features for the classification task. Our features are binary, denoting the presence or absence of each of a set of part-of-speech n-grams acquired using the Stanford POS tagger [], as well as the presence or absence of each of 467 function words taken from LIWC []. We consider only those entries that appear at least ten times in the entire corpus, in order to reduce sparsity in the data. As our learning algorithm we use SVM with sequential minimal optimization (SMO), taken from the WEKA machine learning toolkit [].

3.2 Detecting Different MT Systems

In the first experiment set, we explore the ability to detect outputs of machine translated text from different MT systems, in an environment containing both human generated and machine translated text. For this task, we use a portion of the Canadian Hansard corpus [], containing 48,914 parallel sentences from French to English. We translate the French portion of the corpus using several MT systems, respectively: Google Translate, Systran, and five other commercial MT systems available at the website, which enables to query example MT systems built by several european MT companies. After translating the sentences, we take 20,000 sentences from each engine output and conduct the detection experiment by labeling those sentences as MT sentences, and another 20,000 sentences, which are the human reference translations, labeled as reference sentences. We conduct a 10-fold cross-validation experiment on the entire 40,000 sentence corpus. We also conduct the same experiment using 20,000 random, non-reference sentences from the same corpus, instead of the reference sentences. Using simple linear regression, we also obtain an R2 value (coefficient of determination) over the measurements of detection accuracy and BLEU score, for each of three feature set combinations (function words, POS tags and mixed) and the two data combinations (MT vs. reference and MT vs. non reference sentences). The detection and R2 results are shown in Table 1.


[width=7.5cm,height=7.5cm, xlabel= BLEU,ylabel=detection accuracy (%)]


[color=red,mark=*,style=solid] coordinates (36.06, 63.34) (29, 72.02) (24.66, 72.36) (18.25, 78.20) (15.44, 79.57) (12.36, 80.90) (8.39 , 89.36) ; \addlegendentrymix-nr


[color=blue,mark=*,style=solid] coordinates (36.06, 59.51) (29, 69.47) (24.66, 69.77) (18.25, 75.86) (15.44, 78.11) (12.36, 79.24) ( 8.39, 88.85) ; \addlegendentrymix-r


[color=red,mark=x,style=densely dotted] coordinates (36.06, 60.43) (29, 69.17) (24.66, 69.87) (18.25, 69.78) (15.44, 71.38) (12.36, 75.46) ( 8.39, 84.97) ; \addlegendentryfw-nr


[color=blue,mark=x,style=densely dotted] coordinates (36.06, 57.27) (29, 66.05) (24.66, 67.48) (18.25, 67.06) (15.44, 68.58) (12.36, 73.37) ( 8.39, 84.79) ; \addlegendentryfw-r


[color=red,mark=x,style=densely dashed] coordinates (36.06, 60.32) (29, 64.39) (24.66, 66.61) (18.25, 73) (15.44, 73.90) (12.36, 74.33) ( 8.39, 79.60) ; \addlegendentrypos-nr


[color=blue,mark=x,style=densely dashed] coordinates (36.06, 57.21) (29, 65.55) (24.66, 64.1225) (18.25, 70.29) (15.44, 73.065) (12.36, 73.0475) ( 8.39, 78.84) ; \addlegendentrypos-r

Figure 1: Correlation between detection accuracy and BLEU score on commercial MT systems, using POS, function words and mixed features against reference and non-reference sentences.

As can be seen, best detection results are obtained using the full combined feature set. It can also be seen that, as might be expected, it is easier to distinguish machine-translated sentences from a non-reference set than from the reference set. In Figure 1, we show the relationship of the observed detection accuracy for each system with the BLEU score of that system. As is evident, regardless of the feature set or non-MT sentences used, the correlation between detection accuracy and BLEU score is very high, as we can also see from the R2 values in Table 1.

Features Data Google Moses Systran ProMT Linguatec Skycode Trident R2
mixed MT/non-ref 63.34 72.02 72.36 78.2 79.57 80.9 89.36 0.946
mixed MT/ref 59.51 69.47 69.77 75.86 78.11 79.24 88.85 0.944
func. w. MT/non-ref 60.43 69.17 69.87 69.78 71.38 75.46 84.97 0.798
func. w. MT/ref 57.27 66.05 67.48 67.06 68.58 73.37 84.79 0.779
POS MT/non-ref 60.32 64.39 66.61 73 73.9 74.33 79.6 0.978
POS MT/ref 57.21 65.55 64.12 70.29 73.06 73.04 78.84 0.948
Table 1: Classifier performance, including the R2 coefficient describing the correlation with BLEU.
MT Engine Example
Google Translate These days, all but one were subject to a vote,
and all had a direct link to the post September 11th.”
Moses these days , except one were the subject of a vote ,
and all had a direct link with the after 11 September .”
Systran From these days, all except one were the object of a vote,
and all were connected a direct link with after September 11th.”
Linguatec Of these days, all except one were making the object of a vote
and all had a straightforward tie with after September 11.”
ProMT These days, very safe one all made object a vote,
and had a direct link with after September 11th.”
Trident From these all days, except one operated object voting,
and all had a direct rope with after 11 septembre.”
Skycode In these days, all safe one made the object in a vote
and all had a direct connection with him after 11 of September.”
Table 2: Outputs from several MT systems for the same source sentence (function words marked in bold)

3.3 In-House SMT Systems

Parallel Monolingual BLEU
SMT-1 2000k 2000k 28.54
SMT-2 1000k 1000k 27.76
SMT-3 500k 500k 29.18
SMT-4 100k 100k 23.83
SMT-5 50k 50k 24.34
SMT-6 25k 25k 22.46
SMT-7 10k 10k 20.72
Table 3: Details for Moses based SMT systems

In the second experiment set, we test our detection method on SMT systems we created, in which we have control over the training data and the expected overall relative translation quality. In order to do so, we use the Moses statistical machine translation toolkit []. To train the systems, we take a portion of the Europarl corpus [], creating 7 different SMT systems, each using a different amount of training data, for both the translation model and language model. We do this in order to create different quality translation systems, details of which are described in Table 3. For purposes of classification, we use the same content independent features as in the previous experiment, based on function words and POS tags, again with SMO-based SVM as the classifier. For data, we use 20,000 random, non reference sentences from the Hansard corpus, against 20,000 sentences from one MT system per experiment, again resulting in 40,000 sentence instances per experiment. The relationship between the detection results for each MT system and the BLEU score for that system, resulting in R2=0.774, is shown in Figure 2.


[title=R2=0.789,width=7.5cm,height=7.5cm, xlabel= BLEU,ylabel=detection accuracy (%)] \addplot[color=black,mark=*] coordinates (29.18,72.33) (28.54,73.10) (27.76,73.90) (24.34,74.12) (23.83,73.59) (22.46,74.78) (20.72,75.98) ;


[color=black,mark=none,style=dashed] table[ y=create col/linear regression=y=Y] X Y 29.18 72.33 28.54 73.10 27.76 73.90 24.34 74.12 23.83 73.59 22.46 74.78 20.72 75.98 ;

Figure 2: Correlation between detection accuracy and BLEU score on in-house Moses-based SMT systems against non-reference sentences using content independent features.

[title=R2=0.774, width=7.5cm,height=7.5cm, xlabel= human evaluation score,ylabel=detection accuracy (%)] \addplot[color=black,mark=*] coordinates (0.638,58.58) (0.604,58.61) (0.591,57.63) (0.573,58.78) (0.562,59.38) (0.541,58.63) (0.512,59.86) (0.486,58.53) (0.439,60.46) (0.429,61.65) (0.420,62.66) (0.389,60.83) (0.322,64.10) ;


[color=black,mark=none,style=dashed] table[ y=create col/linear regression=y=Y] X Y 0.638 58.58 0.604 58.61 0.591 57.63 0.573 58.78 0.562 59.38 0.541 58.63 0.512 59.86 0.486 58.53 0.439 60.46 0.429 61.65 0.420 62.66 0.389 60.83 0.322 64.10 ;

Figure 3: Correlation between detection accuracy and human evaluation scores on systems from WMT13’ against reference sentences.

[title=R2=0.556, width=7.5cm,height=7.5cm, xlabel= human evaluation score,ylabel=detection accuracy (%)] \addplot[color=black,mark=*] coordinates (0.638,74.05) (0.604,73.56) (0.591,73.36) (0.573,73.63) (0.562,73.83) (0.541,74.10) (0.512,74.36) (0.486,73.80) (0.439,74.55) (0.429,75.10) (0.420,75.65) (0.389,73.71) (0.322,76.68) ; \addplot[color=black,mark=none,style=dashed] table[ y=create col/linear regression=y=Y] X Y 0.638 74.05 0.604 73.56 0.591 73.36 0.573 73.63 0.562 73.83 0.541 74.10 0.512 74.36 0.486 73.80 0.439 74.55 0.429 75.10 0.420 75.65 0.389 73.71 0.322 76.68 ;

Figure 4: Correlation between detection accuracy and human evaluation scores on systems from WMT 13’ against non-reference sentences.

[title=R2=0.829, width=7.5cm,height=7.5cm, xlabel= human evaluation score,ylabel=detection accuracy (%)] \addplot[color=black,mark=*] coordinates (0.638,60.92) (0.604,61.54) (0.591,62.41) (0.573,62.87) (0.562,62.46) (0.541,62.97) (0.512,63.81) (0.486,62.87) (0.439,64.09) (0.429,64.72) (0.420,66.81) (0.389,65.24) (0.322,65.87) ; \addplot[color=black,mark=none,style=dashed] table[ y=create col/linear regression=y=Y] X Y 0.638 60.92 0.604 61.54 0.591 62.41 0.573 62.87 0.562 62.46 0.541 62.97 0.512 63.81 0.486 62.87 0.439 64.09 0.429 64.72 0.420 66.81 0.389 65.24 0.322 65.87 ;

Figure 5: Correlation between detection accuracy and human evaluation scores on systems from WMT 13’ against non-reference sentences, using the syntactic CFG features described in section 4.2

4 Machine Translation Evaluation

4.1 Human Evaluation Experiments

As can be seen in the above experiments, there is a strong correlation between the BLEU score and the MT detection accuracy of our method. In fact, results are linearly and negatively correlated with BLEU, as can be seen both on commercial systems and our in-house SMT systems. We also wish to consider the relationship between detection accuracy and a human quality estimation score. To do this, we use the French-English data from the 8th Workshop on Statistical Machine Translation - WMT13’ [], containing outputs from 13 different MT systems and their human evaluations. We conduct the same classification experiment as above, with features based on function words and POS tags, and SMO-based SVM as the classifier. We first use 3000 reference sentences from the WMT13’ English reference translations, against the matching 3000 output sentences from one MT system at a time, resulting in 6000 sentence instances per experiment. As can be seen in Figure 3, the detection accuracy is strongly correlated with the evaluations scores, yielding R2=0.774. To provide another measure of correlation, we compared every pair of data points in the experiment to get the proportion of pairs ordered identically by the human evaluators and our method, with a result of 0.846 (66 of 78). In the second experiment, we use 3000 random, non reference sentences from the newstest 2011-2012 corpora published in WMT12’ [] against 3000 output sentences from one MT system at a time, again resulting in 6000 sentence instances per experiment. While applying the same classification method as with the reference sentences, the detection accuracy rises, while the correlation with the translation quality yields R2=0.556, as can be seen in Figure 4. Here, the proportion of identically ordered pairs is 0.782 (61 of 78).

4.2 Syntactic Features

We note that the second leftmost point in Figures 3, 4 is an outlier: that is, our method has a hard time detecting sentences produced by this system although it is not highly rated by human evaluators. This point represents the Joshua [] SMT system. This system is syntax-based, which apparently confound our POS and FW-based classifier, despite it’s low human evaluation score. We hypothesize that the use of syntax-based features might improve results. To verify this intuition, we create parse trees using the Berkeley parser [] and extract the one-level CFG rules as features. Again, we represent each sentence as a boolean vector, in which each entry represents the presence or absence of the CFG rule in the parse-tree of the sentence. Using these features alone, without the FW and POS tag based features presented above, we obtain an R2=0.829 with a proportion of identically ordered pairs at 0.923 (72 of 78), as shown in Figure 5.

5 Discussion and Future Work

We have shown that it is possible to detect machine translation from monolingual corpora containing both machine translated text and human generated text, at sentence level. There is a strong correlation between the detection accuracy that can be obtained and the BLEU score or the human evaluation score of the machine translation itself. This correlation holds whether or not a reference set is used. This suggests that our method might be used as an unsupervised quality estimation method when no reference sentences are available, such as for resource-poor source languages. Further work might include applying our methods to other language pairs and domains, acquiring word-level quality estimation or integrating our method in a machine translation system. Furthermore, additional features and feature selection techniques can be applied, both for improving detection accuracy and for strengthening the correlation with human quality estimation.


We would like to thank Noam Ordan and Shuly Wintner for their help and feedback on the early stages of this work. This research was funded in part by the Intel Collaborative Research Institute for Computational Intelligence.