Tuesday, 12 June 2012

Evaluating Error Analysis Results From MT Systems across Genres (using Arabic)


Developers, researchers and end-users of Machine Translation (MT) systems are often interested in analysing their efficacy to establish the relative benefits of their application to translation. In other words, users of MT are interested in the quality of a system’s output and whether it produces ‘good’ translations.  When it comes to judging the quality of MT performance, the basic tenant is captured in the maxim adopted by Papineni et al., ‘The closer a machine translation to a professional human translation, the better it is’ (2001: 1).

Numerous evaluative methods of judging MT system performance have been developed, and they fall into two broad categories; human evaluation and automatic evaluation. The most widely recognized benchmark for assessing MT quality is professional human translators who make judgements based on standards of accuracy, fidelity and fluency usually against ‘gold standard’ human-translated reference texts (King, 1996 in Pryzybocki et al. 2006: 1).

For the purpose of this investigation, MT output quality is seen from the perspective of the typical non-commercial end-user who may be one of the millions who use MT systems on a casual basis. These causal end-users are most likely to utilize MT for informational purposes or to ‘get the gist’ of a text and in such instances it is accuracy of semantic content that becomes the major aspect of quality (Koponen, 2010: 2). This report will adopt an error analysis scheme to classify and count errors in texts machine translated by the popular MT systems Systran and Google Translate. The error scheme focuses on semantic errors found in the target texts. Two fields or text genres have been used; general politics and technology although both texts are designed for general rather than technical or specialist audiences. The results demonstrate how both MT systems perform and whether patterns across text genres may be suggested. Finally, the results will be compared to scores generated by the automated evaluation metric BLEU to establish whether the general results are corroborated between my own human evaluation and that of a popular automatic metric.  
Evaluating MT output quality is an important task that may be of interest to individuals and larger scale commercial users wishing to decide which MT system to use. Error analysis is one way to do this, and comparing results from different systems may serve as a basis for determining the types and frequency of particular errors and thus, how a hierarchical classification of translation errors might serve as the basis for improving fluency in MT systems.


The texts selected for this experiment have been chosen to reflect the needs of average MT users, and include two source texts in Arabic comprising a general political text, and a technological text suitable for a lay reader on the development of communication technology. Both texts are taken from the Project Syndicate website which provides high quality articles in several languages, and the translations are done by professional volunteer translators who render paragraph for paragraph. The two varying genres were chosen to observe whether levels of translation accuracy vary with text type, and both texts contain a range of sentence types as well as complex nouns, pronouns and names of organisations.

Each source text was translated from Arabic into English using two different MT systems: Systran which is a freely available rule-based (RbMT) MT system and one of the oldest commercial organizations in the field; and Google’s Google Translate, a free statistical system. Rule-based systems rely on thousands of lexical and syntactic rules coded into the software, while statistical systems rely on statistical learning methods applied to large corpora to create a database of bilingual phrase tables and language models for translation (Koponen, 2010: 3).


Many error analyses attempt o identify errors in correlation with parts of speech such as verbs, prepositions and determiners etc[1]. However, such an approach departs from my stated aim of evaluating the semantic, rather than purely linguistic aspects of MT output. This evaluation will instead probe the correlation of semantic aspects between source and target text as a general, but effective indicator of fluency.

Analysis of the MT output has been performed by myself, a professional translator who is competent in the language pair. Items in the target text that seem intuitively ‘unnatural’ or simply not fluent or adequate translations of the corresponding source text item as perceived from a semantic point of view,  have been counted as errors using the following scheme adapted from Kaponen, 2010. Items can be larger than single words since compound nouns and idioms for example, are regarded as one single semantic entity in this scheme and count as one error if incorrect. Semantic items were compared between the source and target texts with the human translations as ‘gold standard’ references.

It is of course the case that items can be translated in different ways and still remain legitimate translations by virtue of retaining the semantic content albeit in another expression. In this evaluation, such occurrences are called ‘substituted items’ if the reader of the target text can derive the same informational data from them as he could from the source text or human reference, and for this reason they are not classified as errors.
The following error categories were adopted based on Kaponen’s scheme:

Omitted concept: ST concept that is not conveyed by the TT
Added concept: TT concept that is not present in the ST
Untranslated concept: SL words that appear in TT (with Arabic as SL, untranslated concepts appear transliterated)
Mistranslated concept: A TT concept has the wrong meaning for the context.
Substituted concept: TT concept is not a direct lexical equivalent for ST concept but can be considered a valid replacement for the context (i.e. a synonym).


The number of errors found in the target texts is shown in Table 1, while error rates and percentile analyses are presented in Table 2. Substituted items are shown but not counted as errors. The rule-based system, Systran, made a total of 472 errors, with a greater proportion, though not significantly so, in the technology genre. The overall mean error rate was 21.37%. As for the statistical-based system Google Translate, it produced far fewer errors, with just 135 across both genres, though with a far greater proportion in the technology genre. Overall, the Google Translate combined mean error rate was 6.4%.

Table 1 Error count in MT target texts according to genre and MT system

Untranslated items
Mistranslated items
Substitut-ed items
Misordered items
Total errors
Google Translate

Table 2 Total error counts and averages

Error Rate Error/Words
Error rate (%)
Mean error rate

Google Translate



The error count results show some patterns but also reveal widely divergent differences in performance across the MT systems, with Google Translate vastly outperforming Systran in preserving source semantic content. While both systems scored higher error counts for the genre of technology compared to politics, Google’s error rate in the political domain was only a tenth of those registered in Systran, and only a half in the technological domain.

By far the most common error in both systems across the genres was mistranslated items (see Table 3), with 72.2% of errors in Systran and 58.5% for Google Translate.  

Table 3 Rate of error type

Untranslated items
Mistranslated items
Substitut-ed items
Misordered items
Total errors
Google Translate

That both systems recorded the highest number of errors in the ‘mistranslated’ category gives weight to the proposition that semantic mismatches in the form of mistranslations are the most common issue for MT systems in general. However, we must acknowledge that mistranslations occur on a cline of fluency to non-fluency and some mistranslations might more readily relate to the context whilst others are do greatly impede comprehension. In this sample segment below we note that mistranslations in many sentences preclude adequate comprehension for informational purposes, whilst others merely worsen comprehension.

Arabic source text-
وكان من بين المزايا التي يحصل عليها عضو البرلمان الحق في تخصيص خمسة عشر خطاً هاتفياً لمن يعتبره مستحقا.
Human reference translation-
‘Members of parliament had among their privileges the right to allocate 15 telephone connections to whomever they deemed worthy.’
Gloss translation-
‘Was among the privileges that acquired them member of the parliament the right to allocate fifteen line telephone to-who he considered deserving.’
‘Thevirtues were among which collects raised hermember of the parliament the right forspecification of five ten lines is telephoneblamed considers him deserved.’
Google Translate-
‘One of the advantages obtained by the Member of Parliament the right to allocate fifteen telephone line for those he considers worthy.’

An exhaustive list of errors classified according to parts of speech is beyond the scope of this report, but the major errors of each system can be seen in this representative example taken from the technological text. The Systran TT is incomprehensible to the extent that users cannot derive from it the same semantic content they otherwise could from the human translated text (of course, in authentic MT scenarios the user does not have access to a human ‘gold standard’ translation for reference).

Among the major mistranslation issues frequently occurring in Systran are:
  • Words joined together across the grammar spectrum; definite articles, prepositions, possessives etc
  • Very high occurrence of homography -erroneous assignment of part-of-speech categories which is a common issue for MT systems and especially direct transfer systems (Lehrberger & Bourbeau, 1988: 15).
  • Failure to translate numbers, خمسة عشر (fifteen) is rendered ‘five ten’.
  • Confusion of definite and indefinite articles.
As for Google Translate, the most common errors are;
  • Omission of the verb ‘to be’ as seen in the example above. Arabic does not usually use the present tense of this verb and this may be indicative of why Google has difficulty ‘detecting’ its semantic import in the source text.
  • Confusion of definite and indefinite articles.
  • Some cases of homophony.
Automatic MT evaluation

Automatic Machine Translation methods build on the idea of proximity to professional human translation by developing metrics that can account for and replicate human evaluations of MT output quality. One such automatic metric is the Bilingual Evaluation Understudy (BLEU) that was developed by a research team at IBM. BLEU measures translations between 0 and 1, with 1 being a perfect match to the reference human translation.

With the error rate scores previously produced by human evaluation in mind, we can form and test the following hypothesis;

Since the BLEU metric measures precision of machine translation by analysing their closeness to the reference translation, we can predict that a text with greater errors will achieve a lower BLEU score, and on this premise Google Translate should achieve a higher score in the BLEU test, indicating that it is a closer and a more precise translation than the one offered by Systran.

This proves to be the case (see chart 1 below), and the large gap in error rate between the two systems seems to be reflected in the BLEU scores. However, the error rate evaluation showed that technology scored higher than politics in both systems, but this seems to the inverse case in the BLEU scores.

Chart 1 Automated BLEU Evaluation.


The results of this report reveal that Google Translate performs better- in fact significantly so- than Systran in translating from Arabic into English in two text genres, politics and technology. Not only did Systran, the rule-based system, make more errors, it is also evident from the error type analysis and the translation segment examples shown that its semantic content was far less adequate than that produced by the statistical system Google Translates. The automatic evaluation scores provided by BLEU substantiate this although it disagrees with the relative performances of the two genres.  It is of course the case that human evaluation is inherently subjective and open to vagaries, yet we have seen that it was able to form the basis of a successful hypothesis affirmed by BLEU, an automated evaluation metric.

 Although Systran made more errors overall and may be deemed less precise, the degree and criticalness of semantic mismatch cannot be ascertained based on the results as certain errors will have greater impact in misconstruing meaning than others.  To account for this, further studies may wish to accord a weighting to each error type according to how critical they are in distorting the semantic content of the source text. Further studies may also like to assess whether a correlation is to be found in error rates and types of MT system used.

Based on the conclusions in this report, I would recommend that Arabic-English users of automated MT systems use Google Translator as their preferred option over Systran due to its greater quality output as demonstrated in this report’s findings.

 [1] See for example the taxonomy developed by Elliott et al in (Elliott et al., 2004). 


Elliott, D., Hartley, A., & Atwell, E., (2004): ‘A fluency error categorization scheme to guide automated machine translation evaluation. In: Machine translation: from real users to research: 6th conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, September 28 – October 2, 2004; ed. Robert E. Frederking and Kathryn B. Taylor (Berlin: Springer Verlag, 2004); pp. 64-73.

King, M. (1996). ‘Evaluating Natural Language Processing Systems’. In: Communications of the ACM (39) 1, pp.73–79.

Koponen, M. (2010): ‘Assessing Machine Translation Quality with Error Analysis’. In: Electronic Proceedings of the KäTu Symposium on Translation and Interpreting Studies 4 (2010).
Lehrberger, J., & Bourbeau, L., (2010): Machine Translation: Linguistic Characteristics of MT Systems and General Methodology of Evaluation, Lingvisticae Investigationes upplementa 15, Amsterdam/ Philadelphia: John Benjamins.

Papineni, K., Roukos, S., Ward, T. & Zhu, WJ. (2002): ‘BLEU: A Method for Automatic Evaluation of Machine Translation’. ACL 2002: Proceedings of the 40th Annual Meeting of the Association for Computer Linguistics. Philadelphia, July 2002 pp311-318.

Pryzybocki, M., Sanders, G. & Le, A., (2006): ‘Edit Distance: A Metric for Machine Translation Evaluation’. In: LREC (2006).


  1. Tenet, not tenant.

  2. «Google Translator as their preferred option over Systran»

    Let me add that GT can be used just to have an idea of a text, but never for professional translations as it compromises confidentiality. Professional translators should never resort to it as they can be taken to court for this. Clients interested in preserving the confidentiality of their business should take this into account also.