Introduction
Developers, researchers and end-users of Machine Translation
(MT) systems are often interested in analysing their efficacy to establish the
relative benefits of their application to translation. In other words, users of
MT are interested in the quality of a system’s output and whether it produces
‘good’ translations. When it comes to
judging the quality of MT performance, the basic tenant is captured in the
maxim adopted by Papineni et al., ‘The closer a machine translation to a
professional human translation, the better it is’ (2001: 1).
Numerous evaluative methods of judging MT system performance
have been developed, and they fall into two broad categories; human evaluation
and automatic evaluation. The most widely recognized benchmark for assessing MT
quality is professional human translators who make judgements based on
standards of accuracy, fidelity and fluency usually against ‘gold standard’
human-translated reference texts (King, 1996 in Pryzybocki et al. 2006: 1).
For the purpose of this investigation, MT output quality is
seen from the perspective of the typical non-commercial end-user who may be one
of the millions who use MT systems on a casual basis. These causal end-users
are most likely to utilize MT for informational purposes or to ‘get the gist’
of a text and in such instances it is accuracy of semantic content that becomes
the major aspect of quality (Koponen, 2010: 2). This report will adopt an error
analysis scheme to classify and count errors in texts machine translated by the
popular MT systems Systran and Google Translate. The error scheme focuses on
semantic errors found in the target texts. Two fields or text genres have been
used; general politics and technology although both texts are designed for
general rather than technical or specialist audiences. The results demonstrate
how both MT systems perform and whether patterns across text genres may be
suggested. Finally, the results will be compared to scores generated by the
automated evaluation metric BLEU to establish whether the general results are
corroborated between my own human evaluation and that of a popular automatic
metric.
Evaluating MT output quality is an important task that may
be of interest to individuals and larger scale commercial users wishing to
decide which MT system to use. Error analysis is one way to do this, and
comparing results from different systems may serve as a basis for determining
the types and frequency of particular errors and thus, how a hierarchical
classification of translation errors might serve as the basis for improving
fluency in MT systems.
Materials
The texts selected for this experiment have been chosen to
reflect the needs of average MT users, and include two source texts in Arabic
comprising a general political text, and a technological text suitable for a
lay reader on the development of communication technology. Both texts are taken
from the Project Syndicate website which provides high quality articles in
several languages, and the translations are done by professional volunteer
translators who render paragraph for paragraph. The two varying genres were
chosen to observe whether levels of translation accuracy vary with text type,
and both texts contain a range of sentence types as well as complex nouns,
pronouns and names of organisations.
Each source text was translated from Arabic into English
using two different MT systems: Systran which is a freely available rule-based
(RbMT) MT system and one of the oldest commercial organizations in the field;
and Google’s Google Translate, a free statistical system. Rule-based systems
rely on thousands of lexical and syntactic rules coded into the software, while
statistical systems rely on statistical learning methods applied to large
corpora to create a database of bilingual phrase tables and language models for
translation (Koponen, 2010: 3).
Methodology
Many error analyses attempt o identify errors in correlation
with parts of speech such as verbs, prepositions and determiners etc. However, such an approach
departs from my stated aim of evaluating the semantic, rather than purely
linguistic aspects of MT output. This evaluation will instead probe the
correlation of semantic aspects between source and target text as a general,
but effective indicator of fluency.
Analysis of the MT output has been performed by myself, a
professional translator who is competent in the language pair. Items in the target
text that seem intuitively ‘unnatural’ or simply not fluent or adequate
translations of the corresponding source text item as perceived from a semantic
point of view, have been counted as
errors using the following scheme adapted from Kaponen, 2010. Items can be
larger than single words since compound nouns and idioms for example, are
regarded as one single semantic entity in this scheme and count as one error if
incorrect. Semantic items were compared between the source and target texts
with the human translations as ‘gold standard’ references.
It is of course the case that items can be translated in
different ways and still remain legitimate translations by virtue of retaining
the semantic content albeit in another expression. In this evaluation, such
occurrences are called ‘substituted items’ if the reader of the target text can
derive the same informational data from them as he could from the source text
or human reference, and for this reason they are not classified as errors.
The following
error categories were adopted based on Kaponen’s scheme:
Omitted concept: ST concept that is not conveyed by the TT
Added concept: TT concept that is not present in the ST
Untranslated concept: SL words that appear in TT (with Arabic as SL, untranslated concepts
appear transliterated)
Mistranslated concept: A TT concept has the wrong meaning for the context.
Substituted concept: TT concept is not a direct lexical equivalent for ST concept but can be
considered a valid replacement for the context (i.e. a synonym).
Results
The number of
errors found in the target texts is shown in Table 1, while error rates and
percentile analyses are presented in Table 2. Substituted items are shown but not
counted as errors. The rule-based system, Systran, made a total of 472 errors,
with a greater proportion, though not significantly so, in the technology
genre. The overall mean error rate was 21.37%. As for the statistical-based
system Google Translate, it produced far fewer errors, with just 135 across
both genres, though with a far greater proportion in the technology genre.
Overall, the Google Translate combined mean error rate was 6.4%.
Table 1 Error count in MT target texts according to genre and MT system
|
|
Omissions
|
Additions
|
Untranslated items
|
Mistranslated items
|
Substitut-ed items
|
Misordered items
|
Total errors
|
Systran
|
Politics
|
8
|
12
|
5
|
189
|
12
|
30
|
244
|
Technology
|
20
|
6
|
5
|
152
|
0
|
45
|
228
|
Google Translate
|
Politics
|
2
|
5
|
2
|
13
|
9
|
6
|
28
|
Technology
|
20
|
1
|
2
|
66
|
0
|
18
|
107
|
Table 2 Total error counts and averages
|
|
Error Rate Error/Words
|
Error rate (%)
|
Mean error rate
|
Systran
|
Politics
|
244/1183
|
20.6%
|
21.3%
|
Technology
|
228/1030
|
22.14%
|
Google Translate
|
Politics
|
28/1147
|
2.4%
|
6.4%
|
Technology
|
107/1028
|
10.4%
|
Discussion
The error count
results show some patterns but also reveal widely divergent differences in
performance across the MT systems, with Google Translate vastly outperforming
Systran in preserving source semantic content. While both systems scored higher
error counts for the genre of technology compared to politics, Google’s error rate
in the political domain was only a tenth of those registered in Systran, and
only a half in the technological domain.
By far the most
common error in both systems across the genres was mistranslated items (see
Table 3), with 72.2% of errors in Systran and 58.5% for Google Translate.
Table 3 Rate of error type
|
Omissions
|
additions
|
Untranslated items
|
Mistranslated items
|
Substitut-ed items
|
Misordered items
|
Total errors
|
Systran
|
5.9%
|
3.8%
|
2.1%
|
72.2%
|
2.5%
|
15.9%
|
472
|
Google Translate
|
16.3%
|
4.4%
|
2.9%
|
58.5%
|
6.7%
|
17.8%
|
135
|
That both systems
recorded the highest number of errors in the ‘mistranslated’ category gives
weight to the proposition that semantic mismatches in the form of
mistranslations are the most common issue for MT systems in general. However,
we must acknowledge that mistranslations occur on a cline of fluency to
non-fluency and some mistranslations might more readily relate to the context
whilst others are do greatly impede comprehension. In this sample segment below
we note that mistranslations in many sentences preclude adequate comprehension
for informational purposes, whilst others merely worsen comprehension.
Arabic source
text-
وكان من بين المزايا التي يحصل عليها عضو البرلمان
الحق في تخصيص خمسة عشر خطاً هاتفياً لمن يعتبره مستحقا.
Human reference
translation-
‘Members of parliament had among
their privileges the right to allocate 15 telephone connections to whomever
they deemed worthy.’
Gloss
translation-
‘Was among the
privileges that acquired them member of the parliament the right to allocate
fifteen line telephone to-who he considered deserving.’
Systran-
‘Thevirtues were among which collects raised hermember of
the parliament the right forspecification of five ten lines is telephoneblamed
considers him deserved.’
Google Translate-
‘One of the advantages obtained by the Member of Parliament
the right to allocate fifteen telephone line for those he considers worthy.’
An exhaustive list of errors classified according to parts
of speech is beyond the scope of this report, but the major errors of each
system can be seen in this representative example taken from the technological
text. The Systran TT is incomprehensible to the extent that users cannot derive
from it the same semantic content they otherwise could from the human
translated text (of course, in authentic MT scenarios the user does not have
access to a human ‘gold standard’ translation for reference).
Among the major mistranslation issues frequently occurring
in Systran are:
- Words joined together
across the grammar spectrum; definite articles, prepositions, possessives
etc
- Very high occurrence of
homography -erroneous assignment of part-of-speech categories which is a
common issue for MT systems and especially direct transfer systems
(Lehrberger & Bourbeau, 1988: 15).
- Failure to translate
numbers, خمسة عشر
(fifteen) is rendered ‘five ten’.
- Confusion of definite and
indefinite articles.
As for Google Translate, the most common errors are;
- Omission of the verb ‘to
be’ as seen in the example above. Arabic does not usually use the present
tense of this verb and this may be indicative of why Google has difficulty
‘detecting’ its semantic import in the source text.
- Confusion of definite and
indefinite articles.
- Some cases of homophony.
Automatic MT
evaluation
Automatic Machine Translation methods build on the idea of
proximity to professional human translation by developing metrics that can
account for and replicate human evaluations of MT output quality. One such
automatic metric is the Bilingual Evaluation Understudy (BLEU) that was
developed by a research team at IBM. BLEU measures translations between 0 and
1, with 1 being a perfect match to the reference human translation.
With the error rate scores previously produced by human
evaluation in mind, we can form and test the following hypothesis;
Since the BLEU metric measures precision of machine
translation by analysing their closeness to the reference translation, we can
predict that a text with greater errors will achieve a lower BLEU score, and on
this premise Google Translate should achieve a higher score in the BLEU test,
indicating that it is a closer and a more precise translation than the one
offered by Systran.
This proves to be the case (see chart 1 below), and the
large gap in error rate between the two systems seems to be reflected in the BLEU
scores. However, the error rate evaluation showed that technology scored higher
than politics in both systems, but this seems to the inverse case in the BLEU
scores.
Chart 1 Automated BLEU Evaluation.
Conclusion
The results of this
report reveal that Google Translate performs better- in fact significantly so-
than Systran in translating from Arabic into English in two text genres,
politics and technology. Not only did Systran, the rule-based system, make more
errors, it is also evident from the error type analysis and the translation
segment examples shown that its semantic content was far less adequate than that
produced by the statistical system Google Translates. The automatic evaluation
scores provided by BLEU substantiate this although it disagrees with the
relative performances of the two genres. It is of course the case that human evaluation
is inherently subjective and open to vagaries, yet we have seen that it was
able to form the basis of a successful hypothesis affirmed by BLEU, an automated
evaluation metric.
Although Systran made more errors overall and
may be deemed less precise, the degree and criticalness of semantic mismatch
cannot be ascertained based on the results as certain errors will have greater
impact in misconstruing meaning than others. To account for this, further studies may wish
to accord a weighting to each error type according to how critical they are in
distorting the semantic content of the source text. Further studies may also
like to assess whether a correlation is to be found in error rates and types of
MT system used.
Based on the
conclusions in this report, I would recommend that Arabic-English users of
automated MT systems use Google Translator as their preferred option over
Systran due to its greater quality output as demonstrated in this report’s
findings.
See for example the taxonomy developed by Elliott et al in (Elliott et al., 2004).
Bibliography
Elliott, D., Hartley, A., &
Atwell, E., (2004): ‘A fluency error categorization scheme to guide automated
machine translation evaluation. In:
Machine translation: from real users to
research: 6th conference of the Association for Machine Translation in the
Americas, AMTA 2004, Washington, DC, September 28 – October 2, 2004; ed.
Robert E. Frederking and Kathryn B. Taylor (Berlin: Springer Verlag, 2004); pp.
64-73.
King, M. (1996). ‘Evaluating Natural Language Processing Systems’. In: Communications
of the ACM (39) 1, pp.73–79.
Koponen, M. (2010): ‘Assessing Machine Translation Quality
with Error Analysis’. In: Electronic Proceedings of the KäTu Symposium on Translation and
Interpreting Studies 4 (2010).
Lehrberger, J., & Bourbeau,
L., (2010): Machine Translation: Linguistic Characteristics of MT Systems
and General Methodology of Evaluation, Lingvisticae Investigationes
upplementa 15, Amsterdam/ Philadelphia: John Benjamins.
Papineni, K.,
Roukos, S., Ward, T. & Zhu, WJ. (2002): ‘BLEU: A Method for Automatic
Evaluation of Machine Translation’. ACL 2002: Proceedings of the 40th
Annual Meeting of the Association for Computer Linguistics. Philadelphia,
July 2002 pp311-318.
Pryzybocki, M., Sanders, G. & Le, A., (2006): ‘Edit
Distance: A Metric for Machine Translation Evaluation’. In: LREC (2006).