Evaluating Machine Translation Providers

5 min readDec 28, 2020
How to say good-bye in several languages. Photo by JACQUELINE BRANDWAYN on Unsplash

Anno.Ai’s data science team has continued evaluating online machine learning model providers by testing machine translation offerings. This evaluation follows our previous benchmarking studies of handwriting recognition, named entity recognition, and automatic license plate recognition providers.

Background and Use Case

For this exercise, we envisioned a customer who needs to quickly grasp the meaning, tone, and intent of short text blocks. The type of text might range from a formal document to more colloquial communications containing idioms, abbreviations, or social media-style hashtags. This customer may not need publication-ready translated output, but understanding nuance and context from a variety of speaker/writer styles is key.

With this customer in mind we benchmarked the performance of three major cloud service providers on translating text from four key languages into English. Our test language selection — Arabic, Mandarin Chinese, Persian, and Russian — reflected a range of challenging and diverse languages. We tested the following providers, all of which offer language detection, the integration of domain-specific glossaries, and the ability to customize translation output:

  • Google Cloud Platform (GCP) Translation: Available in basic or advanced options depending on the level of customization required. Supports more than 100 language pairs.
  • Amazon (AWS) Translate: Available in batch or real-time options and supports translation between 71 languages.
  • Microsoft Azure Translator: Also offers transliteration, dictionaries, and integration with speech-to-text or text-to-speech modules. Supports more than 70 languages, although not all languages are available for all services.

Test Data

Given our potential use case, we looked for test datasets of sentence-aligned pairs (rather than paragraph- or document-aligned pairs) that incorporated a wide variety of text and writing styles. The sentence-aligned datasets provide a sentence in a source language followed by a reference translation in the target language (in our case, English). For Arabic, Persian, and Russian we used the Global Voices corpus. This dataset consists of news segments from the Global Voices website, which is available in more than 50 languages. For Chinese, however, we used the test dataset from the news translation task of the 2019 Association of Computational Linguistics Annual Meeting.

Our test data was constructed using news-based datasets. Photo by Bianca Sbircea-Constantin on Unsplash


After running each of the source-language sentences through the three providers’ APIs and recording their translations, we conducted a two-phase review to evaluate the quality of each translation: one phase using automated metrics and one using human reviewers. The automated metrics assign a score to the providers’ translations by comparing their similarity to the reference translations. For this phase, we used three different metrics, each of which is optimized for different characteristics:

  • BLEU is one of the most popular metrics and is commonly used as a baseline;
  • METEOR evaluates correlation with human judgment at the sentence level versus the corpus level;
  • NIST is similar to BLEU, but is weighted to favor more meaningful (rarer) word combinations.

After calculating the metrics for each translated sentence, we recorded the mean and standard deviation for each metric/language/provider.

In the second phase, we randomly selected 15 to 20 sentences from the test datasets and sent the corresponding translations from each provider to source-language linguists for human review. The linguists rated the translations on a scale of 1 to 5 on how well the translation captured the meaning of the original sentence. Finally, a separate human reviewer rated each of the translations on its fluency in English, and we recorded an average of the two human ratings.


The two tables below show the numeric results of our review, including the mean and standard deviation for each provider, each metric, and each language. The rows shaded in gray show the sum of scores for all three metrics, with the top score highlighted in bold text.

We were surprised to discover that the three providers showed comparable performance, and none really emerged as a clear winner. Although AWS slightly edged out the others in three out of four languages as scored by the metrics means, GCP received the highest human ratings in three out of four languages. Azure scored lower in metrics means (except in Chinese) and human ratings, but showed the lowest standard deviation across languages, indicating more consistency in translation quality.

Created by Anno.Ai
Created by Anno.Ai


The following examples demonstrate the comparable performance of the three providers. The examples show how the providers sometimes choose incorrect words, but overall produce meaningful results.

1. Source language: Arabic. Reference translation: “Agadir Souss (fr) is concerned as well”:

  • GCP: “SOS Agadir blogger also expressed concern.”
  • AWS: “Sous Agadir’s blog also expressed concern.”
  • Azure: “The blog of Souss Agadir also expressed concern.”

2. Source language: Chinese. Reference translation: “Previously, it was informed that the Russian navy would equip the “Kazan” submarine in 2019.”

  • GCP: “Earlier, it was reported that the Russian Navy would install the “Kazan” submarine in 2019.”
  • AWS: “It follows news that the Russian Navy will list the submarine USS “Kazan” in 2019.”
  • Azure: “It follows news that the Russian Navy will install the Kazan submarine in 2019.”

3. Source language: Persian. Reference translation: “@RanaGaza: HUGE explosion rocks #Gaza on first day of Eid!”

  • GCP: “@RanaGaza: The huge explosion of #Gaza on the first day of Eid.”
  • AWS: “@RanaGaza: Massive explosion #غزه on the first day of Eid.”
  • Azure: “@RanaGaza: Massive explosion of #غزه on the first day of Eid.”

4. Source language: Russian. Reference translation: “There was something on the books allegedly dealing with the issue, but it was totally ineffectual.”

  • GCP: “There were some laws supposedly dealing with this issue, but they were absolutely ineffective.”
  • AWS: “There were some laws allegedly dealing with this issue, but they were absolutely unedifective.”
  • Azure: “There were some laws supposedly dealing with this issue, but they were absolutely deductive.”


All three providers showed strong performance, for the most part producing intelligible, fluent English sentences from a variety of topics and contexts across the test languages. We also recorded output speed and found all three to be comparable here as well. While a given provider might outshine the others in a particular use case, ultimately factors such as price and ease of integration would likely play the deciding role in selecting a provider.

What’s Next

Stay tuned for part 2 of our machine translation evaluation, which covers our comparison between the online providers and Hugging Face’s open-source Transformers library.

About the Author

Kristi Flis is a Senior Data Scientist at Anno.Ai. Prior to joining Anno, Kristi spent 14 years serving in national security roles in the federal government, including several years overseas. Kristi holds degrees from Georgetown University and Colgate University.




Bringing online retail analytics to a brick and mortar world