… and a comparison with cloud provider translation services
As we discussed in our previous post, the Anno.Ai data science team has continued evaluating machine learning model providers by testing machine translation offerings. In Part 1, we compared the Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure APIs for translating Arabic, Chinese, Persian, and Russian into English.
While the commercial cloud services provide a great option for online use cases, some of our use cases require running models in an offline environment and/or the flexibility to re-train and tune these models to more specific data environments. For these cases, we turned to open source neural machine translation (NMT) models that can be tuned and deployed for offline environments. In the second part of this series, we’ll provide an overview of open source NMT models. We’ll also compare models available through the Hugging Face Transformers library with the cloud provider services we evaluated in Part 1 of the translation series.
Overview of Open Source NMT Models
Over the past five years, several robust open source NMT frameworks have emerged. Some of the most popular ones include:
- MarianMT: MarianMT is a fast translation framework written in C++ and is primarily maintained by the Microsoft Translator team. This is also the NMT engine that’s used under the hood for Microsoft’s Neural Machine Translation service.
- OpenNMT: The Harvard NLP team originally developed OpenNMT, and it is now primarily maintained by SYSTRAN.
- Sockeye: Sockeye is a sequence-to-sequence framework for neural machine translation; it’s used under the hood by Amazon Translate.
- Fairseq: Fairseq is Facebook’s sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks. It provides reference implementations and pre-trained models associated with many recent NMT research articles.
- Hugging Face Transformers: The Transformers library provides general-purpose architectures for translation as well as a range of other language modeling and text generation tasks. It also enables contributors to publish language datasets and share trained models.
In an initial review of these frameworks, we were looking for both the option to train and tune our own models to more specific data environments, as well as existing (or “pre-trained”) models that we could use right away for a broad set of different languages.
A pre-trained model is a model that has been created by someone else, and that can be used as a starting point vs. training your own model from scratch. Pre-trained models can be helpful for rapid prototyping, transfer learning, or if you need broad language support across a range of different language pairs (and don’t have the time/resources to train all of these models yourself). Publishing and sharing large pre-trained language models also cuts down on energy costs and climate impacts associated with each end user training up their own models.
While all of these open source projects support training your own models from scratch, most don’t provide very many pre-trained models — most only provide models trained on the WMT2017 bilingual data for English-German and a few other language pairs. The one exception is the Hugging Face Transformers library, which added a large set of pre-trained machine translation models earlier this year. We were excited to dive deeper into the Transformers library, and also test how these translation models would perform compared to the commercial services we’d evaluated in Part 1!
Translation with Transformers
The Transformers library enables contributors to publish language datasets and share pre-trained models. In May 2020, the Language Technology Research Group at the University of Helsinki (Helsinki-NLP) published a large set of translation models to the Transformers library. These models were trained using the MarianNMT framework and the Open Parallel Corpus (OPUS) dataset. The model set includes over 1000+ language pairs, and 169 source language or language family translations to English.
Using these pre-trained models is straightforward. The example below shows how to load the models dynamically from the Hugging Face model hub, or you can download and save them locally to access offline (each set of model weights and tokenizer files is about 310MB).
Hugging Face vs. Commercial Cloud Providers
After downloading the Hugging Face translation models for our languages of interest, we wanted to test how the models would perform compared to the commercial services we’d evaluated in our previous post. While the Hugging Face translation model set has models for Arabic, Russian, and Chinese to English translation, they do not have a Persian-specific model so we used the Indo-Iranian family model instead. We ran the Hugging Face translation models through the same evaluation process described in our previous post. We conducted a two-phase evaluation to assess the quality of the translations for each language: one phase using automated metrics and one using human reviewers.
For automated metrics, we calculated BLEU, METEOR, and NIST scores for translations on the Global Voices dataset (for Arabic, Persian, and Russian) and the test dataset from the news translation task of the 2019 Association of Computational Linguistics Annual Meeting (for Chinese).
The Hugging Face models held their own and produced results that were on par with (or in some cases, better than) the other models for Arabic and Russian. For Chinese, the Hugging Face model metrics were slightly lower than, but still close to, the commercial models. The Hugging Face Indo-Iranian model underperformed on Persian translations and we would need to improve/retrain this model for our use cases, but this isn’t surprising given the range of languages that model covers.
Human Rater Review
We also had our linguists review a subset of the translations for correctness in both fluency and meaning, and averaged those scores to produce a score out of 100. A comparison of the human reviewer ratings on the commercial APIs vs. Hugging Face models is included below, and we’ve also included some sample translations in each language. The Hugging Face models were on par with the commercial models for Arabic, Chinese, and Russian translations. For Persian, while the Indo-Iranian family model occasionally produced accurate translations, in general the translations were completely inaccurate and we’d need to retrain the model in that case.
Overall, we were really impressed by the Hugging Face model performance and ease of use, especially for offline use cases involving Arabic, Chinese, or Russian. One final note on speed — the Hugging Face translation models aren’t particularly fast (several seconds per sentence-based translation on CPU), in part due to the use of a slower tokenizer. Hugging Face is in the process of upgrading its tokenizers as part of the Transformers 4.0.0 and subsequent releases, and hopefully more of the translation models will get a boost in speed soon as well.
We’re planning to post more model evaluations soon, so please keep an eye out for our upcoming posts on models for speech-to-text transcription and automated polygon segmentation!