Facebook AI announces M2M-100: A Many-to-Many Language Translation
In addition to the several releases by Facebook AI is this new model that helps translation in multiple directions for a wide number of languages with accuracy like never achieved before. This model processes even those languages with lesser resources making this a revolutionary development for language processing.

Up until now, translation algorithms using bilingual systems have been training only one model and have been using it for translation between desired pairs of languages, but, this has its disadvantages, these models are usually very English centered, i.e. to be translated to the desired direction for a set of languages, it was first converted to English and then to the target language resulting in a lowered performance for non-English translations.
Despite being supported by most of the training data sources, it does not represent the translation accurately. This demanded the need for a model like Facebook AI’s M2M-100 language translation model, that restructures how language translation is done.
Facebook AI has been releasing a lot of revolutionary models in recent times which include the wav2vec model, KILT, and Dynabench, an addition to this is the M2M-100 model.
So, to attain this a method called the Multilingual Machine Translation(MMT) is used which factorizes the computation to translate languages by sharing information between similar languages working with zero-shot learning and empowering low resource direction processing.
This model aims to process 100 languages by building a large-scale many-to-many dataset which is not very complex because a parallel corpus is automatically constructed and a new mining strategy is used aiming to lower the mining effort in all directions. Also, back-translation is used for making the quality of the model on low resource pairs with zero-shot training better.
Working
Contemporary neural translation models are made up of blocks like subword segmentation and the encoder-decoder architecture commonly called the transformer. In subword segmentation, SentencePiece is used to separate sentences into words because it is designed to work with languages that are not segmented or can not be segmented. Using these words, a multilingual dictionary is built.
This can be achieved by calculating the frequency of words in the training dataset, which leads to an under-representation of words in low resource language. This problem is solved by incorporating monolingual data for languages with low resources.
MMT model is chiefly based on the sequence-to-sequence architecture of the Transformer, which consists of two modules; the encoder and the decoder. The purpose of the encoder is used for embedding the source tokens and that of the decoder is to use these embeddings to produce the target sentence, token by token.
Each layer in the transformer takes a vector sequence in form of its input and gives a vector sequence in the form of output. The encoder, further, has two more layers; a feed-forward layer and a self-attention layer applied sequentially and then followed by a residual connection.
The next aspect of the model is the target language token. For this model, we have multiple languages and it may so happen that the target language is not fixed which requires some strategies to be applied to analyze and modify the network model to get the output in the form of a sentence in the target language.
Moving further, we need to understand the training perspective of this model. For training, the MMT model makes use of a large Transformer model that has 12 encoder and 12 decoder layers processing on 1024 dimensions for embedding. The optimizer used is Adam optimizer.
All languages are divided into varying numbers of shards such that languages with high resources have more shards than those with low resources.
Building the Model
The first step towards building the model is to select 100 such languages for which a high quality labeled dataset exists. For this, we first include geographically widely spoken languages, then those languages for which public evaluation data exists, and then those that have monolingual data available.
Next, the mining of parallel data takes place. In most supervised translation systems, the idea lies in the processing of parallel sentences, that are generally identified as bitext data, that can traditionally be derived from human translations.
So, mining for parallel data involves identifying such sentences that could potentially be translations requiring a medium to calculate the similarity between sentences. The work by this model is done on embeddings that are the results of the LASER encoder, thereby enabling a comparison of semantics for 94 different languages. The parallel corpus is then retrieved using the FAISS indexing.
The advantage of using LASER embeddings is that they generalize to unseen languages that let the user mine bitexts for about 100 languages.
MMT model does roughly follow the data mining pipeline which starts with pre-processing and dividing of a large text corpus, then the sentence pairs that have been orderly arranged are embedded and stored with an index, followed by a comparison of indexed sentences to potential pairs and finally, filtering of the results in post-processing. For post-processing, CCMatrix and CCAligned methods are used.
CCMatrix approach is a global one where all possible statements in one language are compared to those in another language. This is good because it scans all possible sentences through the document that aligns with a sentence from another language. CCAligned, on the other hand, is a local approach that eliminates the global scaling problem by pre-selecting all the documents that can be compared. After these are used, the result is filtered to get rid of sentences that are more than 50% punctuation.
Finally, length and language-specific filtering is applied where very long sentences are removed and those sentences with more than 50% of such characters that do not form the core of a language are removed.
For mining, a new approach is designed with the MMT model. It makes use of the bridge language strategy that prevents mining all possible pairs until exhaustion. This model aims to lower the number of bitext pairs while retaining the direction of translation that has a practical use. For this, the model first groups all the languages into 14 separate groups.
All languages belonging to the same group are mined against each other because they are usually grouped according to semantic similarity. Further, languages are grouped based on their cultural and location adjacency.
To integrate languages from different groups, bridge languages are used in each grouping, usually choosing those with the most similarity. For example, for the 12 Indo-Aryan languages’ family, bridge languages picked are Hindi, Tamil, and Bengali.
A possible addition to this model is back translations that empower translation in the backward direction too. When the model is trained to be able to translate from source language to target language, the use of back-translation generates some data by translating the monolingual target language sentences to our source language.
This addition almost always gives better results, irrespective of the BLEU score calculated in the beginning.
The result of this model when compared with the already in use English-centric Random Method, helped understand the bridge language strategy mines the families of high resource languages, resulting in a large amount of bitext that covers a wide variety of languages.
It can be seen that using sparse data over dense data was more helpful because it increases the entropy of words that are processed and gives a wider control over the language.
Industry Benefits
The processing for this model creates high-quality training data required to train the models, which has been researched upon for a while now.
The models that we have used to date have explored aspects like noise removal from a large dataset, but this model provides large quantities of data for training multilingual models. But this comes with a pre-processing overhead to result in a uniform dataset in terms of tokenization, labeling, simplified text, etc.
Furthermore, this model allows processing with low resource languages, which has been a critical field in research for further development. For a lot of languages, even high-quality monolingual data is not readily available.
Usage
The tasks that are performed by this model is something that can be used in almost all sectors of day-to-day operation.
In medical and life sciences: To translate manuals for doctors, records for patients, prescriptions, and instructions. This can be used by pharmaceutical companies expecting to expand to other countries.
In information technology: A large portion of the IT companies work on projects for foreign countries too, for which proper communication and job documentation is necessary. To facilitate easy communication, M2M-100 can be used.
Web translation for credibility and support: Like a feature provided by Google Chrome to translate pages upon detecting a non-English page, is the kind of use M2M-100 has in this domain. To ensure that the user can understand all the decisions they are making, M2M-100 can be used, and also offer assistance in areas required.
Other domains include the banking and finance areas, the travel and tourism areas, etc.
Closing thoughts
The M2M-100 model can accurately translate between 9,900 directions for 100 languages using the bridge language strategy, resulting in a better performance than all those that existed already as verified according to the benchmarks of various evaluation standards like WMT, and TED. It is now upon the creative minds to exploit the benefits of this model to push technology development even further.
For more information, read the research paper for the MMT model, by clicking here.
A step towards eliminating the language barrier, that’s great 💯