Jump to content

Machine learning models/Production/Machine translation

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Meta AI Research team, IndicTrans2 team, Madlad-400 researchers, Softcata, and OpusMT project
Model owner(s)WMF Language Team
PublicationsSee links to whitepapers and publishing by respective models in this page.
CodeGithub repository of MinT service. See source code links to individual models in this page.
Uses PIINo
In production?Yes
Which projects?Content Translation
Machine translates given text in source language to target language


Motivation

[edit]

MinT (Machine in Translation) is a machine translation service based on open-source neural machine translation models. The service is hosted in the Wikimedia Foundation infrastructure, and it runs translation models that have been released by other organizations with an open-source license. An open machine translation service can be a key piece of the essential infrastructure of the ecosystem of free knowledge.

The MinT Service is designed to provide translations from multiple machine translation models. These models are not trained or developed by Wikimedia Foundation. WMF is a downstream user of these models.

NLLB-200

[edit]

This model was released by Meta research team as part of No Language Left Behind project program. This model supports translation across 200 languages, including many that are not supported by other vendors.

Model card: See page 183 of the whitepaper published as part of NLLB-200

Citation
[edit]
@misc{nllbteam2022languageleftbehindscaling,
      title={No Language Left Behind: Scaling Human-Centered Machine Translation}, 
      author={NLLB Team and Marta R. Costa-jussà and James Cross and Onur Çelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Al Youngblood and Bapi Akula and Loic Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hoffman and Semarley Jarrett and Kaushik Ram Sadagopan and Dirk Rowe and Shannon Spruit and Chau Tran and Pierre Andrews and Necip Fazil Ayan and Shruti Bhosale and Sergey Edunov and Angela Fan and Cynthia Gao and Vedanuj Goswami and Francisco Guzmán and Philipp Koehn and Alexandre Mourachko and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Jeff Wang},
      year={2022},
      eprint={2207.04672},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2207.04672}, 
}

OpusMT

[edit]

The OPUS (Open Parallel Corpus) project from the University of Helsinki compiles multilingual content with a free license to train the OpusMT translation models. Anyone can easily help improve the translation quality by participating in the different projects that contribute data to OPUS. For example, when using Content Translation to create translations of Wikipedia articles, the data on published translations will be incorporated as a new resource to improve the translation quality for the next version of the model. Another quick way to contribute is to provide sentence translations with Tatoeba.

Whitepaper: https://arxiv.org/pdf/2212.01936

Citation
[edit]
@misc{tiedemann2023democratizingneuralmachinetranslation,
      title={Democratizing Neural Machine Translation with OPUS-MT}, 
      author={Jörg Tiedemann and Mikko Aulamo and Daria Bakshandaeva and Michele Boggia and Stig-Arne Grönroos and Tommi Nieminen and Alessandro Raganato and Yves Scherrer and Raul Vazquez and Sami Virpioja},
      year={2023},
      eprint={2212.01936},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2212.01936}, 
}

IndicTrans2

[edit]

IndicTrans2. The IndicTrans2 project provides translation models to support over 20 Indic languages. These models were developed by AI4Bharat@IIT Madras, a research group at the Indian Institute of Technology, Madras.

The training data, proess and ethical statement are provided in the paper MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Citation
[edit]
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

Softcatalà

[edit]

Softcatalà is a non-profit organization with the goal to improve the use of Catalan in digital products. As part of the Softcatalà Translation project, translation models used in their translator service to translate 10 languages to and from Catalan have been released.

Paper: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Citation
[edit]
@inproceedings{ivars-ribes-etal-2024-community,
    title = "Community-driven machine translation for the {C}atalan language at Softcatal{\`a}",
    author = "Ivars-Ribes, Xavi  and
      Mas, Jordi  and
      Riera, Marc  and
      Ortol{\`a}, Jaume  and
      Forcada, Mikel  and
      C{\`a}novas, David",
    editor = "Scarton, Carolina  and
      Prescott, Charlotte  and
      Bayliss, Chris  and
      Oakley, Chris  and
      Wright, Joanna  and
      Wrigley, Stuart  and
      Song, Xingyi  and
      Gow-Smith, Edward  and
      Forcada, Mikel  and
      Moniz, Helena",
    booktitle = "Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)",
    month = jun,
    year = "2024",
    address = "Sheffield, UK",
    publisher = "European Association for Machine Translation (EAMT)",
    url = "https://aclanthology.org/2024.eamt-2.23/",
    pages = "45--46",
    abstract = "Among the services provided by Softcatal{\`a}, a non-profit 25-year-old grassroots organization that localizes software into Catalan and develops software to ease the generation of Catalan content, one of the most used is its machine translation (MT) service, which provides both rule-based MT and neural MT between Catalan and twelve other languages. Development occurs in a community-supported, transparent way by using free/open-source software and open language resources. This paper briefly describes the MT services at Softcatal{\`a}: the offered functionalities, the data, and the software used to provide them."
}

MADLAD-400

[edit]

MADLAD-400 is a multilingual machine translation model by Google Research that supports 419 languages.

Citation
[edit]
@misc{kudugunta2023madlad400multilingualdocumentlevellarge, 
title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset}, 
author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
year={2023}, 
eprint={2309.04662}, 
archivePrefix={arXiv}, 
primaryClass={cs.CL}, 
url={https://arxiv.org/abs/2309.04662},
 }

License

[edit]

MinT is licensed under MIT license. See License.txt

Please refer the following table for their respective license details.

Project License for Code/ Library Documentation/ Public Models License/ Data Set
NLLB-200 MIT CC-BY-SA-NC 4.0
OpusMT MIT CC-BY 4.0
IndiTrans2 MIT CC-0 (No Rights Reserved)
Softcatala MIT MIT
MADLAB-400 Apache 2.0 Apache 2.0