Machine learning models/Production/Machine translation
| Model card | |
|---|---|
| This page is an on-wiki machine learning model card. | |
A model card is a document about a machine learning model that seeks to answer basic questions about the model. | |
| Model Information Hub | |
| Model creator(s) | Meta AI Research team, IndicTrans2 team, Madlad-400 researchers, Softcata, and OpusMT project |
| Model owner(s) | WMF Language Team |
| Publications | See links to whitepapers and publishing by respective models in this page. |
| Code | Github repository of MinT service. See source code links to individual models in this page. |
| Uses PII | No |
| In production? | Yes |
| Which projects? | Content Translation |
| Machine translates given text in source language to target language | |
Motivation
[edit]MinT (Machine in Translation) is a machine translation service based on open-source neural machine translation models. The service is hosted in the Wikimedia Foundation infrastructure, and it runs translation models that have been released by other organizations with an open-source license. An open machine translation service can be a key piece of the essential infrastructure of the ecosystem of free knowledge.
The MinT Service is designed to provide translations from multiple machine translation models. These models are not trained or developed by Wikimedia Foundation. WMF is a downstream user of these models.
NLLB-200
[edit]This model was released by Meta research team as part of No Language Left Behind project program. This model supports translation across 200 languages, including many that are not supported by other vendors.
Model card: See page 183 of the whitepaper published as part of NLLB-200
Citation
[edit]@misc{nllbteam2022languageleftbehindscaling,
title={No Language Left Behind: Scaling Human-Centered Machine Translation},
author={NLLB Team and Marta R. Costa-jussà and James Cross and Onur Çelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Al Youngblood and Bapi Akula and Loic Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hoffman and Semarley Jarrett and Kaushik Ram Sadagopan and Dirk Rowe and Shannon Spruit and Chau Tran and Pierre Andrews and Necip Fazil Ayan and Shruti Bhosale and Sergey Edunov and Angela Fan and Cynthia Gao and Vedanuj Goswami and Francisco Guzmán and Philipp Koehn and Alexandre Mourachko and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Jeff Wang},
year={2022},
eprint={2207.04672},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2207.04672},
}
OpusMT
[edit]The OPUS (Open Parallel Corpus) project from the University of Helsinki compiles multilingual content with a free license to train the OpusMT translation models. Anyone can easily help improve the translation quality by participating in the different projects that contribute data to OPUS. For example, when using Content Translation to create translations of Wikipedia articles, the data on published translations will be incorporated as a new resource to improve the translation quality for the next version of the model. Another quick way to contribute is to provide sentence translations with Tatoeba.
Whitepaper: https://arxiv.org/pdf/2212.01936
Citation
[edit]@misc{tiedemann2023democratizingneuralmachinetranslation,
title={Democratizing Neural Machine Translation with OPUS-MT},
author={Jörg Tiedemann and Mikko Aulamo and Daria Bakshandaeva and Michele Boggia and Stig-Arne Grönroos and Tommi Nieminen and Alessandro Raganato and Yves Scherrer and Raul Vazquez and Sami Virpioja},
year={2023},
eprint={2212.01936},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2212.01936},
}
IndicTrans2
[edit]IndicTrans2. The IndicTrans2 project provides translation models to support over 20 Indic languages. These models were developed by AI4Bharat@IIT Madras, a research group at the Indian Institute of Technology, Madras.
The training data, proess and ethical statement are provided in the paper MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
- Model card: See Page 57 of https://arxiv.org/pdf/2309.04662
Citation
[edit]@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}
Softcatalà
[edit]Softcatalà is a non-profit organization with the goal to improve the use of Catalan in digital products. As part of the Softcatalà Translation project, translation models used in their translator service to translate 10 languages to and from Catalan have been released.
Paper: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Citation
[edit]@inproceedings{ivars-ribes-etal-2024-community,
title = "Community-driven machine translation for the {C}atalan language at Softcatal{\`a}",
author = "Ivars-Ribes, Xavi and
Mas, Jordi and
Riera, Marc and
Ortol{\`a}, Jaume and
Forcada, Mikel and
C{\`a}novas, David",
editor = "Scarton, Carolina and
Prescott, Charlotte and
Bayliss, Chris and
Oakley, Chris and
Wright, Joanna and
Wrigley, Stuart and
Song, Xingyi and
Gow-Smith, Edward and
Forcada, Mikel and
Moniz, Helena",
booktitle = "Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)",
month = jun,
year = "2024",
address = "Sheffield, UK",
publisher = "European Association for Machine Translation (EAMT)",
url = "https://aclanthology.org/2024.eamt-2.23/",
pages = "45--46",
abstract = "Among the services provided by Softcatal{\`a}, a non-profit 25-year-old grassroots organization that localizes software into Catalan and develops software to ease the generation of Catalan content, one of the most used is its machine translation (MT) service, which provides both rule-based MT and neural MT between Catalan and twelve other languages. Development occurs in a community-supported, transparent way by using free/open-source software and open language resources. This paper briefly describes the MT services at Softcatal{\`a}: the offered functionalities, the data, and the software used to provide them."
}
MADLAD-400
[edit]MADLAD-400 is a multilingual machine translation model by Google Research that supports 419 languages.
- Paper: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
- Model card: See Page 57 of https://arxiv.org/pdf/2309.04662
Citation
[edit]@misc{kudugunta2023madlad400multilingualdocumentlevellarge,
title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
year={2023},
eprint={2309.04662},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2309.04662},
}
License
[edit]MinT is licensed under MIT license. See License.txt
Please refer the following table for their respective license details.
| Project | License for Code/ Library | Documentation/ Public Models License/ Data Set |
|---|---|---|
| NLLB-200 | MIT | CC-BY-SA-NC 4.0 |
| OpusMT | MIT | CC-BY 4.0 |
| IndiTrans2 | MIT | CC-0 (No Rights Reserved) |
| Softcatala | MIT | MIT |
| MADLAB-400 | Apache 2.0 | Apache 2.0 |