Talk:Strategy/Wikimedia movement/2018-20/Working Groups/Partnerships/Recommendations/Q2 R3

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

I would strongly recommend partnering with folks working on machine translation in order to build good language models for small/underserved languages, as well as to continue to harness state-of-the-art methods for large/well-served languages. As I wrote in my 2018 developer submit statement:

Machine translation plays a key role in removing these barriers and enabling new content and collaborators. We should invest in our own engineers and infrastructure supporting machine translation, especially between minority languages and script variants. Our editing community will continually improve our training data and translation engines, both by explicitly authoring parallel texts (as with the Content Translation tool) and by micro-contributions such as clicking yes/no on a proposed translation or pair of parallel texts ("bandit learning"). Using "zero-shot translation" models, our training data from "big" wikis can improve the translation of "small" wikis. Every contribution further improves the ability of our tools to make additional articles from other languages available.

[...]

We should build clusters specifically for training translation (and other) deep learning models. As a supplement to our relationships with statistical translation tools Moses and Apertium, we should partner with the OpenNMT project for modern neural machine translation research. We should investigate whether machine translation can replace LanguageConverter, our script conversion tool; conversely, our editing fluency in ANY language pair should approach what LanguageConverter provides for its supported languages.

These technologies have the potential to supercharge our work by letting us erase barriers between "big" and "little" wikis and more effectively work together. Partnerships will help ensure that we are using state-of-the-art tools on the "big data" wikis, the latest research to maximize the effectiveness of our small training sets on the "little data" wikis, and the latest linguistics research to extend these technologies to cover the smaller languages of the world. Cscott (talk) 23:47, 15 August 2019 (UTC)

From Catalan Salon[edit]

Our community doesn't has many crossfeed with other technical communities (...)