Collaborative Machine Translation for Wikipedia

From Meta, a Wikimedia project coordination wiki

This document describes a proposal for a long-term strategy using several technologies for offering a machine translation system based on collaborative principles to be used in Wikipedia or other websites where text content may change.


  • Source language users can manually disambiguate surface forms
  • Target language users can add missing translations
  • These contributions are (as much as possibly) kept even if the base text changes
  • Manual translations are identified and kept as the primary version


  • A semantic dictionary with surface forms (inflections, conjugations, etc)
  • A reliable semantic annotation system (fuzzy anchoring, entity linking, etc)
  • A complete machine translation system (morphological analyser, sense disambiguator, POS tagger, etc)
  • Visual editing interfaces for: transfer rules, sense disambiguation, parallel text display.
  • An automatic multilingual content synchronizer
  • A repository to store parallel text

How it works[edit]

In general terms:

  • Source language n-grams (normally sentences or chunks) are stored as annotations. It is possible to process these source text fragments to add POS-tagging or semantic disambiguation.
  • These chunks are translated by open source MT software and improved by volunteers. Information about how that translation was generated can also be displayed (in order to select/correct rules, detect other wrong translations, etc).
  • The translated text can be merged into the target language wikipedia if wished. These translations are used to perform statistics to further improve other translations by the MT system.

At annotation level for monolingual users who wants to help to disambiguate:

  • The morphological analyzer scans the chunk and detects lemmas based on the semantic dictionary
  • The morphological disambiguator selects senses from the semantic dictionary for these lemmas and provides a degree of confidence
  • Source language users are presented the source text with a color code (red= unrecognized, orange= unclear), and they can manually disambiguate (semantic tagging/annotation) or add words to the semantic dictionary. All the tags are stored in the semantic annotations.

At annotation level for translators:

  • The MT system offers a translation that can be improved correcting the text, adding missing (source or target language) words to the dictionary, etc
  • Translations can be merged into the target language wikipedia if wished (article and section selector provided)
  • Additionally these translations are kept in a central repository and used as parallel corpora to perform statistical analysis and further improve other translations.

For readers: After selecting a different content language in the wiki the reader is visiting, the following will happen

  • the multilingual content synchronizer identifies which parts of the text have been translated manually and determines for which parts to use machine translation.
  • If the input text has changed the annotations are reattached to the most probable position or deleted when that is no longer possible (see this post on fuzzy anchoring).
  • The translation is presented in a way that the readers can tap/click/mouseover a fragment to display the original text or to be directed to the translation interface.

Existing technology and to-do[edit]

Semantic multilingual dictionary[edit]

Currently in Wikidata there are two proposals for enabling Wiktionary support as linked data (1 and 2). If inflections are supported, Wikidata/Wiktionary could be the base to feed any machine translation engine.

Once the system is in place it will be possible to create language pairs from existing information that could be used to create translation dictionaries for Apertium. Some examples of language pair files (XML).

Semantic annotation[edit]

For semantic annotation of entities, there is DBpedia Spotlight, however a specific solution should be implemented to cater for the needs of machine translation, or adding those capabilities to the Apertium toolbox. Basically it would need mono-lingual morphology selection, and sense selection. Later on the translation rules manually selected should also be stored as annotations. DBpedia Spotlight uses KEA and Lucene/OpenNLP with UIMA.

Other software: Maui is another software used to annotate on vocabulary concepts. They provide pre-trained concept schemes plus can be trained for new concepts. Works quite well for phrase and even topic based tagging/annotation, even if integrated with Solr facets. Same thing can be achieved via Apache Stanbol using various different engines.

Machine translation platform[edit]

For this kind of project it is prefered to use a rule-based machine translation system, because total control is wanted over the whole process and minority languages should be accounted for (not that easy with statistical-based MT, where parallel corpora may be non-existing). Once the parallel corpora is developed, statistics methods would be implemented to improve the translation either fine-tuning the transfer rules or feeding an statistic engine (it could be Moses using a multi-engine translation synthesiser).

In the open source world, Apertium offers a reliable rule-based MT toolchain. It should be noted, however, that currently it works around XML dictionaries. On the plus side, it has a thriving community, with 11 GsoC projects running this year. Moses is, amongst other international organisations, mainly supported by EuroMatrix project and funded by the European Commission.

There is a current effort to convert the Apertium translation rules into Wikimedia templates, and then use DBpedia extraction templates to convert such templates into linked data. An example of these upcoming rule translation templates could be:

 | pair = en-es
 | phrase type = NP <!-- this could be determined from either
 source_head or target_head, but it's nicer to have it -->
 | source = determiner adjective noun
 | target = determiner noun adjective
 | alignment = 1 3 2 <!-- not necessary in this example, but would be
 if there were more than one of each PoS -->
 <!-- this could also be written as 1-1 3-2 2-3 -- it would be nice to
 be able to use that convention, to import statistically derived rules,
 but it's only necessary to know one set of positions when writing by
 hand -->
 | source head = 3
 | target head = 2 <!-- not necessary with alignment -->
 | source example = the big dog
 | target example = el perro grande
 | target attributes = {{attribs | definiteness = 1 | gender = 2, 1 |
 number = 2, 1}} <!-- the actual attributes would be those used in
 wiktionary -->

An example of a DBpedia mapping template: Mapping_en:Elementbox

At some point, it could be possible to work with these translation rules directly as linked data, and having them stored as a Wikibase repository or as another database for triples.

Interface[edit] already offers a translation interface that eventually could be expanded to support a semantic tagging interface of source and translated text. As for the transfer rules, there is a GsoC project idea to build a visual interface for transfer rules for Apertium. Although this editor is going to be a Qt interface (i.e. not web-based), it will provide an initial work to know how it should be done.

Multilingual content synchronizer[edit]

CoSyne is a past EU project that has devoted 3 years to the topic of "multilingual synchronisation of wikis".[1]

The prototype, run on top of MediaWiki, aimed at recognising which parts of a wiki article are already present in two languages, which are not, the latter being translated and introduced automatically in the other language. The main components are cross-lingual textual entailment (to recognise which parts of the article are already in both languages) and MT. The prototype used a SMT engine, but as it is being called as a web service, one could use any MT engine.

The project ended in 2013. As of 2015, no outputs are known and the CoSyne website is down.

The opportunity of MT[edit]

  • It can make available a more diverse knowledge to a greater number of people
    • Challenge: requires high quality, fast load times and user interaction
  • Not necessary to keep translations up-to-date
    • Challenge: users must keep an eye on vandalism because it would have a greater reach.
  • The translations might not be literary-grade, but they will be factually accurate
    • Challenge: That will depend on how many users will participate improving the translation rules and the dictionary
  • Disambiguation can be a source of micro-tasks for people accessing with mobile devices
    • Challenge: unclear if it will spark enough interest
  • Tagging text at such level will allow more NLP, AI, and answer engine applications
    • Challenge: it will require a PR campaign to involve researchers
  • It might catalyze the growth of translation tools for languages that now have none
    • Challenge: some user volume or researcher involvement is still required

The dangers of MT[edit]

  • Since content in other languages might be more readily available, the incentive to write content in someone’s own language might be less pressing
    • Machine translation is already available and that didn’t have any negative impact on contributions, more the opposite, facilitating human translations in less time that would have done anyway and they would have taken longer.[citation needed]
  • Errors or fake information can spread more easily
    • These challenges are already being addressed by Wikidata[unclear]
  • Users can complain about cultural bias in other language communities
    • Or it could foster a better cultural understanding
  • Writing transfer rules requires linguistic and technical expertise
    • A project like this could spark the social appropriation of MT technology (the social process by which technologies become mainstream, cfr. Arduino, 3d printing, etc)
  • There might not be enough eyes to control translation tagging
    • Even then, readers can still notice what is right or wrong[unclear]