Machine translation

From Meta, a Wikimedia project coordination wiki

The purpose of the Wiki(pedia) Machine Translation Project is to develop ideas, methods and tools that can help translate Wikipedia articles (and Wikimedia pages) from one language to another, particularly out of English and into languages with small numbers of fluent speakers.

Remember to read the current talk page and particularly what is stated it the Wikipedia Translation page:

Wikipedia is a multilingual project. Articles on the same subject in different languages can be edited independently; they do not have to be translations of one another or correspond closely in form, style or content. Still, translation is often useful to spread information between articles in different languages.

Translation takes work. Machine translation, especially between unrelated languages (e.g. English and Japanese), produces very low quality results. Wikipedia consensus is that an unedited machine translation, left as a Wikipedia article, is worse than nothing. (see for example here). The translation templates have links to machine translations built in automatically, so all readers should be able to access machine translations easily.

Remember that if the idea would be to simply run a Wikipedia article in a fully-automatic machine translation system (such as Google Translate), there would be no point in adding the results to the "foreign" Wikipedia: a user should just feed the system with the desired URL.

Motivation[edit]

Small languages can't produce articles as fast as Wikimedia projects in languages such as English, Japanese, German or Spanish, because the number of wikipedians is too low and some prefer to contribute to bigger projects. One potential solution for this problem, in discussion since 2002, is the translation of Wikimedia projects. As some languages will not have enough translators, Machine Translation can improve the productivity of the community. This sort of automatic translation would be a first step for manual translations to be added and corrected later, while the local communities develop.

A second, but very important motivation, is the development of free tools for Computational Linguistics and Natural Language Processing. These fields are very important, but resources for small languages are usually inexistent, low-quality, expensive and/or restricted in their usage. Even for "big" languages such as English, free resources are still short. We could develop...

Approaches[edit]

Interlingua approach[edit]

A different, but related approach would be translating articles into a machine translation interlingua like UNL, then writing software modules to translate automatically from that interlingua into each target language. The initial translation could be created fully by hand, or machine translated with humans verifying accuracy of the translation and choosing between multiple alternatives. This only saves work with respect to direct translation if there are several target languages whose modules are well enough developed, but those modules are much easier to write, and expectably more accurate, than full real-language to real-language automatic translation systems.

Translating between closely related languages[edit]

I would imagine it would be an easier task to translate between similar languages than non-similar ones. For example, we have Wikipedias in Catalan and Spanish and Macedonian and Bulgarian, perhaps even Dutch and Afrikaans (some more studies would have to be done to evaluate which would be most appropriate). There is some free software being produced in Spain called en:Apertium that might be useful here.

Suggested statistical approach[edit]

  • In a first step, a series of language corpora and statistical models are generated from the various Wikipedias. The results are particularly interesting because, besides the extraction of text needed for the project, they also allow us to make public under a permissive license a kind of data generally unavailable for smaller languages, or at most only available after expensive purchases. Many of these have already been released and are hosted at SourceForge:
Language code Dump date Native name English name Raw corpus Clean corpus Reduced corpus Language model
af 2010-03-10 afrikaans Afrikaans link (7.8 MB) link (7.7 MB) link (3.1 MB)
ca 2010-02-19 català Catalan link (86.6 MB) link (85.0 MB) link (7.1 MB)
en 2010-01-30 English English link (1.6 GB) link (1.5 GB) link (266.9 MB) link (18.4 MB)
eo 2010-03-11 esperanto Esperanto link (33.0 MB) link (32.6 MB) link (6.2 MB)
eu 2010-02-22 euskara Basque link (12.9 MB) link (12.8 MB) link (4.0 MB)
gl 2010-02-20 galego Galician link (25.2 MB) link (24.7 MB) link (3.8 MB)
is 2010-03-11 íslenska Icelandic link (6.8 MB) link (6.7 MB) link (3.8 MB)
it 2010-02-18 italiano Italian link (329.9 MB) link (323.3 MB) link (305.7 MB) link (15.7 MB)
nap 2010-02-21 napulitano Neapolitan link (614.3 KB) link (580.2 KB) link (1.4 MB)
pms 2010-03-07 piemontèis Piedmontese link (1.7 MB) link (1.6 MB) link (2.4 MB)
pt 2010-03-08 português Portuguese link (185.0 MB) link (180.9 MB) (not yet)
qu 2010-02-25 runa simi Quechua link (685.5 KB) link (639.6 KB) link (1.6 MB)
sl 2010-02-25 slovenščina Slovenian link (22.1 MB) link (21.8 MB) link (4.6 MB)
sw 2010-02-24 kiswahili Swahili link (2.9 MB) link (2.8 MB) link (3.5 MB)
yo 2010-02-25 yorùbá Yoruba link (433.6 KB) link (375.5 KB) link (1.1 MB)

(for more information on these data, see my talk page at the English Wikipedia Tresoldi 16:22, 13 March 2010 (UTC))[reply]

  • As most of the minor Wikipedias would likely be populated, at least at first, by articles in the English one, the first pairs to be developed would be the English/Foreign language ones. Thus, for each language an initial list of random sentences is drawn from the English corpus (as above) and, if such a system is available, translated with some existing machine translation software (such as Google Translate or Apertium). Hopefully, wikipedians will start revising the translations collaboratively, just like the normal Wikipedia articles.
  • After the small list of random sentences is adequately translated/revised, a number of other sentences will be selected to be gradually added to the developing parallel corpus. By using statistics collected with the language models built above, particularly with the English one, two different approaches will be followed:
    • The first is to gradually cover all common n-grams of all orders (descending from 5-grams), so that the most common structures in the English wikipedia will be covered by the corpus (in other words, more pages should be translated with fewer problems -- think about very similar pages such as the ones about towns and cities, short biographies, etc.).
    • The second will be to gradually cover the n-grams found in the parallel corpus, in order to cover the different contexts of both those n-grams and particularly of the contexts of them in the already included sentences.
  • After the translation of about 1,000 sentences, an actual system will start to be build weekly. Everyone with some experience in statistical machine translation will agree that 1,000 sentences is a ridiculously low number for statistical translation, but the idea is to have a baseline set up and gradually increment it. Besides that, while the adding of sentences as described above would go on, from time to time the system would be used to translate some of top articles of the English Wikipedia not covered by the foreign one, correct it (with a lot of pain in the first times) and add it back to the corpus. After some time, we should finally be able to eat our own dog food, i.e., to do the first raw translation with our statistical systems, not relying anymore in non free systems (however, the usage of free software like Apertium for related languages is likely to still be a better alternative in the foreseeable future).
  • The corpora will be gradually tagged with part-of-speech information, lemmas and eventually syntactic information.
  • There will be a gradual integration with the Wiktionary and Wikipedia's intralinguistic links, covering not only the the basic lemmas but, hopefully, the most common inflected forms for each language -- the results could be then retributed to Wiktionary with carefully set bots.

Evaluating with Wikipedias[edit]

Originally found in the Apertium Wiki.

One of the ways of improving an MT system, and at the same time improve and add content in Wikipedias, is to use Wikipedias as a test bed. You can translate text from one Wikipedia to another, then either post-edit yourself, or wait for, or ask other people to post-edit the text. One of the nice things is that MediaWiki (the software Wikipedia is based on) allows you to view diffs between the versions (see the 'history' tab).

This strategy is beneficial both to Wikipedia and to any machine translation system, such as Apertium or a statistical one based in Moses. Wikipedia gets new articles in languages which might not otherwise have them, and the machine translation system gets information on how we can improve the software. It is important to note that Wikipedia is a community effort, and that rightly people can be concerned about machine translation. To get an idea of this, put yourself in the place of people having to fix a lot of "hit and run" SYSTRAN (a.k.a. BabelFish) or Google Translate translations, with little time and not much patience.

Guidelines[edit]

  • Don't just start translating texts and waiting for people to fix them. The first thing you should do, is create an account on the Wikipedia, and then find the "Community notice board". Ask there how regular contributors would feel about you using the Wikipedia for tests. The community notice board should be linked from the front page. It might be called something like "La tavèrna" in Occitan, or "Geselshoekie" in Afrikaans. When you are asking them, make the following clear:
  • This is free software / open source machine translation.
  • You would like to help the community and are doing these translations both to help their Wikipedia expand the range of articles, and to improve the translation software.
  • The translations will be added only with the consent of the community, you do not intend to flood them with poorly translated articles.
  • The translations will be added by a human not by a bot.
  • Ask them if there are any subjects that they prefer you would cover, perhaps they have a page of "requested translations".
  • One way of looking at it might be as a non-native speaker of the language trying to learn the language. Point out that the initial translation will be done by machine, then you will try and fix the translation, but anything that you don't fix you would be grateful for other people to fix.

An example of the kind of conversation you might have is found here.

How to translate[edit]

In order to be more useful, when you create the page, first paste in the unedited machine translation output. Save the page with an edit summary saying that you're still working on it. Then proceed to post-edit the output. After you've finished, save the page again. If you go to the history tab at the top of the page and do "Compare selected versions" you will see the differences (diff) between the machine translation and the post-edited output. This gives a good indication of how good the original Apertium output was.

It's also helpful if you first paste the input. Then you can compare 1. input, 2. MT output, 3. post-edit (keeping the input text in the article history might be useful if you want to compare old MT-output with a newer version of the machine translator)

Existing free software[edit]

  • Apertium
    • Apertium is an open-source platform (engine, tools) to build machine translation systems, mainly between related languages. Code, documentation, language pairs available from the Apertium website.
    • See Niklas Laxström, On course to machine translation, May 2013.
  • Ariane
    • Ariane (Ariane-H / Heloise version) is an online environment for developing machine translation systems. It is fully compatible with the original Ariane-G5 from the GETA research group of Grenoble University (France).
  • Moses (Moses is licensed under the LGPL.)
    • " Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus)." (from their website Moses website.
  • Wikipedia translation -- defunct
    • Tool designed to help populate smaller Wikipedias, for example translating country templates quickly

Attempts[edit]

Several projects were or have been started to use computer assisted translation on Wikimedia projects. An incomplete list follows, for projects conducted on the Wikimedia projects themselves.

In other cases they were used outside, or not programmatically:

Resources[edit]

General[edit]

  • Google Translate - Online gratis (statistical) machine translator.
  • Bing Translator - Online gratis (statistical) machine translator.
  • GramTrans - Online gratis (rule-based) machine translator, mostly covering Scandinavian languages.
  • VertaalMachine - Online translator that covers over 80 languages.
  • Promt - Online gratis (rule-based) machine translator.
  • WordLingo - Online translator, gratis for up to 500 words.
  • Apertium.org – Online free & open source (rule-based) machine translator.
  • Translate and Back - Online gratis Google based translation, which enables checking correctness by back translation.
  • Okchakko - Online gratis (rule-based) translator : French/Italian to Corsican

Dictionaries[edit]

Corpora[edit]

  • Europarl - EU12 languages up to 44 million words per language (to be used only with English as source or target language, as many of the non-English sentences are translations of translations).
  • JRC-Acquis - EU22 languages.
  • Southest European Times - English, Turkish, Bulgarian, Macedonian, Serbo-Croatian, Albanian, Greek, Romanian (approx. 200,000 aligned sentences, 4--5 million words).
  • South African Government Services - English and Afrikaans (approx. 2,500 aligned sentences, 49,375 words).
  • IJS-ELAN - English-Slovenian.
  • Open Source multilingual corpora - Despite the name, some resources might not be eligible for Wikipedia given their license.
  • OpenTran - single point of access to translations of open-source software in many languages (downloadable as SQLite databases).
  • Tatoeba Project - Database of example sentences translated into several languages.
  • ACL Wiki - List of resources by language (many corpus links etc. here)
  • FrazeIt - A search engine for sentences and phrases. Supports six languages, filtered by form, zone, context, and more.

Bibliography[edit]

See also[edit]

Generic English Wikipedia articles[edit]