Jump to content


From Meta, a Wikimedia project coordination wiki

Machine translation: we can do it!


I believe that if we work together, we can soon have machine translations: rough at first but constantly improving. There is a lot of work to do, but each part is possible and not very difficult.

As I see it, we need these things:

A machine interlingua ("Mingua")


We need a language which can be read by computers. This language can be developed as we go along. It can be called "Mingua" for "Machine interlingua", and might look something like this:

  • "Sam eats apples" in English becomes "assert(eat('Sam','apples'),present)" in Mingua.

Grammar and vocabulary for each language




I'm impressed with the rich collection of information growing on Wiktionary. Some words have synonyms listed in many languages. This will be a great resource for machine translation.

We need a little more information on each Wiktionary page. For machine translation, the computer program needs to know not only which words are synonyms, but how to use each word.

For example, suppose one language has no verb quite like "eat", but when they want to express that concept, they use a verb like "nourish", as in "Apples nourish Sam". Then the Wiktionary page for their "nourish" word could give the translation into Mingua as:

  • "$A nourishes $B" means "eat($B,$A)"

A computer can parse this and get the words in the right order. Information about prepositions that usually go with verbs can be demonstrated in the same way, e.g. "look at", "give to".

Verb forms also need to be shown:

  • Englishverb eat: eats, ate, eaten, eating.

For regular verbs, only one form needs to be shown. For French verbs, for example, enough forms can be listed that the other forms can be found by regular rules.

Human users of Wiktionary can also benefit from this additional information.

If Wiktionary decides that this type of information is outside its scope, then a separate project will need to be started. Either way, use of a wiki allows huge numbers of people to easily cooperate in building the collection of information.



Another thing I would like to see is wiki pages setting out the grammar of each language in computer-readable form. The more crucial pages would likely be protected part of the time once they've been developed sufficiently.

These pages would demonstrate, step-by-step, how to translate from Mingua to other languages and vice versa.

Some of it might look like this:

  • English:

sentence:= subsentence
subsentence:= clause | conjunction(subsentence, subsentence)
clause:= subject verb [object]

This means that a sentence is made of a subsentence. A subsentence can be made of either a clause, or of two subsentences joined by a conjunction. A clause in English has a subject, then a verb, then an object in that order; the object is in brackets because it can be left out.

It would be a lot more complicated than that. People could try translating complex sentences using machine translation and change or add to the description of the grammar to make their sentences translate nicely.

A translation program


We need a computer program to do the translation. I think this is not very difficult: the program itself can be fairly simple. The complexity will be in the vocabulary and grammar to be built up on the wiki.

Disambiguated input


As user Sloyment suggests on the page Wikipedia Machine Translation Project, before an article is translated, people who speak the original language can mark it up for easier translation. For example:

  • I put(past) it into (the oven with the glass door).
  • I put(past) it (into the oven) with the oven mitts.
  • The sheep(singular) ate out of my hand.

At the same time, they might also edit awkward sentences to be easier for both humans and computers to read.

Places to store various stages of translation


I think we will need places on the Wikimedia projects to store the following versions of articles:

Lists of articles to be translated


Lists or copies of good articles recommended for translation.

Disambiguated input


See above. It would be good to store this input permanently, so that if the original article is expanded, it will be easier to develop a disambiguated version of the expanded article.

Articles in Mingua


Articles that have been translated into Mingua, the machine interlingua. These articles can then be automatically translated into many other languages.

Machine-translated articles


We need a place to put articles which have been machine-translated but not yet checked and edited by a human. One idea is to put these in the regular article space but put a template on them indicating that they are automatic translations and therefore perhaps unreliable. Another idea is to give them names like subpages or a different namespace. Another idea is to have them on a separate project.

See also


R. Morneau has a very interesting (though copyrighted) monograph online about a machine translation interlingua. His [1] concept is that people would learn the interlingua as a language and learn to write in it; it could then be translated automatically into many languages. I think it's easier to generate disambiguated input and automatically translate from that into a machine interlingua. With my concept, people don't have to be able to read or write the interlingua -- only computers do.

See also Wikipedia Machine Translation Project. Please comment on the ideas either on the discussion page there or on my talk page.