Machine translation for small Wikipedias
This is a first scratch that deals with machine translation for small projects. Cherokee already has this feature and I suppose it will become better and better. Such tools can help us a lot since it seems as if error percentages of approx 5% can be reached. Of course machine translation software is not perfect - it only knows what we tell the software, but the more I see about it the more I believe that we can really reach a good level.
Of course each language has its own problematics ... we will find the way to get things there where we want them.
A first discussion about this theme was on at the Foundation mailing list and it is time to go over to meta and create facts on the ground.
Have fun - believe me: it will be fun :-)
- 1 Where and how you can create the wordlist
- 2 Swahili and Bantu Language Machine Translations
- 3 German and French Machine Translation Status
- 4 Cherokee Machine Translation Status
- 5 Chickasaw Machine Translation Status
- 6 Uto-Aztecan Machine Translation Status
- 7 Creek and Muskogean Machine Translation Status
- 8 Hawaiian Machine Translations
- 9 Dine (Navajo) Machine Translation Status
Where and how you can create the wordlist
Thanks to Jeff we now have the wordlist online on OmegaWiki. So if anybody works on that wordlist improving it we will get quite high chances to be faster and faster and get better and better pretranslations. You can access to the list through this link.
The wordlist can be exported as soon as it is as complete as possible since the data is in a relational database - therefore you will be able to use it not only for the machine translation project, but also as offline dictionary, spellchecker and in many other ways.
Swahili and Bantu Language Machine Translations
I am dumping the public Swahili Lexicon as suggested by Dr. Benjamin and relying solely on the Kamusi lexicon. Next pass translation run will use Dr. Martin Benjamins's English-Swahili Grammar parser and rule sets. Jmerkey 20:40, 29 August 2006 (UTC)
- I have purged the database from ] for the first run and removed the public swahili lexicons and thesaurus extractions which replied upon this lexicon and have converted the translator to reconstruct the thesaurus from the kamusi swahili lexicons. I am running a second pass with particles a/an/the removed from the translator lexicons. This pass also is using the link grammar parser to word pair and link word tenses to increase accuracy. Jmerkey 03:33, 30 August 2006 (UTC)
- Second run completed 8/30/06 and XML dumps posted at . Updated Kamusi lexicons and Thesaurus also completed. Jmerkey 16:25, 30 August 2006 (UTC)
German and French Machine Translation Status
The German and French Machine translation lexicon development is underway. I am hosting someone from Germany who is fluent in both (my German wife's older son from Schaag). His name is Florian and he will be assisting in creation of rule sets, conjugators, and lexicons for German and French machine translations from the English Wikipedia. Jmerkey 20:43, 29 August 2006 (UTC)
- Jeff, where are they doing it and who is doing it? If we all work together on OmegaWiki on these lists all of us will be faster and we will get better and better results in future. Thanks! --Sabine 16:08, 21 September 2006 (UTC)
Cherokee Machine Translation Status
The Cherokee translations are at over 98% at present, but still need additional lexicon and disambiguation work to correct subtle errors in verb and noun usage. We are also adding 200,000 additional phrases, words, and Otali dialect varients, as well as modern names and their Cherokee equivalents. The Cherokee project is the most developed to date, but thereis still considerable work left and review. Jmerkey 20:45, 29 August 2006 (UTC)
Chickasaw Machine Translation Status
The Chickasaw Nation is now participating in the project. Robert Mayden of the Chickasaw Nation of Oklahoma is working on lexicons, grammar rule sets, and parsers for the Chickasaw Language, a Muskogean Language Dialect. Machine Translation runs are slated to being later this month for this language family. Jmerkey 22:33, 12 September 2006 (UTC)
Uto-Aztecan Machine Translation Status
The Ute Tribe has received the agreements and we are awaiting funding approval for a 3TB system for deployment at their Tribal Complex from the ANA and Federal Government (although a system has already been setup they just have not paid for it yet) and at present are doing work on Windows based systems. This is slated for mid-October after the Bear Dance and Autumn Ceremonies. The language materials, and status of translations are governed under a highly confidential agreement that the Foundation has a copy of and has reviewed. Due to religious and cultural issues with the Ute People related to their language and beliefs about their languages being written and distributed, the precise status and details of this project outside of general statements cannot be disclosed. The tribe is slated to setup and host a complete Wikipedia mirror mid-fall with access only to their tribal members. We are proceeding with work on the project IAW our agreements with them. We estimate completion of the activities discussed over the summer end of year. Jmerkey 20:52, 29 August 2006 (UTC)
Creek and Muskogean Machine Translation Status
These groups have been contacted and native speakers identified. We anticipate these programs will begin mid to late fall. Jmerkey 20:53, 29 August 2006 (UTC)
Hawaiian Machine Translations
Discussions in process. Currently, it is not known if there is any sort of Hawaiian machine translation software, rulesets or automated grammatical analysis algorithms for ʻŌlelo Hawaiʻi [sp?] at this time. The website http://wehewehe.org does contain an excellent online dictionary with an individual-word search engine.
One Native Dine Speaker is currently working on lexicons, rule sets, and grammar rules for the Dine Language. This project is going much slower than we had anticipated. The Dine tribe has not been approached officially as they are in an election year and they will be undergoing a leadership change in the composition of their council and executive government after the elections. Our current plan is to complete an initial translation to 95% accuracy, then approach the tribe about hosting the translations at their complex. The Dine tribe already has an enormous language effort underway, however, Wikipedia participation has been nil to none to date. Jmerkey 21:04, 29 August 2006 (UTC)
- Ronnie Greymountain, of the Dine Tribe is at 40% completion of lexicons, thesaurus and rule sets for machine translation for the Navajo Language. An appliance has been procured for this effort and is now underway. We will publish the Dine Wikipedia Machine Translation late November/December time frame with the first application of the translation in the children's programs. We will approach the Tribal Leadership in an official capacity about full Wikipedia participation in early 2007. Jmerkey 02:53, 8 September 2006 (UTC)