Applicant's Wikimedia username. If one is not provided, then the applicant's name will be provided for community review.
- Machine Assisted Translation of Wikipedia Articles into Low Resource Languages
Entity Receiving Funds
Provide the name of the individual or organization that would receive the funds.
- Lesan AI UG (haftungsbeschränkt)
Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.
- Wikipedia is the largest encyclopedia ever assembled with the vision of enabling every human being to freely share in the sum of all knowledge. Wikipedia currently has a total of more than six million articles and over 17 billion words in its English edition. Unfortunately, millions of people cannot access this resource because it’s not available in their language. For instance, at the moment there are only 218 Tigrinya Wikipedia and 15,018 Amharic Wikipedia articles.
One of the main challenges of translating Wikipedia into low resource languages, is the lack of assistive linguistic tools to empower contributors: e.g., machine translation (MT) to cover the mundane pieces so humans can focus on the more creative aspect, language models to help predict the next word for efficient typing, as well as terminology dictionaries and spell checkers to help make translation outputs consistent to name a few.
In this project, we investigate the problem of translating Wikipedia articles from a high resource language into low resource languages using human-in-the-loop MT systems. In particular, we will investigate different approaches to translate a sample of English Wikipedia articles into Tigrinya and Amharic.
The availability of Wikipedia in Tigrinya and Amharic will have a transformative impact for the millions of people that speak these languages. We believe this effort will contribute towards democratizing access to the Web through machine translation for millions of people. Besides the actual Wikipedia articles, the parallel corpus will contribute towards the development of a parallel corpus for training machine translation systems and semantic annotation of text for named entity recognition and linking.
Approximate amount requested in USD.
Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).
- Campaigning and rewarding volunteers 10 months (continuous) 10,000
- Training contributors 10 months(continuous) 2,000
- Translation of articles dedicated contributors 10 months (continuous) 25,000
- Dataset Quality Assurance 2 months 2,000
- Computing and other costs 1,000
- Overhead costs 7,500.0
Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.
- This work will contribute towards:
- Innovate in Free Knowledge
This research will contribute to the following Knowledge Gaps:
- - Contributors
- - Content
Plans for dissemination.
- We believe the findings of this experiment will be useful for other low resource languages that have very few Wikipedia articles. We will publish our findings as a paper in a suitable venue.
Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.
- We have built state-of-the-art machine translation (MT) systems to and from Amharic, Tigrinya and English. Our work has been featured as a demo at NeurIPS 2021 .
 Hadgu, Asmelash Teka, et al. ""Lesan – Machine Translation for Low Resource Languages."" Neural Information Processing Systems 2021 (NeurIPS 2021) demonstrations track. arXiv preprint arXiv:2112.08191 (2021).
I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.