Grants:Programs/Wikimedia Research Fund/Understanding how Editors Use Machine Translation in Wikipedia: A Case Study in African Languages
This is a stage 2 Wikimedia Research Fund application.
Review the full proposal.
- Eleftheria Briakou, Tajuddeen Gwadabe and Marine Carpuat
Affiliation or grant type
- University of Maryland; Masakhane Research Foundation
- Eleftheria Briakou, Tajuddeen Gwadabe and Marine Carpuat
- Understanding how Editors Use Machine Translation in Wikipedia: A Case Study in African Languages
Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.
Our project aims to leverage state-of-the-art Natural Language Processing (NLP) technology to help Wikipedians develop best practices for using Machine Translation (MT) when creating content, particularly in under-resourced languages. We propose to study the use of the Content Translation tool—with the recent integration of the NLLB-200 service, an impressively small percentage of content is modified after machine translation (< 10% in low-resource languages such as Igbo and Zulu). However, the analysis so far has been limited to whether the content is modified or not, which lumps together cases where MT translation quality is good enough to be used as is, and cases where human editors let translation errors go through.
Our project aims to close this gap by characterizing what human editors change when they edit MT outputs: do they fix factual errors, disfluencies, or stylistic issues? What kind of translation errors do they let go through? We propose to answer this question by following a participatory approach that brings together the scaling strengths of NLP and the evaluative strengths and knowledge of native speakers. As a first step, we will explore MT's use for creating content in three African languages. We propose to study Igbo, Hausa, and Swahili as they all exhibit small but different percentages and types of modified content based on our preliminary analysis (the choice of languages is open to debate conditioned on the use of the ContentTranslation tool once we will launch the project). Our human evaluation will result in a collection of annotations of overall translation quality (for both MT and post-edited texts) along with fine-grained information characterizing the nature of editing operations employed by editors when they use MT technology in those languages. Second, we propose to scale our analysis to more languages via leveraging NLP technology that we will evaluate on our annotated dataset. To that end, we propose to develop NLP models to automatically estimate the quality of Wikipedia translations and compare raw translations to human-edited content based on our own work on characterizing cross-lingual semantic divergences and state-of-the-art MT quality estimation tools.
We hope that doing this study collaboratively with the Wikimedia foundation will make it possible to interview and survey editors to understand their translation workflow and needs, and possibly evaluate whether they find NLP feedback on translation quality useful.
Approximate amount requested in USD.
- 50,000 USD
Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).
- Annotation Cost: $18,000. This will be used to cover (two rounds of) annotation costs for translation quality assessment in 3 languages.
- Salary or Stipend: $17,400 ($8,400 for Project Management/PI, $9,000 for language coordinators)
- Institutional overhead: $5,310. The will be used for Mashakane admin (15%).
- Travel to Wikimedia and NLP research conferences: $5,000.
- Human Subjects Incentives: $3,000
- Compute Resources: $1,290
Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.
Our project aims to help Wikipedia communities that serve under-resourced and under-studied languages develop best practices for using machine translation technology when they create content in their languages. This project represents a step towards addressing one of the key issues highlighted in the 2030 Wikimedia strategic direction: bridging Wikipedia's knowledge and content gaps across languages.
Plans for dissemination.
We will share the results of our work with Wikimedia and academic communities as detailed below:
1. We will present our findings at Wikimedia conferences (e.g., Wiki Workshop, Wikidata Workshop, Wiki-M3L, WikiIndaba).
2. We will release any artifacts from this project to the public and communicate any results from related surveys/interviews with Wikipedia communities.
3. We will summarize our findings in research papers submitted at top-tier Natural Language Processing conferences.
Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.
Eleftheria Briakou is a Ph.D. Candidate at the University of Maryland, College Park. Her work contributes models for understanding meaning differences across translations of Wikipedia pages.
Tajuddeen Rabiu Gwabade is a member of Masakhane for over two years and worked on creating datasets and models for various African languages. He currently works part-time as a Project Manager at Masakhane Research Foundation coordinating different Lacuna Funds.
Marine Carpuat is an Associate Professor in Computer Science at the University of Maryland, College Park. Marine is the recipient of multiple career and research awards. She was a program co-chair of the NAACL 2022 conference and a co-chair of the Evaluation Working Group for “Big Science”.
I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.
Please add any feedback or endorsements to the grant discussion page only. Any feedback added here may be removed.