Grants:Programs/Wikimedia Research Fund/Wikipedia and Speech Technologies: Using multimodal AI for increasing content in low resource languages on Wikipedia

From Meta, a Wikimedia project coordination wiki
statusnot funded
Wikipedia and Speech Technologies: Using multimodal AI for increasing content in low resource languages on Wikipedia
start and end datesJuly 2023 - July 2024
budget (USD)36,032.60 USD
fiscal year2022-23
applicant(s)• Hellina Hailu Nigatu, Chris Chinenye Emezue and Bonaventure Dossou

Overview[edit]

Applicant(s)

Hellina Hailu Nigatu, Chris Chinenye Emezue and Bonaventure Dossou

Affiliation or grant type

Masakhane

Author(s)

Hellina Hailu Nigatu, Chris Chinenye Emezue and Bonaventure Dossou

Wikimedia username(s)

Project title

Wikipedia and Speech Technologies: Using multimodal AI for increasing content in low resource languages on Wikipedia

Research proposal[edit]

Description[edit]

Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

A multilingual resource with 329 languages, Wikipedia has been a go-to data source for tasks like machine translation and speech recognition. While Wikipedia provides some language diversity, low-resource languages, such as African languages, suffer quantity and quality issues. Another issue is contextuality of articles: for instance, there is an article on Amharic Wikipedia about the Big Mac burger when there is no McDonalds in Ethiopia, while the article about a village in Ethiopia, Awra Amba, is available in 5 languages, none of which are Ethiopian. This ties with colonial legacies of erasing and rewriting African history[1].

Having contextually relevant content on Wikipedia in low-resource languages requires thinking about how different communities preserve knowledge. Almost half of the world's languages are unwritten[2]. Many African communities preserve knowledge through oral traditions. Authoring articles in written form becomes a challenge because (1) the people who recite and pass the oral artifacts are elders and (2) it is hard to type in these languages since most keyboard and writing tools are tailored for higher-resourced scripts

Multimodal input systems and Automatic Speech Recognition (ASR) tools can empower communities to preserve their knowledge in their language. Yet, current ASR tools are lagging for African languages due to insufficient and out-of-context data. To realize our goal of increasing input modality, we need to create relevant datasets and train ASR models. We propose a project with two end-products: (1) contextually relevant articles for communities written by hired local researchers (2) ASR datasets created from the articles produced by the researchers. Our community researchers will interview community members and review literature to create the articles. Unlike prior work [3], we will bring the process closer to users by using bots on Telegram/WhatsApp to gather recordings for the articles. To begin, we will focus on 4 languages: Amharic, Ndonga, Igbo, and Swahili. Since Wikipedia uses citations for validation, this project will preface our future work on creating referencing systems for such content. We will expand on tools like Scribe, WikiSpeech and Sawtpedia, but with a focus on creation of new articles rather than narration of existing ones.

[1] https://doi.org/10.1007/s10691-021-09470-6

[2] https://www.ethnologue.com/enterprise-faq/how-many-languages-world-are-unwritten-0

[3] https://commonvoice.mozilla.org/en

Personnel[edit]

N/A

Budget[edit]

Approximate amount requested in USD.

36,032.60 USD

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

- For voice recording Telegram/WhatsApp bots

Python Engineer $5,600

Data storage (1 year) $282.6

Cloud hosting (5 years) $150

- For wiki data curators: $100 per article

30 articles per language. Hence $12,000 to create the written articles.

-For narrators

Each prompt is an average 15 seconds sentence for the people to narrate. We aim to pay $0.3 for each prompt. On average one article will have 500 sentences. Hence there will be a total of 500 x 4 x 30 x $0.3 = $18,000

Impact[edit]

Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

This project will increase the diversity of content on Wikipedia in low resource languages. It is aligned with several of the Movement Strategy Recommendations, particularly Improving User Experience and Identifying Topics for impact. It is also aligned with challenges outlined in A Taxonomy of Knowledge Gaps for Wikimedia Projects. It will lower barriers for contributors and empower communities to inscribe their own knowledge, protecting them from being systematically erased. Lessons learned from our project can inform other projects such as “Make Wikipedia more accessible to the visually impaired.” It is also inline with the Sustainable Development Goals of the United Nations, particularly Quality Education and Reduced Inequalities.

Dissemination[edit]

Plans for dissemination.

Our work will mainly be disseminated through scholarly articles. The data curation and community engagement will be a good research contribution for HCI, NLP and ICTD venues. The ASR system will be a speech processing contribution, highlighting the importance of local content for speech data. We are also hopeful that our work will elicit scholars in social sciences to study the impact of increased content and representation on Wiki. Our work will also appear in WikiIndaba and WikidataCon.

Past Contributions[edit]

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

We are a group of Maskahane and Lanfrica researchers, dedicated to making and improving NLP tools for African languages through open-source projects. We have developed the African Stopword Project and MasakhaneNER, tools that are open source and developed through community efforts.

While building these tools, we have used Wikipedia as a data source; which is an additional drive for us to improve the quality and quantity of African content on Wikipedia. Maskahane has also won the 2021 WMF Research Award and has an ongoing project with the fund. Collaborators of this project have worked as volunteer contributors to the project “Research: Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning”


I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.

Yes