Community Resources and Partnerships/India Rapid Project/50K Malayalam Words:A Lingua Libre Audio Corpus Project
Applicant
[edit]- Main Wikimedia username. (required)
BhagyaMohan
- Organization
N/A
- If you are a group or organization leader, board member, president, executive director, or staff member at any Wikimedia group, affiliate, or Wikimedia Foundation, you are required to self-identify and present all roles. (required)
N/A
- Describe all relevant roles with the name of the group or organization and description of the role. (required)
N/A
Project
[edit]- 1. Please state the title of your proposal. This will also be the Meta-Wiki page title.
50K Malayalam Words: A Lingua Libre Audio Corpus Project
- 2. and 3. Proposed start and end dates for the proposal.
2025-09-01 - 2026-02-28
- 4. Where will this proposal be implemented? (required)
India
- 5. Are your activities part of a Wikimedia movement campaign, project, or event? If so, please select the relevant project or campaign. (required)
Not applicable
- 6. What is the change you are trying to bring? What are the main challenges or problems you are trying to solve? Describe this change or challenges, as well as main approaches to achieve it. (required)
This project aims to build a comprehensive, open, and accessible audio corpus of the Malayalam language by contributing 50 thousand high-quality audio recordings of Malayalam words, phrases, and expressions to Lingua Libre and Wikimedia Commons. The larger change I seek to bring is to enhance the digital presence, accessibility, and preservation of the Malayalam language in the global linguistic ecosystem.
Challenges or Problems Addressed:
[edit]- Lack of Free and Accessible Malayalam Audio Content: Most pronunciation resources for Malayalam are either unavailable, behind paywalls, or of poor quality. Malayalam learners and technology developers have limited access to accurate, diverse, and freely usable voice data.
- Digital Underrepresentation of Regional Languages: Malayalam, though spoken by millions, is underrepresented in free and open datasets, especially in audio format. This limits its visibility in digital tools such as language learning apps, speech recognition models, and accessibility technologies (like screen readers).
- Open data is essential for training AI models and educational tools. Malayalam lacks robust datasets for Natural Language Processing (NLP), especially in spoken form.
Approaches and Strategies to Achieve the Change:
[edit]1. Mass Recording and Systematic Data Organization:
- Approach: Use the Lingua Libre recording tool to systematically record 1 lakh Malayalam words and phrases, organized by theme, frequency, or usefulness.
- Why this approach: Lingua Libre is already optimized for multilingual, scalable audio contributions with metadata and integration with Wikimedia Commons.
- Activity Example: Create lists of commonly used Malayalam words, government terms, academic words, and everyday vocabulary, then record and upload them.
2. Open Access Contribution to Wikimedia Projects:
- Approach: Make all audio files freely available under Creative Commons license via Wikimedia Commons and Wiktionary.
- Why it works: These platforms ensure global reach, long-term preservation, and easy integration with other tools.
Activity Example: Link audio files to Malayalam Wiktionary entries to enrich the dictionary with pronunciation data.
Why These Approaches?
[edit]- These strategies align with the Wikimedia movement’s values of openness, accessibility, and collaboration.
- Lingua Libre is already a proven, stable platform that supports high-volume recording and uploads.
- Community-driven projects in other languages (e.g., Basque, Catalan, Tamil) have successfully used these same approaches.
- By choosing scalable and open platforms, we ensure that this work contributes to language equality in the digital age.
- 7. What are the planned activities? (required) Please provide a list of main activities. You can also add a link to the public page for your project where details about your project can be found. Alternatively, you can upload a timeline document. When the activities include partnerships, include details about your partners and planned partnerships.
Below is a list of the main activities that will be carried out as part of the Malayalam Audio Corpus Project for Lingua Libre and Wikimedia Commons:
- Record audio pronunciations using the Lingua Libre tool. This will be done by a core team.
- Why: The Lingua Libre platform automates and streamlines mass recording and uploading.
- Goal: Reach the milestone of 50,000 recorded Malayalam words.
- Uploading and Categorizing on Wikimedia Commons
Timeline:
Month 1-4 : Mass recording, Commons uploads
Month 5 : Documentation and Reporting
- 8. Describe your team. Please provide their roles, Wikimedia Usernames and other details. (required) Include more details of the team, including their roles, usernames, Wikimedia group, and whether they are salaried, volunteers, consultants/contractors, etc.
Bhagya Mohan (Audio Contributor/Project Lead )
[edit]Username: BhagyaMohan
- Role: Will record the corpuses, ensure audio quality, and manage audio data organization.
[To be confirmed] Support Organizer
- Role: Will assist in batch uploading audio to Wikimedia Commons, add metadata, organize categories, and integrate files with Wiktionary.
- 9. Who are the target participants and from which community? How will you engage participants before and during the activities? How will you follow up with participants after the activities? (required)
Voice contributors
- 9.1. If your project includes in-person activities, are there any international participants travelling to India for them? (required)
- No
- 9.1.1. List all countries of participation. (required)
- 9.2. Will the project be transferring funds to any international participants? (required)
- No
- 9.2.1. List all international participants receiving funding and their countries. (required)
N/A
- 10. Does your project involve work with children or youth? (required)
- No
- 10.1. Please provide a link to your Youth Safety Policy. (required) If the proposal indicates direct contact with children or youth, you are required to outline compliance with international and local laws for working with children and youth, and provide a youth safety policy aligned with these laws. Read more here.
N/A
- 11. How did you discuss the idea of your project with your community members and/or any relevant groups? Please describe steps taken and provide links to any on-wiki community discussion(s) about the proposal. (required) You need to inform the community and/or group, discuss the project with them, and involve them in planning this proposal. You also need to align the activities with other projects happening in the planned area of implementation to ensure collaboration within the community.
As this is an individual-led project, initial planning and scoping have been done independently. However, I recognize the importance of community awareness and alignment.
I will be updating the talk page of the Wikimedians of Kerala User Group to inform the community about this project and invite feedback or suggestions. This will help ensure transparency, avoid duplication, and open up space for future collaboration or reuse of the audio content in other Malayalam Wikimedia projects.
- 12. Does your proposal aim to work to bridge any of the content knowledge gaps (Knowledge Inequity)? Select one option that most apply to your work. (required)
Language
- 13. Does your proposal include any of these areas or thematic focus? Select one option that most applies to your work. (required)
Gender and diversity
- 14. Will your work focus on involving participants from any underrepresented communities? Select one option that most apply to your work. (required)
Linguistic / Language
- 15. In what ways do you think your proposal most contributes to the Movement Strategy 2030 recommendations. Select one that most applies. (required)
Innovate in Free Knowledge
Metrics
[edit]- 17. What do you hope to learn from your work in this project or proposal? (required)
This project is designed and implemented as an individual effort, with the goal of contributing a large-scale open audio dataset in Malayalam to Lingua Libre and Wikimedia Commons. While it does not rely on community participation, it still seeks to generate valuable insights that can benefit similar future initiatives.
1. How feasible is it for an individual to carry out a high-volume audio contribution project?
[edit]- Can one person effectively manage recording, organizing, and uploading 1 lakh audio files within a structured timeframe?
- What are the time, technical, and logistical challenges faced in executing such a project solo?
2. What tools and workflows best support efficient individual contributions to Lingua Libre and Wikimedia Commons?
[edit]- How reliable and user-friendly is the Lingua Libre platform for solo users doing bulk work?
- What technical improvements or workflow adaptations can support high-volume individual contributors?
3. What is the potential impact of individual-led projects on open knowledge platforms?
[edit]- Can large-scale solo contributions significantly enhance the utility of platforms like Wiktionary or Commons for learners and researchers?
- Does individual work attract reuse, visibility, or further contributions from others after completion?
4. What are the limits and sustainability challenges of working independently?
[edit]- How can individual projects be made more sustainable over time without burnout?
- What kind of support (technical, financial, moral) is most helpful for solo contributors?
- 18. What are your Wikimedia project targets in numbers (metrics)? (required)
| Other Metrics | Target | Optional description |
|---|---|---|
| Number of participants | 3 | Since this project is primarily designed as an individual-led initiative, the core work (recording, uploading, organizing) will be carried out by one person. However, there may still be a small number of additional participants involved in training, testing, or supporting roles. |
| Number of editors | 3 | Since this project is primarily led by an individual and focuses on audio contributions rather than large-scale editing campaigns, the number of active editors involved will be limited. |
| Number of organizers | 2 | Audio Contributor/Project Lead (1 person): Will handle the planning, coordination, recording, uploading, reporting, and documentation.
Support Organizer (1 person): May assist with specific tasks like training setup, technical troubleshooting, or outreach design (volunteer or part-time support if needed). |
| Wikimedia project | Number of content created or improved |
|---|---|
| Wikipedia | |
| Wikimedia Commons | 50000 |
| Wikidata | |
| Wiktionary | |
| Wikisource | |
| Wikimedia Incubator | |
| Translatewiki | |
| MediaWiki | |
| Wikiquote | |
| Wikivoyage | |
| Wikibooks | |
| Wikiversity | |
| Wikinews | |
| Wikispecies | |
| Wikifunctions or Abstract Wikipedia |
- Optional description for content contributions.
Wikimedia Commons will host the publicly accessible audio files, categorized and licensed under Creative Commons. Lingua Libre will serve as the recording and metadata platform. Recordings will be automatically linked to Commons.
- 19. Do you have any other project targets in numbers (metrics)? (optional)
No
| Main Open Metrics | Description | Target |
|---|---|---|
| N/A | N/A | N/A |
| N/A | N/A | N/A |
| N/A | N/A | N/A |
| N/A | N/A | N/A |
| N/A | N/A | N/A |
- 20. What tools would you use to measure each metrics? Please refer to the guide for a list of tools. You can also write that you are not sure and need support. (required)
Number of Content Contributions - Lingua Libre Statistics Dashboard
Global Contributions tool (on Meta)
Quality Review of Audio - Manual spot-checking - Playback review with checklists for clarity, volume, pronunciation
Budget
[edit]- 21. Please upload your budget for this proposal or indicate the link to it. (required)
https://docs.google.com/spreadsheets/d/1tiq7yc3y186txyG-Ctskj3pJWUdpvLCwVG56F2hFN58/edit?usp=sharing
- 22. What is the amount you are requesting for this proposal? Please provide the amount in Indian Rupees. (required)
140500 INR
- 22.1. Convert the amount requested into USD using the Oanda converter. This is done to help you assess the USD equivalent of the requested amount. Your request should be between 500 - 5,000 USD. (required)
1642 USD
- By submitting this proposal request you agree with the Institutional Partner Privacy Policy, Application Privacy Statements, WMF Friendly Space Policy and Universal Code of Conduct.
Yes
This is an automatically generated Meta-Wiki page. The page was copied from Fluxx, the web service of Wikimedia Foundation Funds, where the user has submitted their application. Please do not make any changes to this page because all changes will be removed after the next update. Use the discussion page for your feedback. The page was created by CR-FluxxBot.