Grants:Programs/Wikimedia Community Fund/Rapid Fund/Language models for generating Wikidata descriptions in Southeast Asian languages (ID: 22459507)
This is an automatically generated Meta-Wiki page. The page was copied from Fluxx, the grantmaking web service of Wikimedia Foundation where the user has submitted their application. Please do not make any changes to this page because all changes will be removed after the next update. Use the discussion page for your feedback. The page was created by CR-FluxxBot.
Applicant Details
[edit]- Main Wikimedia username. (required)
Alphama
- Organization
N/A
- If you are a group or organization leader, board member, president, executive director, or staff member at any Wikimedia group, affiliate, or Wikimedia Foundation, you are required to self-identify and present all roles. (required)
I'm a group leader of a Wikimedia User Group (submitted to the Affiliation Committee).
- Describe all relevant roles with the name of the group or organization and description of the role. (required)
I am the president of Vietnam Wikimedians User Group, https://meta.wikimedia.org/wiki/Vietnam_Wikimedians_User_Group
Main Proposal
[edit]- 1. Please state the title of your proposal. This will also be the Meta-Wiki page title.
Language models for generating Wikidata descriptions in Southeast Asian languages
- 2. and 3. Proposed start and end dates for the proposal.
2024-06-01 - 2024-09-01
- 4. Where will this proposal be implemented? (required)
Vietnam
- 5. Are your activities part of a Wikimedia movement campaign, project, or event? If so, please select the relevant project or campaign. (required)
Not applicable Wikidata descriptions
- 6. What is the change you are trying to bring? What are the main challenges or problems you are trying to solve? Describe this change or challenges, as well as main approaches to achieve it. (required)
Descriptions enable readers to distinguish Wikidata items from others by identifying them with concise text. While many Wikidata items have descriptions in widely spoken languages like English, there is a notable absence of descriptions in other languages, particularly Southeast Asian languages such as Vietnamese, Bahasa Indonesia, Bahasa Melayu, and Tagalog.
Adding new short descriptions to Wikidata items can be tedious and repetitive. Hence, we propose utilizing language models trained by neural networks to automatically translate descriptions from English to other languages. These models will be integrated into a translation tool, serving as a description recommender that editors can review and modify before incorporating them into Wikidata using QuickStatements.
In addition to enhancing description contributions to Wikidata, our project aims to foster collaboration among User Groups in the ESEAP region by involving them in evaluating the translated descriptions.
- 7. What are the planned activities? (required) Please provide a list of main activities. You can also add a link to the public page for your project where details about your project can be found. Alternatively, you can upload a timeline document. When the activities include partnerships, include details about your partners and planned partnerships.
1. Data Collection: We utilize a web crawler to randomly gather a minimum of 50,000 Wikidata items containing descriptions in English, Bahasa Indonesia, Bahasa Melayu, and Tagalog through the Wikidata APIs online. A larger dataset enhances the performance of our language model. 2. Data Annotation: Considering the varied nature of Wikidata contributions, language descriptions may differ in content across translations. Hence, our data annotation process employs three techniques: leveraging pre-trained models, utilizing ChatGPT outputs, and incorporating human annotations. Initially, we employ pre-trained models from Huggingface.io to translate English descriptions into other languages, comparing them with the original labels using metrics like METEOR before aligning description pairs. Subsequently, we harness ChatGPT APIs to translate these descriptions into other languages further, aiming to enhance quality, with ChatGPT expected to produce outputs akin to human standards. Finally, a subset of data is annotated by multiple participants. We amalgamate data from these three methods to construct our dataset for training language models. 3. Language Model Training: We train our collected dataset on various pre-trained models (e.g., Helsinki-NLP/opus-mt-en-vi, Helsinki-NLP/opus-mt-en-id, Helsinki-NLP/opus-mt-en-tl, mBART, mT5, etc.). Our experimentation involves two types of language models: a unified model for all languages and individual models for each language. We evaluate the models' performance to select those yielding optimal results. 4. Development of Recommended Tool: Subsequently, we create a user-friendly GUI tool that furnishes pairs of descriptions (English and another language) for users to review and edit before uploading them to Wikidata via QuickStatements. 5. Human Evaluation: Participants are tasked with evaluating a subset of generated descriptions based on criteria such as accuracy, conciseness, and naturalness to gauge inter-rater reliability and assess output quality. Insights from this evaluation enable us to refine training data and retrain models for enhanced performance.
- 8. Describe your team. Please provide their roles, Wikimedia Usernames and other details. (required) Include more details of the team, including their roles, usernames, Wikimedia group, and whether they are salaried, volunteers, consultants/contractors, etc. Team members involved in the grant application need to be aware of their involvement in the project.
Our team members:
1. Alphama: project manager, programmer. 2. I will ask for 12 other participants later in 4 languages to join as the role of data annotators and translation quality evaluators.
- 9. Who are the target participants and from which community? How will you engage participants before and during the activities? How will you follow up with participants after the activities? (required)
We invite participants from the ESEAP region who are proficient in Bahasa Indonesian, Bahasa Melayu, Tagalog, and Vietnamese. We plan to initiate email communications and engage in discussions with various communities and user groups in their respective languages.
Additionally, we will establish a dedicated Meta area within the Vietnamese Wikimedians User Group to monitor the project's progress. Following its completion, we will continue to solicit feedback from editors within communities to enhance the efficacy of language models.
- 10. Does your project involve work with children or youth? (required)
No
- 10.1. Please provide a link to your Youth Safety Policy. (required) If the proposal indicates direct contact with children or youth, you are required to outline compliance with international and local laws for working with children and youth, and provide a youth safety policy aligned with these laws. Read more here.
N/A
- 11. How did you discuss the idea of your project with your community members and/or any relevant groups? Please describe steps taken and provide links to any on-wiki community discussion(s) about the proposal. (required) You need to inform the community and/or group, discuss the project with them, and involve them in planning this proposal. You also need to align the activities with other projects happening in the planned area of implementation to ensure collaboration within the community.
At this moment, I only opened a discussion on the Vietnamese Wikimedians User Group and sent emails to some Wikimedians such as Butch (Talago) and Sakti (Indonesian). I will continue open discussions with related projects and user groups to attract their participation.
- 12. Does your proposal aim to work to bridge any of the content knowledge gaps (Knowledge Inequity)? Select one option that most apply to your work. (required)
Language
- 13. Does your proposal include any of these areas or thematic focus? Select one option that most applies to your work. (required)
Education
- 14. Will your work focus on involving participants from any underrepresented communities? Select one option that most apply to your work. (required)
Linguistic / Language
- 15. In what ways do you think your proposal most contributes to the Movement Strategy 2030 recommendations. Select one that most applies. (required)
Innovate in Free Knowledge
Learning and metrics
[edit]- 17. What do you hope to learn from your work in this project or proposal? (required)
To start, my objective is to understand how modern neural network-based language models can be utilized to automate the creation of Wikidata descriptions, with a focus on delivering outputs that meet human standards. I want to underscore the critical role of language models in efficiently addressing content generation tasks in Wikimedia projects. Additionally, I am eager to gain insights into effectively bridging various communities for collaborative projects, particularly within the ESEAP region.
- 18. What are your Wikimedia project targets in numbers (metrics)? (required)
Other Metrics | Target | Optional description |
---|---|---|
Number of participants | 13 | Alphama: project manager, programmer
12 annotators and evaluators in 4 languages. |
Number of editors | 12 | 12 annotators will be asked to import quality-generated descriptions into Wikidata in their language. |
Number of organizers | 1 | I, Alphama is the project organizer. |
Wikimedia project | Number of content created or improved |
---|---|
Wikipedia | |
Wikimedia Commons | |
Wikidata | 6000 |
Wiktionary | |
Wikisource | |
Wikimedia Incubator | |
Translatewiki | |
MediaWiki | |
Wikiquote | |
Wikivoyage | |
Wikibooks | |
Wikiversity | |
Wikinews | |
Wikispecies | |
Wikifunctions or Abstract Wikipedia |
- Optional description for content contributions.
The tool is set to produce 6,000 fresh Wikidata descriptions across four languages and will engage 12 evaluators to assess the translation quality before importing them to Wikidata using QuickStatements. Each evaluator will be tasked with reviewing 500 descriptions.
- 19. Do you have any other project targets in numbers (metrics)? (optional)
No
Main Open Metrics | Description | Target |
---|---|---|
N/A | N/A | N/A |
N/A | N/A | N/A |
N/A | N/A | N/A |
N/A | N/A | N/A |
N/A | N/A | N/A |
- 20. What tools would you use to measure each metrics? Please refer to the guide for a list of tools. You can also write that you are not sure and need support. (required)
To assess the effectiveness of language models, we employ various string metrics such as BLEU, METEOR, and ROUGE, which generate scores ranging from 0 to 1 by comparing predicted descriptions with reference descriptions across the test dataset.
We utilize inter-rater reliability measures among evaluators to evaluate the accuracy of outputs generated by language models, including Pearson's r, Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, and others.
Financial proposal
[edit]- 21. Please upload your budget for this proposal or indicate the link to it. (required)
https://docs.google.com/spreadsheets/d/1jp6XbEMLiT0gVbgsi3MUMD7KDdVHYB8kHZ5UkRPpzQ8/edit?usp=sharing
- 22. and 22.1. What is the amount you are requesting for this proposal? Please provide the amount in your local currency. (required)
125948038 VND
- 22.2. Convert the amount requested into USD using the Oanda converter. This is done only to help you assess the USD equivalent of the requested amount. Your request should be between 500 - 5,000 USD.
4945.5 USD
- We/I have read the Application Privacy Statement, WMF Friendly Space Policy and Universal Code of Conduct.
Yes
Endorsements and Feedback
[edit]Please add endorsements and feedback to the grant discussion page only. Endorsements added here will be removed automatically.
Community members are invited to share meaningful feedback on the proposal and include reasons why they endorse the proposal. Consider the following:
- Stating why the proposal is important for the communities involved and why they think the strategies chosen will achieve the results that are expected.
- Highlighting any aspects they think are particularly well developed: for instance, the strategies and activities proposed, the levels of community engagement, outreach to underrepresented groups, addressing knowledge gaps, partnerships, the overall budget and learning and evaluation section of the proposal, etc.
- Highlighting if the proposal focuses on any interesting research, learning or innovation, etc. Also if it builds on learning from past proposals developed by the individual or organization, or other Wikimedia communities.
- Analyzing if the proposal is going to contribute in any way to important developments around specific Wikimedia projects or Movement Strategy.
- Analysing if the proposal is coherent in terms of the objectives, strategies, budget, and expected results (metrics).