Grants:Programs/Wikimedia Community Fund/Rapid Fund/QA tools to improve the quality, reliability, and consistency of Wiktionary (ID: 22282613)

From Meta, a Wikimedia project coordination wiki
statusFunded
QA tools to improve the quality, reliability, and consistency of Wiktionary
proposed start date2023-10-26
proposed end date2023-12-31
grant start date2023-10-26T00:00:00Z
grant end date2024-03-31T00:00:00Z
budget (local currency)5000 USD
budget (USD)5000 USD
amount recommended (USD)5000
grant typeIndividual
funding regionESEAP
decision fiscal year2023-24
applicant• Tbm
organization (if applicable)• N/A
Review Final Report

This is an automatically generated Meta-Wiki page. The page was copied from Fluxx, the grantmaking web service of Wikimedia Foundation where the user has submitted their application. Please do not make any changes to this page because all changes will be removed after the next update. Use the discussion page for your feedback. The page was created by CR-FluxxBot.

Applicant Details[edit]

Main Wikimedia username. (required)

Tbm

Organization

N/A

If you are a group or organization leader, board member, president, executive director, or staff member at any Wikimedia group, affiliate, or Wikimedia Foundation, you are required to self-identify and present all roles. (required)

N/A

Describe all relevant roles with the name of the group or organization and description of the role. (required)


Main Proposal[edit]

1. Please state the title of your proposal. This will also be the Meta-Wiki page title.

QA tools to improve the quality, reliability, and consistency of Wiktionary

2. and 3. Proposed start and end dates for the proposal.

2023-10-26 - 2023-12-31

4. Where will this proposal be implemented? (required)

Philippines

5. Are your activities part of a Wikimedia movement campaign, project, or event? If so, please select the relevant project or campaign. (required)

Other (please specify) Wiktionary

6. What is the change you are trying to bring? What are the main challenges or problems you are trying to solve? Describe this change or challenges, as well as main approaches to achieve it. (required)

The aim of this project is to create a set of tools that will assist editors in improving the quality, reliability, and consistency of Wiktionary.

A lot of information is duplicated within Wiktionary itself and between the various Wiktionary communities (English Wiktionary, French Wiktionnaire, German Wikiwörterbuch, etc). Such factual information include noun genders/classes, noun plural forms, hyphenation patterns, etymologies, and more. These tools will implement a number of consistency checks for such information on Wiktionary. The output can be used by editors to improve the quality and reliability of the information.

Furthermore, my hope is that this effort will lead to increased collaboration between the different Wiktionary communities, which operate fairly autonomously at the moment.

Finally, it can be argued that a lot of information from Wiktionary should be migrated to Wikidata and embedded in the various Wiktionary communities using Wikifunctions. These tool will help to find discrepancies of the information in the different Wiktionary communities, which is an important first step in order to migration this information to Wikidata in the future.

Specifically, this project will implement tools to:

  • Cross-reference information between different Wiktionary communities
  • Cross-reference information internally within the English Wiktionary
  • Validate hyphenation patterns
  • Compare information from Wiktionary translation tables with Wikidata

This project will deliver a number of well documented, open-source tools that can be used by volunteer editors to identify issues.

Issues identified by the tools will be shared with other editors, so they can help address them. Furthermore, the tools can be extended by others to implement more consistency checks and support for more languages.

7. What are the planned activities? (required) Please provide a list of main activities. You can also add a link to the public page for your project where details about your project can be found. Alternatively, you can upload a timeline document. When the activities include partnerships, include details about your partners and planned partnerships.

The planned activities are:

  • Creation of tools (in public, in Git)
  • Documentation of tools
  • Testing and improvement of tools
  • Discussion of tools with community and further refinement
  • Sharing of output with other editors

The following tools will be created:

  • Cross-reference information between different Wiktionary communities: compare factual information, such as noun genders/classes and plural forms among different Wiktionary communities. For example, English Wiktionary might claim that the German word "Baum" is masculine whereas the French Wiktionnaire might claim that this word is "feminine", which would be an inconsistency that needs to be resolved.
  • Cross-reference information internally within the English Wiktionary: implement a number of reliability checks; for example I recently fixed the
plural of the Yiddish word אינדזל (indzl) from אינדזלן (indzln) to אינדזלען (indzlen).  However, both plural pages still exist on Wiktionary claiming to be the plural of the word.  The wrong plural page needs to be deleted.  This tool can identify discrepancies such as this.  It will also compare references related to the etymology within Wiktionary to find inconsistencies.
  • Validate hyphenation patterns: Wiktionary contains information about the possible hyphenation of words, but this can be wrong and not match the actual word (often due to typos or copy&paste errors). For example, the hyphenation of German "Todesdatum" was given as "Ge|burts|da|tum", which is clearly a different word. (I wrote an initial prototype in the past, which led to a number of fixes; it needs to be re-written and extended to other languages.)
  • Compare information from Wiktionary translation tables with Wikidata: English Wiktionary entries contain translation tables to other languages. In some cases, these tables refer to the Wikidata item. Wikidata contains interwiki links - we can compare these interwiki links with the translations provided on Wiktionary to find differences.

The code will be published under the GPL or another OSI-approved open source license.

Limitations: there's a big variation in terms of the template usage both among the different Wiktionary communities but even within different languages on the same Wiktionary instance; this makes is difficult to write tools that work everywhere. Furthermore, some checks might need specific rules for different languages. We will address these limitations as much as possible by building on top of existing infrastructure/tools (such as wiktextract) and making the code generic so it can easily be adapted for other languages.


8. Describe your team. Please provide their roles, Wikimedia Usernames and other details. (required) Include more details of the team, including their roles, usernames, Wikimedia group, and whether they are salaried, volunteers, consultants/contractors, etc. Team members involved in the grant application need to be aware of their involvement in the project.

Martin Michlmayr (User:tbm) - salaried. I will implement this project in close cooperation with the Wiktionary community. I have extensive experience with Quality Assurance (QA) efforts, both inside of Wiktionary and in the open-source community (particularly Debian where I filed thousands of bug reports). I have fixed thousands of quality issues on Wiktionary; an example list of fixes I made on Wiktionary can be found here: https://en.wiktionary.org/wiki/User:Tbm/QA/Fixes

With the help of these tools, I will create lists of issues found in various languages which can be used by volunteer editors of those language to address these issues. I have worked with Wiktionary editors of other languages before, for example by creating a list of wrong hyphenation patters for various languages. This has resulted to many fixes on Wiktionary for German, Polish, Dutch and Catalan. See https://en.wiktionary.org/wiki/User:Tbm/QA/Hyphenation

9. Who are the target participants and from which community? How will you engage participants before and during the activities? How will you follow up with participants after the activities? (required)

The target participants are editors working on different languages on the various Wiktionary communities. I am actively involved in the English Wiktionary and have been working with editors from other languages on solving a number of QA issues. I will create pages listing issues that were flagged by these tools, so editors can work on them. Furthermore, the code will be openly written and discussed with others in the Wiktionary community.

10. Does your project involve work with children or youth? (required)

No

10.1. Please provide a link to your Youth Safety Policy. (required) If the proposal indicates direct contact with children or youth, you are required to outline compliance with international and local laws for working with children and youth, and provide a youth safety policy aligned with these laws. Read more here.

N/A

11. How did you discuss the idea of your project with your community members and/or any relevant groups? Please describe steps taken and provide links to any on-wiki community discussion(s) about the proposal. (required) You need to inform the community and/or group, discuss the project with them, and involve them in planning this proposal. You also need to align the activities with other projects happening in the planned area of implementation to ensure collaboration within the community.

I have discussed these ideas and tools with a number of editors on Wiktionary Discord.

12. Does your proposal aim to work to bridge any of the content knowledge gaps (Knowledge Inequity)? Select one option that most apply to your work. (required)

Language

13. Does your proposal include any of these areas or thematic focus? Select one option that most applies to your work. (required)

Culture, heritage or GLAM

14. Will your work focus on involving participants from any underrepresented communities? Select one option that most apply to your work. (required)

Linguistic / Language

15. In what ways do you think your proposal most contributes to the Movement Strategy 2030 recommendations. Select one that most applies. (required)

Innovate in Free Knowledge

Learning and metrics[edit]

17. What do you hope to learn from your work in this project or proposal? (required)

While there a number of QA tools for Wiktionary, a lot of work is needed in this area. I'm curious if the creation of these tools will prompt the community to build more tooling.

Furthermore, I'd like to see if these tools will lead to more cooperation among the different Wiktionary communities.

Finally, we will see if this will prompt a discussion about moving some Wiktionary data to Wikidata in order to remove duplication among the different Wiktionary communities.

18. What are your Wikimedia project targets in numbers (metrics)? (required)
Number of participants, editors, and organizers
Other Metrics Target Optional description
Number of participants 20
Number of editors 20
Number of organizers 3
Number of content contributions to Wikimedia projects
Wikimedia project Number of content created or improved
Wikipedia
Wikimedia Commons
Wikidata
Wiktionary 500
Wikisource
Wikimedia Incubator
Translatewiki
MediaWiki
Wikiquote
Wikivoyage
Wikibooks
Wikiversity
Wikinews
Wikispecies
Wikifunctions or Abstract Wikipedia
Optional description for content contributions.

N/A

19. Do you have any other project targets in numbers (metrics)? (optional)

No

Main Open Metrics Data
Main Open Metrics Description Target
N/A N/A N/A
N/A N/A N/A
N/A N/A N/A
N/A N/A N/A
N/A N/A N/A
20. What tools would you use to measure each metrics? Please refer to the guide for a list of tools. You can also write that you are not sure and need support. (required)

The tools will create a list of words that have issues. We can monitor these pages to see how many issues were fixed and by which editors. We can also count the number of issues outstanding and compare from the beginning and the end.

Financial proposal[edit]

21. Please upload your budget for this proposal or indicate the link to it. (required)

https://docs.google.com/spreadsheets/d/1B6Dxg4Ile_-ylw-MpEL3Ls0EY0vepQfplgMeC44ukbQ/edit?usp=sharing


22. and 22.1. What is the amount you are requesting for this proposal? Please provide the amount in your local currency. (required)

5000 USD

22.2. Convert the amount requested into USD using the Oanda converter. This is done only to help you assess the USD equivalent of the requested amount. Your request should be between 500 - 5,000 USD.

5000 USD

We/I have read the Application Privacy Statement, WMF Friendly Space Policy and Universal Code of Conduct.

Yes

Endorsements and Feedback[edit]

Please add endorsements and feedback to the grant discussion page only. Endorsements added here will be removed automatically.

Community members are invited to share meaningful feedback on the proposal and include reasons why they endorse the proposal. Consider the following:

  • Stating why the proposal is important for the communities involved and why they think the strategies chosen will achieve the results that are expected.
  • Highlighting any aspects they think are particularly well developed: for instance, the strategies and activities proposed, the levels of community engagement, outreach to underrepresented groups, addressing knowledge gaps, partnerships, the overall budget and learning and evaluation section of the proposal, etc.
  • Highlighting if the proposal focuses on any interesting research, learning or innovation, etc. Also if it builds on learning from past proposals developed by the individual or organization, or other Wikimedia communities.
  • Analyzing if the proposal is going to contribute in any way to important developments around specific Wikimedia projects or Movement Strategy.
  • Analysing if the proposal is coherent in terms of the objectives, strategies, budget, and expected results (metrics).

Endorse