Jump to content

WikiCred/2022 CFP/Tooling to improve the credibility and reliability of information on Wiktionary

From Meta, a Wikimedia project coordination wiki
Tooling to improve the credibility and reliability of information on Wiktionary
A WikiCred 2022 Grant Proposal
Project TypeDevelopment
AuthorMartin Michlmayr / Carles Pina i Estany
(Tbm / Carlespina)
Contacttbm@cyrius.com / carles@pina.cat
Requested amount8,000-10,000 USD
Award amountUnknown
What is your idea?

The aim of this project is to create a set of tools that can be used to improve and ensure the veracity, reliability and integrity of information on Wiktionary.

Wiktionary is a multilingual dictionary which provides extensive information about words, such as their meanings, but also etymologies, pronunciations, sample quotations, and more. There are independent Wiktionary efforts in several languages.

The credibility of information on Wiktionary can be improved through tools in a number of ways:

  1. Making it easier for contributors to monitor changes to their language of interest. For example, we can extract data for one language from a data dump and compare it with a previous dump, allowing contributors to review changes periodically.
  2. Checks and links within Wiktionary: for example etymologies refer to the origin of words and the pages for those words may refer back to their descendants or derived words. Tools can identify and flag discrepancies that need to be investigated. (If word A refers to word B as its etymology, word B should refer to word A as a descendant.)
  3. Integrity checks between different Wiktionary communities: the different Wiktionary projects (English Wiktionary, French Wiktionnaire, German Wikiwörterbuch, etc) cover the same words in many languages (English, French, German, but also Swahili, Polish, Swedish and more). While the definitions of words are language-specific, a lot of information about words are the same: plural forms, noun genders/classes, etc. We can write tooling that compare such information between different Wiktionary communities and flag differences, so they can be reviewed manually. (For example if one Wiktionary says that the gender of the German noun Baum is masculine whereas another says it’s feminine, only one, namely the former, is correct.) (This also builds the foundation for automatically syncing information across different Wiktionary communities, such as audio pronunciation files, IPA pronunciation information, etc).
  4. Integrity checks between Wiktionary and other dictionaries: as with the previous task, information from Wiktionary (such as plural forms or noun genders/classes) can be compared to high-quality dictionaries that are not part of the Wikimedia family. (Note: the feasibility of this particular task requires an investigation of legal issues related to database and sui generis rights, which we will do before working on this particular task.)


Why is it important?

We live in a global world and words are the building blocks for effective communication.

Wiktionary is the primary freely available, open knowledge, multilingual dictionary. Improving the quality of Wiktionary will allow users to put more reliance on its data and use Wiktionary for additional use-cases (for example, the use of Wiktionary as a corpus for academic research on languages).


Link(s) to your resume or anything else (CV, GitHub, etc.) that may be relevant


Is your project already in progress?

No.

Martin has made some contributions manually (such as adding some “Descendants” for Swahili words to Arabic and English words, e.g. msahafu and tomato), but the goal of this project is to write tooling to make this effort more scalable.


How is this project relevant to credibility and Wikipedia?

A dictionary is only credible if the information is correct. The tooling developed here will flag incorrect data and improve the quality of Wiktionary.


What is the ultimate impact of this project?

We hope to have an impact on a number of areas:

  • Improve the information on Wiktionary
  • Create tools to make it easier for others to review information periodically
  • Improve the perception of the credibility of Wiktionary among users
  • Increase collaboration between the different Wiktionary communities (English Wiktionary, French Wiktionnaire, German Wikiwörterbuch, etc)
  • Inspire others to work on similar initiatives


Can your project scale?

Yes, the whole idea of this project is to create tools to make it easier to improve information veracity, reliability and integrity on Wiktionary at a scale.

(However, see challenges below)


Why are you the people to do it?

Martin is extremely detail-oriented and obsessed with high levels of quality. He has employed these skills in a number of different areas (such as thousands of bug reports on open source projects and various editing gigs). The proposed project is a perfect match. He is also good with documentation and community building, which are required for this project.

Carles is a proficient Python coder with a passion for languages and solving problems in an open manner. He has been working on a translation tool that makes it easier to use Wiktionary translations and since 2005 has been maintaining qdacco, a front-end of the open DACCO dictionary (English-Catalan).

Both of us are keen in getting more involved in Wikimedia projects and this project would allow us to spend more time on Wiktionary.


What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?

Even though English is the lingua franca of our time, we believe that diversity in languages is important.

While some languages get a lot of attention on Wiktionary, others get less. Tooling can help improve all languages, even those with a less active user-base.

Improving Wiktionary will make it a better resource for language learning. Wiktionary can play an important part in the long-term success of diversity and inclusiveness of the Wikimedia movement.

Wiktionary is also a good way for future editors from many countries to get involved in the Wikimedia community.


What are the challenges associated with this project and how you will overcome them?

The biggest challenge is to create tools that are generic and work with many languages. There’s a great diversity on the English Wiktionary itself in the way knowledge is expressed and the different Wiktionary initiatives in other languages use completely different markup, making it difficult to write tools that support everything.

There are several ways to tackle these problems:

  • Design the scripts in a way to make them extendable
  • Create good documentation to allow contributors from other Wiktionary initiatives to contribute to and extend the tools
  • Provide showcases to inspire others to create similar efforts for their languages
  • Focus on particular languages for the language-specific parts (while keeping the code generic and well documented so others can do the same for other languages). Specifically, we will predominantly focus on Swahili (as well as German and Catalan to a smaller extent) for the parts that are language-specific.


How will you spend your funds?

The funds will primarily be spent on the development of tools.

Various tools will be written in Python and published under an open-source license. We will also write extensive documentation and show cases and work on community building. We will also run the tools and highlight or fix mistakes.

We will extend existing open-source efforts where possible (e.g. wikiextract).

A rough plan:

  • Tools to monitor changes (e.g. from database dumps)
  • Checks and links within Wiktionary
    • Python module to add Descendants, Etymology and Derived words
    • Tools to cross-reference etymologies, descendants and derived words
    • Tools to guess derived words based on language-specific rules (e.g. drive → driver in English; -amua → mwamuzi in Swahili) and update Etymology/Derived words (semi)automatically
    • Tools to cross-reference Arabic root pages with derived words to ensure the Arabic root matches
  • Integrity checks between different Wiktionary communities
    • Support for one more more languages for wikiextract (e.g. German)
    • Tools that compare information from different Wiktionary communities
      • Plural forms
      • Noun genders and noun classes
      • Other objective information (i.e. facts)
  • Integrity checks between Wiktionary and other dictionaries
    • Parser for TUKI (a popular Swahili dictionary)


How long will your project take?

3 months


Have you worked on projects for previous grants before?

Martin has received a grant from Wikimedia UK to buy a microphone, which has resulted in thousands of pronunciations of Swahili words for Wikimedia Commons (and, as an extension, Wiktionary) spoken by a woman from Kenya.

Carles has received grants from the Open Knowledge Foundation and Freexian to work on various open-source projects, but hasn’t received grants to work on Wikimedia.