Research:WQT (Wikidata Quality Toolkit): Assuring the world’s data commons

From Meta, a Wikimedia project coordination wiki
16:51, 22 March 2024 (UTC)
Elena Simperl
Albert Meroño Peñuela
Odinaldo Rodrigues
Duration:  2024-January – 2024-December
Quality, Recommendations, References

Invalid status "Active" provided

Wikidata is one of the world’s most precious data assets: launched in 2012 by the Wikimedia Foundation (the NGO that runs Wikipedia), and containing machine-readable factual information about 100+ million topics. It is used extensively in anything from web search engines, virtual assistants (e.g. Siri, Alexa), and fact-checkers to 800+ projects in the Wikimedia ecosystem, including Wikipedias in multiple languages. As a curated source of structured, machine-readable information, it is a valuable training resource for numerous AI applications, including large language models (LLMs) and other foundational AI models. Incomplete, erroneous, biased, or otherwise inappropriate data matters. Wikidata data is used in many Wikipedia articles, which are visited 24 billion times daily. Poor data is especially damaging when used to train AI systems, which tend to reinforce existing biases and stereotypes. This situation is not going away anytime soon: Wikidata grows faster than the size of its community; at the same time, LLMs like ChatGPT are expected to make things worse as they could unleash a huge tide of automatically generated content that requires additional human scrutiny. Furthermore, existing tools that help editors with these tasks are limited in scope or require specialist skills. The scale of the challenge is substantial: Wikidata has 21 million edits daily, made by ~24k active editors supported by ~330 bots. We will build WQT (Wikidata Quality Toolkit), which will support a diverse set of editors in curating and validating Wikidata records at scale. This will leverage research findings and conceptual prototypes drawing from AI, data management, and social computing, which have been designed and evaluated by the Wikidata community, responding to their data assurance needs. The topic is timely, not the least, because of the risks of misinformation and disinformation posed by LLMs. The focus will be on: • revisiting existing assumptions and requirements for data assurance in the age of LLMs; • refactoring, improving and integrating existing code, which originates from a series of research grants and PhD projects; • evaluating the toolkit extensively with the Wikidata community and • developing a robust research software sustainability strategy. The toolkit will be open-source, and all data, software, and guidance will be available to the community, as well as to researchers and AI developers. Besides the direct impact on a community of 24k editors, there are substantial economic and societal implications from the downstream AI applications using Wikidata (e.g. search engines) – according to Government sources, AI employs ~50k people in the UK and added £3.7 billion to the UK economy in 2022.


Describe in this section the methods you'll be using to conduct your research. If the project involves recruiting Wikimedia/Wikipedia editors for a survey or interview, please describe the suggested recruitment method and the size of the sample. Please include links to consent forms, survey/interview questions and user-interface mock-ups.


Please provide in this section a short timeline with the main milestones and deliverables (if any) for this project.

Policy, Ethics and Human Subjects Research[edit]

It's very important that researchers do not disrupt Wikipedians' work. Please add to this section any consideration relevant to ethical implications of your project or references to Wikimedia policies, if applicable. If your study has been approved by an ethical committee or an institutional review board (IRB), please quote the corresponding reference and date of approval.


Describe the results and their implications here. We encourage you to share preliminary data. Don't forget to make status=complete above when you are done.


Provide links to presentations, blog posts, or other ways in which you disseminate your work.