Jump to content

Research:Community-centered Evaluation of AI Models on Wikipedia

From Meta, a Wikimedia project coordination wiki
01:25, 28 October 2022 (UTC)
Duration:  2022-11 – 2024-05
This project's code is open-source

This project's data is available for download and reuse.

This page documents a completed research project.

In this project, we are interested in empowering the Wikipedia community to drive the intentional design and curation of evaluation datasets for AI that impacts them, such as ORES and LiftWing. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion.



As Wikimedia projects scale, Wikipedians increasingly rely on AI tools for governance. For example, AI-based content moderation tools such as ORES are used to identify damaging edits in articles for Wikipedians to review and revert them as necessary. Despite the growth of AI tools, Wikipedia communities currently have limited means to evaluate particular AI tools’ “fit for use” with respect to their collective norms and values. Currently, the curation of ORES’ training and evaluation datasets relies on a system called Wikilabels, which is hosted on an external website outside Wikipedia. Wikipedians can join data labeling campaigns on Wikilabels and request a subset of data to label. However, unlike Wikipedia where each article is editable by any Wikipedian, Wikilabels assigns each data point to only a single Wikipedian for labeling in isolation. Wikilabels also doesn’t enable Wikipedians to discuss labels collaboratively, unlike Wikipedia, where each article has an associated talk page for discussing its content. As such, it is less clear to what extent Wikilabels’ datasets reflect the collective perspectives of the community, versus just the perspectives of some individual annotators. Besides labeling preselected data in Wikilabels, Wikipedians currently do not have a way to proactively evaluate how different AI tools perform with respect to their collective norms and values.



This research includes three main activities, which are described as follows:

Formative study


To better understand Wikipedians’ desires and challenges around data curation for AI evaluation, we conducted a formative study with eight Wikipedians who had experience using AI-based content moderation tools on Wikipedia, contributing to the development of these tools, or participating in grassroots efforts to identify areas for improvement in deployed tools. Through a reflexive thematic analysis and affinity diagramming by two of the authors, we derived the following four highest-level themes as design requirements for systems that aim to support the community-driven data curation process for AI evaluation on Wikipedia.

  1. The data curation process should be led by the community and follow their established norms.
  2. The data curation process should encourage deliberation to surface disagreements, build consensus, and promote shared understanding.
  3. The data curation process should embed within existing workflows.
  4. The data curation process should be public and transparent to community members.

Wikibench development


Based on these design requirements, we developed Wikibench, a system that enables community members to collaboratively curate AI evaluation datasets, while navigating disagreements and ambiguities through discussion.

Wikibench mainly supports three actions for community-driven data curation: select, label, and discuss. Wikipedians can use Wikibench to select new data points for inclusion in datasets and to label these data points during the course of their regular, daily activities on Wikipedia. Wikibench also supports Wikipedians in filtering through data points that have already been added to the dataset, to select ones to further label or discuss. Through Wikibench, Wikipedians are supported in either discussing the label of individual data points, or discussing higher-level topics related to the overall data curation process.

Community evaluation


We conducted a field study to observe how Wikipedians use Wikibench in the course of their regular activities on Wikipedia. In total, we recruited 12 Wikipedians with diverse experiences and backgrounds.

Quality of the community-curated dataset


We found evidence that the dataset collected through Wikibench is broadly reflective of Wikipedians’ perspectives, while also capturing ambiguity and disagreement among community members. We demonstrate how the resulting dataset can be used to investigate the relative strengths and limitations of two different AI models used on Wikipedia. Below are some highlighted findings:

  • Wikibench’s labels reflect consensus among a broader set of community members.
  • Wikibench captures ambiguity and differences of perspective.
  • Most data points are labeled by multiple community members.
  • Wikibench’s dataset can help in understanding AI models’ alignment with community perspectives.
  • Participants believe Wikibench’s dataset can help the community understand gaps between their collective values and AI models’ predictions.

How Wikipedians use Wikibench


In addition, we found that participants appreciated how Wikibench’s design embedded seamlessly into their workflow. They organically drove the overall data curation process, beyond just labeling data and believed that the collaborative approach supported by Wikibench was beneficial both for dataset quality and for community building. Below are some highlighted findings:

  • Participants find Wikibench much easier to use than Wikilabels because it is embedded into their existing workflow.
  • Participants revised label definitions to better capture emerging nuances and community norms.
  • Participants defined inclusion criteria to ensure data was accessible by the full community.
  • Participants authored a data statement to specify- the appropriate usage of the evaluation dataset.
  • Participants prefer Wikibench’s collaborative approach to data labeling.
  • Participants find Wikibench helpful in quickly identi- fying edits where more labels or discussion could be valuable.
  • Participants find Wikibench effective in surfacing disagreements and facilitating consensus-building.
  • Participants see value in the openness of Wikibench’s datasets.

Taken together, our findings indicate potential for the approach embodied by Wikibench to support the curation of AI datasets that reflect community norms and values. For further details, please see our paper.


  • [done] 10.2022: create project page, finish interview protocol
  • [done] 11.2022: start recruitment for the formative study, finish interviews and analysis
  • [done] 12.2022: define concrete design objectives and start ideation
  • [done] 01.2023: wrap up ideation
  • [done] 02.2023: iterate on low-fidelity prototypes
  • [done] 03.2023: iterate on medium-fidelity prototypes
  • [done] 04.2023: iterate on high-fidelity prototypes
  • [done] 05.2023: build the actual system
  • [done] 06.2023: recruit Wikipedians for community evaluation
  • [done] 07.2023: run community evaluation
  • [done] 08.2023: analyze results, write paper, submit paper
  • [done] 02.2024: hear back from reviewers, submit the camera-ready version
  • [done] 05.2024: release the official link to paper on ACM's digital library

Policy, Ethics and Human Subjects Research


This research has been approved by the Institutional Review Board (IRB) at Carnegie Mellon University on August 1st, 2022. Researchers are required to ask for participants' verbal consent before the study begins.





If you are referencing this project, please cite this paper:

Tzu-Sheng Kuo, Aaron Halfaker, Zirui Cheng, Jiwoo Kim, Meng-Hsin Wu, Tongshuang Wu, Kenneth Holstein, and Haiyi Zhu. 2024. Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. ACM, New York, NY, USA, 24 pages. https://doi.org/10.1145/3613904.3642278