From Meta, a Wikimedia project coordination wiki

Wikibench is a system that enables Wikipedians to collaboratively curate AI evaluation datasets, while navigating disagreements and ambiguities through discussion. The curated datasets can be used to evaluate artificial intelligence (AI) and machine learning (ML) models in wikis, such as ORES and LiftWing. Wikibench is developed as part of a research project.


As Wikimedia projects scale, Wikipedians increasingly rely on AI tools for governance. For example, recent changes patrollers use AI-based quality and intent filters to check recent changes of articles for vandalism; topical working groups of various WikiProjects use AI-based article topic models for article curations.

Despite the growth of AI tools, Wikipedians currently have limited means to evaluate particular AI tools’ “fit for use” with respect to their collective norms and values. What is the accuracy of prediction for these AI tools? What are the false positive and false negative rates? The main challenge in evaluating these AI models is the lack of datasets for evaluation. Wikibench aims to bridge this gap by enabling Wikipedians to collaboratively curate datasets, while navigating disagreements and ambiguities through discussion on talk pages.

System development[edit]

To support data curation on Wikipedia, Wikibench is designed with the following four requirements in mind:

  1. The data curation process should be led by the community and follow their established norms.
  2. The data curation process should encourage deliberation to surface disagreements, build consensus, and promote shared understanding.
  3. The data curation process should embed within existing workflows.
  4. The data curation process should be public and transparent to community members.

User interface[edit]

Wikibench has three primary interfaces for data curation.


Wikipedians can use Wikibench’s plug-in (e.g., on diff pages) to select and label new data points during their regular activities on Wikipedia.

Entity page[edit]

Wikibench’s entity pages publicly show the labels of individual data points and facilitate discussions and (re-)labeling.

Campaign page[edit]

The campaign page publicly shows the entire dataset and surfaces data points that could benefit from additional attention, including data labels with high disagreement and data points that could benefit from additional labelers.

Try out Wikibench[edit]

Wikibench was deployed on English Wikipedia for the edit quality campaign. To import Wikibench and view the campaign results, please refer to the instructions on the campaign page.

Wikibench is currently on pause as the Wikimedia Foundation is transitioning the machine learning system from ORES to LiftWing. This transition requires a substantial redesign of Wikibench to integrate with the new system. Once the transition is complete, the developer of Wikibench will seek Wikipedians' feedback and explore opportunities to iterate on Wikibench if deemed desirable. Please sign up for future updates.

You can help[edit]

If you like Wikibench and want to help with its development, you can take one or more of the following actions, even if you don't have a computer programming background:

  • To receive updates about Wikibench and future study opportunities, please sign your name on the contact list.
  • Improve this project page about Wikibench.
  • Suggest new features or campaigns on the discussion page.


If you are referencing this project, please cite this paper:

Tzu-Sheng Kuo, Aaron Halfaker, Zirui Cheng, Jiwoo Kim, Meng-Hsin Wu, Tongshuang Wu, Kenneth Holstein, and Haiyi Zhu. 2024. Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. ACM, New York, NY, USA, 24 pages.