From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Wikibench is a tool for evaluating artificial intelligence (AI) and machine learning (ML) models in wikis, such as ORES or Liftwing. The system allows editors to label edits and other wiki entities during the course of their normal work (e.g. patrolling for vandalism) and communicate with each other in case of a disagreement.


Wikimedia projects heavily rely upon AI models for maintenance and improvement. For example, recent changes patrollers use AI-based quality and intent filters to check recent changes of articles for inappropriate edits; topical working groups of various WikiProjects use AI-based article topic models for article curations. While Wikipedians rely on these AI models to carry out critical work across Wikimedia projects, the effectiveness of these AI models is unclear. For example, what is the accuracy of these AI models' predictions? What are the false positive and false negative rates? The core belief underlying this project is that Wikipedians' voices are essential in evaluating AI models from their individual and collective perspectives. The fundamental obstacle to evaluating these AI models stems from the lack of available data for the purpose of evaluation. Specifically, to understand whether a model's prediction is correct or not, we need data labeled with the correct answer and a sufficient amount of labeled data to calculate statistical measures, such as accuracy.

System development[edit]

Wikibench aims to support the curation of labeled and up-to-date data for the purpose of evaluating AI models used across Wikimedia projects. Acknowledging that data curation requires extra effort in addition to Wikipedians' existing work, as well as individuals may have different opinions on what the correct label is, Wikibench is developed based on the following five design considerations:

  1. The data curation process should be led by the community.
  2. The data curation process should embed within existing workflows to minimize extra efforts.
  3. The data curation process should encourage deliberation for consensus building.
  4. The progress of the data curation process should be publicly available.
  5. The resulting evaluation dataset should surface both the strengths and limitations of AI models.

These five design considerations are derived from interviews with multiple Wikipedians, as documented on this research project page.

User interface[edit]

Based on the five design considerations above, Wikibench provides three primary interfaces to support curating labeled data for AI model evaluation:

Campaign page[edit]

The goal of a campaign is to curate a sufficient amount of labeled data that can be used to evaluate one or more targeted AI models. A campaign page of Wikibench documents the details of each campaign, including its goal, progress, participation instruction, and the link to model evaluation results.

Entity page[edit]

Things on the wiki to be labeled are considered entities, such as edits, articles, etc. The entity pages of Wikibench document the label provided by Wikipedians for each entity.

In-context (e.g. Special:Diff)[edit]

Acknowledging that data labeling requires extra effort in addition to Wikipedians' existing work, the Wikibench plug-in is a minimal interface embedded in Wikipedian's existing workflow that gathers labeled data while Wikipedians perform their work. The plug-in for each campaign needs to be customized because Wikipedians' workflow associated with each campaign is likely different. Currently the system supports labeling while reviewing Special:Diff in order to support patrolling work.

Try out Wikibench[edit]

Wikibench is currently a research prototype and is under pilot testing. Anyone can try out the prototype by following the instruction described in the piloting campaign listed below:

English Wikipedia[edit]

You can help[edit]

If you like Wikibench and want to help with its development, you can take one or more of the following actions, even if you don't have a computer programming background:

  • We are recruiting participants for pilot testing now! More information is available on the recruitment page.
  • To receive updates about Wikibench and future study opportunities, please also sign your name on the contact list.
  • Feel free to try out Wikibench by clicking on the campaign page above, even if you're not interested in or available for the study.
  • Improve this project page about Wikibench.
  • Suggest new features or campaigns on the discussion page.