Jump to content

Research:Understanding volunteers' motivation/experience of building/modifying tools, and sentiments regarding AI/LLM use cases and Edit Checks

From Meta, a Wikimedia project coordination wiki
Duration:  2025-August – 2025-October
This page documents a completed research project.


Final report for a Wikimedia Foundation Research study on volunteer perception of Edit Check etc.

The Edit Check feature is deployed to help newcomers while they attempt and make edits, informing them of relevant policies and providing correction suggestions. The ‘checks’ in place now span a few discrete pillars such as references, copy-right, and tone. Thus far, previous research has been done with the perspective of those receiving this guidance; this project invited experienced Wikipedians who have written/modified AbuseFilters (and also a couple who have primarily written/edited gadgets, bots and/or scripts) as participants to better understand their motivations for this work, their current usage of AI/LLM/machine learning generally and in wiki (specifically moderation) contexts, and their reactions to a few edit checks that are currently in the works (tone, reference, and paste checks).

Methods

[edit]

11 total interviews were conducted:

  • With 10 English Wikipedia and 1 French Wikipedia participants
  • Who have AbuseFilter-related rights, and
  • Who have worked on user scripts, bots, gadgets, and/or AbuseFilter extensions

Timeline

[edit]

This project was initiated and completed in FY25-26 Q2.

Policy, Ethics and Human Subjects Research

[edit]

This study followed the Wikimedia Foundation's data retention guidelines and data publication guidelines.

Respondents were able to review the project's privacy statement, and scheduled interviewees reviewed and signed a project-specific release form prior to participation.

Results

[edit]

Key Insights

  • Tech and tooling gaps. A small number of technical gaps (some which cannot be meaningfully addressed via P&T due to privacy-related restrictions) are detailed in the report (e.g. more tools to embed tags on diffs, one-stop shop for templates with wikidata backends), but overall no glaring gaps.
    • Key ‘structural’ gap is “failing at the stage of updating”. The community relies on Foundation teams to build/maintain tooling and even when the community can write its own tools, support from the Foundation is still needed on some level (but this support is not always available and/or provided).
      • There is a sentiment that the Foundation should dedicate more resourcing towards addressing tech debt (in this realm, that would be assisting with updating/maintenance of existing community-built tools over building new features/tooling and being more communicative about when internal code/API changes so that the community can be aware of and plan for what updates are needed when updates break tools and/or the developer who wrote/maintained the tool in question is no longer active).
  • Level of interest in tooling usage and impact metrics. Generally, participants seemed relatively blasé about needing metrics or other measures of 'impact', However, we are aware from other research and anecdotally that experienced users utilize XTools often to review just that; some are interested or get a kick out of seeing usage details and there are some types of tools where metrics are important to check 'success' rates and evaluate whether tweaks are necessary. It’s likely that usage/impact stats are both useful to have (for working reasons) and also still socially appreciated to some extent. The key to supporting or highlighting quantitative metrics is to ensure that ‘glue’ work on the wikis is not simultaneously devalued or forgotten.
  • Policy/guideline/norm programming/enforcement as a motivation for modifying and creating technical tools. When asked about their journeys into building/editing tools, the programming/enforcement sentiment and language was not really salient for participants in their narratives and descriptions of their work. Most simply wanted to get things done, wanted to stop doing a particular task or set of tasks repetitively, wanted to add a functionality ro reduce friction, or similar. Certainly, a bulk of the wiki work involves fighting vandalism/bad actors and at the core, while fighting vandalism is certainly would be classified as a policy/enforcement action, participants mostly did not describe it as such.
  • Conflicting feelings on AI/LLM/machine learning type technologies and its inclusion in moderation contexts. While a couple participants were staunchly against automated tooling/models in any capacity, most felt that these tech tools can be used in various scenarios. However, all caveated positivity and enthusiasm about the pros; they felt that regardless, trusting any final output still necessitated moderate to significant human follow-up/correction/confirmation' depending on use case and complexity. Another concern, is that stepwise inclusion might eventually trip an unintended or unwanted step in the sequence of utilizing what is perceived as a double-edged sword.
  • Edit Checks. Overall, participants reacted positively to checks being useful for newer editors, unconfirmed accounts, IP editors to have that nudge before saving/publishing. quite a refrain for the Foundation to ensure different contexts (which language, region, culture, sister wiki, etc) have had a chance to weigh in at the outset and on an ongoing basis.
    • However, tone check elicited comparatively more concern in its implementation and application. To participants, this check – in contrast to the others (reference and paste checks) – will face more challenges at the content/language level (cultural/political/linguistic and other usage nuances and differences).

Recommendations

Ensure that the community is well informed about changes that could affect their tooling ecosystem. With enough notice, they can adjust resourcing to address potential tooling downtime. Continue to support XTools and other tooling metrics (and centralize as many as possible so they are not piecemeal/ad hoc around the wikis). Consult the community in the rollout and implementation processes for the various checks, especially the ones that directly involve a ‘judgment’ of content/language usage. For tone check in particular, ensure that the Bert model and other technology that underpins it can adequately distinguish for common use cases that might be flagged incorrectly, like quotations or academic documentation of slang, etc.

Generally, be transparent as possible when using AI/LLM/machine learning technology on the wikis, always allow for human confirmation/corrobation of outputs/actions, and use these technologies where they are most useful/efficient while retaining the wikis’ identity as the source of human-sourced/curated/moderated knowledge.

Overall Conclusion

While many Wikipedians who write and modify tools appreciate the existence of an additional potential tool with which to accomplish moderation tasks (and potentially avoid issues before they’re finalized), most participants have reservations about the inclusion of an unclear AI or AI-adjacent system making live suggestions, especially when they involve more nuanced content issues like checking tone. Additionally, many emphasize that not only should the Foundation take care to introduce a check that may erroneously flag ‘tone’ that simply reflects a quotation, an academic usage of ‘inappropriate’ or ‘slang’ terms, or other more esoteric but appropriate use cases, that in general AI/AI-adjacent usages on the wikis should be done as transparently as possible with a community member human still performing the final deciding ‘check’ in those circumstances. Finally, to zoom out further still, participants flag that they (community members, individual language/project wikis) would like more… ownership/stewardship/sovereignty in whether and how tools are implemented, regardless of their technical origin (i.e. even if Foundation staff build, maintain and roll them out).

Resources

[edit]

References

[edit]