Grants talk:Programs/Wikimedia Research Fund/WikipedAI: Investigating AI Collaboration and Conflict in Open Knowledge Systems
Add topicThank you for considering providing your input for the proposal. Although many applicants are not actively monitoring these pages and are unlikely to respond here, we will share your feedback with every applicant to help them improve their research. Please write your feedback in a constructive and supportive manner and note that Community Resources team's behavioral expectations as well as UCoC's expected behavior apply to this space. |
If possible, please keep your input concise. We'd particularly like feedback on the following questions:
- From what perspective or in what role are you providing your input? (For example: a Wikipedia editor, a representative of a Wikimedia chapter, a member of a regional committee)
- In what ways do you think this research can support members of the Wikimedia communities in their work on Wikimedia projects?
- Is there a particular project or affiliate in your country/region/project that can benefit from the result of this research and/or you recommend that the applicants seek to coordinate or collaborate with? If so, please provide details including the name of the project or affiliate and a short description of the relevance of the proposal to the needs of the affiliate or project.
- Do you have any other comments or feedback you would like to share with the Research Fund chairs?
Proposal feedback
[edit]I do not think that this proposal should be funded. The research questions are ill-defined. And for any interpretation that could make them of practical relevance for Wikipedians, the proposed method is not suitable for answering them - at least not without unrealistic assumptions about the capabilities of system that the author promises to construct as part of this project, while evidently unaware of practices for building functioning LLM agents that are standard both in the research literature and in industry practice. If this project is being funded and executed as specified in this proposal, I would proactively caution the Wikipedia community not to rely on its promised conclusions, because the envisaged research method is unsuitable for supporting them.
In more detail:
The proposal addresses an important topic and makes bold promises about the contributions to be expected from this project:
This work will provide critical insights into the emergent dynamics and potential risks associated with multi-agent AI editing, informing Wikipedia community guidelines and Wikimedia governance strategies.
It promises to do this by constructing and analyzing a controlled simulation of LLM agents editing a sandbox Wikipedia to answer the following research questions:
- RQ1: How do LLM bots interact when working to edit Wikipedia?
- RQ2: How does LLM editing activity affect Wikipedia article content (e.g., on quality, sourcing, neutrality)?
- RQ3: How does LLM editing behaviour compare to human / traditional bot editing patterns?
But it fails to specify what is actually being "simulated" (i.e. what the criteria should be for a successful simulation). The author seems to assume that "LLM bots" are a well-defined study subject (like, say, a biological species or a particular online community). But "LLM bots [that are designed to be capable of] working to edit Wikipedia" by and large do not exist yet, except for some LLM-based systems focused on some very specific tasks (disregarding simpler tasks that don't necessarily require modern LLMs, like spellchecking or vandalism detection). The few known examples include:
- STORM (by Shao et al., very briefly mentioned in the proposal) - for writing articles
- Wikicrow (by Skarlinkski et al., not mentioned in the proposal) - for writing articles
- Edisum for generating edit summaries from a diff
- Ashkinaze et al. ("Seeing Like an AI", briefly mentioned in the proposal) for detecting and correcting NPOV violations
(Also, here is an ongoing effort by UIUC researchers to make LLMs perform yet another very specific task involved in editing Wikipedia.)
It is also worth noting that none of these four - even though they can be reasonably regarded being the best (or among the best) efforts to make LLM-based systems perform editor tasks realistically well - has actually been officially used on Wikipedia yet. As for the LLM editing activity/LLM editing behaviour that RQ2 and RQ3 focus on, the proposal cites some existing evidence indicating that LLM editing is increasingly used—often without clear disclosure or detectability. But this is - for all we know - generally not done by autonomously editing "LLM Agents" but by human editors copypasting LLM-generated text into Wikipedia (as the proposal vaguely acknowledges later), a behavior that this proposed project is not designed to analyze.
And these tasks that have already been reasonably successfully implemented with LLMs - writing new articles from scratch, providing edit summaries, etc. - are just a small part of what is involved in editing Wikipedia. Any setup that aspires to yield critical insights into the emergent dynamics and potential risks associated with multi-agent AI editing on the actual Wikipedia will need to encompass a much bigger spectrum of activities, even if the simulated system envisaged in this proposal would allow some simplifications (despite working with actual data from English Wikipedia, in all its complexities - think templates, categories, images...).
The author seems very uninterested in thinking about what might be involved here concretely, i.e. what actions or goals his LLM Agents would need to be made to perform or follow in order to call the simulation useful. Oddly, he gives much more thought to how to categorize their behavior after the fact: Analysing this editing behaviour [recorded in the simulation wiki's edit history] relies on effective operationalisation of key concepts. "Geiger and Halfaker (2017) categorise bot interactions according to specific tasks (e.g. category work, fixing double redirects), yet there is likely to be greater diversity in LLM editing interactions. [...etc.]" (That comparison to non-LLM bots is especially weird, considering that they are only trusted to carry out a very small set of tasks involved in editing Wikipedia, in contrast to the author's own LLM bots which will supposedly run an entire wiki by themselves.)
In other words, the author doesn't give us good reasons to assume that his simulation will correspond in any realistic way to what we might seeing from LLM bots capable of editing the actual Wikipedia (without getting blocked immediately for obviously nonsensical behavior).
What's more: While there are now some systems that might be at or near that capability for a small part of the many specific activities involved in editing Wikipedia, each of the existing projects mentioned (STORM, Wikicrow, Edisum,...) was the result of multiple researchers devoting an entire research paper's amount of work to get them to perform reasonably well at their specific task. For example, this involved designing complex agent workflows for STORM (see this figure) and Wikicrow (see Figure 1 "Schematic of PaperQA2’s agentic toolset" in their paper). Both also involved LLM agents doing extensive information retrieval, just like finding external sources for articles is a very important part of human Wikipedia editors' work. (And IIRC search engine API fees have been a considerable cost factor of the STORM system.)
The WikipedAI proposal seems oblivious of such difficulties. The very short LLM agent specification section adds to the strong impression that the author assumes that merely prompting an LLM suffices to create a working agent. (Maybe something like "Hey, ChatGPT, act like a Wikipedia editor"? We don't know.) For example, judging from the section's opening paragraph, he seems to think that specifying the right number of agents and frequency of edits is among the most important things for making his agents behave sufficiently similar to real Wikipedia editors, and more interested in comparing different LLMs for the same prompt than in how to actually find and test a well-working prompt:
The number of agents and frequency of edits (relative to number of articles) will be controlled so as to approximate real editor interaction dynamics on Wikipedia. We will also investigate the dynamics of inter- and intra-model editing, assessing how models from different providers (OpenAI, Anthropic, Google, DeepSeek, Meta etc.) interact with each other. Prompt instructions will be consistent between models for the main analysis, and tested for stability to ensure result robustness (Barrie et al., 2024).
(BTW, the cited Barrie et al. paper is about text classification, a very different task.)
Similarly, in the "Risks" section the author briefly starts to consider the possibility that his simulation may not be realistic - but not because his LLM bots might simply fail to perform standard editing tasks reliably enough (i.e. the feat that the STORM, Wikicrow and Edisum authors had to work hard to achieve for a single task), but because of (AI cliché alarm) the lack of human presence and its "nuance[s]":
The controlled sandbox may fail to capture crucial elements of real Wikipedia editing and we will not be easily able to validate our findings against real LLM editing cases. Human editors bring nuanced judgment, and social interaction on Wikipedia involves discussion, consensus-building processes, and power structures (e.g. admin interventions) that our simulation may not fully replicate.
And speaking generally: Building functioning LLM agent systems remains a very non-trivial challenge as of mid 2025, even for much more narrowly defined and well-benchmarked tasks than "Wikipedia editing". There is a reason why several widely used frameworks like LangChain, Llamaindex or DSPy have emerged. (E.g. STORM uses DSPy.) And there now exists a vast amount of guidances, whitepapers and academic publications on how to get LLM agents to work in general. The proposal indicates very little awareness of them, and the author doesn't seem to have any prior experience in building such systems (or at least mentions it neither in the proposal nor in his CV or website).
Overall, it remains entirely unclear why the project's expected findings should be considered generalizable to all forms of "LLM editing", as opposed to pertaining just the author's own LLM bots. How and how much they shape knowledge, reinforce bias, or trigger feedback loops (to quote questions that the introduction section calls essential for safeguarding not just Wikipedia, but the wider information ecosystem) will likely depend very much on the specific prompts used and other particular design choices. (The author admits as much when he talks about the possibility of pitting bots with a pre-specified bias on a particular subject fare against “neutral” bot editors later in the proposal.)
Another standard part of building working LLM systems is evaluation - the aforementioned STORM, Wikicrow, Edisum and NPOV detection/fixing papers all featured it centrally in assessing how well their solution worked in implementing a Wikipedia editing task. Not so the WikipedAI proposal, which oddly turns the tables in this regard and proclaims that instead of having the proposed agent system pass benchmark evaluations, future benchmarks should be built based on the WikipedAI "framework" "model organism":
one of the key outputs being the WikipedAI framework as a “model organism” for experimenting with different agent compositions and specifications. This will help with the design and evaluation of multi-agent LLM benchmarks oriented towards Wikimedia principles (Johnson et al., 2024) that may extend beyond Wikipedia.
(It's unclear to me what that "framework" output will consist in exactly, or what value it will provide beyond the many existing methods of analyzing user interactions on-wiki, several of which the project will rely on itself according to the "Data collection and analysis" section.)
Regards, HaeB (talk) 10:36, 13 May 2025 (UTC)