Grants:Programs/Wikimedia Community Fund/Rapid Fund/Wikipedia's Factual GenAI Assistant Experiment (ID: 23544856)
Applicant details
[edit]- Main Wikimedia username. (required)
عباد ديرانية
- Organization
N/A
- If you are a group or organization leader, board member, president, executive director, or staff member at any Wikimedia group, affiliate, or Wikimedia Foundation, you are required to self-identify and present all roles. (required)
I'm a board member or president of a Wikimedia Affiliate or mission-allied organization.
- Describe all relevant roles with the name of the group or organization and description of the role. (required)
- Board member of Wikimedia NYC
- Previous board member of Wikimedia Levant
- Grantee of WikiTermBase (MS Strategy grants)
- Editor on Arabic Wikipedia
Main proposal
[edit]- 1. State the title of your proposal. This will also be the Meta-Wiki page title.
Wikipedia's Factual GenAI Assistant Experiment
- 2. and 3. Proposed start and end dates for the proposal.
2025-09-25 - 2025-12-31
- 4. What is your tech project about, and how do you plan to build the product?
Include the following points in your answer:
- Project goal and problem you solve
- Product strategy or project roadmap
- Technical approach (infrastructure, tech stack, key tools and services)
- Integrations or dependencies (if any)
*Project Goal*
The goal of the proejct is to test a novel approach of using open source large language models (LLMs) as fact-checkers that can provide a reliable paraphrasing of Wikipedia's content through a GenAI assistant. Through this experiment, we seek to gauge a new way of bringing more reliable use cases of LLMs and to collect data that can show the validity of LLMs as fact-checkers in future implementations.
- Problem*
This project is inspired by my personal experience as a Wikipedia editor (since 2009) and movement organizer, who's currently developing chatbots for a living. Unfortunately, most commercial generative AI chatbots have little guardrails for reliability and factuality, yet generative AI is quickly taking over Wikipedia's role as the primary source of information on the internet. Wikipedia's readership has been decreasing since 2013, and the trend is more alarming for smaller languages. Integrating GenAI more into our work is probably inevitable, so it's crucial to start experimenting with ways to align it with our values and principles sooner than later.
- Solution*
We'll build an experimental AI assistant for readers that exclusively draws answer from Wikipedia pages, and integrates an explicit and novel fact-checking step into its architecture that's inspired by Wikipedia's own fact-checking process by editors. This assistant is not intended for public use but only as a time-bound experiment, which will be used for rigorous testing and evaluation of this model's reliability compared to Wikipedia's baseline of reliable information. We'll enlist the support of editors and collaborators in manually fact-checking ~500 responses, and will collect other qualitative feedback to learn about the viability of such an assistant and how it compares to non-fact-checked off the shelf LLMs.
*Project Roadmap*
Setup & Basic architecture (September):
- 5. What is the expected impact of your project, and how will you measure success?
Include the following points in your answer:
- Milestones and progress tracking
- Project impact and success metrics
The expected impact of this proejct is better understand the validty of GenAI assistants as factual sources of information, by comparing several approaches to using LLM assistants with a fact-checking and reliability step.
Milestones:
- September - October: Build a fully-functional RAG chatbot prototype for experimentation purposes.
- October: Test experimental pipelines (Plain LLM, MiniCheck, other LLMs) on benchmark datasets.
- November: Evaluate a subset of the model responses with editors (~500 responses) + error analysis.
- December: Consolidate results, insights on fact-checker efficacy, draft report prepared.
- Tracking: Monthly check-ins on deliverables; response counts and evaluator participation logged; Fleiss’ Kappa agreement tracked for evaluation reliabilit
Success metrics:
- Factuality: Our main goal is compare the facuality of the GenAI assistant to commercial LLMs and Wikipedia content by having human reviewers judge the factuality of a subset of generated responses against their sources. The main metric is the number of responses with factual errors - we've considered a number of frameworks for the factuality evaluation, but have settled on a True / False boolean for whether an individual statement is factual.
- ≥500 responses evaluated: To validate our results, we aim to evaluate a subset of at least 300 responses with ≥3 independent Wikipedians or experts evaluating each, in order to use evaluator agreement later on.
- Inter-rater agreement: We are looking for an agreement betweet the evaluators that is at least above moderate level (Fleiss’ Kappa ≥ 0.4–0.6).
- Acceptance by the Wikimedia community: We will share the results with the Wikimedia community and conduct a short survey to understand the community's attitude towards the result. We'll consider this a success if more than two thirds of respondents support further experiemntation in the future.
- 6. Who is your target audience, and how have you confirmed there is demand for this project? How did you engage with the Wikimedia community?
Include the following points in your answer:
- Project demand and target audience description
- Links to interaction(s) with Wikimedia community
- Evidence from community consultation such as the [Community Wishlist]
Demand & Target Audience
Inside the movement:
- We are running an experiment the results of which will impact / target both the 'Wikimedia contributor community and developers'. We are addressing a crucial question that both of these groups are struggling with: How does GenAI impact our work, and where can we trust it?
- Down the line and depending on the results of our experiment, Wikimedia readers could be a huge impacted group if we're able to develop trustworhty GenAI solutions that will change how they interact with the internet.
Outside the movement:
- Then GenAI research community may have great interest in the results of this experiment and the white paper we intend to share, to which our project team is strongly connected already.
- GenAI developers will be greatly impacted by a demonstration of building a factual RAG chatbot, which is one of the biggest challenges in the industry right now, as experienced by our team members at Uber and Audible.
Community Engagement
During my participation at Wikimania, I have had extended conversations about this project with Jonathan Fraine (CTO of Wikimedia Deutschland), Leila Zia (WMF Research Team), and Asaf Bartov (WMF Community Development), and Liam Wyatt (Wikimedia Enterprise and WikiCred perspective), who all confirmed angle of the project's importance, while also providing nuance and guidance on how to best execute it. Additionally, I spoke individually with many Wikimedians at the conference who were very excited by probing into the potential of GenAI.
On August 29th, I gave an online presentation attended by dozens of Wikimedians and hosted by the Deoband Community Wikimedia introducing GenAI in wiki use cases as well as this project. I received tremendeous feedback afterwards and incredible enthusiasm for the very simple demos I demonstrated during the talk, which was an important encouragement to pull through with the project.
Although in English, this project touches on the goal and aspirations that I have heard from Wikimedians across global communities, of whom a huge mass wishes to tackle the opportunities of GenAI, but have been too overwhelmed by the hype and the fast-moving nature of the field to actually build solutions that we need to test and validate.
- 7. How will your team predict and manage potential user security and privacy risks, and what risks do you currently see?
Include the following points in your answer:
- The level of in-house or consulted security and privacy expertise you will have available to you during delivery of this project
- How your development, testing, and deployment processes mitigate the introduction of unnecessary security or privacy risks
- We will bring on a professional software developer to ensure that our tech stack is robust, secure, and ideally compliant with the requirements of hosting on Toolforge directly to ensure maximum security.
- We are not planning on collecting any personal or user data of the typical users of the tool, we will only use data available on Toolforge (or, if we host on HuggingFace, it only collects aggreggated statistics of use).
- The only channel through which user data may be submitted is a completely voluntary feedback form for those who wish to be involved as evaluators of the project. Users will have the option to contact the project organizers directly, in case they wish to bypass providing contact information through the form. If they choose to fill it out, we will only ask for usernames and emails through a secure Google Form / LimeSurvey (AFAIK both approved).
- 8. Who is on your team, and what is your experience?
Include the following points in your answer:
- Your experience as a developer, relevant past projects
- Wikimedia SUL (developer), Gerrit, Github, Gitlab or other relevant public account handles
- Other team members, their roles and expertise
- Abbad Diraneyya: Project Lead and RAG expert. Wikimedian since 2009, co-founder of Wikimedia Levant, ex-Knowledge Manager at the Wikimedia Foundation, current board member of Wikimedia NYC. AI Conversational Designer at Uber specializing in developing and evaluating GenAI chatbots.
- Christoph Meinrenken: Project Advisor, Professor of Practice in Columbia University, and the Information & Knowledge Strategy Program Director. Expertise in research and data analysis.
- Amy Nguyen Pham: Machine Learning Engineer at Audible, responsible for dataset science analysis and helping build a secure and robust tech infrastructure.
- Olivia Long: AI Safety Intern at University of Chicago's XLabs, responsible for model testing, evaluation, and software development support.
- JunYang Ma: Math major at Columbia College, responsible for training algorithms and collecting testing datasets for factuality.
- 9. How will the project be maintained long-term?
Include the long-term maintenance plan with maintainer(s) in your answer. If you expect the long-term maintenance to incur expenses, please list those and the plan for long-term expense coverage.
The project is an early GenAI experiment at this stage that is projected to run from September - Dcember 2025. In case of promising results, long-term development and future iterations are likely to be provided through Columbia University (which is involved in the research aspect), as well as potentially the Wikimedia NYC chapter.
- 10. Under what license will your code be released, and how will you ensure the product is well documented?
Include the following points in your answer:
- Code license and compatibility with Wikimedia projects
- Documentation plan
Code License
All code developed under this grant will be released under an MIT License (or GPL v3, if preferred for closer alignment with Wikimedia’s norms). Both are open-source licenses that are compatible with Wikimedia projects, ensuring that what we work on can be freely reused, adapted, and distributed by the community without restriction. This choice supports interoperability and long-term sustainability within the Wikimedia technical ecosystem.
Documentation Plan
To ensure the product is well documented and usable by both developers and community members:
- We already launched a work-in-progress Github repoto transparently share our code, datasets and results not only with the Wikimedia community, but open source developers over all.
- A comprehensive README file will be provided in the project repo by our developer, including setup instructions, usage examples, and contribution guidelines.
- A dedicated documentation section (e.g., <code>docs/</code> folder or GitHub Pages) will describe the project’s architecture, dependencies, and integration with Wikimedia tools or workflows.
- Step-by-step tutorials or walkthroughs will be provided where relevant, with screenshots or example datasets to help users get started quickly.
- Documentation will follow Wikimedia’s developer documentation guidelines for consistency and accessibility.
- 11. Will your project depend on or contribute to third-party tools or services?
Two kinds of third party tools could be involved:
- Large language models: We anticipate mainly using open source LLMs hosted on HuggingFace, including Llama, Mixtral, and others. For pure benchmarking, we will also run small scale experiments with OpenAI and Anthropic models, because of their standard status in the industry (these will be used in tests rather in any publicly-accessible chatbot pipeline).
- HuggingFace (TENTATIVE): This is our hosting back up option, depending on any risks or issues identified by the WMF and our developer when working with Toolforge / Wikimedia Cloud Services.
- 12. Is there anything else you’d like to share about your project? (optional)
We would like to emphasized that we put a lot of thought into this project and why it's needed. We also recognize it is an experiment in a touchy field, and that there are many risks but also many opportunities involved. We completely anticipate concerns and would be very open to having in-depth conversations to better understand and work around them as experts in our fields.
Budget
[edit]- 13. Upload your budget for this proposal or indicate the link to it. (required)
https://docs.google.com/spreadsheets/d/19PbzzC8-ytxOetiGpmC_yqdDw_C-ZfHGHnQe9myw0_E/edit
- 14. and 15. What is the amount you are requesting for this proposal? Please provide the amount in your local currency. (required)
2300 USD
- 16. Convert the amount requested into USD using the Oanda converter. This is done only to help you assess the USD equivalent of the requested amount. Your request should be between 500 - 5,000 USD.
2300 USD
- We/I have read the Application Privacy Statement, WMF Friendly Space Policy and Universal Code of Conduct.
Yes
Endorsements and Feedback
[edit]Please add endorsements and feedback to the grant discussion page only. Endorsements added here will be removed automatically.
Community members are invited to share meaningful feedback on the proposal and include reasons why they endorse the proposal. Consider the following:
- Stating why the proposal is important for the communities involved and why they think the strategies chosen will achieve the results that are expected.
- Highlighting any aspects they think are particularly well developed: for instance, the strategies and activities proposed, the levels of community engagement, outreach to underrepresented groups, addressing knowledge gaps, partnerships, the overall budget and learning and evaluation section of the proposal, etc.
- Highlighting if the proposal focuses on any interesting research, learning or innovation, etc. Also if it builds on learning from past proposals developed by the individual or organization, or other Wikimedia communities.
- Analyzing if the proposal is going to contribute in any way to important developments around specific Wikimedia projects or Movement Strategy.
- Analysing if the proposal is coherent in terms of the objectives, strategies, budget, and expected results (metrics).
This is an automatically generated Meta-Wiki page. The page was copied from Fluxx, the web service of Wikimedia Foundation Funds, where the user has submitted their application. Please do not make any changes to this page because all changes will be removed after the next update. Use the discussion page for your feedback. The page was created by CR-FluxxBot.
