Jump to content

Research:Newsletter/2025/November

From Meta, a Wikimedia project coordination wiki
Wikimedia Research Newsletter

Vol: 15 • Issue: 11 • November 2025 [contribute] [archives]

At least 80 million inconsistent facts on Wikipedia – can AI help find them?

By: Tilman Bayer and Alaexis

At least 80 million (3.3%) of Wikipedia's facts are inconsistent, LLMs may help finding them

[edit]
Reviewed by Tilman Bayer

A paper titled "Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models",[1] presented earlier this month at the EMNLP conference, examines

inconsistencies, a specific type of factual inaccuracy [on English Wikipedia], and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time.
Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact [...]

In a Twitter thread, the lead author shared his

Takeaways:
- Contradictions are measurable and fixable at scale.
- LLMs aren't ready to fully automate yet (best AUROC 75.1% on WikiCollide) but are effective copilots.

The authors focus specifically on internal inconsistencies, which they define as

contradictory facts within Wikipedia that indicate errors requiring correction through consultation of original sources. In a crowdsourced repository, inconsistencies can arise from outdated information, limited awareness of related content during editing, or simple human error.

They illustrate this notion with an example (still not yet corrected on-wiki at the time of writing) drawn from FEVEROUS, a Wikipedia-derived dataset published in 2021 whose rate of inconsistencies was found to be even higher (7.3%):

François de Bourbon-Montpensier was born in 1492 and received the title “duchy-peerage of Châtellerault” in 1515. However, the Wikipedia table [rather, infobox in] “Duke of Châtellerault” incorrectly states that the title was created 23 years earlier.

To support editors in finding such inconsistencies, the authors constructed the aforementioned LLM-based

CLAIRE (Corpus-Level Assistant for Inconsistency REcognition), a system for surfacing inconsistencies in large corpora. [...] CLAIRE finds and displays not only candidate contradictions but also disambiguating context and explanations of specialized terminology. It features an interactive interface implemented as a browser extension that surfaces potential inconsistencies to Wikipedia visitors.

(Unfortunately, that browser extension doesn't yet seem to have been released as part of the project's code repository or elsewhere.)

CLAIRE is then used to facilitate a (manually confirmed) lower bound estimate of the overall frequency of inconsistent facts on Wikipedia:

Applying CLAIRE to 700 atomic facts uniformly sampled from Wikipedia articles, we identified 44 potentially inconsistent facts, of which 23 were manually confirmed inconsistent. With 99% confidence, we estimate that approximately 3.3% ± 1.7%[1.6%, 5.0%] of all facts in Wikipedia contradict other information in the corpus. This is a lower bound, as CLAIRE may miss inconsistencies [...] Extrapolated to the entire encyclopedia, this corresponds to between 37.6 million and 121.9 million inconsistent facts,[...] underscoring the need for systematic inconsistency detection.

The authors then present their own WIKICOLLIDE dataset, consisting of 955 atomic facts drawn from Wikipedia [using a snapshot from November 1, 2024], each manually labeled as either consistent or inconsistent with the corpus. This sample was drawn from a subset of articles (Level 5 Vital Articles) and deliberately biased to prioritize facts more likely to be inconsistent. It is thus not representative of Wikipedia as a whole. However, the paper's classification of the types of inconsistencies present in this corpus should still give an idea of which are most frequent on Wikipedia:

Breakdown of inconsistency types in WIKICOLLIDE validation and test sets (331 inconsistent facts)
Inconsistency Type Description %
Numerical Inconsistencies in numerical data, such as quantities, measurements, or percentages 54.7
Off-by-One Numerical Small discrepancy involving a margin of one unit 23.0
Clear Numerical Significant difference that cannot be explained by a margin of one unit 31.7
Logical The claim and evidence directly or indirectly contradict each other 17.5
Direct Logical Clear negation or alternative to a unique fact 14.8
Indirect Logical Contradiction inferred or indirectly implied 2.7
Definition Different definitions or interpretations for the same term or concept 10.6
Temporal Inconsistencies in dates, durations, or event sequences 7.9
Named Entity Inconsistencies identifying specific entities (people, organizations, locations) 6.0
Categorical Differences in categorizing entities, objects, or concepts 2.1
Spatial Inconsistencies in spatial descriptions or geographical information 1.2

See also:

Taking stock of the 2024–2025 research grants

[edit]
By Alaexis and Tilman Bayer

The Research Fund is a Wikimedia Foundation initiative that supports individuals, groups, and organizations with expertise and interest in conducting research on or about Wikimedia projects. The main funding criterion is whether the grant would result in high-quality and high-impact scholarship. Grant sizes range from $2,000 to 50,000 USD and work must be completed within 12 months. Since the previous batch of grants was issued in summer 2024, those projects should now be finished, making this a good time to examine the results. The nine projects in this batch received over $270,000 USD[supp 1] in total funding.

Out of 9 projects in that batch, 5 have published their results on Meta Research pages. For the remaining 4 projects without published results, I reached out to the researchers directly and added their responses to the Notes column in the table below.

The research is supposed to

  • Contribute to generalizable knowledge that has the potential to improve and expand our understanding of the Wikimedia projects and their impact;
  • Identify and/or evaluate novel technical and socio-technical solutions that can enhance the technology or policy in support of the Wikimedia projects;
  • Inform important social or policy decisions that organized groups within the Wikimedia communities want to make.
  • [Create] datasets of importance for Wikimedia communities (including but not limited to Wikimedia research communities).

Notable findings

[edit]

Daniel Baránek and Veronika Kršková compared the coverage of Wikidata with that of a Czech biographical dictionary. They found that more than a quarter of dictionary entries were missing from Wikidata (and likely from Wikipedia as well). Fascinatingly, further research showed that the gap reflected different notions of notability now and in the past. Many missing persons were principals and professors who played major roles during nationalist tensions in the late 19th and early 20th centuries.

Brett Buttliere, Matt Vetter and Sage Ross tried to solve the problem of low academic engagement on Wikipedia. They identified reasons why scholars do not edit Wikipedia: academic contributions to Wikipedia aren't measured and valued in the academic community and there is general skepticism about the reliability of Wikipedia. We all want more experts on Wikipedia, so it's good to have more data about the problem. See the Research Page for the solutions that the authors proposed and implemented.

Personally, I'd be very interested in the results of the AI tagging for Commons initiative, as well as in the two projects addressing the gender gap. Unfortunately, their results were unavailable as of October 18.

Gaps and concerns

[edit]

While the Research Fund supports important work, several issues emerged from this batch:

  • Incomplete reporting: 4 out of 9 projects have not published results on Meta at the time of writing (October 2025), even though the grant period has ended.
  • Unpublished datasets: some projects that could benefit the community haven't shared their underlying data. For example, the biographical dictionaries comparison identified specific gaps in Wikidata coverage, but the dataset of missing entries hasn't been published (happy to be corrected).
  • Uncertain scholarly impact: the fund aims to support "high-quality and high-impact scholarship," but measuring impact is challenging, especially for research generating "generalizable knowledge" rather than artifacts that Wikipedians can use right away. As far as I can tell, none of these projects have yet resulted in peer-reviewed publications.

Table

[edit]
Project name Link to grant page Link to research project page Results available?
(as of mid-October 2025)
Requested amount (USD)[supp 2] Grant amount (USD)[supp 1] Notes
Wikidata for the People of Africa [1] [2] yes 40,000 29,213.12
Development of a training program for teachers to use Wikipedia as a resource for collaborative learning and the development of skills for digital citizenship [3] [4] no 50,000 20,413.00 Results expected in December 2025
Bridging the Gap Between Wikipedians and Scientists with Terminology-Aware Translation: A Case Study in Turkish [5] [6] yes 50,000 39,929.00
Wikimedia versus traditional biographical encyclopedias. Overlaps, gaps, quality and future possibilities [7] [8] yes 50,000 24,911.16
System Design for Increasing Adoption of AI-Assisted Image Tagging in Wikimedia Commons [9] [10] no[supp 3] 49,500 20,992.00 Data collected by December 2024
Investigating Neurodivergent Wikimedian Experiences [11] [12] yes 22,000 24,250.00 An open access publication is in the works
Developing Wikimedia Impact Metrics as a Sociotechnical Solution for Encouraging Funder/ Academic Engagement [13] [14] yes 42,000 50,782.95 (?)[supp 4]
Cover Women [15] [16] no[supp 5] 32,000 30,379.44
Addressing Wikipedia's Gender Gaps Through Social Media Ads [17] [18] no 30,000 29,816.33 At the data collection phase as of October 2025

Briefly

[edit]


Other recent publications

[edit]

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia"

[edit]

From the abstract:[2]

"Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information.[...] we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs [...] we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: [ibm.biz/wikicontradict ]."

From the paper:

"[...] Wikipedia editors use a wide range of maintenance tags to flag problematic content for improvement. However, these maintenance tags are typically removed when creating Wikipedia datasets for LLM pre-training, which results in content with various quality issues being included in the pre-training process.
In this work, we focus on three tags that indicate content inconsistencies: inconsistent, self-contradictory, and contradict-other. The first two tags denote contradictory statements within the same article, whereas the third tag highlights instances where the content of one article contradicts that of another article. In total, we collect around 1,200 articles that contain these tags [...]


"Factual Inconsistencies in Multilingual Wikipedia Tables"

[edit]

From the abstract:[3]

"Despite covering the same topics, the different [language] versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content."

From the paper:

"while English provides the most comprehensive coverage in terms of volume, German Wikipedia faces significant data quality challenges despite having substantial content"


"When Collaborative Maintenance Falls Short: The Persistence of Retracted Papers on Wikipedia"

[edit]

From the abstract:[4]

"We construct a novel dataset that integrates Wikipedia revision histories with metadata from Retraction Watch, Crossref, Altmetric, and OpenAlex, identifying 1,181 citations of retracted papers. We find that 71.6% of all citations analyzed are problematic. These are citations added before a paper's retraction, as well as the citations introduced after retraction without any in-text mention of the paper's retracted status. Our analysis reveals that these citations persist for a median of over 3.68 years (1,344 days). Through survival analysis, we find that signals of human attention are associated with a faster correction process. Unfortunately, a paper's established scholarly authority, a higher academic citation count, is associated with a slower time to correction."

From the "Discussion" section:

"A key consideration is the role of automated tools, such as RetractionBot [25]. This bot exemplifies the specialized roles that automated agents play in Wikipedia’s quality control ecosystem [66]. It primarily serves an editorial audience. By systematically adding a prominent template to the reference section, the bot is highly effective at its specific task of signaling a source’s retracted status to editors engaged in verification and maintenance. [...] However, our work highlights a persistent gap between the effectiveness of automation for these specific, often editor-facing tasks and the challenges of repairing more nuanced, epistemic issues for a general reader. This distinction is key: while a bot can efficiently apply a “technical flag,” this action is distinct from the substantive, contextual repair required to update an article’s main text."

See also a related recent blog post by Egon Willighagen: "Retracted articles cited in Wikipedia"

"Automatically Estimating the Trustworthiness of Wikipedia Articles"

[edit]

From the abstract:[5]

"We present a model to assess the trustworthiness of external sources based on manually annotated [English] Wikipedia articles. To do so, we analyze how often an external source was referenced in Wikipedia articles in which either a problem with reliability was identified or a previously identified problem was solved. From the frequency of the respective occurrences, we aim to draw conclusions about a positive or negative influence of the source on the trustworthiness of new Wikipedia articles. For this, we use the external sources referenced in a Wikipedia article to predict whether the article contains a reliability issue or not. First experiments show that our model is not able to reliably assess the trustworthiness of Wikipedia articles yet."

"Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset"

[edit]

From the abstract:[6]

"Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it."

The authors evaluated GPT-4o on their benchmark, finding that it "performs reasonably well, especially on frequent names" in adding diacritics missing on Arabic Wikipedia, but "struggles with rarer entries and variant mappings."


"Reading between the lines with topic models and machine learning: Islam’s representation on Wikipedia"

[edit]

From the abstract:[7]

"[...] we first construct a representative dataset on Islam using Wikipedia articles. Afterwards, we apply several topic modelling and machine learning based approaches on the newly constructed dataset to find representation of Islam on Wikipedia. Also, we design two algorithms based on word2vec to find the inter topic similarity and intra topic similarity for the topic models. The intra topic similarity algorithm agrees well with human judgment of topic resolution and coherence of topics. As topic models find the dominant topics prevailing in a natural language document corpus, the intra topic similarity algorithm can be used as a new metric to find the coherence of single topics within the topic model."

References

[edit]
  1. Semnani, Sina; Burapacheep, Jirayu; Khatua, Arpandeep; Atchariyachanvanit, Thanawan; Wang, Zheng; Lam, Monica (November 2025). "Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models". In Christos Christodoulopoulos; Tanmoy Chakraborty; Carolyn Rose; et al. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. EMNLP 2025. Suzhou, China: Association for Computational Linguistics. pp. 34827–34854. ISBN 9798891763326. doi:10.18653/v1/2025.emnlp-main.1765.  / Data and code
  2. Hou, Yufang; Pascale, Alessandra; Carnerero-Cano, Javier; Tchrakian, Tigran; Marinescu, Radu; Daly, Elizabeth; Padhi, Inkit; Sattigeri, Prasanna (2024-06-19). "WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia". arXiv:2406.13805 [cs]. 
  3. Cappa, Silvia; Kong, Lingxiao; Peet, Pille-Riin; Wei, Fanfu; Zhou, Yuchen; Kalo, Jan-Christoph (2025-07-24). "Factual Inconsistencies in Multilingual Wikipedia Tables". arXiv:2507.18406 [cs]. 
  4. Shi, Haohan; Yu, Yulin; Romero, Daniel M.; Horvát, Emőke-Ágnes (2025-09-24). "When Collaborative Maintenance Falls Short: The Persistence of Retracted Papers on Wikipedia". arXiv:2509.18403 [cs]. 
  5. Grumbach, Luca-Philipp (2025-02-21). Automatically Estimating the Trustworthiness of Wikipedia Articles (PDF) (bachelor thesis). Friedrich-Schiller-Universität Jena.  / Presentation slides
  6. Bondok, Rawan; Nassar, Mayar; Khalifa, Salam; Micallef, Kurt; Habash, Nizar (2025-06-23). "Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset". arXiv:2505.02656 [cs]. 
  7. Khan, Sazid Zaman; As-ad, Jamil; Khaliluzzaman, Md; Anwar, Toni; Islam, Rashedul (2025-08-18). "Reading between the lines with topic models and machine learning: Islam’s representation on Wikipedia". Journal of Computational Social Science 8 (4): 89. ISSN 2432-2725. doi:10.1007/s42001-025-00415-6.  Closed access
Supplementary references and notes:
  1. a b The actually funded amount (as stated in a clarification provided by the WMF Research team)
  2. As per the "budget" number provided on the linked grant page. (According to a clarification from the WMF Research team, this is the amount requested in the grant proposal, rather than the actually funded budget.)
  3. Findings were published in late October (after the initial draft of the Signpost version of this report had been posted)
  4. Inferred
  5. Partial results were published in November (after the initial draft of the Signpost version of this report had been posted)

Wikimedia Research Newsletter
Vol: 15 • Issue: 11 • November 2025
About • Subscribe: Email WikiResearch on Twitter/X WikiResearch on Facebook WikiResearch on mastodon.social WikiResearch on Bluesky[archives][Signpost edition][contribute][research index]