Research:Reference Quality in English Wikipedia

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T305888
00:03, 24 March 2022 (UTC)
Aitolkyn Baigutanova
Jaehyeon Myung
Changwook Jung
References, Knowledge Integrity, Disinformation
This page documents a completed research project.

Wikipedia plays a crucial role in the integrity of the Web. This work analyzes the reliability of this global encyclopedia through the lens of its references. We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. We release Citation Detective, a tool for automatically calculating the RN score, and discover that the RN score has dropped by 20 percent point in the last decade, with more than half of verifiable statements now accompanying references. The RR score has remained below 1\% over the years as a result of the efforts of the community to eliminate unreliable references. We propose pairing novice and experienced editors on the same Wikipedia article as a strategy to enhance reference quality. Our quasi-experiment indicates that such a co-editing experience can result in a lasting advantage in identifying unreliable sources in future edits. As Wikipedia is frequently used as the ground truth for numerous Web applications, our findings and suggestions on its reliability can have a far-reaching impact. We discuss the possibility of other Web services adopting Wiki-style user collaboration to eliminate unreliable content.


Reference Need (RN)[edit]

We describe the Citation Detective tool, which uses ma-chine learning to compute the RN score for a given article revision, in the full version of our paper (Baigutanova et al., 2023). The tool classifies all sentences in a re-vision and labels each sentence with a binary Citation Need (Redi et al., 2019) label 𝑦 according to the model output: 𝑦 = [ ˆ𝑦], where [·] is the rounding function and ˆ𝑦 is the output of the Citation Need model. When 𝑦 = 1, the sentence needs a citation; when 𝑦 = 0, the sentence does not need one. We compute each revision’s RN score by aggregating sentence-level Citation Need labels:

where 𝑃 is the set of sentences needing citations for a given article; 𝑐𝑖 reflects the presence of a citation in the original text of the sentence 𝑖: 𝑐 = 0 if the sentence does not have an inline citation in the original text or 𝑐 = 1 if the sentence has an inline citation in the original text.

Reference Risk (RR)[edit]

The Wikipedia editing community maintains a classifi-cation of the reliability of the sources that have been frequently questioned, which is known as the perennial sources list2. Our research utilizes blacklisted and dep-recated categories of this classification as risky sources, as they are suggested to be prohibited in general. Using the public Wikipedia XML dumps, we ran a regular ex-pression to extract risky references in article revisions. Then, the revision’s 𝑅𝑅 score is computed as the propor-tion of sentences containing unreliable references in that revision: where 𝑁 is the total number of citations in a given revi-sion; 𝑥 is the number of unreliable references. Revisions not including any reference are omitted in this analysis.


We built three datasets from the English edition of Wikipedia. (i) Random dataset includes 3,177,963 revi-sions of randomly sampled 20K pages. (ii) Top dataset includes 23,802,067 revisions of 10K pages that received the highest total page views in the English Wikipedia within the analyzed period, as computed by Wikimedia’s Pageviews API.3 Every editing revision is logged with the following metadata: revision id, timestamp, user id, prior revision count of the editing user, user type anonymous or not, bot or not, page id, revision byte size difference compared to the prior revision, and revision type minor or not. As the scope of this study is limited to understand-ing the role of human editors in maintaining the reference quality of Wikipedia articles, we filtered out edits made by bots in the further analysis.

We built the third dataset to examine the lifespan of deprecated and blacklisted domains. We froze the date to January 2022 and obtained the history of all references listed in the perennial sources list used until that point. (iii) Reference History dataset consists of 4,203,467 oc-currences of references that are still existing and that are removed. The dataset consists of the following informa-tion for each occurrence of a reference: the page id, the timestamp when the reference was added, the timestamp when it was removed, the domain of the reference, the category of the domain, and the timestamp when the cor-responding domain was classified as deprecated or black-listed in the perennial source list if applies. The removal timestamp is blank if a reference was added but not yet removed.


Figure 1: Reference need (RN) scores gradually decreased over the last decade, indicating an improved reference coverage of articles. The drop is nearly 20 percent point over the decade.

Evolution of Reference Quality[edit]

Tracking the RN and RR scores allows us to examine how reference quality has evolved over the past decade. The evolution of the reference need score is shown in Figure 1.

First, the average reference need score per article went down gradually over the last ten years, dropping by around 20% in both Top and Random datasets. This demonstrates that a greater proportion of Wikipedia pages now include citations or more than 60% of citation-requiring sentences accompany a reference. The evolution of the reference risk score is shown in Figure 2.

Figure 2: Reference risk (RR) scores remain under 1% and show a decreasing trend in recent years, suggesting a reduction in the use of risky references after the introduction of the perennial sources list in 2018.

The risk score has remained below 1% throughout the analyzed period. While the score only started to decrease in 2018 for the Random dataset, the Top dataset saw a decline starting in 2016. The decrease in the RR score coincided with the introduction of the perennial sources list in 2018. This might suggest that the collaborative effort of Wikipedia editors enabled them to address newly registered non-authoritative sources, resulting in a decrease in the following years. We observe that the RR scores across the two datasets have increasingly diverged over the past few years.

Lifespan of Risky Sources[edit]

To explore the role of community-driven work in the evo-lution of reference quality, we examine whether classify-ing sources in the perennial source list as ”deprecated” or”blacklisted” motivates editors to remove existing risky references. We calculate the lifespan of risky references as the time elapsed between their addition and removal using the Reference History dataset. We analyze the lifes-pan of references within a year before and after their classification in the perennial sources list, as the list was established in 2018.

Figure 3 shows the median lifespan (or the number of days a reference survives before being removed by

future edits) of risky references decreased by more than threefold once they were added to the perennial list by ed-itors. Additionally, the lifespan of risky references at the 75th percentile decreases by approximately two months. These results indicate that labeling of perennial sources encouraged editors to remove unreliable references if they were labeled undesirable quickly. There was no definite consensus among deprecated sources regarding the do-mains of ”Daily Mail” and ”The Sun.” Their status was the subject of multiple discussions, so they were excluded from our main analysis.

Figure 3: The lifespan of unreliable sources a year before and after being added to the perennial sources list. Sources have a short lifespan on Wikipedia once marked as unreliable.


The RN index gradually decreased over the past decade, indicating that more articles now accompany references. This trend results from an increasing volume of community initiatives to improve citation coverage, including the exceptional work done by editors and the success of tools to ensure Wikipedia’s verifiability. These efforts improve Wikipedia itself and, in return, result in a higher quality encyclopedia for humans and machines.

Our results may be considered a lower bound of the reference risk value because the perennial sources list only covers a small fraction of potentially unreliable sources. Unfortunately, using external reliability indexes and fact-checking systems is difficult in the Wikipedia context, given that existing lists are country-specific or not generic enough to cover the diversity of topics and sources. Creating a global index of source reliability would improve this estimate, support targeted interventions in specific content areas, and expose potential disinformation at-tacks from malicious users. Together with other efforts to build trust around the world, our scientific community could support such a global effort to improve and keep an eye on the quality of Wikipedia’s sources that directly affect the services people use. Systems that help automatically flag the presence of newly added unreliable sources could help editors monitor reference quality, and this paper provides a foundational methodology to build such support tools.


Find the full paper in


[Baigutanova et al.2023] Aitolkyn Baigutanova, Jae- hyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal assessment of reference quality on wikipedia. In proc. of the WWW.

[Redi et al.2019] Miriam Redi, Besnik Fetahu, Jonathan Morgan, and Dario Taraborelli. 2019. Citation needed: A taxonomy and algorithmic assessment of wikipedia’s verifiability. In proc. of the WWW.

Related Projects[edit]