Research:Identifying Controversial Content in Wikidata

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T287946
This page documents a completed research project.

As part of our efforts on improving knowledge integrity, we are actively working to support the community and affiliates groups to have better understanding of their projects, and also in providing tools to address knowledge integrity issues.

In this project we are aiming to support the Wikidata community, creating a framework that allows to identify content within that project that could be controversial. Identifying such content will help the community to early react to potential conflicts, and also to guide newcomers, showing them the different mechanisms that Wikidata has to manage this kind of situations.


In order to identify content that require special attention from admins and the community in general we operationalize the idea of “controversial content” testing different definitions, more specifically we study:

Considering these definitions we perform a data analysis, to get insights about these different scopes, as well as test ML-approaches to explore the predictability of such situations.

Additionally, we run an analysis on edit spikes, to understand which are the most edited Items on Wikidata at a given point in time, and then perform a quantitative and qualitative analysis to understand the reasons behind those spikes.


The data used on this study comes from two main sources:

  • The mediawiki_history table: We use this data to understand the evolution of claims over time. By parsing the automatically generated edit summary of Wikibase, we are able to understand - among other things - when a specific property or label was edited in a given Item.
  • The Wikidata JSON dump: To analyze the usage of the “disputed by” qualifier, we used the snapshot from 2022-02-28.

Notice that in both cases, our code uses the parquet versions of these tables, that are stored in Wikimedia Data Lake. However, our experiments could be easily replicated by transforming the public dumps in parquet format or adapting the code to use CSV and JSON files from those dumps. All the data used on this study is public, therefore our results are completely reproducible.

For this study we also want to differentiate edits made manually by editors from the ones made using automated tools. This could be difficult in Wikidata, because it is common that users use automatic tools within their personal accounts, as opposed to dedicated bot accounts. To overcome this issue, since October 2021, Wikidata has added a special tag, that indicates when an edit has been done using the Wikidata Interface. We use this as proxy for “manual edits”. Results using that filter are indicated as “Wikidata-ui tag”.


Disputed By[edit]

As of 2022-02-28 just 1886 claims contained the “disputed by” qualifier. The most common Property using that qualifier was “country” and those claims usually correspond to disputed territories. We also found some specific usages of this qualifier for some medical content.

Top-20 properties using "statement disputed by"
prop label count
P17 "country" 607
P3355 "negative therapeutic predictor" 184
P3354 "positive therapeutic predictor" 139
P131 "located in the administrative territorial ent... 128
P31 "instance of" 102
P460 "said to be the same as" 52
P6216 "copyright status" 51
P39 "position held" 31
P170 "creator" 30
P3359 "negative prognostic predictor" 29
P569 "date of birth" 23
P40 "child" 22
P570 "date of death" 18
P50 "author" 17
P361 "part of" 16
P1196 "manner of death" 15
P509 "cause of death" 14
P22 "father" 13
P106 "occupation" 11
P1376 "capital of" 11

We also analyze the topic of the Items where this qualifier appears, finding that most of them are related to geography, confirming that this qualifier is especially used around territorial conflicts.

Top-5 topics containing the "statement disputed by" (notice that one article could belong to more than one topic)
Topic Number of articles
Geography.Regions 1729
Culture.Biography 283
Geography.Geographical 208
Culture.Media 204
Examples of "Disputed by" claims
Item Label Property Label Value "Disputed by"
"Barack Obama" "country of citizenship" "United States of America" "Barack Obama citizenship conspiracy theories
"Hans Island" "country" "Greenland" "Canada"
"Star Trek VI: The Undiscovered Country" "part of" "Star Trek canon" "Gene Roddenberry"
"EGFR Amplification" "positive therapeutic predictor" "gefitinib" "Epidermal Growth Factor Receptor Gene Amplifi...


Our approach to study reverts on Wikidata, was to create an ML-classifier to predict whether a given edit will be reverted or not. To do this, we consider just edits containing the wikidata-iu tag, and build a balanced dataset with 50% edits that were reverted and 50% that were not reverted. We used three categories of features:

  • Content related: This considers the type of edit done on a given revision, if it was adding, updating or removing content as well as the Property edited on that given claim (if any).
  • Page (Item) related: Here we consider the “age of the page”, ie, the time since the Item was created, and the number of revisions on that page, at the moment of the revision to be analyzed.
  • User related: Similarly with the page related features, here we consider the “age of the account” and the total number of revisions than by that user.

Distribution of account age (in days) and reverted and not reverted revisions.

On a balanced dataset we obtained a precision of 75% when considering all the features, with the top-three features corresponding to “account age”, “page revision count” and “page age”. When considering only page and user related features we still obtain a 70% precision. These results suggest that content related features do not add that much information, and that the probability of being and edit being reverted is highly related with users characteristics. More experienced users are less likely to be reverted than the new ones. Therefore, using reverts as proxy for controversies does not seem to give a lot of information about the controversial content, but explain the community dynamics and the learning curves to use Wikidata.

You can find how to build the ML-model in this notebook.

Talk pages[edit]

Finally we studied the usage of talk pages as a proxy for controversies. Our intuition here was that controversial content could drive long discussions on talk pages. However, we noticed low usage of talk pages on Wikidata, especially when compared to Wikipedia.

Ratio of edits corresponding to talk pages per Project Ratio of edits corresponding to talk pages per Project

For example, in October 2021, we just found 1331 revisions on Talk pages in Wikidata, being proportionally much lower than in Wikipedia projects. However, when running these comparisons it is important to take in account that sometimes discussions on Wikipedia are related to the wording or style used to describe a given content, while in Wikidata - given its structured nature - those conversations will be different.

Edit Spikes[edit]

Finally, we studied edits spikes on Wikidata. The intuition behind this approach is inspired by previous work, on event detection on Wikipedia. To do this, first we counted the number of different users editing each Item in a given period of time. Next, we compare that number, with total pageviews of the sitelinks of those Items. Our goal was to understand if co-editing is related with exogenous or endogenous triggers. In other words, if the Wikipedia articles (site)linked with a given Item are receiving a lot of page views, the edits on that Item would be related to exogenous factors (for example a movie release, or an event covered by the mass media). We noticed that all the top-100 Items with more editors were on the top decile of pageviews, implying that those edits were somehow driven by external factors. Then, we manually analyze those top-100 items. We found a clear connections between ongoing events, such as people receiving the Nobel Price, or a terror attack, and the Items with more editors.

Top-20 Items with more editors in Octuber 2021
Item Label (en) Distinct Editors Sitelinks Pageviews Reason for edits
Q317877 Abdulrazak Gurnah 46 74566233 nobel prize
Q109051038 Halyna Hutchins 42 239646554 death
Q106582931 Squid Game 41 2957646852 popular tv show
Q108782773 Pandora Papers 40 159200344 media story
Q380 Meta Platforms 37 321207744 company rename
Q1823418 Dmitry Muratov 36 54780456 nobel prize
Q60322501 Ardem Patapoutian 36 17170984 nobel prize
Q3675789 Syukuro Manabe 35 91714612 nobel prize
Q105572 Benjamin List 34 19214284 nobel prize
Q109370 Klaus Hasselmann 33 19505832 nobel prize
Q1174906 David Julius 31 11448472 nobel prize
Q64168538 Alexander Schallenberg 29 32763156 became new chancelor of austria
Q259646 David Amess 28 117482267 death
Q5237001 David MacMillan 25 6906796 nobel prize
Q108886413 Kongsberg attack 25 14453230 terror attack
Q6761526 Maria Ressa 23 14561562 nobel prize
Q1235614 Giorgio Parisi 23 17624052 nobel prize
Q359480 Fumio Kishida 22 113946424 became prime minister of japan
Q108793477 Frances Haugen 21 13486014 blew the whistle on facebook
Q28699137 HoYeon Jung 20 163268365 stars in squid game

While these events are not (necessarily) controversial, the fact that they are receiving edits from multiple users in a short period of time, and they receive a lot of public attention from readers too, might be a good indicator of Items that would require attention from administrators.


When analyzing the number of different users editing the same Item, we have noticed a large proportion of Items with just one editor per month. For example, if we consider articles edited in October 2021, in English Wikipedia over 20% of articles edited have more than one user contributing to those articles. However, in Wikidata this share goes just around the 5%, with most of those edits going to Items related to ongoing events. Smaller projects such as Catalan Wikipedia have around 10% of articles being edited for more than one user in that period of time. Although - as we already mentioned - these comparisons between Wikipedia and Wikidata could be unfair, it is important to consider the risks associated with content that is edited by a small number of users. While it is unlikely to find controversies on those Items, finding ways to create more collaborative Items could help to improve quality and trust on that content.


We found that well-established controversies such as territory disputes between countries are well managed by the “disputed by” qualifier. We also noticed that revisions reverted are associated with users characteristics more than the content-related ones. Item talk pages are not commonly used on Wikidata. More investigation is needed to understand the reasons behind this, but we speculate that more of these discussions are happening on Property talk pages and in Wiki Projects. Finally, we noticed a correlation between most edited Items on Wikidata and ongoing events receiving attention on Wikipedia. Focusing admins efforts on checking that content seems an interesting strategy to deal with potential issues on impactful content. However, it would be also important to dedicate effort to review less popular content that is usually edited by individual users, without much community participation.

Limitations and Future Work[edit]

Although there could be several other definitions for “controversial content”, we consider the former ones as a first proxy for controversiality. In future work we will consider exploring other possible approaches.

It is also important to say that comparisons between Wikidata and Wikipedia need to be taken with a grain of salt. While in Wikipedia one revision (edit) could contain several changes, in Wikidata every claim edited is logged as one revision, this needs to be considered when comparing the number of edits in both platforms. In future work we would like to consider other approaches to counting edits, for example, counting edit sessions - a set of edits done on a given claim or Property within a certain span of time - instead of single revisions. More research is needed to define such parameters, and this is something we would like to explore.