Research:Identifying Controversial Content in Wikidata
As part of our efforts on improving knowledge integrity, we are actively working to support the community and affiliates groups to have better understanding of their projects, and also in providing tools to address knowledge integrity issues.
In this project we are aiming to support the Wikidata community, creating a framework that allows to identify content within that project that could be controversial. Identifying such content will help the community to early react to potential conflicts, and also to guide newcomers, showing them the different mechanisms that Wikidata has to manage this kind of situations.
In order to identify content that require special attention from admins and the community in general we operationalize the idea of “controversial content” testing different definitions, more specifically we study:
- Claims containing the “statement disputed by” qualifier.
- Content affected by “edit wars” or with multiple reverts.
- Items with large discussion on their talk pages.
Considering these definitions we perform a data analysis, to get insights about these different scopes, as well as test ML-approaches to explore the predictability of such situations.
Additionally, we run an analysis on edit spikes, to understand which are the most edited Items on Wikidata at a given point in time, and then perform a quantitative and qualitative analysis to understand the reasons behind those spikes.
The data used on this study comes from two main sources:
- The mediawiki_history table: We use this data to understand the evolution of claims over time. By parsing the automatically generated edit summary of Wikibase, we are able to understand - among other things - when a specific property or label was edited in a given Item.
- The Wikidata JSON dump: To analyze the usage of the “disputed by” qualifier, we used the snapshot from 2022-02-28.
Notice that in both cases, our code uses the parquet versions of these tables, that are stored in Wikimedia Data Lake. However, our experiments could be easily replicated by transforming the public dumps in parquet format or adapting the code to use CSV and JSON files from those dumps. All the data used on this study is public, therefore our results are completely reproducible.
For this study we also want to differentiate edits made manually by editors from the ones made using automated tools. This could be difficult in Wikidata, because it is common that users use automatic tools within their personal accounts, as opposed to dedicated bot accounts. To overcome this issue, since October 2021, Wikidata has added a special tag, that indicates when an edit has been done using the Wikidata Interface. We use this as proxy for “manual edits”. Results using that filter are indicated as “Wikidata-ui tag”.
As of 2022-02-28 just 1886 claims contained the “disputed by” qualifier. The most common Property using that qualifier was “country” and those claims usually correspond to disputed territories. We also found some specific usages of this qualifier for some medical content.
|P3355||"negative therapeutic predictor"||184|
|P3354||"positive therapeutic predictor"||139|
|P131||"located in the administrative territorial ent...||128|
|P460||"said to be the same as"||52|
|P3359||"negative prognostic predictor"||29|
|P569||"date of birth"||23|
|P570||"date of death"||18|
|P1196||"manner of death"||15|
|P509||"cause of death"||14|
We also analyze the topic of the Items where this qualifier appears, finding that most of them are related to geography, confirming that this qualifier is especially used around territorial conflicts.
|Topic||Number of articles|
|Item Label||Property Label||Value||"Disputed by"|
|"Barack Obama"||"country of citizenship"||"United States of America"||"Barack Obama citizenship conspiracy theories|
|"Star Trek VI: The Undiscovered Country"||"part of"||"Star Trek canon"||"Gene Roddenberry"|
|"EGFR Amplification"||"positive therapeutic predictor"||"gefitinib"||"Epidermal Growth Factor Receptor Gene Amplifi...|
Our approach to study reverts on Wikidata, was to create an ML-classifier to predict whether a given edit will be reverted or not. To do this, we consider just edits containing the wikidata-iu tag, and build a balanced dataset with 50% edits that were reverted and 50% that were not reverted. We used three categories of features:
- Content related: This considers the type of edit done on a given revision, if it was adding, updating or removing content as well as the Property edited on that given claim (if any).
- Page (Item) related: Here we consider the “age of the page”, ie, the time since the Item was created, and the number of revisions on that page, at the moment of the revision to be analyzed.
- User related: Similarly with the page related features, here we consider the “age of the account” and the total number of revisions than by that user.
On a balanced dataset we obtained a precision of 75% when considering all the features, with the top-three features corresponding to “account age”, “page revision count” and “page age”. When considering only page and user related features we still obtain a 70% precision. These results suggest that content related features do not add that much information, and that the probability of being and edit being reverted is highly related with users characteristics. More experienced users are less likely to be reverted than the new ones. Therefore, using reverts as proxy for controversies does not seem to give a lot of information about the controversial content, but explain the community dynamics and the learning curves to use Wikidata.
You can find how to build the ML-model in this notebook.
Finally we studied the usage of talk pages as a proxy for controversies. Our intuition here was that controversial content could drive long discussions on talk pages. However, we noticed low usage of talk pages on Wikidata, especially when compared to Wikipedia.
Ratio of edits corresponding to talk pages per Project
For example, in October 2021, we just found 1331 revisions on Talk pages in Wikidata, being proportionally much lower than in Wikipedia projects. However, when running these comparisons it is important to take in account that sometimes discussions on Wikipedia are related to the wording or style used to describe a given content, while in Wikidata - given its structured nature - those conversations will be different.
Finally, we studied edits spikes on Wikidata. The intuition behind this approach is inspired by previous work, on event detection on Wikipedia. To do this, first we counted the number of different users editing each Item in a given period of time. Next, we compare that number, with total pageviews of the sitelinks of those Items. Our goal was to understand if co-editing is related with exogenous or endogenous triggers. In other words, if the Wikipedia articles (site)linked with a given Item are receiving a lot of page views, the edits on that Item would be related to exogenous factors (for example a movie release, or an event covered by the mass media). We noticed that all the top-100 Items with more editors were on the top decile of pageviews, implying that those edits were somehow driven by external factors. Then, we manually analyze those top-100 items. We found a clear connections between ongoing events, such as people receiving the Nobel Price, or a terror attack, and the Items with more editors.
|Item||Label (en)||Distinct Editors||Sitelinks Pageviews||Reason for edits|
|Q317877||Abdulrazak Gurnah||46||74566233||nobel prize|
|Q106582931||Squid Game||41||2957646852||popular tv show|
|Q108782773||Pandora Papers||40||159200344||media story|
|Q380||Meta Platforms||37||321207744||company rename|
|Q1823418||Dmitry Muratov||36||54780456||nobel prize|
|Q60322501||Ardem Patapoutian||36||17170984||nobel prize|
|Q3675789||Syukuro Manabe||35||91714612||nobel prize|
|Q105572||Benjamin List||34||19214284||nobel prize|
|Q109370||Klaus Hasselmann||33||19505832||nobel prize|
|Q1174906||David Julius||31||11448472||nobel prize|
|Q64168538||Alexander Schallenberg||29||32763156||became new chancelor of austria|
|Q5237001||David MacMillan||25||6906796||nobel prize|
|Q108886413||Kongsberg attack||25||14453230||terror attack|
|Q6761526||Maria Ressa||23||14561562||nobel prize|
|Q1235614||Giorgio Parisi||23||17624052||nobel prize|
|Q359480||Fumio Kishida||22||113946424||became prime minister of japan|
|Q108793477||Frances Haugen||21||13486014||blew the whistle on facebook|
|Q28699137||HoYeon Jung||20||163268365||stars in squid game|
While these events are not (necessarily) controversial, the fact that they are receiving edits from multiple users in a short period of time, and they receive a lot of public attention from readers too, might be a good indicator of Items that would require attention from administrators.
When analyzing the number of different users editing the same Item, we have noticed a large proportion of Items with just one editor per month. For example, if we consider articles edited in October 2021, in English Wikipedia over 20% of articles edited have more than one user contributing to those articles. However, in Wikidata this share goes just around the 5%, with most of those edits going to Items related to ongoing events. Smaller projects such as Catalan Wikipedia have around 10% of articles being edited for more than one user in that period of time. Although - as we already mentioned - these comparisons between Wikipedia and Wikidata could be unfair, it is important to consider the risks associated with content that is edited by a small number of users. While it is unlikely to find controversies on those Items, finding ways to create more collaborative Items could help to improve quality and trust on that content.
We found that well-established controversies such as territory disputes between countries are well managed by the “disputed by” qualifier. We also noticed that revisions reverted are associated with users characteristics more than the content-related ones. Item talk pages are not commonly used on Wikidata. More investigation is needed to understand the reasons behind this, but we speculate that more of these discussions are happening on Property talk pages and in Wiki Projects. Finally, we noticed a correlation between most edited Items on Wikidata and ongoing events receiving attention on Wikipedia. Focusing admins efforts on checking that content seems an interesting strategy to deal with potential issues on impactful content. However, it would be also important to dedicate effort to review less popular content that is usually edited by individual users, without much community participation.
Limitations and Future Work
Although there could be several other definitions for “controversial content”, we consider the former ones as a first proxy for controversiality. In future work we will consider exploring other possible approaches.
It is also important to say that comparisons between Wikidata and Wikipedia need to be taken with a grain of salt. While in Wikipedia one revision (edit) could contain several changes, in Wikidata every claim edited is logged as one revision, this needs to be considered when comparing the number of edits in both platforms. In future work we would like to consider other approaches to counting edits, for example, counting edit sessions - a set of edits done on a given claim or Property within a certain span of time - instead of single revisions. More research is needed to define such parameters, and this is something we would like to explore.