Research:Identifying Controversial Content in Wikidata

Tracked in Phabricator:
Task T287946

Created

01:59, 24 September 2021 (UTC)

Contact

Diego Sáez Trumper

Wikimedia Foundation

Collaborators

Lydia Pintscher

Wikimedia Deutschland

Wikidata

Open source
via WMF Research Gitlab

Open data
via Wikimedia Data Lake

Research:Projects

This page documents a completed research project.

As part of our efforts on improving knowledge integrity, we are actively working to support the community and affiliates groups to have better understanding of their projects, and also in providing tools to address knowledge integrity issues.

In this project we are aiming to support the Wikidata community, creating a framework that allows to identify content within that project that could be controversial. Identifying such content will help the community to early react to potential conflicts, and also to guide newcomers, showing them the different mechanisms that Wikidata has to manage this kind of situations.

Methods[edit]

In order to identify content that require special attention from admins and the community in general we operationalize the idea of “controversial content” testing different definitions, more specifically we study:

Claims containing the “statement disputed by” qualifier.
Content affected by “edit wars” or with multiple reverts.
Items with large discussion on their talk pages.

Considering these definitions we perform a data analysis, to get insights about these different scopes, as well as test ML-approaches to explore the predictability of such situations.

Additionally, we run an analysis on edit spikes, to understand which are the most edited Items on Wikidata at a given point in time, and then perform a quantitative and qualitative analysis to understand the reasons behind those spikes.

Data[edit]

The data used on this study comes from two main sources:

The mediawiki_history table: We use this data to understand the evolution of claims over time. By parsing the automatically generated edit summary of Wikibase, we are able to understand - among other things - when a specific property or label was edited in a given Item.
The Wikidata JSON dump: To analyze the usage of the “disputed by” qualifier, we used the snapshot from 2022-02-28.

Notice that in both cases, our code uses the parquet versions of these tables, that are stored in Wikimedia Data Lake. However, our experiments could be easily replicated by transforming the public dumps in parquet format or adapting the code to use CSV and JSON files from those dumps. All the data used on this study is public, therefore our results are completely reproducible.

For this study we also want to differentiate edits made manually by editors from the ones made using automated tools. This could be difficult in Wikidata, because it is common that users use automatic tools within their personal accounts, as opposed to dedicated bot accounts. To overcome this issue, since October 2021, Wikidata has added a special tag, that indicates when an edit has been done using the Wikidata Interface. We use this as proxy for “manual edits”. Results using that filter are indicated as “Wikidata-ui tag”.

Results[edit]

Disputed By[edit]

As of 2022-02-28 just 1886 claims contained the “disputed by” qualifier. The most common Property using that qualifier was “country” and those claims usually correspond to disputed territories. We also found some specific usages of this qualifier for some medical content.

Top-20 properties using "statement disputed by"
prop	label	count
P17	"country"	607
P3355	"negative therapeutic predictor"	184
P3354	"positive therapeutic predictor"	139
P131	"located in the administrative territorial ent...	128
P31	"instance of"	102
P460	"said to be the same as"	52
P6216	"copyright status"	51
P39	"position held"	31
P170	"creator"	30
P3359	"negative prognostic predictor"	29
P569	"date of birth"	23
P40	"child"	22
P570	"date of death"	18
P50	"author"	17
P361	"part of"	16
P1196	"manner of death"	15
P509	"cause of death"	14
P22	"father"	13
P106	"occupation"	11
P1376	"capital of"	11

We also analyze the topic of the Items where this qualifier appears, finding that most of them are related to geography, confirming that this qualifier is especially used around territorial conflicts.

Top-5 topics containing the "statement disputed by" (notice that one article could belong to more than one topic)
Topic	Number of articles
Geography.Regions	1729
Culture.Biography	283
Geography.Geographical	208
Culture.Media	204
STEM.STEM*	139

Examples of "Disputed by" claims
Item Label	Property Label	Value	"Disputed by"
"Barack Obama"	"country of citizenship"	"United States of America"	"Barack Obama citizenship conspiracy theories
"Hans Island"	"country"	"Greenland"	"Canada"
"Star Trek VI: The Undiscovered Country"	"part of"	"Star Trek canon"	"Gene Roddenberry"
"EGFR Amplification"	"positive therapeutic predictor"	"gefitinib"	"Epidermal Growth Factor Receptor Gene Amplifi...

Reverts[edit]

Our approach to study reverts on Wikidata, was to create an ML-classifier to predict whether a given edit will be reverted or not. To do this, we consider just edits containing the wikidata-iu tag, and build a balanced dataset with 50% edits that were reverted and 50% that were not reverted. We used three categories of features:

Content related: This considers the type of edit done on a given revision, if it was adding, updating or removing content as well as the Property edited on that given claim (if any).
Page (Item) related: Here we consider the “age of the page”, ie, the time since the Item was created, and the number of revisions on that page, at the moment of the revision to be analyzed.
User related: Similarly with the page related features, here we consider the “age of the account” and the total number of revisions than by that user.

On a balanced dataset we obtained a precision of 75% when considering all the features, with the top-three features corresponding to “account age”, “page revision count” and “page age”. When considering only page and user related features we still obtain a 70% precision. These results suggest that content related features do not add that much information, and that the probability of being and edit being reverted is highly related with users characteristics. More experienced users are less likely to be reverted than the new ones. Therefore, using reverts as proxy for controversies does not seem to give a lot of information about the controversial content, but explain the community dynamics and the learning curves to use Wikidata.

You can find how to build the ML-model in this notebook.

Talk pages[edit]

Finally we studied the usage of talk pages as a proxy for controversies. Our intuition here was that controversial content could drive long discussions on talk pages. However, we noticed low usage of talk pages on Wikidata, especially when compared to Wikipedia.

Ratio of edits corresponding to talk pages per Project

For example, in October 2021, we just found 1331 revisions on Talk pages in Wikidata, being proportionally much lower than in Wikipedia projects. However, when running these comparisons it is important to take in account that sometimes discussions on Wikipedia are related to the wording or style used to describe a given content, while in Wikidata - given its structured nature - those conversations will be different.

Edit Spikes[edit]

Finally, we studied edits spikes on Wikidata. The intuition behind this approach is inspired by previous work, on event detection on Wikipedia. To do this, first we counted the number of different users editing each Item in a given period of time. Next, we compare that number, with total pageviews of the sitelinks of those Items. Our goal was to understand if co-editing is related with exogenous or endogenous triggers. In other words, if the Wikipedia articles (site)linked with a given Item are receiving a lot of page views, the edits on that Item would be related to exogenous factors (for example a movie release, or an event covered by the mass media). We noticed that all the top-100 Items with more editors were on the top decile of pageviews, implying that those edits were somehow driven by external factors. Then, we manually analyze those top-100 items. We found a clear connections between ongoing events, such as people receiving the Nobel Price, or a terror attack, and the Items with more editors.

Top-20 Items with more editors in Octuber 2021
Item	Label (en)	Distinct Editors	Sitelinks Pageviews	Reason for edits
Q317877	Abdulrazak Gurnah	46	74566233	nobel prize
Q109051038	Halyna Hutchins	42	239646554	death
Q106582931	Squid Game	41	2957646852	popular tv show
Q108782773	Pandora Papers	40	159200344	media story
Q380	Meta Platforms	37	321207744	company rename
Q1823418	Dmitry Muratov	36	54780456	nobel prize
Q60322501	Ardem Patapoutian	36	17170984	nobel prize
Q3675789	Syukuro Manabe	35	91714612	nobel prize
Q105572	Benjamin List	34	19214284	nobel prize
Q109370	Klaus Hasselmann	33	19505832	nobel prize
Q1174906	David Julius	31	11448472	nobel prize
Q64168538	Alexander Schallenberg	29	32763156	became new chancelor of austria
Q259646	David Amess	28	117482267	death
Q5237001	David MacMillan	25	6906796	nobel prize
Q108886413	Kongsberg attack	25	14453230	terror attack
Q6761526	Maria Ressa	23	14561562	nobel prize
Q1235614	Giorgio Parisi	23	17624052	nobel prize
Q359480	Fumio Kishida	22	113946424	became prime minister of japan
Q108793477	Frances Haugen	21	13486014	blew the whistle on facebook
Q28699137	HoYeon Jung	20	163268365	stars in squid game

While these events are not (necessarily) controversial, the fact that they are receiving edits from multiple users in a short period of time, and they receive a lot of public attention from readers too, might be a good indicator of Items that would require attention from administrators.

Discussion[edit]

When analyzing the number of different users editing the same Item, we have noticed a large proportion of Items with just one editor per month. For example, if we consider articles edited in October 2021, in English Wikipedia over 20% of articles edited have more than one user contributing to those articles. However, in Wikidata this share goes just around the 5%, with most of those edits going to Items related to ongoing events. Smaller projects such as Catalan Wikipedia have around 10% of articles being edited for more than one user in that period of time. Although - as we already mentioned - these comparisons between Wikipedia and Wikidata could be unfair, it is important to consider the risks associated with content that is edited by a small number of users. While it is unlikely to find controversies on those Items, finding ways to create more collaborative Items could help to improve quality and trust on that content.

Conclusion[edit]

We found that well-established controversies such as territory disputes between countries are well managed by the “disputed by” qualifier. We also noticed that revisions reverted are associated with users characteristics more than the content-related ones. Item talk pages are not commonly used on Wikidata. More investigation is needed to understand the reasons behind this, but we speculate that more of these discussions are happening on Property talk pages and in Wiki Projects. Finally, we noticed a correlation between most edited Items on Wikidata and ongoing events receiving attention on Wikipedia. Focusing admins efforts on checking that content seems an interesting strategy to deal with potential issues on impactful content. However, it would be also important to dedicate effort to review less popular content that is usually edited by individual users, without much community participation.

Limitations and Future Work[edit]

Although there could be several other definitions for “controversial content”, we consider the former ones as a first proxy for controversiality. In future work we will consider exploring other possible approaches.

It is also important to say that comparisons between Wikidata and Wikipedia need to be taken with a grain of salt. While in Wikipedia one revision (edit) could contain several changes, in Wikidata every claim edited is logged as one revision, this needs to be considered when comparing the number of edits in both platforms. In future work we would like to consider other approaches to counting edits, for example, counting edit sessions - a set of edits done on a given claim or Property within a certain span of time - instead of single revisions. More research is needed to define such parameters, and this is something we would like to explore.