Research:Understanding Wikidata's Value

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Created
February 9, 2017
Contact
Andrew Hall
Collaborators
Aaron Halfaker
Duration:  2017-04 — 2017-06
VisualEditor - Icon - Check.svg
This page documents a completed research project.


Wikidata and OpenStreetMap represent new peer production movements aimed at producing structured data. Wikidata has become an increasingly popular structured data source to sister project applications such as Wikipedia infoboxes, as well as to third-party applications such as Google Knowledge Graph. Although Wikidata is used, we know very little about how applications use it. No studies have analyzed the usage and value that Wikidata provides to them. This is what we sought to do. Our fundamental research question was:

RQ: What Wikidata is important? That is, what Wikidata is used and what Wikidata is valuable?

Among other things, exploring this question and the characteristics of important Wikidata can help contributors create useful, high-quality Wikidata. For example, tools can indicate when important Wikidata is low-quality and needs to be improved. This initial exploration takes place in the context of Wikimedia projects.

Related Work[edit]

While work studying Wikidata has been generally limited, research studying edits to Wikidata items, properties, and statements has been even more so. Work by Steiner et al. and Müller-Birn et al. looked at methods of data definition and application (e.g. which properties or statements are bot-produced or human-produced)[1][2]. They found that bots create most of the data in Wikidata[2]. We are not aware, however, of work that has focused on aspects of Wikidata usage or importance in applications.

Methods[edit]

Data Preparation[edit]

We used entity (item and property) usage and page view data to conduct our analyses.

Entity usage data. Wikidata usage is recorded at the entity level. We extracted entity usages for all Wikimedia projects using Wikidata from database table dumps from May 1, 2017. There were 650 projects (including 304 Wikipedia versions) containing usages.

Page view data. Page views are the dominant means of determining the value of Wikipedia content[3]. We applied page views of Wikimedia project pages to the entities those pages referenced in order to generate a measurement of entity value. To do so, we calculated yearly page views for project pages. We chose a year-long timespan (June 8, 2016 to June 7, 2017) since a shorter timespan might contain biases towards pages that are popular at certain points of the year (e.g. a Wikipedia article on Christmas might be more popular towards the end of each year). As a contribution, we've release this page view dataset (https://analytics.wikimedia.org/datasets/one-off/pageview_rate/20170607/).

We combined our entity usage and page view data (removing data for the project "testwikidata") to create a dataset of entity usage and "value" for 22,250,021 entities. We've released this entity importance dataset (https://analytics.wikimedia.org/datasets/one-off/entity_usage/20170501/).

Measuring Contributor Investment[edit]

To better understand characteristics of important Wikidata, we sought to compare how entity importance relates with measures of contributor effort investment into entities. To do so, we compared entity value with both entity quality and entity editing activity (number of revisions). In both cases we used a random sample of 100,000 items from our entity importance dataset.

To get quality ratings for each entity, we used Object Revision Evaluation Services (ORES). ORES created a quality score for the current state of each entity. A class "A" item is the highest quality while a class "E" is the lowest.

Results[edit]

We next analyzed our data. We provide descriptive statistics and discussion from our exploration of entity usage and value.

Entity Importance[edit]

We operationalized entity importance in two ways 1) entity usage/transclusion and 2) entity value/actualized use. We discuss results for both.

Entity Transclusion (Usage)[edit]

Most entities are used on only one page and project. Usages were highly skewed such that most entities were used by few pages and on few projects. We found that 54% of entities used were only used on one project page and that 95% were used on nine pages or less. 58% of entities were used by only one project. This number dropped to 8% when an entity was used by multiple pages.

'Identifers' are most transcluded entities. The item VIAF identifer was the most transcluded entity with over 2.4 million project page usages. Other identifiers such as the items International Standard Name Identifier and SUDOC were also widely transcluded. This result make sense considering the relative ease through which bots such as VIAFbot can automatically propagate large quantities of identifiers without a need for large amounts of human labor.

Gender bias in transclusion. The item for human male was transcluded 722,871 times. The item for human female was transcluded 151,783 times. This means the concept of a "human male" occurred on almost 5 times the number of project pages as "human female" did.

Entity Actualized Use (Value)[edit]

Many entities provide little value. Data was highly skewed such that most entities provided little value. 5% of entities did not have page views. Further, nearly half of entities (47%) had 100 or less page views.

Mild correlation between entity value and entity usages. We sought to compare how entity page usages were associated with the value that the entity provides in aggregate across pages. We found a mild positive correlation (rho=.22, p<.001). If an entity is used on more pages, it's reasonable to assume that it would also provide more value across those pages in aggregate. Identifiers like "VIAF identifier" were still among the most valuable entities, but slightly less so. A number of properties were amongst the most valuable entities. The item Wikimedia main page was the most valuable entity with 12.5 billion views. The property Commons category was the second most valuable while the item human was the third. The item human male was the sixth most valuable entity having 3 billion views which was approximately 3 times the number of views of the human female item. However, the average page using the "human female" item had more views than for the "human male" item.

We also sought to compare how entity project page usages was associated with the average value that the entity provides to pages it transcludes. We found no correlation (rho=.00, p<.001).

How Entity Importance relates with Entity Investment[edit]

Entity Quality versus Importance[edit]

According to ORES, 4 entities were rated class A, 2786 were class B, 18,631 were class C, 23,916 were class D, and 54,663 were class E. The 4 class A entities were Yangtze,Mats Hummels, Chennai, and Ronald Reagan.

We found a quite mild, marginally significant, positive relationship between item quality and importance (rho=.14, p=.06). This is interesting since one might expect that items that are more valuable and are viewed more would be edited in order to improve quality. Given this intuition, we thought it would be interesting to look at low quality items with high value as well as high quality items with low value. For the former, we found the most viewed class "E" item was abstract being with approximately 6.4 million views. A variety of other class "E" items also had a large number of views. Related to high quality items with low value, we found that all 4 of the class "A" items had more than 1 million page views. Thus, there were not high quality items with extremely low numbers of views.

Entity Activity versus Importance[edit]

To measure activity, we used the total number of revisions for an entity. We found a mild positive relationship between entity activity and importance (rho=.18, p<.001). As with entity quality versus importance, this result is perhaps unexpected. It seems reasonable to assume that valuable entities would get more attention from contributors (and also become higher quality due to that attention). However, as our results show, this is not the case.

Discussion[edit]

Wikidata aims to be a globally applicable repository of data that is relevant across language editions of Wikipedia. Contrary to this aim, we saw in our results that Wikidata entities are in fact, generally not global. Rather entities are often locally applied to a single page and to a single project (e.g. only English Wikipedia). Perhaps there should be an increased effort in the community towards producing entities that are global. Additionally, it's possible that a large number of existing entities are globally applicable but not recognized as so. Exploration of the "global-ness" of entities would be interesting future work.

Our entity quality versus importance results exemplify the importance of contributor tools that could highlight poor quality entities with high value. Improvements made to these entities could have a large impact.

Future work[edit]

There are a large number of additional analyses we would like to perform.

Characteristics of Valuable Data[edit]

We'd like to compare the value of data produced by bots to the value of data produced by humans. One hypothesis could be that humans create less data with higher value whereas bots apply less directed, less valuable data in general.

Furthermore, perhaps valuable Wikidata are used in articles corresponding to certain Wikiprojects or other domains within Wikipedia. Knowing where valuable data is applied could help direct editors towards creating valuable Wikidata content.

Work by Priedhorsky et al. has shown that a very small proportion of all editors produce most of the value in Wikipedia[3]. Determining who produces value in Wikidata could result in implications for design — for example, perhaps mechanisms to "keep [power editors] happy” as was suggested by Priedhorsky et al. for Wikipedia.

Finally, we could also identify other characteristics of valuable Wikidata. Perhaps valuable statements are associated with properties that have high-quality descriptions or multiple examples instructing their application on their Wikidata pages.

Supply Versus Demand[edit]

Work in Wikipedia has shown a misalignment between the “supply and demand” of contributions[4]. Perhaps curation work on Wikidata is misaligned with client wiki usage.

Its also unclear how much non-existent Wikidata is attempted to be accessed via templates and Lua modules. By tracking which Wikidata *would have been* used if they existed when client pages were rendered, a value can be put on this missing data. In effect, we could define a set of Wikidata that could immediately provide value if applied.

Third Party Usage[edit]

While understanding how valuable Wikidata is to Wikimedia projects is important, we would like to take steps towards understanding how valuable Wikidata is to third parties such as Google. This is a challenging problem since Wikidata data (unlike Wikipedia data) does not need to be cited when used elsewhere. We'd like to work with the Wikimedia Foundation to formally propose mechanisms that can be used to track 3rd party data usage. For example, one potential way to log third party Wikidata usage is via an http log point (https://wikidata.org/usage-tracking/Q42/P123) that developers would be required or at least strongly recommended to use. 3rd party developers could then benefit from increased curation activities from Wikidata editors on the entities/statements that are most used -- increasing the apparent quality of Wikidata to 3rd party users.

Summary[edit]

Wikidata's importance (usage and value) to applications has not be studied in detail. We perform initial explorations in this space while also comparing how Wikidata importance aligns with effort invested by contributors. We find that most Wikidata has limited use and value and that contributor investment into entities has little relationship with data importance. We conclude with a discussion followed by future work we would like to perform in this space.

References[edit]

  1. Steiner, Thomas. 2014. “Bots vs. Wikipedians, Anons vs. Logged-Ins (redux): A Global Study of Edit Activity on Wikipedia and Wikidata.” In Proceedings of The International Symposium on Open Collaboration, 25. ACM.
  2. a b Müller-Birn, Claudia, Benjamin Karran, Janette Lehmann, and Markus Luczak-Rösch. 2015. “Peer-Production System or Collaborative Ontology Engineering Effort: What Is Wikidata?” In Proceedings of the 11th International Symposium on Open Collaboration, 20. ACM.
  3. a b Priedhorsky, Reid, Jilin Chen, Shyong Tony K. Lam, Katherine Panciera, Loren Terveen, and John Riedl. 2007. “Creating, Destroying, and Restoring Value in Wikipedia.” In Proceedings of the 2007 International ACM Conference on Supporting Group Work, 259–68. ACM.
  4. Warncke-Wang, Morten, Vivek Ranjan, Loren Terveen, and Brent Hecht. 2015. “Misalignment Between Supply and Demand of Quality Content in Peer Production Communities.” In ICWSM 2015: Ninth International AAAI Conference on Web and Social Media.