User:Hall1467/Understanding Wikidata's Value (WMDE Proposal)
We're proposing to do work to make tracking of Wikidata usage more granular. Current Wikidata usage tracking does not provide a level of granularity down to the statement level, only to the entity level. However, applications using Wikidata often use only specific statements of entities. We'd like to work towards tracking statement usage to understand which statements are used/provide value. In the near-term, as part of this contract, we want to do two things:
- Analyze the database I/O load of tracking statements (more fine-detailed tracking compared to the current entity-level tracking which means the load will be larger)
- Analyze current entity usage
There are a few details of the contract that we would like to specify:
- Aaron Halfaker, Principle Research Scientist at the Wikimedia Foundation, would manage Andrew's day-to-day work (as he is currently doing with Andrew's WMF contract).
- Working towards statement tracking deployment, we need Wikidata product team time in order to review patch sets and to engineer software. For the past several months, we've already been working with Marius Hoch.
- The contract would start the week of July 9 and continue full-time through the week of September 3. Andrew would be able to work part-time afterwards in order to finish up the contract if need be.
- Analysis of statement tracking database I/O load. We will complete a feasibility analysis for deployment of statement tracking to all or some wikis.
- Analysis of current entity usage. We will analyze characteristics of valuable entity data.
Specifics for each deliverable are listed below.
Analysis of Statement Tracking Database I/O Load
We will deliver a feasibility analysis for the deployment of statement tracking to all or some wikis. This would report what we would expect to happen in terms of database I/O load if statement tracking were to be deployed. Depending upon the results of intermediate stages of our analysis, if the I/O load were determined to be too large for the MySQL database where entity tracking currently occurs, we would investigate and recommend other options such as Hadoop and RESTBase.
We already have some of the feasibility analysis completed. We've proposed a table schema and have also begun explorations into Hadoop and RESTBase.
Analysis of Current Entity Usage
We have identified several research directions related to Wikidata entity value in the context of Wikipedia/other Wikimedia projects. The results of our explorations should have important implications for Wikidata and its contributors as well as to the design of their tools.
Page views are the dominant means of determining the value of Wikipedia content. We'll apply pageviews of Wikimedia project pages to the entities those pages reference in order to generate a measurement of entity value.
Characteristics of valuable data
Past research has shown that bots create most of the data in Wikidata. By measuring value, we'll be able to compare the value of data produced by bots to the value of data produced by humans. One hypothesis could be that humans create less data with higher value whereas bots apply less directed, less valuable data in general.
Furthermore, perhaps valuable Wikidata entities are used in articles corresponding to certain Wikiprojects or other domains within Wikipedia. Knowing where valuable data is applied could help direct editors towards creating valuable Wikidata content.
Work by Priedhorsky et al. has shown that a very small proportion of all editors produce most of the value in Wikipedia. Determining who produces value in Wikidata could result in implications for design — for example, perhaps mechanisms to "keep [power editors] happy” as was suggested by Priedhorsky et al. for Wikipedia.
Finally, we could also identify other characteristics of valuable Wikidata. It would be interesting to see if valuable entities are also high quality entities -- e.g. have good statement coverage. When were valuable entities created? Are valuable entities produced more frequently in recent history or have they existed for awhile?
Supply versus demand
Work in Wikipedia has shown a misalignment between the “supply and demand” of contributions. Perhaps curation work on Wikidata is misaligned with client wiki usage. We'll investigate this.
Long Term Research Project
Research studying Wikidata item and statement/property edits has been generally quite limited. Work by Steiner et al. and Müller-Birn et al. looked at methods of data definition and application (e.g. which properties or statements are bot-produced or human-produced).
We are not aware of work that has focused on aspects of Wikidata item, statement, or property importance. For example, no studies have analyzed the value that Wikidata provides to applications (such as Wikipedia infoboxes or Google Knowledge Graph) using its data. We aim to explore Wikidata’s value to these applications. Initially, this exploration will take place in the context of Wikimedia projects. Our ultimate long-term goal is to study third party application usage of Wikidata.
We envision three high-level stages for this project more generally -- each is discussed below.
Stage 1: Preliminary Steps and Details for Studying Value in Wikipedia
Wikidata's data is intended to be used, so understanding usage is a prerequisite to understanding the value of Wikidata. Currently, usage is recorded only at the entity level. We propose modifying the Wikibase MediaWiki extension in order to log usage down to the statement level. This will require the creation of, for example, a new database table storing statement usage data.
Stage 2: Measuring Value in Wikimedia Projects
The types of analyses in this stage are largely discussed above in Analysis of Current Entity Usage. To fully complete this stage, we would like to perform analyses for both entities and statements.
As with entities, for statements, we could identify various characteristics of valuable Wikidata. Additionally, as of now, it's unclear how much non-existent Wikidata is attempted to be accessed via templates and Lua modules. By tracking which statements *would have been* used if they existed when client pages were rendered, we can put a value on this missing data. In effect, we will define a set of Wikidata that could immediately provide value if applied.
Stage 3: Third Party Usage
While understanding how valuable Wikidata is to Wikimedia projects is important, we would like to take steps towards understanding how valuable Wikidata is to third parties such as Google. This is a challenging problem since Wikidata data (unlike Wikipedia data) does not need to be cited when used elsewhere.
We'll work with the Wikimedia Foundation to formally propose mechanisms that can be used to track 3rd party data usage. For example, one potential way to log third party Wikidata usage is via an http log point (https://wikidata.org/usage-tracking/Q42/P123) that developers would be required or at least strongly recommended to use. 3rd party developers could then benefit from increased curation activities from Wikidata editors on the entities/statements that are most used -- increasing the apparent quality of Wikidata to 3rd party users.
- Priedhorsky, Reid, Jilin Chen, Shyong Tony K. Lam, Katherine Panciera, Loren Terveen, and John Riedl. 2007. “Creating, Destroying, and Restoring Value in Wikipedia.” In Proceedings of the 2007 International ACM Conference on Supporting Group Work, 259–68. ACM.
- Müller-Birn, Claudia, Benjamin Karran, Janette Lehmann, and Markus Luczak-Rösch. 2015. “Peer-Production System or Collaborative Ontology Engineering Effort: What Is Wikidata?” In Proceedings of the 11th International Symposium on Open Collaboration, 20. ACM.
- Warncke-Wang, Morten, Vivek Ranjan, Loren Terveen, and Brent Hecht. 2015. “Misalignment Between Supply and Demand of Quality Content in Peer Production Communities.” In ICWSM 2015: Ninth International AAAI Conference on Web and Social Media.
- Steiner, Thomas. 2014. “Bots vs. Wikipedians, Anons vs. Logged-Ins (redux): A Global Study of Edit Activity on Wikipedia and Wikidata.” In Proceedings of The International Symposium on Open Collaboration, 25. ACM.