User:Hall1467/sandbox

From Meta, a Wikimedia project coordination wiki
Created
June 13, 2017
Contact
Andrew Hall
Collaborators
Aaron Halfaker
Duration:  2017-07 – 2017-09

This page documents a proposed research project.
Information may be incomplete and may change before the project starts.


Research studying Wikidata items and statement/property edits has been generally quite limited. Work by Steiner et al. and Müller-Birn et al. looked at methods of data definition and application (e.g. which properties or statements are bot-produced or human-produced)[1][2].

We are not aware of work that has focused on aspects of Wikidata statement or property importance. For example, no studies have analyzed the value that Wikidata provides to applications (such as Wikipedia infoboxes or Google Knowledge Graph) using its data. We aim to explore Wikidata’s value to these applications. Initially, this exploration will take place in the context of Wikimedia projects. Our ultimate goal is to study third party application usage of Wikidata.


Project Details[edit]

We envision three stages for this project -- each is discussed below.

Stage 1: Preliminary Steps and Details for Studying Value in Wikipedia[edit]

Wikidata's data is intended to be used, so understanding usage is a prerequisite to understanding the value of Wikidata. Currently, usage is recorded only at the item level. We propose modifying the Wikibase MediaWiki extension in order to log usage down to the statement level. This will require the creation of, for example, a new database table storing statement usage data.

Stage 2: Measuring value in Wikimedia projects[edit]

We have identified several research directions related to Wikidata value in the context of Wikipedia. The results of our explorations should have important implications for Wikidata and its contributors as well as to the design of their tools.

Page views are the dominant means of determining the value of Wikipedia content[3]. We'll apply pageviews of Wikimedia project pages to the statements those pages reference in order to generate a fine-grained measurement of statement value.

Characteristics of valuable data[edit]

Past research has shown that bots create most of the data in Wikidata[2]. By measuring value at the statement level, we'll be able to compare the value of data produced by bots to the value of data produced by humans. One hypothesis could be that humans create less data with higher value whereas bots apply less directed, less valuable data in general.

Furthermore, perhaps valuable Wikidata statements are used in articles corresponding to certain Wikiprojects or other domains within Wikipedia. Knowing where valuable data is applied could help direct editors towards creating valuable Wikidata content.

Work by Priedhorsky et al. has shown that a very small proportion of all editors produce most of the value in Wikipedia[3]. Determining who produces value in Wikidata could result in implications for design — for example, perhaps mechanisms to "keep [power editors] happy” as was suggested by Priedhorsky et al. for Wikipedia.

Finally, we could also identify other characteristics of valuable Wikidata. Perhaps valuable statements are associated with properties that have high-quality descriptions or multiple examples instructing their application on their Wikidata pages.

Supply versus demand[edit]

Work in Wikipedia has shown a misalignment between the “supply and demand” of contributions[4]. Perhaps curation work on Wikidata is misaligned with client wiki usage.

Its also unclear how much non-existent Wikidata is attempted to be accessed via templates and Lua modules. By tracking which statements *would have been* used if they existed when client pages were rendered, can put a value on this missing data. In effect, we will define a set of Wikidata that could immediately provide value if applied.

Stage 3: Third Party Usage[edit]

While understanding how valuable Wikidata is to Wikimedia projects is important, we would like to take steps towards understanding how valuable Wikidata is to third parties such as Google. This is a challenging problem since Wikidata data (unlike Wikipedia data) does not need to be cited when used elsewhere.

We'll work with the Wikimedia Foundation to formally propose mechanisms that can be used to track 3rd party data usage. For example, one potential way to log third party Wikidata usage is via an http log point (https://wikidata.org/usage-tracking/Q42/P123) that developers would be required or at least strongly recommended to use. 3rd party developers could then benefit from increased curation activities from Wikidata editors on the items/statements that are most used -- increasing the apparent quality of Wikidata to 3rd party users.

Statement of Deliverables[edit]

In this section, we append our original proposal based on our discussion with Lydia on June 13. We provide specific deliverables:

  • We will deploy statement tracking code. This code will write to wbc_entity_usage tables which will likely be on MySQL servers. We will test deployment of statement tracking on one client wiki (Greek Wikipedia) and if the database load on MySQL servers is determined to be too high, we will store wbc_entity_usage tables in Hadoop clusters. Alternatively, we will use other storage options such as Cassandra.
  • We will produce a watch list tool to indicate when Wikidata items that are used by a given client page are modified. Any contributor to a client page might like to know when the Wikidata properties that that page is using change so that he/she can check those changes.
  • We will also create a tool that logs when client wikis (that use Wikidata) change. This tool can provide information about which properties are referenced in the client wiki, but are missing. We could also produce a similar tool that shows properties that could be used/provide value to large numbers of wikis.

Expectations of the Contract[edit]

  • We would like a weekly touch base meeting with member(s) of the Wikidata product team. This will allow us to be more actively part of code deployment to ensure the process moves quickly and also to ensure we address any concerns/needs of the product team in a timely manner.
  • Aaron would manage Andrew's day-to-day work (as he is currently doing with Andrew's WMF contract)

References[edit]

  1. Steiner, Thomas. 2014. “Bots vs. Wikipedians, Anons vs. Logged-Ins (redux): A Global Study of Edit Activity on Wikipedia and Wikidata.” In Proceedings of The International Symposium on Open Collaboration, 25. ACM.
  2. a b Müller-Birn, Claudia, Benjamin Karran, Janette Lehmann, and Markus Luczak-Rösch. 2015. “Peer-Production System or Collaborative Ontology Engineering Effort: What Is Wikidata?” In Proceedings of the 11th International Symposium on Open Collaboration, 20. ACM.
  3. a b Priedhorsky, Reid, Jilin Chen, Shyong Tony K. Lam, Katherine Panciera, Loren Terveen, and John Riedl. 2007. “Creating, Destroying, and Restoring Value in Wikipedia.” In Proceedings of the 2007 International ACM Conference on Supporting Group Work, 259–68. ACM.
  4. Warncke-Wang, Morten, Vivek Ranjan, Loren Terveen, and Brent Hecht. 2015. “Misalignment Between Supply and Demand of Quality Content in Peer Production Communities.” In ICWSM 2015: Ninth International AAAI Conference on Web and Social Media.