Jump to content

Research:Wikiwho Provenance Api

From Meta, a Wikimedia project coordination wiki

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Goal

[edit]

We aim to provide a performant API to request for every single token (words + functional characters) in any specified Wikipedia article the revision of origin for that token, and all changes ever applied to it - with high accuracy. In this way metadata like the original author and the presence (in revision and/or time) of each token in the past can be retrieved - this can also be used for extracting disputes about content.

We currently offer the service in English, German, Turkish and Basque, with more languages planned.

Current State

[edit]

A beta version of the API is live and working for en.wikipedia.org, although we are working on the performance. See API and documentation here: https://api.wikiwho.net/api/


Methods

[edit]

The algorithm used to mine provenance for single tokens is described in our corresponding paper, including runtime and precision evaluations for English.[1] Further information can also be found at f-squared.org/wikiwho in Internet Archive.

Regarding the "precision" of the method: Former research [1][2] has shown that the task of identifying the "correct" original author of a piece of text in a WP article is not trivial. Therefore we rely on an extraction method that has to be scientifically proven to perform at 95% percent precision[1] , higher than any other algorithm proposed for the task, as far as we can tell. We think that this is crucial if used in production.

Use Cases

[edit]
A wiki page annotated by whoCOLOR.

Apart from direct queries to the WikiWho api, there are some use cases already:

  • Use case 1: whoCOLOR: this is a userscript that highlights selected text pieces in an article annotated with their provenance (author). Other features currently being build include a conflict view that highlights the most deleted and reintroduced text pieces, as well as a word history view that shows for each word/token when it was originally introduced and it's individual deletion/reintroduction history. See examples, description, screenshots and download link at this website: f-squared.org/whovisual. Described in a ICWSM workshop paper.[3]
  • Use case 2: whoVIS: A prototype of an editor-editor interaction network visualization for individual articles, based on the word/tokens deleted and reintroduced by editors. Also at f-squared.org/whovisual. A WWW Conference demo paper describes the system.[4]

References

[edit]
  1. a b c Flöck, Fabian, and Maribel Acosta. "WikiWho: Precise and efficient attribution of authorship of revisioned content." Proceedings of the 23rd international conference on World wide web. ACM, 2014.
  2. Luca de Alfaro , Michael Shavlovsky, Attributing authorship of revisioned content, Proceedings of the 22nd international conference on World Wide Web, May 13-17, 2013, Rio de Janeiro, Brazil
  3. Flöck, Fabian, et al. "Towards Better Visual Tools for Exploring Wikipedia Article Development–The Use Case of “Gamergate Controversy”." Ninth International AAAI Conference on Web and Social Media. 2015.
  4. Flöck, Fabian, and Maribel Acosta. "whovis: Visualizing editor interactions and dynamics in collaborative writing over time." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015.