Research:Content persistence

From Meta, a Wikimedia project coordination wiki
An example of words persisting between revisions of a Wikipedia article about apples.
Word persistence example. An example of words persisting between revisions of a Wikipedia article about apples.

Content persistence is the measurement of how content persists through the history of revisions to a wiki-page based on the assumption that content that survives a certain amount of time or subsequent revisions does so due to some inherent quality of the content and its relevance to the article. This assumption is based on the view of wikis' publish-first, edit-later model as a case of informal peer review[1] where contributions that are low quality should be quickly removed or overwritten in by subsequent edits. In this way, content persistence can be viewed as a generalization of revert rate.

Construction[edit]

The persistence of content through revisions of an article is generally determined by performing textual diffs between revisions and tracking the content that does not change. Figure 1 depicts words persisting between revisions of a toy example of an article about apples. The information attained by performing a diff between the revisions might look as follows:

  • -1: (insert: "Apples are red.")
  • 1-2: (equal: "Apples are ") (remove: "red") (insert: "blue") (equal: ".")
  • 2-3: (equal: "Apples are ") (remove: "blue") (insert: "red") (equal: ".")
  • 3-4: (equal: "Apples are ") (insert: "tasty and ") (equal: "red.")
  • 4-5: (equal: "Apples are tasty and ") (remove: "red") (insert: "blue") (equal: ".")

By tracing this diff information, a data structure can be built that keeps track of discrete content items and attributes them to their original author. In order to turn text into discrete content items, a tokenizer is used to discover word boundaries. Once content is broken into tokens, identifiers can be associated with them so that they can be tracked through the history of a page. For example:

  1. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (1, "red"), (1, ".")
  2. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (2, "blue"), (1, ".")
  3. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (1, "red"), (1, ".")
  4. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (4, "tasty"), (4, " "), (4, "and"), (1, " "), (1, "red"), (1, ".")
  5. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (4, "tasty"), (4, " "), (4, "and"), (1, " "), (5, "blue"), (1, ".")

In the last revision's list of tokens, it's obvious now that "Apples are " was added by the first revision since the identifier "1" suggests this. However, it may be surprising that in the same revision, "blue" is given a new identifier, rather than persisting the (2, "blue") seen in revision #2. This is due to the lack of clarity for what text means in relation to other text. However, in revision #3, (1, "red") persisted. This is due to an identity revert, a revision that exactly duplicates a previous revision. Since the content is exactly duplicated, an algorithm can be sure that the tokens are the exact same ones.

Another way to view this set of words is to transform it into a token-major list that expresses which revisions contained the word. For example:

  • ("Apples", [1,2,3,4,5])
  • (" ", [1,2,3,4,5])
  • ("are", [1,2,3,4,5])
  • (" ", [1,2,3,4,5])
  • ("red", [1,3,4])
  • (".", [1,2,3,4,5])
  • ("blue", [2])
  • ("tasty", [4,5])
  • (" ", [4,5])
  • ("and", [4,5])
  • (" ", [4,5])
  • ("blue", [5])

From this list, it is easy to see how many revisions a given token persisted. For example, "Apples" was added in the first revision and appeared in 4 subsequent revisions. Under the assumption that subsequent revisions of the page represent informal review of the contents, one might assume that the token "Apples" was a high quality contribution to the article. However, this assumption falls flat with content that was only recently inserted into the article. This problem is commonly referred to as right censoring since, when time is plotted from left to right, the samples on the right side have less information. To state it simply, we need more revisions in the article before we can know if "tasty" was a good addition to the article or not. However, we can conclude quite confidently that the token "blue" that was added in revision #2 was not of high quality since it did not persist for a single revision (it was immediately reverted).

There's one additional issue to be concerned with: How much does whitespace matter? And for that matter, what about stop words (grayed out in Figure 1)? Recent research[2][3] has eliminated whitespace, stop words and other wiki markup when computing the value, quality and productivity of editors' work.

Metrics[edit]

  • Persistent Word Revision (PWR): The sum total of subsequent revisions persisted by the words in a revision. Halfaker et al. used this as a measure of productivity, a mixture of the quality and quantity of Wikiwork[3]
  • PWR per Word (PWRpW): The average of the log number of revisions persisted by each word in a revision. Halfaker et al. used this as a measure of the quality of work performed by editors.[4] This indicator works under the assumption that content in Wikipedia is best thought of as randomly volatile and highly reviewed. Under this assumption, there should be an exponential decay (or a constant hazard function) in the probability of persistence of the highest quality words due to re-structuring and en:Wikipedia:Edit_creep#E.
  • Persistent Word Views (PWV): The sum of views that pages receive while a word appears in the article. Priedhorsky et al. used this metric as a measure of the value contributed by authors (assuming that an encyclopedia is meant to be read and highly viewed articles are of high value)[2].

Code[edit]

Open licensed code has been made publicly available for tracing the content persistence through the history of revisions of an article in the python-mwpersistence library (notes from February 2016 architectural discussion).

Services[edit]

The WikiWho API provides, for each element of the tokenized Wikitext of an article at any given revision, the revision in which the token was originally added and all revisions in which the token was deleted or reinserted. This enables content persistence measurements of several kinds, e.g., aggregated per token or editor in the article. For per-user aggregations of persisting content over all articles see a secondary endpoint for edit persistence . Available language editions to date: EN, DE, ES, TR, EU.

Limitations[edit]

  • Only measures the quality of added content. Does not measure the quality of removals.
  • Content must be explicitly vetted by subsequent revisions in order to determine quality.

Usage[edit]

  • In their work on WikiTrust, Adler and Alfaro use the implied review of words that last over time and revisions in articles to determine a "trustworthiness" score for content in Wikipedia[5][6].
  • A 2006 First Monday paper described the implementation of a similar coloring algorithm in MediaWiki, based on both the number of edits and the amount of time that a part of Wikipedia article has survived.[7]
  • Priedhorsky et al. used the number of views article receive while content persists to measure the value contributed by editors of Wikipedia[2]. Notably, they found that 0.1% of editors contribute ~40% of the value in the wiki (as of early 2007).
  • Halfaker et al. used the number of revisions words added by an editor last in an article to approximate the quality of contributions[4]. They developed a metric for the average number of revisions that content added by an author lasts (PWRpW = Persistent Word Revisions per Word) to be highly related to revert rate.
  • To control for the scale of contributions made by editors, Halfaker et al. used the amount of persistent word revisions contributed to measure the productivity of an editor in Wikipedia[3]. They found evidence that, despite a decrease in the rate of contributions following being reverted, editors were generally more productive, they argue, suggests a learning effect.
  • Research:Measuring edit productivity
  • A conference paper for CHI'13[8] reported on a project where 640 undergraduate and graduate students edited Wikipedia articles on scientific topics in 36 university courses. The authors found that the "students substantially improved the scientific content of over 800 articles, at a level of quality indistinguishable from content written by PhD experts", measured in a content persistence metric.
  • In 2017, Flöck et al. published the "Toktrack" dataset, described as containing "every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history." The accompanying paper[9] presents various results about content persistence derived from this dataset. See also the summary in the Wikimedia Research Newsletter: "Who wrote this? A new dataset tracks the provenance of English Wikipedia text over 15 years"

See also[edit]

References[edit]

  1. Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser L (2005). Information quality work organization in Wikipedia. American Society for Information Science and Technology, 59(6), 983-1001.
  2. a b c Priedhorsky, R., Chen, J., Lam, S. K., Panciera, K., Terveen, L., & Riedl, J. (2007). Creating, destroying, and restoring value in Wikipedia, GROUP (pp. 259-268).
  3. a b c Aaron Halfaker, Aniket Kittur, & John Riedl (2011). Don't bite the Newbies: How reverts effect the quantity and quality of Wikipedia work, The 7th International Symposium on Wiki's and Open Collaboration (pp. 163-172). 10.1145/2038558.2038585
  4. a b Aaron Halfaker, Aniket Kittur, Robert E. Kraut, & John Riedl. (2009). A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia, The 5th International Symposium on Wiki's and Open Collaboration Article 15, 10 pages. 10.1145/1641309.1641332
  5. B. Thomas Adler and Luca de Alfaro. A Content-Driven Reputation System for the Wikipedia. Technical Report ucsc-crl-06-18, School of Engineering, University of California, Santa Cruz, 2006.
  6. B. Thomas Adler, Krishnendu Chatterjee, Luca de Alfaro, Marco Faella, Ian Pye, Vishwanath Raman, Assigning Trust to Wikipedia Content, in WikiSym '08: Proceedings of the 2008 international symposium on Wikis, May 2008
  7. Tom Cross, Puppy smoothies: Improving the reliability of open, collaborative wikis," First Monday, volume 11, number 9 (September 2006)
  8. Rosta Farzan, Robert E. Kraut: "Wikipedia Classroom Experiment: bidirectional benefits of students’ engagement in online production communities" CHI’13, April 27–May 2, 2013, Paris, France. PDF
  9. Flöck, Fabian; Erdogan, Kenan; Acosta, Maribel (2017-05-03). TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia. Eleventh International AAAI Conference on Web and Social Media.