Research:MDM - The Magical Difference Machine

From Meta, a Wikimedia project coordination wiki
This page documents a completed research project.


Topic[edit]

There are a number of research questions that in some form or other depend on what exactly has been changed, added, or removed during a revision. Unfortunately, this data is not easily accessible via the Wikidumps or databases which only contain the full text a revised article. The goal of this sprint is to create a system for quickly producing and querying a dataset containing the diffs of all revisions in the English wikipedia. The idea is to define a broad data structure that can then be used to answer the research questions and generate datasets based upon the diffs.

Below is a list of requirements/expectations of the capabilities of the revision diff (change enacted by an edit) database that the quants are building. You should think of this document as a wish list of things you would like to be able to search for that are related to the page text that is changed by edits.

Feature Priority (Justification)
Find what revisions a template (or set of templates) was inserted by
  1. e.g. “{{ db-”, “<!-- db-”
  2. regular expressions would be cool
High (We need to count the frequency and location of the application of templates such as welcomes, warnings, etc.)
Count of total content added/removed by an editor (by namespace) High
Simple to associate with revert information e.g., was reverted, is reverting, for vandalism, etc. High (Figuring out what is and isn’t a revert and how often it happens would be a huge boon. We also have to note when addition of content is actually a revert of a previous blanking of content.)
References/citations i.e. ref tags and citation templates Moderate (This might suggest how successful editors are with using our complicated syntax for sourcing, and more importantly, is a clear measure of the quality of edits. It will also allow us to figure out those editors who are great at adding references as a taxonomy activity. Knowing the sourcing gurus could be useful.)
Structural changes, such as addition or removal of sections Low (Use of proper sections or removal of them is one measure useful for determining quality. Vandals often blank sections, and use of proper section syntax is often a sign of a quality addition.)
External and internal links added or removed Low (Interesting but link use in general is not really an issue that is likely to have a causal relationship with new editor retention, though heavy external link use is usually a sign of low quality.)
Were cleanup templates or citation needed templates added or removed Moderate (This is generally interesting as a look at how often editors apply these tags to each other’s work, and it may be one of the factors that has lead to lower retention rates.)
What is the ratio of markup to content in the diff (i.e. complexity) Moderate (Growth in complexity over time is interesting, as well as how good newbies really are at using complex markup, but it’s not vital.)
Links to policy and guidelines (WP:___) Low (We already know people frequently cite these.)
Images added or removed Low (This is probably a separate question from links though.)

Process[edit]

The system will use MapReduce to process the dumps via Hadoop and store the data in MongoDB (i.e. Wikilytics).

A rough description of the system and process is as follows:

  • Parse dumps to send to the map function via the Hadoop streaming interface
  • Send each page to the map function
  • Map function will yield diffs to reduce function and store them in MongoDB

The sprint will consist of two major milestones:

  • Generate the Diff Dataset

In the first week, we will generate a system to quickly produce and store the diff dataset.

  • Searching Interface

In the second week, we will create the interface to perform searches and generate datasets from the diff dataset generated in the first week.

Results and discussion[edit]

Future work[edit]

See also[edit]