Research:MDM - The Magical Difference Machine
There are a number of research questions that in some form or other depend on what exactly has been changed, added, or removed during a revision. Unfortunately, this data is not easily accessible via the Wikidumps or databases which only contain the full text a revised article. The goal of this sprint is to create a system for quickly producing and querying a dataset containing the diffs of all revisions in the English wikipedia. The idea is to define a broad data structure that can then be used to answer the research questions and generate datasets based upon the diffs.
Below is a list of requirements/expectations of the capabilities of the revision diff (change enacted by an edit) database that the quants are building. You should think of this document as a wish list of things you would like to be able to search for that are related to the page text that is changed by edits.
|Find what revisions a template (or set of templates) was inserted by
||High (We need to count the frequency and location of the application of templates such as welcomes, warnings, etc.)|
|Count of total content added/removed by an editor (by namespace)||High|
|Simple to associate with revert information e.g., was reverted, is reverting, for vandalism, etc.||High (Figuring out what is and isn’t a revert and how often it happens would be a huge boon. We also have to note when addition of content is actually a revert of a previous blanking of content.)|
|References/citations i.e. ref tags and citation templates||Moderate (This might suggest how successful editors are with using our complicated syntax for sourcing, and more importantly, is a clear measure of the quality of edits. It will also allow us to figure out those editors who are great at adding references as a taxonomy activity. Knowing the sourcing gurus could be useful.)|
|Structural changes, such as addition or removal of sections||Low (Use of proper sections or removal of them is one measure useful for determining quality. Vandals often blank sections, and use of proper section syntax is often a sign of a quality addition.)|
|External and internal links added or removed||Low (Interesting but link use in general is not really an issue that is likely to have a causal relationship with new editor retention, though heavy external link use is usually a sign of low quality.)|
|Were cleanup templates or citation needed templates added or removed||Moderate (This is generally interesting as a look at how often editors apply these tags to each other’s work, and it may be one of the factors that has lead to lower retention rates.)|
|What is the ratio of markup to content in the diff (i.e. complexity)||Moderate (Growth in complexity over time is interesting, as well as how good newbies really are at using complex markup, but it’s not vital.)|
|Links to policy and guidelines (WP:___)||Low (We already know people frequently cite these.)|
|Images added or removed||Low (This is probably a separate question from links though.)|
A rough description of the system and process is as follows:
- Parse dumps to send to the map function via the Hadoop streaming interface
- Send each page to the map function
- Map function will yield diffs to reduce function and store them in MongoDB
The sprint will consist of two major milestones:
- Generate the Diff Dataset
In the first week, we will generate a system to quickly produce and store the diff dataset.
- Searching Interface
In the second week, we will create the interface to perform searches and generate datasets from the diff dataset generated in the first week.