WSoR datasets/revision diff

Location

The diffdb can be downloaded from dumps.wikimedia.org.

Fields

hadoop21@beta:~/wikihadoop/diffs$ /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-10-bzip2/part-00000 | head -n 3
133350337	11406585	0	'National security and homeland security presidential directive'	1180070193	u'Begin'	False	308437	u'Badagnani'	0:1:u"The '''[[National Security and Homeland Security Presidential Directive]]''' (NSPD-51/HSPD-20), signed by President [[George W. Bush]] on May 9, 2007, is a [[Presidential Directive]] giving the [[President of the United States]] near-total control over the United States in the event of a catastrophic event, without the oversight of [[United States Congress|Congress]].\n\nThe signing of this Directive was generally unnoticed by the U.S. media as well as the U.S. Congress. It is unclear how the National Security and Homeland Security Presidential Directive will reconcile with the [[National Emergencies Act]], signed in 1976, which gives Congress oversight during such emergencies.\n\n==External links==\n*[http://www.whitehouse.gov/news/releases/2007/05/20070509-12.html National Security and Homeland Security Presidential Directive], from White House site\n\n==See also==\n*[[National Emergencies Act]]\n*[[George W. Bush]]\n\n{{US-stub}}"
133350707	11406585	0	'National security and homeland security presidential directive'	1180070344	None	False	308437	u'Badagnani'	906:1:u'National Security Directive]]\n*[['
133350794	11406585	0	'National security and homeland security presidential directive'	1180070386	None	False	308437	u'Badagnani'	613:-1:u'signed'	613:1:u'a U.S. federal law passed'

Each row represents a revision from the April, 2011 XML dump of the English Wikipedia. There *should* be a row for every revision that wasn't deleted when that dump was produced; however at this time, some cleanup will need to be done to remove duplicates and fill in missing revision diffs.

rev_id: The identifier of the revision being described PRIMARY KEY
page_id: The identifier of the page being revised
namespace: The identifier of the namespace of the page
title: The title of the page being revised
timestamp: The time the revision took place as a Unix epoch timestamp in seconds
comment: The edit summary left by the editor
minor: Minor status of the edit (boolean)
user_id: The identifier of the editor who saved the revision
user_text: The username of the editor who saved the revision
diffs - Tab separated, diff operations. Each diff operation has three parts (separated by colons):
- position: The position in the article text at which the operation took place
- action: Did the operation add or remove some text? ("1" for add, "-1" for remove)
- content: The text operated on. For added text, this is the content to add. For removed text, this is the content that was removed.

Each row can have 0-many diff operations. Values in the result set have been encoded using python's repr() function and can be reproduced in python with the eval() function.

Reproduction

Install Hadoop, WikiHadoop and the differ.
- beta.wikiliytics.org, gamma.wikilytics.org and delta.wikilytics.org (managed by Diederik van Liere) have Hadoop 0.21, WikiHadoop 0.1 and the differ installed.
Log in to the Hadoop master node.
Download the Wikipedia dump files compressed in bz2 from the dump distribution site. Make sure to choose the dumps with full edit histories (pages-meta-historyN.xml.bz2).
- For the 20110405 dumps (this is the source of the dataset being generated): [1] [2] [3][4][5][6][7][8][9][10][11][12][13][14][15]
Copy the dump files in to HDFS using /usr/lib/hadoop-beta/bin/hdfs dfs -copyFromLocal enwiki*.xml

Launch a Hadoop job for each dump file using the command below.

screen -S j01diffs /usr/lib/hadoop-beta/bin/hadoop jar hadoop-0.22-streaming.jar -Dmapreduce.task.timeout=0 -Dmapred.reduce.tasks=0 -Dmapreduce.input.fileinputformat.split.minsize=290000000 -D mapreduce.map.output.compress=true -input /enwiki-20110405-pages-meta-history1.xml.bz2 -output /usr/hadoop/out-01 -mapper ~/wikihadoop/diffs/revision_differ.py -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat

With 3 nodes and 24 cores in total, one dump file of EN wiki approximately takes 20-24 hours to process.

If you want to extract the dataset as an ordinary file, accumulate the dataset rows into one file (diffs.tsv.gz) using /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* > diffs.tsv.
- There are some duplicates in the results [16]. If you want to exclude those duplicates, use /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* | sort -n -k2 -k1 -u -T ~/tmp/ > diffs.tsv instead. Note that ~/tmp needs to be a directory large enough to contain all the results shown with /usr/lib/hadoop-beta/bin/hdfs dfs -du /usr/hadoop/out-*/part-*.
- This may take several hours~one day depending on the size. It will be than 400 GB for EN wiki.

Notes

The dataset being generated is incomplete in two ways.

Duplicated entries for less than 0.02% revisions (estimated). [17]
Some revisions are failed to be diffed and marked with 'diff_fail'.