WSoR datasets/revision diff

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Location[edit]

The diffdb can be downloaded from dumps.wikimedia.org.

Fields[edit]

hadoop21@beta:~/wikihadoop/diffs$ /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-10-bzip2/part-00000 | head -n 3
133350337       11406585        0       'National security and homeland security presidential directive'        1180070193      u'Begin'        False   308437  u'Badagnani'    0:1:u"The '''[[National Security and Homeland Security Presidential Directive]]''' (NSPD-51/HSPD-20), signed by President [[George W. Bush]] on May 9, 2007, is a [[Presidential Directive]] giving the [[President of the United States]] near-total control over the United States in the event of a catastrophic event, without the oversight of [[United States Congress|Congress]].\n\nThe signing of this Directive was generally unnoticed by the U.S. media as well as the U.S. Congress. It is unclear how the National Security and Homeland Security Presidential Directive will reconcile with the [[National Emergencies Act]], signed in 1976, which gives Congress oversight during such emergencies.\n\n==External links==\n*[http://www.whitehouse.gov/news/releases/2007/05/20070509-12.html National Security and Homeland Security Presidential Directive], from White House site\n\n==See also==\n*[[National Emergencies Act]]\n*[[George W. Bush]]\n\n{{US-stub}}"
133350707       11406585        0       'National security and homeland security presidential directive'        1180070344      None    False   308437  u'Badagnani'    906:1:u'National Security Directive]]\n*[['
133350794       11406585        0       'National security and homeland security presidential directive'        1180070386      None    False   308437  u'Badagnani'    613:-1:u'signed'        613:1:u'a U.S. federal law passed'

Each row represents a revision from the April, 2011 XML dump of the English Wikipedia. There *should* be a row for every revision that wasn't deleted when that dump was produced; however at this time, some cleanup will need to be done to remove duplicates and fill in missing revision diffs.

  • rev_id: The identifier of the revision being described PRIMARY KEY
  • page_id: The identifier of the page being revised
  • namespace: The identifier of the namespace of the page
  • title: The title of the page being revised
  • timestamp: The time the revision took place as a Unix epoch timestamp in seconds
  • comment: The edit summary left by the editor
  • minor: Minor status of the edit (boolean)
  • user_id: The identifier of the editor who saved the revision
  • user_text: The username of the editor who saved the revision
  • diffs - Tab separated, diff operations. Each diff operation has three parts (separated by colons):
    • position: The position in the article text at which the operation took place
    • action: Did the operation add or remove some text? ("1" for add, "-1" for remove)
    • content: The text operated on. For added text, this is the content to add. For removed text, this is the content that was removed.

Each row can have 0-many diff operations. Values in the result set have been encoded using python's repr() function and can be reproduced in python with the eval() function.

Reproduction[edit]

  1. Install Hadoop, WikiHadoop and the differ.
    • beta.wikiliytics.org, gamma.wikilytics.org and delta.wikilytics.org (managed by Diederik van Liere) have Hadoop 0.21, WikiHadoop 0.1 and the differ installed.
  2. Log in to the Hadoop master node.
  3. Download the Wikipedia dump files compressed in bz2 from the dump distribution site. Make sure to choose the dumps with full edit histories (pages-meta-historyN.xml.bz2).
  4. Copy the dump files in to HDFS using /usr/lib/hadoop-beta/bin/hdfs dfs -copyFromLocal enwiki*.xml
  5. Launch a Hadoop job for each dump file using the command below.
    • screen -S j01diffs /usr/lib/hadoop-beta/bin/hadoop jar hadoop-0.22-streaming.jar -Dmapreduce.task.timeout=0 -Dmapred.reduce.tasks=0 -Dmapreduce.input.fileinputformat.split.minsize=290000000 -D mapreduce.map.output.compress=true -input /enwiki-20110405-pages-meta-history1.xml.bz2 -output /usr/hadoop/out-01 -mapper ~/wikihadoop/diffs/revision_differ.py -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
      
    • With 3 nodes and 24 cores in total, one dump file of EN wiki approximately takes 20-24 hours to process.
  6. If you want to extract the dataset as an ordinary file, accumulate the dataset rows into one file (diffs.tsv.gz) using /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* > diffs.tsv.
    • There are some duplicates in the results [16]. If you want to exclude those duplicates, use /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* | sort -n -k2 -k1 -u -T ~/tmp/ > diffs.tsv instead. Note that ~/tmp needs to be a directory large enough to contain all the results shown with /usr/lib/hadoop-beta/bin/hdfs dfs -du /usr/hadoop/out-*/part-*.
    • This may take several hours~one day depending on the size. It will be than 400 GB for EN wiki.

Notes[edit]

The dataset being generated is incomplete in two ways.

  • Duplicated entries for less than 0.02% revisions (estimated). [17]
  • Some revisions are failed to be diffed and marked with 'diff_fail'.