Data dumps/Other tools

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Other tools[edit]

WikiXRay Python parser[edit]

WikiXRay is a Python tool for automatically processing Wikipedia's XML dumps for research purposes.

It also includes the more complete parser to extract metadata for all revisions and pages in a WIkipedia's XML dump, compressed with 7zip (or any other version). See the WikiXRay page on Meta for more info.

WikiPrep Perl script[edit]

Wikipedia preprocessor ( is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc. Of interest is also new SourceForge page with more updated branches:

The version described above works only for old (a few years old) dumps, and hence it is not maintained, it WILL break on current dumps. Although, the idea is not abandoned and is maintained by Tomaz Solc undel GPL and it is available here. This version spans multiple processes if required to speed up the process.

Wikipedia Dump Reader[edit]

This program provides a convenient user interface to read the text-only xml compressed dumps.

No conversion is needed, only some index-construction initial step. Written mostly in Python+Qt4, except for the small, very portable bzip2-decompression C code, thus should run on all PyQt4-enabled platform, although tested only on Desktop Linux. Wikicode is reinterpreted, thus it may sometimes display differently than the official php interpreter.

MediaWiki XML Processing[edit]

This python library is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

BzReader (Windows offline reader)[edit]

This program allows the Windows users to read Wikipedia offline using compressed dumps.

There is a fast built-in full-text search and the Wiki code is interpreted as HTML. You can also navigate between articles just like in online Wikipedia.


For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.

mwdumper can read the .bz2 files directly, but importDump.php requires piping like so: bzip2 -dc pages_current.xml.bz2 | php importDump.php


For the .7z files, you can use 7-Zip or p7zip to decompress. These are available as free software:

Something like: 7za e -so pages_current.xml.7z | php importDump.php

will expand the current pages and pipe them to the importDump.php PHP script.

Even more tools[edit]

  • BigDump - A small php-script for importing very large mySQL dumps (Even through web-servers with hard runtime limits or Safe mode!)
  • WikiFind - A small program for searching database dumps for user specified keywords (using regexes). Output is a wiki-formatted list (in a text file) of articles containing the keyword. (Program still under development)
  • Wikihadoop, DiffDB and WikiPride
  • w:Wikipedia:Computer help desk/ParseMediaWikiDump - perl module for parsing the XML dumps and finding articles in the file with certain properties, e.g. all pages in a given category
  • [1] - perl script to pars the XML dumps and create files with lists of links, category hierarchies, related articles and other information possibly useful for researchers
  • [2] - a .NET library to parse MySQL dumps and make the the resulting information available to calling programs
  • Dictionary Builder - a java program for generating a list of words and definitions from the XML dumps for one of the Wiktionary projects
  • parse-mediawiki-sql – a Rust library for quickly parsing the SQL dump files with minimal memory allocation

And still more...[edit]

A number of offline readers of Wikipedia have been developed.

A list of alternative parsers and related tools is available for perusal. Some of these are downloaders, some are parsers of the XML dumps and some are meant to convert wikitext for a single page into rendered HTML.

See also this list of data processing tools intended for use with the Wikimedia XML dumps.