Data dumps/Other tools

From Meta, a Wikimedia project coordination wiki

There are three options to use the compressed data dumps: decompress them, which is time and memory consuming, reading the compressed files with general purpose library, e.g. Python's Bz2file, or using one of the custom Wikipedia readers/libraries.

Other tools[edit]

WikiXRay Python parser[edit]

WikiXRay was a Python tool from 2007 for automatically processing Wikipedia's XML dumps for research purposes. The source code seems to be unavailable.

It also includes the more complete parser to extract metadata for all revisions and pages in a Wikipedia XML dump, compressed with 7zip (or any other version). See the WikiXRay page on Meta for more info.

WikiPrep Perl script[edit]

Wikipedia preprocessor (wikiprep.pl) is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc.

The last update was in 2012 so it only works for very old dumps. It is not maintained. It is available on Sourceforge. This version spans multiple processes if required to speed up the process.

Wikipedia Dump Reader[edit]

This program provides a convenient user interface to read the text-only xml compressed dumps. It was last updated in 2012.

No conversion is needed, only some index-construction initial step. Written mostly in Python+Qt4, except for the small, very portable bzip2-decompression C code, thus should run on all PyQt4-enabled platform, although tested only on Desktop Linux. Wikicode is reinterpreted, thus it may sometimes display differently than the official php interpreter.

https://launchpad.net/wikipediadumpreader

MediaWiki XML Processing[edit]

This python library is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

https://pythonhosted.org/mwxml/

MediaWiki SQL Processing[edit]

This python library is a collection of utilities for efficiently processing MediaWiki’s SQL database dumps. It is built to be very similar to mwxml but for the SQL dumps.

https://pypi.org/project/mwsql/

Utilities for processing Wikipedia and Wikidata dumps in Go[edit]

gitlab.com/tozd/go/mediawiki is a Go package providing utilities for processing Wikipedia and Wikidata dumps. Supports processing various types of dumps, e.g., entities JSON dumps, Wikimedia Enterprise HTML dumps, SQL dumps. Provides idiomatic Go structs. It supports downloading and processing at the same time.

bzip2[edit]

For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.

7-Zip[edit]

For the .7z files, you can use 7-Zip to decompress.

Something like: 7za e -so pages_current.xml.7z | php importDump.php

will expand the current pages and pipe them to the importDump.php PHP script.

Even more tools[edit]

  • BigDump - A small php-script for importing very large mySQL dumps (Even through web-servers with hard runtime limits or Safe mode!)
  • WikiFind - A small program for searching database dumps for user specified keywords (using regexes). Output is a wiki-formatted list (in a text file) of articles containing the keyword. (Program still under development)
  • Wikihadoop, DiffDB and WikiPride
  • w:Wikipedia:Reference desk/Archives/Computing/Early/ParseMediaWikiDump - perl module for parsing the XML dumps and finding articles in the file with certain properties, e.g. all pages in a given category
  • Wikipedia-SQl-dump-parser - a .NET library to parse MySQL dumps and make the the resulting information available to calling programs
  • Dictionary Builder - a Rust program for generating a list of words and definitions from the XML dumps for one of the Wiktionary projects
  • parse-mediawiki-sql – a Rust library for quickly parsing the SQL dump files with minimal memory allocation
  • Awk and Nim source code examples for processing Wikipedia XML. The Nim example is an optimized C XML library almost 10x faster than Awk in comparison tests.
  • wiktionary_dump_to_xml_1 Attempt to parse some of the wikitext constructs of the German Wiktionary in order to convert them to XML. Written in Java, licensed under the GNU AGPL 3+ any later version. Archived in 2018.
  • offline-wiki-reader - A shell script for searching Wikipedia index files and extracting single page content straight from the related compressed Wikipedia XML dumps.

And still more...[edit]

A number of offline readers of Wikipedia have been developed.

A list of alternative parsers and related tools is available for perusal. Some of these are downloaders, some are parsers of the XML dumps and some are meant to convert wikitext for a single page into rendered HTML.

See also this list of data processing tools from the year 2012 intended for use with the Wikimedia XML dumps.