Research:Software for quick processing of Wikidumps

WSoR 2011

Contact

Aaron Halfaker

University of Minnesota

This page documents a completed research project.

For this sprint, I produced a python package containing software for efficiently processing Wikipedia dumps using python. The software maps a function over the pages in a set of XML database dumps. It is...

easy to work with because the interface is an iterator over streaming page data that can be looped
uses little memory because it takes advantage of the efficiency of stream-reading XML in a sax parser and
is fast because it allows symmetric multiprocessing of several dump files at a time.

The package provides both a command-line interface and a python library that can be imported and used as in iterator over dump data.

Process

Since python uses a Global Interpreter Lock, threading does not take advantage of multiple cores on a processing machine. To circumvent this problem, the multiprocessing package mimics threads via a the subprocess forking interface. Through the interface, primitive thread safety mechanisms can be used to allow message passing between the processes.

This package creates a "Processor" for the number of available cores on the client machine and publishes a queue of dump files for each processor to process. Each processor's output is then serialized via a central output queue into a generator that can be used by the main process.

The resulting system can be passed a function for processing page-level data (and it's revisions).

Results and discussion

With this system, the revert graph of the January, 2011 dump of the English Wikipedia was produced in 20 hours. A single process system would have taken about one week.

External links

Project now hosted on Github (formerly on bitbucket)
Documentation