For this sprint, I produced a python package containing software for efficiently processing Wikipedia dumps using python. The software maps a function over the pages in a set of XML database dumps. It is...
- easy to work with because the interface is an iterator over streaming page data that can be looped
- uses little memory because it takes advantage of the efficiency of stream-reading XML in a sax parser and
- is fast because it allows symmetric multiprocessing of several dump files at a time.
The package provides both a command-line interface and a python library that can be imported and used as in iterator over dump data.
Since python uses a Global Interpreter Lock, threading does not take advantage of multiple cores on a processing machine. To circumvent this problem, the multiprocessing package mimics threads via a the subprocess forking interface. Through the interface, primitive thread safety mechanisms can be used to allow message passing between the processes.
This package creates a "Processor" for the number of available cores on the client machine and publishes a queue of dump files for each processor to process. Each processor's output is then serialized via a central output queue into a generator that can be used by the main process.
The resulting system can be passed a function for processing page-level data (and it's revisions).
Results and discussion
With this system, the revert graph of the January, 2011 dump of the English Wikipedia was produced in 20 hours. A single process system would have taken about one week.