WikiXRay/Python parser

WikiXRay includes a Python parser for processing the XML dumps of Wikipedia and extracting relevant information.

Currently, there are 2 different versions of the parser:

The first one is dump_sax.py. It can be used as an alternative to other importing tools, such as mwdumper. Right now, it is capable of importing the required info to load the page, revision and text database tables, thus allowing to import any language edition.

The second version is dump_sax_research.py. It loads a special version of the page and revision database tables, including additional information such as the length, number of letters, number of words, etc. of every revision processed from the compressed XML dump file.

Please note that you can either use the complete time-lang-pages-meta-history.xml or the lighter version time-lang-pages-meta-current.xml for the import process. Also notice that you can either use a compressed or decompressed XML file, as long as the compression software allows direct decompression to the standard output stream.

System Requirements

Initially, I thought that the parser was compatible with any Linux platform providing Python, MySQL and the mysqldb module for Linux (if you want to use the --monitor mode).

Unfortunately, I have later discover some incompatibilities regarding UTF-8 encoding in older versions of Python (2.4.1).

So, I place here the system requirements that absolutely warrant you a correct execution of the parser:

MySQL 5.0.x or higher.
Python 2.5.1 or higher (for example, the one in Ubuntu 7.04 or higher).
mysqldb module (Debian package python-mysqldb) 1.2.1 or higher

MySQL tables layout

The current version of the parser has been developed using SAX.

If you want to use the dump_sax.py standard version, to import an XML dump of any language edition in Wikipedia, you should be able to use the standard definition of the page, revision and text tables in MediaWiki. However, as Mediawiki continues to increase its features, you can find the most up to date version that this parser uses here.

In case you use the dump_sax_research.py version, the information it extracts follows this database layout.

As you can see, many indexes has been removed from the SQL definition of the tables.

I strongly recommend that you don't create more indexes. First decompress, parse and load the data in this simplified tables, then you can create additional indexes as needed. If you have too many indexes (especially on long varchar fields) it could slow down the loading process a lot (as MySQL will have to process, for every new entry all the specified indexes, and this could require a long time).

Parser Usage

IMPORTANT NOTE: The following explanation applies to both the standard and research versions of the parser. The commands provided in this section always refer to the standard version, dump_sax.py, though you can use dump_sax_research.py just in the same way.

The parser source code for the standard version can be obtained here.
The parser source code for the research version can be downloaded from here.
You also need this module for database access, because the dumper imports it to use with the --monitor mode.

You can execute it with this command, if you want the parser to create SQL files as a result of the parsing process. You can later import that SQL code into MySQL:

  7za e -so downloaded-dump-pages-meta-history.xml.7z | 
   python dump_sax.py

You can also ask the parser to create an encoded output SQL stream suitable to be directly imported into MySQL:

  7za e -so downloaded-dump-pages-meta-history.xml.7z | 
   python dump_sax.py --streamout | mysql -u myuser -pmypasswd mydb

Nevertheless, there is a big problem trying to use this option with very big dumps (those in the list of the top 20 dumps regarding its database size). MySQL can easily give you a timeout error, because the server is too busy trying to process the parser's requests.

The safest way to import such big dumps is to use the monitor mode, indicating the --monitor option. This way, the parser will catch the Exception produced by the server timeout, and it retries the insert:

  7za e -so downloaded-dump-pages-meta-history.xml.7z | 
   python dump_sax.py --monitor -u user -p passwd -d db_name

Execution Options

Starting from this version, there are no mandatory arguments. You have instead a lot of optional flags and arguments to tweak the parser features.

For a detailed description of the current (and future) available options, type:

  python dump_sax.py -h

Or:

  python dump_sax.py --help

It should display an output similar to this one:

How to use the parser, step by step:

0. You will need to install the 7zip compressing program if you don't have it previously installed it in your system.

1. Create a new MySQL database, then use the SQL code above to generate the page and revision tables.

2. Download the .7z version of the complete database dump of the language edition you want to process (pages-meta-history.xml.7z).

3. Execute the above commands, with the adequate modifiers if needed.

Example:

  7za e -so  furwiki-20070405-pages-meta-history.xml.7z | 
   python dump_sax.py --pagefile=furwiki_page.sql --revfile=furwiki_rev.sql

Example (with bunzip):

  bunzip -d -c  furwiki-20070405-pages-meta-history.xml.bz2 | 
   python dump_sax.py --pagefile=furwiki_page.sql --revfile=furwiki_rev.sql

In case you have activated the verbose mode of operation (enabled by default), the parser will prompt, every 1000 completed revisions, the average processing speed for pages and revisions.

4. If you have generated .sql files instead of an SQL stream for direct importing to MySQL, you can import the .sql files in MySQL, for example:

  mysql -u phil -pmypassw test < rev_file.sql
  mysql -u phil -pmypassw test < page_file.sql

And you're done.

Possible problems

The current version build extended inserts to speed up the process of importing the SQL code into MySQL. You can adjust the maximum length and maximum number of codes of the extended insert with --insertmaxsize=MAXSIZE and --insertmaxrows=MAXROWS.

Stop and reprise the parsing task

By now the parser doesn't support the operation of stopping and reprising from a certain point of the entire process, which may run for days in large wikis. If your parsing process have to be interrupted (by your will or by external causes) you can use Rxr_for_WikiXRay, which will help you the reprise of your process in a slow time without modifying the dump or the wiki x ray parser. See its page for source code and details.

Performance results

See WikiXRay/Performance for obtaining performance results about WikiXRay in different platforms.