Data dumps/Tools for importing

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This list of tools for importing the XML dumps is not comprehensive. If you spot one in the wild that's not described here, please add it!

Converting to SQL first[edit]

If you are using MySQL and you are working with the dumps from a relatively large project, you will want to convert them to SQL first, and then import those into your database. This will require that you either load in the other SQL tables we provide, or that you rebuild that data using the rebuildall.php maintenance script provided with MediaWiki.

If you are working with only a small subset of pages, this solution is not ideal for you, as you'll have a bunch of extra page and revision related information (such as links) that won't be valid.

It's probably better to first read all rows into a table without any keys at all, and then create the needed keys afterwards. It is much faster to do the necessary sorting once instead of updating the keys for each inserted row. It is not enough to an "alter table page disable keys" because unique keys will still be updated and checked for uniqueness at each inserted row.

  • One way would be to edit the first part of the sql file which creates the table before using it.
  • Another way which gives better control would be to write your own program to parse the sql file and insert the data into the database. See tools:~byrial/wikidata-programs/read_page_table.c for an example of that. The program only inserts some of the columns into the table, and also only selected rows depending on the namespace.

Converting XML files to SQL[edit]

These tools produce SQL files that can then be imported into your database by e.g. mysqlimport.

  • mwdumper - Java tool, download it here and see also the manual.
  • mwxml2sql - C program for *nix platforms
  • mwimport - Perl script, needs editing by hand for non-english-language projects, source here
  • mwdum.py - Python tool with low memory-footprint and mediocre speed so far. Includes "parentid" (which mwdumper seems not to do) and has no unicode-problems so far.

Converting SQL files to tab-delimited files[edit]

The output of these tools is intended for use with LOAD DATA INFILE for MySQL databases.

Importing directly into your database[edit]

If time is not an issue or you are dealing with a very small project or a subset of pages for import, you can try importing directly into the database. This method generally means that rows in related tables will be populated as information for each revision is imported, but it is much slower than using the SQL files we provide for download.

Tools for importing directly from the XML files to your database:

  • ImportDump.php -- maintenance script that comes with MediaWiki, always current. Also see the MediaWiki manual.

Importing Into Elasticsearch[edit]

The Wikiparse tool can directly import the bz2 tarball into Elasticsearch with a number of convenient analyzers setup for text searching.

Making the imported wiki functional[edit]

If you don't only want to import data, but also to use the resulting wiki (e.g. it's a backup restore or a wiki migration), you have to take several additional steps. (Please expand this list.)

  • If you don't have all the private user data (emails, passwords, preferences) you have to pull the scattered pieces together in some way and it's easy to make a mess.
    • mw:Extension:MediaWikiAuth imports accounts only when users actually need them, but should recover everything.
    • If you re-create the accounts en mass, ensure that the user_id's match.
    • ...
  • ...

Tools from the past[edit]

People have been at this for years now. Here's some of the tools that folks have written, for the historical record:

  • xml2sql - cross-platform tool in C for converting XML files to sql, but now several years out of date
  • Perl importing script - Perl script for importing XML files directly into the database, years out of date