Data dumps/Dump format

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Format of the sql files[edit]

These are provided as dumps of entire tables, using mysqldump. They start with various commands to set up the character set correctly for the import; they also turn off certain checks related to indexes, for speed. More importantly however, they contain a DROP TABLE IF EXISTS stanza before the inserts of the actual data. This means that if you import one of these files into an existing wiki, any data you had in that table will be lost.

Each INSERT statement contains several thousand rows of data for speed purposes.

Format of the XML files[edit]

The main page data is provided in the same XML wrapper format that Special:Export produces for individual pages. It's fairly self-explanatory to look at, but there is some documentation at Help:Export#Export_format.

Three sets of page data are produced for each dump, depending on what you need:

  • pages-articles.xml
    • Contains current version of all article pages, templates, and other pages
    • Excludes discussion pages ('Talk:') and user "home" pages ('User:')
    • Recommended for republishing of content.
  • pages-meta-current.xml
    • Contains current version of all pages, including discussion and user "home" pages.
  • pages-meta-history.xml
    • Contains complete text of every revision of every page (can be very large!)
    • Recommended for research and archives.

The XML itself contains complete, raw text of every revision, so in particular the full history files can be extremely large; en.wikipedia.org's January 2010 dump is about 5.87e12 bytes (5.34 TiB) raw. Currently we are compressing these XML streams with bzip2 (.bz2 files) and additionally for the full history dump 7-Zip (.7z files).

Several of the tables are also dumped with mysqldump (for the database definition, see the documentation at mw:Category:MediaWiki database tables); the gzip-compressed SQL dumps (.sql.gz) can be read directly into a MySQL database but may be less convenient for other database formats.

In addition, "stub" dumps with filenames like stub-meta-history.xml.gz, stub-meta-current.xml.gz, and stub-articles.xml.gz, contain header information only for pages and revisions, omitting the actual page content. These contain information like the sha1 sum of each revision text, the redirect target of a page if it has one, and other similar data that is not contained in the page content dumps.

Unicode[edit]

The dumps may contain non-Unicode (UTF8) characters in older text revisions due to lenient charset validation in the earlier MediaWiki releases (2004 or so). For instance, zhwiki-20130102-langlinks.sql.gz contained some copy and pasted iso8859-1 "ö" characters; as the langlinks table is generated on parsing, a null edit or forcelinkupdate to the page was enough to fix it.