Jump to content

Data dumps/What the dumps are not

From Meta, a Wikimedia project coordination wiki

A few words about what the xml/sql dumps are not.

The dumps are not backups.

  • They contain only public data. No user email addresses, certainly no passwords. No IP addresses of users. Rebuilding from a dump means every user re-registers from scratch with 0 edits. Deleted revisions would not be available to be undeleted later if that were determined to be appropriate. AbuseFilter rules are not dumped, so those would need to be reconstructed from scratch. And so on.

The dumps are not consistent.

  • They do not reflect a consistent state of the database. It's perfectly possible for a dump process to pull metadata from a revision for the first phase of the dump ('stubs') and when it later tries to get the revision text, to find that the text is no longer available (hidden by administrators). The page table is dumped earlier and the contents may have changed by the time stubs are retrieved. Consistency would require a working with snapshot of the database itself, or locking the database long enough for the entire dump process to complete (2 weeks!), or post-processing the generated dumps.
  • They do not necessarily reflect a single version of MediaWiki. In an effort to "fail fast" as the saying goes, we run batches of short jobs in order to produce a single file ('stubs') or batches of files (revision content). If MediaWiki has been deployed in the meantime, the new version will be used for the new files. A change in the code can mean that, for example, uids that were 0 in the revision table are now NULL in the actor table and are written out as <uid />; this change can take place in the middle of a run. (The particular bug is fixed, but there are countless potential examples.)
  • They do not reflect a consistent state of the data, above and beyond the database. MediaWiki does not do all actions atomically. A dump process may request revision content for a page that includes a template that has been altered, so that it adds pages to some new category. Until a page is rerendered, the category links table may be updated for some of the pages and not for others. Once again there are countless examples.

The dumps are not complete.

  • The databases have broken data in them from various problems over the years. There are text records which point to nonexistent storage clusters, half-converted Flow records, and all kinds of other lovely things accumulated in the 15 years the projects have been around. These errors are not accounted for in the dumps; problematic entries are simply skipped, because the xml schema does not contain a means for describing errors encountered with the data itself.
  • No media dumps are produced. Even if we were to produce tarballs of media in use on the projects, there would still be all of the rest of the media on Commons to account for, the vast majority of it in fact.