Data dumps

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps. Please volunteer to host a mirror if you have access to sufficient storage and bandwidth.

Summary[edit]

Description[edit]

WMF publishes data dumps of Wikipedia and all WMF projects on a regular basis. English Wikipedia is dumped once a month, while smaller projects are often dumped twice a month.

Content[edit]

  • Text and metadata of current or all revisions of all pages as XML files
  • Most database tables as sql files
    • Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
    • Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
    • Media metadata (image, oldimage tables)
    • Info about each page (page, page_props, page_restrictions tables)
    • Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
    • List of all pages that are redirects and their targets (redirect table)
    • Log data, including blocks, protection, deletion, uploads (logging table)
    • Misc bits (interwiki, site_stats, user_groups tables)
  • experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps

http://dumps.wikimedia.org/other/incr/

  • Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
  • Media bundles for each project, separated into files uploaded to the project and files from Commons

images : http://meta.wikimedia.org/wiki/Database_dump#Downloading_Images

  • Static HTML dumps for 2007-2008

http://dumps.wikimedia.org/other/static_html_dumps/

(see more)

Download[edit]

You can download the latest dumps (for the last year) here (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).

Archives : dumps.wikimedia.org/archive/

Current mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

Data format[edit]

XML dumps since 2010 are in the wrapper format described at Export format( schema ). Files are compressed in bzip2 (.bz2) and .7z format.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examples[edit]

See examples of importing dumps in a MySQL database with step-by-step instructions here .

Existing tools[edit]

Available tools are listed in the following locations, but information is not always up-to-date:

Access[edit]

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Support[edit]

Maintainer: Ariel Glenn

Mailing list: xmldatadumps-l

Research projects using data from this source[edit]


What is this all about?[edit]

Wikimedia provides public dumps of our wikis' content:

  • for archival/backup purposes
  • for offline use
  • for academic research
  • for bot use
  • for republishing (don't forget to follow the license terms)
  • for fun!

Please follow the XML Data Dumps mailing list by reading the archives or subscribing, for up to date news about the dumps; you can also make inquiries about them there. If you cannot download the dump you want because it no longer exists, or if you have other issues with the files, you can ping the developers there.

Warning on time and size[edit]

Before attempting to download any of the Wikis or their components, PLEASE READ CAREFULLY the time and space scale information below! Because of the size of some file collections (TERAbytes), downloads can take days, or even weeks. (See also our FAQ on the size of the English language Wikipedia dumps.) Be sure you understand your storage capabilities before attempting downloads. Notice (below) that there are a number of versions that are "friendlier" in size and content, which you can customize to your scalability by using or not using images, using or not using talk pages, etc. A careful read of the info below will save a lot of headaches compared to jumping right into downloads.

What's available and where[edit]

It's all explained here: what's available and where you can download it.

How often dumps are produced[edit]

All databases are dumped via 3 groups of processes which run simultaneously. The largest database, enwiki, takes 8 or 9 days for a full run to complete, and is run once a month. A second set of 'large' wikis runs in a continuous loop with the aim of getting dumps for those out twice a month; the rest we shoot for three times a month, also on a rolling basis. Failures in the dump process are generally dealt with by rerunning the portion of the dump that failed. See the wikitech page for more information about the processes and the dump architecture.

Larger databases such as jawiki, dewiki, and frwiki can take a long time to run, especially when compressing the full edit history or creating split stub dumps. If you see a dump seemingly stuck on one of these for a few hours, or days, it's likely not dead, but simply processing a lot of data. You can check that file sizes are increasing or that more revisions are being processed, by reloading the web page for the dump.

The download site shows the status of each dump: if it's in progress, when it was last dumped, etc. A compact version exists too: database dump service progress report.

Feeds for last dump produced[edit]

If you're interested in a file, you can subscribe to the RSS feed for it, so that you know when a new version is produced. No more time spent opening the web page, no more dumps missed and hungry bots without their XML ration.

The URL can be found in the latest/ directory for the wiki (database name) in question: for instance

dumps.wikimedia.org/metawiki/latest/

contains the feed

dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-history.xml.bz2-rss.xml

for the last *-pages-meta-history.xml.bz2 dump produced.

Format of the dump files[edit]

The format of the various files available for download is explained here.

Download tools[edit]

You can download the XML/SQL files and the media bundles using a web client of your choice, but there are also tools for bulk downloading you may wish to use.

Tools for import[edit]

Here's your basic list of tools for importing.

Other tools[edit]

Check out and/or add to this partial list of other tools for working with the dumps, including parsers and offline readers.

Producing your own dumps[edit]

MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php [1] which can be used to produce XML dumps directly, with or without page history.

The programs which manage our multi-database dump process are available in our source repository but would need some tweaking to be used outside of Wikimedia.

You can generate dumps from public wikis using WikiTeam tools.

Step by step importing[edit]

We documented the process to set up a small non-English-language wiki with not too many fancy extensions, using the standard MySQL database backend, on a Linux platform. Read the example or add your own.

See also the MediaWiki manual page on importing XML dumps.

Where to go for help[edit]

If you have trouble importing the files, or problems with the appearance of the pages after import, check our import issues list.

If you don't find the answer there or you have other problems with the dump files, you can:

  • Ask in #mediawiki on irc.freenode.net - Although help is not always available at all times
  • Ask on the xmldatadumps-l (quicker) or the wikitech-l mailing lists.

Alternatively, if you have a specific bug to report:

  • File a bug at Bugzilla under the Product "Datasets"

For French speaking people, see also fr:Wikipédia:Requêtes XML

FAQ[edit]

Some questions come up often enough that we have a FAQ for you to check out.

See also[edit]

On the dumps:

On related projects: