- 1 Summary
- 2 What is this all about?
- 3 Warning on time and size
- 4 What's available and where
- 5 How often dumps are produced
- 6 Format of the dump files
- 7 Download tools
- 8 Tools for import
- 9 Other tools
- 10 Producing your own dumps
- 11 Step by step importing
- 12 Where to go for help
- 13 FAQ
- 14 See also
- 15 References
WMF releases data dumps of Wikipedia and all WMF projects on a regular basis.
- Text and metadata of current or all revisions of all pages as XML files
- Most database tables as sql files
- Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
- Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
- Media metadata (image, oldimage tables)
- Info about each page (page, page_props, page_restrictions tables)
- Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
- List of all pages that are redirects and their targets (redirect table)
- Log data, including blocks, protection, deletion, uploads (logging table)
- Misc bits (interwiki, site_stats, user_groups tables)
- experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps
- Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
- Media bundles for each project, separated into files uploaded to the project and files from Commons
Images : See here
- Static HTML dumps for 2007-2008
Archives : dumps.wikimedia.org/archive/
Current mirrors offer an alternative to the download page.
Due to large file sizes, using a download tool is recommended.
SQL dumps are provided as dumps of entire tables, using mysqldump.
Some older dumps exist in various formats.
How to and examples
See examples of importing dumps in a MySQL database with step-by-step instructions here .
Available tools are listed in the following locations, but information is not always up-to-date:
All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.
- Maintainer: Ariel Glenn
- Mailing list: xmldatadumps-l
- Bug reports: Phabricator Dumps Generation project
- Design work on Dumps 2.0 replacement: Phabricator Dumps Rewrite project
Research projects using data from this source
- "A Breakdown of Quality Flaws in Wikipedia" examines cleanup tags on the English Wikipedia using a January 2011 dump
- "There is No Deadline – Time Evolution of Wikipedia Discussions" looks at the time evolution of Wikipedia discussions, and how it correlates to editing activity, based on 9.4 million comments from the March 12, 2010 dump
- "Understanding collaboration in Wikipedia" mines a complete dump of the English Wikipedia (225 million article edits) for insights into open collaboration
- "Dynamics of Conflicts in Wikipedia" takes the revision history from the dump to extract the reverts based on the text comparison to study the dynamics of editorial wars in multiple language editions
What is this all about?
Wikimedia provides public dumps of our wikis' content:
- for archival/backup purposes
- for offline use
- for academic research
- for bot use
- for republishing (don't forget to follow the license terms)
- for fun!
Please follow the XML Data Dumps mailing list by reading the archives or subscribing, for up to date news about the dumps; you can also make inquiries about them there. If you cannot download the dump you want because it no longer exists, or if you have other issues with the files, you can ping the developers there.
Warning on time and size
Before attempting to download any of the Wikis or their components, PLEASE READ CAREFULLY the time and space scale information below! Because of the size of some file collections (TERAbytes), downloads can take days, or even weeks. (See also our FAQ on the size of the English language Wikipedia dumps.) Be sure you understand your storage capabilities before attempting downloads. Notice (below) that there are a number of versions that are "friendlier" in size and content, which you can customize to your scalability by using or not using images, using or not using talk pages, etc. A careful read of the info below will save a lot of headaches compared to jumping right into downloads.
Faster archives and servers
Once you're sure you've selected the smallest dataset which fits your purpose, make sure to get it in the most efficient way:
- download them compressed in 7z format, which can be 10 times smaller than bz2 and decompresses faster (once the 7z download is complete, you can pipe the content with 7z e -so, see also full manual);
- download from one of the dumps mirrors, which can be much faster especially if they're near you network-wise (e.g. if you both are on a GÉANT/Internet2/etc. network);
- if the (de)compression takes more time than expected, make sure you've downloaded either 7z or multistream.xml.bz2 files and that your software supports multithreading (like pbzip2/lbzip2 for bz2 decompression and p7zip with lzma-utils 5.1.3+ or 5.2+ for compression only, or xargs/parallel for multiple 7z LZMA decompressions); if you write to disk, consider that LZMA decompression is likely to be faster than your disk can handle.
What's available and where
It's all explained here: what's available and where you can download it.
How often dumps are produced
All databases are dumped on three hosts which generate dumps simultaneously. The largest database, enwiki, takes about 14 days for a full run to complete. Wikidata is not far behind.
We produce full dumps with all historical page content once a month; this dump run starts at the beginning of each month.
We produce partial dumps with current page content only, also once a month, starting about 2/3rds of the way through the month.
Failures in the dump process are generally dealt with by cleaning up the underlying issue and letting the automated runner rerun the job.
See wikitech:Dumps/Current_Architecture for more information about the processes and the dump architecture.
- Larger databases such as jawiki, dewiki, and frwiki can take a long time to run, especially when compressing the full edit history or creating split stub dumps. If you see a dump seemingly stuck on one of these for a few hours, or days, it's likely not dead, but simply processing a lot of data. You can check that file sizes are increasing or that more revisions are being processed, by reloading the web page for the dump.
Monitoring dump generation
If you are interested in a particular wiki and run date (e.g. frwiki, the "full" run that starts on the 1st of the month), you can check the file dumpstatus.json in the corresponding directory, i.e. for April 1 2017's frwiki run you would look at https://dumps.wikimedia.org/frwiki/20170401/dumpstatus.json and so on. See Data dumps/Status format for more information on the format of these output files. If you are interested in getting information on all wikis, you can check the https://dumps.wikimedia.org/index.json file which aggregates the per-run json files for the most recent run across all wikis.
Feeds for last dump produced
If you're interested in a file, you can subscribe to the RSS feed for it, so that you know when a new version is produced. No more time spent opening the web page, no more dumps missed and hungry bots without their XML ration.
The URL can be found in the
latest/ directory for the wiki (database name) in question: for instance
contains the feed
for the last *-pages-meta-history.xml.bz2 dump produced.
You can use services that turn RSS feeds to email notifications (like Blogtrottr).
Format of the dump files
The format of the various files available for download is explained here.
You can download the XML/SQL files and the media bundles using a web client of your choice, but there are also tools for bulk downloading you may wish to use.
Tools for import
Here's your basic list of tools for importing.
Check out and/or add to this partial list of other tools for working with the dumps, including parsers and offline readers.
Producing your own dumps
MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php  which can be used to produce XML dumps directly, with or without page history.
The programs which manage our multi-database dump process are available in our source repository but would need some tweaking to be used outside of Wikimedia.
You can generate dumps from public wikis using WikiTeam tools.
Step by step importing
We documented the process to set up a small non-English-language wiki with not too many fancy extensions, using the standard MySQL database backend, on a Linux platform. Read the example or add your own.
See also the MediaWiki manual page on importing XML dumps.
Where to go for help
If you have trouble importing the files, or problems with the appearance of the pages after import, check our import issues list.
If you don't find the answer there or you have other problems with the dump files, you can:
- Ask in #mediawiki on irc.freenode.net - Although help is not always available at all times
- Ask on the xmldatadumps-l (quicker) or the wikitech-l mailing lists.
Alternatively, if you have a specific bug to report:
- File a bug at Phabricator under the Dumps Generation project.
For French speaking people, see also fr:Wikipédia:Requêtes XML
Some questions come up often enough that we have a FAQ for you to check out.
On the dumps:
- mw:Manual:Importing XML dumps
- mw:Research Data Proposals#Dump
- mw:WMF Projects/Data Dumps
On related projects:
- Datasets - a list of different data sources related to the Wikimedia projects and tools for working with them
- en:User:Emijrp/Wikipedia Archive
- WikiTeam (website) - a group of people who develop software for making backups and archive wikis
- Entropy-based analysis tool (Who Writes Wikipedia?)
- Wikimedia group on The Data Hub for many other data dumps
- mw:Backing up a wiki How to back up your wiki, with a database dumping tool for MySQL/PostgreSQL etc, or with dumpBackup.php
- For instance it may take less than 2 hours of wall clock time to decompress the whole 100+ GiB of the compressed full dumps of the English Wikipedia: phabricator:P4751. Using 2 CPU cores, it takes less than a day to decompress and grep all the revisions for a string: phabricator:P4750.
- Many scripts can read directly from bz2/7z files, such as wikistats or Python scripts, or recommend to read from the piped decompressed content, such as wikiq and mwdiffs.