Data dumps/Dumps sizes and growth
If you are working with any medium or large size wikis, these files are HUGE.
Here are a few examples for the English language Wikipedia revision history dump.
- November 2010: TBD (running). 104GB bz2 compressed.
- May 2016: 14 371 294 447 992 bytes (14 TB) uncompressed.
- October 2018: 17 959 415 517 241 (18 TB) bytes uncompressed.
- April 2019: 18 880 938 139 465 bytes (19 TB) uncompressed. 937GB bz2 compressed. 157GB 7z compressed.
Wikidata is also large. In May 2019, 550GB bz2 compressed, 190GB 7z compressed. In March 2022, 1.4TB uncompressed.
At this point Wikidata is growing faster than English language Wikipedia.
The moral of this story: Download only what you need, and never uncompress into a file if you can avoid it (you don't have room anyway).
Downloading what you need
Types of files for download (see https://dumps.wikimedia.org/backup-index.html):
Metadata vs Content
- so-called stubs: these files do not contain any page or revision content. They only contain metadata (see Page metadata to get the idea).
- content files, i.e. pages-articles, meta-current, meta-history. These files contain some metadata and raw revision content.
Articles vs Everything
- articles: these files contain everything in the main namespace (so all articles), the project namespace, the Help and MediaWiki namespaces, the Template and Module namespaces, the Category and File namespaces, and a couple others.
- meta: these files contain every page in every namespace, notably including all User pages and all Talk pages.
Current vs History
- articles, current: these files contain only the current revision of each page at the time of the dump
- history: these files contain every revision of the page dumped; this could be thousands of revisions in some cases.