Data dumps/What's available for download
Available for download from XML/Sql dumps per project
[edit]For information about how or where to download these files, see the section 'Getting the dumps' at Data dumps.
Database tables
[edit]The format of these files is explained here.
See also database_field_prefixes and database layout.
The tables can be broken down into a few rough groups:
- Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
- Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
- Media metadata (image, oldimage tables)
- Info about each page (page, page_props, page_restrictions tables)
- Special purpose/misc (geo-tags, interwiki, redirect, and so on)
Table | Filename format | Schema documentation | Description |
---|---|---|---|
categorylinks | <wikiname>-YYYYMMDD-categorylinks.sql.gz | categorylinks table schema | Page ids and the categories to which they belong |
category | <wikiname>-YYYYMMDD-category.sql.gz | category table schema | All categories with number of pages, subcats, files in each |
change_tag | <wikiname>-YYYYMMDD-change_tag.sql.gz | change tag table schema | All tags and the log entry, rc or rev to which they were applied |
externallinks | <wikiname>-YYYYMMDD-externallinks.sql.gz | externallinks table schema | Page ids and the off-wiki links they contain |
flaggedpages * | <wikiname>-YYYYMMDD-flaggedpages.sql.gz | flaggedpages table schema | Page ids and info about their latest stable versions |
flaggedrevs * | <wikiname>-YYYYMMDD-flaggedrevs.sql.gz | flaggedrevs table schema | Revision ids and info about how they have been reviewed |
geo_tags | <wikiname>-YYYYMMDD-geo_tags.sql.gz | geo_tags table schema | Coordinate info contained in each page |
image | <wikiname>-YYYYMMDD-image.sql.gz | image table schema | Information about uploaded files |
imagelinks | <wikiname>-YYYYMMDD-imagelinks.sql.gz | image links table schema | Page ids and their links to media files |
iwlinks | <wikiname>-YYYYMMDD-iwlinks.sql.gz | iwlinks table schema | Page ids and their links to pages on other wikis |
langlinks | <wikiname>-YYYYMMDD-langlinks.sql.gz | langlinks table schema | Page ids and the equivalent pages on other wikis |
page | <wikiname>-YYYYMMDD-page.sql.gz | page table schema | Page info: namespace, title, current revision, etc. |
pagelinks | <wikiname>-YYYYMMDD-pagelinks.sql.gz | pagelinks table schema | Page ids and their links to other pages on this wiki |
page_props | <wikiname>-YYYYMMDD-page_props.sql.gz | page props table schema | Page ids and various properties of the page (default sortkey? in hidden categories?) |
page_restrictions | <wikiname>-YYYYMMDD-page_restrictions.sql.gz | page restrictions table schema | Info about pages protected from editing or moving |
protected_titles | <wikiname>-YYYYMMDD-protected_titles.sql.gz | protected titles table schema | Info about titles for which pages cannot be created |
redirect | <wikiname>-YYYYMMDD-redirect.sql.gz | redirect table schema | Pages that are redirects and their targets |
sites | <wikiname>-YYYYMMDD-sites.sql.gz | sites table schema | Info about all wikis: language code, wiki type, etc. |
site_stats | <wikiname>-YYYYMMDD-site_stats.sql.gz | site stats table schema | Sitewide statistics: page views, total edits, etc. |
templatelinks | <wikiname>-YYYYMMDD-templatelinks.sql.gz | templatelinks table schema | Page ids and the templates they contain |
user_groups | <wikiname>-YYYYMMDD-user_groups.sql.gz | user groups table schema | User ids and the groups to which they belong (bot, sysop, etc) |
wbc_entity_usage | <wikiname>-YYYYMMDD-wbc_entity_usage.sql.gz | wbc entity usage schema | Wikidata entity ids and the pages that use them |
* These tables may not be available on all wikis.
XML files
[edit]The detailed format of these files is available at dump format.
The following are available:
- Log data, including blocks, protection, deletion, uploads
- filename: <wikiname>-YYYYMMDD-pages-logging.xml.gz)
- Metadata about each page and current or all revisions
- filenames: <wikiname>-YYYYMMDD-stub-articles.xml.gz, <wikiname>-YYYYMMDD-stub-meta-current.xml.gz, <wikiname>-YYYYMMDD-stub-meta-history.xml.gz)
- The three types of "stub" (metadata) dump files have metadata for the following page revisions:
- articles - all pages except for talk pages, current revision only
- meta-current - all pages, current revision only
- meta-history - all pages, all revisions
- Text of current or all revisions of all pages
- filenames: <wikiname>-YYYYMMDD-pages-articles.xml.bz2, <wikiname>-YYYYMMDD-pages-meta-current.xml.bz2, <wikiname>-YYYYMMDD-pages-meta-history.xml.bz2 and 7z)
- The three types of page content dumps have text for the page revisions from the corresponding metadata file.
- Short plain text abstracts of each page in the main namespace (filename: <wikiname>-YYYYMMDD-abstract.xml.gz)
More about filenames
[edit]For large wikis, files containing partial dumps of log data, revision metadata, page/revision content, and article abstracts are available for download, as well as the full dump files. The partial files have the following naming scheme:
<wikiname>-YYYYMMDD-<dump-type><num>.xml-p<num>p<num>.<compression-type>
The dump type is the same as in the full dump file, i.e. "stub-meta-history", "abstract", and so on. The next number is the "part number", 1 through 27 for enwiki and wikidatawiki, and 1 through 6 for the other big wikis. These are produced in parallel, with part "1" containing the first entries and part "6" or "27" containing the last entries. The remaining part of the filename, p<num>p<num>, are the first/last page ids covered in the specific file, for the metadata, page content and abstract files, and the first/last log entry ids for the log data files. The compression type is one of .gz, .bzw2 or .7z, according to the dump type and specific job.
Everything else
[edit]The remaining per project files that can be downloaded are the following:
- titles of all pages
- titles of all pages in the main namespace
- general site info: namespaces and their aliases, "magic words", mainpage url, etc.
Old (2013) sql files
[edit]Sql files for testing, generated from the page metadata and page content XML files have been made available for the February 2013 dump run of the English language Wikipedia, for use with MediaWiki 1.20 [1]. Before blindly using them, please note that these do not have the usual drop/create tables stanzas at the beginning.
Tab-delimited files for use with MySQL's LOAD DATA INFILE, generated form the Sql files for the February 2013 dump run of the English language Wikipedia are also available for testing [2] for MediaWiki 1.20.
Not available
[edit]Private data is not available for download. Some database tables contain partially private data, such as passwords, e-mail addresses, preferences, and watchlists. Deleted or suppressed content or user information is also not published; it may have contained spam, personally identifying information, copyright violations or other sensitive material.
Downloading media
[edit]Old media bundles for each project are available from a mirror site, via http, ftp or rsync: see Media tarballs on our list of mirrors.. If you want to browse or retrieve the original media as individual files, that's available too; see Media on our list of mirrors.
New media bundles are not currently produced.
The Wikimedia Foundation has permission to use certain images, and many of the fair use images are borderline in terms of whether they can be used or not off Wikipedia. If you choose to download the image base, you do so at your own risk and assume all liability for the use of any images on the main Wikipedia site. The Wikipedia Community vigorously police the site and remove infringing images daily, however, it is always possible that some images may escape this extraordinary level of vigilance and end up on the site for a short time. As of February of 2007, the entire collection of images produce a compressed tar.gz file of over 213 GB (gigabytes). As of November 2011 the image and other media files take up about 17T, most of it already compressed media.
Wish list
[edit]Some things people want are on a wish list of other items. If you want other items to be dumped, it is highly recommended that you add a request on Phabricator; dumps maintainers check that every day or two, as opposed to the wish list.