Mirroring Wikimedia project XML dumps

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page coordinates the efforts for mirroring Wikimedia project XML dumps around the globe, on independent servers, similar to the GNU/Linux .isos mirror sites. See the list of mirrors below for the dumps.

Requirements[edit]

Space[edit]

  • Last 5 good dumps (most desired option): 10.5 TB for 5 most recent dumps, as of March 2014.
  • Last 2 good dumps: 4.2 TB, as of March 2014.
  • Only most recent good dumps : 2.1 TB, as of March 2014.
  • Historical archives (2 dumps per year from 2002 through 2010): 1.6T now (Aug 2012), missing some data, expect 3-4T total.
  • All dumps and other data currently hosted: about 34 TB and growing, as of March 2014.
    Expect slow growth; the number of dumps we keep will not grow substantially but the number and size of projects will increase steadily.
    We are not very interested in selectively mirroring some projects or dumps.
  • "Other" (pageview and other statistics): 5.2 TB, as of March 2014.

Compare this to the estimates from 2012.

Bandwidth[edit]

Wikimedia provides about 33 MB/s via dataset1001.wikimedia.org (stats) for XML dumps, as of March 2014.

Current Mirrors[edit]

Dumps[edit]

Organisation Contents Location HTTP access FTP access rsync URL
Wikimedia All public data Virginia, United States http://dumps.wikimedia.org none none
C3SL Last 5 good XML dumps Curitiba, Paraná, Brazil http://wikipedia.c3sl.ufpr.br ftp://wikipedia.c3sl.ufpr.br/wikipedia/ rsync://wikipedia.c3sl.ufpr.br/wikipedia/
Your.org All public data Illinois, United States http://dumps.wikimedia.your.org/ ftp://ftpmirror.your.org/pub/wikimedia/dumps/ rsync://ftpmirror.your.org/wikimedia-dumps/

Media[edit]

Organisation Contents HTTP access FTP access rsync URL
Your.org Media (current version only) http://ftpmirror.your.org/pub/wikimedia/images/ ftp://ftpmirror.your.org/pub/wikimedia/images/ rsync://ftpmirror.your.org/wikimedia-images/

Media tarballs[edit]

Organisation Contents HTTP access FTP access rsync URL Notes
Your.org Media tarballs per project (except Commons) http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/ --
Internet Archive Media tarballs per project, per day (only Commons) https://archive.org/details/wikimediacommons -- --
  • All the Commons uploads (and their description pages in XML export format) of each day since 2004, one zip file per day, one item per month. A text file listing various errors is available for each month, as well as a CSV file with metadata about every file of each day.
  • The archives are made by WikiTeam and meant to be static; an embargo of about 6 months is followed, in order to upload months which are mostly cleaned up. Archives up to early 2013 have been uploaded in August-October 2013 so they reflect the status of the time. After logging in, you can see a table with details about all items.
  • See Downloading in bulk using wget for official HTTP download instructions. Download via torrent, however, is supposed to be faster and is highly recommended (you need a client which supports webseeding, to download from archive.org's 3 webseeds): there is one torrent per item and an (outdated) torrent file to download all torrent files at once.
  • Please join our distributed effort, download and reseed one torrent.
  • Individual images can be downloaded as well thanks to the on-the-fly unzipper, by looking for the specific filename in the specific zip file, e.g. [1] for File:Quail1.PNG.
For an unofficial listing of torrents, see data dump torrents.

Pageview stats, MediaWiki tarballs, other files[edit]

Organisation Contents HTTP access FTP access rsync URL
Wansecurity.com MediaWiki releases, pageview and other stats, historical XML archive, mwdumper http://wikimedia.wansec.com/ -- rsync://wikimedia.wansec.com/wikimedia/

Who can we contact for hosting a mirror of the XML dumps?[edit]

If you are a hosting organization and want to volunteer, please send email to ariel -at- wikimedia.org with XML dumps mirror somewhere in the subject line.

If you are brainstorming organizations that might be interested, see discussion page.

See also[edit]

External links[edit]