Mirroring Wikimedia project XML dumps

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page coordinates the efforts for mirroring Wikimedia project XML dumps around the globe, on independent servers, similar to the GNU/Linux .isos mirror sites. See the list of mirrors below for the dumps.

Requirements[edit]

Space[edit]

  • Last 5 good dumps (most desired option): 15 TB for 5 most recent dumps, as of October 2017. This would be 3 sets of full dumps and 2 sets of partial dumps.
  • Last 2 good dumps: 5.4 TB, as of October 2017. This would be one set of full dumps and one set of partial dumps.
  • Only most recent good dumps : 4.1 TB, as of October 2017. This would be one set of full dumps.
  • Historical archives (2 dumps per year from 2002 through 2010): 1.6T now (October 2017), probably won't grow any.
  • All dumps and other data currently hosted: about 43 TB and growing, as of October 2017.
    Expect slow growth; the number of dumps we keep will not grow substantially but the number and size of projects will increase steadily. Wikidata growth may accelerate, which could have a big impact.
    We are not very interested in selectively mirroring some projects or dumps.
  • "Other" (pageview and other statistics): 17 TB, as of October 2017.

Compare this to the estimates from 2012.

Bandwidth[edit]

Wikimedia provides about 70 MB/s via dataset1001.wikimedia.org (stats) for XML dumps, as of January 2017.

Current mirrors[edit]

Dumps[edit]

Organisation Contents Location Access
Wikimedia All public data Virginia, United States
Academic Computer Club, Umeå University Last 5 good XML dumps Umeå, Sweden
C3SL Last 5 good XML dumps Curitiba, Paraná, Brazil
Your.org All public data Illinois, United States
Internet Archive All public data (updated semi-manually) California, United States
Bytemark Last 5 good XML dumps York, United Kingdom

Media[edit]

Note: The media files in the mirror may be outdated, please use with care. Have a look at the last modified date.
Organisation Contents Access
Your.org Media (current version only)

Media tarballs[edit]

Organisation Contents Access
Your.org Media tarballs per project (except Commons)
Internet Archive Media tarballs per day for Wikimedia Commons
Notes for the wikimediacommons collection
  • All the Commons uploads (and their description pages in XML export format) of each day since 2004, one zip file per day, one item per month. A text file listing various errors is available for each month, as well as a CSV file with metadata about every file of each day.
  • The archives are made by WikiTeam and meant to be static; an embargo of about 6 months is followed, in order to upload months which are mostly cleaned up. Archives up to early 2013 have been uploaded in August-October 2013 so they reflect the status of the time. After logging in, you can see a table with details about all items.
  • See Downloading in bulk using wget for official HTTP download instructions. Download via torrent, however, is supposed to be faster and is highly recommended (you need a client which supports webseeding, to download from archive.org's 3 webseeds): there is one torrent per item and an (outdated) torrent file to download all torrent files at once.
  • Please join our distributed effort, download and reseed one torrent.
  • Individual images can be downloaded as well thanks to the on-the-fly unzipper, by looking for the specific filename in the specific zip file, e.g. [1] for File:Quail1.PNG.
Other notes

Pageview stats, MediaWiki tarballs, other files[edit]

The nd.edu site is restricted to certain institutions with Internet2/ESnet/Geant connectivity, but those with access (primarily academics and researchers) will have high bandwidth downloads.

Organisation Contents Access
Academic Computer Club, Umeå University 'Other' datasets
Your.Org 'Other' datasets
Center for Research Computing, University of Notre Dame Wikidata entity dumps, pageview and other stats, Picture of the Year tarballs, Kiwix openzim files, other. Restricted ESnet/Geant/I2 access only!

Potential mirrors[edit]

If you are a hosting organization and want to volunteer, please send email to ops-dumps@wikimedia.org with XML dumps mirror somewhere in the subject line.

Based on your space and bandwidth restrictions, decide how many dumps you want to mirror, whether you want to mirror in addition or alternatively the archives (pre-2009 dumps) and/or "other" datasets. Let us know that in the email. We'll need the hostname for our rsync config, the name for the ipv6 address if there is a separate name, or in case there is no ipv6 connectivity, a note to that effect, and a contact email address.

Once your information is added to our rsync config, you'll be able to pick up the desired dirs and files from the appropriate rsync module:

  • dumpslastone -- last complete good dump for each wiki as well as completed files from any run that is in progress
  • dumpslasttwo -- last two complete runs etc
  • dumpslastthree -- last three complete runs etc
  • dumpslastfour -- last four complete runs etc
  • dumpslastfive -- last five complete runs etc
  • dumpmirrorsother -- 'other' datasets (as seen at [2])
  • dumpmirrorsalldumps -- all dumps but no archives and no 'other' datasets
  • dumpmirrorseverything -- absolutely everything
  • dumpmirrorseverything/archives -- just the archival dumps of historical interest

We recommend a daily cron job for this.

If you are brainstorming organizations that might be interested, see discussion page.

See also[edit]

External links[edit]