Talk:Mirroring Wikimedia project XML dumps

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Email template[edit]

This is a template for emailing organizations that may be interested in mirroring the dumps.

Subject: Request for a mirror of the Wikimedia public datasets

Dear Sir/Madam,

I am <YOUR NAME>, representing the Wikimedia Foundation as a volunteer for its projects. I have recently visited your website and I was hoping that your organization can kindly help in volunteering to mirror the Wikimedia public datasets on your site so that it can benefit researchers.

The Wikimedia public datasets contain database dumps of many of their own wikis, like the English Wikipedia. These dumps are available for download and researchers can use them to carry out research projects, most of the time in a way that could benefit the open source community and the many volunteer editors that help out on Wikipedia and its sister projects.

I sincerely hope that your organization can support free knowledge for a good cause and benefit everyone that are using the dumps. If you would like more details, please check out:

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

That page can provide more details on what is being mirrored and the size needed. If you require more assistance, please feel free to contact me or the official liaison at ariel@wikimedia.org (Ariel T. Glenn).

I look forward to a favorable reply and have a nice day. Thank you!

Requirements[edit]

Availability[edit]

What type of availability is required for a mirror: is low availability allowed, such as 85% uptime?

Organizations accepts/rejects[edit]

Accepted organizations[edit]

On-going
  • Archive.org (through third-party volunteers)
Expressed interest

Rejected organizations[edit]

Ideas[edit]

Amazon[edit]

This is already supposed to be happening ;). We also shouldn't forget about all the media hosted at Commons - they seem to be rather vulnerable to software or human errors. Eug 15:39, 8 October 2011 (UTC)

They now have a cheap option ("Amazon Glacier") designed for backups - it costs $0.01/GB. If they make the base data a Public Data Set (free for Wikimedia I suppose), Glacier can be used for incrementals. They also have an portable hard-disk import option for large amounts of data.

Netfirms[edit]

I have hosting on Netfirms for $10/year with unlimited storage and 100000 GB/month bandwidth. Could Wikimedia try hosting on a commercial provider? —The preceding unsigned comment was added by 69.249.211.198 (talk) 15:04, 6 August 2011

Generally, you get what you pay for with cheap web hosting. I'm not saying it's a bad idea, but if we start storing terabytes of data and consuming terabytes of traffic, I'd bet that some sort of "acceptable use policy" will present itself. Maybe for some of the smaller projects, this could be a good idea. LobStoR 15:20, 6 August 2011 (UTC)

Library of Congress[edit]

Library of Congress is going to save every public tweet. Why don't they save a copy of Wikipedia? emijrp (talk) 16:47, 10 September 2010 (UTC)

iBiblio[edit]

I have contacted iBiblio for hosting a copy of the latest dumps, working as a mirror of download.wikimedia.org. No response yet. emijrp (talk) 13:12, 15 November 2010 (UTC)

Their response: Unfortunately, we do not have the resources to provide a mirror of wikipedia. Best of luck! See this message in the Xmldatadumps-l mailing list.

archive.org[edit]

The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library (from http://www.archive.org/about/about.php). Phauly 11:20, 16 November 2010 (UTC)

BitTorrent[edit]

Are torrents useful? Does someone currently have disk space and willness enough to actually seed the existing ones? I don't like torrents for this because:

  • torrents are useful to save bandwidth, which is not our problem,
  • it's impossible to find seeders for every dump,
  • even if you download some dump, have Torrent and some bandwidth to share it I'm not sure that you'll like to keep the compressed dumps in addition to the uncompressed ones.

--Nemo 12:44, 16 November 2010 (UTC)

I'm not sure if torrents are a good solution for this problem. Torrents depend on how many people seed them. These are _huge_ files. Also, the dumps change quickly (ok, English Wikipedia dump is from January 2010), but other Wikipedias as German or French have new dumps every month or so. I think that we can release a compilation torrent every year which contains all the 7z dumps (about 100 GB), perhaps, with the future X anniversary of Wikipedia it can be a success. Emijrp 11:32, 29 November 2010 (UTC)
Torrents can use Wikimedia's http server as a "web seed" (example torrent with web seed) so at the very least they'll stay seeded for as long as Wikimedia keeps a particular dump on the site... but could potentially stay seeded for much longer. Torrents definitely won't make anything *worse*... LobStoR 18:10, 21 December 2010 (UTC)
Torrents seem like a good idea to me. Rjwilmsi 14:58, 23 December 2010 (UTC)
I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.
I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.
Torrent Link —The preceding unsigned comment was added by 71.194.190.179 (talk) 01:52, 13 March 2011 (UTC)
This talk page is not really the most appropriate place to list out all of our wiki-dump torrents. We don't need Wikimedia to create torrents for us, anyway. See Data dumps#What about bittorrent? for a more complete list of existing enwiki (and more) data dumps - the torrent you just mentioned has already been listed on here since January 2011 (again, see Data dumps). Perhaps we should move that list from "Data dumps" and create an actual article which is specifically for listing wiki-dump torrents. LobStoR 23:10, 14 March 2011 (UTC)
I was mistaken, your torrent is different from our existing torrent for 2011-01-15. Our currently-existing torrent uses Wikimedia's http servers as a web seed, too, for an accelerated download. Also,
  1. Wikimedia cannot reasonably "validate" these dumps (checked for vandalism and "quality") in a large-scale fashion.
  2. Wikimedia is already helping seed unofficial torrents, via their HTTP web servers (web seed).
  3. Wikimedia is the organization that runs the website Wikipedia. Just to clarify terminology.
Just a few notes in response to your post. Please feel free to contribute to our listing of torrents, and help seed our existing torrents :-) LobStoR 23:36, 14 March 2011 (UTC)
Is it possible to use a service like Bittorrent sync? this would allow Torrent like behaviour and setting the share to read only.
This would cut strain on the main server and spread the sharing load to everyone sharing the dump.

BitTorrent on Burnbit[edit]

I've been experimenting with creating .torrents of the database dump files using an automated web service called Burnbit. Please see the user sandbox I've created for testing, at User:LobStoR/data dump table/enwiki-20110115. Previously, I had been downloading and manually creating the web-seeded .torrents listed at data dump torrents, but this service simplifies the process to a single click.

Shortcomings:

  • When a dump is recreated (which happens occasionally), it is difficult to remove and recreate the .torrent on Burnbit (must "Report" a broken link and wait for Burnbit to respond)
  • Extremely large/small files cannot be used

I think this can encourage users to help create and seed torrents, by making it easy for everyone. Reference links:

I am hoping to talk to Ariel about possibly adding something like this as "official" Wikimedia torrents (since the md5sums are displayed on the Burnbit). Burnbit is great for this because additional http web seeds can also be added later. Please provide feedback here, and feel free to make changes to the sandbox if you see any problems or improvements. LobStoR 13:50, 26 June 2011 (UTC)

RedIRIS[edit]

en:RedIRIS http://www.rediris.es/ Emijrp 11:33, 16 November 2010 (UTC)

We already contacted them last year on this topic, so we could help to ask them. --GlimmerPhoenix 23:13, 17 December 2010 (UTC)

Academic computer network organizations[edit]

en:Category:Academic computer network organizations. Emijrp 11:34, 16 November 2010 (UTC)

Swedish University Network (SUNET) with it's extensive file-archive is kindly asked to host 14 September 2012.

Datto[edit]

Datto Inc. has graciously offered to host a mirror, but seem to have disappeared. On 2 January, they listed themselves with an ETA of 8 days to going live on 10 January; it is now 26 January (24 days elapsed), and I don't see anything listed at the link they provided (wikipedia.dattobackup.com). Would Austin McChord or anyone from Datto please drop a note here with some sort of status update? Thanks! LobStoR 23:24, 26 January 2011 (UTC)

There was a brief discussion on wikitech-l but it hasn't gotten anywhere since. You might try responding to that and asking for a status report. Cap'n Refsmmat 04:19, 28 January 2011 (UTC)
I removed this entry from the list, since there has been no further action. LobStoR 12:45, 14 May 2011 (UTC)

C3SL[edit]

I've just sent a message to C3SL, brazillian mirror for Source Forge. Cross your fingers! Lugusto 14:42, 3 June 2011 (UTC)

And we accepted. We're waiting for instructions. Contact carlos@fisica.ufpr.br —The preceding unsigned comment was added by 200.17.209.129 (talk) 00:36, 12 June 2011 (UTC)
This mirror is now live. We might have a few glitches as we try to coordinate the exact current contents to be mirrored. :-) -- ArielGlenn 17:52, 13 October 2011 (UTC)
The URL is wikipedia.c3sl.ufpr.br. Out of curiosity, how does C3SL get the files? Are they hashed during and/or after transfer to verify accuracy? LobStoR 19:38, 13 October 2011 (UTC)
They get the files from rsync directly from http://dumps.wikimedia.org. --Hydriz (talk) 14:44, 8 May 2012 (UTC)

AARNet[edit]

I've sent an email to AARNet, a mirror that was set up in 1998. Although I can't guarantee that it would allow us to have the XML dumps content (since they only focus on open source software), but cross your fingers! --Hydriz 10:55, 24 November 2011 (UTC)

And they rejected, sigh. --Hydriz 03:54, 30 November 2011 (UTC)

Sent another email to AARNET, lets hope for the best. Sha-256 (talk) 06:06, 9 January 2013 (UTC)

"I remember the last request for this, and unfortunately our reasoning this time will be the same. The archive is too big for the present available capacity, and we’ve had no requests from any researchers (identifying themselves as such) for the data to be mirrored. We may revisit the decision later this year once a new iteration of Mirror is deployed, but until then, sorry, we won’t mirror it." ,not happening it seems Sha-256 (talk) 06:57, 9 January 2013 (UTC)
I can certainly gather the names of some Australian researchers who would be interested in this. But the size would make it a better target for an RDSI node rather than AARNET. Researchers probably want more than just a mirrored dump; they would want it extracted and pre-processed in a number of ways for convenience in mining it in various ways. Most researchers who work with WIkipedia dumps have to do extensive preprocessing so the desire to do it once and share is definitely there. I am in conversation with an RDSI node and the size doesn't seem to faze them, but we would need folks to volunteer to help with preprocessing it. Kerry Raymond (talk) 20:59, 10 January 2013 (UTC)

Incremental updates[edit]

Just spitballing here, would it make sense for en.wiki to have yearly full dumps, and more frequent incremental updates as compressed diffs? This might be a solution in search of a problem, but it would make it much easier to keep an up to date dump (if people would want that, provided that they do keep the compressed dump, and dumps are fairly similar, I have no idea how any of those hold up). 80.56.9.191 21:57, 15 December 2010 (UTC)

I'm not so technical, but as I understand it dumps are already "incremental", i.e. the previous dump is "prefetched" and used to build the next one, so your proposal wouldn't help to reduce times, but perhaps you were talking about bandwidth (you have to download less), and the answer here is that bandwidth is not a great issue AFAIK. --Nemo 11:10, 16 December 2010 (UTC)
An incremental compressed format makes a lot of sense to me. SJ talk  23:45, 28 April 2012 (UTC)

Commons dumps[edit]

Can you also set up a page about mirroring Wikimedia Commons binary dumps? That's what I am most interested in. The Internet Archive (talk to raj) is quite willing to host such a dump, along with the ones listed here. SJ talk  23:45, 28 April 2012 (UTC)

Binary dumps, do you refer to the textual dumps, or the image dumps of Wikimedia Commons? --Hydriz (talk) 14:45, 8 May 2012 (UTC)

WickedWay[edit]

Hello,

We want to host a mirror, but I'm wondering... How can we do it best? Cause we host a mirror at this moment that rsyncs with the masters of CentOS en Ubuntu and some more, but the only way to use rsync now is rsync from a other mirror?

And secondly, how often do we need to update?

Best, Huib

Our mirror server is located in Dronten, the Netherlands and has a 1Gbit uplink. Secondly we have 50TB storage with approximately 20 TB free. Huib talk Abigor 12:58, 21 May 2012 (UTC)
For help with setting up a mirror, get in touch with Ariel. See Mirroring Wikimedia project XML dumps#Who can we contact for hosting a mirror of the XML dumps? for contact info. 64.40.57.98 06:22, 22 May 2012 (UTC)

all Wikipedia on torrent[edit]

It's possible to make a torrent with all (images, articles) wikipedia ? Maybe in torrent parts.

How many TB ? --CortexA9 (talk) 15:42, 1 November 2012 (UTC)

see: http://xowa.sourceforge.net/image_dbs.html

Masaryk University mirror[edit]

Fwiw, it looks like the Masaryk University mirror (http://ftp.fi.muni.cz/pub/wikimedia/): 1. Stopped pulling updates in November 2012, and 2. Is a partial mirror, excluding the 'enwiki' dumps. --Delirium (talk) 15:52, 28 July 2013 (UTC)

Volunteer for 2 TB[edit]

Hi,

I'm an administrator of the French digital library en:Les Classiques des sciences sociales based in Quebec, Canada, and we want to volunteer to help mirroring Wikimedia project. We have about 2 TB of space for this on our server. We wanted to know if it can be useful and, if so, how much upload/download traffic it can represent ? I think that there is about one dump per month, so I suppose it means at least 2 TB of upload per month. --Simon Villeneuve 21:45, 26 August 2015 (UTC)

  • If you want a lot of traffic with little disk space usage, you should mirror http://download.kiwix.org/ (the load is adjustable, just tell the Kiwix maintainer how much you want).
  • In 2 TB it's hard to include any meaningful subset of the XML dumps, but you could probably fit all the *-pages-meta-current.xml.bz2 files or perhaps even all the *pages-meta-history*7z pages: these would probably be mostly unused, but some user may happen to be closer to your server and prefer it to the main ones.
  • If you want something more stable that you don't need to update and babysit yourself, you can just seed the media tarballs; nobody else is doing that, so the contribution would be useful. --Nemo 12:52, 28 August 2015 (UTC)
Ok, thank you very much for your answer. I'll see what I can do. Simon Villeneuve 19:46, 18 September 2015 (UTC)