Talk:Mirroring Wikimedia project XML dumps

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Contents

[edit] Ideas

[edit] Amazon

This is already supposed to be happening ;). We also shouldn't forget about all the media hosted at Commons - they seem to be rather vulnerable to software or human errors. Eug 15:39, 8 October 2011 (UTC)

[edit] Netfirms

I have hosting on Netfirms for $10/year with unlimited storage and 100000 GB/month bandwidth. Could Wikimedia try hosting on a commercial provider? The preceding unsigned comment was added by 69.249.211.198 (talk • contribs) 15:04, 6 August 2011.

Generally, you get what you pay for with cheap web hosting. I'm not saying it's a bad idea, but if we start storing terabytes of data and consuming terabytes of traffic, I'd bet that some sort of "acceptable use policy" will present itself. Maybe for some of the smaller projects, this could be a good idea. LobStoR 15:20, 6 August 2011 (UTC)

[edit] Library of Congress

Library of Congress is going to save every public tweet. Why don't they save a copy of Wikipedia? emijrp (talk) 16:47, 10 September 2010 (UTC)

[edit] iBiblio

I have contacted iBiblio for hosting a copy of the latest dumps, working as a mirror of download.wikimedia.org. No response yet. emijrp (talk) 13:12, 15 November 2010 (UTC)

Their response: Unfortunately, we do not have the resources to provide a mirror of wikipedia. Best of luck! See this message in the Xmldatadumps-l mailing list.

[edit] archive.org

The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library (from http://www.archive.org/about/about.php). Phauly 11:20, 16 November 2010 (UTC)

[edit] BitTorrent

Are torrents useful? Does someone currently have disk space and willness enough to actually seed the existing ones? I don't like torrents for this because:

  • torrents are useful to save bandwidth, which is not our problem,
  • it's impossible to find seeders for every dump,
  • even if you download some dump, have Torrent and some bandwidth to share it I'm not sure that you'll like to keep the compressed dumps in addition to the uncompressed ones.

--Nemo 12:44, 16 November 2010 (UTC)

I'm not sure if torrents are a good solution for this problem. Torrents depend on how many people seed them. These are _huge_ files. Also, the dumps change quickly (ok, English Wikipedia dump is from January 2010), but other Wikipedias as German or French have new dumps every month or so. I think that we can release a compilation torrent every year which contains all the 7z dumps (about 100 GB), perhaps, with the future X anniversary of Wikipedia it can be a success. Emijrp 11:32, 29 November 2010 (UTC)
Torrents can use Wikimedia's http server as a "web seed" (example torrent with web seed) so at the very least they'll stay seeded for as long as Wikimedia keeps a particular dump on the site... but could potentially stay seeded for much longer. Torrents definitely won't make anything *worse*... LobStoR 18:10, 21 December 2010 (UTC)
Torrents seem like a good idea to me. Rjwilmsi 14:58, 23 December 2010 (UTC)
I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.
I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.
Torrent Link The preceding unsigned comment was added by 71.194.190.179 (talk • contribs) 01:52, 13 March 2011 (UTC).
This talk page is not really the most appropriate place to list out all of our wiki-dump torrents. We don't need Wikimedia to create torrents for us, anyway. See Data dumps#What about bittorrent? for a more complete list of existing enwiki (and more) data dumps - the torrent you just mentioned has already been listed on here since January 2011 (again, see Data dumps). Perhaps we should move that list from "Data dumps" and create an actual article which is specifically for listing wiki-dump torrents. LobStoR 23:10, 14 March 2011 (UTC)
I was mistaken, your torrent is different from our existing torrent for 2011-01-15. Our currently-existing torrent uses Wikimedia's http servers as a web seed, too, for an accelerated download. Also,
  1. Wikimedia cannot reasonably "validate" these dumps (checked for vandalism and "quality") in a large-scale fashion.
  2. Wikimedia is already helping seed unofficial torrents, via their HTTP web servers (web seed).
  3. Wikimedia is the organization that runs the website Wikipedia. Just to clarify terminology.
Just a few notes in response to your post. Please feel free to contribute to our listing of torrents, and help seed our existing torrents :-) LobStoR 23:36, 14 March 2011 (UTC)

[edit] BitTorrent on Burnbit

I've been experimenting with creating .torrents of the database dump files using an automated web service called Burnbit. Please see the user sandbox I've created for testing, at User:LobStoR/data dump table/enwiki-20110115. Previously, I had been downloading and manually creating the web-seeded .torrents listed at data dump torrents, but this service simplifies the process to a single click.

Shortcomings:

  • When a dump is recreated (which happens occasionally), it is difficult to remove and recreate the .torrent on Burnbit (must "Report" a broken link and wait for Burnbit to respond)
  • Extremely large/small files cannot be used

I think this can encourage users to help create and seed torrents, by making it easy for everyone. Reference links:

I am hoping to talk to Ariel about possibly adding something like this as "official" Wikimedia torrents (since the md5sums are displayed on the Burnbit). Burnbit is great for this because additional http web seeds can also be added later. Please provide feedback here, and feel free to make changes to the sandbox if you see any problems or improvements. LobStoR 13:50, 26 June 2011 (UTC)

[edit] RedIRIS

en:RedIRIS http://www.rediris.es/ Emijrp 11:33, 16 November 2010 (UTC)

We already contacted them last year on this topic, so we could help to ask them. --GlimmerPhoenix 23:13, 17 December 2010 (UTC)

[edit] Academic computer network organizations

en:Category:Academic computer network organizations. Emijrp 11:34, 16 November 2010 (UTC)

[edit] Datto

Datto Inc. has graciously offered to host a mirror, but seem to have disappeared. On 2 January, they listed themselves with an ETA of 8 days to going live on 10 January; it is now 26 January (24 days elapsed), and I don't see anything listed at the link they provided (wikipedia.dattobackup.com). Would Austin McChord or anyone from Datto please drop a note here with some sort of status update? Thanks! LobStoR 23:24, 26 January 2011 (UTC)

There was a brief discussion on wikitech-l but it hasn't gotten anywhere since. You might try responding to that and asking for a status report. Cap'n Refsmmat 04:19, 28 January 2011 (UTC)
I removed this entry from the list, since there has been no further action. LobStoR 12:45, 14 May 2011 (UTC)

[edit] C3SL

I've just sent a message to C3SL, brazillian mirror for Source Forge. Cross your fingers! Lugusto 14:42, 3 June 2011 (UTC)

And we accepted. We're waiting for instructions. Contact carlos@fisica.ufpr.br The preceding unsigned comment was added by 200.17.209.129 (talk • contribs) 00:36, 12 June 2011 (UTC).
This mirror is now live. We might have a few glitches as we try to coordinate the exact current contents to be mirrored. :-) -- ArielGlenn 17:52, 13 October 2011 (UTC)
The URL is wikipedia.c3sl.ufpr.br. Out of curiosity, how does C3SL get the files? Are they hashed during and/or after transfer to verify accuracy? LobStoR 19:38, 13 October 2011 (UTC)
They get the files from rsync directly from http://dumps.wikimedia.org. --Hydriz (talk) 14:44, 8 May 2012 (UTC)

[edit] AARNet

I've sent an email to AARNet, a mirror that was set up in 1998. Although I can't guarantee that it would allow us to have the XML dumps content (since they only focus on open source software), but cross your fingers! --Hydriz 10:55, 24 November 2011 (UTC)

And they rejected, sigh. --Hydriz 03:54, 30 November 2011 (UTC)

[edit] Incremental updates

Just spitballing here, would it make sense for en.wiki to have yearly full dumps, and more frequent incremental updates as compressed diffs? This might be a solution in search of a problem, but it would make it much easier to keep an up to date dump (if people would want that, provided that they do keep the compressed dump, and dumps are fairly similar, I have no idea how any of those hold up). 80.56.9.191 21:57, 15 December 2010 (UTC)

I'm not so technical, but as I understand it dumps are already "incremental", i.e. the previous dump is "prefetched" and used to build the next one, so your proposal wouldn't help to reduce times, but perhaps you were talking about bandwidth (you have to download less), and the answer here is that bandwidth is not a great issue AFAIK. --Nemo 11:10, 16 December 2010 (UTC)
An incremental compressed format makes a lot of sense to me. SJ talk  23:45, 28 April 2012 (UTC)

[edit] Commons dumps

Can you also set up a page about mirroring Wikimedia Commons binary dumps? That's what I am most interested in. The Internet Archive (talk to raj) is quite willing to host such a dump, along with the ones listed here. SJ talk  23:45, 28 April 2012 (UTC)

Binary dumps, do you refer to the textual dumps, or the image dumps of Wikimedia Commons? --Hydriz (talk) 14:45, 8 May 2012 (UTC)

[edit] WickedWay

Hello,

We want to host a mirror, but I'm wondering... How can we do it best? Cause we host a mirror at this moment that rsyncs with the masters of CentOS en Ubuntu and some more, but the only way to use rsync now is rsync from a other mirror?

And secondly, how often do we need to update?

Best, Huib

Our mirror server is located in Dronten, the Netherlands and has a 1Gbit uplink. Secondly we have 50TB storage with approximately 20 TB free. Huib talk Abigor 12:58, 21 May 2012 (UTC)
For help with setting up a mirror, get in touch with Ariel. See Mirroring Wikimedia project XML dumps#Who can we contact for hosting a mirror of the XML dumps? for contact info. 64.40.57.98 06:22, 22 May 2012 (UTC)
Personal tools

Variants
Actions
Navigation
Community
Beyond the Web
Print/export
Toolbox