Talk:Data dumps
From Meta, a Wikimedia project coordination wiki
Note that this page is not necessarily monitored those who can resolve all such problems. Some such queries would be more usefully directed to the appropriate mailing lists, such as wikitech-l.
[edit] Download Ireland articles and all templates
Does anyone know how to download all Ireland English wikipedia articles and all templates?? I just want something like the navbox and all the pages in the Ireland category.
[edit] enwiki-20080103-pages-meta-history.xml.bz2
Arrrgggghhhh!!! After 148 hours of downloading, I was 97% done with enwiki-20080103-pages-meta-history.xml.bz2 when someone 404'd it!!! Now we are back to having NO complete Wiki dumps available. Is this a secret policy, or what?
[edit] Frequent abort / fail
Dumps frequently fail and then it takes quite a long time until a new one is prepared.
Also, many dumps often fail one after another and a lot of red lines appear at http://download.wikimedia.org/backup-index.html . I don't know how the dumping works, but maybe there's one bug that causes them all to fail. If one dump fails, then maybe the problem that caused it to fail causes the subsequent ones to fail and they are not retried until the next cycle.
All these observations are very amateur, so feel free to correct me.
If it cannot be fixed right away, can it at least be explained here at the main page, Data dumps?
I don't know about other projects, but on the Hebrew Wikipedia we frequently use it for analyzing and improving interwiki links (see en:Wikipedia:WikiProject Interlanguage Links/Ideas from the Hebrew Wikipedia) and for other purposes.
Thanks in advance. --Amir E. Aharoni 15:37, 30 July 2008 (UTC)
Well, dumps failed on 2008-08-01, now is 2008-08-26. I think it's embarrassing for the Wikimedia. :( --Ragimiri 16:10, 26 August 2008 (UTC)
What's worse (from my point of view at least) is that the "small" dumps work fine, but happen (or don't happen) at the mercy of the largest dumps, which as noted very often fail right away, run for a long time and then fail, or now and again, run for a very long time and actually succeed. It's a real shame we can't have these run every month (say), on a particular date, separately from the large dumps. However, it's probably entirely in vain to comment and complain here: I don't think the server admins/devs monitor this page. Whether they'd pay any attention to requests on wikitech-l remains to be seen. Alai 18:53, 4 September 2008 (UTC)
- What's worse worse is I've offered years ago to take the dumps and run with them, as it were. Instead a whole load of dev time went into smartening them up, but they cannot be a high priority for them. Rich Farmbrough 22:04 4 October 2008 (GMT). 22:04, 4 October 2008 (UTC)
[edit] en dump has "ETA 2009-07-25"?
Would I be going way out on a limb, were I to speculate that this might be yet another failure mode for the full en dump that we're currently in? Alai 07:06, 6 November 2008 (UTC)
- No, it's not a failure; it's just a bad estimate. The full history dump does take a really long time, but (assuming it's allowed to run to completion) it'll finish well before July. Already it's estimating completion in May, so that's something. --Sapphic 04:21, 1 December 2008 (UTC)
- Oh, that's all right, then. </sarcasm>. The "long time" the full history dump is typically around six weeks, not the thick end of a year. It's very clear that something is very badly broken here. Alai 19:14, 7 January 2009 (UTC)
[edit] ImportDump.php killed
First, I am sorry for my English.
I try to import the dump of the ukrainian Wikipedia. After 5 minutes importing I recieve messege "Killed". I have changed file php.ini and set the following parameters:
upload_max_filesize = 20M
post_max_size = 20M
max_execution_time = 1000
max_input_time = 1000
But I still recieve the same messege "Killed" after 5 minutes importing. (it importing 8000 pages maximum) Support of the webhost provider have no ideas what is going wrong.
Please, help me! Thank you! --93.180.231.55 20:58, 26 December 2008 (UTC)
- This means that someone or something explicitely killed the process, probably because it consumed too much resources. Many shared hosting places, universities, etc, kill processes automatically after a couple of minutes. PLease talk to your local system admin. -- 81.163.107.36 10:31, 27 December 2008 (UTC)
[edit] Stub dumps
I just added a mention of the stub dumps, which I believe contains correct information. The stub dumps are useful for research purposes-- and MUCH easier to work with size-wise-- so I hope they will continue to be generated. 209.137.177.15 06:40, 3 March 2009 (UTC)
[edit] Problem with split stub dumps ?
I don't know if this is the right forum for this request, but frwiki, dewiki and even enwiki seem to repeatedly fail or take too long and get killed, apparently as a result of the long delay required for dumping « split stubs ». Would it be possible to reorder dumps so that key dumps like pages-articles.xml.bz2 would be dumped before these split stub dumps ? --66.131.214.76 21:49, 11 March 2009 (UTC) Laddo talk
- Any updates? answers? 87.68.112.255 08:20, 29 March 2009 (UTC)
[edit] eswiki-pages-articles articles and templates
I'm downloaded eswiki-20090615-pages-articles and imported with mwdumper.jar now when i go to see an article in the place of an templates are articles them I see these articles in many other pages. the last year i make same procedure and i see the main page know not. in the place of my templates i have articles. i have run rebuildall.php and the problem continues --Enriluis 18:43, 30 June 2009 (UTC)
[edit] 'current' symlinks to the latest dumps
Could we get static symlinks to the latest dumps? Probably many tools' developers at the Toolserver (but not only them) would benefit from an address like http://download.wikimedia.org/plwiki/20090809/plwiki-current-pages-articles.xml.bz2 instead of http://download.wikimedia.org/plwiki/20090809/plwiki-20090809-pages-articles.xml.bz2. We could use it to perform some periodical jobs by downloading the newest dumps without guessing or looking at the download page. That would certainly help us automatizing these jobs. An example of this job is http://toolserver.org/~holek/stats/bad-dates.php which presents dates in articles that are written improperly, considering Polish grammar. Hołek ҉ 16:21, 10 August 2009 (UTC)
[edit] Missing IDs in abstract.xml
Is there a reason why the article-ID isn't contained in the "abstract.xml"-files? - in contrast for example to the pages-articles.xml. Ecki 12:25, 16 December 2009 (UTC)