Talk:Data dumps

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Note that this page is not monitored by those who can resolve all such problems. Please subscribe to and send mail about these issues to the appropriate mailing list, xmldatadumps-l.

Download wiki entire "Talk" pages[edit]

When will the entire wiki "Talk" pages be available for download?

Incorrect wiktionary dumps in Turkish[edit]

Turkish Wiktionary titles are converted to uppercase using incorrect collation in mysql dumps. Turkish has different casing for i and ı (dotless i) which convert to İ and I respectively. However in dumps title "İngilizce" is converted to "İNGILIZCE" instead of "İNGİLİZCE". This makes almost whole the data useless.


Download Ireland articles and all templates[edit]

Does anyone know how to download all Ireland English wikipedia articles and all templates?? I just want something like the navbox and all the pages in the Ireland category.

enwiki-20080103-pages-meta-history.xml.bz2[edit]

Arrrgggghhhh!!! After 148 hours of downloading, I was 97% done with enwiki-20080103-pages-meta-history.xml.bz2 when someone 404'd it!!! Now we are back to having NO complete Wiki dumps available. Is this a secret policy, or what?

It is available in Internet Archive, exactly here (133 GB). Emijrp 18:32, 12 August 2010 (UTC)
The .7z version of that file is listed a 17.3GB. Surely that can't really be the same data—can it? MoreThings 00:18, 11 September 2010 (UTC)

Frequent abort / fail[edit]

Dumps frequently fail and then it takes quite a long time until a new one is prepared.

Also, many dumps often fail one after another and a lot of red lines appear at http://download.wikimedia.org/backup-index.html . I don't know how the dumping works, but maybe there's one bug that causes them all to fail. If one dump fails, then maybe the problem that caused it to fail causes the subsequent ones to fail and they are not retried until the next cycle.

All these observations are very amateur, so feel free to correct me.

If it cannot be fixed right away, can it at least be explained here at the main page, Data dumps?

I don't know about other projects, but on the Hebrew Wikipedia we frequently use it for analyzing and improving interwiki links (see en:Wikipedia:WikiProject Interlanguage Links/Ideas from the Hebrew Wikipedia) and for other purposes.

Thanks in advance. --Amir E. Aharoni 15:37, 30 July 2008 (UTC)

Well, dumps failed on 2008-08-01, now is 2008-08-26. I think it's embarrassing for the Wikimedia. :( --Ragimiri 16:10, 26 August 2008 (UTC)

What's worse (from my point of view at least) is that the "small" dumps work fine, but happen (or don't happen) at the mercy of the largest dumps, which as noted very often fail right away, run for a long time and then fail, or now and again, run for a very long time and actually succeed. It's a real shame we can't have these run every month (say), on a particular date, separately from the large dumps. However, it's probably entirely in vain to comment and complain here: I don't think the server admins/devs monitor this page. Whether they'd pay any attention to requests on wikitech-l remains to be seen. Alai 18:53, 4 September 2008 (UTC)


What's worse worse is I've offered years ago to take the dumps and run with them, as it were. Instead a whole load of dev time went into smartening them up, but they cannot be a high priority for them. Rich Farmbrough 22:04 4 October 2008 (GMT). 22:04, 4 October 2008 (UTC)

en dump has "ETA 2009-07-25"?[edit]

Would I be going way out on a limb, were I to speculate that this might be yet another failure mode for the full en dump that we're currently in? Alai 07:06, 6 November 2008 (UTC)

No, it's not a failure; it's just a bad estimate. The full history dump does take a really long time, but (assuming it's allowed to run to completion) it'll finish well before July. Already it's estimating completion in May, so that's something. --Sapphic 04:21, 1 December 2008 (UTC)
Oh, that's all right, then. </sarcasm>. The "long time" the full history dump is typically around six weeks, not the thick end of a year. It's very clear that something is very badly broken here. Alai 19:14, 7 January 2009 (UTC)

ImportDump.php killed[edit]

First, I am sorry for my English.

I try to import the dump of the ukrainian Wikipedia. After 5 minutes importing I recieve messege "Killed". I have changed file php.ini and set the following parameters:

upload_max_filesize = 20M

post_max_size = 20M

max_execution_time = 1000

max_input_time = 1000

But I still recieve the same messege "Killed" after 5 minutes importing. (it importing 8000 pages maximum) Support of the webhost provider have no ideas what is going wrong.

Please, help me! Thank you! --93.180.231.55 20:58, 26 December 2008 (UTC)

This means that someone or something explicitely killed the process, probably because it consumed too much resources. Many shared hosting places, universities, etc, kill processes automatically after a couple of minutes. PLease talk to your local system admin. -- 81.163.107.36 10:31, 27 December 2008 (UTC)

Stub dumps[edit]

I just added a mention of the stub dumps, which I believe contains correct information. The stub dumps are useful for research purposes-- and MUCH easier to work with size-wise-- so I hope they will continue to be generated. 209.137.177.15 06:40, 3 March 2009 (UTC)

Problem with split stub dumps ?[edit]

I don't know if this is the right forum for this request, but frwiki, dewiki and even enwiki seem to repeatedly fail or take too long and get killed, apparently as a result of the long delay required for dumping « split stubs ». Would it be possible to reorder dumps so that key dumps like pages-articles.xml.bz2 would be dumped before these split stub dumps ? --66.131.214.76 21:49, 11 March 2009 (UTC) Laddo talk

Any updates? answers? 87.68.112.255 08:20, 29 March 2009 (UTC)

eswiki-pages-articles articles and templates[edit]

I'm downloaded eswiki-20090615-pages-articles and imported with mwdumper.jar now when i go to see an article in the place of an templates are articles them I see these articles in many other pages. the last year i make same procedure and i see the main page know not. in the place of my templates i have articles. i have run rebuildall.php and the problem continues --Enriluis 18:43, 30 June 2009 (UTC)

'current' symlinks to the latest dumps[edit]

Could we get static symlinks to the latest dumps? Probably many tools' developers at the Toolserver (but not only them) would benefit from an address like http://download.wikimedia.org/plwiki/20090809/plwiki-current-pages-articles.xml.bz2 instead of http://download.wikimedia.org/plwiki/20090809/plwiki-20090809-pages-articles.xml.bz2. We could use it to perform some periodical jobs by downloading the newest dumps without guessing or looking at the download page. That would certainly help us automatizing these jobs. An example of this job is http://toolserver.org/~holek/stats/bad-dates.php which presents dates in articles that are written improperly, considering Polish grammar. Hołek ҉ 16:21, 10 August 2009 (UTC)

Missing IDs in abstract.xml[edit]

Is there a reason why the article-ID isn't contained in the "abstract.xml"-files? - in contrast for example to the pages-articles.xml. Ecki 12:25, 16 December 2009 (UTC)

mislinked md5sums file for lastest wiki[edit]

I'm not sure this is the right place to point this out but I noticed that the md5sums file for http://download.wikimedia.org/enwiki/latest/ is mislinked. it points to the md5sums from an older dump(2009-Oct-31) but the dump is from 2009-Nov-28, this is a problem for tools that fetch the latest dump and use the checksum to check for corruption.

More exact figures on uncompressed dump size?[edit]

It says that the dumps can uncompress to 20 times their size. Can anyone give a more exact figure on what the uncompressed size of the enwiki would be? (I need to know how big a drive to buy to hold it.) Thanks, Tisane 23:34, 3 March 2010 (UTC)

Depending on what you want to do with the data, there may not be any need to uncompress it. If you're writing a tool to scan through the file and extract metadata, you can simply uncompress the data in memory as you're reading the file, and only write the (presumably much smaller) metadata to disk. If you're looking to import the data into a database, you're going to run into bigger problems than storage space (the full history dump far exceeds the capacity of most database software.. the Wikimedia foundation itself splits the data across several MySQL databases.) --Oski Jr 18:17, 5 March 2010 (UTC)

How big would image bundles be if they existed[edit]

How large would commons and enwiki image bundles be uncompressed if they existed? 71.198.176.22 19:02, 16 May 2010 (UTC)

Enwiki: 200 GB, Commons: 6 TB -- Daniel Kinzler (WMDE) 11:28, 17 May 2010 (UTC)
And growing. It would be nice to make tarballs of thumbnails. Emijrp 18:24, 12 August 2010 (UTC)

Oldest Wikipedia dump available?[edit]

At, for example, http://dumps.wikimedia.org/itwiki/ I see there are 6 dumps available, the latest 2010-Jun-27, the oldest 2010-Mar-02.

My question is: are older dumps stored somewhere? Is it possible to get them? Even via a script on the database... Thanks! --Phauly 15:42, 29 June 2010 (UTC)

No, because new dumps contain all info of previous dumps, except things they shouldn't contain (deleted private data and so on). --Nemo 21:11, 11 November 2010 (UTC)
Actually that is only true of the full history dump. I found the older dumps extremely useful, however space constraints mean that I can only keep a couple. Rich Farmbrough 18:52 7 January 2011 (GMT).

How could I download only Category Pages ?[edit]

I wish to utilise wikipedias category structure, could I just download the category pages? Chendy 22:23, 9 August 2010 (UTC)

Dump server still down?[edit]

--eom-- अभय नातू 09:45, 24 December 2010 (UTC)

Seems like the 'Dump process is idle.'. I'm waiting for it to start again. EmilS 19:02, 27 December 2010 (UTC)
My pet AI is starving. Two months without juicy xml dumps!
yes it is extremely unfortunate. I thought new hardware was being sourced. there's a bugzilla on the failure. Rich Farmbrough 10:26 4 January 2011 (GMT).
When the dump page was completely down there was a page with updates, does anyone have the link to that page? Rich, do you have a link to that bugzilla? EmilS 10:37, 4 January 2011 (UTC)
It was wikitech:Dataset1. Anyway there's a mailing list for updates, mail:xmldatadumps-l. --Nemo 19:08, 4 January 2011 (UTC)

The only entry in the mailing list for 2011 says

-- cut here --

Hello,

first a happy new year to everybody. Are there any news about generating new dumps?

Best regards Andreas

-- cut here--

Rich Farmbrough 18:38 7 January 2011 (GMT).
There's some hints at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Dumps. Rich Farmbrough 18:39 7 January 2011 (GMT).
A 14 December posting says "We have also been working on deploying a server to run one round of dumps in the interim." I wonder what happened to that. Rich Farmbrough 18:42 7 January 2011 (GMT).
And now we are back in business! :) EmilS 13:58, 11 January 2011 (UTC)

queue?[edit]

Why some wikis are dumped the second time in january, and pl.wiki (and 5 others: it, ja, ru, nl and pt) have no new copy since november? Malarz pl 19:14, 18 January 2011 (UTC)

Locked/closed project dumps?[edit]

Is there any reason that locked projects continue to get database dumps, when it's obvious that nothing has changed (since the database for a locked project is read-only?) The Simple English Wikiquote's most recent database dump was 04-04-2011, despite the fact that the project has been closed since at least Febuary 2010. I don't know how resource-intensive it is to make a database dump of a wiki, but surely there's no point in releasing new dumps if nothing has changed. Avicennasis 18:41, 7 April 2011 (UTC)

I asked the people managing the dumps about this, and they said that the reason we keep dumps is because if someone wants a copy they have a right to it. We could change formats or add new bits of data, and most of the wikis still allow changes from users like stewards if necessary. They're happy to keep running new dumps, and it's not really a big issue.
They also pointed out that future feedback should probably be sent to xmldatadumps-l and not on wiki talk pages. :-) Cbrown1023 talk 20:45, 8 April 2011 (UTC)
Ah. I knew others still a right to the data, since it's released under out favorite licenses, CC-BY-SA 3.0/GFDL. I just didn't see the point in producing a new dump when, assuming no changes could be made, the older dumps would be identical. It hadn't occurred to me that there might still be changes at a much higher level.
As a sidenote, I didn't think to ask on the mailing list, since it's mentioned on the content page for "trouble with the dump files", and I didn't think this was trouble. Face-smile.svg Thank you, however, for reaching out to the right people, and the quick response! Avicennasis 11:01, 9 April 2011 (UTC)

Dumps page sorting[edit]

Hello, can someone tell me how are the dumps now sorted on this page : http://dumps.wikimedia.org/backup-index.html

Is it due to a problem (before it was sorted by last modified date). Thanks, Jona 07:22, 1 October 2011 (UTC)

The dumps are sorted by the date the dump was started. Sometimes one part of a dump will be re-run much later and so the date of the status file, which is the timestamp you see in the index.html page, reflects that. This is the reason you see an order that sometimes looks a bit odd. Typically the larger wikis (de fr it pt en) are the ones that might have a step redone. I would recommend you get on the xmldatadumps-l mailing list (see link at top of this page), where you can ask questions like this and get a much more timely response :-) -- ArielGlenn 09:03, 2 December 2011 (UTC)

rebuildImages.php fails[edit]

I copied all images to new mediawiki installation. I have done a importDump.php. Then I want all images to be recreated with rebuildImages.php --missing , but it fails!

Some of the copied files have strange characters in its filenames. I think the script tries to rename the file, but it cannot find the method ImageBuilder::renameFile().

Can somebody help? Thx a lot. --Rolze 13:30, 10 January 2012 (UTC)

C:\PROGRAMS\xampp\htdocs\wiki\maintenance>..\..\..\php\php.exe rebuildImages.php --missing
Fatal error: Call to undefined method ImageBuilder::renameFile() in C:\PROGRAMS\xampp\htdocs\wiki\maintenance\rebuildImages.php on line203

Notes from a crazy person[edit]

I threw some notes down at mw:Laxative about ways to make scanning database dumps very fast. I'm not sure anything will ever come of it, but just so the page isn't completely orphaned, there's now a link from Meta-Wiki. :-) --MZMcBride (talk) 22:10, 12 March 2012 (UTC)

7-zip actually is better for non-full dumps[edit]

This page claims that "SevenZip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files." In an experiment I discovered that most of the gzip dumps can be made about 40% smaller by 7-zipping with maximum compression settings, and the bzip2 dumps made about 20% smaller. For example, enwiki-20120307-stub-meta-history.xml.gz is 20.2 GB, but compressed with 7z is 11.7 GB. I realise that 7-zipping everything with maximum compression settings would consume a lot of CPU and make it harder to turn out dumps on time, but I think the saved bandwidth would justify it - many overseas downloaders have to pay high rates per gigabyte for bandwidth. 7-zip is able to take advantage of multiple cores, which are rapidly becoming more plentiful. At the least, for downloaders who do need smaller dumps, we should let them know it's possible for them to recompress after downloading to save storage space. Dcoetzee (talk) 15:08, 1 April 2012 (UTC)

how to create "compatible" dumps[edit]

hi, i run www.ameisenwiki.de and i want to create dumps for wikitaxi. i need the pager-articles.xml.bz2 format. currently i try this with php dumpBackup.php --full > /var/www/wiki/dump/pager-articles.xml and create the .bz2 file afterwards. wikitaxi is not able to import - parser error if i use dumpgenerator.py i also get an incompatible xml Which tool is used to create these exports?

A 7z compression test[edit]

I tried to achieve better 7z compression of the dumps than the current one. As a test, downloaded the English Wikipedia full dumps from 2013-06-04 and recompressed them with the following 7z options:

-mx9 -md=30 -mfb=270 -mlc=4

The original size of the full dumps was 70,187,341K. The size after the recompression was 57,441,684K - about 22% smaller.

The upsides are:

  • disk space economy
  • bandwidth economy

The downsides are:

  • signifficantly more CPU usage: the recompression took about 80 days on my AMD A4-4000
  • signifficantly more memory usage: with these options 7z requires about 10.5 GB of RAM.

Depending on the current CPU/RAM situation of the Wikimedia hardware and the cost analysis, it may be justified or not to use this stronger compression mode.

Also, grouping exported pages not by pageids, but by a category system may additionally decrease somewhat the compressed file size. However, I wasn't able to formulate an algorithm for choosing the best categories hierarchy to follow. -- Григор Гачев (talk) 13:11, 28 September 2013 (UTC)

Would it help to use Efficient XML Interchange (EMI)?[edit]

XML files often have a large overhead due to the length of the tags, though of course good compression can alleviate that. This is addressed via EMI - see the Efficient XML Interchange Working Group Public Page. Has anyone looked at whether publishing dumps in EMI would help with the time or space issues? Nealmcb (talk) 17:17, 3 July 2015 (UTC)

Documentation of content of XML files[edit]

Where is there documentation of the content of the various dumps? The names are not fully self-explanatory. For example, what are "primary meta-pages"? Are we just supposed to know enough XML to figure it out by downloading and processing each type of dump file. Specifically which dumps have pages from the Appendix name space? Or enwikt's Citations namespace? Presumably that differs by project. Should we have specific documentation for each project that expresses an interest? DCDuring (talk) 15:08, 4 July 2015 (UTC)

Wikispecies species download[edit]

Hello everyone,

I already posted this question on the wikispecies discussion page, but maybe you could give me an answer to this too:

Is there a possibilty to download the core species data behind the wikispecies project? Or rephrased: Is there a dump that contains just the main pages (assuming they are species) and their respective links to successive species pages?

I'm aware that you can download the entire wikispecies, which leaves you with a 5GB large file. Converting that is naturally something I'd like to avoid, as I am basically just looking for the core relational data.--Taeping5347 (talk) 13:17, 18 April 2016 (UTC)

If links are all you need, a pagelinks table dump or even database query may be easier. Extracting more information may require an alternative parser like mediawiki-utilities. Nemo 17:15, 18 April 2016 (UTC)
Thank you for your hint. I will give the pagelinks a try, but I suspect it'll consist of a lot of pages that are not actually of interest. It would be great, if this specific version of a dump could be implemented for the wikispecies project, where the actual relation between pages contains very real information.--Taeping5347 (talk) 17:49, 18 April 2016 (UTC)

Talk pages?[edit]

Where can I download the talk pages for EN? Thanks. -- Green Cardamom (talk) 04:51, 12 August 2016 (UTC)

pages-meta-current (related: phabricator:T99483). Nemo 07:18, 12 August 2016 (UTC)

Download all files in one go.[edit]

Dump files should offer packages, which offers a full download of anything -- pages, talks, images etc. of a wikimedia project in one go. It can even provide a total backup of its entire server in one folder, so as to allow users to download them more easily. Wetitpig0 (talk) 10:51, 5 September 2016 (UTC)

What encoding of sha1 in dump xml[edit]

What part of revision used to calc sha1? What encoding (base[?]) used to covert sha1 to text?

...
{{Внешние ссылки нежелательны}}</text>
      <sha1>7oe1255dhbgmwbyh4bla8qh0zlq7m2o</sha1>
    </revision>
  </page>
...

I found it in ruwiki-20170501-pages-articles.xml.bz2 dump. This last revision of RU:Magnet-ссылка. Ivan386 (talk) 15:56, 12 May 2017 (UTC)