Mirroring Wikimedia project XML dumps/estimates
March 2014 estimate
From dataset1001:/data/xmldatadumps/public, I ran:
cat rsync-filelist-last-1-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last1-2014.txt
- 2293020248 kbytes, about 2.1T
cat rsync-filelist-last-2-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last2-2014.txt
- 4566567252 kbytes, about 4.2T
cat rsync-filelist-last-4-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last4-2014.txt
- 9058836524 kbytes, about 8.4T
cat rsync-filelist-last-5-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last5-2014.txt
- 11275812556 kbytes, about 10.5T
August 5 2013 estimate
I wrote a tiny script that would give me the du across all projects for the first dump of each month, to track growth. Using that I have the following:
total for year 2012, month 07: 1,024,149,808 total for year 2012, month 08: 1,470,942,004 total for year 2012, month 09: 1,493,108,864 total for year 2012, month 10: 1,689,314,104 total for year 2012, month 11: 1,758,897,860 total for year 2012, month 12: 1,790,053,612 total for year 2013, month 01: 1,813,492,404 total for year 2013, month 02: 1,845,837,948 total for year 2013, month 03: 2,010,964,172 total for year 2013, month 04: 1,737,291,644 total for year 2013, month 05: 1,935,464,408 total for year 2013, month 07: 2,003,609,136
9 803 791 956 bytes (9.2T) via the method listed below for the last 5 good dumps, 4 004 194 960 (3.8T) for the last 2 good dumps, and 2 031 829 840 (1.9T) for the last 1 good dump across all wikis.
Jan 21 2012 estimate
I have a list of the last 5 complete dumps for each project; we generate it for rsyncing to mirror sites. This list includes 5 full complete dumps of the beast, enwiki. Running
cat rsync-list.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /data/xmldatadumps/atgtesting/space-needs-precise-2012.txt
gave me a total of 6347870404 bytes or 6.0T in human-readable form.
As above, but starting with a list of the last 2 good dumps, along with a du based on the output of the file list, we got 2598845584 bytes or 2.5T in human-readable form.
I generated a list of the files in the last good dump across all projects, using our rsync list generation script. From that and a similar du to the above, the space used is 1308697212 bytes or 1.3T in human-readable form.
Dec. 16 2010 estimate
The source of the 1.3T estimate is as follows:
I ran a simple du script on our copy of the dumps. It skipped "bad" and "archive" dumps (known to be incomplete or corrupt) and only looked at the dumps that completed. It might have counted the most recent dump for a project with some failed items, this shouldn't have cost us much in the accuracy.
Original total: 648 235 316 K = 648 GB.
For enwiki it used the 20100730 dumps, which have a total of 85 072 676 K = 85 GB. It should have used the most complete = 20100904 which are 419 439 240 K = 420 GB. The difference in size is: 334 366 564 K = 334 GB.
Adding that to our previous total we now get 982 601 880 = 983 GB.
One more factor: we did not run the 7z compression on the 09 04 dumps, which would give us about 35GB more. We also did not do the recombine of the page-meta-history bz2s; this would have given us 343 663 832 874 bytes = 344 GB more. Adding those to 983 GB we get a grand whopping total of 344+35+983 = 1362 GB.
(Someone can check my arithmetic, I suck.)