Mirroring Wikimedia project XML dumps/estimates

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

March 2014 estimate[edit]

From dataset1001:/data/xmldatadumps/public, I ran:

  • cat rsync-filelist-last-1-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last1-2014.txt
    2293020248 kbytes, about 2.1T
  • cat rsync-filelist-last-2-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last2-2014.txt
    4566567252 kbytes, about 4.2T
  • cat rsync-filelist-last-4-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last4-2014.txt
    9058836524 kbytes, about 8.4T
  • cat rsync-filelist-last-5-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last5-2014.txt
    11275812556 kbytes, about 10.5T

August 5 2013 estimate[edit]

I wrote a tiny script that would give me the du across all projects for the first dump of each month, to track growth. Using that I have the following:

total for year 2012, month 07: 1,024,149,808
total for year 2012, month 08: 1,470,942,004
total for year 2012, month 09: 1,493,108,864
total for year 2012, month 10: 1,689,314,104
total for year 2012, month 11: 1,758,897,860
total for year 2012, month 12: 1,790,053,612
total for year 2013, month 01: 1,813,492,404
total for year 2013, month 02: 1,845,837,948
total for year 2013, month 03: 2,010,964,172
total for year 2013, month 04: 1,737,291,644
total for year 2013, month 05: 1,935,464,408
total for year 2013, month 07: 2,003,609,136

Rather scary!

9 803 791 956 bytes (9.2T) via the method listed below for the last 5 good dumps, 4 004 194 960 (3.8T) for the last 2 good dumps, and 2 031 829 840 (1.9T) for the last 1 good dump across all wikis.

Jan 21 2012 estimate[edit]

last 5[edit]

I have a list of the last 5 complete dumps for each project; we generate it for rsyncing to mirror sites. This list includes 5 full complete dumps of the beast, enwiki. Running

cat rsync-list.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /data/xmldatadumps/atgtesting/space-needs-precise-2012.txt

gave me a total of 6347870404 bytes or 6.0T in human-readable form.

last 2[edit]

As above, but starting with a list of the last 2 good dumps, along with a du based on the output of the file list, we got 2598845584 bytes or 2.5T in human-readable form.

last 1[edit]

I generated a list of the files in the last good dump across all projects, using our rsync list generation script. From that and a similar du to the above, the space used is 1308697212 bytes or 1.3T in human-readable form.

Dec. 16 2010 estimate[edit]

The source of the 1.3T estimate is as follows:

I ran a simple du script on our copy of the dumps. It skipped "bad" and "archive" dumps (known to be incomplete or corrupt) and only looked at the dumps that completed. It might have counted the most recent dump for a project with some failed items, this shouldn't have cost us much in the accuracy.

Original total: 648 235 316 K = 648 GB.

For enwiki it used the 20100730 dumps, which have a total of 85 072 676 K = 85 GB. It should have used the most complete = 20100904 which are 419 439 240 K = 420 GB. The difference in size is: 334 366 564 K = 334 GB.

Adding that to our previous total we now get 982 601 880 = 983 GB.

One more factor: we did not run the 7z compression on the 09 04 dumps, which would give us about 35GB more. We also did not do the recombine of the page-meta-history bz2s; this would have given us 343 663 832 874 bytes = 344 GB more. Adding those to 983 GB we get a grand whopping total of 344+35+983 = 1362 GB.

(Someone can check my arithmetic, I suck.)