Research talk:Measuring edit productivity/Work log/2015-01-5

From Meta, a Wikimedia project coordination wiki

Monday, January 5, 2015[edit]

Just got back from the holiday and I'm picking up where I left off. So, it looks like the text trimming script that I ran worked as expected. So, next I want to re-try the revision stats job on simplewiki. First, let's check the filesize change.

[halfak@stat1002: ~/projects/persistence]
$ du -hs /mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-notext-snappy/
23G	/mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-notext-snappy/
[halfak@stat1002: ~/projects/persistence]
$ du -hs /mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-snappy
11T	/mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-snappy

Well... Hmm.. That's a pretty massive difference.  :) It makes sense since the diffs naturally compress changes. This should substantially reduce the storage space needed to sort and partition the data. Time to try again. --Halfak (WMF) (talk) 17:58, 5 January 2015 (UTC)[reply]