Research talk:Measuring edit productivity/Work log/2014-11-24

Monday, November 24, 2014[edit]

I've been working for a while to get diffs in place for doing productivity analysis. Today, I have finished processing Simple English Wikipedia. I'd like to re-process the diffs to understand what is there and to make some estimates of the amount of time that would be required to generate persistence information on top of such diffs. --Halfak (WMF) (talk) 16:18, 24 November 2014 (UTC

So I have 4.5 million diffs. This corresponds very closely to the number of revisions in the database.

DB query

$ wc part-00000 
   4537725 1494056958 9033188242 part-00000

So, it looks like we're good to go here. Before I start digging into this data, I'm going to kick off a version of this diff processing for English Wikipedia.

Enwiki kicked off. Now, back to our diffs. The next thing I need to do is re-sort the dataset. Right now, the diffs are shuffled in the dataset. I'd like to have them in sorted order so that I can process a whole page at a time. So, my plan is to write a script that will extract fields from the json set so that I can use unix sort to group and sort revision diffs. --Halfak (WMF) (talk) 21:44, 24 November 2014 (UTC)[reply]

So, I just found that sorting in hadoop streaming seems to be pretty straightforward[1] -- at least it is easy to configure. I have no idea how fast processing is likely to actually be. Time to run some tests!

First, I need to confirm that this would even do what I want. So I need some sample data to play around with sorting. --Halfak (WMF) (talk) 21:44, 24 November 2014 (UTC)[reply]