Research talk:Measuring edit productivity/Work log/2015-04-15
Add topicWednesday, April 15, 2015
[edit]The diff job finished! Here's the hadoop stats:
File System Counters
FILE: Number of bytes read=11992158169291
FILE: Number of bytes written=11342016265314
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=634498881239
HDFS: Number of bytes written=337764574821
HDFS: Number of read operations=13317
HDFS: Number of large read operations=0
HDFS: Number of write operations=4000
Job Counters
Launched map tasks=2439
Launched reduce tasks=2000
Data-local map tasks=2438
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=446505506700
Total time spent by all reduces in occupied slots (ms)=26616033150
Total time spent by all map tasks (ms)=44650550670
Total time spent by all reduce tasks (ms)=2661603315
Total vcore-seconds taken by all map tasks=44650550670
Total vcore-seconds taken by all reduce tasks=2661603315
Total megabyte-seconds taken by all map tasks=228610819430400
Total megabyte-seconds taken by all reduce tasks=13627408972800
Map-Reduce Framework
Map input records=583741359
Map output records=415592383
Map output bytes=8579023624969
Map output materialized bytes=3778907328006
Input split bytes=508991
Combine input records=0
Combine output records=0
Reduce input groups=415592383
Reduce shuffle bytes=3778907328006
Reduce input records=415592383
Reduce output records=415592383
Spilled Records=1246338388
Shuffled Maps =4878000
Failed Shuffles=0
Merged Map outputs=4878000
GC time elapsed (ms)=183793619
CPU time spent (ms)=45173296420
Physical memory (bytes) snapshot=6270283120640
Virtual memory (bytes) snapshot=14864163971072
Total committed heap usage (bytes)=8861103685632
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=634498372248
File Output Format Counters
Bytes Written=337764574821
15/04/09 14:00:04 INFO streaming.StreamJob: Output directory: /user/halfak/streaming/enwiki-20141106/diffs-snappy
real 8408m36.464s
user 4m50.086s
sys 7m25.772s
I had to implement some diff timeouts in order to get it to finish. For that reason, there's some edits that have no diff. I wasn't able to find them in with a simple grep for "diff: null", so I'm just going to kick off the persistence job and see how it goes while I prepare to perform an analysis of the diffs. --Halfak (WMF) (talk) 16:15, 15 April 2015 (UTC)
I had to make some modifications, but the script is now started. In the meantime, I want to (1) confirm that all the diffs are in fact not "null" and (2) plot the diff timing data that I extracted. --Halfak (WMF) (talk) 16:28, 15 April 2015 (UTC)
Well... it looks like I should have been looking for "ops: null". Oh well. Let's grab a sample and start working with it.
So, I randomly sampled 100k revisions from the first reducer. That might result in some bias. I'm not sure. So I'll do come analysis on this while I pull a larger sample.
OK.
Well, that looks fast to me.
Let's look at some stats.
quantile(diff_stats$diff.time) # 0% 25% 50% 75% 100% #0.00 0.02 0.05 0.13 3.85 summary(diff_stats$truncated) # False #100000
Cool. It looks like we're performing about right for my expectations. Now I'm just waiting for the proper sample to finish. --Halfak (WMF) (talk) 18:40, 15 April 2015 (UTC)
Looks like the persistence generator failed. That was because I changed the format of diffs in order to track stats. I've released a new version of mwstreaming (0.5.5) to fix this and restarted the job. --Halfak (WMF) (talk) 18:41, 15 April 2015 (UTC)
Well, it's all running, but this is going to take at least a couple of hours, so I'm going to go work on other things. If the sample finishes today, I'll update here. If not, look for future worklogs. --Halfak (WMF) (talk) 20:18, 15 April 2015 (UTC)
Update from the FUTURE! The proper sample completed. It looks like stats didn't change in any meaningful way, but I did update the #Diff time density plot above. --Halfak (WMF) (talk) 17:38, 16 April 2015 (UTC)
