Research talk:Measuring edit productivity/Work log/2014-12-31

From Meta, a Wikimedia project coordination wiki

Wednesday, December 31, 2014[edit]

Post-holiday status updates.

So, enwiki's diff generation is *still running*! And hadoop reports that the percentage hasn't really budged substantially. Also, simplewiki's persistence stats failed to be generated. Since enwiki is still running, I'm going to focus on simplewiki's failure first. --20:07, 31 December 2014 (UTC)


I've got a nice confusing one here.

org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
        at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:221)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:88)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:250)
        at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:208)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1886)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1484)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

I'm not running out of space in HDFS. I guess there will be a massive shuffle/sort based on rev_id. It could be that the individual machines are running out of temp space.

2014-12-24 20:02:51,012 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:halfak (auth:SIMPLE) cause:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1415917009743_51745_m_000001_0/file.out
2014-12-24 20:02:51,013 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1415917009743_51745_m_000001_0/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:402)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
        at org.apache.hadoop.mapred.YarnOutputFiles.getOutputFileForWrite(YarnOutputFiles.java:84)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1813)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1484)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

This looks relevant too. OK. Time to turn this into a replicable thing so that I can ask ottomata to take a look. --Halfak (WMF) (talk) 20:57, 31 December 2014 (UTC)[reply]


OK... So I thought about it a little bit more and I think that the input size might be an issue. So I've decided to dramatically cut down on the size of the records by dropping the text field. I think that this will be for the best regardless. I actually wrote the functionality into json2diffs, but I set the default to keep the text.

Here's the script for dropping the text.

$ cat drop_text.py
"""
Drops the text field from a RevisionDocument.  Dramatically saves space. 
"""
import json, sys


for line in sys.stdin:
	doc = json.loads(line)
	
	if 'text' in doc:
		del doc['text']
	if 'revision' in doc and 'text' in doc['revision']:
		del doc['revision']['text']
	
	json.dump(doc, sys.stdout)
	sys.stdout.write("\n")