Research talk:Measuring edit productivity/Work log/2015-09-16

Add topic
From Meta, a Wikimedia project coordination wiki

Wednesday, September 16, 2015[edit]

It's been a while, but I haven't put this project down. I spent most of my hours on this project honing my utilities for processing content persistence. See pythonhosted.org/mwpersistence. I've been working with other researchers who are using similar strategies to track content to try to centralize on a general strategy.

Anyway, it's time to get some analysis done, so that's why I'm here today. See https://github.com/halfak/measuring-edit-productivity for code that I'll be referencing.

So, first things first, I'm updating the Makefile to allow me to use a set of Snappy files that I pulled from the hadoop clustet to stat1003 so that I can try processing the data in single-server mode.

First things first, I need to be able to process our snappy compressed files. See Phab:T112770. --Halfak (WMF) (talk) 17:31, 16 September 2015 (UTC)Reply

Regretfully, this is a blocker for me. So I'm going to go to hadoop and re-compress these files bz2. *sigh* --Halfak (WMF) (talk) 21:14, 16 September 2015 (UTC)Reply

I've learned a couple of things.

  1. Hadoop's Snappy compression is special and therefor will not work with snzip anyway
  2. It's better if I just recompress the files as Bz2 in hadoop
  3. In order to preserve page partitioning and chronological order, I have to make hadoop re-sort the data -- even though it is already sorted.

Basically, I'm done with Snappy. I'll be converting my whole workflow to bz2 asap.

For now, I've kicked off a new job to do the recompression. --Halfak (WMF) (talk) 22:31, 16 September 2015 (UTC)Reply


(Note: posting from the next morning)

Here's the script that I wrote:

#!/bin/bash
# Gather command line args
job_name=$1
input=$2
output=$3

echo "Zipping up virtualenv"
cd /home/halfak/venv/3.4/
zip -rq ../3.4.zip *
cd -
cp /home/halfak/venv/3.4.zip virtualenv.zip

echo "Moving virtualenv.zip to HDFS"
hdfs dfs -put -f virtualenv.zip /user/halfak/virtualenv.zip;

echo "Running hadoop job"
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-*streaming*.jar \
    -D  mapreduce.job.name=$job_name \
    -D  mapreduce.output.fileoutputformat.compress=true \
    -D  mapreduce.output.fileoutputformat.compress.type=BLOCK \
    -D  mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -D  mapreduce.task.timeout=6000000 \
    -D  stream.num.map.output.key.fields=3 \
    -D  mapreduce.partition.keypartitioner.options='-k1,1n' \
    -D  mapreduce.job.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator" \
    -D  mapreduce.partition.keycomparator.options='-k1,1n -k2,2 -k3,3n' \
    -D  mapreduce.reduce.speculative=false \
    -D  mapreduce.reduce.env="LD_LIBRARY_PATH=virtualenv/lib/" \
    -D  mapreduce.map.env="LD_LIBRARY_PATH=virtualenv/lib/" \
    -D  mapreduce.map.memory.mb=1024 \
    -D  mapreduce.reduce.speculative=false \
    -D  mapreduce.reduce.memory.mb=1024 \
    -D  mapreduce.reduce.vcores=2 \
    -D  mapreduce.job.reduces=2000 \
    -files       hadoop/mwstream  \
    -archives    'hdfs:///user/halfak/virtualenv.zip#virtualenv' \
    -input       $input \
    -output      $output \
    -mapper      "bash -c './mwstream json2tsv page.id timestamp id -'" \
    -reducer     "bash -c 'cut -f4'" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Everything went as planned and I'm pulling the data down to our stat1003 as I type. --Halfak (WMF) (talk) 14:36, 17 September 2015 (UTC)Reply