Research talk:Measuring editor labor hours/Work log/2015-09-08

From Meta, a Wikimedia project coordination wiki

Tuesday, September 8, 2015[edit]

Well... I'm a little late to start this work log. I've been working on this project for a while, but it was mostly in the context of mwsessions -- a generalized set of utilities for processing session information in MediaWiki. I just now split up the generalized code and my working code for this project. For the working code, see github.com/halfak/labor-hours.

Recap[edit]

To recapitulate the stuff that I've done up until now.

  1. I gathered datasets of sorted revision-saved events (aka revision rows in the database) from both the revision and archive tables.
  2. I ran the mwsessions sessionize utility on the sequence of revisions until it could complete the job successfully.

Note that I've completed passes on cawiki (because someone needed it) and enwiki (because it is the biggest and most difficult).

1. Sorted revisions[edit]

revision table archive table
SELECT
  DATABASE() AS wiki,
  rev_id AS id,
  rev_page AS page_id,
  rev_user AS user_id,
  rev_user_text AS user_text,
  rev_timestamp AS timestamp,
  rev_sha1 AS sha1,
  rev_len AS len,
  False AS archived
FROM revision
WHERE rev_deleted = 0 /* No deleted revisions */
ORDER BY rev_timestamp ASC, rev_id ASC;
SELECT
  DATABASE() AS wiki,
  rev_id AS id,
  rev_page AS page_id,
  rev_user AS user_id,
  rev_user_text AS user_text,
  rev_timestamp AS timestamp,
  rev_sha1 AS sha1,
  rev_len AS len,
  False AS archived
FROM revision
WHERE rev_deleted = 0 /* No deleted revisions */
ORDER BY rev_timestamp ASC, rev_id ASC;


2. mwsession sessionize[edit]

See the docs for this utility here: pythonhosted.org/mwsessions/

Essentially, this utility will convert a sequence of events into sessions with stats and (optionally) events labeled by session. It takes a little more than 24 hours to run on the full history of English Wikipedia using a single core.

Loading into the DB[edit]

To prepare for my analysis, I'll be loading the sessions into the database. I have 158 million sessions and 670million session revisions.. But now it is time for a meeting, so I'm going to save the log and come back. --Halfak (WMF) (talk) 21:00, 8 September 2015 (UTC)[reply]

j/k. Meetings took too long. I guess I'm coming back tomorrow. --Halfak (WMF) (talk) 23:43, 8 September 2015 (UTC)[reply]