Research talk:Measuring edit productivity/Work log/2015-09-29

From Meta, a Wikimedia project coordination wiki

Wednesday, September 30, 2015[edit]

So, I've been working with the output a little bit and I found a problem. It turns out that the diff algorithm behaves strangely when you re-use the abstract segment tree in two diffs. While I've already fixed the issue and added tests, the problem is in the diff algorithm -- the beginning of the pipeline -- so that means I need to re-run the diffs that I had previously ran. *sigh*. So, regretfully, I won't be able to do much of an interesting analysis with this dataset in the short-term. However I did spend some time honing my analysis techniques on the data I do have. So I'll take a little bit of time to go over that here.

First, let's look at the survival of tokens by the number of seconds they remain visible. I can imagine two good strategies for looking at this: a density plot of the time of removal and a death hazard plot.

The density of total time visible when removed is plotted for word-tokens. (bounded at 72 hours)
Time visible (hazard). The density of total time visible when removed is plotted for word-tokens. (bounded at 72 hours)
The hazard of removal is plotted for word-tokens as the time visible increases. (bounded at 72 hours)
Time visible (hazard). The hazard of removal is plotted for word-tokens as the time visible increases. (bounded at 72 hours)

Well that's interesting. I suppose it makes sense that a hazard plot will look nearly identical to a density plot of the death times, but I didn't expect that they'd look identical! So, one feature that jumps out is the cycle of peaks every 24 hours. I bet that's because of the cyclical patterns of activity (and watchlist views) that happens as the earth turns. It looks like the vast majority of the density/hazard goes away after a few hours. That corresponds to my previous observations about quality control in enwiki (see When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes?).

OK. Onto the number of revisions that tokens persist before they are removed (just doing the hazard this time).

The hazard of removal is plotted for word-tokens at increasing counts of revisions persisted. (bounded at 45 revisions)
Revisions persisted (hazard). The hazard of removal is plotted for word-tokens at increasing counts of revisions persisted. (bounded at 45 revisions)

Here, we see a much more regular pattern. It looks like most of the hazard drops away after a few followup revisions, but the hazard continues to fall with additional revisions. So it looks like a good threshold for damage would be set at 3-5 revisions, but assuming the hazard decay represents some quality aspect of the contribution, more information can be learned about that aspect by observing long-term revision persistence. It'll be fun to look at what revisions these tokens are a part of to dig in further. It'll also be fun to see how this changes once I complete a new run of diffs against enwiki.

--Halfak (WMF) (talk) 00:34, 30 September 2015 (UTC)[reply]