Research:Vandal fighter work load/Journal

From Meta, a Wikimedia project coordination wiki

Monday, June 20th[edit]

The plan today is to gather my reverts with the vandal codings labeled. I have data from Jan. 2009, but it is pretty old so I'll be working to gather data up until Jan. 2011. I'll be using a script to perform the analysis using the Wikilytics machines.


Currently waiting on space to move the Jan. 2009 file to the slave database so I can join it up with the user and revision tables. --EpochFail 21:35, 20 June 2011 (UTC)


Permissions set up for storage space. Currently waiting for transfer. I also got a few problems solved today and helped get Fabian started with some resources. I'm still preparing to hand off the dump map script. --EpochFail 00:10, 21 June 2011 (UTC)

Tuesday, June 21st - Thursday, June 23rd[edit]

Time lost due to server downtime. New revert data was generated for 2011, but loading it into the database takes so long that I might not be able to load it before the end of the week. Starting now anyway.

Also, I published a SMP package for python. :) see [1]

Friday, June 24th[edit]

Finally got dataset in database and indexed.

Dataset not pulled out of database until 4:44PM.  :( Looks like I'm working through the weekend.

It looks like the "is_reverted" field didn't load properly. MySQL enforces a "tinyint" type of the field (boolean doesn't exist) and it appears that some values are set to 127 rather than 1 or 0. Not sure what is going on there.

It looks like the reverting revision values are not right. There is definitely some weirdness in this data. It looks like I might have to work from the newly generated stuff. EpochFail 00:16, 25 June 2011 (UTC)


Weekend, June 25&26th[edit]

Built revert table from reverted dataset. I'm doing D_LOOSE and D_STRICT in MySQL which makes me uneasy, but a series of quick checks confirms that I can do case insensitive matching (took over an hour to verify that one with the VARBINARY datatype). --EpochFail 18:54, 27 June 2011 (UTC)


Monday, June 27th[edit]

It's my half-birthday!

I loaded the reverter_months dataset into R today to find out that the number of reverted vandal revisions never exceeds 2. This is ridiculous, so I'm exploring what could have caused this. I suspect that there is some strangeness in MySQL's SUM() function. --EpochFail 19:20, 27 June 2011 (UTC)

Here is the query:

SELECT 
	SUBSTRING(rev_timestamp, 1,4)                           as year, 
	SUBSTRING(rev_timestamp, 5,2)                           as month, 
	rev_user                                                as user_id,
	u.user_name                                             as username,
	COUNT(*)                                                as revisions, 
	SUM(rvt.revision_id IS NOT NULL)                        as reverts, 
	SUM(rvt.revision_id IS NOT NULL AND rvt.is_vandalism)   as vandal_reverts
FROM revision r
LEFT JOIN halfak.revert_20100130 rvt
	ON r.rev_id = rvt.revision_id
INNER JOIN user u
	ON r.rev_user = u.user_id
WHERE rev_timestamp < "20110000000000"
GROUP BY SUBSTRING(rev_timestamp, 1,4), SUBSTRING(rev_timestamp, 5,2), rev_user, u.user_name

^^^^^^

(

I just realized I was pulling from the wrong dataset. I should be pulling from 20110115, but the table I pulled above was the broken one "20100130". I'm going to kill that table immediately. --EpochFail 21:43, 27 June 2011 (UTC)

Started processes to get individual years so that I don't have to wait so long. First up is 2010. Soon to follow are stats. --EpochFail 22:12, 27 June 2011 (UTC)

This is taking a long time to re-run so I am getting some total counts too. It should be interesting to throw out there:

SELECT
	SUBSTR(rev_timestamp, 1,4),
	SUBSTR(rev_timestamp, 5,2),
	count(*), 
	sum(revision_id IS NOT NULL), 
	sum(revision_id IS NOT NULL AND is_vandalism) 
FROM halfak.revert_20110115 rvt 
LEFT JOIN revision r ON rvt.revision_id = r.rev_id
GROUP BY 
	SUBSTR(rev_timestamp, 1,4),
	SUBSTR(rev_timestamp, 5,2);

I just wrote up the sprint for quick processing Wikidumps and updated the software to handle a use-case that Fabian brought up (processPage functions that have no output). --EpochFail 00:13, 28 June 2011 (UTC)

Somehow I've made it to the end of another day with a dataset. I've had a query running for the majority of today though so hopefully that spits out something tonight so I can get some stats done in the morning. --EpochFail 00:15, 28 June 2011 (UTC)

Tuesday, June 28th[edit]

I have data. The two sets of queries I was running have finished and everything looks good. First thing is to load the datasets into R and start producing graphics. I'm thinking of a time series over reverts and reverts for vandalism. --EpochFail 16:05, 28 June 2011 (UTC)

year reverted edits reverted vandalism
2001 519 0
2002 6785 118
2003 28897 9611
2004 149592 51848
2005 699808 238704
2006 2467566 988344
2007 3672055 1219215
2008 3865183 1521243
2009 3252186 892660
2010 2952958 596456
2011 114466 15203


Currently struggling to load in reverter_months dataset (so I can measure how the work load is distributed). I'm trying to load in ~7.5 million records, but I'm only getting about 3 million. --EpochFail 18:29, 28 June 2011 (UTC)


Formatting the following graph took a surprisingly long time. R gets upset when the X axis values aren't simply numbers. (The Year/Month conbination had to be coerced and then reconstructed. Ugh)


The following plots rank editors by the amount of vandal reverting activity they are involved in.

It looks like the regular expression I am using to catch reverts for vandalism is limited in that it doesn't catch Huggle activity, which I suspect comprises a large amount of vandal fighting work. So I'm reproducing the plots for total revert activity. I'm assuming that if editors revert more than 1000 edits in a year, the majority of those reverts are probably vandal-related. I think that those plots will be ready tomorrow. --EpochFail 00:59, 29 June 2011 (UTC)

Wednesday, June 29th[edit]

I'm starting to get things together on the actual project page so I'll skip logging my work here since I am just doing it there. --EpochFail 21:52, 29 June 2011 (UTC)

Actually... now it is time to look at editor workload. So I guess I will keep writing here. The plan is to generate average non-bot-editor workload for each month and plot the results. This is similar to the patroller workload plot, so I hope to just use that R code. If everything goes nicely, I expect to have this done by the end of today. That will allow me to get the history fellow data tomorrow and finish cleaning up the project page.  :) Then Friday I get to be a scientist and think/collaborate. *flexes muscles* --EpochFail 21:55, 29 June 2011 (UTC)

Just kidding. I helped set up Staeiou on internproxy and showed him how to use screen. Then meetings consumed the rest of the day. I did, however, plan the next steps for research with RyanF to understand how vandals get banned, how long it takes and what conversion to good editing looks like.

Signing off. --EpochFail 00:53, 30 June 2011 (UTC)

Thursday, June 30th[edit]

Back to editor workload. I'm going to plot the vandal reverters vs. active editors and then work load per editor month with regression lines. Hopefully I can squeeze that in around the RCom meeting...which is at lunch time. :( --EpochFail 16:50, 30 June 2011 (UTC)

So... It looks like vandal fighters represent a decreasing proportion of active editors. Interesting.

Super vandal vighters (reverting >=50 vandals / month) look similar to vandal fighters (reverting >= 5 vandals / month)


Just got lunch, had a discussion with WSOR about newbie thresholds and headed to the RCom meeting. Time (entropy) proves irreversible. --EpochFail 20:03, 30 June 2011 (UTC)

Finally back to my desk and working. It looks like there was a little work on the newbie classification variables that I was discussing earlier with WSOR. I Looked them over and made updates with Melanie. Now it is time to finish the vandals fighter work load analysis. ...the thing I have been working to this whole time! --EpochFail 21:19, 30 June 2011 (UTC)

I'm waiting for things to loa d, so I figured I'd do a quick brain dump:

  • It looks like the proportion of active editors who are working on vandal fighting is decreasing. This could make sense due to the increasing effectiveness of vandal bots and tools like Huggle that make discovering and reverting vandalism quick and easy.
  • The regular expression I am using to detect vandal reverts isn't good enough. I can get away with looking at raw reverts as anti-vandal work in high volume situations, but I'd like to be confident about the editors with a low amount of anti-vandal work too. Staeiou and I are planning to repair and extend that approach next week.
  • This is taking too long! I've written 3 emails + this while I've been waiting. I actually got my result a few minutes ago, but I had to restart the R aggregation because of an error in the trivial post processing. I won't make that mistake again.
  • I think I'll hop on the Research:Index to go look for RCom work while I wait.

--EpochFail 22:31, 30 June 2011 (UTC)

Friday, July 1st[edit]

It looks like vandal fighter work load is going down. Graphs to follow after meeting. --EpochFail 19:59, 1 July 2011 (UTC)