Research talk:Revision scoring as a service/Work log/2016-02-16
Wednesday, February 17, 2016
OK. I'm working on Urdu wiki. Urdu has this problem that it is a BotPedia(TM). Most of the edits are through bot or other automated tools. Further, it doesn't get nearly as much attention as the big Wikipedias. These two things combined mean that the rate of vandalism on Urdu Wikipedia is very low. That makes it hard to get enough labels (a labor hour problem) in order to get a representative set of vandalism. You can't assume that all reverted edits are vandalism and you can't assume that any non-reverted edit is non-vandalism! I've been thinking about this problem a lot when building training/testing sets for Wikidata.
So, here's my plan. We're going to get a *really* big set of revisions (Say, 500k) and then we're going to split these revisions by whether or not we can trust them. E.g. if an edit is reverted, you can never trust it was good. But if an edit is saved by a sysop and not reverted, it's probably good. We'll measure the proportions of these "potentially sketchy" edits and "trusted" edits and then subsample them into a balanced set for labeling. After labeling is complete, we can sample with replacement to re-achieve the proportions of "potentially sketchy" and "trusted" edits. Then this forms out training and testing set.
OK. 500k revisions. See http://quarry.wmflabs.org/query/6337
Now to run this pre-labeling script that looks up whether the user is in a trusted group or has made a trusted number of edits. Then it checks if an edit was reverted and flags any edit as "needs review" == "potentially sketchy".
$ cat datasets/urwiki.sampled_revisions.500k_2015.tsv | \ > editquality prelabel https://ur.wikipedia.org \ > --trusted-groups=bot,bureaucrat,sysop,rollbackers \ > --trusted-edits=1000 \ > --verbose > \ > datasets/urwiki.prelabeled_revisions.500k_2015.tsv
$ wc urwiki.prelabeled_revisions.500k_2015.tsv 499906 1987829 13877876 urwiki.prelabeled_revisions.500k_2015.tsv
Cool. So, it looks like I wasn't able to look up 94/500,000 = 0.019% of edits. That seems likely.
Now. How many "need review".
$ cat urwiki.prelabeled_revisions.500k_2015.tsv | grep "True" | wc 12928 39918 242493 $ cat urwiki.prelabeled_revisions.500k_2015.tsv | grep "False" | wc 486977 1947908 13635356
Let's see. 12,928/486,977 = 2.65%. Cool. So that means, we can make a labeling set about 2.65% * 2 = 5.3% the size of the input set.
So, now, we want to randomly sample a reasonably sized labeling set from this. Let's look at English Wikipedia.
$ cat enwiki.rev_damaging.20k_2015.tsv | grep True | wc 807 1614 12105 $ cat enwiki.rev_damaging.20k_2015.tsv | grep False | wc 19193 38386 307088
OK. So it looks like we have only 808 damaging observations. I think we'll want more than that. Let's look at how many of the edits in our original 500 were reverted.
$ cat urwiki.prelabeled_revisions.500k_2015.tsv | grep "reverted" | wc 717 2868 19359
OK. Not so many. So a size-able proportion of those are probably some version of damage.
I was thinking that we'd sample something like 2.5k from both the trusted and "needs review" sized of the dataset and run with that. But we should expect only (717/19193)*2500= ~93.4 reverted revisions to even get labeled. Unless Urdu Wikipedians don't revert a lot of damage, that's not going to be very many examples of vandalism to learn on. Further, we're going to need to split out some more for testing. Hmmm...
OK. I thought about it and I realized that we can always load more into Wiki labels if we need them. There's no way to get more observation other than hard work and we've managed to cut ~95% from the initial work. If we have to come back to review a bit more of the "needs review" group, that'll be a ~97.5% reduction (because we likely won't need any more "trusted" observations). --EpochFail (talk) 00:12, 17 February 2016 (UTC)
Making the review set
Basically, what I need to do is randomly sample 2500 revisions of "trusted" and "needs review" edits. Here's a crazy looking bit of bash to do it.
( echo "rev_id\tneeds_review\treason"; ( cat datasets/urwiki.prelabeled_revisions.500k_2015.tsv | \ grep "True" | \ shuf -n 2500; \ cat datasets/urwiki.prelabeled_revisions.500k_2015.tsv | \ grep "False" | \ shuf -n 2500 \ ) | \ shuf \ ) > datasets/urwiki.revisions_for_review.5k_2015.tsv
$ wc datasets/urwiki.revisions_for_review.5k_2015.tsv 5001 17709 116800 datasets/urwiki.revisions_for_review.5k_2015.tsv $ head -n5 datasets/urwiki.revisions_for_review.5k_2015.tsv rev_id needs_review reason 1239668 True NULL 1337867 True NULL 1466161 True NULL 1160958 False trusted group
First, to create the campaign.
u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('معیار ترمیم کریں ( 5K متوازن )', 'fawiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True); INSERT 0 1
Woops! That's not supposed to be "fawiki".
u_wikilabels=> select * from campaign where active and wiki = 'fawiki'; u_wikilabels=> select * from campaign where active and wiki = 'fawiki' and id = 23; id | name | wiki | form | view | created | labels_per_task | tasks_per_assignment | active ----+--------------------------------+--------+------------------------+----------------+----------------------------+-----------------+----------------------+-------- 23 | معیار ترمیم کریں ( 5K متوازن ) | fawiki | damaging_and_goodfaith | DiffToPrevious | 2016-02-17 00:43:04.861367 | 1 | 50 | t (1 row) u_wikilabels=> update campaign set wiki = "urwiki" where active and wiki = 'fawiki' and id = 23; ERROR: column "urwiki" does not exist LINE 1: update campaign set wiki = "urwiki" where active and wiki = ... ^ u_wikilabels=> update campaign set wiki = 'urwiki' where active and wiki = 'fawiki' and id = 23; UPDATE 1
Now, to run the loading script.
halfak@wikilabels-01:~/backups$ bzc /srv/wikilabels/venv/bin/wikilabels task_inserts 22 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W bzcat bzcmp halfak@wikilabels-01:~/backups$ bzcat ../datasets/urwiki.revisions_for_review.5k_2015.tsv.bz2 | /srv/wikilabels/venv/bin/wikilabels task_inserts 23 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W Password for user u_wikilabels: INSERT 0 5000