Research talk:Revision scoring as a service/Work log/2016-02-16

Wednesday, February 17, 2016

Latest comment: 8 years ago1 comment1 person in discussion

OK. I'm working on Urdu wiki. Urdu has this problem that it is a BotPedia(TM). Most of the edits are through bot or other automated tools. Further, it doesn't get nearly as much attention as the big Wikipedias. These two things combined mean that the rate of vandalism on Urdu Wikipedia is very low. That makes it hard to get enough labels (a labor hour problem) in order to get a representative set of vandalism. You can't assume that all reverted edits are vandalism and you can't assume that any non-reverted edit is non-vandalism! I've been thinking about this problem a lot when building training/testing sets for Wikidata.

So, here's my plan. We're going to get a *really* big set of revisions (Say, 500k) and then we're going to split these revisions by whether or not we can trust them. E.g. if an edit is reverted, you can never trust it was good. But if an edit is saved by a sysop and not reverted, it's probably good. We'll measure the proportions of these "potentially sketchy" edits and "trusted" edits and then subsample them into a balanced set for labeling. After labeling is complete, we can sample with replacement to re-achieve the proportions of "potentially sketchy" and "trusted" edits. Then this forms out training and testing set.

OK. 500k revisions. See http://quarry.wmflabs.org/query/6337

Now to run this pre-labeling script that looks up whether the user is in a trusted group or has made a trusted number of edits. Then it checks if an edit was reverted and flags any edit as "needs review" == "potentially sketchy".

$ cat datasets/urwiki.sampled_revisions.500k_2015.tsv | \
> editquality prelabel https://ur.wikipedia.org \
>   --trusted-groups=bot,bureaucrat,sysop,rollbackers \
>   --trusted-edits=1000 \
>   --verbose > \
> datasets/urwiki.prelabeled_revisions.500k_2015.tsv

$ wc urwiki.prelabeled_revisions.500k_2015.tsv 
  499906  1987829 13877876 urwiki.prelabeled_revisions.500k_2015.tsv

Cool. So, it looks like I wasn't able to look up 94/500,000 = 0.019% of edits. That seems likely.

Now. How many "need review".

$ cat urwiki.prelabeled_revisions.500k_2015.tsv | grep "True" | wc
  12928   39918  242493
$ cat urwiki.prelabeled_revisions.500k_2015.tsv | grep "False" | wc
 486977 1947908 13635356

Let's see. 12,928/486,977 = 2.65%. Cool. So that means, we can make a labeling set about 2.65% * 2 = 5.3% the size of the input set.

So, now, we want to randomly sample a reasonably sized labeling set from this. Let's look at English Wikipedia.

$ cat enwiki.rev_damaging.20k_2015.tsv | grep True | wc
    807    1614   12105

$ cat enwiki.rev_damaging.20k_2015.tsv | grep False | wc
  19193   38386  307088

OK. So it looks like we have only 808 damaging observations. I think we'll want more than that. Let's look at how many of the edits in our original 500 were reverted.

$ cat urwiki.prelabeled_revisions.500k_2015.tsv | grep "reverted" | wc
    717    2868   19359

OK. Not so many. So a size-able proportion of those are probably some version of damage.

I was thinking that we'd sample something like 2.5k from both the trusted and "needs review" sized of the dataset and run with that. But we should expect only (717/19193)*2500= ~93.4 reverted revisions to even get labeled. Unless Urdu Wikipedians don't revert a lot of damage, that's not going to be very many examples of vandalism to learn on. Further, we're going to need to split out some more for testing. Hmmm...

OK. I thought about it and I realized that we can always load more into Wiki labels if we need them. There's no way to get more observation other than hard work and we've managed to cut ~95% from the initial work. If we have to come back to review a bit more of the "needs review" group, that'll be a ~97.5% reduction (because we likely won't need any more "trusted" observations). --EpochFail (talk) 00:12, 17 February 2016 (UTC)Reply

Making the review set

Latest comment: 8 years ago2 comments1 person in discussion

Basically, what I need to do is randomly sample 2500 revisions of "trusted" and "needs review" edits. Here's a crazy looking bit of bash to do it.

(
  echo "rev_id\tneeds_review\treason";
  (
    cat datasets/urwiki.prelabeled_revisions.500k_2015.tsv | \
    grep "True" | \
    shuf -n 2500; \
    cat datasets/urwiki.prelabeled_revisions.500k_2015.tsv | \
    grep "False" | \
    shuf -n 2500 \
 ) | \
 shuf \
) > datasets/urwiki.revisions_for_review.5k_2015.tsv

$ wc datasets/urwiki.revisions_for_review.5k_2015.tsv 
  5001  17709 116800 datasets/urwiki.revisions_for_review.5k_2015.tsv

$ head -n5 datasets/urwiki.revisions_for_review.5k_2015.tsv 
rev_id	needs_review	reason
1239668	True	NULL
1337867	True	NULL
1466161	True	NULL
1160958	False	trusted group

OK. That looks like what I wanted. Time to load 'em up. --EpochFail (talk) 00:37, 17 February 2016 (UTC)Reply

First, to create the campaign.

u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('معیار ترمیم کریں ( 5K متوازن )', 'fawiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True);
INSERT 0 1

Woops! That's not supposed to be "fawiki".

u_wikilabels=> select * from campaign where active and wiki = 'fawiki';
u_wikilabels=> select * from campaign where active and wiki = 'fawiki' and id = 23;
 id |              name              |  wiki  |          form          |      view      |          created           | labels_per_task | tasks_per_assignment | active 
----+--------------------------------+--------+------------------------+----------------+----------------------------+-----------------+----------------------+--------
 23 | معیار ترمیم کریں ( 5K متوازن ) | fawiki | damaging_and_goodfaith | DiffToPrevious | 2016-02-17 00:43:04.861367 |               1 |                   50 | t
(1 row)

u_wikilabels=> update campaign set wiki = "urwiki" where active and wiki = 'fawiki' and id = 23;
ERROR:  column "urwiki" does not exist
LINE 1: update campaign set wiki = "urwiki" where active and wiki = ...
                                   ^
u_wikilabels=> update campaign set wiki = 'urwiki' where active and wiki = 'fawiki' and id = 23;
UPDATE 1

Ahh. There.

Now, to run the loading script.

halfak@wikilabels-01:~/backups$ bzc /srv/wikilabels/venv/bin/wikilabels task_inserts 22 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W bzcat  bzcmp  
halfak@wikilabels-01:~/backups$ bzcat ../datasets/urwiki.revisions_for_review.5k_2015.tsv.bz2 | /srv/wikilabels/venv/bin/wikilabels task_inserts 23 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W 
Password for user u_wikilabels: 
INSERT 0 5000

And it works! ur:ویکیپیڈیا:نشانات. Now to figure out what the interface isn't translated. --EpochFail (talk) 00:50, 17 February 2016 (UTC)Reply