Research talk:Automated classification of edit quality/Work log/2016-04-13

From Meta, a Wikimedia project coordination wiki

Wednesday, April 13, 2016[edit]

Generating prelabeled data for hungarian and swedish today.

$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep True | wc
   1848    5879   38077
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc
    285    1140    7980

Looks like we don't have enough True observations for Hungarian. So, we'll need to boost that to ~40k observations. Here's the updated query: http://quarry.wmflabs.org/query/8811 Once that finishes, I'll try again. For now, let's check on Swedish.

$ wc datasets/svwiki.revisions_for_review.20k_2016.tsv
  4024  14935 104654 datasets/svwiki.revisions_for_review.20k_2016.tsv
$ cat datasets/svwiki.revisions_for_review.20k_2016.tsv | grep True | wc
   1523    4932   32127
$ cat datasets/svwiki.revisions_for_review.20k_2016.tsv | grep reverted | wc
    286    1144    8008

Same story here. Let's boost the observations to 40k. Here it is: http://quarry.wmflabs.org/query/8810

Now to go back to huwiki.

$ wc datasets/huwiki.revisions_for_review.5k_2016.tsv
  5001  17898 123527 datasets/huwiki.revisions_for_review.5k_2016.tsv
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep True | wc
   2500    7895   51000
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc
    340    1360    9520

Looks good.

Now for svwiki.

$ wc datasets/svwiki.revisions_for_review.5k_2016.tsv 
  5001  18097 125232 datasets/svwiki.revisions_for_review.5k_2016.tsv
$ cat datasets/svwiki.revisions_for_review.5k_2016.tsv | grep True | wc
   2500    8094   52705
$ cat datasets/svwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc
    453    1812   12684

Cool! Ready to go. --EpochFail (talk) 17:13, 13 April 2016 (UTC)[reply]