Research talk:Automated classification of edit quality/Work log/2016-04-13
Add topicAppearance
Latest comment: 10 years ago by EpochFail in topic Wednesday, April 13, 2016
Wednesday, April 13, 2016
[edit]Generating prelabeled data for hungarian and swedish today.
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep True | wc
1848 5879 38077
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc
285 1140 7980
Looks like we don't have enough True observations for Hungarian. So, we'll need to boost that to ~40k observations. Here's the updated query: http://quarry.wmflabs.org/query/8811 Once that finishes, I'll try again. For now, let's check on Swedish.
$ wc datasets/svwiki.revisions_for_review.20k_2016.tsv
4024 14935 104654 datasets/svwiki.revisions_for_review.20k_2016.tsv
$ cat datasets/svwiki.revisions_for_review.20k_2016.tsv | grep True | wc
1523 4932 32127
$ cat datasets/svwiki.revisions_for_review.20k_2016.tsv | grep reverted | wc
286 1144 8008
Same story here. Let's boost the observations to 40k. Here it is: http://quarry.wmflabs.org/query/8810
Now to go back to huwiki.
$ wc datasets/huwiki.revisions_for_review.5k_2016.tsv
5001 17898 123527 datasets/huwiki.revisions_for_review.5k_2016.tsv
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep True | wc
2500 7895 51000
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc
340 1360 9520
Looks good.
Now for svwiki.
$ wc datasets/svwiki.revisions_for_review.5k_2016.tsv
5001 18097 125232 datasets/svwiki.revisions_for_review.5k_2016.tsv
$ cat datasets/svwiki.revisions_for_review.5k_2016.tsv | grep True | wc
2500 8094 52705
$ cat datasets/svwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc
453 1812 12684
Cool! Ready to go. --EpochFail (talk) 17:13, 13 April 2016 (UTC)