Research talk:Automated classification of edit quality/Work log/2016-02-28

From Meta, a Wikimedia project coordination wiki

Sunday, February 28, 2016[edit]

I should start doing my edit quality-related worklogs here. So, I'm going to start with my work to extract a labeling set for Norwegian, Hebrew and Vietnamese Wikipedias. I'm trying to build a set of 2.5k "needs review" and 2.5k "trusted" edits for labeling.

OK. So first of all, I need to label enough edits that we get at least 2.5k that "need review". I usually start with a random sample of 20k edits, but some wikis are so dominated by bots and privileged users that I need bigger samples. In this case, nowiki and viwiki were this way. Here's how the pre-labeling process worked out:

$ cat datasets/nowiki.prelabeled_revisions.100k_2015.tsv | grep True | wc
   7351   23881  155890
$ cat datasets/nowiki.prelabeled_revisions.100k_2015.tsv | grep False | wc
  92642  370568 2686618

$ cat datasets/viwiki.prelabeled_revisions.100k_2015.tsv | grep True | wc
   8141   25911  167614
$ cat datasets/viwiki.prelabeled_revisions.100k_2015.tsv | grep False | wc
  91849  367396 2663621

$ cat datasets/hewiki.prelabeled_revisions.20k_2015.tsv | grep True | wc
   4166   13401   87151
$ cat datasets/hewiki.prelabeled_revisions.20k_2015.tsv | grep False | wc
  15798   63192  458142
need review reverted trusted
nowiki 7351 (7.4%) 1597 (1.6%) 92642 (92.6%)
viwiki 8141 (8.1%) 1031 (1.0%) 91849 (91.9%)
hewiki 4166 (20.9%) 773 (3.9%) 15798 (79.1%)

OK. Now to generate the 5k sets of 2.5/2.5k needing review/trusted. Here's the basic pattern demonstrated on nowiki's prelabeled set:

(echo "rev_id\tneeds_review\treason"; \
 (cat datasets/nowiki.prelabeled_revisions.20k_2015.tsv | grep True | \
 shuf -n 2500; \
 cat datasets/nowiki.prelabeled_revisions.20k_2015.tsv | grep False | \
 shuf -n 2500 \
 ) | shuf \
) > datasets/nowiki.revisions_to_review.5k_2015.tsv

And here's the three datasets:

$ wc *.revisions_to_review.*
  5001  18048 124856 hewiki.revisions_to_review.5k_2015.tsv
  5001  18107 125387 nowiki.revisions_to_review.5k_2015.tsv
  5001  17964 124035 viwiki.revisions_to_review.5k_2015.tsv
 15003  54119 374278 total

Now to load them into Wiki labels, I'll need to make sure all the language assets are in order. --EpochFail (talk) 20:35, 28 February 2016 (UTC)[reply]

Updating UI.[edit]

Here's the pull for updating the Wikilabels UI: https://github.com/wiki-ai/wikilabels/pull/95 --EpochFail (talk) 20:43, 28 February 2016 (UTC) Here's the pull for updating the damaging_and_goodfaith form: https://github.com/wiki-ai/wikilabels-wikimedia-config/pull/11 --EpochFail (talk) 20:48, 28 February 2016 (UTC)[reply]

Loading the campaigns[edit]

OK. Looks like we've deployed successfully. Now I'm back to loading the campaigns into the database.

u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('איכות ערוכה ( 5k מאוזן )', 'hewiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True);
INSERT 0 1
u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('Sửa chất lượng ( 5k cân bằng)', 'viwiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True);
INSERT 0 1
u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('Edit kvalitet ( 5k balansert)', 'nowiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True);
INSERT 0 1
u_wikilabels=> SELECT id, name, wiki FROM campaign WHERE wiki IN ('hewiki', 'nowiki', 'viwiki');
 id |             name              |  wiki  
----+-------------------------------+--------
 25 | איכות ערוכה ( 5k מאוזן )      | hewiki
 26 | Sửa chất lượng ( 5k cân bằng) | viwiki
 27 | Edit kvalitet ( 5k balansert) | nowiki
(3 rows)

OK. Time to do some loading.

halfak@wikilabels-01:~/datasets$ cat hewiki.revisions_to_review.5k_2015.tsv | /srv/wikilabels/venv/bin/wikilabels task_inserts 25 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W 
Password for user u_wikilabels: 
INSERT 0 5000
halfak@wikilabels-01:~/datasets$ cat viwiki.revisions_to_review.5k_2015.tsv | /srv/wikilabels/venv/bin/wikilabels task_inserts 26 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W 
Password for user u_wikilabels: 
INSERT 0 5000
halfak@wikilabels-01:~/datasets$ cat nowiki.revisions_to_review.5k_2015.tsv | /srv/wikilabels/venv/bin/wikilabels task_inserts 27 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W 
Password for user u_wikilabels: 
INSERT 0 5000

OK. We should be good to go.

I'm declaring victory for today. --EpochFail (talk) 21:30, 28 February 2016 (UTC)[reply]