Research talk:Automated classification of edit quality/Work log/2017-07-26

From Meta, a Wikimedia project coordination wiki

Friday, July 28, 2017[edit]

Training using Flagged Revisions as a proxy for damaging[edit]

Tracked in Phabricator:
Task T166235

In T166235, we did an experiment to see if a model trained on edits accepted through the flagged revisions interface would do any better at finding damaging edits than a model trained on the Wiki Labels damaging data. The results were not promising, with ROC-AUC falling from 0.954 to 0.900.

Hypothesis[edit]

Is data from the flagged revisions system a higher quality and more relevant to the task of finding damaging edits, than the data keyed through Wiki Labels? If so, training on the flagged revisions data should give us a fitness boost.

Methodology[edit]

Zache (talk · contribs) gave us a Quarry script to find revisions approved through the Flagged Revisions system. A simplified query was eventually used to generate a list of all approved revisions, consisting of about 50,000 rows. We labeled these as good-faith and not damaging, and gave an approved=1 label for good measure. These labeled revisions were union merged (see below) with 15,000 of the Wiki Labels that had been reserved as a training set. This merged file became our training data. The remaining 5,000 Wiki Labels observations were used for testing model health. No cross-validation was performed.

In hindsight, these flaggedrevs approved revisions were not quite right because they may have been the final edit in what could have been a chain of edits to review. This was an omission and if we end up repeating an experiment like this, we should query for only final revisions whose parent revision equals the starting revision of the reviewed chain.

A model was trained using a Makefile[1] tweaked to build a second fiwiki.flaggedrevs.damaging model using the same parameters as the production fiwiki.damaging model, except it was fed the merged labels including flaggedrevs-approved changes as its source of true classifications. Here are test results from the two models:

Current champion damaging model Model trained on approved Flagged Revisions
revscoring model_info models/fiwiki.damaging.gradient_boosting.model
ScikitLearnClassifier
 - type: GradientBoosting
 - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700
 - version: 0.3.0
 - trained: 2017-06-26T03:59:29.167423

Table:
	         ~False    ~True
	-----  --------  -------
	False     16727     2231
	True        113      904

Accuracy: 0.883
Precision:
	-----  -----
	False  0.993
	True   0.289
	-----  -----

Recall:
	-----  -----
	False  0.882
	True   0.89
	-----  -----

PR-AUC:
	-----  -----
	False  0.993
	True   0.548
	-----  -----

ROC-AUC:
	-----  -----
	False  0.95
	True   0.954
	-----  -----
revscoring model_info models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
ScikitLearnClassifier
 - type: GradientBoosting
 - params: random_state=null, verbose=0, init=null, learning_rate=0.01, min_samples_split=2, subsample=1.0, warm_start=false, center=true, min_samples_leaf=1, scale=true, loss="deviance", presort="auto", min_weight_fraction_leaf=0.0, balanced_sample=false, n_estimators=700, balanced_sample_weight=true, max_features="log2", max_leaf_nodes=null, max_depth=5
 - version: 0.0.1
 - trained: 2017-07-25T20:50:13.806134

Table:
	         ~False    ~True
	-----  --------  -------
	False      4589      138
	True        137      121

Accuracy: 0.945
Precision:
	-----  -----
	False  0.971
	True   0.467
	-----  -----

Recall:
	-----  -----
	False  0.971
	True   0.469
	-----  -----

PR-AUC:
	-----  -----
	False  0.993
	True   0.437
	-----  -----

ROC-AUC:
	-----  ---
	False  0.9
	True   0.9
	-----  ---

Two new utilities were introduced to facilitate this work:

union_merge_observations will take multiple observations files, and does a set union of any observations of the same record. For revision observations, this will merge all labels applied to each revision. This tool is now available in the revscoring repo.[2]

normalize_column_types casts values to an expected type, in this case it was required because Quarry outputs integer 0/1 for boolean values, and our tools expect a true JSON boolean. We threw away this version of the tool because it wasn't worth the work to canonicalize it. If we end up needing it again one day, may want to combine it with a data validation step.

References[edit]