Research talk:Revision scoring as a service/Work log/2016-02-08

Monday, February 8, 2016[edit]

$ make models/wikidata.reverted.all.rf.model 
cut datasets/wikidata.features_reverted.all.nonbot.500k_2015.tsv -f2- | \
	revscoring train_test \
		revscoring.scorer_models.RF \
		wb_vandalism.feature_lists.experimental.all \
		--version 0.0.1 \
		-p 'max_features="log2"' \
		-p 'criterion="entropy"' \
		-p 'min_samples_leaf=1' \
		-p 'n_estimators=80' \
		-s 'pr' -s 'roc' \
		-s 'recall_at_fpr(max_fpr=0.10)' \
		-s 'filter_rate_at_recall(min_recall=0.90)' \
		-s 'filter_rate_at_recall(min_recall=0.75)' \
		--balance-sample-weight \
		--center --scale \
		--label-type=bool > \
	models/wikidata.reverted.all.rf.model
2016-02-08 18:44:01,006 INFO:revscoring.utilities.train_test -- Training model...
2016-02-08 18:44:50,121 INFO:revscoring.utilities.train_test -- Testing model...
ScikitLearnClassifier
 - type: RF
 - params: min_samples_leaf=1, warm_start=false, class_weight=null, min_weight_fraction_leaf=0.0, oob_score=false, verbose=0, min_samples_split=2, n_estimators=80, max_features="log2", bootstrap=true, center=true, criterion="entropy", max_depth=null, balanced_sample_weight=true, max_leaf_nodes=null, random_state=null, scale=true, n_jobs=1
 - version: 0.0.1
 - trained: 2016-02-08T18:44:50.110700

         ~False    ~True
-----  --------  -------
False     80971       11
True         83       17

Accuracy: 0.9988406798056289

ROC-AUC: 0.968
Filter rate @ 0.9 recall: threshold=0.013, filter_rate=0.982, recall=0.94
Recall @ 0.1 false-positive rate: threshold=0.713, recall=0.04, fpr=0.0
Filter rate @ 0.75 recall: threshold=0.075, filter_rate=0.996, recall=0.76
PR-AUC: 0.413

Well, that doesn't look bad. It seems like we can clearly filter out 98.2% of edits and expect a high recall of 0.94. Our ROC-AUC looks pretty good, but that PR-AUC is difficult given the extremely low prevalence of vandalism. It seems like we couldn't really do a ClueBot-like strategy and expect a very high recall. It'll be interesting to plot these results next to those of the other models. I'll kick off the next feature extraction and model generation. --EpochFail (talk) 19:34, 8 February 2016 (UTC)[reply]

General and user features[edit]

Just finished training the model. Here's what I get:

$ make models/wikidata.reverted.general_and_user.rf.model
cut datasets/wikidata.features_reverted.general_and_user.nonbot.500k_2015.tsv -f2- | \
        revscoring train_test \
                revscoring.scorer_models.RF \
                wb_vandalism.feature_lists.experimental.general_and_user \
                --version 0.0.1 \
                -p 'max_features="log2"' \
                -p 'criterion="entropy"' \
                -p 'min_samples_leaf=1' \
                -p 'n_estimators=80' \
                -s 'pr' -s 'roc' \
                -s 'recall_at_fpr(max_fpr=0.10)' \
                -s 'filter_rate_at_recall(min_recall=0.90)' \
                -s 'filter_rate_at_recall(min_recall=0.75)' \
                --balance-sample-weight \
                --center --scale \
                --label-type=bool > \
        models/wikidata.reverted.general_and_user.rf.model
2016-02-09 15:40:32,005 INFO:revscoring.utilities.train_test -- Training model...
2016-02-09 15:41:25,816 INFO:revscoring.utilities.train_test -- Testing model...
ScikitLearnClassifier
 - type: RF
 - params: max_depth=null, n_estimators=80, balanced_sample_weight=true, class_weight=null, min_samples_split=2, warm_start=false, oob_score=false, verbose=0, random_state=null, max_features="log2", min_samples_leaf=1, criterion="entropy", n_jobs=1, center=true, max_leaf_nodes=null, min_weight_fraction_leaf=0.0, bootstrap=true, scale=true
 - version: 0.0.1
 - trained: 2016-02-09T15:41:25.813516

         ~False    ~True
-----  --------  -------
False     99003       14
True         93       30

Accuracy: 0.9989207181763163

ROC-AUC: 0.965
Filter rate @ 0.9 recall: threshold=0.025, filter_rate=0.992, recall=0.902
Recall @ 0.1 false-positive rate: threshold=None, recall=None, fpr=None
Filter rate @ 0.75 recall: threshold=0.087, filter_rate=0.996, recall=0.764
PR-AUC: 0.457

So, that's comparable to the "all" features set, which suggests that we get most of our signal beyond "general" with the user features. --EpochFail (talk) 17:33, 9 February 2016 (UTC)[reply]