Research talk:Automated classification of edit quality/Work log/2017-07-19

From Meta, a Wikimedia project coordination wiki

Wednesday, July 19, 2017[edit]

Today, I am going to tell you the story of how I decided to change the max_features parameter of the GradientBoostingClassifier (GBC) to improve the accuracy of the editquality model and what came out of it.

Problem[edit]

One of the issues with the current editquality model is its bias (it is leaning towards, or rather against, non-registered and new editors). To decrease this bias, it could be helpful to increase model's variance by engaging as many features as reasonable. See Bias-Variance Tradeoff for some details.

Hypothesis[edit]

So, I hypothesized that an additional potential source of bias could be max_features. What does Scikit-learn library tell us about this parameter? Here you go: "choosing max_features < n_features leads to a reduction of variance and an increase in bias." So, we use max_features=log2. Log2 is < than n_features. Which means, if we have ~10 features, log2 leaves us with ~3 randomly selected ones. What if we bring max_features to default which is None, i.e. all features will be engaged into the calculation? It promises to be a safe experiment because overfitting is unlikely to be a problem thanks to CV. Let's do this for ruwiki only.

Results[edit]

My hypothesis proved wrong (at least for ruwiki): ROC-AUC score with max_features=null for damaging model is 0.934 while the score for the model with max_features=log2 was higher - 0.936. Similar results are for goodfaith model (0.932 vs. 0.935) and reverted model (0.886 vs. 0.891).

Apparently, with all features enacted, the variance increases too much. A common practice with GBC is to check up to 30-40% of the features, which log2 essentially does. As well as sqrt actually, which is the most recommended parameter for max_features in GBC.

Below are the excerpts from the ruwiki tuning reports, ["log2"] version vs. None [null] version.

1. DAMAGING

Top scoring configurations

model mean(scores) std(scores) params
GradientBoostingClassifier 0.936 0.006 max_depth=7, n_estimators=700, learning_rate=0.01, max_features="log2"
GradientBoostingClassifier 0.936 0.006 max_depth=3, n_estimators=300, learning_rate=0.1, max_features="log2"
GradientBoostingClassifier 0.935 0.007 max_depth=5, n_estimators=700, learning_rate=0.01, max_features="log2"

vs.

Top scoring configurations

model mean(scores) std(scores) params
GradientBoostingClassifier 0.934 0.006 n_estimators=700, learning_rate=0.1, max_depth=1, max_features=null
GradientBoostingClassifier 0.934 0.006 n_estimators=300, learning_rate=0.1, max_depth=3, max_features=null
GradientBoostingClassifier 0.934 0.006 n_estimators=500, learning_rate=0.1, max_depth=1, max_features=null

2. GOODFAITH

RFC actually tops the list here, with 0.935, but GB with log2 is at least shows the same score sometimes.

GradientBoostingClassifier

mean(scores) std(scores) params
0.935 0.008 max_features="log2", max_depth=7, n_estimators=700, learning_rate=0.01
0.934 0.006 max_features="log2", max_depth=7, n_estimators=500, learning_rate=0.01
0.934 0.007 max_features="log2", max_depth=5, n_estimators=700, learning_rate=0.01

vs.

GradientBoostingClassifier

mean(scores) std(scores) params
0.932 0.007 learning_rate=0.01, max_depth=5, max_features=null, n_estimators=500
0.932 0.006 learning_rate=0.01, max_depth=7, max_features=null, n_estimators=500
0.932 0.007 learning_rate=0.01, max_depth=5, max_features=null, n_estimators=300

3. REVERTED

Top scoring configurations

model mean(scores) std(scores) params
GradientBoostingClassifier 0.891 0.008 learning_rate=0.01, max_depth=7, n_estimators=500, max_features="log2"
GradientBoostingClassifier 0.891 0.007 learning_rate=0.01, max_depth=7, n_estimators=700, max_features="log2"
RandomForestClassifier 0.89 0.011 criterion="entropy", max_features="log2", n_estimators=320, min_samples_leaf=5

vs.

[GBC shows up way below Random Forest, not even in the top10]

GradientBoostingClassifier

mean(scores) std(scores) params
0.886 0.005 learning_rate=0.01, n_estimators=700, max_depth=5, max_features=null
0.884 0.004 learning_rate=0.01, n_estimators=500, max_depth=5, max_features=null
0.884 0.007 learning_rate=0.01, n_estimators=500, max_depth=7, max_features=null


Sources of inspiration:

   * http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
   * https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/