Research talk:Automated classification of article importance/Work log/2017-03-16

From Meta, a Wikimedia project coordination wiki

Thursday, March 16, 2017[edit]

Today I'll continue where I left off yesterday by training the GBM classifier and inspecting misclassified articles.

Gradient Boost Model[edit]

We train a GBM in much the same way as we did for our Random Forests and SVMs yesterday. First we use 10-fold cross-validation on the 1,000 item training set with varying minimum node sizes (which is equivalent to the RFs "terminating node size") to identify the best-performing minimum node and forest sizes. Using an iterative approach we find that for the initial case with number of views and number of links from all other Wikipedia articles as predictors, a minimum node size of 8 and a forest with 1,221 trees is preferred. This is then used to predict the articles in the test set, where we get an overall accuracy of 53.75% and the following confusion matrix:

Top High Mid Low
Top 28 10 2 0
High 13 16 7 4
Mid 2 9 16 13
Low 0 1 13 26

This model performs well on Top- and Low-importance articles, but does not so well on High- and Mid-importance articles. Given yesterday's models confusion between Top- and High-importance articles, it is not surprising to see the GBM struggle with these as well.

Next we try using project-internal links as the second predictor, and again tune the minimum node size and forest size parameters, and find that 16 and 1,435 respectively have the best performance. Applying those to the test set reports an overall accuracy of 59.38%. The improved performance comes in the Mid-importance class, where 26 articles were correctly predicted, compared to 16 in the previous model. Performance in the other classes is roughly the same.

Lastly we try all three predictors. Here we find that 32 minimum observations and a forest with 1,830 trees has the best cross-validation performance. Running this configuration on the test set we get an overall accuracy of 60%. The results are very similar to those with just using project-internal links, having both of them together does not seem to really add much information.

2,000 item training set[edit]

We then redo the process on the 2,000 item training set. First using global links and article views, we find that a minimum node size of 64 and a forest with 2,278 trees has the best cross-validation performance. Applying this to the test set we get an overall accuracy of 54.38%. This is slightly higher than what was reported with the smaller training set.

Using project-internal wikilinks we find that a minimum node size of 4 and a forest with 2,486 trees has the best cross-validation performance. On the test set, this model achieves 51.88% accuracy, slightly below what we saw previously.

Lastly, using all three predictors we find that the minimum node size should be 32 and the forest size be 2,497 trees, as that has the best cross-validation performance. On the test set, this model achieves the same overall accuracy as the previous model, at 51.88%. This is much lower than what we saw with the smaller training dataset, and in that way similar to how the training/testing performance of the Random Forest classifier turned out. So also in the case of this project-specific classifier do we find that the SVM is the higher performer.

Classification errors[edit]

We select the highest performing SVM classifier (using all three predictors and trained using the 1,000 item training set) and then examine the articles it misclassified. We are primarily interested in predictions that are far away from the actual rating, meaning that we disregard errors of the neighboring class (e.g. Top-importance articles being predicted High-importance). Here is a confusion matrix which lists the interesting articles, columns are predicted ratings and the rows are the true ratings.

Top High Mid Low
Top
High
Mid
Low

I suspect that the first thing WPMED is going to ask for is a list across all of their articles, since our test dataset only contains 160 articles. In the whole dataset, the distribution of article predictions is as follows:

Top High Mid Low
Top 72 16 1 1
High 240 520 177 78
Mid 325 2,129 4,099 2,412
Low 70 1,339 4,632 13,251

Some of these categories are clearly too big to list completely, for instance if we use a distance of two classes as our threshold, we need to list 1,339 Low-importance articles predicted to be High-importance. That would be counterproductive. Instead we focus on the somewhat smaller classes, and list them individually: