Research talk:Automated classification of article importance/Work log/2017-04-11

From Meta, a Wikimedia project coordination wiki

Tuesday, April 11, 2017[edit]

Today I'll get in touch with WPMED regarding categories of Low-importance articles, and I'll look into building models using relative measures of views and inlinks.

Relative measures[edit]

We have so far used absolute measures of article views and inlinks. If we instead switch to using relative measures, we can potentially incorporate importance ratings from multiple projects into a single model, and through that build a classifier that predicts importance on a global scale reasonably well. The first questions to answer are: what measures do we use, and how well do they perform?

I investigated the distribution of number of views and number of inlinks. The former appears to be reasonably log-normal, while the other is a mix of at least two distributions. This means we should not use z-score to describe the inlink variable, so I start out by calculating the rank of all articles, individually for each variable, and turning it into a rank percentile. Later, I will test if using the z-score for views produces better results. Because these variables are calculated across the entire project, I also created new synthetic samples for the 2,000 article training set, but reused the test set sampling from yesterday. We use the relabelled ratings from yesterday as well, feedback from WPMED suggests that I got most of them right.

Since we do not know how the "proportion of links from WPMED" variable works in this case, we first train a classifier that only gets the two rank variables as input. It results in the following confusion matrix:

Top High Mid Low Accuracy
Top 32 7 1 0 80.00%
High 7 20 9 4 50.00%
Mid 3 7 21 9 52.50%
Low 0 3 14 23 57.50%
Average 60.00%

Compared to our initial result from yesterday, this classifier performs much better on this test set (60% overall accuracy, versus 50% yesterday). It performs better on all classes except Low-importance, where accuracy is 2.5% lower. Keep in mind that while we've sampled a new training set with new synthetic samples, the articles in the test set are all the same.

Now that we are operating in a restricted two-dimensional space, does the GBM fare any better than it did previous? We use 10-fold cross-validation to determine minimum node size (finding 32 to be best), then use it again to find the right number of trees (1,008), and use that to predict articles in the test set, resulting in the following confusion matrix:

Top High Mid Low Accuracy
Top 30 9 1 0 75.00%
High 7 21 10 2 52.50%
Mid 3 8 21 8 52.50%
Low 0 2 11 27 67.50%
Average 61.88%

Slightly lower performance on Top-importance articles, slightly higher on High-importance articles, and much higher on Low-importance articles. We might want to look into this further later on depending on how the SVM performs. We continue with the SVM classifier and add the "proportion of inlink from WPMED" as a predictor, getting the following performance:

Top High Mid Low Accuracy
Top 29 10 1 0 72.50%
High 7 23 7 3 57.50%
Mid 3 11 18 8 45.00%
Low 0 2 12 26 65.00%
Average 60.00%

Accuracy is about the same for Top- and Low-importance articles, better for High-importance, lower for Mid-importance, and overall slightly lower. This might partly be due to the test set, cross-validation performance using this model is slightly higher than previous. Thus, it looks like we're still gaining some accuracy by adding this variable.

Next, we add the two global proportions from the clickstream dataset (proportion of views from articles, and proportion of inlinks used), again thinking that they will provide us with some information about how importance manifests itself in usage. This results in the following confusion matrix:

Top High Mid Low Accuracy
Top 26 13 1 0 65.00%
High 8 21 7 4 52.50%
Mid 2 8 23 7 57.50%
Low 0 3 13 24 60.00%
Average 58.75%

We see again a drop in performance on the test set, but higher cross-validation performance on the training set. The classifier is now more accurately predicting Mid-importance articles, at the expense of not performing as well on either of the three other classes. It is interesting to see that these additional variables do not appear to add accuracy, contrary to what we've seen previously.

We add the binary variable for articles that ought to be Low-importance and then get the following confusion matrix:

Top High Mid Low Accuracy
Top 24 15 1 0 60.00%
High 3 29 5 3 72.50%
Mid 0 14 17 9 42.50%
Low 0 7 14 19 47.50%
Average 55.62%

Overall accuracy again drops a bit. We see a huge gain in accuracy for High-importance articles, but lower accuracy for all other types of articles, with a particularly large drop for Mid-importance articles. Cross-validation performance is in this case about the same as it was without the variable, suggesting that it overall might not provide much additional information.

For reference, we also test the two additional proportional variables based on WikiProject-medicine-specific articles. Previously, this has resulted in a further drop in performance. We get the following confusion matrix:

Top High Mid Low Accuracy
Top 20 18 1 1 50.00%
High 2 24 12 2 60.00%
Mid 2 10 18 10 45.00%
Low 0 5 14 21 52.50%
Average 51.88%

Overall accuracy drops a bit, and we can see that there is more overall confusion between the classes. In order words, these two variables still appear to provide little useful information, although the cross-validation accuracy is somewhat higher than before.

As the SVM results are somewhat disappointing, we switch to the GBM to see if that fares better. We already have a benchmark that performed reasonably well, and introduce the first proportional variable again to see how that fares. Like before, we use cross-validation to tune minimum node size (4) and number of trees (1,563), after which we use the model to predict the test set and get the following confusion matrix:

Top High Mid Low Accuracy
Top 29 11 0 0 72.50%
High 7 23 8 2 57.50%
Mid 3 8 20 9 50.00%
Low 0 2 9 29 72.50%
Average 63.12%

Accuracy is roughly the same for Top- and Mid-importance articles, and 5% higher for High- and Low-importance articles. As was the case for the SVM, we also find that the cross-validation accuracy on the training set is improved by adding these variables. However, with the SVM we saw that performance decreased, whereas here it increases slightly. Why does the GBM not have an issue with this test set?

We then add the proportional variables from the clickstream and use cross-validation to tune minimum node size (16) and tree size (2,262). When predicting the articles in the test set, we get the following confusion matrix:

Top High Mid Low Accuracy
Top 26 14 0 0 65.00%
High 5 25 10 0 62.50%
Mid 1 8 25 6 62.50%
Low 0 2 13 25 62.50%
Average 63.12%

Overall accuracy is the same as before, but what we see is that performance is consistent across all four classes, where previously we had higher accuracy for Top- and Low-importance articles. In other words, this appears to be a classifier which delivers reasonably accurate performance across the board. Question now is, what happens when we add the binary variable for Low-importance articles. We tune the minimum node size (64), and tree size (3,030) and get the following confusion matrix on the test set:

Top High Mid Low Accuracy
Top 24 16 0 0 60.00%
High 5 24 10 1 60.00%
Mid 0 10 23 7 57.50%
Low 0 2 13 25 62.50%
Average 60.00%

A slight decrease in performance compared to what we saw earlier. It's still performing well, though. Either way, the GBM performs well on this type of dataset, which is not that surprising given how it is now a restricted set of proportions that can be sliced and diced, whereas we earlier had some Guassian clouds that played to SVM's strengths.