Research talk:Automated classification of article importance/Work log/2017-03-21

From Meta, a Wikimedia project coordination wiki

Tuesday, March 21, 2017[edit]

Today I will look into cross-evaluating the classifier we trained on the entire English edition and the WPMED classifier to see how they differ. I'll also start looking into the clickstream dataset, as we have some ideas about how to utilize that to increase performance.

Cross-evaluation[edit]

I start this by using the SVM classifier we trained on the whole dataset of English Wikipedia, and having it predict the importance ratings of our WPMED test dataset (160 articles). It scores an overall accuracy of 54.38% with the following confusion matrix where rows are true rating and columns are predicted rating:

Top High Mid Low
Top 36 4 0 0
High 16 20 4 0
Mid 7 17 14 2
Low 2 6 15 17

When we evaluated this classifier on the larger test dataset for English Wikipedia, it scored an overall accuracy of 50.56%. It performed similarly for the Top-, High-, and Low-importance articles, and somewhat lower (35%) on the Mid-importance articles.

One the WPMED test dataset, its accuracy for Top-importance articles is 90%, there are only 4 articles it misclassifies as High-importance. Two of these, Log-term effects of alcohol consumption and Hypercholesterolemia, have predictions that agree with our highest performing WPMED classifier. One article, Urinary tract infection is correctly predicted as Top by our WPMED classifier. Lastly there's Tooth decay, which as we've previously seen is predicted Mid-importance by the WPMED classifier.

The performance on the other classes is arguably not that great. It gets only half the High-importance articles right, and predicts most of the rest as Top-importance. This kind of confusion between Top- and High-importance was also present when evaluating the larger test set, although on the WPMED test set it's mostly one-way. It will be interesting to see if this continues when we predict the entire WPMED dataset. If not, it could be that the cut point between Top and High is quite different between this classifier and the WPMED one.

Performance on Mid-importance articles is poor (35% accuracy), and as we see the predictions are spread out across the higher classes. Performance on Low-importance is also quite poor, 42.5% accuracy, with many of them being predicted as Mid-importance. When we previously evaluated this on the global dataset, performance on Low-importance articles was quite high (71.5%). Combined with what we saw on Mid-importance, this suggests that boundaries can be quite different between these datasets.

We next predict importance ratings for the entire WPMED dataset. In this dataset the classes are not balanced, making overall accuracy no longer useful as a performance measure. We will therefore instead discuss performance for individual classes based on the following confusion matrix:

Top High Mid Low
Top 81 9 0 0
High 407 483 84 22
Mid 1,100 4,118 2,599 1,130
Low 928 3,789 6,357 8,213

Accuracy for Top-importance is again very high (90%) but comparable to the WPMED classifier (84.4%) (note that accuracy and precision are equal) . Given the imbalance between classes, recall becomes a useful measure since it's affected by how many false positives we have. That is where we see a huge difference between these classifiers, because as the confusion matrix shows, there are a lot of High-, Mid-, and Low-importance articles predicted to be Top-importance. The result is that recall for the Top-importance class is only 3.22%, whereas the WPMED classifier does much better with a recall of 11.2%.

Both classifiers have similar accuracy on High-importance articles, but again the large numbers of lower-class articles predicted to be in this class makes the recall score low (5.75%, compared to the WPMED classifier's 14.0%).

For Mid- and Low-importance classes the big difference between the classifiers shows up in the Accuracy/Precision, with the WPMED classifier performing much stronger. These results are somewhat surprising because on the WPMED test set it was not as easy to spot these trends. The ratio of Low-importance articles predicted to be Top-importance versus Low-importance appears roughly similar (1:8), but Mid-importance articles are much more likely to be predicted as Top-importance.

We next turn this around and use the WPMED classifier to predict importance in the test dataset of unanimously rated articles across the English edition (1,600 articles, 400 in each class). This does not go very well, with an overall accuracy of 29.06%, only slightly above the random baseline of 25%. The confusion matrix will reveal why:

Top High Mid Low
Top 15 119 10 256
High 5 36 15 344
Mid 0 4 20 376
Low 0 1 5 394

Almost everything is Low-importance! The only exception seems to be about a quarter of the Top-importance articles, which instead are predicted to be High-importance. I am unsure whether this suggests anything about how well a hybrid classifier might perform or not. It might be that WPMED's importance criteria are much higher than everyone else's (e.g. they state that only 1% of WPMED articles should attain a Top-importance rating), meaning that an article needs to be much more popular and have higher number of inlinks to be labelled that way by our classifier.