Research talk:Automated classification of article importance/Work log/2017-04-12

Wednesday, April 12, 2017[edit]

Today I'll generate a set of candidate articles for reassessment for WPMED, and continue my investigation of WikiProjects, aiming to discover others that we can approach. I'll also test whether Z-scores for views performs better than percentile rankings, for reference.

Z-scores for views[edit]

We train a GBM classifier using Z-scores for article views instead of rank percentile, and compare it against the same classifier trained yesterday. Through cross-validation we identify the appropriate minimum node size (64) and tree size (2,267) and get the following confusion matrix:

	Top	High	Mid	Low	Accuracy
Top	24	16	0	0	60.00%
High	5	25	8	2	62.50%
Mid	0	10	22	8	55.00%
Low	0	3	9	28	70.00%
Average					61.88%

Compared to yesterday's results, this model performs slightly lower, with an overall accuracy of 61.88% compared to the same model with rank percentiles having overall accuracy of 63.12%. The differences are: slightly lower performance on Top- and High-importance articles, lower performance on Mid-importance articles, but higher accuracy on Low-importance articles. This corresponds well to some of the same results we saw for earlier classifiers, Z-scores seems to perform somewhat similar to raw numbers in that Top- and Low-importance articles are reasonably easy to predict.

Tuning an SVM on the same dataset results in slightly higher overall performance, with the following confusion matrix:

	Top	High	Mid	Low	Accuracy
Top	27	12	1	0	67.50%
High	5	24	9	2	60.00%
Mid	0	9	24	7	60.00%
Low	0	3	11	26	65.00%
Average					63.12%

Overall accuracy is the same as the highest-performing GBM from yesterday. The individual differences are similar to what we described earlier, using Z-scores the Top- and Low-importance articles are easier to predict, while the High- and Mid-importance article predictions are less accurate. However, the differences are small.

WPMED candidates[edit]

We create a dataset of WPMED articles without the ones that ought to be Low-importance. 810 synthetic samples of Top-importance articles are created and added with the 90 Top-importance articles to create a set of 900. The same number of articles are sampled from the other classes to create a training set of 3,600 articles. We use 10-fold cross-validation to tune the minimum node size (32) and the tree size (5,925) of a GBM classifier, and use the trained classifier to predict all WPMED articles (except certain categories of Low-importance, as mentioned earlier). This results in the following confusion matrix:

	Top	High	Mid	Low	Accuracy
Top	71	17	2	0	78.89%
High	257	470	128	116	48.40%
Mid	371	2,297	1,922	4,127	22.05%
Low	231	1,804	1,254	9,888	75.04%
Average					53.81%

Overall accuracy is quite a lot lower than we've seen previously, but we're still on average correct more than half the time. Accuracy for Top- and Low-importance articles is strong, getting over three quarters of each of them correct. I interpret that to suggest that articles in those two classes are mainly where they're supposed to be. Accuracy for High-importance is a lot lower, and we see quite some High-importance articles being predicted as Top-importance. This is something we've discussed elsewhere about predictions, that the distinction between Top- and High-importance is fuzzy. We also see that a large proportion of Mid-importance articles are predicted to be High-importance, suggesting again that the distinction is not clear. Lastly, we see that almost half of the Mid-importance articles are predicted to be Low-importance. Since we also saw that Low-importance articles were predicted well, it suggests that Mid-importance is perhaps not what it should be.

We also see quite a number of Low-importance articles are candidates for reassessment. If these predictions are somewhat on point, then about 3% of WPMED should be Top-importance. By their definition, less than 1% should be, which again brings up the question of how many articles should be Top-importance?

Manually inspecting some of the misclassified articles suggests that the predictions are reasonable. Number of article views and inlinks strongly affect the predictions, as one would expect them to. There are some interesting examples, e.g. West Midlands Ambulance Service, which is not ignored because there is no Wikidata about what it is, and it has 1,600 inlinks because it's linked in an infobox. In other words, it appears that we are quite sane in our rankings and are identifying some good examples of where there is still work to be done, either on Wikipedia in the form of reassessment, or on Wikidata in the form of additional information.

I generated a list of 96 candidate articles for reassessment, 24 from each of the four predicted rating classes. Will get in touch with WPMED about them tomorrow.