Research talk:Automated classification of article importance/Work log/2017-03-16

Thursday, March 16, 2017

Today I'll continue where I left off yesterday by training the GBM classifier and inspecting misclassified articles.

Gradient Boost Model

We train a GBM in much the same way as we did for our Random Forests and SVMs yesterday. First we use 10-fold cross-validation on the 1,000 item training set with varying minimum node sizes (which is equivalent to the RFs "terminating node size") to identify the best-performing minimum node and forest sizes. Using an iterative approach we find that for the initial case with number of views and number of links from all other Wikipedia articles as predictors, a minimum node size of 8 and a forest with 1,221 trees is preferred. This is then used to predict the articles in the test set, where we get an overall accuracy of 53.75% and the following confusion matrix:

	Top	High	Mid	Low
Top	28	10	2	0
High	13	16	7	4
Mid	2	9	16	13
Low	0	1	13	26

This model performs well on Top- and Low-importance articles, but does not so well on High- and Mid-importance articles. Given yesterday's models confusion between Top- and High-importance articles, it is not surprising to see the GBM struggle with these as well.

Next we try using project-internal links as the second predictor, and again tune the minimum node size and forest size parameters, and find that 16 and 1,435 respectively have the best performance. Applying those to the test set reports an overall accuracy of 59.38%. The improved performance comes in the Mid-importance class, where 26 articles were correctly predicted, compared to 16 in the previous model. Performance in the other classes is roughly the same.

Lastly we try all three predictors. Here we find that 32 minimum observations and a forest with 1,830 trees has the best cross-validation performance. Running this configuration on the test set we get an overall accuracy of 60%. The results are very similar to those with just using project-internal links, having both of them together does not seem to really add much information.

2,000 item training set

We then redo the process on the 2,000 item training set. First using global links and article views, we find that a minimum node size of 64 and a forest with 2,278 trees has the best cross-validation performance. Applying this to the test set we get an overall accuracy of 54.38%. This is slightly higher than what was reported with the smaller training set.

Using project-internal wikilinks we find that a minimum node size of 4 and a forest with 2,486 trees has the best cross-validation performance. On the test set, this model achieves 51.88% accuracy, slightly below what we saw previously.

Lastly, using all three predictors we find that the minimum node size should be 32 and the forest size be 2,497 trees, as that has the best cross-validation performance. On the test set, this model achieves the same overall accuracy as the previous model, at 51.88%. This is much lower than what we saw with the smaller training dataset, and in that way similar to how the training/testing performance of the Random Forest classifier turned out. So also in the case of this project-specific classifier do we find that the SVM is the higher performer.

Classification errors

We select the highest performing SVM classifier (using all three predictors and trained using the 1,000 item training set) and then examine the articles it misclassified. We are primarily interested in predictions that are far away from the actual rating, meaning that we disregard errors of the neighboring class (e.g. Top-importance articles being predicted High-importance). Here is a confusion matrix which lists the interesting articles, columns are predicted ratings and the rows are the true ratings.

	Top	High	Mid	Low
Top				Tooth decay
High				HIV and pregnancy Tainted blood scandal Basic symptoms of schizophrenia Tuberculosis in relation to HIV
Mid	Electric shock Abortion
Low		Waldemar Haffkine Threshold potential Agglutination (biology)

I suspect that the first thing WPMED is going to ask for is a list across all of their articles, since our test dataset only contains 160 articles. In the whole dataset, the distribution of article predictions is as follows:

	Top	High	Mid	Low
Top	72	16	1	1
High	240	520	177	78
Mid	325	2,129	4,099	2,412
Low	70	1,339	4,632	13,251

Some of these categories are clearly too big to list completely, for instance if we use a distance of two classes as our threshold, we need to list 1,339 Low-importance articles predicted to be High-importance. That would be counterproductive. Instead we focus on the somewhat smaller classes, and list them individually:

Top-importance predicted to be Low-importance

Tooth decay

Top-importance predicted to be Mid-importance

Major trauma

High-importance predicted to be Low-importance

Mid-importance predicted to be Top-importance

Low-importance predicted to be Top-importance