Research talk:Automated classification of article importance/Work log/2017-03-15

Wednesday, March 15, 2017

Today I will continue where I left off yesterday, specifically training and evaluating machine learners on the WPMED dataset.

Class imbalance problem

The distribution of articles per importance class is skewed both globally in Wikipedia and in WikiProject Medicine. In our previous work this was not a problem because we identified 1,900 Top-importance articles, which we saw was enough to create reasonably sized training and test datasets. In WikiProject Medicine the situation is different, here is the number of articles per importance rating:

Importance	No. of articles
Top	90
High	997
Mid	8,965
Low	19,292

There are only 90 Top-importance articles in WikiProject Medicine, which is a challenge when we want to train a classifier. If we use a 10%/90% split of test/training data like we did with our our Wikipedia Article Quality dataset, our test dataset contains only 36 articles, nine from each of the four importance classes. That seems too low to get a good evaluation of classifier performance. Our training dataset would also contain only 324 articles (81 from each class), which is lower than what we would like. Can we get more data?

One approach to solving the problem is to generate synthetic samples using k-nearest neighbors, this approach is called SMOTE. This is an established approach in the science literature, so we choose to explore it first.

Datasets

We first create a test dataset with 160 articles, a random sample of 40 articles from each of the four importance categories. While this is not a particularly large dataset, it should provide us with some signal of performance, particularly in combination with cross-validation on training data.

Using 40 articles for testing leaves us with 50 Top-importance articles for training. Using these 50, we generate two datasets of synthetic samples using SMOTE. The first dataset contains 200 synthetic samples, which when combined with the 50 original articles gives us a total of 250 data points. The second dataset contains 450 synthetic samples, for a total of 500 combined. The process of generating both of these datasets requires us to have a similar number of non-Top articles, and in both cases we sampled randomly across the remainder of the training data to get the right number of non-Top-importance articles. We added these samples in order to ensure that the dataset contains representative data. SMOTE uses k-nearest neighbors to generate the synthetic samples, and our sampling strategy should ensure that the nearest neighbors are not biased.

Once the synthetic samples were generated we combined them with the 50 Top-importance articles and a random sample across each of the three other categories to create balanced datasets of 1,000 and 2,000 articles.

Classifier training and evaluation

Similarly as for our classifier trained on data across all of English Wikipedia, we test three types of classifiers: Random Forest (RF), Support-Vector Machine (SVM), and Graident-Boost Modeling (GBM).

Random Forest

We first investigate performance when using two measures as predictors: number of incoming links from other articles across the Wikipedia, and number of views. To tune performance we alter forest size and terminating node size, streamlining the process we used earlier. Using 10-fold cross-validation on the dataset with 1,000 items (200 synthetic samples and 800 articles) we get the following table of accuracy (in percent), where the columns are the number of trees in the forest and rows is the terminating node size:

	101	201	301	401	501	601	701	801	901	1001
1	60.3	59.5	59.6	60.3	60.3	59.9	60.4	59.9	59.5	59.9
2	60.6	60.0	61.3	60.5	60.5	59.9	60.2	60.4	60.6	60.3
4	61.7	62.5	61.8	61.0	61.5	61.5	61.4	61.1	61.3	61.4
8	62.3	62.8	63.0	63.0	62.9	63.4	62.9	62.8	63.2	63.3
16	62.9	63.5	64.0	63.6	63.5	63.8	63.8	63.8	64.0	63.9
32	63.5	63.8	63.9	63.5	63.5	63.2	63.3	63.2	63.1	63.2
64	62.7	62.8	62.8	62.9	62.9	62.7	62.8	62.6	62.7	62.7
128	60.9	61.9	62.4	62.2	62.3	62.4	62.2	62.2	62.3	62.5

From this table we can see that performance often varies very little with the size of the forest. There is typically a bit of increase in performance as the size of the forest grows when terminating node size gets larger, this is expected since in those cases we are creating a large forest of shallow trees. We can also see how performance degrades once terminating node size gets large enough (e.g. 64 and 128), showing the limited size of the dataset.

Based on the performance in this table, we choose the simplest model with highest performance, and that is 301 trees and terminating node size 16, with cross-validation accuracy of 64%. Testing this model on the test-set gives us the following confusion matrix where rows are true labels and columns are predicted labels:

	Top	High	Mid	Low
Top	30	9	1	0
High	5	20	11	4
Mid	2	4	31	3
Low	0	1	13	26

The overall accuracy of this model is 66.88%. We see high performance on Top- and Mid-importance articles, with an accuracy of 75% and 77.5%, respectively. It performs less well on High-importance (50% accuracy) and Low-importance (65% accuracy). In both of the latter cases we see a fairly large proportion of articles is predicted to belong to a neighboring class. Another positive trend here is that there are no Top-importance articles predicted as Low-importance, and vice versa, which corresponds well to the graphs we plotted yesterday where these classes appeared rather distinct from each other.

Next we repeat this process using number of inlinks from other articles within WikiProject Medicine and number of article views as predictors. We again create a 2x2 table showing cross-validation accuracy as we vary forest size and terminating node size:

	101	201	301	401	501	601	701	801	901	1001
1	59.5	58.7	59.5	59.0	59.4	59.1	59.6	59.4	59.7	59.2
2	59.9	59.5	59.4	59.5	59.2	59.9	59.1	59.5	59.7	59.3
4	58.9	59.5	59.0	59.1	59.7	59.6	59.4	59.3	59.6	60.0
8	59.7	59.4	59.0	58.4	58.7	58.8	59.0	59.4	58.5	58.6
16	59.5	58.8	60.1	59.6	60.7	60.1	60.2	60.1	59.9	59.7
32	59.8	60.1	60.3	59.9	60.3	60.4	59.6	60.0	60.1	60.0
64	61.0	60.3	60.9	60.9	60.4	60.5	60.2	60.6	60.9	60.4
128	60.1	59.9	59.9	59.8	60.1	59.6	59.8	59.8	59.7	59.7

A key takeaway from this table is that performance is lower than what we saw using global inlink counts. In this table the highest accuracy is 61%, compared to 64% in the other model. We can also see that there is very little variation in performance as we change forest size and terminating node size, making it unclear what conditions provide the top performing model. Given the lower accuracy reported here, I saw no reason to further test this model.

Lastly we try combining all three predictors, and repeat the same cross-validation test with varying forest size and terminating node size:

	101	201	301	401	501	601	701	801	901	1001
1	64.4	64.6	64.1	64.2	64.2	64.1	64.4	64.2	63.7	64.1
2	64.0	65.1	64.3	63.9	64.3	63.5	64.4	64.7	64.5	65.0
4	64.1	64.5	64.3	64.0	64.6	64.3	64.3	64.5	64.7	65.2
8	64.7	64.0	63.6	64.6	64.6	64.3	64.3	64.2	64.3	64.2
16	63.2	64.5	64.9	64.0	63.3	64.4	64.0	64.1	64.7	63.9
32	64.1	64.2	64.4	64.2	64.4	63.9	64.5	64.2	64.4	64.5
64	62.2	63.0	62.5	63.2	62.5	63.5	62.9	62.7	62.9	62.7
128	59.3	59.0	59.7	59.0	59.1	59.5	59.0	59.2	59.2	58.8

There are several patterns to be found in this table. First of all, we see that it is similar to the other one using project-internal wikilinks in that performance does not vary as much as it did in our first table. It will therefore be slightly more difficult to choose the appropriate model. Secondly, we see that overall performance is close to what we reported for our first model, suggesting that project-internal wikilinks do not add much information. The best performing mode reports 65.2% accuracy, versus 64% for the first model. Lastly, we see that the highest performance is found in a model with a fairly small terminating node size of 4, whereas the previous one used 16. This makes sense in that we now have an additional variable which can then be used to further divide the dataset, and if we keep the same terminating node size that is impossible to do.

We choose the highest performing model from the previous table, with a forest size of 1,001 trees and terminating node size of 4. Testing it on the test set then gives the following confusion matrix:

	Top	High	Mid	Low
Top	27	12	1	0
High	4	20	12	4
Mid	2	6	21	11
Low	0	3	13	24

On the test set, this model has an overall accuracy of 57.5%. That is considerably lower than what we saw for the first model that only used one measure of the number of wikilinks. This larger model performs worse on all classes, and particularly on Mid-importance articles. While we would assume that adding another variable to our model would improve performance because said variable encodes information that would otherwise not be available, here we see a clear example of when that idea backfires. Given these results, we find that the original model is our primary candidate.

2,000 item training set

We now repeat the procedure on the larger training dataset. Because that dataset contains more samples from the larger classes (High-, Mid-, and Low-importance), we might see improvements in performance. However, it also contains more synthetic samples, so we are not sure. The approach we use will be the same as for the smaller dataset. For brevity, I will only report the main results.

Cross-validation accuracy using number of wikilinks from all Wikipedia articles and number of views is somewhat lower than earlier, the highest accuracy reported is 63.05%. We again choose the simplest of those models (201 trees, terminating node size 16) and test it on the test set. Overall accuracy is 58.75%.

Cross-validation accuracy using number of wikilinks within WikiProject Medicine has higher performance than before, and performance is on par with what we saw for global wikilinks. Highest reported accuracy is 63.15% for a forest with 101 trees and terminating node size 32. Some other forest sizes using the same terminating node size also performs well. However, performance on the test set is not great, with overall accuracy of 54.38%. This is similar as we saw earlier, that global links provide more signal than project-internal links.

Lastly we test using all three predictors. Here we find a large improvement in reported cross-validation accuracy, with the highest reported accuracy coming in at 68.05%. This is using a forest with 301 trees and terminating node size of 16. We then test this combination on the test set, where overall accuracy is 55.62%. Again we see the challenge of performance across different dataset and conditions, with the cross-validation suggesting promising results, but the results are not delivered on the test set.

SVM

We next do similar kinds of testing with the SVM classifier, using the same combinations of predictors. The SVM is tuned slightly differently, but the general approach is the same. I report only the results for brevity.

Using number of wikilinks from all Wikipedia articles and number of views, I tune the gamma and cost parameters using R's tuning functionality as seen previously, then run 10-fold cross-validation on the 1,000 item training dataset. Reported overall accuracy is 59.1%. I then run this on the test dataset and get a reported accuracy of 48.75% and this confusion matrix:

	Top	High	Mid	Low
Top	22	14	4	0
High	9	21	7	3
Mid	4	8	15	13
Low	0	4	16	20

This matrix shows that the classifier is performing about the same across all four classes, but only getting about half (or less) of the predictions correct for each of them.

We then repeat the procedure using project-internal links as our link-based predictor. After tuning we run 10-fold cross-validation on the training set and it reports a mean accuracy of 60.1%. Running this model on the test set reports an improvement in accuracy to 58.75%. The gain in performance comes in the Top- and Low-importance articles, where 5 more articles are predicted correctly. Performance in the other two classes is about the same as before.

Lastly we run this with all three predictors. 10-fold cross-validation on the training set after tuning reports a mean accuracy of 66%. Running the model on the test set reports overall accuracy of 62.5%, with this confusion matrix:

	Top	High	Mid	Low
Top	29	10	0	1
High	12	20	4	4
Mid	2	7	23	8
Low	0	3	9	28

From this confusion matrix it's clear that the gains in performance come from Top-, Mid-, and Low-importance articles, all of these are predicted much more accurately than the first SVM model. While the performance on Mid-importance articles is not as strong as the first Random Forest model, it is quite close.

2,000 item training set

We then repeat the procedure on the larger training dataset. Again the idea is that the larger set of samples from the bigger categories might provide us with better predictions, although the added synthetic samples might also result in lower accuracy. As before, reported results are condensed for space.

Using number of wikilinks from all other Wikipedia articles we get a 10-fold cross-validation mean accuracy of 61.95%. On the test set the model has an accuracy of 55%. This is significantly better than what we saw with the smaller dataset, and might mean that we can get further improvements with the additional predictor.

Changing the predictor to number of project-internal wikilinks the 10-fold cross-validation mean accuracy is 59.95%, while on the test set this model has an accuracy of 50%. This is much lower than we previously saw using the smaller dataset, which could mean that overall performance using all three variables might not be a strong.

Using all three predictors the mean cross-validation accuracy across 10 folds is 64.4%. This is slightly lower than what was reported using the smaller dataset. On the test set this model has an overall accuracy of 60%. The model performs slightly better on Low-importance articles, but does not perform better on any other of the classes.