Research talk:Automated classification of article importance/Work log/2017-03-28

Tuesday, March 28, 2017

Today I plan to wrap up training and evaluating classifiers based on the clickstream dataset, and start work on analyzing potential categories for WPMED articles of Low-importance.

Clickstream models

We have so far used number of views and inlinks as predictors in our models in order to keep them straightforward. In our WPMED classifier we also added "proportion of inlinks from within WikiProject Medicine" to incorporate local indegree with global indegree, based on the approach used by Kamps and Koolen in their 2008 paper. That approach worked well, boosting the performance of that classifier.

The approach can also be used in combination with the clickstream dataset. As a first step, we use the clickstream dataset to calculate the proportion of views that come from other Wikipedia articles and the proportion of inlinks that are used to access the article. The first measure should provide us with a bit more information about how traffic is shaped, in other words whether an article's traffic is mainly internal or external. The second measure should provide us with more information about the degree to which inlinks are actually used. If an article has many inlinks but very few of them are actually used, then it might not be as important as it otherwise appears to be.

We downloaded the 2017 clickstream dataset and wrote a Python script that can process it based on a given input dataset of articles we're interested in. This gives us the total number of views in the clickstream dataset, the number of views coming from other articles, and the number of distinct other articles that led to views. In addition to measuring this globally it can also add two project-specific variants, where the views has to originate from an article within the project.

After processing the clickstream dataset, we add these measures to our existing models and test whether they result in higher performance. We tested the global dataset with a Random Forest, SVM, and GBM. Then we tested the WPMED dataset using only an SVM, because that continues to be the better performing model.

Global dataset

Random Forest

We use our cross-validation testing approach to determine the right forest size and terminating node size for all our Random Forest models. First we create a benchmark model using just number of views and inlinks, with a forest size of 701 trees and terminating node size of 512. This model has an overall accuracy of 49.88% on our 1,600 article test set, with the following confusion matrix (where rows are true rating and columns are predicted rating):

	Top	High	Mid	Low
Top	182	112	88	18
High	82	164	125	29
Mid	16	85	168	131
Low	2	12	102	284

This performance is on par with what we've seen earlier. We then add each of our two new variables one at a time, and then combine them at the end. First, we add "proportion of views from other articles". It scores an overall accuracy of 47.12% on our dataset, with this confusion matrix:

	Top	High	Mid	Low
Top	208	116	41	35
High	126	153	76	45
Mid	59	93	96	152
Low	12	26	65	297

We can see gains in accuracy for Top- and Low-importance articles, but the performance on the other two classes is poorer. Particularly Mid-importance articles suffer, with only 24% accuracy. This variable therefore doesn't seem to provide us with a lot of useful information overall. We add "proportion of inlinks creating traffic" instead. This version of the model also has 49.88% accuracy, with the following confusion matrix:

	Top	High	Mid	Low
Top	193	112	72	23
High	96	168	96	40
Mid	24	90	136	150
Low	5	11	83	301

This model has better accuracy for everything but Mid-importance articles. It is better than the previous version, but still only gets 34% of the articles correct. It looks like this variable to some extent shifts articles upwards in importance. Maybe adding both variables will provide us with useful information? Adding both variables to the model results in 49.75% overall accuracy, with this confusion matrix:

	Top	High	Mid	Low
Top	209	106	68	17
High	101	164	103	32
Mid	31	90	144	135
Low	3	14	104	279

Compared to the benchmark, this scores much better on Top-importance articles, but the other classes suffer. For High- and Mid-importance we can see a shift of articles towards higher importance classes, with accuracy for Mid-importance suffering compared to the benchmark. Low-importance articles score about the same as before. For the Random Forest classifier, this information does not appear to provide any useful clues about article importance.

SVM

We first confirm the benchmark performance of the SVM classifier that we found earlier. Using number of views and inlinks, we achieve an overall accuracy on the test set of 50.38% with this confusion matrix (where again, rows are true rating and columns are predicted rating):

	Top	High	Mid	Low
Top	203	114	66	17
High	103	181	85	31
Mid	33	93	138	136
Low	4	17	95	284

Similarly as before, we add each variable individually before combing them both. First the proportion of views coming from other articles, which results in an overall accuracy of 49.38% and this confusion matrix:

	Top	High	Mid	Low
Top	225	98	52	25
High	121	156	89	34
Mid	46	90	123	141
Low	12	12	90	286

Compared to the benchmark, this classifier performs better on Top- and Low-importance articles, but worse on the other two. For High- and Mid-importance articles, it appears to somewhat shift articles towards higher classes, similar as we saw for the Random Forest classifier. Overall we don't get any improvement, so let's move on to the other variable, proportion of inlinks leading to clicks.

Adding proportion of active inlinks to the model leads again to a reduction in overall accuracy, performance for this model is 48.69% with the following confusion matrix:

	Top	High	Mid	Low
Top	207	110	59	24
High	124	163	82	31
Mid	58	80	130	132
Low	11	18	92	279

Compared to the benchmark, this model performs slightly worse on Top-, Mid- and Low-importance, but quite a lot worse on High-importance. Overall there's no benefit from adding this information to the model. We therefore move on the last model, which combines the two. A model with all four variables has a slight improvement in overall accuracy compared to the benchmark, coming in at 51.44% accuracy on our test set, and the following confusion matrix:

	Top	High	Mid	Low
Top	235	88	55	22
High	122	171	74	33
Mid	51	88	130	131
Low	6	22	85	287

The confusion matrix shows that this model is a trade-off between increased accuracy for Top-importance articles, and decreased accuracy for High- and Mid-importance. Accuracy for Low-importance articles is mostly unchanged. We get an eight percentage point increase in accuracy for Top-importance (58.75% in this model). We lose about two percentage points of accuracy on High- and Mid-importance.

GBM

We again train a benchmark GBM using number of views and inlinks. Using manual cross-validation (GBM's built-in CV algorithm sometimes errors out because of bugs in the selection process), we set minimum node size to 8 and use 299 trees. This model has an overall accuracy of 49.62% with the following confusion matrix:

	Top	High	Mid	Low
Top	189	130	67	14
High	100	192	84	24
Mid	23	110	151	116
Low	5	23	110	262

The benchmark model gets about half the articles correct in Top- and High-importance, 37.75% correct in Mid-importance, and 65.5% correct in Low-importance. Similarly as for some of the other models, it appears that Low-importance articles are reasonably easy to predict, whereas the others are challenging.

Because we've previously seen that adding just a single variable does not lead to improved performance, we skip that step and only train a model with both variables. Through cross-validation we select minimum node size of 64, and use 417 trees for predictions. This model has an overall accuracy of 49.62% with the following confusion matrix:

	Top	High	Mid	Low
Top	197	123	65	15
High	102	191	81	26
Mid	30	106	145	119
Low	4	21	114	261

It's fairly easy to see that this model makes a slight improvement in predicting Top-importance articles, at a cost of reduction of predictions of Mid-importance articles. The other two classes are basically unchanged.

In summary, these models don't seem to perform much better. It might be that our new variables are not good predictors, or it might be that our approach of using unanimous votes leads to a confusing dataset. Either way, before we move further on with a global classifier, it might be fruitful to consider alternatives.

WPMED dataset

We had unfortunately failed to store the dataset used previously. I did a bit of data cleaning of our original WPMED dataset by removing all disambiguation pages (they should have an NA-rating), and by correcting the rating of 129 individuals that I had identified, all of them should have a Low-importance rating but did not. From this we then randomly select 40 articles from each rating class as a test set, and build a 1,000 article training set with 200 synthetic Top-rating examples using SMOTE. As we saw previously, creating more synthetic samples did not lead to better performance. Given that the dataset are slightly different, performance isn't comparable, which is why we first train a benchmark classifier. A lot of the intermediate stage models are not useful, so we skip those and focus on the benchmark, the one with the two proportional variables used in the global dataset, and one with project-specific variations of those two. Because the SVM classifier has had the highest performance throughout our analysis, we only use that approach here.

The first model is our benchmark from earlier, which uses three predictors: number of views, number of inlinks, and proportion of inlinks from articles within WikiProject Medicine. The benchmark model has an overall accuracy of 55.62% on the new test set, with the following confusion matrix:

	Top	High	Mid	Low
Top	24	16	0	0
High	10	18	8	4
Mid	2	7	20	11
Low	0	3	10	27

Top- and Low-importance articles are the one that appear to be somewhat easier to predict. We can clearly see the split in the dataset, that Top- and High-importance articles look alike, and that Mid- and Low-importance articles look alike.

The next model adds the two proportional variables from before: proportion of views from other articles, and proportion of inlinks that lead to traffic. This model is a big improvement over the benchmark, having an overall accuracy of 68.12% with the following confusion matrix:

	Top	High	Mid	Low
Top	36	4	0	0
High	12	20	7	1
Mid	1	7	26	6
Low	1	3	9	27

We see improvements across the board, except for Low-importance articles where accuracy stays at 67.5%. Accuracy for Top-importance articles comes in at 90%, suggesting that these types of measures work really well for identifying those in WPMED, maybe because the traffic is substantial enough for it to matter in distinguishing between Top- and High-importance. We do, however, see that a slightly higher proportion of High-importance articles are classified as Top-importance, but overall the classifier improves on High-importance articles because fewer of the lower class articles are misclassified. We can also see an improvement in distinguishing between Mid- and Low-importance, but mainly that the former class is more clearly different from the latter.

The final model we test adds two project-specific variants of the proportional variables to the model: proportion of views coming from other articles in WikiProject Medicine, and proportion of all inlinks that were active and came from Medicine articles. This model has an overall accuracy of 63.75%, with the following confusion matrix:

	Top	High	Mid	Low
Top	35	5	0	0
High	10	19	7	4
Mid	1	9	23	7
Low	1	3	11	25

Because there are so few articles in the dataset, small changes in accuracy show up as fairly large percentage differences. We see from the confusion matrix that the performance on Top- and High-importance articles is basically unchanged. The drop in performance comes on Mid- and Low-importance articles, but we can also see changes in how High-importance articles are classified. Whereas they before were mainly mistaken for Top-importance, we now also see that some of them are mistaken for Low-importance articles. Mainly the drop in performance comes from a larger confusion for Mid- and Low-importance articles, in both cases more of them look like they belong to other classes, mostly the neighboring ones.

In conclusion we see indications that clickstream data allows for improved performance on the WPMED dataset as long as we're looking at Wikipedia-wide effects, adding project-specific variants of this data does not improve performance, regardless of whether it's in combination with the global variables or replacing them (we tried both approaches). At the same time, we know that certain categories of WPMED articles are defined as Low-importance, so it will be interesting to see whether the clickstream data adds performance once we incorporate data on categories.