Research talk:Automated classification of article importance/Work log/2017-03-28

From Meta, a Wikimedia project coordination wiki

Tuesday, March 28, 2017[edit]

Today I plan to wrap up training and evaluating classifiers based on the clickstream dataset, and start work on analyzing potential categories for WPMED articles of Low-importance.

Clickstream models[edit]

We have so far used number of views and inlinks as predictors in our models in order to keep them straightforward. In our WPMED classifier we also added "proportion of inlinks from within WikiProject Medicine" to incorporate local indegree with global indegree, based on the approach used by Kamps and Koolen in their 2008 paper. That approach worked well, boosting the performance of that classifier.

The approach can also be used in combination with the clickstream dataset. As a first step, we use the clickstream dataset to calculate the proportion of views that come from other Wikipedia articles and the proportion of inlinks that are used to access the article. The first measure should provide us with a bit more information about how traffic is shaped, in other words whether an article's traffic is mainly internal or external. The second measure should provide us with more information about the degree to which inlinks are actually used. If an article has many inlinks but very few of them are actually used, then it might not be as important as it otherwise appears to be.

We downloaded the 2017 clickstream dataset and wrote a Python script that can process it based on a given input dataset of articles we're interested in. This gives us the total number of views in the clickstream dataset, the number of views coming from other articles, and the number of distinct other articles that led to views. In addition to measuring this globally it can also add two project-specific variants, where the views has to originate from an article within the project.

After processing the clickstream dataset, we add these measures to our existing models and test whether they result in higher performance. We tested the global dataset with a Random Forest, SVM, and GBM. Then we tested the WPMED dataset using only an SVM, because that continues to be the better performing model.

Global dataset[edit]

Random Forest[edit]

We use our cross-validation testing approach to determine the right forest size and terminating node size for all our Random Forest models. First we create a benchmark model using just number of views and inlinks, with a forest size of 701 trees and terminating node size of 512. This model has an overall accuracy of 49.88% on our 1,600 article test set, with the following confusion matrix (where rows are true rating and columns are predicted rating):

Top High Mid Low
Top 182 112 88 18
High 82 164 125 29
Mid 16 85 168 131
Low 2 12 102 284

This performance is on par with what we've seen earlier. We then add each of our two new variables one at a time, and then combine them at the end. First, we add "proportion of views from other articles". It scores an overall accuracy of 47.12% on our dataset, with this confusion matrix:

Top High Mid Low
Top 208 116 41 35
High 126 153 76 45
Mid 59 93 96 152
Low 12 26 65 297

We can see gains in accuracy for Top- and Low-importance articles, but the performance on the other two classes is poorer. Particularly Mid-importance articles suffer, with only 24% accuracy. This variable therefore doesn't seem to provide us with a lot of useful information overall. We add "proportion of inlinks creating traffic" instead. This version of the model also has 49.88% accuracy, with the following confusion matrix:

Top High Mid Low
Top 193 112 72 23
High 96 168 96 40
Mid 24 90 136 150
Low 5 11 83 301

This model has better accuracy for everything but Mid-importance articles. It is better than the previous version, but still only gets 34% of the articles correct. It looks like this variable to some extent shifts articles upwards in importance. Maybe adding both variables will provide us with useful information? Adding both variables to the model results in 49.75% overall accuracy, with this confusion matrix:

Top High Mid Low
Top 209 106 68 17
High 101 164 103 32
Mid 31 90 144 135
Low 3 14 104 279

Compared to the benchmark, this scores much better on Top-importance articles, but the other classes suffer. For High- and Mid-importance we can see a shift of articles towards higher importance classes, with accuracy for Mid-importance suffering compared to the benchmark. Low-importance articles score about the same as before. For the Random Forest classifier, this information does not appear to provide any useful clues about article importance.

SVM[edit]

We first confirm the benchmark performance of the SVM classifier that we found earlier. Using number of views and inlinks, we achieve an overall accuracy on the test set of 50.38% with this confusion matrix (where again, rows are true rating and columns are predicted rating):

Top High Mid Low
Top 203 114 66 17
High 103 181 85 31
Mid 33 93 138 136
Low 4 17 95 284

Similarly as before, we add each variable individually before combing them both. First the proportion of views coming from other articles, which results in an overall accuracy of 49.38% and this confusion matrix:

Top High Mid Low
Top 225 98 52 25
High 121 156 89 34
Mid 46 90 123 141
Low 12 12 90 286

Compared to the benchmark, this classifier performs better on Top- and Low-importance articles, but worse on the other two. For High- and Mid-importance articles, it appears to somewhat shift articles towards higher classes, similar as we saw for the Random Forest classifier. Overall we don't get any improvement, so let's move on to the other variable, proportion of inlinks leading to clicks.

Adding proportion of active inlinks to the model leads again to a reduction in overall accuracy, performance for this model is 48.69% with the following confusion matrix:

Top High Mid Low
Top 207 110 59 24
High 124 163 82 31
Mid 58 80 130 132
Low 11 18 92 279

Compared to the benchmark, this model performs slightly worse on Top-, Mid- and Low-importance, but quite a lot worse on High-importance. Overall there's no benefit from adding this information to the model. We therefore move on the last model, which combines the two. A model with all four variables has a slight improvement in overall accuracy compared to the benchmark, coming in at 51.44% accuracy on our test set, and the following confusion matrix:

Top High Mid Low
Top 235 88 55 22
High 122 171 74 33
Mid 51 88 130 131
Low 6 22 85 287

The confusion matrix shows that this model is a trade-off between increased accuracy for Top-importance articles, and decreased accuracy for High- and Mid-importance. Accuracy for Low-importance articles is mostly unchanged. We get an eight percentage point increase in accuracy for Top-importance (58.75% in this model). We lose about two percentage points of accuracy on High- and Mid-importance.

GBM[edit]

We again train a benchmark GBM using number of views and inlinks. Using manual cross-validation (GBM's built-in CV algorithm sometimes errors out because of bugs in the selection process), we set minimum node size to 8 and use 299 trees. This model has an overall accuracy of 49.62% with the following confusion matrix:

Top High Mid Low
Top 189 130 67 14
High 100 192 84 24
Mid 23 110 151 116
Low 5 23 110 262

The benchmark model gets about half the articles correct in Top- and High-importance, 37.75% correct in Mid-importance, and 65.5% correct in Low-importance. Similarly as for some of the other models, it appears that Low-importance articles are reasonably easy to predict, whereas the others are challenging.

Because we've previously seen that adding just a single variable does not lead to improved performance, we skip that step and only train a model with both variables. Through cross-validation we select minimum node size of 64, and use 417 trees for predictions. This model has an overall accuracy of 49.62% with the following confusion matrix:

Top High Mid Low
Top 197 123 65 15
High 102 191 81 26
Mid 30 106 145 119
Low 4 21 114 261

It's fairly easy to see that this model makes a slight improvement in predicting Top-importance articles, at a cost of reduction of predictions of Mid-importance articles. The other two classes are basically unchanged.

In summary, these models don't seem to perform much better. It might be that our new variables are not good predictors, or it might be that our approach of using unanimous votes leads to a confusing dataset. Either way, before we move further on with a global classifier, it might be fruitful to consider alternatives.

WPMED dataset[edit]

We had unfortunately failed to store the dataset used previously. I did a bit of data cleaning of our original WPMED dataset by removing all disambiguation pages (they should have an NA-rating), and by correcting the rating of 129 individuals that I had identified, all of them should have a Low-importance rating but did not. From this we then randomly select 40 articles from each rating class as a test set, and build a 1,000 article training set with 200 synthetic Top-rating examples using SMOTE. As we saw previously, creating more synthetic samples did not lead to better performance. Given that the dataset are slightly different, performance isn't comparable, which is why we first train a benchmark classifier. A lot of the intermediate stage models are not useful, so we skip those and focus on the benchmark, the one with the two proportional variables used in the global dataset, and one with project-specific variations of those two. Because the SVM classifier has had the highest performance throughout our analysis, we only use that approach here.

The first model is our benchmark from earlier, which uses three predictors: number of views, number of inlinks, and proportion of inlinks from articles within WikiProject Medicine. The benchmark model has an overall accuracy of 55.62% on the new test set, with the following confusion matrix:

Top High Mid Low
Top 24 16 0 0
High 10 18 8 4
Mid 2 7 20 11
Low 0 3 10 27

Top- and Low-importance articles are the one that appear to be somewhat easier to predict. We can clearly see the split in the dataset, that Top- and High-importance articles look alike, and that Mid- and Low-importance articles look alike.

The next model adds the two proportional variables from before: proportion of views from other articles, and proportion of inlinks that lead to traffic. This model is a big improvement over the benchmark, having an overall accuracy of 68.12% with the following confusion matrix:

Top High Mid Low
Top 36 4 0 0
High 12 20 7 1
Mid 1 7 26 6
Low 1 3 9 27

We see improvements across the board, except for Low-importance articles where accuracy stays at 67.5%. Accuracy for Top-importance articles comes in at 90%, suggesting that these types of measures work really well for identifying those in WPMED, maybe because the traffic is substantial enough for it to matter in distinguishing between Top- and High-importance. We do, however, see that a slightly higher proportion of High-importance articles are classified as Top-importance, but overall the classifier improves on High-importance articles because fewer of the lower class articles are misclassified. We can also see an improvement in distinguishing between Mid- and Low-importance, but mainly that the former class is more clearly different from the latter.

The final model we test adds two project-specific variants of the proportional variables to the model: proportion of views coming from other articles in WikiProject Medicine, and proportion of all inlinks that were active and came from Medicine articles. This model has an overall accuracy of 63.75%, with the following confusion matrix:

Top High Mid Low
Top 35 5 0 0
High 10 19 7 4
Mid 1 9 23 7
Low 1 3 11 25

Because there are so few articles in the dataset, small changes in accuracy show up as fairly large percentage differences. We see from the confusion matrix that the performance on Top- and High-importance articles is basically unchanged. The drop in performance comes on Mid- and Low-importance articles, but we can also see changes in how High-importance articles are classified. Whereas they before were mainly mistaken for Top-importance, we now also see that some of them are mistaken for Low-importance articles. Mainly the drop in performance comes from a larger confusion for Mid- and Low-importance articles, in both cases more of them look like they belong to other classes, mostly the neighboring ones.

In conclusion we see indications that clickstream data allows for improved performance on the WPMED dataset as long as we're looking at Wikipedia-wide effects, adding project-specific variants of this data does not improve performance, regardless of whether it's in combination with the global variables or replacing them (we tried both approaches). At the same time, we know that certain categories of WPMED articles are defined as Low-importance, so it will be interesting to see whether the clickstream data adds performance once we incorporate data on categories.