Research talk:Automated classification of article importance/Work log/2017-03-08

Wednesday, March 8, 2017

Today I will follow up yesterday's work by generating a dataset of articles with unanimous ratings. Using that dataset I'll gather basic statistics about the articles and start training some models to see how they fare.

Dataset

Based on yesterday's analysis, I generated a dataset consisting of 7,600 articles with unanimous importance ratings by at least two WikiProjects. The dataset contains all 1,900 Top-importance articles, as well as 1,900 randomly sampled articles from each of the other three importance classes. Once all 7,600 articles were gathered, I randomly sampled 400 of those from each of the importance classes and assigned them to a test set using a binary column in the dataset.

I then wrote a Python script that I could run on Tool Labs and extend the dataset with our two initial measurements: number of inlinks and number of article views. I decided to average number of views over 28 days in order to get a reasonable estimate without using too much data. Future work should perhaps look at patterns in article views for these types of articles. Counting the number of inlinks is restricted to ones from other articles, again done as a reasonable first step. After testing the code, I ran it on the full dataset and the result is in our GitHub repository.

Primary analysis

In order to gain a basic understanding of our data, we first look at the distribution of the variables. We are particularly interested in how they differ between the various importance classes. The output from R's summary function is a good first step, here separated by importance rating:

Rating		Minimum	First quartile	Median	Mean	Third quartile	Maximum
Top	Num. Inlinks	0.00	116.80	321.50	2209.00	1014.00	500100.00
Top	Num. views	2.62	53.25	250.70	885.10	937.40	50470.00
High	Num. Inlinks	0.00	55.00	161.00	390.70	384.0	25530.0
High	Num. views	1.62	29.23	113.00	444.00	402.90	44510.00
Mid	Num. Inlinks	0.00	16.00	58.00	167.3	150.2	52920.0
Mid	Num. views	1.53	5.93	19.20	98.91	66.90	4788.00
Low	Num. Inlinks	0.00	4.00	13.00	51.51	54.25	2126.00
Low	Num. views	1.08	2.67	4.82	40.21	13.36	25230.00

There are three important patterns to be found in the numbers in this table:

There is some ability to distinguish between the classes. We can see this by examining the first quartile, median, and third quartile of each class. Top-importance rated articles tend to have more inlinks and views than High-importance articles, which have more than Mid-importance articles, which have more than Low-importance articles. This might be easier to see once we visualize the data using density plots.
All importance classes have at least one article with no links pointing to them from other articles (i.e. they are orphan articles). Upon first seeing this one might think it's a mistake, meaning there is a bug in our code that gathered the data. This turns out to not be the case, all 44 of these articles (listed below) actually have no links to them from other articles. Because of this pattern, our machine learners might misclassify these articles. It is also worth noting that if we used an algorithmic approach to measuring importance (e.g. PageRank), these articles would not rank highly because they are unable to attain ranking from other articles. Eight of the articles are Top-, High-, or Mid-importance, thus indicating a discrepancy between algorithmic and human assessments of importance for these.
The mean and median are far apart for all classes and measures. When we also see that the maximum is orders of magnitude larger than the third quartile, it is clear that these are skewed distributions. We will therefore most likely transform them before we proceed with our machine learning.

Graphical analysis

Scatterplot of number of inlinks against number of views, faceted by rating

We generate three graphs based on the previous numerical analysis. Given the skewed distribution of both number of inlinks and number of article views as described previously, we apply a log-10 transformation (log10(1 + x)) to these in order to reduce the skewness.

The plots on the right visualize the data behind the table seen earlier and provide us with more detail. From the density plot of number of inlinks it appears clear that Low-importance articles are rather distinct from the other three classes, as they are much more likely to have a low number of inlinks. Among the other three classes there is quite some overlap, perhaps particularly for Top- and High-importance articles, although we also see that many Top-importance articles have a much higher number of inlinks than High-importance articles.

The density plot of number of views further substantiate that Low-importance articles have different characteristics than the other classes, but particularly Top- and High-importance articles. There is some overlap between Low- and Mid-importance articles, there is a substantial number of Mid-importance articles with less than 10 daily views on average. We can also see overlap between Top- and High-importance articles, suggesting that these are not particularly distinct when it comes to the number of views they attract. In summary this plot does indicate that importance largely follows popularity.

Lastly we have a faceted scatterplot of inlinks and views. Here we can see that Low-importance articles to some degree cluster towards the bottom left. Mid-importance articles are similar, although their popularity extends further upwards. Top- and High-importance articles tend to have a much larger span in popularity, creating a drawn out shape compared to the other two classes.

Classifier training and evaluation

Based on the results of our feasibility study, we are interested in studying the performance of three classifiers: Random Forest, SVM, and GBM. As mentioned previously our dataset is set up for classifier evaluation since 1,600 of the articles are labelled as a test set, leaving 6,000 articles for training. We will typically utilize cross-validation to evaluate classifier performance, for example to allow us to tune model parameters.

Random Forest

We train a Random Forest classifier and use 10-fold cross-validation on the training set to tune the forest size and terminating node size parameters. The former adjusts the number of decision trees in the forest, and the latter determines the size of the trees by controlling how many items are required to be present in a terminating node (larger settings create shallower trees). Because the number of articles in each class is the same, we can use accuracy (proportion of correctly predicted labels) to measure performance. We find that a forest with 1,001 trees and a terminating node size of 512 has the best performance.

Using the Random Forest model to predict the test set labels gives us the following confusion matrix where rows are true labels and columns are predicted labels:

	Top	High	Mid	Low
Top	182	118	82	18
High	82	170	119	29
Mid	15	91	167	127
Low	2	13	102	283

Overall accuracy of this model is 50.12%. It performs strongly on the Low-importance articles, where it is correctly predicting 70.75% of the articles. Accuracy for the other three classes is in the range of 41.75–45.50%. As we discussed previously, Low-importance articles appear more straightforward to classify using our two measures, while the three other classes are more difficult to distinguish from each other.

SVM

Based on our feasibility study, a radial kernel provides the best performance so we used that here as well. Using R's tuning functionality we investigate how the cost and gamma parameters should be set, and arrive at cost=16 and gamma=1.4. We then use the SVM to predict the labels in the test set and arrive at the following confusion matrix (rows and columns as for the Random Forest table):

	Top	High	Mid	Low
Top	206	110	66	18
High	107	177	84	32
Mid	32	90	140	138
Low	4	17	93	286

The overall accuracy of the SVM is 50.56%, about half a percentage point above the Random Forest classifier. It is fairly straightforward to see that it has higher performance on Top-importance articles (51.5% compared to the Random Forest's 45.5%). It also delivers comparable performance on the High- and Low-importance articles. It does not fare as well on the Mid-importance articles, correctly labelling 35% of them compared to the Random Forest's 41.75%.

GBM

We use 10-fold cross-validation on the training set and iterations in order to identify how large the GBM should be and how many articles should be in a node (similar to the terminating node size in the Random Forest). Using a minimum number of observations in a node of 16, we find that the GBM should contain 1024 trees. We then use it to predict the labels in the test set as for the other two models and arrive at the following confusion matrix:

	Top	High	Mid	Low
Top	199	124	60	17
High	107	185	80	28
Mid	34	105	138	123
Low	10	21	109	260

The overall accuracy of the GBM is 48.88%, slightly over one and a half percentage points lower than the SVM. From the confusion matrix we can see that it is slightly better at predicting High-importance articles (46.25% compared to 42.5% for the Random Forest and 44.25% for the SVM), but does not perform as well on the other classes. It has also a lower tendency to classify Top- and High-importance articles as Mid- or Low-importance compared to the other models.

Conclusions

Based on the analysis of the data and the classifier performance so far, it seems clear that using number of inlinks and article views only gets us so far. We are about twice as good as a random draw overall, and the only class we are reasonably able to predict is Low-importance. It therefore seems evident that we should consider looking for additional data sources.

List of articles with no inlinks

There are 44 articles in the dataset that have no inlinks from other articles (thus they should be labelled as orphans). We have verified that this is not an anomaly by manually checking "What links here" for each article. The 44 articles are: