Research talk:Automated classification of article importance/Work log/2017-03-17

From Meta, a Wikimedia project coordination wiki

Friday, March 17, 2017[edit]

Today I'll work on wrapping up our analysis of the misclassified WPMED articles and hopefully get started on a conversation with them about those. Secondly, I'll work on our sources of signal, particularly seeking to add information from our literature review.

WPMED Articles[edit]

Looking through the lists from yesterday, I'm thinking that there are three main lists of articles that they might be interested in. Based on looking through the lists it might be possible to further categorise them, as discussed below.

Mid-importance articles predicted to be Top-importance:

These seem to fall into three categories:

  1. Medication (e.g. Ibuprofen)
  2. Illnesses/medical conditions (e.g. Motion sickness)
  3. Popular general topics within medicine (e.g. Psychotherapy)

Low-importance articles predicted to be Top-importance:

These seem to fall into several different categories, where some of the larger ones might be:

  1. People (e.g. Robert Koch)
  2. Companies, services, and legislation (e.g. Health insurance in the United States)
  3. Illnesses/medical conditions (e.g. Scar)
  4. Medication and chemical compounds (e.g. Tocopherol)

There might also be a fairly large "miscellaneous" category in this list.

Top/High-importance articles predicted to be Low-importance:

It's not obvious from skimming through it how this list would be categorized. Some articles are about people (e.g. Chukwuedu Nwokolo), there are articles about topics where their importance has waned (e.g. Google Flu Trends), and there are some articles that appear to be related to a more general topic where said topic might be of high importance (e.g. Hepatitis C and HIV coinfection).

Combining High- and Top-importance[edit]

Writing up some notes on a test I ran yesterday, where I combined High- and Top-importance articles into a single category. Given the classification results we saw for WPMED, particularly the confusion matrix, it could appear that those two classes are close and often confused for each other. Combining them also provides us with an easier process of training and testing since we have almost 1,100 articles in the new "High-importance" class.

I made a new dataset with Top- and High-importance articles combined into a new "High-importance" class, then split that into a test dataset with 300 articles (100 randomly from each class), and a training dataset with an equal number of articles from each class (approx. 3,000). Based on the results from the earlier classification, I tuned and tested an SVM classifier with number of global inlinks and views, and one with all three predictors. In both cases results are not particularly good, with overall accuracy coming in at around 66%. While that is higher than we saw previously, we have gone from four to three classes. The new "High-importance" class is fairly easy to predict (77% accuracy), whereas the other two are still confused. So while this approach does to some degree improves performance, it is not a large improvement and does not suggest these classes should be combined.

Dampening local inlink count[edit]

While reading through our literature review, I noticed that Kamps and Koolen had used both local and global indegree in their 2008 paper, which is also what we have been doing in our WPMED models. They modified their local indegree by applying an approach similar to TF/IDF. Their indegree prior formula reported in the paper is:

By itself, local indegree must be correlated with global indegree (if we log-transform both the correlation is 0.89). I am therefore reminded by my investigation into slicing and dicing the view statistics, where correlation always came into play. This dampened measure, however, is not strongly correlated with either the global number of inlinks nor the project-specific number of inlinks. Instead, we have a measure of what proportion of inlinks are from the project. If this measure is close to 1, then the article has a low proportion of links from within the project. If the measure is close to 2, then the article has most of its inlinks from within the project.

I chose to train an SVM on the 1,000 item WPMED training dataset using number of article views, number of inlinks from all of Wikipedia, and this measure of project-specific inlink proportion. First I tune the SVM using the same approach as previously, then test it on the test set. It scores an overall accuracy of 67.5% (compared to the other SVM's 62.5%) with the following confusion matrix:

Top High Mid Low
Top 32 5 3 0
High 10 21 5 4
Mid 2 7 24 7
Low 0 2 7 31

Compared to the previously trained SVM, this one does slightly better on High- and Mid-importance articles. It does much better on both Top- and Low-importance articles, getting an additional three correct articles on both. In addition to getting more of those articles correctly, it also reduces the overlap between classes by not predicting any Top-importance article to be Low-importance, and vice versa.

In Kamps and Koolen's paper, they point to how this measure affects relevance in search queries. If their results transfer to our domain, then this measure is better at capturing local relevance. Whether it is because articles have a high or low proportion of WPMED-specific inlinks remains to be seen, I will run the classifier on the entire WPMED dataset and see what patterns can be learned.