Research talk:Automated classification of article importance/Work log/2017-05-18

From Meta, a Wikimedia project coordination wiki

Thursday, May 18, 2017[edit]

Today I'll follow up on communication with WikiProject NFL, one of their members was kind enough to review all our Top-importance predictions. After that, I'll start writing the view rate pipeline.

WikiProject NFL[edit]

Wikidata network of super-/subclass relationships between WikiProject National Football League articles with clusters labelled.

We got a great set of feedback from one of the members of WP:NFL, going through all our Top-importance candidates for re-rating. Two of them got their rating updated, and another article turned out to be outside the scope of WP:NFL. That's 1/8 of 24, or a 12.5% update rate.

The feedback we got confirms a suspicion I had when I first looked at WP:NFL, that they have rules for importance ratings based on a player's career, e.g. whether he's a Hall of Famer, only played a single game, etc… They also have specific ratings for certain types of articles, e.g. seasons are all High-importance (I found and fixed one of them based on the graph above). Some of this is seen in the network graph above, many of the clusters are uniformly coloured.

When I was building the WP:NFL model I commented on how well it performed. I therefore decided to dig a bit further. First, the summary command in R will give me the influence of the predictors in the GBM, where it is clear that this model is all about the inlinks, percentile of article views has the least influence in the model. When you plot the percentile of inlinks and views for WP:NFL, this also becomes obvious, as seen below.

View and link percentiles for WP:NFL, faceted by importance rating.

I looked through the articles we proposed should be Top-importance, and many of them are around the 90th percentile somewhere, which I can understand puts them on the boundary between a Top- and High-importance rating. Given WP:NFL's judgement of importance for players, and players accounting for most of their articles, it is not surprising that our classifier gets some of them wrong.

The question is whether these findings have the following practical implications or not. First, do we need to develop a language for describing these types of constraints? If we do, it would allow us to encode humans as Low-importance in WP:MED, and seasons as High-importance in WP:NFL, thereby side-chaining the whole model. We might want to restrict it to "instance of" relationships, though, as they could otherwise become difficult to engineer (in other words, I don't see "played less than X games" as a meaningful constraint). Secondly, we might want a good interface for feedback on these predictions. Having a user review a rating and then making sure it is not suggested again is useful (that can for example take care of the issue of predicting football seasons wrong). Reporting it as "incorrect" is also useful, as that provides us with training data.