Research talk:Automated classification of article importance/Work log/2017-03-09

From Meta, a Wikimedia project coordination wiki

Thursday, March 9, 2017[edit]

Today I'll start by following up on yesterday's classifier training by studying some of the misclassified articles. We are most interested in Top-importance articles that are labelled as Mid- or Low-importance, and conversely Low-importance articles that are labelled as Top- or High-importance.

Incorrectly labelled articles[edit]

I split the test dataset into two groups, one for Top-importance articles and one for Low-importance articles. In both cases I only included correctly labelled instances as well as those incorrectly labelled towards the extreme (e.g. Low- and Mid-importance for Top-importance articles). To establish whether the incorrectly labelled articles appear distinctly different from the correctly labelled articles, I generated density plots for number of views and inlinks for all four cases:

Number of inlinks Number of views
Low-importance
Top-importance

In both cases it appears clear that the incorrectly labelled articles are distinctively different from the correctly labelled ones. If we look at incorrectly labelled Top-importance articles, it seems that almost all of them have fewer than 100 views. There are 83 such articles in total. One of them is Paul McCartney's musical career, which we identified as an orphan article yesterday (it has no links pointing to it from other articles). Several list type articles appear as well, for instance List of state highways in California. As we can see with that list, it is only claimed by a single WikiProject but is categorized into three different importance-based categories (which our approach interprets as three different projects). This might suggest that we can increase our precision by switching to parsing talk pages for all articles that appear claimed by multiple projects and has unanimous ratings.

For incorrectly labelled Low-importance articles it seems about 31.6 views/inlinks are a cutoff (log_10(31.6) = 1.5). There are 18 such articles, for instance Chowchilla, a city that includes two women penitentiaries and is planned to be the place where the California High-Speed Rail line splits. The latter might explain some of the article's 120 daily average views. Another example is X2, which has 535 inlinks and averages 3,867 daily views. Both appear high for an article labelled as Low-importance. One might think that the number of views are due to a short-term spike, which they are, but before the spike the article averaged around 1,500 daily views, which is an order of magnitude higher than most other articles in our set. Labelling this movie as Low-importance therefore appears questionable.

Prolific WikiProject participants[edit]

We are interested in discussing aspects of importance with Wikipedia contributors, and think that those who participate in WikiProjects are likely candidates. I wrote a SQL query to find the 100 editors with most edits to WikiProject-pages in the Wikipedia and Wikipedia talk namespaces since Jan 1, excluding bots (this assumes all bots are in the "bot" user group, which they should be). Examining some of the top ranked editors it seems clear these have high experience and engagement with Wikipedia.

Correctly counting ratings[edit]

As discussed above, some articles were identified as being unanimously rated by multiple WikiProjects (e.g. List of state highways in California), but a closer inspection of the article's talk page revealed that only a single project (WikiProject U.S. Roads). The question is then how many other articles are similarly incorrectly labelled.

I wrote a Python script that goes through all 7,600 articles in the dataset and checks their talk pages for templates with importance ratings. Picking up importance ratings correctly turns out to be complicated because talk pages are messy:

  • The WikiProject-related templates are almost always at the top of the page, so my script truncates the talk page if it's larger than 8k. Sometimes this means it misses a template, for instance on Talk:Index of ethics articles, which I've since fixed by moving the templates back up to the top. I did not find other examples of missed templates in my data.
  • Some projects do not rate the article themselves, but instead have work groups or other related WikiProjects that do, or a combination of both. WikiProject Biography, Africa, and South America are all examples of this, as found on Talk:Prince Louis of Battenberg, Talk:Djibouti Armed Forces, and Talk:Dutch colonisation of the Guianas. Given these examples it seems that I should test for the presence of parameters ending with "-priority" or "-importance". I'll also test specifically for these known projects and make note of any others I come across that behave similarly.
  • Some projects use a "priority" parameter, perhaps instead of "importance". One example is Talk:Stephen Lynch, where Wikipedia:WikiProject Musical Theatre has rated it as Mid-priority. This rating does not categorize the article, however, and the WikiProject pages do not appear to use this priority rating in any way (the project seems to be rather stale?) Modifying the script to pick up this rating does therefore not appear to be a priority.

After rewriting the script with a new approach, we uncover several other interesting ways these WikiProject templates are used.

  • Some articles are incorrectly categorized as unanimous and incorrectly displayed as unanimous on the talk page, not reflecting the underlying ratings in the actual template. For example Talk:Geography of Brazil is shown and categorized as Top-importance by both WikiProject Brazil and the "Geography of Brazil" task force, even though the template has geography rating it Mid-importance.
  • There might need to be a distinction between task forces and sub-projects. As we saw for Talk:Geography of Brazil, as task force might rate the article independently of the WikiProject, but the article gets categorized into two importance-related categories. However, there are also templates used by combinations of WikiProjects. See for instance Talk:Parkland College and Talk:Chinese painting where we can see this happen with WikiProject Canada and WikiProject China respectively. Both have related projects, e.g. WikiProject Saskatchewan and WikiProject Chinese history, that may or may not also rate the importance of the article (in both of these cases, the overarching project rated it). In both cases, the article is categorized into multiple importance-related categories.
  • Some projects might use importance-related parameters that may or may not be reflected on the talk page, and may not be the same as the overall rating. en:Talk:Evangelicalism shows an example of this, with WikiProject Christianity being one of the projects that utilize this approach. The project rates it Top-importance, a rating which the theology work group also uses. However, there is a "core topics" work group which has it rated as Low-importance, and that rating is not displayed nor reflected in the article's categorization.

Reflection from the next day (March 10): The talk page template parameters are no more fit to be defined as ground truth than our previous approach of using the categories. As we see there are a myriad of ways these templates are used, so we would have to implement a complex system to handle every case. We saw that some projects have workgroups or task forces, whereas others have associated projects, meaning we would have to survey all of them and encode their relationships. The importance ratings themselves can also be questioned since some projects have multiple ratings in their templates but do not expose this to readers or categorize the articles accordingly. It could therefore be argued that what the category structure reflects is just as close to ground truth as whatever we might find on the talk pages, rendering the whole process of surveying and encoding the WikiProject template system moot.