Research:Language-Agnostic Topic Classification/Countries

From Meta, a Wikimedia project coordination wiki

The ORES topic taxonomy contains several geographic topics such as Asia, Central Asia, East Asia, etc. While these topics are useful for high-level statistics about article or pageview distributions, they do not support other use-cases such as helping editors finding content relevant to their region (which generally is a country or even smaller subdivisions) or more fine-grained analyses of content gaps. To support all of these use-cases, a classification model is being developed that assigns countries to articles (and then any aggregation to larger regions can be easily applied).

Initial Prototype[edit]

Given the success of the language-agnostic topic classification model based on an article's outlinks, that same approach was initially tried but replacing the 64 topics with one or more of 193 countries based on entities identified as sovereign states in Wikidata with a small amount of manual cleaning. Groundtruth data was based purely on an article's associated Wikidata items and was a union of coordinate location (geolocated to the same set of 193 countries by checking simple containment in each country's borders), place of birth, country, country of citizenship, and country of origin. Note, only direct matches were used so e.g., a place of birth property that references a city such as Cambridge for Douglas Adams) would not be mapped to the United Kingdom (though in Douglas Adams' case, the country of citizenship property serves to include the United Kingdom).

This model performed well statistically -- i.e. relatively high precision and recall for most countries -- but had a number of drawbacks that suggested it could still be improved substantially. Most notably, the model struggled to handle articles that relate to many geographies -- e.g., World War II -- and would predict many countries with a very low confidence that generally means no confident predictions. This could potentially be handled by lowering the prediction threshold but, in practice, I think this would result in other issues related to false positives. I think graph-based approaches might show more promise here than I would have expected for the broader topic taxonomy. For example, many topics do not show simple homophily-type relationships -- e.g., an article that links to many articles about people is not itself clearly an article about a person. I would expect clearer relationships with geography though -- i.e. an article that links repeatedly to content about a particular region is almost certainly relevant to that region -- that graph approaches will be able to effectively capture where the fastText classifier (which is a simple classifier over a document embedding) would struggle.

What is a Country?[edit]

While many regions are clearly countries and have widely-recognized borders and sovereignty, other regions that we might think of as countries are disputed or actually officially part of a larger region. The point of this classifier is not political but to support editors who wish to find and edit content relevant to their region as well as analyses of geographic trends. More details can be found in the README to the code repository,[1] but the goal was to build a maximally-inclusive list of regions that could be easily operationalized from Wikidata and pageview data and support most use-cases while still largely capturing countries and not subregions such as individual states in the United States.

Current Model[edit]

The current approach is based on network label propagation. More to come.