Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Topic Analysis

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T228319

This report summarizes what I have learned about identifying article topics on Wikipedia and how these different topics relate to readership.

Goals and Requirements[edit]

The goal is to map any given page view on Wikipedia to one or more of a relatively limited but descriptive set of topics -- i.e. probably more than 20 but ideally not more than 60 or 70. This has to be done in such a way that it applies across multiple language editions of Wikipedia. And ideally it will be relatively efficient given that for this project, and others, it has to be applied to millions of unique pages. While gathering additional labeled data is within reason, given the time constraints of this analysis, I shied away from methods that would have required gathering new labels.

This report discusses what I learned in building a topic model that met these constraints and also recommendations for how to improve upon it for analyzing reading behavior -- i.e. what topics are perhaps overly broad? what topics are too narrow? what topics are more / less important to certain groups?

Approaches[edit]

I started with a Wikidata instance-of-based taxonomy but ultimately abandoned it before switching to a more general Wikidata-based model. Both are described below. Additionally, the ORES drafttopic model was considered but not used as, at the time, there was no good way to extend them to articles that do not exist on English Wikipedia. See this phab:T221891 for an example of how drafttopic was applied to English Wikipedia page views. Past work had also used language-specific LDA models with twenty dimensions that were then hand-labeled for each language. This was resource and time-intensive and so not considered for this work, where the goal was a model that could reasonably apply to any language on Wikipedia.

Wikidata Taxonomy[edit]

The first approach depended purely on the Wikidata instance-of classes. This is tempting for a number of reasons: it is inherently multilingual because it depends solely on Wikidata items and relationships, it is fast and appears relatively simple because it largely just requires traversing a predefined taxonomy, and it is interpretable -- i.e. the outputs are easily explained and can be easily improved by adding appropriate instance-of properties to an item or updating the taxonomy.

The model works as follows: build the Wikidata taxonomy based on the subclass-of property. For any given Wikidata item, you can then identify its instance-of value and where it appears in the taxonomy. You can either assign the item a topic that is its instance-of property, or move up the taxonomy to progressively more general topics. Eventually, if you go the whole way up the taxonomy, you generally end up at Entity (Q35120). The main challenge is in automatically determining how far up the taxonomy you should go for a given item. For my exploration, I used proportion of page views to items that fit under a given class and would define a threshold -- e.g., stop moving up the taxonomy once you reach a class that has at least 1% of page views under it. There are several other, more minor details such as whether you use just a single of the instance-of properties associated with an item or all of them and how to handle loops in the taxonomy.

In practice, this approach broke down for three reasons:

It is very difficult to automatically determine how far to move up the taxonomy when defining the class for a given item.

The Chicago White Sox (Q335169) are a (great) major league baseball team. They are an instance-of baseball team, and if you follow that up the taxonomy, you would get baseball team -> sports team -> team -> organization -> ... (and potentially a few other trees that also get you to organization). The challenge is where to stop on this tree. Sticking with "baseball team" makes sense but results in a massive number of topics across all Wikidata items when you apply this conservative logic elsewhere. "Sports team" is a bit better as a generalization, but automatically determining that moving up to "team" is a relatively non-useful generalization is quite challenging in practice -- "team" also includes the Wright Brothers (Q35820), which have very little in common with the White Sox. As a result, to end up with 40-50 topics across all of Wikidata, you have a lot of topics with very generic names like artificial entity (Q16686448).

The taxonomy is not particularly intuitive for many items.

That is, the taxonomy tends to capture true, but very abstract aspects of what defines a particular concept and often misses out on the more salient aspects of why it may or may not be popular / interesting to a particular group. With the White Sox, you can see that Wikidata represents it primarily as a collection of individuals who work together as opposed to a concept that is related to sports. Other examples include items like List of Bollywood films of 2019 (Q48731762), which is classified as a list and completely misses the Bollywood aspect of the concept, or the landmark school desegregation case Brown v. Board of Education (Q875738), which is classified as a legal case and misses the incredibly important aspects of the case around civil rights and racial discrimination in the United States.

All people are lumped together as instance-of:human.

This is potentially a relatively easy fix in that you can use the Occupation (P106) property, but that will introduce more challenges again in how to actually implement the occupation taxonomy.

See this phab comment for a few more details / examples and wdtaxonomy for a tool that allows for querying / visualizing of the Wikidata taxonomy.

Wikidata -> WikiProject Topic Model[edit]

Ultimately, to address the limitations of the Wikidata instance-of taxonomy while still taking advantage of its inherent multilinguality, I repurposed the ORES drafttopic model to predict labels (drawn from English WikiProjects) for an article based on its Wikidata statements. The model then can use the instance-of property but also other statements related to geography or topic-specific databases that clue into other important components of the concept. Also see phab:T228319#5487781 for a bit more description on the particulars of building the dataset and adjusting the WikiProjects taxonomy. In the end, any given Wikipedia article can be probabilistically mapped to one or more topics based on its associated Wikidata statements. The topics fall under four high-level categories (Culture, Geography, History and Society, and STEM) and then each high-level category has multiple mid-level categories such as Geography.Americas, Geography.Europe, etc.

Recommendations for Improvement[edit]

The following recommendations are based on my exploration of the different options when it came to topic models of Wikipedia articles and application of the final model to the reader demographic surveys. The recommendations are intended to both inform the development of better topic models and interpretation of the topic models as applied to page view data.

People/Human should not be its own category[edit]

The Wikidata instance-of taxonomy groups all people under instance-of:human. Other approaches such as the category-based main topic classifications and WikiProject directory risk doing this as well -- "People" for main topic classifications and "Culture.Language and Literature" for WikiProject biography. Around a quarter of page views go to articles about people though, so lumping them all into a single topic can miss a lot of nuance -- often in favor of distinctions like that between Physics and Math, which are important from an academic perspective but not particularly useful when analyzing page view traffic.

Multilabel is essential[edit]

Very few articles, especially those that are most popular, reasonably fit into a single topic of content. Forcing articles to map to a single topic therefore misses a lot of nuance, and, given the power-law dynamics of Wikipedia, can greatly skew results.

Do not confuse model confidence with topic importance[edit]

Applying the most confident label from a machine-learning model generally tends towards overrepresenting topics related to geography or people as they are often the easiest to detect. This tends to make models very confident in making their predictions even if they are not the most salient labels (they often are not). During model training, this can also skew model statistics if you are not careful to look at both macro and micro statistics -- e.g., biographies comprise a large proportion of Wikipedia articles, so a model that predicts biographies well will look very good even if it performs poorly on other topics.

Inspect the top-viewed articles for a given topic[edit]

A single false positive for a topic can have a large impact on analyses. For instance, my Wikidata->WikiProject model mapped an article about a porn site to STEM.Technology (presumably because of the presence of a website associated with the Wikidata item but little else). Though this was a single incorrect prediction, because STEM.Technology was not a large topic and the article for this porn site received a substantial number of page views, it had an outsized impact on analyses of who was reading articles about STEM.Technology.

Aim for uniform topics that are robust to outliers[edit]

Related to recommendation above about inspecting results, a good way to avoid outliers driving false conclusions is ensuring that the topics are relatively balanced in terms of the number of articles or proportion of page views that fall under them. This both forces the topic model to be responsive to what it is trying to describe (as opposed to someone's preconceived notions of what is / is not important) and improves the robustness of the model to outliers given that no topic should be too small.

Supervised topic models have benefits[edit]

Relying on labels such as WikiProject templates or categories is messy, but it does have the benefit of being relatively easy to adjust and empowering of the editors who best know these topics. This is a large benefit over unsupervised models such as LDA that use article text or links followed by a manual labeling stage at the end to interpret the results. Making changes to these models would literally require changing a substantive the text of articles or link structure of Wikipedia, which is not only more difficult than changing some categories but also would introduce odd dynamics in the content that is intended for readers (as opposed to machines).

Predictions from a trained model are sometimes better than the groundtruth data[edit]

WikiProject labels for articles have impressive coverage in English Wikipedia (the vast majority of articles have at least one WikiProject label), but most projects have not tagged all of the articles that might reasonably be considered part of their topic. This means that for a good model, many predictions it make that look like false positives are likely true positives. This means that a trained model might, in some cases, be more preferable to the groundtruth data. This is also likely true for models that use Wikipedia categories or any other crowdsourced label that is likely to have gaps. This is a particularly exciting example of how machine learning can help augment the good work that people are already doing.

Some topics have very specific reader populations associated with them[edit]

For example, articles about sports tend to be read more by men. If men are overrepresented in the readership of a wiki, this would make sports articles seem potentially much more important than they really are. This is perhaps less of a concern on large Wikipedias where the readership is often more representative of the general populace, but in smaller wikis, the page views can represent a narrow slice of the world of readers who ideally would be reading content in that language.

See Also[edit]