From Meta, a Wikimedia project coordination wiki
Duration:  2019-03 – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The Long Tail Topic Ontology explores methodologies for programmatically identifying "knowledge gaps", or systematically underrepresented areas of the Wikipedia. This project consists of three distinct parts: 1) developing an scalable algorithm that generates sufficiently specific categories, 2) mapping metrics to these categories to illustrate the "completeness" of topical areas, and 3) developing visualizations to effectively explore and navigate the resulting data set.



At present, Wikipedia does not have a comprehensive category structure that can be leveraged to understand the balance between over represented and underrepresented topics, or what information exists within the encyclopedia and what information does not. Existing approaches to categorize the encyclopedia either do not provide sufficient depth, or they likely reflect and perpetuate existing biases towards existing topics. For instance, the ORES Draft Topic model can identify 64 different broad topics, but these topics are not sufficiently specific enough to identify missing content. In this case we may know that Women are an underrepresented category, but we cannot tell whether these women are predominantly female scientists or female athletes. Conversely, while the WikiProject taxonomy provides sufficient depth in some areas, it likely suffers from the same production biases that plague the encyclopedia as a whole. Branches of the taxonomy that receive more attention are likely over specified, while underproduced topics may also fall under overly broad categories.

In this project we first aim to develop a comprehensive category structure that provides enough breadth and depth to understand the distribution of information within the encyclopedia, and how volunteer labor is distributed over those categories. In other words, we aim to understand 1) what topics are covered within the encyclopedia, 2) the quantity and diversity of articles that belong to topics, 3) the relationships between topics, and 4) who contributes to given topics. This taxonomy should be:

Specific: Lowest level categories are smaller and more precise than our existing models.

Cohesive: Each category should contain only articles about a particular topic.

Comprehensive: The taxonomy can be trivially applied to all Wikipedia articles in multiple language editions.

Metrics & Use Cases[edit]

Second, we develop use cases for the topical clustering methodology developed by EdCast for the Long Tail Topic and Ontology project. These use cases focus on--but are not limited to--identifying systematic knowledge gaps. For each of these use cases, integrating EdCast’s clustering strategy allows us to identify topical areas of the encyclopedia that correlate with some metric. Whereas the metric itself can pinpoint specific outlier articles, pairing a metric with EdCast’s clustering strategy allows us to explore topical areas that contain many outliers. For our purposes, these topical areas may represent different kinds of knowledge gaps.

Existing and more refined taxonomies could be leveraged for a similar purpose, but the depth, granularity, and scalability of EdCast’s clustering method provides a distinct advantage. While the ORES DraftTopic model could tell us at a high level which parts of the encyclopedia are systematically underrepresented, it does not provide the level of detail necessary to understand or begin to fix the issue.


Third, we develop and test a series of visualizations designed to explore the clustered data set. These visualizations are designed to highlight existing underrepresented areas of Wikipedia. These visualizations are hosted on Toolforge at