Research:Developing Metrics for Content Gaps (Knowledge Gaps Taxonomy)

From Meta, a Wikimedia project coordination wiki
11:05, 8 February 2021 (UTC)
Duration:  2021-February – 2021-July
content gaps, culture gap, gender gap, diversity
This page documents a completed research project.

In response to Wikimedia Movement’s 2030 strategic direction, more specifically to the goal of knowledge equity, the WMF Research team at the Wikimedia Foundation is developing a framework to understand and measure knowledge gaps with the aim of informing long-term decision-making. The taxonomy of knowledge gaps offers a grouping and the descriptions of the different Wikimedia knowledge gaps. The latter defines a knowledge gap in the content dimension as a disparity in coverage of content.

One particular exception is the gender content gap -- the observation that “more men than women are covered in the main space content of our wikis” -- has received wide-spread through various initiatives that try to address the gap (such as Wiki Women in Red), public dashboards (such as whgi or denelezh), as well as in academic research (such Wagner et al. 2015). More generally, the operationalization of a specific gap in the context of Wikimedia projects poses the following challenges.

First, for each gap there exist many possible metrics. It is a-priori not clear which metric should be preferred, and existing approaches often contain many implicit choices. This can be seen in the example of the gender gap : most-commonly thought of as the number of biographies on women, scholars have highlighted the importance of alternative metrics capturing extent, framing, or structural biases, or aim to capture content related to gender beyond biographies.

Second, there are main challenges when aiming to implement metrics across languages to cover all (Wikipedia) projects. This requires language-agnostic approaches not relying on methods based on key-words (e.g. in identifying relevant article-categories) or language-specific models such as ORES used for measuring, which despite its usefulness is currently deployed for 11 out of more than 300 Wikipedia projects.

Third, and most importantly, in a community-driven environment, we cannot just impose metrics. The community -- affiliates and initiatives that have been working to bridge these gaps -- needs to be involved in the discussions and explain their needs and goals. A coherent overview on community practices across different gaps does currently not exist.

In this work, we develop a general framework to build metrics for 5 knowledge gaps related to the content-gaps (gender, geography, cultural context, time, sexual orientation).


This project aims to develop metrics for knowledge gaps in the content-dimension of the taxonomy. The work has been carried out since the end of January until the end of June. During this period, different deliverables have been created in order to generate knowledge on how to tackle different content gaps. The relevant projects are all Wikipedia language editions, Wikidata, Commons and the rest of projects (wiktionary, wikisource, wikiquote, wikibooks, wikinews, wikiversity, wikivoyage).

The goal is to develop an in-depth understanding of a gap’s metric such that it can be implemented at scale without further research. For example, if at some point we decide to build a dashboard for knowledge gaps (say similar to wikistats), all the ingredients should be there for the implementation to begin.

Research process[edit]

In order to measure the content gaps, we identify the need to structure the research in two phases (or sub-problems):

Phase 1. Mapping Wikipedia articles to a specific gap.

For this, we need to understand the framing of gaps by the communities and stakeholders. Based on this, we can operationalize a gap in the context of Wikipedia. Then we develop a mapping between gaps and content (content-gap mapping).

Phase 2. Quantifying the gap based on a selection of relevant metrics.

We review different models describing the various aspects in which gaps can be measured and conduct interviews with affiliates to capture the community’s interests. We obtain a set of metrics by taking into account aspects of the scientific maturity of the metric, as well as project constraints (choosing a set of metrics).

This is mid-point a presentation of the research process proposed to design a set of metrics for different content gaps (PDF 33 slides).
Sketch of the workflow to develop metrics for knowledge gaps in the content dimension.

At the beginnning and end of each phase, we engaged with communities in order to explain the project and understand their goals and needs. This is a research project leaded by researchers but aimed at supporting them. Therefore, their input is essential for the definition and closure of the project.

Summary of contributions[edit]

Our main contributions in this work consist of:

  • Identification and classification of stakeholders with gaps according to their practices and capacity; review of their framing of the gaps in their spaces, documentation, as well as interviews.
  • Review and synthesis on i) models for aspects of metrics for knowledge gaps, as well as ii) existing quantitative approaches to measure knowledge gaps in Wikimedia projects.
  • Development of a strategy for measuring content-gaps to go beyond ubiquitous gaps; not only applicable to the 5 gaps described here but can be extended to other gaps in the future; direct more attention to gaps that have received less attention.
  • Selection of a final set of metrics reflective of the content gaps.
  • Proto-implementation for at-scale monitoring and results for 300 languages.


a) Set of Metrics[edit]

Taking into account your suggestions, we were able to identify a reduced number of metrics that explain better the most essential aspects of content gaps. They are this compact set of only three:

  • Selection-Score (Number of articles for each category of the gap). e.g., number of articles for each country for the geography gap.
  • Extent-Score (indicator similar to wikirank to explain the degree of completion/quality of articles based on length, # sections, # images). The extent-score will explain “how good” the articles in each category are.
  • Visibility-Score (Percentage of articles for each category in spaces like the “Main Page” or the group of “Featured articles”).

To understand better how we reached to this set, you can read the process of choosing a set of metrics.

b) Preliminary Visualizations[edit]

We generated all the preliminary visualizations for each of the gaps. This is an example of a visualization for the geography gap:

Geography Gap (Selection Metric): Number of Geolocated articles by continent in five Wikipedia language editions (ca, de, en, es, it).

c) Technical Documentation[edit]

We provided a technical document (pdf) for implementing the measures for content gaps (the non-technical aspects related to, e.g., feedback from the community will be covered in detail in another document). We provided a step-by-step guide on how to obtain the relevant metrics.

The document is organized as follows:

Section 1 “Overview of resources and outputs”, where we give an overview of the different outputs (code, databases, visualization).

Section 2 “Main steps and annotation”, where we explain the different steps to map gaps to content and the annotation we have used.

Section 3 “Implementation”, where we explain the implementation of the code including the data sources, generated databases, and main scripts.

The documentation is also available in this page.

Scripts are available at this Github address:

Some scripts contain some dependencies as they need data generated by a previous script or fill a database.

This diagram contains all the main resources to do the mapping and computation of metrics.

Guiding principles[edit]

For this work to be successful, we considered that it was necessary to make visible a short list of principles that can define the researcher position and delimit the responsibilities as well as the aims. The first principle that defines the project is people-centeredness - addressing the needs and challenges of the people who powers the movement - because the resulting measurements will facilitate the planning and encourage the coordination in order to bridge the gaps. Whether it is a dashboard or another type of tool, it is essential that the measurements support the work done by Wikimedians who bridge the gaps.

At the same time, we followed the principle of neutrality in order not to take any position in favor of one stakeholder and not another. We have not seen any particular conflict regarding the process of bridging the gaps, but at the same time, the measurements do not encourage any particular point of view. When studying the different content gaps, we have dedicated more focus to the gender gap because of the higher level of maturity of the academic research and the community organization around it.

When designing and implementing the measurements, we have acknowledged the gaps in research as well as those aspects that have not been addressed by communities. In this sense, we want to highlight the flexibility of the project to be constantly revised with stakeholders’ feedback and according to new scenarios. Both the methods to map content to gaps and to measure the content have been thought to be language-agnostic and topic-language agnostic, considering that the list of language communities may grow in the future, and the specific relevant themes may vary as Wikimedia language communities progress towards the goal of filling the gaps or simply because of world events.

See also[edit]

Subpages of this page[edit]

Other projects[edit]

Main references[edit]

  • Konieczny, P., & Klein, M. (2018). Gender gap through time and space: A journey through Wikipedia biographies via the Wikidata Human Gender Indicator. New Media & Society, 20(12), 4608-4633.
  • Miquel-Ribé, M. (2019). The Sum of Human Knowledge? Not in One Wikipedia Language Edition. Wikipedia@ 20.
  • Miquel-Ribé, M., & Laniado, D. (2019, July). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 13, pp. 620-629).
  • Miquel-Ribé, M., & Laniado, D. (2020, August). The Wikipedia Diversity Observatory: A Project to Identify and Bridge Content Gaps in Wikipedia. In Proceedings of the 16th International Symposium on Open Collaboration (pp. 1-4).
  • Redi, M., Gerlach, M., Johnson, I., Morgan, J., & Zia, L. (2020). A Taxonomy of Knowledge Gaps for Wikimedia Projects (First Draft). arXiv preprint arXiv:2008.12314.
  • Wagner, C., Garcia, D., Jadidi, M., & Strohmaier, M. (2015, April). It's a man's Wikipedia? Assessing gender inequality in an online encyclopedia. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 9, No. 1).