Research:Knowledge Gaps Index/Datasets

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This is a technical design document for the data architecture and pipelines that generate the content gap metrics data. It describes the logic that the pipeline steps implement, the data schemas that connect the steps, the approach used to schedule the execution of the pipeline, and the serving layer of content gap metrics used by the content gap frontend tool.

This document does not cover the design of the frontend tool for the content gap, nor the exact details of API required to serve the data used by the tool.


This document describes the data architecture that quantifies the various content gaps defined by the Knowledge Gaps Index project. The pipelines are implemented as spark pipelines and executed in the WMF's data engineering infrastructure on Yarn. The input data sources for the pipelines are maintained by data engineering with a high availability. The pipelines are scheduled using Airflow, and execution failures will result in an email alert. There is a serving layer for the output of the content gap pipelines, with the details to be determined pending a more concrete design for the frontend tool.


The content gap pipeline is implemented as a Spark application. By distributing the computation on a large number of processes running on Yarn, we can process all content gaps for all languages/Wikipedias in parallel. The pipeline consists of sequence of steps:

  • Compute features relevant for each content gap, for all Wikipedia articles across all projects. Example: the gender gap feature for Frida Kahlo is female based on the sex or gender property.
  • Compute a set of features to provide insight for Wikipedia articles across all projects. Example: The Frida Kahlo on the French Wikipedia received 69636 pageviews in the May 2022
  • Combine the content gap features and metric features datasets, and aggregate statistics for the content gap categories of each content gap. Example: Articles on the French Wikipedia associated with the female category of the gender gap were edited 47421 times in May 2022.

The knowledge gap pipeline depends on a number of external data sources:

In addition, the pipeline makes use of language-agnostic models such as the article quality scores and the country geography model.

Content gap features[edit]

The content gap features are associated with a Wikidata entity instead of an Wikipedia article directly. For example, the content gap features for Frida Kahlo are associated with qid Q5588. As a consequence, the features for that Wikidata entity are associated with all Wikipedia articles about Frida Kahlo in the various languages.

The content gap features themselves are also based on Wikidata where possible. Each content gap consists of a set of categories that are associated with Wikidata properties. The knowledge gap pipeline processes the Wikidata dumps to annotate Wikidata entities that have Wikipedia articles with other Wikidata entities that are linked by Wikidata properties associated with content gaps. For example, the time gap is associated with a set of Wikidata properties that are used to extract time specific values (e.g. a year).  Where possible, the pipeline also uses Wikidata to define the categories for a content gap itself. For example, the gender gap categories are based on the allowed values of the Wikidata property for gender or sex.

The output of this step is a dataset where each row is identified by a QID that is associated with at least one Wikipedia article (i.e. it exists in at least one project), and there is a column for each content gap.

Metric features[edit]

In order to provide insights into content gaps over time, the knowledge gap pipeline aggregates commonly used metrics for analyzing Wikipedia content and editor activity into a metric features dataset.

  • The creation date of the article
  • The quality score of the article
  • The number of pageviews of the article
  • The number of times an article was edited

The metric features are associated with a particular Wikipedia article, i.e. the Frida Kahlo articles for the 152 projects it exists in are all associated with their own set of metric features. In addition, the metric features are represented as time series to allow analysis of trends over time.

Content Gap Metrics[edit]

In order to compute metrics about the content gaps, the content gap features dataset and the metric features dataset are combined to form an intermediate dataset. The metric features are associated with Wikipedia articles in specific Wikipedia projects, while the content gap features are associated with Wikidata items shared across all Wikipedia projects. The content gap metrics are computed for each content gap individually, as the content gap features not only determine which content gap category an article belongs to, but they are also used to determine how to aggregate the intermediate dataset. For example, to compute the metrics for the gender gap we only want to consider articles about people, so any other articles are filtered out before aggregating.

The output of the content gap metrics step is a dataset with columns:

  • wiki_db: the wikipedia project, e.g. frwiki
  • content_gap: the content gap, e.g. gender
  • category: the content gap category, e.g. female
  • time_bucket: the time bucket, e.g. 2022-07
  • by_category: for each content gap category
    • article_created: number of articles created
    • pageviews_sum: total number of pageviews
    • pageviews_mean: mean number of pageviews
    • quality_score: average article quality score
    • revision_count: number of edits
  • totals: same nested columns as by_category, but totals for the content gap

There are various output formats of the content gap metrics that are derived from the schema described above and serve different use cases. A fully de-normalized version in csv format is hosted here.


All the code can be found in the Knowledge Gaps repository on Gitlab.