Research:Knowledge Gaps Index/Measurement/Architecture

This document provides technical background for the data pipelines that generate the content gap metrics datasets.

Overview

This document describes the data architecture that quantifies the various content gaps defined by the Knowledge Gaps Index project. The pipelines are implemented as spark pipelines and executed in the WMF's data engineering infrastructure on Yarn. The input data sources for the pipelines are maintained by data engineering with a high availability. The pipelines are scheduled using Airflow, and execution failures will result in an email alert. There is a serving layer for the output of the content gap pipelines, with the details to be determined pending a more concrete design for the frontend tool.

Approach

The content gap pipeline is implemented as a Spark application. By distributing the computation on a large number of processes running on Yarn, we can process all content gaps for all languages/Wikipedias in parallel. The pipeline consists of sequence of steps:

Compute features relevant for each content gap, for all Wikipedia articles across all projects. Example: the gender gap feature for Frida Kahlo is female based on the sex or gender property.
Compute a set of features to provide insight for Wikipedia articles across all projects. Example: The Frida Kahlo on the French Wikipedia received 69636 pageviews in the May 2022
Combine the content gap features and metric features datasets, and aggregate statistics for the content gap categories of each content gap. Example: Articles on the French Wikipedia associated with the female category of the gender gap were edited 47421 times in May 2022.

The data dependencies of the knowledge gap pipeline are:

Wikidata snapshots for entities and item_page_link
Mediawiki revision history snapshots
Historical pageviews

In addition, the pipeline makes use of language-agnostic models such as the article quality scores and the country geography model.

Content gap features

The content gap features are associated with a Wikidata entity instead of an Wikipedia article directly. For example, the content gap features for Frida Kahlo are associated with qid Q5588. As a consequence, the features for that Wikidata entity are associated with all Wikipedia articles about Frida Kahlo in the various languages.

The content gap features themselves are also based on Wikidata where possible. Each content gap consists of a set of categories that are associated with Wikidata properties. The knowledge gap pipeline processes the Wikidata dumps to annotate Wikidata entities that have Wikipedia articles with other Wikidata entities that are linked by Wikidata properties associated with content gaps. For example, the time gap is associated with a set of Wikidata properties that are used to extract time specific values (e.g. a year). Where possible, the pipeline also uses Wikidata to define the categories for a content gap itself. For example, the gender gap categories are based on the allowed values of the Wikidata property for gender or sex.

The output of this step is a dataset where each row is identified by a QID that is associated with at least one Wikipedia article (i.e. it exists in at least one project), and there is a column for each content gap.

Metric features

In order to provide insights into content gaps over time, the knowledge gap pipeline aggregates commonly used metrics for analyzing Wikipedia content and editor activity into a metric features dataset.

The creation date of the article
The quality score of the article
The number of pageviews of the article
The number of times an article was edited

The metric features are associated with a particular Wikipedia article, i.e. the Frida Kahlo articles for the 152 projects it exists in are all associated with their own set of metric features. In addition, the metric features are represented as time series to allow analysis of trends over time.

Content Gap Metrics

In order to compute metrics about the content gaps, the content gap features dataset and the metric features dataset are combined to form an intermediate dataset. The metric features are associated with Wikipedia articles in specific Wikipedia projects, while the content gap features are associated with Wikidata items shared across all Wikipedia projects. The content gap metrics are computed for each content gap individually, as the content gap features not only determine which content gap category an article belongs to, but they are also used to determine how to aggregate the intermediate dataset. For example, to compute the metrics for the gender gap we only want to consider articles about people, so any other articles are filtered out before aggregating.

The output datasets and schemas are documented are here.

Code

The code can be found in the Knowledge Gaps repository on Gitlab.