Research:Knowledge Gaps Index/Measurement

After finalizing the Taxonomy of Knowledge Gaps, which contains a structured grouping and description of all the potential gaps in Wikimedia projects, our next milestone is to provide insights and tools to help the measurement of such gaps.

Scope

The taxonomy of knowledge gaps identified 3 macro-dimensions across which inequalities exist in Wikimedia projects: Readers, namely the set of individual who access Wikimedia sites to consume content, Contributors, the community of editors of Wikimedia projects, and Content, i.e. the knowledge contained in Wikimedia projects.

The aim of the Knowledge Gap Measurement project is to generate a set of metrics to quantify the gaps we identified in the 3 dimensions. We want to map each gap to one or few numbers (a "metric") reflecting the extent to which the gap is present in Wikimedia projects.

Methods

Mapping readers, contributors, and content to specific gaps.

We interviewed community members and other stakeholders to understand more in-depth how they understand and frame knowledge gaps. Based on these insights, we operationalized a gap, namely we identified the underlying categories and developed methods to categorize readers, contributors, and pieces of content according to the corresponding categories. For example, we used Wikidata to associate Wikipedia biographies with their corresponding gender identity of the subject. Depending on the knowledge gap dimension, we use two methods for mapping:

Survey based: We design survey questions specifically tailored to categorize readers and contributors into groups that are relevant for knowledge gaps (for example, gender groups). Based on the answers to these questions, we can estimate the distribution of Readers and Contributors across different categories that are relevant to measure inequalities in Wikimedia Projects. A complete list of mappings for readers and contributors can be found here
Observation based: we quantify knowledge gaps in Content by estimating the distribution of pieces of content (e.g., Wikipedia articles, Wikidata items) across different categories (e.g., gender, geographic distribution, cultural background). A complete list of mappings for content can be found here . More details about the research behind content measurements is in the Developing Metrics for Content Gaps (Knowledge Gaps Taxonomy) page.

Quantifying the gap based on a selection of relevant metrics.

We reviewed different models describing the various aspects in which gaps can be measured and conducted interviews with affiliates to capture the community’s interests. We obtained a set of metrics quantifying the content coverage for each category, by taking into account aspects of the scientific maturity of the metric, as well as project constraints. For survey-based measurements, the metrics is generally a version of "distribution of answers to the gap specific question". For content-based measurements, we aggregate mappings according to two different sets of metrics:

Selection-Score (e.g., number of articles for each category of the gap), which reflects how much content exists for each category on a wiki.
Extent-Score (e.g., quality of articles based on length, # sections, # images) explains “how good” the articles in each category are.

Results

So far, we have developed metrics for 5 content gaps, and most Readers and Contributor gaps:

5 out of the 11 Gaps in Content, with 2 metrics under development
11 out of the 12 Gaps in Readership
10 out of the 11 Contributorship Gaps

For readers and contributors, the unmapped gap is the "Tech Skills" gap in the "Interaction" facet. While there exist surveys to test individuals' Wikipedia Editing Skills, or more generic Internet Skills ^[1], more research is needed to understand the types of skills we want to test for both readers and contributors, and then implement a questionnaire accordingly.

Readership Metrics

FACET	GAP	Metric
The Dimensions' Facet.	The Knowledge gap. e.g. Gender	How do we measure the gap?
Representation	Gender	Distribution of Survey responses to the gender question. Dataset
	Age	Distribution of survey responses to the age question.
	Geography	Distribution of pageviews and unique devices by geographic categories inferred from readers IP and distribution of survey responses to the urban/rural question
	Language	Distribution of survey responses to the language questions
	Socio-economic Status	Distribution of survey responses to the socio-economic status questions
	Cultural Background	Distribution of survey responses to the cultural background questions (ethnicity and discrimination)
	Sexual Orientation	Distirbution of survey responses to the sexual orientation question
Interaction	Motivation	Distribution of survey responses to the motivation question
	Information Depth	Distribution of survey responses to the information depth question
	Familiarity	Distribution of survey responses to the familiarity question
	Tech Skills	Not yet developed
	Disabilities	Distribution of survey responses to the disability question

See Readers Main Page for a complete list of metrics and their current status.

Contributorship Metrics

FACET	GAP	Metric
The Dimensions' Facet.	The Knowledge gap. e.g. Gender	How do we measure the gap?
Representation	Gender	Distribution of survey responses to the gender question.
	Age	Distribution of survey responses to the age question.
	Geography	Distribution of edits and (active) editors by geographic categories inferred from readers IP and distribution of survey responses to the urban/ruralquestion.
	Language	Distribution of survey responses to the language questions
	Socio-economic Status	Distribution of survey responses to the socio-economic status questions
	Cultural Background	Distribution of survey responses to the cultural background questions (ethnicity and discrimination)
	Sexual Orientation	Distirbution of survey responses to the sexual orientation question
Interaction	Motivation	Distribution of survey responses to the motivation question
	Role	Distribution of survey responses to the role questions (Experience and Role on Wiki)
	Disabilities	Distribution of survey responses to the disability question

See Contributors Main Page for a complete list of gaps and their current status.

Content Gap Metrics

FACET	GAP	Metric
The Dimensions' Facet.	The Knowledge gap. e.g. Gender	How do we measure the gap?
Representation	Gender	Time series of content gap metrics over gender mappings
	Age	Time series of content gap metrics over time mappings
	Geography	Time series of content gap metrics over geographic mappings
	Language	We will be planning more research to measure this gap.
	Socio-economic Status	We will be planning more research to measure this gap.
	Cultural Background	We will be planning more research to measure this gap.
	Topics for Impact	We will be planning more research to measure this gap.
	Sexual Orientation	Time series of content gap metrics over sexual orientation mappings
Interaction	Readability	Currently working on this: follow along our research on multilingual readability
	Structured Data	Currently working on this: follow along our research on Wikidata item quality
	Multimedia	Time series of content gap metrics over multimedia mappings

Information about how articles are mapped to specific content gap categories can be found here, a complete list of content gap metrics and their current status can be found here, and technical background about the data pipeline architecture here.

Ideas for Summarizing Metrics

The final output of the metrics generation process, for both survey-based and observation-based measurements, is an estimation of the coverage/representation of readers, contributors or pieces of content across different categories. While the raw distribution remains the most informative output reflecting the extent of a gap, different stakeholders (c-level, affiliates, community members) will need to look at knowledge gaps values at different depths.

To this end, we started putting together some ideas about how to summarize the distribution-based metrics into a few numbers reflecting the questions people might want to ask to this data.

What is the representation of each category for this gap in this project?

Probability distribution for a gap in a language edition for a specific year $P=(P(gap_{year}^{lan}))$

What is the most represented category for this gap in this project? $Max(P)$
What is the least represented category for this gap in this project? $Min(P)$
How dominant is the most represented category with respect to the least represented one? $Max(P)/Min(P)$
How dominant is the most represented category with respect to second most represented one? $Max(P)/2ndMax(P)$
How unbalanced is the representation of different categories? $Gini(P)$
How diverse is this project with respect to this gap? $Normalized-Entropy(P)$
How are gaps evolving over time? Cumulative distribution for a gap in a language edition over all years.