Research:Knowledge Gaps Index/Measurement

From Meta, a Wikimedia project coordination wiki

After finalizing the Taxonomy of Knowledge Gaps, which contains a structured grouping and description of all the potential gaps in Wikimedia projects, our next milestone is to provide insights and tools to help the measurement of such gaps.

Readers, Contributors, and Content gaps from the Taxonomy of Wikimedia Knowledge Gaps
Readers, Contributors, and Content gaps from the Taxonomy of Wikimedia Knowledge Gaps

Scope[edit]

The taxonomy of knowledge gaps identified 3 macro-dimensions across which inequalities exist in Wikimedia projects: Readers, namely the set of individual who access Wikimedia sites to consume content, Contributors, the community of editors of Wikimedia projects, and Content, i.e. the knowledge contained in Wikimedia projects.

The aim of the Knowledge Gap Measurement project is to generate a set of metrics to quantify the gaps we identified in the 3 dimensions. We want to map each gap to one or few numbers (a "metric") reflecting the extent to which the gap is present in Wikimedia projects.

Methods[edit]

Mapping readers, contributors, and content to specific gaps.[edit]

We interviewed community members and other stakeholders to understand more in-depth how they understand and frame knowledge gaps. Based on these insights, we operationalized a gap, namely we identified the underlying categories and developed methods to categorize readers, contributors, and pieces of content according to the corresponding categories. For example, we used Wikidata to associate Wikipedia biographies with their corresponding gender identity of the subject. Depending on the knowledge gap dimension, we use two methods for mapping:

  • Survey based: We design survey questions specifically tailored to categorize readers and contributors into groups that are relevant for knowledge gaps (for example, gender groups). Based on the answers to these questions, we can estimate the distribution of Readers and Contributors across different categories that are relevant to measure inequalities in Wikimedia Projects. A complete list of mappings for readers and contributors can be found here
  • Observation based: we quantify knowledge gaps in Content by estimating the distribution of pieces of content (e.g., Wikipedia articles, Wikidata items) across different categories (e.g., gender, geographic distribution, cultural background). A complete list of mappings for content can be found here . More details about the research behind content measurements is in the Developing Metrics for Content Gaps (Knowledge Gaps Taxonomy) page.

Quantifying the gap based on a selection of relevant metrics.[edit]

Sketch of the workflow to develop metrics for knowledge gaps in the content dimension.

We reviewed different models describing the various aspects in which gaps can be measured and conducted interviews with affiliates to capture the community’s interests. We obtained a set of metrics quantifying the content coverage for each category, by taking into account aspects of the scientific maturity of the metric, as well as project constraints. For survey-based measurements, the metrics is generally a version of "distribution of answers to the gap specific question". For content-based measurements, we aggregate mappings according to two different sets of metrics:

  • Selection-Score (e.g., number of articles for each category of the gap), which reflects how much content exists for each category on a wiki.
  • Extent-Score (e.g., quality of articles based on length, # sections, # images) explains “how good” the articles in each category are.

More about content metrics here

Results[edit]

So far, we have developed metrics for 5 content gaps, and most Readers and Contributor gaps:

  • 5 out of the 11 Gaps in Content, with 2 metrics under development
  • 11 out of the 12 Gaps in Readership
  • 10 out of the 11 Contributorship Gaps

For readers and contributors, the unmapped gap is the "Tech Skills" gap in the "Interaction" facet. While there exist surveys to test individuals' Wikipedia Editing Skills, or more generic Internet Skills [1], more research is needed to understand the types of skills we want to test for both readers and contributors, and then implement a questionnaire accordingly.

Readership Metrics[edit]

FACET GAP Metric
The Dimensions' Facet. The Knowledge gap. e.g. Gender How do we measure the gap?
Representation Gender Distribution of Survey responses to the gender question.
Age Distribution of survey responses to the age question.
Geography Distribution of pageviews and unique devices by geographic categories inferred from readers IP

and distribution of survey responses to the urban/rural question

Language Distribution of survey responses to the language questions
Socio-economic Status Distribution of survey responses to the socio-economic status questions
Cultural Background Distribution of survey responses to the cultural background questions (ethnicity and discrimination)
Sexual Orientation Distirbution of survey responses to the sexual orientation question
Interaction Motivation Distribution of survey responses to the motivation question
Information Depth Distribution of survey responses to the information depth question
Familiarity Distribution of survey responses to the familiarity question
Tech Skills Not yet developed
Disabilities Distribution of survey responses to the disability question

See Readers Main Page for a complete list of metrics and their current status.

Contributorship Metrics[edit]

FACET GAP Metric
The Dimensions' Facet. The Knowledge gap. e.g. Gender How do we measure the gap?
Representation Gender Distribution of survey responses to the gender question.
Age Distribution of survey responses to the age question.
Geography Distribution of edits and (active) editors by geographic categories inferred from readers IP

and distribution of survey responses to the urban/ruralquestion.

Language Distribution of survey responses to the language questions
Socio-economic Status Distribution of survey responses to the socio-economic status questions
Cultural Background Distribution of survey responses to the cultural background questions (ethnicity and discrimination)
Sexual Orientation Distirbution of survey responses to the sexual orientation question
Interaction Motivation Distribution of survey responses to the motivation question
Role Distribution of survey responses to the role questions (Experience and Role on Wiki)
Disabilities Distribution of survey responses to the disability question

See Contributors Main Page for a complete list of gaps and their current status.

Content Gap Metrics[edit]

FACET GAP Metric
The Dimensions' Facet. The Knowledge gap. e.g. Gender How do we measure the gap?
Representation Gender Time series of content gap metrics over gender mappings
Age Time series of content gap metrics over time mappings
Geography Time series of content gap metrics over geographic mappings
Language We will be planning more research to measure this gap.
Socio-economic Status We will be planning more research to measure this gap.
Cultural Background We will be planning more research to measure this gap.
Topics for Impact We will be planning more research to measure this gap.
Sexual Orientation Time series of content gap metrics over sexual orientation mappings
Interaction Readability Currently working on this: follow along our research on multilingual readability
Structured Data Currently working on this: follow along our research on Wikidata item quality
Multimedia Time series of content gap metrics over multimedia mappings

Information about how articles are mapped to specific content gap categories can be found here, a complete list of content gap metrics and their current status can be found here, and technical background about the data pipeline architecture here.

Ideas for Summarizing Metrics[edit]

The final output of the metrics generation process, for both survey-based and observation-based measurements, is an estimation of the coverage/representation of readers, contributors or pieces of content across different categories. While the raw distribution remains the most informative output reflecting the extent of a gap, different stakeholders (c-level, affiliates, community members) will need to look at knowledge gaps values at different depths.

To this end, we started putting together some ideas about how to summarize the distribution-based metrics into a few numbers reflecting the questions people might want to ask to this data.

  • What is the representation of each category for this gap in this project?

Probability distribution for a gap in a language edition for a specific year

  • What is the most represented category for this gap in this project?
  • What is the least represented category for this gap in this project?
  • How dominant is the most represented category with respect to the least represented one?
  • How dominant is the most represented category with respect to second most represented one?
  • How unbalanced is the representation of different categories?
  • How diverse is this project with respect to this gap?
  • How are gaps evolving over time? Cumulative distribution for a gap in a language edition over all years.

Visualizations[edit]

We generated all the preliminary visualizations for each of the gaps. We are now working on a set of tools to expose and visualize gaps.

See also[edit]

Early research on measuring content gaps Early research on measuring the gender content gap in particular.

  1. "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication. 2018-02-28. doi:10.1093/joc/jqx003.