Jump to content

Research:Knowledge Gaps Index/Measurement/Language gap content metrics

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T348246
Created
18:00, 16 Jun 2025 (UTC)
Duration:  2023-October – 2025-March
This page documents a completed research project.


This project aimed to develop knowledge gaps metrics for measuring language gaps. Within content gaps, the language gap refers to “the difference in content coverage across different languages.”[1]

Metrics developed

[edit]

Within the Knowledge Gaps taxonomy for content gaps, the “language gap” is somewhat of a unique case. Because all content gap measurements compare content coverage across Wikipedia language editions, the examination of any of the content gaps (gender, geography, topic, etc.) are, in a way, an examination of language gaps (that is, an examination of content gaps across language editions).

Based on the data I explored–and the brainstorming with my colleagues—as a part of this project, my conceptualization of the language gap expanded to multiple types of measurements, including content availability (the difference in content availability across different language) and content coverage (the difference of content coverage across different languages).

Content availability

[edit]

This dataset helps track which languages have which Wikimedia projects, as well as level of representation via third-party data. Similar to canonical wikis dataset, but includes test projects (e.g., Incubator projects) and third-party data.

Metric dataset schema:

  • language_code: Wikimedia language code (e.g., ja)
  • language_name: Wikimedia language name (e.g., Japanese)
  • tech_literate_population: Approximate figures for the literate, functional population for each language in each territory (i.e., population able to read and write in the language and comfortable enough to use it with computers) based on Unicode’s Territory-Language Information
  • language_official: List of countries (country codes) where the language has official status —  based on Unicode’s Territory-Language Information
  • web_support: Website and mobile app support (level)
  • unesco_status: The language’s endangerment status per UNESCO: (Ex) Extinct, (CR) Critically endangered, (SE) Severely endangered, (DE) Definitely endangered, (VU) Vulnerable, (NE) Not endangered/ Safe
  • language_is_indigenous: Whether or not the language is an indigenous language, per UNESCO
  • wp_status: Wikipedia status of the language, including "hosted" (Wikipedia edition in this language is hosted by the Foundation), "closed" (Wikipedia edition in this language was previously hosted but is now closed), and "test" (Wikipedia edition in the language exists in Wikimedia Incubator)
  • ws_status: Wikisource status of the language, including "hosted" (Wikipedia edition in this language is hosted by the Foundation), "closed" (Wikipedia edition in this language was previously hosted but is now closed), and "test" (Wikipedia edition in the language exists in Multilingual Wikisource)
  • wb_status: Wikibooks status of the language, including "hosted", "closed", and "test"
  • wb_status: Wikibooks status of the language, including "hosted", "closed", and "test"
  • wq_status: Wikiquote status of the language, including "hosted", "closed", and "test"
  • wn_status: Wikinews status of the language, including "hosted", "closed", and "test"
  • wt_status: Wiktionary status of the language, including "hosted", "closed", and "test"
  • wy_status: Wikivoyage status of the language, including "hosted", "closed", and "test"
  • wv_status: Wikiversity status of the language, including "hosted" (Wikipedia edition in this language is hosted by the Foundation), "closed" (Wikipedia edition in this language was previously hosted but is now closed), and "test" (Wikipedia edition in the language exists in Wikiversity Beta)
  • ws_status: Wikisource status of the language, including "hosted"

Content coverage

[edit]

This dataset helps track coverage of 1000 articles every Wikipedia should have in each Wikipedia language edition, including metrics (e.g. quality) for those articles. It follows the Knowledge Gaps schema for content gaps.

Metric dataset schema:

  • wiki_db: Wikimedia database name  (e.g., “enwiki”)
  • time_bucket: the time bucket, with monthly granularity (e.g. “2020-02”)
  • content_gap: the content gap (e.g., “topic”)
  • category: the underlying categories for the gap; there will only be one category in this schema, whichwill be called ”vital-articles” or “1000-articles-every-wp-should-have”
  • articles_created: number of articles (from the list of 1000) which have been created, at the time of the time bucket
  • pageviews_sum: total number of pageviews for the vital articles that Wikipedia has
  • pageviews_mean: mean number of pageviews for the vital articles that Wikipedia has, at the time of the time bucket
  • revision_count: total number of edits for the vital articles that Wikipedia has, at the time of the time bucket
  • quality_score: average article quality score for the vital articles that Wikipedia has, at the time of the time bucket
  • standard quality: percentage of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteria
  • standard_quality_count: number of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteria

Timeline

[edit]

FY 2023-24: laying the foundation

[edit]

Fiscal year 2023-24 tasks focused on assessing data needs (T341881), compiling primary and secondary data related to language (T341881, T348241), developing calculations and visualizations (T348241), and sharing progress with the volunteer community (T364999). I was also able to use the language metrics work to support Language Engineering (T361640).

Blockers/challenges: acquiring/compiling of secondary data: unable to finalize institutional access to UNESCO WAL (T348249).

Successful deliverables:

Impact:

FY 2024-25: metrics development

[edit]

Fiscal year 2024-25 tasks focused on continued communication with the volunteer community about language metrics project (T371931) as well as continued brainstorming, exploration, and development of actual metrics/datasets (T376728); I incorporated additional primary data to explore language gaps across Commons (T372641) and written scripts (T372066). I was also able to use my work on language metrics to support WE 2.2.3 (T369081).

Blockers/challenges: Unable to finalize institutional access to UNESCO WAL[2] (T348249) or Ethnologue data[3]. As a result, unable to acquire and incorporate third-party data into the metrics.

Successful deliverables:

Impact:

  • Development of GUI meant that volunteers and staff members who lack stat machine access have the ability to explore the language metrics
  • Stats from the GUI have been used for staff and community-facing presentations and reports, including
  • Professional development: building/improving Python skills;  building/improving RMarkdown skills; sharing learnings with community members; adapting presentation of insights to different audiences
  • Stats from the dataset proposal #2 (section above) used for presenting baseline “vital article” coverage across wikis to LPL team: Knowledge Equity Offsite 2025: Content Metrics to help them with annual planning

Unanswered research questions

[edit]
  • Regarding multilingual readership:
    • What is the role of reader translation in assessing content availability?
    • When and how do multilingual readers read content in their L1 vs their L2?
  • Regarding knowledge gaps:
    • Where do we see gaps in readership, based on global language population numbers?
    • Where do we see gaps in contributorship, based on global language population numbers?
    • Where do we see gaps in vital article coverage and quality (when assessing content by, e.g., vital articles)?
    • What is the quality of articles created using the Cx tool with MT, vs. using the Cx tool without Cx, vs. without using the Cx tool? (Likely to be answered by the Cx Dashboard)
  • Regarding AI language coverage:
    • How many readers/speakers would we reach if we work with the top X languages? (T371457)

Potential next steps  

[edit]
  • Contextual/foundational information exists for anyone who would like to get started developing any of the remaining datasets or metrics
    • Coverage of articles about languages: my recommendation would be to utilize a SPARQL query similar to that developed for the vital articles metrics, looking at which languages (e.g., Tagalog (Q34057), Quechua (Q5218)) have Wikipedia articles written about them – specifically, which Wikipedia editions, article quality, etc.
    • Language gaps on Commons: my recommendation is developing metrics that track language representation across file descriptions and captions (T374279); a Python script for generating caption language counts exists here; due to the known presence of linguistically mislabeled captions, an exploration of the uncertainty threshold(s) for Wiki Commons file captions labeled as 'English' is needed (T374281)

References

[edit]