Research:Knowledge Gaps Index/Measurement/Language gap content metrics
This project aimed to develop knowledge gaps metrics for measuring language gaps. Within content gaps, the language gap refers to “the difference in content coverage across different languages.”[1]
Metrics developed
[edit]Within the Knowledge Gaps taxonomy for content gaps, the “language gap” is somewhat of a unique case. Because all content gap measurements compare content coverage across Wikipedia language editions, the examination of any of the content gaps (gender, geography, topic, etc.) are, in a way, an examination of language gaps (that is, an examination of content gaps across language editions).
Based on the data I explored–and the brainstorming with my colleagues—as a part of this project, my conceptualization of the language gap expanded to multiple types of measurements, including content availability (the difference in content availability across different language) and content coverage (the difference of content coverage across different languages).
Content availability
[edit]This dataset helps track which languages have which Wikimedia projects, as well as level of representation via third-party data. Similar to canonical wikis dataset, but includes test projects (e.g., Incubator projects) and third-party data.
Metric dataset schema:
language_code: Wikimedia language code (e.g., ja)language_name: Wikimedia language name (e.g., Japanese)tech_literate_population: Approximate figures for the literate, functional population for each language in each territory (i.e., population able to read and write in the language and comfortable enough to use it with computers) based on Unicode’s Territory-Language Informationlanguage_official: List of countries (country codes) where the language has official status — based on Unicode’s Territory-Language Informationweb_support: Website and mobile app support (level)unesco_status: The language’s endangerment status per UNESCO: (Ex) Extinct, (CR) Critically endangered, (SE) Severely endangered, (DE) Definitely endangered, (VU) Vulnerable, (NE) Not endangered/ Safelanguage_is_indigenous: Whether or not the language is an indigenous language, per UNESCOwp_status: Wikipedia status of the language, including "hosted" (Wikipedia edition in this language is hosted by the Foundation), "closed" (Wikipedia edition in this language was previously hosted but is now closed), and "test" (Wikipedia edition in the language exists in Wikimedia Incubator)ws_status: Wikisource status of the language, including "hosted" (Wikipedia edition in this language is hosted by the Foundation), "closed" (Wikipedia edition in this language was previously hosted but is now closed), and "test" (Wikipedia edition in the language exists in Multilingual Wikisource)wb_status: Wikibooks status of the language, including "hosted", "closed", and "test"wb_status: Wikibooks status of the language, including "hosted", "closed", and "test"wq_status: Wikiquote status of the language, including "hosted", "closed", and "test"wn_status: Wikinews status of the language, including "hosted", "closed", and "test"wt_status: Wiktionary status of the language, including "hosted", "closed", and "test"wy_status: Wikivoyage status of the language, including "hosted", "closed", and "test"wv_status: Wikiversity status of the language, including "hosted" (Wikipedia edition in this language is hosted by the Foundation), "closed" (Wikipedia edition in this language was previously hosted but is now closed), and "test" (Wikipedia edition in the language exists in Wikiversity Beta)ws_status: Wikisource status of the language, including "hosted"
Content coverage
[edit]This dataset helps track coverage of 1000 articles every Wikipedia should have in each Wikipedia language edition, including metrics (e.g. quality) for those articles. It follows the Knowledge Gaps schema for content gaps.
Metric dataset schema:
wiki_db: Wikimedia database name (e.g., “enwiki”)time_bucket: the time bucket, with monthly granularity (e.g. “2020-02”)content_gap: the content gap (e.g., “topic”)category: the underlying categories for the gap; there will only be one category in this schema, whichwill be called ”vital-articles” or “1000-articles-every-wp-should-have”articles_created: number of articles (from the list of 1000) which have been created, at the time of the time bucketpageviews_sum: total number of pageviews for the vital articles that Wikipedia haspageviews_mean: mean number of pageviews for the vital articles that Wikipedia has, at the time of the time bucketrevision_count: total number of edits for the vital articles that Wikipedia has, at the time of the time bucketquality_score: average article quality score for the vital articles that Wikipedia has, at the time of the time bucketstandard quality: percentage of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteriastandard_quality_count: number of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteria
Timeline
[edit]FY 2023-24: laying the foundation
[edit]Fiscal year 2023-24 tasks focused on assessing data needs (T341881), compiling primary and secondary data related to language (T341881, T348241), developing calculations and visualizations (T348241), and sharing progress with the volunteer community (T364999). I was also able to use the language metrics work to support Language Engineering (T361640).
Blockers/challenges: acquiring/compiling of secondary data: unable to finalize institutional access to UNESCO WAL (T348249).
Successful deliverables:
- Compilation (i.e. querying, scraping) of all necessary primary data in Gitlab (see 01_source_data, 02_wrangling_scripts, and 03_wrangled_data)
- State of languages analysis: state_of_languages.ipynb
- State of top 20 languages analyses: state_of_languages-top20.ipynb, state_of_languages-top20-editors.ipynb, state_of_languages-top20_project_size.ipynb, state_of_languages-top20_unique_devices.ipynb
- Incubator analyses: current_incubator_history_visualizations-active_substantial.ipynb, and both_current_and_grad_incubator_history_visualizations-active_substantial.ipynb,
- Presentation of the current state of languages and language metrics at May ‘24 Language Community Meeting.
Impact:
- Developed and documented methods for wrangling Incubator data (which hadn’t been formally developed or documented internally by a team), and which could then be used by RDS teams and LPL
- The first baseline reports about the state of languages across Wikimedia content projects specialized by linguistic edition; the stats from this report have been used for many community-facing presentations and reports, including
- Research:Incubator_and_language_representation_across_Wikimedia_projects#Insights
- The state of languages: Language community meeting presentation (31 May 2024)
- Wikimania Katowice: State of Language Technology and Onboarding at Wikimedia (next FY)
- Celtic_Knot_2024_Future_of_Language_Incubation (next FY)
- Celtic_Knot_2024_Wikimedia’s_new_language_metrics (next FY)
- Professional development: building/improving Python skills; interfacing with community members (sharing learnings with community members), adapting presentation of insights to different audiences
FY 2024-25: metrics development
[edit]Fiscal year 2024-25 tasks focused on continued communication with the volunteer community about language metrics project (T371931) as well as continued brainstorming, exploration, and development of actual metrics/datasets (T376728); I incorporated additional primary data to explore language gaps across Commons (T372641) and written scripts (T372066). I was also able to use my work on language metrics to support WE 2.2.3 (T369081).
Blockers/challenges: Unable to finalize institutional access to UNESCO WAL[2] (T348249) or Ethnologue data[3]. As a result, unable to acquire and incorporate third-party data into the metrics.
Successful deliverables:
- State of Commons analysis: state_of_commons.ipynb
- State of MinT Analysis: state_of_mint.ipynb
- State of Scripts analysis: state_of_scripts.ipynb
- Creation of a PAWS GUI for exploration of language metrics at the Celtic Knot 2024 conference
- Presentation of language metrics at Celtic Knot 2024 conference (see abstract)
- Proposal of three potential datasets for measuring and tracking content coverage across different languages (see T376728)
Impact:
- Development of GUI meant that volunteers and staff members who lack stat machine access have the ability to explore the language metrics
- Stats from the GUI have been used for staff and community-facing presentations and reports, including
- Celtic_Knot_2024_Wikimedia’s_new_language_metrics (2024)
- Knowledge Equity Offsite 2025: Content Metrics presentation (2025)
- Professional development: building/improving Python skills; building/improving RMarkdown skills; sharing learnings with community members; adapting presentation of insights to different audiences
- Stats from the dataset proposal #2 (section above) used for presenting baseline “vital article” coverage across wikis to LPL team: Knowledge Equity Offsite 2025: Content Metrics to help them with annual planning
Unanswered research questions
[edit]- Regarding multilingual readership:
- What is the role of reader translation in assessing content availability?
- When and how do multilingual readers read content in their L1 vs their L2?
- Regarding knowledge gaps:
- Where do we see gaps in readership, based on global language population numbers?
- Where do we see gaps in contributorship, based on global language population numbers?
- Where do we see gaps in vital article coverage and quality (when assessing content by, e.g., vital articles)?
- What is the quality of articles created using the Cx tool with MT, vs. using the Cx tool without Cx, vs. without using the Cx tool? (Likely to be answered by the Cx Dashboard)
- Regarding AI language coverage:
- How many readers/speakers would we reach if we work with the top X languages? (T371457)
Potential next steps
[edit]- Productionize vital articles dataset as part of Knowledge Gaps data (T390104)
- Metrics can be used by teams doing work related to vital knowledge (WE 2) in FY 25-26.
- Steps should be taken to expand or complement the vital articles metrics with metrics tracking the creation and improvement of articles that the community deems important/vital (e.g., metrics for topic-based & community-defined lists)
- Formalize language representation dataset (T3882001)
- Add ISO 639 codes to canonical data (T346855) language dataset
- Incorporate data from the Unicode CLDR territory-language dataset.
- Contextual/foundational information exists for anyone who would like to get started developing any of the remaining datasets or metrics
- Coverage of articles about languages: my recommendation would be to utilize a SPARQL query similar to that developed for the vital articles metrics, looking at which languages (e.g., Tagalog (Q34057), Quechua (Q5218)) have Wikipedia articles written about them – specifically, which Wikipedia editions, article quality, etc.
- Language gaps on Commons: my recommendation is developing metrics that track language representation across file descriptions and captions (T374279); a Python script for generating caption language counts exists here; due to the known presence of linguistically mislabeled captions, an exploration of the uncertainty threshold(s) for Wiki Commons file captions labeled as 'English' is needed (T374281)