Research:Knowledge Gaps Index/Wikimania2023

From Meta, a Wikimedia project coordination wiki

This page contains supplementary material to the Wikimania 2023 video presentation Make the Knowledge Gap Index Your Own. Please use the talk page of this article to give us your feedback about the talk!

Context[edit]

The video of the Wikimania 2023 presentation

Since 2019, the Wikimedia Foundation has been investing in developing the Knowledge Gap Index. We have identified more than 30 gaps of knowledge on Wikipedia and we have been methodically building metrics to measure each of these gaps. Since March 2023, the first batch of public data is available to our communities to start interacting with.

Our Wikimania 2023 presentation provides context about the knowledge gap index projects, looks in-depth at measurements for the gender gap, and provides an overview of the tools we built to make the knowledge gaps data your own.

Data[edit]

Are you interested in looking at the raw knowledge gaps data? Public versions of the data are available for you to download. Simply choose the gap you are interested in and download the corresponding CSV file.

Every row in the CSV data contains the following columns:

  • wiki_db: enwiki, itwiki, etc
  • category: the underlying categories for each gender gap, for example "men", "women", "Europe", etc.
  • time_bucket: the snapshot at which the metric is recorded, with monthly granularity (e.g. 2020-02)
  • [metric value columns] which contain the measurements for the following: article_count_value; article_created_value; pageviews_sum_value; pageviews_mean_value; standard_quality_value; standard_quality_count_value; quality_score_value; revision_count_value. An explanation of relevant metrics:
    • article_created_value: number of articles created for each category in the time bucket
    • pageviews_sum_value: total number of pageviews for each category in the time bucket
    • pageviews_mean_value: mean number of pageviews for each category in the time bucket
    • revision_count_value: total number of edits for each category in the time bucket
    • quality_score_value: average article quality score for each category in the time bucket
    • standard_quality_value: percentage of articles in the category that are above a standard quality threshold for each category in the time bucket
  • [total columns] which contain the totals across all categories for the following: article_count_total; article_created_total; pageviews_sum_total; pageviews_mean_total; standard_quality_total; standard_quality_count_total; quality_score_total; revision_count_total. An explanation of relevant metrics:
    • article_created_total: number of articles created across all categories in the time bucket
    • pageviews_sum_total: total number of pageviews across all categories in the time bucket
    • pageviews_mean_total: mean number of pageviews across all categories in the time bucket
    • revision_count_total: total number of edits across all categories in the time bucket
    • quality_score_total: average article quality score across all categories in the time bucket
    • standard_quality_total: percentage of articles in the category that are above a standard quality threshold across all categories in the time bucket

Plots in the Presentation[edit]

Plots and raw data for the slides in the presentation can be found in this spreadsheet, together with more interesting data about the gender gap.

Knowledge Gaps API[edit]

The documentation for the upcoming knowledge gaps API can be found in the corresponding meta page.

Visualization Notebooks[edit]

The R notebooks for navigating the knowledge gaps data are available in PAWS.

  • Gender gap notebook here
  • Geography gap notebook here


Feedback[edit]

Do you have any feedback on our presentation, the knowledge gap project, or the data? Please let us know! Use our talk page to share your comments and suggestions, thank you!