Research:Knowledge Gaps Index/Measurement/Content
Content Gap Metrics
[edit]FACET | GAP | Metric | CURRENTLY MEASURED? | FREQUENCY |
The Dimensions' Facet. | The Knowledge gap. e.g. Gender | How do we measure the gap? | Actual implementation of the measurements | How often do we make the measurements? |
Representation | Gender | Time series of content gap metrics over gender mappings | Yes, datasets / notebook available | Monthly |
Age | Time series of content gap metrics over time mappings | Yes, datasets available | Monthly | |
Geography | Time series of content gap metrics over geographic mappings | Yes, datasets / notebook available | Monthly | |
Language | We will be planning more research to measure this gap. | In progress, repo | Monthly (ideal) | |
Socio-economic Status | We will be planning more research to measure this gap. | No | Monthly (ideal) | |
Cultural Background | We will be planning more research to measure this gap. | No | Monthly (ideal) | |
Topics for Impact | We will be planning more research to measure this gap. | No | Monthly (ideal) | |
Sexual Orientation | Time series of content gap metrics over sexual orientation mappings | Yes, datasets available | Monthly | |
Interaction | Readability | Currently working on this: follow along our research on multilingual readability | No | Monthly (ideal) |
Structured Data | Currently working on this: follow along our research on Wikidata item quality | No | Monthly (ideal) | |
Multimedia | Time series of content gap metrics over multimedia mappings | Yes, datasets available | Monthly |
Metrics for Aggregation
[edit]As shown by previous Research, the content coverage, namely how well Wikimedia project content addresses a particular topic, can be described in different ways. In the content gap metrics, we operationalize two dimensions of content coverage.
- Selection: whether the content is present or not
- Extent: how much content the topic has overall, i.e. its quality
The gap in each wiki are described according to five different metrics:
article_created
: number of articles created for each category, which reflects the selection of the content gappageviews_sum
: total number of pageviews for each category, which reflects the selection of the content gap from the readers perspectivepageviews_mean
: mean number of pageviews for each category, see aboverevision_count
: total number of edits for each category, which reflects the selection of the content gap from the editors perspectivequality_score
: average article quality score for each category, which reflects the extent of the content gap.standard_quality_count
: number of articles in the category that satisfy the Standard Quality Criteriastandard_quality
: the average of the standard quality score. As the standard quality is binary, this is the ratio articles that satisfy the standard quality criteriaFor the article quality metrics, which are content based, the last revision to an article in a given month is used to calculate the quality. If an article was not edited in a given month, the score from the previous month is used for aggregation.
Standard Quality Criteria
[edit]An article is of standard+ quality if it meets at least 5 of the 6 following criteria:
- It is at least 8kB long in size
- It has at least 1 category
- It has at least 7 sections
- It is illustrated with 1 or more images
- Its references are at least 4
- It has 2 or more intra wiki links.
Aggregation levels
[edit]The metrics are computed at multiple aggregation levels.
Category level
[edit]The most granular dataset is at the level of the content gap category and per wiki. The all_wikis
version is aggregated across all wikis.
- by category:
[ wiki_db, content_gap, category, time_bucket ]
, datasets see the individual content gap datasets table above. - by category across all wikis:
[ content_gap, category, time_bucket ]
, dataset
Content gap level
[edit]The metrics are aggregated across all categories of the content gaps and per wiki. The all_wikis
version is aggregated across all wikis.
- by content gap:
[ wiki_db, content_gap, time_bucket ]
, dataset - by content gap across all wikis:
[ content_gap, time_bucket ]
, dataset
Note that for e.g. the pageviews_mean
, quality_score
, standard_quality
metrics, which are mean values, 're-aggregating' a dataset (e.g. from [ content_gap, category, time_bucket ]
to [ content_gap, time_bucket ]
) will not yield the same results, while for the count/sum based metrics the numbers are identical.