Research:Map of Visual Knowledge Gaps
Wikimedia Commons contains 65 million images, but in many Wikipedias, over 50% of articles have no images.In Wikidata, only 5% of the items have images. But how can we exactly quantify the areas of content where we are missing more images, vs areas with disproportionate amounts of visual content? What about the quality of the existing content? And how to evaluate whether the missing content is already somewhere in Wikimedia projects? The map of visual knowledge gaps can help with this.
We identified two main metrics to reflect the presence of images in Wikimedia projects.
- Number of articles/items Missing images - this reflects the amount of missing content
- Number of existing images per article/item - this reflects the proportion of visual content
Dimensions of analysis
We will be break down the analysis of the metrics above by the following dimensions:
- Selection Gap: By Wikipedia project (languages divided by size in number of articles) or for the Wikidata project as a whole
- Topical Gap: By topic of the article or the corresponding Wikidata item
- Extension Gap: By article length or Wikidata item coverage on Wikipedia
Wikipedia Visual Knowledge Gaps
Selection Gap: Image Distribution Across Wikis
Below are two static plots of Wikipedia size vs % of illustrated articles, and Wikipedia size vs number of images in illustrated articles. The interactive version of these plots can be found at this link
- Unillustrated Articles: The distribution of unillustrated articles varies a lot across Wikipedia edition. For English Wikipedia, the percentage of articles having an image is around 50%. On the contrary, Venetian Wikipedia's articles are almost all illustrated (~86%), while Cebuano Wikipedia is largely unillustrated (only 7% of articles have an image). There is a minor correlation between Wikipedia size and percentage of articles without images. Smaller Wikipedias tend to be more illustrated.
- Number of Images per Article: The number of images per illustrated article is much more uniform across Wikipedia editions. Most Wikis have 2-4 images per article, with outliers such as Karachay and Gagauz (both Turkic languages), which have up to 7 images in average per illustrated article.
Extension Gap: Image Distribution by Article Length
We divided Wikis by size: very small (<1'000 articles), small (<10'000 articles), medium (<100'000 articles), large (<1M articles), very large (>1M articles). We also partitioned articles by length, according to the number of articles across all wikis having a given length: very short (bottom 20%), short (mid-low 20%), medium (mid 20%), long (mid-top 20%), very long (top 20%). We computed percentage of illustrated articles and number of images for all combination of wiki size and article length.
- For mid to large Wikipedias, shorter articles are less likely to have an image. In smaller Wikis, the probability of having an image does not seem to depend on article length.
- Across all wikis, the shorter the article, the smaller the number of images.
Topic Gap: Image Distribution By Article Topics
I computed a topic for each article in each Wiki based on Isaac' topic classifier, then computed the distribution of illustrated articles and number of images per articles across different topics. Below you can find the resulting plots. The bars in the bottom quadrant reflect the number of articles in a given topic.
- The most widely illustrated items are about Architecture, Food, Fashion, Transportation, and Visual Arts, with more than 50% of illustrated articles
- Articles about Europe are far more likely to be illustrated than articles from Africa, Asia and South America, with only about 30% of illustrated artlces for Africa vs around 50% for Europe, but once they are illustrated, they have a comparable number of images.
- Articles about women are more slightly more likely to have an image than an average item about a person!
- Articles about Movies, Books, TV, and Videogames have generally few images than articles about Architecture, Food, Fashion, Transportation and History probably due to copyrighted material?
Wikidata Visual Knowledge Gaps
I sampled 5 million Wikidata items, and extracted, for each item, the following:
- has_image: whether the item has an image or not
- coverage: the number of Wikipedia articles linking to the item. This is quantized into 4 values:
- none if 0 articles link to the item
- small if 1-10 articles link to the item
- medium if 10-100 articles link to the item
- large if >100 articles link to the item
Selection and Extension Gap: Image Distribution by Item Coverage
Here are the major insights:
- Less than 4% of Wikidata items have an image associated with them, more precisely, 3.93%.
- Less than 1/4 of Wikidata items is linked to 10 or more Wikipedia articles, around 38% has 0 links, and around 40% has between 1 and 10 links
- Items with larger coverage are more likely to have an image: in average, more than 70% of items with large coverage has an image associated with them, against only around 1% of illustrated items with none coverage.
Topic Gap: Image Distribution by Item Topic
For each Wikidata item with at least 1 link to Wikipedia (e.g. small, medium, and large coverage) I associated one or more topics using the method based on @Isaac's Wikidata topic classifier.
Results are reported below and are similar to what seen for Wikipedia:
- The most widely illustrated items are about Food, Transport, Fashion, Engineering, and Visual Arts, between 55 and 75 percent of items in these categories are widely illustrated.
- Items about Europe and America are far more likely to be illustrated than items from Africa, Asia and South America, with only about 20% of illustrated items for Africa vs 50% for Europe
- Items about Movies, Books, TV, and Videogames are rarely illustrated**, probably due to copyrighted material?
- Items about women are more likely to have an image than an average item about a person!, with 53% of illustrated items about women, vs 50% of illustrated articles about people overall
- For some areas, such as Biology, Chemistry and others, there is a mismatch between Wikipedia and Wikidata image coverage, to investigate this further, we should understand the proportion of unillustrated Wikipedia articles for which we have Wikidata images, and partition it by topic.