Research:Visual Knowledge Gaps
Wikimedia Research’s Knowledge Gaps white paper underlines the need for methods to quantify knowledge gaps and global usage of multimedia content in Wikimedia spaces. As final product of this strategic direction, we envision a map of visual knowledge gaps that characterizes and quantifies biases and gaps of multimedia content in Wikimedia spaces. With this project, we want to start focusing on quantifying and qualifying visual gaps in biographies. Biographies make up a large percentage of articles in Wikipedia (27% in English Wikipedia), and are among the most visited pages. Knowledge gaps in biographies are therefore more visible than others, and visual content plays a crucial role in the visibility of such gaps. As a matter of fact, articles without images might be perceived as less relevant/important/visible. Moreover, images in Wikipedia are very visible in the broader web: Commons images attached to Wikipedia articles spread easily in the web, as they get surfaced by image search platforms at the top of the search results, and they can be re-used in many contexts. To help reduce these gaps, Initiatives such as Whose Knowledge’s VisibleWikiWomen have been fostering editing campaigns focused on adding images of famous women in Wikipedia. While gender and geographic gaps in Wikipedia content have been quantified and, to some extent qualified, from a general content production perspective, the role images play in such gaps remains unclear. This project aims at addressing this question.
We will focus on images used in biographies in Wikipedia. With this research, we try to characterize gaps in two types of visual content.
- Missing Content: visual material that is not there to begin with.
- Existing content: visual content that is there already but might be biased in some way.
[In progress] Research Questions
To examine the 2 types of content gaps, we are asking the following questions:
- [Qualitative] What are visual knowledge gaps, and why are they important for knowledge and culture dissemination? A taxonomy of visual knowledge gaps in biographies, We will identify a set of potential content gaps that can emerge when representing people (e.g. gender, profession, location),
- [Quantitative] Can we quantify visual content gaps related to biographic content? Here, we focus on the missing content, and what types of omissions frame the subject material in its absence -- e.g. what material should be there but is not for some reason. We will perform a large scale analysis of the amount of visual content associated to Wikipedia articles (and Wikidata items?) about people. We will quantify the proportion of visual content across the dimensions identified in the taxonomy.
- [Mixed] Can we qualify visual content gaps related to biographic content? How does the existing visual content frames the subject matter and what are the problems in this framing? Here, we want to get deeper into the analysis, going beyond mere quantification of the gap. We want to understand whether different pictorial techniques are used to represent different segments of people across different language communities. We will map similarities and differences across articles of people from different genders, locations, and languages, and get deeper into the reason why certain groups of articles tend to be more similar or dissimilar.
- [Longer Term] Debiasing Visual Gaps. How we can use computational tools to surface images in ways that are less "biased" compared to the existing practices? How do we conceptualize a normative standpoint that determines what is a better and inclusive visual framing of the biographies - and what criteria do we use for this?
Under-represented people in Wikipedia
This project aims to analyze and understand the visual representation of some groups of people usually scarcely documented in Wikipedia. The aim is to assess whether these social categories are being systemically excluded or disadvantaged in their visual representation, either through a relative scarcity or an unfavorable use of images in the biographies of the representatives of these groups.
The focus is on observing information gaps that have already been extensively documented at the written level in Wikipedia –such as gender and geo-referenced information gaps– or that, although they have been little researched so far, are of high historical importance and have been the subject of organized campaigns to improve their encyclopedic documentation –such as the racial gap. In all these information gaps, we will distinguish whether, in addition to the existence of a smaller number of articles associated with under-represented social groups, there are visual biases that aggravate the inequality in the record and discourage the development of a social memory on these groups.
The information gaps investigated in this project were chosen because of 1) their importance in previous research or in campaigns to improve their record in Wikipedia, and 2) because they have been strongly linked to social or cultural movements with a high global impact. These gaps are the following:
1. Gender gap: feminist movements since the end of the 18th century have historically raised the social relevance of this gap. Within Wikipedia, it has been debated mainly because of the under-representation of women's biographies, a phenomenon that has been quite well documented and controversial in recent years . This information gap brings together several related dimensions of exclusion: women have fewer biographies than men, are less involved in editing articles, and use less Wikipedia as a source of information . This has led to several campaigns focused on improving encyclopedic documentation on notable women (e.g. #VisibleWikiWomen 2019).
2. Geographical Gap: the importance of unequal geographical structures of knowledge has been widely investigated since the second half of the 20th century, mainly in post-colonial and decolonial studies. In the case of Wikipedia, it has been widely diagnosed an uneven coverage of geo-referenced information, implying this media better documents information about some dominant territories and excludes others . Although the intensity varies according to the language-version , there is a wide trend to the concentration of information on North America and Western Europe, as well as a relative lack of coverage on some regions of Africa, the Middle East, Latin America and Asia .
3. Racial Gap: this gap has been strongly fought since the end of the 18th century, for example by abolitionist movements or those promoting equal civil rights. Recently, research on the digital society has given new relevance to this gap, for example, by showing that some algorithms tend to structure information in an unfavorable way for some races . In the case of Wikipedia, so far no research has defined the relevance of this race gap in biographies. However, there is a quite widespread intuition that this gap exists –for example given the absence of several references in "black history"– and even campaigns have been made to reduce this inequality of information.
Main control variables
These gaps are likely to vary greatly depending on the language version of Wikipedia taken as a reference and the occupations of the people analyzed. Some research shows that language versions of Wikipedia structure information in many different ways , and others, for example, prove that among those born in the last century there is an extremely high record of biographies of artists and sportsmen . Therefore, the gender, geographical and racial gaps will be investigated by controlling for these two variables: a) the language in which the Wikipedia record is produced and b) the main occupation of the people described in the biographies. This will also serve as an input to focus (on certain languages and occupations) future campaigns that seek to combat Wikipedia's systemic biases.
Towards the study of “multidimensional gaps”
Currently, studies have focused on investigating separate information gaps –for example, gender or geographic location biases–. However, these gaps are not mutually exclusive in biographies, but work in an aggregate way, and this could generate important differences in representation levels within excluded groups. For example, several investigations point out that women have a low biographical presence in Wikipedia, but surely the representation of white European women and black African women is quite different. This implies that the study of biographical gaps should adopt a multidimensional perspective that allows for the addition of different representation gaps in specific people.
 Leigh Gruwell, «Wikipedia’s politics of exclusion: Gender, epistemology, and feminist rhetorical (in) action», Computers and Composition 37 (2015): 117–131; Christina Shane-Simpson and Kristen Gillespie-Lynch, «Examining potential mechanisms underlying the Wikipedia gender gap through a collaborative editing task», Computers in Human Behavior 66 (2017): 312–328; Marit Hinnosaar, «Gender inequality in new media: Evidence from Wikipedia», Journal of Economic Behavior & Organization 163 (2019): 262–276.
 Hinnosaar, «Gender inequality in new media».
 Mark Graham et al., «Uneven geographies of user-generated information: patterns of increasing informational poverty», Annals of the Association of American Geographers 104, n.o 4 (2014): 746–764; Mark Graham, «Information geographies and geographies of information», New geographies, 2015; Uri Roll et al., «Using Wikipedia page views to explore the cultural importance of global reptiles», Biological conservation 204 (2016): 42–50.
 Simon E. Overell and Stefan Rüger, «View of the world according to Wikipedia: Are we all little Steinbergs?», Journal of Computational Science 2, n.o 3 (2011): 193–197.
 Mark Graham, S. A. Hale, and M. Stephens, «Geographies of the World’s Knowledge» (Ed. Flick, C. M., London, Convoco! Edition., 2011); Mark Graham, Stefano De Sabbata, and Matthew A. Zook, «Towards a study of information geographies:(im) mutable augmentations and a mapping of the geographies of information», Geo: Geography and environment 2, n.o 1 (2015): 88–105.
 Joy Buolamwini and Timnit Gebru, «Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classiﬁcation», en Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 2018, 15.
 Overell and Rüger, «View of the world according to Wikipedia»; Pablo Aragon et al., «Biographical social networks on Wikipedia: a cross-cultural study of links that made history», en Proceedings of the eighth annual international symposium on Wikis and open collaboration (ACM, 2012), 19; Young-Ho Eom et al., «Interactions of cultures and top people of Wikipedia from ranking of 24 language editions», PloS one 10, n.o 3 (2015): e0114825; Roll et al., «Using Wikipedia page views to explore the cultural importance of global reptiles».
 Ilia Reznik and Vladimir Shatalov, «Hidden revolution of human priorities: An analysis of biographical data from Wikipedia», Journal of informetrics 10, n.o 1 (2016): 124–131; C. Jara-Figueroa, Amy Z. Yu, and César A. Hidalgo, «How the medium shapes the message: Printing and the rise of the arts and sciences», PloS one 14, n.o 2 (2019): e0205771.
Aragon, Pablo, David Laniado, Andreas Kaltenbrunner, and Yana Volkovich. 2012. «Biographical social networks on Wikipedia: a cross-cultural study of links that made history». En Proceedings of the eighth annual international symposium on Wikis and open collaboration, 19. ACM.
Buolamwini, Joy, and Timnit Gebru. 2018. «Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classiﬁcation». En Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 15.
Eom, Young-Ho, Pablo Aragón, David Laniado, Andreas Kaltenbrunner, Sebastiano Vigna, and Dima L. Shepelyansky. 2015. «Interactions of cultures and top people of Wikipedia from ranking of 24 language editions». PloS one 10 (3): e0114825.
Graham, Mark. 2015. «Information geographies and geographies of information». New geographies.
Graham, Mark, Stefano De Sabbata, and Matthew A. Zook. 2015. «Towards a study of information geographies:(im) mutable augmentations and a mapping of the geographies of information». Geo: Geography and environment 2 (1): 88–105.
Graham, Mark, S. A. Hale, and M. Stephens. 2011. «Geographies of the World’s Knowledge». Ed. Flick, C. M., London, Convoco! Edition.
Graham, Mark, Bernie Hogan, Ralph K. Straumann, and Ahmed Medhat. 2014. «Uneven geographies of user-generated information: patterns of increasing informational poverty». Annals of the Association of American Geographers 104 (4): 746–764.
Gruwell, Leigh. 2015. «Wikipedia’s politics of exclusion: Gender, epistemology, and feminist rhetorical (in) action». Computers and Composition 37: 117–131.
Hinnosaar, Marit. 2019. «Gender inequality in new media: Evidence from Wikipedia». Journal of Economic Behavior & Organization 163: 262–276.
Jara-Figueroa, C., Amy Z. Yu, and César A. Hidalgo. 2019. «How the medium shapes the message: Printing and the rise of the arts and sciences». PloS one 14 (2): e0205771.
Overell, Simon E., and Stefan Rüger. 2011. «View of the world according to Wikipedia: Are we all little Steinbergs?» Journal of Computational Science 2 (3): 193–197.
Reznik, Ilia, and Vladimir Shatalov. 2016. «Hidden revolution of human priorities: An analysis of biographical data from Wikipedia». Journal of informetrics 10 (1): 124–131.
Roll, Uri, John C. Mittermeier, Gonzalo I. Diaz, Maria Novosolov, Anat Feldman, Yuval Itescu, Shai Meiri, and Richard Grenyer. 2016. «Using Wikipedia page views to explore the cultural importance of global reptiles». Biological conservation 204: 42–50.
Shane-Simpson, Christina, and Kristen Gillespie-Lynch. 2017. «Examining potential mechanisms underlying the Wikipedia gender gap through a collaborative editing task». Computers in Human Behavior 66: 312–328.