Research:Gender gap in Wikipedia's content

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
no affiliation
Duration:  2017-06 — 2017-09
gender gap, content, Wikipedia

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.

Only 15% of all biographies on English Wikipedia belong to women. Women and men are portrayed differently on Wikipedia in terms of article structure, the use of infoboxes, network properties, notability etc. This research project is aimed at mapping the gender gap on Wikipedia in terms of its content. This work is done as a followup report to Netha's presentation at WikiWomenCamp 2017. The aim is to create a systematic review of peer-reviewed published research papers on gender gap on Wikipedia's content.


Slides 15-19 contain key points regarding the content bias on Wikipedia
  • Find all relevant articles for the analysis using Google scholar. Keywords used are 'Wikipedia', 'gender', 'content', 'women', 'bias' and various relevant combinations of these words.
  • Screen the title and abstract to include only those studies that fit the inclusion criteria. Further screening for content to only include the studies about gender gap in Wikipedia's content.
  • Assess the validity of results
  • Systematic presentation of the findings


  • Article search and screening YesY
  • Assessment of articles YesY
  • Presentation of findings YesY
  • Write a research report (ongoing) - due September 15, 2017


The results were analysed under four categories :

  • Coverage bias : Coverage bias occurs when men and women are covered differently on Wikipedia. For example, the coverage bias may manifest as differences in the number of notable women and men portrayed on Wikipedia.
Research Data Methods Findings
Wagner et al [1] Wikipedia in 6 language editions Wikipedia in 6 languages compared to several datasets: Freebase, Pantheon, Human Accomplishment, crawled the content of articles about people in the reference datasets using Wikipedia’s API (November 2014). Men and women are covered equally well on Wikipedia and articles about women tend to be longer than articles about men on Wikipedia, when compared to those from the reference datasets.
Graells-Garrido et al [2] The DBPedia 2014 dataset, The Wikipedia English Dump of October 2014 The DBPedia and Wikipedia data dump were analysed for metadata properties. The gender of a biography, whenever not mentioned, was determined by 'inferred gender for Wikipedia biographies' (Bamman and Smith) 15% of articles in 'Person class' were about women. In comparison to the global proportion of women, the categories that over-represent women are Artist, Royalty, FictionalCharacter, Noble, BeautyQueen, and Model.
Reagle & Rhue [3] Biographical subjects from several sources (100 Most Influentiial figures in American History, TIME magazine's list of 2008's most influential people, Chambers Biographical dictionary, American National Biography Online) compared to English Wikipedia and Britannica. A Python program was used to compare web pages related to the subjects targeted in the reference sources. Google API was queried for top four results. Gender was guessed by the balance of gendered pronouns (she, her, he, his). The length of an article is determined by the words of article content and does not include citations and other miscellany. Wikipedia provides better coverage and longer articles on women than Britannica. Wikipedia has more articles about women than Britannica in absolute terms, but articles about women on Wikipedia are more likely to be missing than articles about men compared to Britannica.
Wagner et al [4] DBPedia 2014 dataset, inferred gender for Wikipedia bios Calculated the number of language editions in which per biography is represented and google search volume of women's bio, compared them with Wikipedia articles Women in Wikipedia are more notable than men, which the authors interpret as the outcome of a subtle glass ceiling effect.
  • Structural bias : Structural bias refers to preferential use of gender-specific tendencies while connecting articles on notable people. For example, there may be more links to men's biographies on articles related to women.
Research Data Methods Findings
Wagner et al [1] Wikipedia in 6 language editions Wikipedia’s API (November 2014), analysed for probability that a link from article with gender g1 ends in an article with gender g2. Articles about women connect less to articles about men via interlinks. Articles about people with the same gender tend to link to each other. Articles about women tend to link more to articles about men than the opposite. Men are more central than women in English, Russian and German language Wikipedia.
Graells-Garrido et al [2] The DBPedia 2014 dataset, The Wikipedia English Dump of October 2014 Proportion of links from gender to gender was calculated and tested against expected proportions. Analysed distribution of PageRank by gender to understand centrality. Women biographies tend to link more to other women than to men. The article with highest centrality tend to be predominantly about men, beyond what one could expect from the structure of the network.
Wagner et al [4] DBPedia 2014 dataset, inferred gender for Wikipedia bios, attributes, PageRank Explored to what extent the connectivity between people is influenced by gender. Investigated the relation between the centrality of people and their gender using PageRank. The top-ranked women according to PageRank are slightly less central than men, and the centrality of women decreases faster than that of men with decreasing rank. There exists a bias in the generation of links by Wikipedia editors, favoring articles about men.
  • Lexical bias : Lexical bias refers to the inequalities in the terms used to describe men and women on Wikipedia. For example, the articles about women are more likely to have details about their family life.
Research Data Methods Findings
Wagner et al [1] Wikipedia in 6 language editions Open vocabulary approach where classifier determines which words are most effective in distinguishing the gender of the person an article is about. Log likelihood ratios are used for comparing different feature-outcome relationships. There is lower salience of male-related words in articles about men, which can be related to the idea of male as the null gender (there is a social bias to assume male as the standard gender in certain social situations). Words like “married”, “divorced”, “children” or “family” are much more frequently used in articles about women. This study confirms that men and women are presented differently on Wikipedia and that those differences go beyond what we would expect due to the history of gender inequalities.
Graells-Garrido et al [2] The DBPedia 2014 dataset, The Wikipedia English Dump of October 2014, Linguistic Inquiry and Word count (LIWC) dictionary To explore which words are more strongly associated with each gender, Pointwise Mutual Information is measured over the set of vocabulary in both genders. Also considered burstiness, a measure of word importance in a single document according to the number of times it appears within the document, under the assumption that important words appear more than once (they appear in bursts) when they are relevant in a given document. Marriage and sex-related content are more frequent in women's biographies and cognition related content is highlighted in men's biographies. Words most associated with men are mostly about sports, while the words most associated with women are to arts, gender and family. Of particular interest are two concepts strongly associated with women: her husband and first woman.
Wagner et al [4] Overview of English Wikipedia biographies, inferred gender for Wikipedia bios Analysed gender topic, relationship topic and family topic in Wikipedia's biographies. Quantified the tendency of expressing positive and negative aspects of biographies with adjectives, as a measure of the degree of abstraction of positive and negative content. Family-, Gender-, and relationship-related topics are more present in biographies about women, linguistic bias manifests in Wikipedia since abstract terms tend to be used to describe positive aspects in the biographies of men and negative aspects in the biographies of women.

  • Visibility bias: Visibility bias occurs when articles related to men and women are differently promoted within Wikipedia. For example, men's biographies are potentially more likely to be featured articles than women's biographies, although the difference is not significant.
Research Data Methods Findings
Wagner et al [1] Wikipedia in 6 language editions Proportion of women's biographies that make it to the main page of Wikipedia Selection procedure of featured articles of Wikipedia community does not suffer from gender bias.


  1. a b c d It's a man's Wikipedia? Assessing Gender Inequality in an online Encyclopedia Wagner, Claudia; Garcia, David; Jadidi, Mohsen; Strohmaier, Markus (May 2015). "It's a man's Wikipedia? Assessing Gender Inequality in an online Encyclopedia". Proceedings of the Ninth International AAAI Conference on Web and Social Media. Retrieved 28 July 2017. 
  2. a b c Graelles-Garrido, Eduardo; Lalmas, Mounia; Menczer, Filippo (September 2015). "First Women, Second Sex : Gender Bias in Wikipedia". Social and Information Networks. Retrieved 28 July 2017. 
  3. Reagle, Joseph; Rhue, Lauren (2011). "Gender bias in Wikipedia and Britannica". International Journal of Communication S: 1138–1158. Retrieved 28 July 2017. 
  4. a b c Wagner, Claudia; Graelles-Garrido, Eduardo; Garcia, David; Menczer, Filippo (2016). "Women through the glass ceiling: gender asymmetries in Wikipedia" (PDF). EPJ Data Science. Retrieved 30 July 2017.