Only 15% of all biographies on English Wikipedia belong to women. Women and men are portrayed differently on Wikipedia in terms of article structure, the use of infoboxes, network properties, notability etc. This research project is aimed at mapping the gender gap on Wikipedia in terms of its content. This work is done as a followup report to Netha's presentation at WikiWomenCamp 2017. The aim is to create a systematic review of peer-reviewed published research papers on gender gap on Wikipedia's content.
Coverage bias : Coverage bias occurs when men and women are covered differently on Wikipedia. For example, the coverage bias may manifest as differences in the number of notable women and men portrayed on Wikipedia.
Wikipedia in 6 languages compared to several datasets: Freebase, Pantheon, Human Accomplishment, crawled the content of articles about people in the reference datasets using Wikipedia’s API (November 2014).
Men and women are covered equally well on Wikipedia and articles about women tend to be longer than articles about men on Wikipedia, when compared to those from the reference datasets.
The DBPedia 2014 dataset, The Wikipedia English Dump of October 2014
The DBPedia and Wikipedia data dump were analysed for metadata properties. The gender of a biography, whenever not mentioned, was determined by 'inferred gender for Wikipedia biographies' (Bamman and Smith)
15% of articles in 'Person class' were about women. In comparison to the global proportion of women, the categories that over-represent women are Artist, Royalty, FictionalCharacter, Noble, BeautyQueen, and Model.
Biographical subjects from several sources (100 Most Influentiial figures in American History, TIME magazine's list of 2008's most influential people, Chambers Biographical dictionary, American National Biography Online) compared to English Wikipedia and Britannica.
A Python program was used to compare web pages related to the subjects targeted in the reference sources. Google API was queried for top four results. Gender was guessed by the balance of gendered pronouns (she, her, he, his). The length of an article is determined by the words of article content and does not include citations and other miscellany.
Wikipedia provides better coverage and longer articles on women than Britannica. Wikipedia has more articles about women than Britannica in absolute terms, but articles about women on Wikipedia are more likely to be missing than articles about men compared to Britannica.
DBPedia 2014 dataset, inferred gender for Wikipedia bios
Calculated the number of language editions in which per biography is represented and google search volume of women's bio, compared them with Wikipedia articles
Women in Wikipedia are more notable than men, which the authors interpret as the outcome of a subtle glass ceiling effect.
Structural bias : Structural bias refers to preferential use of gender-specific tendencies while connecting articles on notable people. For example, there may be more links to men's biographies on articles related to women.
Wikipedia’s API (November 2014), analysed for probability that a link from article with gender g1 ends in an article with gender g2.
Articles about women connect less to articles about men via interlinks. Articles about people with the same gender tend to link to each other. Articles about women tend to link more to articles about men than the opposite. Men are more central than women in English, Russian and German language Wikipedia.
DBPedia 2014 dataset, inferred gender for Wikipedia bios, attributes, PageRank
Explored to what extent the connectivity between people is influenced by gender. Investigated the relation between the centrality of people and their gender using PageRank.
The top-ranked women according to PageRank are slightly less central than men, and the centrality of women decreases faster than that of men with decreasing rank. There exists a bias in the generation of links by Wikipedia editors, favoring articles about men.
Lexical bias : Lexical bias refers to the inequalities in the terms used to describe men and women on Wikipedia. For example, the articles about women are more likely to have details about their family life.
Open vocabulary approach where classifier determines which words are most effective in distinguishing the gender of the person an article is about. Log likelihood ratios are used for comparing different feature-outcome relationships.
There is lower salience of male-related words in articles about men, which can be related to the idea of male as the null gender (there is a social bias to assume male as the standard gender in certain social situations). Words like “married”, “divorced”, “children” or “family” are much more frequently used in articles about women. This study confirms that men and women are presented differently on Wikipedia and that those differences go beyond what we would expect due to the history of gender inequalities.
The DBPedia 2014 dataset, The Wikipedia English Dump of October 2014, Linguistic Inquiry and Word count (LIWC) dictionary
To explore which words are more strongly associated with each gender, Pointwise Mutual Information is measured over the set of vocabulary in both genders. Also considered burstiness, a measure of word importance in a single document according to the number of times it appears within the document, under the assumption that important words appear more than once (they appear in bursts) when they are relevant in a given document.
Marriage and sex-related content are more frequent in women's biographies and cognition related content is highlighted in men's biographies. Words most associated with men are mostly about sports, while the words most associated with women are to arts, gender and family. Of particular interest are two concepts strongly associated with women: her husband and first woman.
Overview of English Wikipedia biographies, inferred gender for Wikipedia bios
Analysed gender topic, relationship topic and family topic in Wikipedia's biographies. Quantified the tendency of expressing positive and negative aspects of biographies with adjectives, as a measure of the degree of abstraction of positive and negative content.
Family-, Gender-, and relationship-related topics are more present in biographies about women, linguistic bias manifests in Wikipedia since abstract terms tend to be used to describe positive aspects in the biographies of men and negative aspects in the biographies of women.
Visibility bias: Visibility bias occurs when articles related to men and women are differently promoted within Wikipedia. For example, men's biographies are potentially more likely to be featured articles than women's biographies, although the difference is not significant.