Mind the Gap

From Meta, a Wikimedia project coordination wiki

Mind the Gap(s)! Writing Styles of Female Editors on Wikipedia[edit]

This essay was first drafted by LauraHale on Etherpad She designed the methodologies used in the article, wrote most of the lit review and processed much of the data. [note 1] Hawkeye7 created five programs to process different data sets used in this study. Pine did the major peer review and collaborative editing that made the text much clearer, helped with data visualisation, did a significant collaborative rewrite of the baseline comparison, added some possibilities to the conclusions, checked the math, and made many miscellaneous contributions. Several participants on #wikimedia-gendergapconnect provided feedback on the methodology, lit review and results, and corrected typos. Photographs were borrowed from the existing Wikimedia collection. As with much of Wikipedia, this essay is the result of collaboration of diverse editors with unique talents.


Wikipedia's gender gap is well known. A study estimates the total percentage of female contributors on Wikipedia as under 9%.[1][note 2]

People in the Wikimedia Foundation, such as Sue Gardner, have stepped up to address the problem of the low number of women.[2][3] There is even a foundation-created mailing list for discussing the problem. There have been at least two studies that have looked at gender participation issues, including a 2010 survey which looked at demographics for the whole of Wikipedia and a 2011 survey which looked at characteristics of female English Wikipedia contributors. The GenderGap list has discussed the issues brought up in these studies. Several participants have worked to address on-wiki problems brought up by female contributors on the list. The list has included discussions of what sort of outreach should be done and how to best address the needs of women. For me, as someone with a background in educational research and who currently does population studies as a part of her PhD work, I was concerned that these strategies often appeared to have an over-broad focus on dealing with all women, and that these strategies often failed to deal with distinct female population groups within the Wikipedia community. This analysis will focus on characteristics of female participants on English Wikipedia. The analysis will look to see if these participants are representative of the female English speaking population. The analysis will also explore, through some existing literature and in the conclusion, the question of whether these potential differences could matter when planning strategy to target the gender gap.

Review of Literature[edit]

One of the points that came out of the 2011 survey was that American, Australian, and British lesbian females may be over represented in the Wikipedia lesbian population. That study found that 8% of all female contributors were lesbians. 24% of the female responders were non-heterosexuals, compared to 69% of the survey respondents who identified as straight. [note 3] This compares to a 2006 study in the United States that puts the percentage of GLB people in the United States at 4.1%.[4] In the United Kingdom, this percentage is estimated at 1.5%.[5] This suggests that Wikipedia is not attracting a representative population of female contributors.

Location Straight Non-Straight
Wikipedia 69.0% 24.0%
United States 95.9% 4.1%[4]
United Kingdom 98.5% 1.5%[5]

Wikipedia is not attracting a representative population of female contributors, as lesbians account for 24% of the contributors while only comprising between 1.5% and 4.1% of the female population. This issue leads to questions about why and how else Wikipedia's population may be unrepresentative, the impacts that non-representativeness may have on Wikipedia content, and how Wikipedia's female population could be made more representative of the general population.

Female participation in predominantly male volunteer activities[edit]

On the subject of female participation in predominantly male volunteer activities, Dayle Jackson, writing in "Women who Manage Sport in New Zealand" (1996)[6] notes that there are certain types of females who are disproportionately involved with New Zealand sport on a volunteer level. They are predominantly childless single women. The Wikipedia 2011 study similarly found that 57% of Wikipedia's female contributors are single. In the New Zealand study, some of the non-single females with children felt guilty for volunteering their time on sport. This characteristic fits with what the Wikipedia 2011 study found: 22% of Wikipedia female respondents cited family and home life as why they did not contribute more.

Female participation in the predominantly male field of sports administration[edit]

Barbara Levido, in the same New Zealand book, compiled survey results from Australia, New Zealand and Canada that asked why females were under-represented as sport administrators. Some of the reasons that help explain gender imbalances in sport include: the success of informal male networks (85% of respondents in Canada cited this as a cause for gender imbalance); lack of females with sufficient experience (67% of respondents in New Zealand cited this as a cause for gender imbalance); and weakness of informal female networks (77% of respondents in Australia, 67% of respondents in New Zealand).

Janice Cranch in "The Precarious Trail", also in Women who Manage Sport in New Zealand, discusses extensively one of the most problematic barriers for female participation outside of demographic types and family life: other women. Some women, once in power, will strive to maintain the status quo of sport by excluding other women. At the same time, some of the females in positions of power prefer working with males over females and will limit the opportunities for their fellow females inside an organization because they see them as potential threats to their own advancement. Cranch discusses multiple instances of national sport organizations, like Sport Canada and the Hillary Commission, trying to decrease the gender gap for sport administration and sport participation, and failing. Cranch cites Hall, Cullen and Slack (1989) saying "organizations are a man's world, and females who choose to enter that world must learn the language, symbols, myths, beliefs and values of male culture." (p.34)

Cranch goes on to discuss surveys done in Canada and New Zealand that looked at the demographic characters of female sport administrators, and they have several important similar characteristics to female Wikipedian contributors: single, highly educated, and having a first language of English. Chast cites a study by Korabik (1990) which found that female sport administrators in the male dominated world of sport administration "do not conform to the typical female stereotype." Females who succeed on this level are the ones who can, and choose to, conform to the male model. This goes back to the issue on Wikipedia: is it possible that the female participation base on Wikipedia is self-selecting in terms of attracting non-stereotypical females who can, and choose, to conform to male culture? Cameron's "Women who Manage New Zealand Sport" and Collins' "Sport in New Zealand Society" indicate that the existing systems can work against females because females inside these organizations prefer working with males over working with women, and some female administrators have male characteristics that make it easier for them to be part of and support the existing system at the expense of their counterparts.

In this essay, we cannot answer the question of whether Wikipedia attracts female participants who intentionally work against the interests of other female contributors when they follow stereotypical male models of behavior, but we do study whether female Wikipedia editors use language in ways that are more typical of males, and we suggest possible implications of these results for recruitment and retention of female editors.

Background information on gendered language[edit]

"Gender, Genre, and Writing Style in Formal Written Texts" is a 2003 paper published by the Illinois Institute of Technology and Bar-Ilan University in Israel (Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni). The paper looked for differences between male and female writing by examining text in the British National Corpus. The paper found differences between the genders did exist. Specifically, they found that females were more likely to use "involved" words, and males were more likely to use words that are "informational".

Introduction to Methodology[edit]

Survey research and data mining both have problems as research methods, but there are few alternative options and any methodology will have to rely on them.[note 4] A method that I chose to answer the question of "Do female contributors to Wikipedia generally use language that is more typical of males?" involves looking at informal gendered language usage on a contributor's user pages, and attempting to identify the gender of the contributor to see if it matches the user's stated gender. In later sections, using a similar methodology, formal gendered language usage was examined for articles in the main namespace. The methodology is detailed in those sections.

Gendered language on user pages[edit]

The first step was to develop a formula based on Gender, Genre, and Writing Style in Formal Written Texts to identify male and female users. This was done using Gender Genie and Hacker Factor's Gender Guesser's formulas, which were based on the paper. The total points for informal usage from each site were used. The decision to use informal language was based on the assumption that user pages are informal spaces, comparable to blogs, and are distinct from article pages which utilise more formal writing. When words appeared on both lists, the higher point total was used. Point totals assigned to each word can be found below.

Feminine Keywords Feminine Word Points Masculine Keywords Masculine Word Points
actually 49 a 6
am 42 above 4
and 4 are 28
be 17 around 42
because 55 as 37
but 43 at 6
everything 44 below 8
has 33 ever 21
her 9 good 31
hers 3 if 25
herself 9 in 10
him 73 is 19
if 47 it 6
like 43 many 6
me 4 more 34
more 41 now 33
myself 4 said 5
not 27 some 58
out 39 something 26
she 6 the 17
should 7 these 8
since 25 this 44
so 64 to 2
too 38 well 15
was 1 what 35
we 8 who 19
when 17
where 18
with 52
your 17

The second step was to create a list of Wikipedia users by gender binary of male or female. This was done by identifying userboxes that stated a user's gender. Some of the user boxes identified for males included {{user male}} {{User:UBX/male}}. For women, some of these user boxes included {{user female}} {{User:UBX/female4}} {{User:UBX/female}}.[note 5] Once these were identified on 22 October 2011, users who transcluded these infoboxes on their user pages were found and put in a table, with row headers for gender based on userbox and username. When this was done, 1,231 self identified female users and 798 self identified male users were found. Given the constraints of the software being used, user names with special characters were removed from the list. The totals were then 1,119 females and 722 males.[note 6]

The third step was to create a program that would pull the contents of a user page from Wikipedia, count the total number of occurrences of words from the female list and male list, multiply each scored word by its point multiplier, and add the points together to get total points for female gendered words and male gendered words for each user. Each user was assigned a female point score and a male point score.

The fourth step was to do and if-then statement to identify the user as a self-identified male or female based on the content of their user page.

For each user, the female total points, male total points, and self-identified gender were then put into a spreadsheet along with the users' username on 30 October 2011.[note 7] Once the data was compiled, all contributors who had a score of both zero female word usage and zero male word usage were removed. This brought the total number of self identified females to 867 and self identified males to 568. These numbers are what the results are based on. [note 8] It is important to note that the equation generated gender score does not determine a person's actual gender. What it does is provide an educated guess based on the research results found in Gender, Genre, and Writing Style in Formal Written Texts to determine what the user's gender may be based on their language usage. The results assume that the sample included only people who accurately stated their gender.

Results for Gendered language on user pages[edit]

In the sample, 60.4% of all those surveyed were female and 39.6% were male.

Category Count Percentage
Self identified female 867 60.4%
Self identified male 568 39.6%

When data was processed, the results showed that a large portion, 61.7% of self identified female contributors, use language that codes their writing as male. This is large compared to the 29.0% of self identified males whose language codes their writing as female. When "Yes - Male is actually Male" and "Yes - Female is actually Female" appear in the results, it indicates that the program's results agreed with the user's stated gender. When "No - Identified Female is actually Male" and "No - Identified Male is actually Female" appear in the results, it means the program disagreed with the user's stated gender. These disagreements indicate that the users are using language that is more common for the other gender.[note 9]

Category Count Percentage
Self-identified male, scored male 403 71.0%
Self-identified female, scored female 332 38.3%
Self-identified male, scored female 165 29.0%
Self-identified female, scored male 535 61.7%
Score identified females 497 34.6%
Score identified males 938 65.4%
Correlation 0.095

The graph below shows male scores along the horizontal axis, and female scores along the vertical axis. Data points with scores on either axis above 2,000 were removed from this graph, but are included in the data in the results tables.

The graph shows that (1) self-identified males that the computer algorithm scored as using female language, and (2) females that the computer program identified as using female language, are often relatively close to the dividing diagonal line for gender-even scores; while (3) females that scored as using male language, and (4) males identified as using male language, are more likely to be further away from the diagonal.

Yes - Scored Male is actually Male Yes - Scored Female is actually Female No - Scored Female is actually Male No - Scored Male is actually Female
Male Range 6899 4013 1238 4985
Female Range 2273 5906 1288 3056
Range Difference: Female range minus Male range 4626 1893 50 1929
Male Score Mean 330.11 204.37 142.28 271.62
Female Score Mean 153.78 303.08 205.72 148.25
Mean Difference: Female score mean minus Male score mean 176.33 98.70 63.44 123.36
Male Median 135 85 68 132
Female Median 55 162.5 110 55
Median Difference: Female median minus Male median 80 77.5 42 77
Male Mode 10 0 0 69
Female Mode 0 4 4 0
Mode Difference: Female mode minus Male mode 10 4 4 69

Category Self-identified females, median Self-identified males, median Self-identified females, mean Self-identified males, mean Self-identified females, mode Self-identified males, mode
Male score 117 121.5 245.87 276.05 0 0
Female score 90 70.5 207.55 169.42 0 0

Comparing medians[edit]

The graph of medians, above, shows female scores on the vertical axis, and male scores on the horizontal axis. The graph shows that both types of coded males, true males and incorrectly identified females, share more gendered language characteristics than their coded female counterparts, where the actual females and misidentified males have medians that are far apart.

The medians for all true males and all true females both fall on the male side of the diagonal line. This supports other data that suggests that Wikipedia's women tend to use male language characteristics.

Some initial conclusions[edit]

The median for all males and all females is on the male side of the graph, which suggests that both groups tend to score as male writing style. However, of the subgroups of each type of score, women who correctly identified as women scored as having a more female writing style than the males who scored on the female side of the axis. Regarding the other two subgroups, men who scored as men, and women who scored as men, both had very similar median scores.

Wikipedia Main namespace Article Language[edit]

To provide an additional set of information regarding the gendered language article nature of Wikipedia's main namespace, a list of 3,431 articles on Wikipedia was created that identified the gender topic focus of the article.[note 10] Articles were tagged as either Male or Female. Male articles were compiled from Category:Men, male national association football team seasons, male cricket players and male sport competitors. Female articles were compiled from Category:Women, and female sport teams and competitors. The article lists were then run through the program that coded by language usage Wikipedia users, using the equation in the table below with the point totals originating from Hacker Factor: Gender Guesser's formal totals and Gender Genie's non-fiction totals.[note 11]

Female Score Male Score
and 4 a 6
be 17 above 4
her 9 are 28
hers 3 around 42
if 47 as 23
me 4 at 6
myself 4 below 8
not 27 is 8
she 6 it 6
should 7 many 6
to 2 more 34
was 1 said 5
we 8 the 7
when 17 these 8
where 18 to 2
with 52 what 35
your 17 who 19

Articles with non-UTF8 characters in the user name and articles with scores of Male=Female were removed. The results give an idea as to the gender coded language prevalence in Wikipedia article space by article gender topic.

Category Count Per cent
Total Female Topics In Sample 2045 66.05%
Total Male Topics in Sample 1051 33.95%
Total Scored as Female Writing 476 15.37%
Total Scored as Male Writing 2620 84.63%
Total Scored as Female Writing For Female Articles 349 17.07%
Total Scored as Male Writing for Female Articles 1696 82.93%
Total Scored as Female Writing for Male Articles 127 12.08%
Total Scored as Male Writing for Male Articles 924 87.92%

The table above shows that the gendered nature of the topic has little impact on whether Wikipedia articles are written using male or female coded language. In both female article topics and male article topics, more than 80% scored as male writing.

Comparison of Articles on Wikipedia and Other Wikis[edit]

Information from ten randomly chosen Wikipedia articles was compared to (1) ten articles on English Wikihow, and (2) ten articles on the Geek Feminism wiki using its longest articles that were not lists or bibliographies. These pages were run against Hacker Factor's Gender Guesser. The tool's "formal" and "informal" writing style analysis settings were used to score each of these articles, with the results shown in the table below. (Wikihow was chosen because it is known for having a large number of female administrators and having a large group of female contributors. Geek Feminism Wiki was chosen because the content is very female centric and its contributors are almost exclusively female). This helps to put English Wikipedia into context against a female friendly wiki, and a formal, female topic centric wiki with a large number of contributors. Worthy of note is that 43% of Wikihow's contributor base identifies as female.[note 12] Geek Feminism Wiki was founded by an Australian female and has been publicised on and by women's organisations such as Oceania Women of Open Tech, LinuxChix and Drexel University's Libraries. [7][8][9]

English Wikipedia Informal Formal WikiHow Informal Formal Geek Feminism Informal Formal
GRB2 Weak male Weak male Remember Dreams Male Weak female Who is harmed by a Real Names policy? Male Weak male
Ryojun Guard District Male Weak female Choose a Lowchen Puppy Male Weak female Flashbelt slide show Male Weak male
Commander-in-Chief's Trophy Male Male Quit Your Job and Go on a Road Trip Male Weak female Male Programmer Privilege Checklist Male Weak female
International Association for Military Pedagogy Male Male Configure a Router to Use DHCP Male Female Conference anti-harassment/Policy resources Male Weak female
The New People Male Male Have a Good Morning Weak male Female EMACS virgins joke Male Weak male
Gizzard Male Weak male Escape from Handcuffs Male Weak male FLOSS Male Male
RAF Holmsley South Male Weak male Sign up for Google One Pass Male Weak female Gaming Male Weak male
Margaret Hughes Male Weak male Fix a Leaky Faucet Male Weak female Where are the females bloggers? Male Male
Andrew Balfour Male Weak male Volunteer Male Female Women-friendly events Male Weak male
The Man Who Wasn't There Male Weak male Find a Quiet Beach in Oahu, Hawaii Male Weak male Outing Weak male Weak male

The total occurrences of male, female, weak male, weak female were counted for all three wikis, Wikipedia, WikiHow and Geek Feminism, within the above samples. This results in the following:

Sortable table
Wiki Count Female Male Weak female Weak male
Wikipedia Informal 0 9 0 1
Wikipedia Formal 0 3 1 6
Wikihow Informal 0 9 0 1
Wikihow Formal 3 0 5 2
Geek Feminism Informal 0 9 0 1
Geek Feminism Formal 0 2 2 6

Using the analysis setting for "informal" language in the analysis tool, all three wikis have most of their articles using male or weak male gendered writing styles. Relative to Wikipedia, when using the "formal" language analysis setting in the tool, both Geek Feminism and wikiHow have a greater percentage of articles coded as female compared to Wikipedia. And of the three websites chosen, only Wikipedia appears to have a problem attracting and retaining a female contributor base.


While the data is insufficient to reach the conclusion that Wikipedia attracts females who code their language usage as male in all circumstances on-wiki and off-wiki, we have shown that females use a more male style of writing when writing for Wikipedia. Wikipedia has a smaller percentage of female contributors than other wikis: The problems are not solely caused by Wikipedia's content being an informational style of writing that is more typical of males, as shown by the fact that other wikis manage to attract larger proportions of female contributors when also using a similar informational style of writing.

Opportunities for Action and Further Research[edit]

An opportunity for further research, which would be relevant to efforts for female editor retention and recruitment, would be a comparison of Wikipedia user page gendered language percentage, to the percentage of gendered language on posts on social networks that allow users to contribute at least 300 words of text.[note 13] If these sites show similar percentages to Wikipedia, it could suggest that the original study is flawed in terms of gendered specific word usage, but if women on Wikipedia use similar word patterns on social websites as they do on Wikipedia, then this suggests that Wikipedia attracts women whose word choices are more typical of men, which in turn suggests considerations that Wikipedia could take into account when attempting to recruit and retain female editors. For example, Wikipedia could focus one type of recruitment and retention effort at women whose communication styles are more typical of males, and another type of recruitment and retention effort at women whose communication styles are more typical of females. Since Wikipedia's female population of editors is estimated at less than 15%, even a modest level of success might lead to a meaningful increase in the percentage of female editors. Also, such targeted recruitment and retention of female editors could help to increase the number of female editors who communicate in language that is more typically female.

Another opportunity for further research is regarding the writing samples used in the study. Because the user data examined to identify a user's gender for comparison against their stated gender was limited to their user page, one method of increasing the potential writing available and increasing the size of users included in the sample would be to include their contributions to talk pages, where the language used could be more involved and informal providing better insight into the language pattern usage to indicate modes of communication. Related to this, an analysis could be made of the words used in contributor's edit summaries.

Another follow up could be done to examine the contributor history of self-identified and language-identified males, self-identified and language-identified females, self-identified females and language-identified males, and self-identified males and language-identified females. Editing patterns such as when these users were most active, the article spaces they make most of their contributions on English Wikipedia, might explain how certain types of males and females use Wikipedia and allow for more tailored outreach for recruitment and retention to be done in relation to them.

Further reading[edit]


  1. The use of Etherpad is why a true edit history for this essay is not available.
  2. An older survey found 15% of Wikipedia's contributors were female[1], which suggests that the percentage of female contributors has declined since the prior survey. One source puts the total percentage of women editing Wikipedia across all languages at 12.64%.[2] This needs to be taken with a grain of salt as the surveys used different methodologies for participant recruitment.
  3. 8% declined to identify their sexual orientation.
  4. In doing research about Wikipedia's population, there are problems. First, the true demographic characteristics of Wikipedia contributors are unknown. This makes survey research difficult, as it is not possible to know if a survey sample is representative of the whole population. Survey results can be biased by the self selection of users to participate or not. Surveys also take a long time to get responses. Outside of survey work, another possible method of examining the population on Wikipedia is to use user pages. This method has problems because user pages do not have standardised fields that can be used to mine demographic information; contributors are not required to put information on their user pages; and certain things that could be used for demographic data mining, such as userboxes, are not standardised or lack widespread usage.
  5. Users could not be identified by category as the category Wikipedians by gender was deleted in June 2007. Reasons for deletion included too broad, and serving no collaborative purpose.
  6. Roughly 60% of the people who self identified their gender were female. Females are more likely to identify their gender on Wikipedia than their male counterparts. Remember, 91% of Wikipedia's total population is known to be male according to the 2011 survey.
  7. A copy of the program and the raw results are available upon request.
  8. The raw data is available upon request.
  9. A baseline sample taken from a few lists including people who edited sport articles, people who created the most articles, had the most edits on Wikipedia, and were listed on RecentChanges, of Wikipedians was run to determine if the gender language scores of self-identified male and female users identified through userboxes was representative of Wikipedia's larger population. This was done by identifying the percentage of users whose uses of language in user space were scored by the computer program as being male or female regardless of their self-identified gender or if they did not self-identify. The population sample size was 10,731 users. The results appear in the table below. Remember that the 2011 survey data showed that Wikipedia's editors are 91% male and 9% female, but the self-identified editors who are studied in the main portion of this study are approximately 60% female. To check that the users in the study are reasonably similar to the baseline, we compare the male users' scores in the study to the overall score in the baseline, with the assumption that 91% of the baseline are male. As you can see, 73.18% of all baseline users had user pages that scored as male, while 70.95% of the male users in the study had user pages that scored as male. We considered the nearness of these two numbers to be reasonable for purposes of this study.
    Gender Identified Percentage
    Baseline scored as male 5195 73.18%
    Baseline scored as female 1904 26.82%
    Male and Female self identified users, who scored as using male language* 938 65.37%
    Male and Female self identified users, who scored as using female language* 497 34.63%
    Self identified male users who scored as using male language* 403 70.95%
    Self identified male users who scored as using female language* 165 29.05%
    • In the sample of self identified users, the percentage of female users is much higher than on Wikipedia as a whole, and likely to be much higher than in the baseline. The slightly lower 65.37% percentage of userspaces scoring as male for all male and female self identified users, relative to the 73.18% of the baseline who scored as male, is not surprising.
  10. Creating an extensive list of articles written by men and women is hard. Main namespace articles are often edited by many editors. A member of #wikimedia-gendergapconnect argued that such an analysis was necessary to actually understand what was going on. A non-representative sample of articles created by self identified male and language coded male Hawkeye7 (articles was ran against self identified female and language coded female LauraHale. (Articles). The following table is based formal language main namespace gender language coded results for articles they created:
    Sortable table
    Category Count Percentage
    Total Female Written Articles In Sample 495 91.50%
    Total Male Written Articles in Sample 46 8.50%
    Total Identified Female Writing 37 6.84%
    Total Identified Male Writing 504 93.16%
    Total Female Written Articles Female Writing For Female Articles 34 6.87%
    Total Female Topic Male Writing for Female Articles 461 93.13%
    Total Male Topic Female Writing for Male Articles 3 6.52%
    Total Male Written Articles Male Writing for Male Articles 43 93.48%

    Despite having user pages that code each as their correct gender, there is no difference between the gendered language in the main namespace for articles they created.

  11. Both lists had the same words and point totals, but the lists are smaller than the ones for informal writing.
  12. Mayo Fuster Morell says: "A total of 43% of registered participants are females [3]
  13. 300 words is a suggested minimum length by Hacker Factor to get the best possible minimum match.[4]