Research:Gender asymmetry in English Wikipedia's victimization coverage
Summary
[edit]This project investigates gender asymmetry in English Wikipedia's victimization articles — entries titled "Murder of...", "Disappearance of...", "Kidnapping of...", and similar prefixes where the subject is notable primarily because of their victimization, not for prior achievements. Using a three-layer NLP classification pipeline (pronoun extraction, name-based gender inference, and manual classification), we classified the gender of victims in 3,048 such articles. The central finding is that 54.6% of "Murder of..." articles are about female victims, despite women constituting only ~20% of real-world homicide victims globally (UNODC). This represents an approximately 2.7-fold overrepresentation of female victims relative to real-world statistics.
Background
[edit]At WikiWomenCamp in New Delhi in 2023, a discussion point was raised that women are often framed through victimization on Wikipedia, while men are more commonly associated with achievements. This hypothesis is difficult to test empirically, and there has been limited research in this area.
English Wikipedia contains a distinctive class of articles where the subject person is notable solely because of their victimization. For example, the article "Murder of Vandana Das" exists because her death was the notable event — she was not otherwise notable. By contrast, the murder of Gauri Lankesh, an already notable journalist, is covered as a section in her biography. These victimization-framed articles therefore offer a signal of which victims Wikipedia's editorial community found noteworthy enough for a standalone article.
This project was motivated by the question: Is there a gender imbalance in whose victimhood Wikipedia considers notable?
Research questions
[edit]- What is the gender ratio among Wikipedia articles where the subject is notable primarily due to their victimization (e.g., "Murder of...", "Disappearance of..." articles)?
- How does this ratio compare with real-world victimization statistics, particularly global homicide data from the UNODC?
- Does the gender asymmetry vary across different types of victimization (murder, disappearance, kidnapping, assassination)?
- Is the gender skew consistent across time periods, or is it concentrated in particular decades?
Data
[edit]Source
[edit]Article titles were extracted from the English Wikipedia database using Quarry, filtering for titles beginning with the following prefixes:
- "Murder of..."
- "Assassination of..."
- "Disappearance of..."
- "Kidnapping of..."
- "Assault of..."
- "Abuse of..."
- "Beating of..."
- "Murders of..." (plural, indicating multiple victims)
This yielded 3,387 articles across eight categories.
Filtering
[edit]Articles were filtered into three groups:
- Individual victim articles (3,048): Articles about a single identifiable victim — the primary analysis set.
- Multi-victim articles (281): Articles with "and" in the title or "Murders of..." prefix — listed separately.
- Excluded non-person articles (57): Entries like "Abuse of power" or "Murder of Crows" (a film) that matched the title pattern but do not describe victimization of an individual.
Methodology
[edit]Gender classification pipeline
[edit]Most victims in these articles lack Wikidata entries with a sex/gender property (P21), as they are ordinary people, not public figures. A three-layer classification pipeline was developed:
Layer 1: Pronoun extraction (highest confidence). Article introductions were fetched via the Wikipedia API. Gendered pronouns (she/her/hers vs. he/him/his) were counted in the first 1,500 characters of each article. Where one gender's pronouns dominated, the classification was assigned directly. This layer classified 1,045 articles.
Layer 2: Name-based inference. The victim's first name was extracted from the article title and classified using the gender-guesser Python library, which covers given names from a wide range of cultural backgrounds. This layer classified an additional 1,456 articles. Estimated error rate: 3–5%, particularly for East Asian names where surname precedes given name.
Layer 3: Manual classification. Remaining articles — many with encoding issues in the Quarry export — were classified manually using cultural and contextual name knowledge. This resolved 182 additional cases.
In total, 2,683 of 3,048 individual-victim articles (88.0%) were classified. The remaining 365 (12.0%) could not be reliably assigned.
Year extraction
[edit]Incident years were extracted from article text using regex pattern matching for four-digit years in the range 1800–2029, taking the first year mentioned in the opening 500 characters. This yielded dates for 42.1% of articles.
Results
[edit]Overall gender distribution
[edit]Among 2,683 classified victimization articles:
- Male: 1,366 (50.9%)
- Female: 1,317 (49.1%)
By category
[edit]| Article type | Male | Female | Female % | N |
|---|---|---|---|---|
| Murder of | 837 | 1,007 | 54.6% | 1,844 |
| Disappearance of | 135 | 231 | 63.1% | 366 |
| Kidnapping of | 44 | 50 | 53.2% | 94 |
| Assassination of | 325 | 22 | 6.3% | 347 |
| Assault of | 12 | 6 | 33.3% | 18 |
| Beating of | 10 | 2 | 16.7% | 12 |
Comparison with real-world data
[edit]According to the UNODC Global Study on Homicide and data compiled by Our World in Data, approximately 80% of homicide victims globally are male and 20% are female. This ratio has been relatively stable across the period 1900–2022.
Among Wikipedia's "Murder of..." articles, the ratio is nearly inverted: 54.6% female. This represents a roughly 2.7-fold overrepresentation of female murder victims relative to their share of real-world homicides.
Temporal trends
[edit]Among articles with extractable dates, the female overrepresentation in murder articles is consistent across most decades from the 1940s onward. The 1980s show the strongest skew (66.1% female), likely reflecting the serial-killer media coverage of that era.
Robustness
[edit]The female majority holds across both classification methods independently:
- Pronoun-classified "Murder of..." articles: 55.0% female
- Name-classified "Murder of..." articles: 54.0% female
The consistency across methods suggests the finding is not an artefact of classification error.
Discussion
[edit]Several mechanisms may contribute to the overrepresentation of female victims:
- Media pipeline: Murders of women may receive more sustained media coverage, generating more citable sources that meet Wikipedia's notability standards. This is consistent with the well-documented "missing white woman syndrome" in media studies.[1]
- True-crime culture: The growth of true-crime media (podcasts, documentaries, books) has disproportionately focused on female victims, influencing which cases editors are aware of and motivated to document.
- Editorial interest: Wikipedia's editor base, estimated to be 80–90% male, may find cases involving female victims more compelling or notable, leading to more article creation.
- Article survival: Articles about male murder victims may be more frequently nominated for deletion on notability grounds. This hypothesis could be tested through analysis of Articles for Deletion logs.
Relation to prior work
[edit]This finding complements existing research on Wikipedia's gender gap. Wagner et al. (2015) found that articles about women are more likely to mention gender, relationships, and family than articles about men.[2] The present finding extends this: women are not only underrepresented in achievement-based biographical articles (18–20% of all biographies), but are overrepresented in articles defined by victimhood.
Limitations
[edit]- Classification pipeline has a 3–5% estimated error rate on name-based gender inference; 12% of articles remain unclassified.
- Analysis is limited to English Wikipedia; other language editions may differ.
- Binary gender classification does not account for non-binary individuals.
- The type of murder (domestic violence, serial killing, terrorism) was not controlled for.
- Comparison with UNODC data is approximate — global aggregates vs. Wikipedia's geographic coverage biases.
- This analysis shows that Wikipedia disproportionately documents female victims, not that it intentionally victimises women. The bias likely originates upstream in the media ecosystem.
Future work
[edit]- Analyse whether articles about female victims tend to be longer and more detailed than those about male victims.
- Study Wikipedia's Articles for Deletion logs to determine if male-victim articles are deleted at higher rates.
- Extend the analysis to other language editions of Wikipedia.
- Control for murder type (domestic violence, serial killing, gang violence, etc.) to disentangle contributing factors.
Code and data
[edit]All code, data, and the classified dataset are available under open licence:
- Diff blog: A blog post about the research
References
[edit]- ↑ Liebler, C. M. (2010). Me(di)a Culpa?: The "Missing White Woman Syndrome" and Media Self-Critique. Communication, Culture & Critique, 3(4), 549–565.
- ↑ Wagner, C., Garcia, D., Jadidi, M., & Strohmaier, M. (2015). It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia. Proceedings of the International AAAI Conference on Web and Social Media, 9(1), 454–463.