Grants:Programs/Wikimedia Research Fund/Measuring the Gender Gap: Attribute-based Class Completeness Estimation

From Meta, a Wikimedia project coordination wiki
Measuring the Gender Gap: Attribute-based Class Completeness Estimation
start and end datesJuly 2023 - July 2024
budget (USD)50,000 USD
fiscal year2022-23
applicant(s)• Gianluca Demartini



Gianluca Demartini

Affiliation or grant type

The University of Queensland


Gianluca Demartini

Wikimedia username(s)


Project title

Measuring the Gender Gap: Attribute-based Class Completeness Estimation

Research proposal[edit]


Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

The problem. Successful crowdsourcing projects like Wikipedia and Wikidata naturally grow and evolve over time. This happens while having editors focussing on certain parts of the project instead of others. While the ability for editors to decide what to contribute to comes with the advantage of flexibility, it may result in biased content where, for example, one gender is better represented than others. An example of this is the number of male astronauts as compared to the number of female astronauts (73 out of 574, in Wikipedia.

The solution. There are possible viable approaches to address this issue. For example, the editor community may decide to stop adding new male astronauts to the project to allow for content about female astronauts to catch up. Alternatively, the community may decide to represent the real distribution in the profession. In any case, this remains a community decision.

Research Contribution. Rather than deciding how to deal with gender unbalanced content in Wikimedia projects, the aim of this research is to automatically identify underrepresented classes by quantifying and measuring the expected size of a class in order to empower the community in taking decisions and setting editorial priorities. This is possible by making use of the edit history for a Wikimedia project.

Approach. Our method can estimate the completeness of a class of entities. Hence can be used to answer questions such as “Does the knowledge base have a complete list of all female astronauts?”. Our techniques are derived from species estimation and data management and are applied to the case of collaborative editing. We make use of entities observed in a project’s edit history as a proxy for observations in a capture/recapture study setup. This allows us to use estimators for species population (e.g., Jackknife Estimators [1]) to predict class cardinality.

Generalization. This approach can as well be applied to non-binary value attributes (e.g., non-binary genders, or age groups like, for example, counting how many astronauts in the age ranges 20-30, 30-40, and 40-50 there should be in Wikipedia) to estimate attributed-based class cardinality.

[1] Heltshe, J.F., Forrester, N.E.: Estimating species richness using the jackknife procedure. Biometrics pp. 1–11 (1983)


  • Dr Lei Han, The University of Queensland, Australia


Approximate amount requested in USD.

50,000 USD

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

$37,198.18: 6 months post-doc at UQ (A6 salary level, 2023; AUD/USD rate as of 21Nov2022). They are the main staff member working on the project under the supervision of the PI and with help from the casual RA.

$2,343: Laptop for post-doc

$3,937.01: 100 casual hours of a research assistant (RA) at UQ (HEW5.1 salary level, 2023; AUD/USD rate as of 21Nov2022). They will support the post-doc and PI in preparing the Wikimedia datasets for analysis.

$6,522: Institutional overhead


Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

To close the gap (the gender gap in Wikimedia project content, in our case) it is first critical to be able to measure it. While it is up to the editor community to make decisions on how to best balance content by prioritising editorial focus, we instead aim at empowering editors in their decisions by providing them with relevant information about the current gender balance across classes in Wikidata and categories in Wikipedia. This aligns with the 2030 Wikimedia Strategic Direction as our contribution enables the platform and the community to collect “knowledge that fully represents human diversity”.


Plans for dissemination.

Upon completion of the research, we plan to disseminate our findings through several channels. First, we plan to describe our research approach as well as our experimental findings (e.g., a list of classes with related gender balance information) on the relevant Wikimedia project. We also plan to disseminate our approach and results to the academic research community by means of peer reviewed scientific publications in computer science conferences and journals with a topical focus on fairness.

Past Contributions[edit]

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

Our previous research [1] has looked at how to use statistical estimators to measure class cardinality in Wikidata. We used the knowledge graph edit history as evidence for the estimators and to measure class completeness. In this project we plan to extend our approach by looking at attribute-specific cardinality estimations (e.g., How many female astronauts should be there? Do we have them all?) and beyond the single Wikidata project.

[1] Michael Luggen, Djellel Difallah, Cristina Sarasua, Gianluca Demartini, and Philippe Cudré-Mauroux. Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs. In: The International Semantic Web Conference (ISWC 2019 - Research Track). Auckland, New Zealand, October 2019.

I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.