Wikipedia's gender and cultural gaps are well documented, but how different are these information imbalances on Wikimedia Commons? Wikimedia Commons is home to over 65M files and has the most number of active users on Wikimedia projects only after the English Wikipedia. This project will provide an overview of the Image files uploaded to Wikimedia Commons with respect to their diversity on gender and coverage of different cultural representation such as food and dress. We will attempt to quantify existing variations and make recommendations to improve the situation where necessary.
Research Questions & Scope
We have four specific questions that we hope to answer through this research:
- 1. What genders do Images files of humans uploaded to Wikimedia Commons depict?
- We will not examine other media types.
- 2. What is the breakdown of Wikimedia Commons images depicting food (categorized by region)?
- 3. What is the breakdown of Wikimedia Commons images depicting dress (categorized by region)?
- 4. How do the depictions of food and dress in Wikimedia Commons images compare to the cultural diversity of the global population?
Additionally, we are curious to explore whether the gender/food/dress depicted in the images is accurately represented by the file name or description.
We adopt a systematic approach to answering the above questions. Below are the key components of our approach.
- A sample of 1 million random images from Commons was taken, along with their corresponding categories on the Commons page.
use commonswiki_p; SELECT page_title, GROUP_CONCAT(cl_to SEPARATOR ';') FROM page, categorylinks WHERE page.page_id=categorylinks.cl_from AND page_namespace=6 GROUP BY page_id ORDER BY RAND() LIMIT 1000000;
- Quarry on Quarry.wmcloud.org/. The CSV file of the query result can be found here.
- We utilized a tool developed by another individual for weak labeling of Commons images. See Categorization of Wikipedia Images. This tool allowed us to assign labels to the random sample of Commons images based on their respective categories and returned categories for each image. By categorizing the images as Humans, Food, Clothes, or None, a new labeled sample was generated.
- To facilitate the human annotation process, we built a custom tool called WMCLabeler (Wikimedia Commons Labeler. Volunteers were invited to participate in a label-a-thon, where they were presented with a series of questions and tasked with accurately labeling the images. See subject recruitment announcement. The questions are as follows:
- 1. Which gender(s) do you think the person in the image could possibly identify as?
- 2. Which region/country do you think the food in the image is associated with?
- 3. Which region/country do you think the cloth/dress in the image is associated with?
- 4.* Do you think the gender/food/dress depicted in the image is accurately represented by the file name or description?
- Once the volunteers completed the annotation process using WMCLabeler, we downloaded the generated labels for further analysis. The labeled images, along with the corresponding annotations, were collected and processed to derive insights. This analysis formed the basis for our findings and recommendations.
- "Computational Linguistics Reveals How Wikipedia Articles Are Biased Against Women". MIT Technology Review. Retrieved 2021-01-12.
- Miquel-Ribé, Marc; Laniado, David (2018). "Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions". Frontiers in Physics 6. ISSN 2296-424X. doi:10.3389/fphy.2018.00054.
- "WikiStats - All Wikimedia Projects by Size". wikistats.wmcloud.org. Retrieved 2021-01-12.