What is the problem you're trying to solve?
Wikimedia projects hold several million biographies, which form an incredible dataset about WHO the project finds important. However without further analysis and tools, it's difficult to answer some prominent questions about those biographies like:
- how diverse are they?
- which diversity gaps would be easy to fill?
- how are these statistics changing over time?
Two actively used tools from the proposers of this grant, WHGI (Wikidata Human Gender Indicators, a previous WMF grantee) and Denelezh, provide some of this data to evaluate aspects of the diversity in biographies. The data that they create has been used to measure success of Wikiprojects, and has been a barometer of Wikimedia's gender gap when discussed in the BBC and Bloomberg. The Wikimedia Grant Team's Annual Report has said of them:
"Grants for research and tools (such as WHGI) - which minimally contribute to the targets of people or articles - have been extremely valuable in improving our understanding of the gender gap and how or why it manifests."
Yet these tools each have their own limitations:
- they only provide some statistics and don't directly help editors to close gaps,
- they are mostly dedicated to the gender gap, and other biases could be explored
- some of their features overlap, and two tools means double the maintenance, syphoning energy from updates & new features
We want to take this community data effort to the next level!
What is your solution to this problem?
Our solution is to merge WHGI and Denelezh into a new tool, called humaniki. Completing the difficult task of merging the codebases of the previous tools will do two things: 1. reduce the maintenance effort of the tools and 2. provide a much needed architecture refresh unlocking new community features. For instance, an "evolution view" of showing how a Wikipedia's diversity is trending is highly desired by our current users. Denelezh can compute and display this report, except has no automation to do it on schedule. WHGI runs an automated pipeline but cannot compute the evolution view. Marrying these strengths, along with many other new features generated by interviewing our current users, we can make the data diversity tool that Wikimedians deserve.
What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually.
Remember to review the tutorial for tips on how to answer this question.
- 1) Engineer reliable, polished diversity data collection for the future.
- Merging two overlapping single-maintainer tools into one comprehensive multi-maintainer tool will mature these tools out of their prototype phases. This will benefit the Wikimedia's wikiprojects and communities by solidifying diversity data collection for the years to come.
- In addition the result will be a more polished, more capable, data portal for diversity data that will increase the public face of this corner of feminist data enthusiasts.
- 2) Create more detailed, actionable statistical outputs for editors.
- While interesting, and enabling the first views into Wikipedia's biography composition, not all the data produced by WHGI and Denelezh is that actionable. Communities have long pointed out that in addition to knowing some of these statistics, the tools aren't delivering on their potential to help users focus their editing.
- We would like to enable comparison features that can answer questions like:
- "Which women are represented in French Wikipedia but not English Wikipedia?"
- "Which occupations have seen the least amount of biographies created in the last 2 years?"
- "Which biographies are being created in my historical interest areas, so that I can find similar minded editors?"
- "Have there been any Wikipedias that have seen systematic anti-diversity editing? When was it?"
- "In what non-gender dimensions are the humans of two wikis similar/different?"
- 3) Enable future community features through an API.
- The data that is produced through WHGI and Denelezh is not easily re-usable outside of their own websites. We want to make sure the data is usable by in the full tools/data ecosystem. For instance we want to:
- Enable data ingest for the Gender Campaigns Tool being built.
- Easily enable bot-updating of stats or lists on to Wikipages or embeds into any website. E.g. weekly reports delivered to Wikiprojects.
- There are even moon-shot ideas that could be unlocked with an open API:
- A/B testing of editing efforts between languages.
- AI training data to know what predicts biased editing patterns.
How will you know if you have met your goals?
For each of your goals, we’d like you to answer the following questions:
- During your project, what will you do to achieve this goal? (These are your outputs.)
- Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)
For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents).
Remember to review the tutorial for tips on how to answer this question.
- 1. Reliability
- 1 new repository, with 2+ contributors. This project aims to consolidate projects, so having just one active software repository is better.
- 1 year successful running. The tool can run for 1 year without breaking.
- 2. Usage
- 1,000 unique visitors to our web frontend site within the first month, to know that people are using our tool.
- 5 prominent Wikiepidans trained in how to use our tool, who are active in projects dedicated to editing content about humans.
- 3. Actionable statistics
- Elicitation outputs. Written summary of the elicitation process, including sketches and mocks of what the community requested features will look like.
- 5+ features delivered. From out pool of newly solicited features and our known backlog we aim to deliver at least 5 desired features (evolution, comparison, etc.).
- Software acceptance outputs. We will follow a "software acceptance" process. This will let us know if our community features are also satisfactory to our users, and will record this process.
- 1,000 clicks+ back to Wikimedia projects per month, as part of an actionable-statistics feature.
- 4. Extensibility.
- 1 new Wikiproject user. Our tools are currently in use by Wikiprojects, it would be fantastic to support new projects.
- 1 new API user. We aim to have 1 API user using our data (Gender Campaign Tools or other).
Do you have any goals around participation or content?
Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.
Our project does not aim to create any new content itself, but rather support new content creation. In fact one of the very things we can enable is precisely "the measuring" of new biographies. With our new concept, with a few clicks of a dashboard you would be able to see how many new biographies were created in a Wikipedia during a time range—and compare it to the same time a year ago.
- A) Review WHGI and Denelezh with Community
- We will synthesize the past discussions and bug reports defining our "backlog features".
- Host a virtual focus group with previous users of WHGI and Denelezh (asynchronously and synchronously), creating "community features".
- B) Core application
This step creates the underlying tech stack.This part has two sub-goals:
- Define the schema of the intermediate database (using section 2.2 from preliminary work).
- Design and implement the core of the application, with the ability to:
- download Wikidata JSON dumps,
- create intermediate statistics and reports in a local database,
- serve statistics via a generic API, to our own front-end and other services.
- C) Re-implement the features of WHGI and Denelezh
- "Backfill" snapshot data from WHGI into the new architecture.
- Re-implement the statistical reports of WHGI and Denelezh in the new architecture
- Create a skeleton dynamic web front-end
- D) Implement new reports from Community Input
Here we will implement our "backlog" and "community" features:
- We will prototype and wire-frame features first allowing for a community input before building.
- We will conduct user-testing midway, and software acceptance testing at the end.
- Review previously suggested features:
- Taskforce and list-making support:
- Evolution view.
- External ID support.
- Comparison of two subsets (with the ability to explore subsets by external ID, in addition to date, gender, year of birth, country of citizenship, occupation, Wikimedia project).
- Hierarchy of occupations.
- Data quality:
- Show data availability of specific Wikidata properties describing humans (with filtering by date, gender, year of birth, country of citizenship, occupation, Wikimedia project, external ID).
- List data that are probably mistakes that can be fixed on Wikidata (e.g. a nationality like French instead of France).
- Taskforce and list-making support:
See Mockup for some ideas.
- E) API & Exports
Statistics and data generated by the project will be reusable and published under CC0 license (domain public).
- Create an API that allows 3rd party applications (like the Gender Campaigns Tool) to access any of the data available to the front-end via HTTPS.
- Export data from reports in CSV files.
- F) Internationalization and localization
As the application should be available in as many languages as possible, two things will be internationalized:
- the user interface, relying on translatewiki.net,
- the content, relying on Wikidata translations.
The application will be available in English and in French. Other translations are out of scope and will be made by the community.
- G) Documentation
The project will provide:
- end-user documentation:
- directly on tool,
- slides for hands-on workshops presenting the tool and its features to end-users.
- technical documentation, at least:
- architecture description (on Meta),
- database schema,
- code under AGPLv3 license.
- H) Future Maintenance Plan
- Have a plan for how errors will be reported.
- Have a way to monitor data processing.
- Agree on a way to collect bug reports.
- I) Project Management
- We will conduct weekly meetings to keep the project going between our engineering and community liaison teams.
- We will conduct monthly meetings with our Advisor to make sure the project is ideologically on-track.
- J) Communication plan
- Create a written work-log, reporting monthly on progress. (Making sure the platform allows comment and interaction).
- Host an online training session with prominent Wikipedians instructing them in the usage of the tool, with Q&A.
- Create a demonstration video for launch of tool.
- Human resources
- Software engineering
- Community Liaison
- 140 hours Participatory-design, user-testing, and communications at $50/hour
- Project Management
- 60 hours Project Management including technical management at $50/hour
- Travel (if COVID-19 allows, otherwise online attendance).
- 2 x $2,000 stipends
- Wikimedia Hackathon
- 1 x $1,500 stipends
- Server hosting
WHGI was funded by an IEG grant in 2015 and has been maintained on volunteer effort since. Wikimedia Foundation still provides a server to WHGI, which will be reused for this project. Denelezh relies on a server provided by Wikimédia France; this server will be decommissioned when goals A and B will be achieved, saving Wikimedia France some money.
Online workshops will be organized with end-users at milestones:
- User feedback and feature wishlist making (at stage A).
- User prototype testing (at stage C).
- User acceptance testing (at stage E) until the end of the projects.
COVID-19 contingency planning section
This project relies on in-person events only in "nice-to-have" situations like Wikimedia Hackathon, and Wikimania. Therefore the project could be easily made remote-only, just by attending the virtual replacements for those events (if they happen), or by forgoing those events entirely. This would reduce the total cost of the project by $5,500.
Note: We were already planning to conduct user-research such as interviews with current users of our tools through video-chat. COVID-19 does not affect those plans.
Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
- Maximilian Klein, Data scientist, founder of WHGI
- Envel Le Hir, Data engineer, founder of Denelezh
- Potentially a third engineer, still not found, who's identity makes them a better ideological fit to work on a diversity project then two people who benefit from white- and male- privilege.
- A community liaison, still not solidified, perhaps a Wikiproject: Women in Red volunteer, who has experience in product design and user-testing. This liaison will be in charge of the communication plan as well.
You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
The project was presented on several occasions:
- August 2019: Phabricator task T230184
- September 2019: lightning talk at WikiConvention francophone (slides in French, in English)
- October 2019: poster at WikidataCon (poster)
- November 2019: poster at WikiConference North America (poster)
- TODO: on relevant communities (on-wiki: Women in Red, Wiki Loves Women, WikiDonne, Wikidata project chat, on mailing-lists: wiki-research-l, analytics, wikidata) at milestones.
- March 2020 ː Les sans pagEs here on the francophone Wikipedia. I notified the telegram group as well (Nattes à chat)
Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).
- Makes sense, and project lead has demonstrated ability to deliver in past. Jtmorgan (talk) 17:59, 4 March 2020 (UTC)
- The tools are a very useful way of exploring the gender gap, Denelezh's ability to make old datasets searchable is particularly useful for showing the impact of particular initiatives. Merging the two tools sounds sensible. Richard Nevell (talk) 21:37, 10 March 2020 (UTC)
- Total support, many people from the sans pagEs project have been asking continuaously for such a tool. especially one that would enable to see the evolution of the gender gap per language over the time. Thank you for proposing this, and and we hope to be able to use the tool very very soon being users of the existing tools. I would also love to particpate as an advisor and / por community liaison on the project that I find really exciting. I hope this will not remain a Womne in Red project only on the anglophone wikipedia, but that other language communities will be involved.Nattes à chat (talk) 08:55, 11 March 2020 (UTC)
- Support: The current tools provide a useful basis for monitoring progress but have suffered from a lack of computer resources and problems resulting frp, variations in the formats of Wikidata dumps. The additional features proposed will be useful not only in providing an improved source of statistics but in assisting editors to address areas liable to reduce subject or language-based gaps, especially in relation to gender. The two experts behind the proposal are highly competent in the areas addressed.--Ipigott (talk) 11:35, 11 March 2020 (UTC)
- Support per Ipigott. --Rosiestep (talk) 10:27, 12 March 2020 (UTC)
- Support well, we (wikipedians working on gender gap) do need such tools. Both project leads have already shown their ability to deliver with initial tools. They now want to team to do an even better tool. Willing to help in whatever capacity so that this happen Anthere (talk)
- Support Go for it. GoranSM (talk) 13:29, 19 March 2020 (UTC)
- Support it make sense and it would be way more easier to only have one tool. Cheers, VIGNERON * discut. 11:43, 24 March 2020 (UTC)
- Oppose if spending money on software, then do it for general SPARQL/RDF tools. This can then be used for any kind of data not only for the area of interest of some. MrProperLawAndOrder (talk) 01:09, 23 May 2020 (UTC)
- Oppose Such data is already accessible via Wikidata SPARQL queries. Creating a new parallel tool is not the way to go. Money should rather be spent to improve the existing tooling per MrProperLawAndOrder. ChristianKl ❪✉❫ 06:18, 23 May 2020 (UTC)
- Support good user in need of support. there is not one right way. failing to fund an initiative will not fund technical debt. you can not compel users to work your task, by blackballing all alternate approaches, Slowking4 (talk) 10:48, 25 May 2020 (UTC)
- Support for such a tool. Thank you for proposing this, and we hope to be able to use it in other language communities will be involved. Wikimujeres UG would also love to participate and willing to help in whatever capacity.Tiputini (talk) 23:54, 25 July 2020 (UTC)