Art+Feminism User Group/Reporting/All Pages/workplan draft

From Meta, a Wikimedia project coordination wiki

Art+Feminism Wikidata plans

March 2018

Art+Feminism plans to use Wikidata to link a collection of items from which to generate task lists. These lists will be used to improve Wikipedia articles as part of our outreach efforts. They will also be used to improve the Wikidata items themselves.

Over the past 4 years Art+Feminism participants have created/edited over 10,000 articles, and we forecast 10,000 more this March. These articles include artists, artworks, writers, filmmakers and other articles that clearly fall in the rubric of Art+Feminism, as well as a smaller subset of articles that fall outside that rubric: doctors, scientists, politicians, etc; we welcome people at our events to edit all articles about women and gender but we seek to narrow our focus for the items/articles we seek to improve long term. In order to do so, our first task is to add profession data to each item, so we can filter our items by profession.

Overview of code[edit]

We (Danara Sarioglu and Michael Mandiberg) have written a python script which we are provisionally calling Wikidata QuickSheets that queries the Wikipedia and WIkidata APIs to return QIDs, and then queries Wikidata for P21 and P106. It turns out that only half of these items have P106 data, so we have written code to add P106 data to items missing it This script uses category data, and pulls the first sentence of the Wikipedia page for each QID and parses it for the list of all professions. The script outputs the profession and the first sentence in a Keyword In Context (KIC) style, so that a human can verify the validity of the script’s work. Only data verified by a human will be converted by the script into Quick Statements ready tuplets.

The data produces several error states. We are still refining the script to handle these error states in trying to pair our list of Wikipedia page names with QIDs: malformed page names, pages that were moved since creation, etc.

Future steps[edit]

We hope to unify them via P972, or whatever new Wikimedia Project Focus List property is agreed upon. Once we have added all profession data we will filter out the items that have P106 values that are out of scope for the project. We will add that AF specific P972/Wikidata focus list metadata to the final set of items.

At base, we will use this unified set of items to generate lists of items and articles for improvement on Wikipedia and on Wikidata via the semi-automated approach described above.

We also think this script based approach will be of interest to others. It is a process specifically designed to be accessible to those without programming experience. It uses simple article lists (which can be generated via SPARQL) to generate spreadsheets for human evaluation. These sheets will then be transformed back into QS ready data. The script requires no special libraries or dependencies, beyond what is available by default in basic python configurations.

In this regard it is similar to the Wikidata game, but differs in three key ways: 1) you can start with your own very focused list 2) it does more of the work for you 3) laid out in a spreadsheet format you can scan and approve the data faster and at scale. In a way it is like taking the Wikidata Game and transforming it to pair with Quick Statements.

How the Wikidata QuickSheets works[edit]

Step 0: move category data over from Wikipedia in bulk

I moved Category data from enwiki to Wikidata P106 for about 14,000 items that fell into the categories we commonly find our articles in (artists, writers, academics, etc). I did this by generating QID lists with petscan, working with as shallow a depth as was reasonably reflected in the potential P106 values on Wikidata. I generated Quick Statement tuplets based on these lists.

Step 1: generate lists to start from[edit]

The script takes a CSV of articles in the following format:

Language, name

es, Deborah Ahenkorah

en, Caroline Woolard

en, Charlotte Cotton

The script could be modified to also accept a list of QIDs as outputted from a SPARQL query.

The script takes a list of all values for P106 from a SPARQL query and generates a working list of occupations sorted by total count on Wikidata that includes their descriptions, so as to differentiate Q3455803 “director of a creative work” from Q1162163 “in business or institutions person in charge of realizing an objective.” In our workfiles, this is “occupations-withDescriptions.csv”

Step 2: generate work files[edit]

With the list of articles, and occupations, the script uses Wikipedia and Wikidata APIs to look up the QID, gender (P21), and occupation (P106) of each item. It outputs them to several different CSVs:

  • Has QID, P21 = female, has P106
  • Has QID, P21 = female, P106 found via Category
  • Has QID, P21 = female, P106 found via Searching first line
  • Has QID, P21 = female, no P106 found (these are mostly items that do not have en.wiki articles)
  • Has QID, P21 = male OR doesn’t exist (these are mostly non human items, like paintings)
  • No QID, Error -1 (redirects, and other things we haven’t sorted through)
  • No QID, Error -2 (malformed text)

For the QID items missing P106, the script tries two routes to establish P106 data:

  1. It reads the Category data from en.wiki, and compares that to an interface matrix we generated by hand, that contains correlations between Wikipedia Category and QID for ~100 common occupations.
  2. It pulls the first sentence from the Wikipedia article, and searches for P106 values from the occupations-withDescriptions.csv list. The script has a tolerance setting, so you can say that you only want to search for items with more than x uses on Wikidata (I have settled on 5 for testing, which reduces the 9207 values for P106 to 3100 values.

Step 3: notate work file[edit]

We then take the CSV with QID items missing P106, and parse it manually. We look at each lede sentence and make decisions about each entry in the sheet. We check the occupation that the script found against the first sentence. If we accept the scripts suggestion, we put a Y in that column (actually any value will work). If we find a different value needs to be inserted, we put that in the alt occupation column. The Popular indicates the relative frequency of that particular value: an * indicates it has 1000 or more uses on Wikidata; a . indicates it has more than 100 uses; a space (which produces an empty cell) means it has less than 100 uses.

You can see a detail of a sample work set here:

Step 4: generate Quick Statements ready tuplets[edit]

The script reads the notated work file, and outputs Quick Statements ready tuplets for each row that is marked with a Y. Additionally, it creates a new work file for the lines where I entered an alt occupation; the code does a reverse lookup to see if it can find the correct P106 value.

Proof of Concept[edit]

We have also produced a proof of concept with a reduced data set from the new articles created during the 2017 campaign. You can find the full folder of data here: https://drive.google.com/drive/folders/1f1kyQGZ15Ry3nbs1khN3-ZaQcBVbwWbQ?usp=sharing

  • needs human review contains the files that need to be acted on
  • Category Outputs CSV contains human readable P106 data found via enwiki categories
  • Category Outputs QS contains the QS formatted version of that data found via categories

Step 5: Add Art+Feminism specific metadata[edit]

Once all items in our long list have professions, we can sort and include the ones we want to keep in our focus list. For example, we will likely remove politicians, scientists, etc. Having trimmed those, we will add that property to all of the remaining QIDs.

Step 6: generate worklists, etc[edit]

From here, we can do work like generate worklists, that encourage participants to expand articles started at Art+Feminism events.