FindingGLAMs/Documentation/Tools and workflows

Tools and workflows we use for working with data

Introduction

Within the project FindingGLAMs, we upload datasets describing GLAM institutions in various countries. Depending on the type of the dataset and the coverage of the topic on Wikidata, an upload can imply either of the following types of operations, or a combination thereof:

Creating new Wikidata items (and filling them with data).
Adding information to existing Wikidata items.

Every upload batch is different, but in general the following steps are carried out:

Ingesting and cleaning up source data
Reconciling source data with Wikidata
Uploading source data to Wikidata

Various tools and workflows can be used to carry out each of these steps. We are currently using OpenRefine which is enough to do (most of) them. Historically the developers at Wikimedia Sverige have used different tools and methods – for example for Connected Open Heritage, we developed COH-tools which uses André Costa's library wikidata-stuff.

However, writing specialized tools in Python works best when we’re dealing with very large amounts of similarly structured data, as was the case with COH. In FindingGLAMs, our data comes from different sources, and each dataset has its special and unpredictable challenges. Because of its visual workflow, OpenRefine enables us to quickly get an overview of a new dataset.

In general, there is no tool that suits all types of uploads, and everyone involved in Wikidata develops their own preferences and workflows.

OpenRefine is an open source application for data cleaning and transformation. It's been around for several years, but only got integrated Wikidata support fairly recently → http://openrefine.org/download.html

What a data upload process can look like

Ingesting and cleaning up source data

Source data comes in different formats, most commonly csv/tsv, json, xml. They can all be loaded into OpenRefine and displayed in a spreadsheet-like format.

OpenRefine allows you to edit data by performing operations on cells matching certain criteria. For example, sometimes you have to change the case of strings (Göteborgs Stadsbibliotek → Göteborgs stadsbibliotek), or flip a name around (Johansson, Fatima → Fatima Johansson).

You can do transformations using either Python or a language called GREL (General Refine Expression Language). The Python support makes it possible to write ad-hoc scripts to edit a particular column. The scripts can be re-used across columns and projects.

Reconciling source data with Wikidata

This part is both important and tricky. If we want to upload extra data to existing Wikidata items, we have to identify them first, and align our dataset to it to make sure we are targeting right items. If we want to create new Wikidata items, we will want to make sure it's needed in the first place, i.e. that those items don't exist.

A leson that we learned is that because Wikidata is large and messy, no matter how careful you are, you WILL end up creating duplicate items – especially when working with datasets in foreign languages. The question is how many. Duplicate items can be merged when they are discovered.

There are generally two ways to identify matching items:

Unique identifiers

This is the ideal way – assuming, of course, that the data type you're handling has unique identifiers of some sort, AND that both the Wikidata items and your dataset have them. OpenRefine can than do a lookup on Wikidata using the values from the relevant columns. However, unless we're dealing with a small and well-defined area, there will be Wikidata items that should have the ID but don't. That's why we cannot rely on this method exclusively.

Labels

A good old text searching using item labels. OpenRefine can also do this. Again, the usefulness of this method will depend on the nature of the data. For example, Göteborgs stadsbibliotek (Gothenburg City Library (Q9656601)) could have the label Stadsbiblioteket i Göteborg or Stadsbiblioteket (Göteborg) on Wikidata.

When OpenRefine thinks an item could be a match but is not 100% sure, it will give us suggestions to manually review.

If we 're left with unreconciled data rows, we can either: tell OpenRefine to create new Wikidata items for them, review them manually, or just ignore them (and not upload that data). This will depend on how many there are, how we feel about the quality of this dataset in general, etc.

Uploading source data to Wikidata

After editing and reconciling the data, we can create a Wikidata schema directly in OpenRefine. I.e. we specify how the fields in your source data should be used to build a Wikidata item. Then we can choose to upload your changes using either OpenRefine's built-in uploader, or export them as QuickStatements commands and use QS to upload them.

Reference material – QuickStatements

Reference material – OpenRefine

What does GLAM data look like?

Read Information for GLAM partners to get a feeling for the type of data we’re working with in this project.

For more detailed overviews of what properties are in use, see: