Connected Open Heritage/Wikidata migration

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
This report has been created by Wikimedia Sverige as part of the Connected Open Heritage project. For questions, please email andre.costa@wikimedia.se.
There's also a workflow outline of how the mapping and processing of a WLM dataset can be done.

Introduction[edit]

Background[edit]

The Monument Database has been the backbone on which the Wiki Loves Monuments competition was built. This database however lives outside of the Wikimedia infrastructure and with the introduction of Wikidata that is now the more natural place for the information to live.

As part of the Connected Open Heritage project Wikimedia Sverige is investigating how this information can be migrated to Wikidata. This involves both setting up a workflow for migrating individual datasets as well as trying to identify any issues which might hinder a dataset, or parts of it, from being migrated. This report is part of this work.

Report context[edit]

The following report relies on a framework where you would:

  1. Ingest all of the data for a country in the WLM database into Wikidata
  2. Change the lists so that they pull that info from Wikidata (instead of pushing it to the database)
  3. (Make the database update this info from Wikidata)
  4. Ingest other suitable fields from the lists into Wikidata, and make them pull that information from Wikidata instead

The focus of the report is on the first of these steps, ingesting data from the database.

Known problems[edit]

It is known that the framework described in this report causes problems if:

  • Some of the fields in the database cannot be ingested into Wikidata.
  • The existing tooling for updating Wikidata from (the lists in) Wikipedia does not meet editor requirements.
  • The list data was not derived from official sources (original research or through weak sources).
  • The licensing of the original dataset is unclear (see Wikidata:Data donation for more info).

What you are expected to know for each dataset[edit]

For each dataset you will need some knowledge about how it was created so that you can answer e.g.

  • What is the original source of the data?
  • Are the id's from the external source or some local construct?
  • Did the id's get some special prefix added/removed?
  • Was there any data massaging done before the initial ingestion?

Open questions[edit]

  • Is the license of the source data relevant (i.e. it must be CC0) or are the Wikipedia lists considered to have “washed” the data from its source?
  • Are there (always) clear sources for the data in the Wikipedia lists? Including which fields were sourced and which weren’t.
  • How to deal with information loss due to format incompatibilities?
    • E.g. address="[[Link street]] 52"P969: “Link street 52”
    • Place location="Unlinked town"P276:<no Q-id exists>

Ingesting data from the database(s)[edit]

It is important to keep in mind that the database consists of two parts. Monuments_all which is uniform across countries and accessible through the api, and the (sub-)country specific ones (e.g. monuments_se-ship_(sv)) which can contain more detailed info. The lists on Wikipedia populate the specific table through the mappings described in monuments_config.py and the specific tables are merged into the general one through fill_table_monuments_all.sql.

Depending on how much info is lost going from the specific table to the general one (compared to the ease of use of the general one) you can choose which you would rather use as a source for your Wikidata import.

More information on the database (how the harvesting works, the api, and how new datasets are added) is available at Commons:Monuments database. Below I try to go through a step-by-step process for migrating the data to Wikidata.

Match fields to Properties[edit]

For each field in the database (of your choice) you need to figure out which Property to match it to, or alternatively if it should be excluded from the ingestion. Look at how any existing monuments on Wikidata have been set up to ensure that you use the same or at least a similar structure. In addition to database fields there might also be some implicit properties such as P31/instance of, or P17/land. You also need to ensure that the field↔Property matchings are of the same type. I.e. wikilinks to Q-ids and plain text to strings. Non matching formats (or mixed formats) need to be handled or at the very least logged.

Examples:

The minimum requirement is that the following fields be present:

  • P17: The country in which the monument is found
  • P131: The administrative area within which the monument is found. (use the smallest possible administrative area

Note:

  • Some properties combine multiple fields
    • e.g. lat + lon together make P625/Coordinate
  • Some fields combine multiple properties
    • e.g. [[Ekolsunds f.d. värdshus]] (Ekolsund 3:7) → [[<article>]] (<P1262>)
  • Some fields might not be suitable for Wikidata. This can be either internal data (such as timestamps) or unstructured text/descriptions. The latter could possibly be reused as a Description in Wikidata.
  • Some items (such as the value for P31) might also have to be created first in order to be used for matching.

Parse field values[edit]

In the dataset each value is likely either a string, wikitext or a number. By contrast most values in Wikidata will be Q-ids.

From wikitext we can extract page names from which we can get Q-ids (if it isn’t a red link). For plain strings we need to use a lookup table. This can either be constructed manually (e.g. a dictionary of all region-iso to Q-ids) or through querying/searching wikidata. If the latter is used then care needs to be take that the found object is of the right type. E.g. that we are matching Sweden to a state and not to a music album.

Dates and coordinates need to be parsed differently so as to create the right sort of object t submit to Wikidata. In any field where wikitext could be allowed (i.e. anywhere where you would expect a string) you should be prepared that it could appear. The opposite also applies, if you are expecting a link (to extract a Q-id) then you need to handle a plain string. Often the wikitext will affect how you need to interpret the string:

e.g.

  • [[A|B]] → article:A, label:B
  • [[A]] → article:A, label:A
  • A → label:A
  • [[A|B]]C → article:A, label:BC
  • [[A|B]], C → article:A, label:?

Note:

  • A single string may contain multiple values. e.g.:
    • "[[steel]], [[concrete]]" → P186:Q11427 AND P186:Q22657
    • "[[Ekolsunds värdshus]] (A 2:2, 11:1 och B 8:2)" → article:Ekolsunds värdshus, P1262:"A 2:2" AND P1262:"A 11:1" AND P1262:"B 8:2"
  • You need to be prepared to deal with non standard data:
    • e.g. year = approx 1900, 1900s, 1890 to 1981, 1988 and 1999 etc
    • coordinates in different systems etc.
  • You need to be prepared to deal with implicit inaccuracies in numbers/dates/coordinates.
  • As always be weary of underscores (_) which might appear in links or page names.

The best way to deal with edge cases is to test that the value behave as expected and if they don’t then skip it and log that this was done.

Check for existence[edit]

Many of the objects you are working on may already exist. If so it is important that you find and identify the matching object so that the parsed information is added to it, rather than creating a new item.

For WLM items (with a unique identifier) the unique identifier is often a good starting point which is fairly easy to search for. The second alternative is to look for any article linked to from the "name" field or identified in the monument_article field.

Pitfalls and recommendations:

  • What if the linked object already has a different id?
    • This likely means a mismatch. Log and skip.
  • What if the linked object is not of the expected type?
    • This likely means a mismatch. Log and skip.
  • What if the linked page does not (yet) have a wikidata item
    • Create it as an empty item

Clean up existing data (on Wikidata)[edit]

Any items which already exist on Wikidata should preferably be cleaned up before new items are ingested. This is in part because these are easier to identify before the ingestion, in part because this makes it easier to eliminate the accidental inclusion of duplicates. The common steps include:

  • Ensuring that each item has been tagged with the relevant ID properties
  • Ensuring constraint violations for the property/properties have been resolved
  • Ensure that these follow a similar schema as defined for the new items.

Determine a label (and description)[edit]

You should match a field (or a combination of fields) against Label. Note that this field is set per language and that it (in combination with a  Description) must be unique for Wikidata. Additionally you can set multiple names, or name/id no. combinations as Aliases if people are likely to search for the object using these.

Determine how to handle existing info[edit]

In the case where an item already exists special care needs to be taken to ensure that the information on Wikidata is not simply overwritten by that from the database. The recommended way to proceed is:

  1. If the claim doesn’t exist; simply add the new claim.
  2. If the claim exists:
    1. But the value is different; add the new claim as a separate value
    2. And the value is the same:
      1. But the reference is different; add an additional reference to the claim
      2. And the reference is the same[1]; nothing more to add.

The case which requires additional care is 2.1. While different values don’t always indicate that there is a problem (e.g. a building could first have been a church and then a casino) you need to determine if:

  • The values are really different. E.g. dates, numbers and coordinates all have built in uncertainties meaning a broad statement might cover a more exact one. Even if they overlap it might still be desirable to keep both depending on the source of them.
  • The differing values are indicative of an erroneous matching of items (although this should primarily have been taken care of by Check for existence).

Ideally conflicting values should be logged since they can at the least highlight the need to mark one of the statements as preferred.

Determine how to reference the statements[edit]

When doing mass imports of statements you are also expected to add a reference to each of the statements added from the database. There are two types of references you could add depending on the circumstances around the data.

A reference along the lines of "imported from the WLM database" mainly fills the purpose of highlighting for other editors where the data is being ingested from rather than actually sourcing the statement. On the other hand it has the benefit of always being true.

If the original source of the data is known then it should be better to use that as an actual reference for the statements. The problem however is that information originally imported into the database (straight from the source) cannot easily[2] be distinguished from data which was later submitted by a user.

Whichever reference is decided upon it should contain the following:

  • A claim as to the source of the data (e.g. P248)
  • P577: The date for when the data was published (if known)
  • P813: The date for when the data was confirmed at the source
  • P854: A url to the source of the data (if online)

Documentation[edit]

Make sure to document all of the choices done during mapping since this will make it easier for others to follow the process at a later stage as well as being useful as a basis for tooling, both existing WLM bots/tools and any new tools.[3]

Changing the lists (post ingestion)[edit]

The final step is to completely deprecate the database by having the Wikipedia lists fetching the information from Wikidata and in turn having any updates pushed to Wikidata. The remaining issues preventing this from happening are described in more detail over at Commons:Monuments database/Wikidata.

Footnotes[edit]

  1. Or the same at least up to time dependent factors such as fetch date (P813) or similar.
  2. It might be possible if you know what the first changed timestamp should be in which case you could detect which items haven’t been updated since initial ingestion.
  3. Similar documentation was done by Multichill before creating the first Wikipedia lists.