Learning patterns/Wikidata mass imports
What problem does this solve?
You have a large amount of data that you would like to integrate into Wikidata, as part of a content donation from an institution or some other project. However, because it is so much data, it is not easy to import. You're concerned about overwhelming Wikidata with too much information, as well as the potential for errors.
What is the solution?
First: ensure acceptable copyright status
Basic facts, not creatively expressed, are not copyrightable per se. However, collections of data may be subject to database rights, particularly in the European Union. To avoid this issue, as well as to prevent future issues, Wikidata proactively waives copyright and neighboring rights on its data corpus through the CC0 waiver. Accordingly, any data you want to incorporate into Wikidata should also be proactively released.
Modeling the data
Data modeling is an exercise in taking the data from its source, however it is expressed, and figuring out how to express it in the language of Wikidata. Wikidata has items, referring to an identifiable person, place, thing, or idea. Items contain statements, consisting of properties ("questions"), claims ("answers," or "purported answers"), occasionally qualifiers that add nuance to the claim, and ideally references as well. (And sometimes, claims have ranks.)
While Wikidata is very flexible in terms of how data can be expressed, one of the main constraints is that you must use properties that already exist; you cannot arbitrarily create your own. First, it is recommended that you try to use existing properties, but where that is not possible, you can request that one (or more) be created via the property proposal process.
If there is a WikiProject relevant to your project, it is suggested you work with them. You may even work with them on developing an informal schema, representing a consensus of what properties should appear on what items and how they should be used. (Wikidata does not require formal schemas, and there is not much appetite for them. However, WikiProjects like WikiProject Source MetaData have put together informal recommendations.)
If, as part of this project, you want to associate items with other items, a mapping between concepts in your dataset and with Wikidata may be in order. You can create as many items as you need to express your dataset, provided that items meet notability requirements, but you should try to avoid creating duplicate entries where possible. A tool like Mix n Match can help facilitate this, especially in combination with an authority control property linking your dataset with Wikidata.
Be mindful of provenance
"Provenance" refers to the origin of the data. Much like Wikipedia, Wikidata cares about the source of its statements, and even allows references to be associated with individual claims.
Part of this equation is having the original data source be available publicly, readily located somewhere with a clear connection to the organization publishing the dataset. Think of this as the "authoritative" copy that Wikidata can be verified against. (And if the data in the original dataset comes from somewhere else as well, try to represent that provenance too!) The ideal would be something like an API on an official website where people can query against the original record. Failing that, a dataset published somewhere like Figshare should work too.
The other part is citing this authoritative reference from within Wikidata claims. There are a few patterns for this:
- If you are citing a dataset release, it is precedented to create a Wikidata item on the dataset release itself, and then as your reference, state that the claim is "stated in" the item corresponding to your dataset. For example: Q17939676, which is used as a reference on Q17861746. The item about the dataset release should include as much metadata about the dataset as possible, including the date of its release.
- If you are citing something like an API where there are individual URLs for accessing different aspects of the data, the reference should have three parts: a "stated in" claim referring to the general name of the database, a URL linking directly to an API call that presents the data in the claim (or a corresponding identifier via an identifier property), and the date of retrieval. The benefit of this approach is that people can programmatically check Wikidata against the source material.
The mass import itself
In terms of tools you can use, Wikidata Integrator is a Python library that is relatively easy to use. A slightly lower-tech solution is Quick Statements, which takes specially formatted spreadsheets and makes edits to Wikidata accordingly. Alternatively, if you are familiar with RDF and SPARQL, you can use the Wikibase loader in LinkedPipes ETL. For large numbers of edits it is recommended that you set up a bot account or get a flood flag for your own account.
Start small and scale up. By starting small, you can see if there are any issues in your data, or if there are problems translating the original dataset into Wikidata. Also be sure to keep in touch with other Wikidata editors, especially those that may have to clean up in case things go wrong. d:Wikidata:Project chat is a good place to reach out to other editors.
If you create new properties as part of this, it is recommended that you set up constraint violation reports for them. This way, if there are deviations from expected use of the properties, they are documented.
If you know that there are other identifiers associated for items in your dataset, you should proactively find those identifiers and map them to Wikidata items. This helps prevent the creation of duplicate items and can lead to richer Wikidata entries.
soweegoproject aims at linking Wikidata items with external identifiers from large-scale catalogs. Its Wikidata bot has started very small, and early community feedback is invaluable to safely scale up. --Hjfocs (talk) 18:01, 17 January 2019 (UTC)