Wikidata/Development/RDF

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Icono de esbozo.svg This page represents the current status of discussions within the Wikidata project team. It is used as a base for development. To avoid confusion please only update it to reflect updates of this status. If you want to discuss it, please use the Wikidata mailing list or the talk page. Thank you!

This document is a draft, and should not be assumed to represent the ultimate structure.

This historic note explains part of the RDF export of Wikidata and explains some of the considerations that led to this design . A more recent description of the RDF format used in Wikidata is found in the report Introducing Wikidata to the Linked Data Web, which supersedes most of the material on this page. Nevertheless, the new work is based on the old, and is compatible in most aspects. There is also a web page with periodic RDF exports of Wikidata based on the most recent format.

General notes[edit]

The more you are familiar with the following topics the easier will it be for you to understand this note. If you cannot follow a part, feel free to use the references given here.

Furthermore, throughout this note we will not use the URIs that will be used in the system later on, but rather suggestive and convenient URIs that help to follow the argumentation in this note. There exists a separate note on the Wikidata URI scheme. The namespaces used in this note are listed at the bottom of the page.

Motivation[edit]

Example[edit]

For the following discussion, we introduce two statements with the same item and property, the first with two, the second with one qualifier:

Berlin


Population 3,499,879 [no sources]
as of November 30, 2012
Method Extrapolation
8,000 [1 source]
as of 15th century

Ground triples[edit]

The simplest possible way to represent the first statement would be the following triple:

w:Berlin p:Population "3499879"^^xsd:integer .

We will call this triple the ground triple of the first example statement. The ground triple completely omits the qualifiers, which in case of the first example does not seem too bad. But the second statement is giving the population of Berlin as 8,000 in the 15th century. This would lead to the following ground triple:

w:Berlin p:Population "8000"^^xsd:integer .

If this ground triple was given too, Berlin would be in the result set when querying for cities with less than 10,000 inhabitants, which does not sound desirable.

Statements with qualifiers[edit]

Instead, we need to represent the statement including the qualifiers. Recommendations on how to represent this kind of data is given by the Semantic Web Best Practice note on Defining N-ary Relations on the Semantic Web. In the following we follow their recommendation (pattern 1, use case 1).

We have to introduce an intermediary node to represent the statement as such, and connect it to the item and the value. The following triples represent the claim from the first statement.

w:Berlin s:Population Berlin:Statement1 .
Berlin:Statement1 rdf:type o:Statement .
Berlin:Statement1 v:Population "3499879"^^xsd:integer .
Berlin:Statement1 q:As_of "2011-11-30"^^xsd:date .
Berlin:Statement1 q:Method w:Extrapolation .
Berlin:Statement1 rdfs:label "3,499,879 (As of Nov 30, 2011, Method Extrapolation)"^en .

Note that we introduced two new properties, s:Population (s: like statement) and v:Population (v: like value) instead of the original p:Population property (mind the namespaces). The original property, in the p: namespace, is a datatype property that connects items directly with an integer value (i.e. the domain is item). The s: property, on the other hand, is an object property connecting the item with the statement about it.

For the same reason we introduced the q: properties for qualifiers, connecting a statement node with the value of the qualifier.

A user of the data could easily get the ground triple, either by adding an OWL2 axiom stating that p:Population is derived from the property chain of s:Population and v:Population, or by using a SPARQL construct query with the same effect.

One additional feature in order to support Semantic Web browsers is to give the statement a label that corresponds to a human-readable form of the value including the qualifiers. The label would be exported in all available languages. This enables Semantic Web browsers to display the first triple in the above serialization in a useful way for the viewer of the data. We tested it with Marbles and the OpenLink Date Explorer.

In order to see how none and unknown are represented, please refer to the specification below.

Statements with references[edit]

Now that we have a node for the statement, it becomes trivial to add a reference.

Berlin:Statement1 prov:wasDerivedFrom Berlin:Statement1-Reference1 .

The provenance ontology is still a moving target, and it might be necessary to adapt this property later. Also we will further expand on how the references are modeled as soon as this is progressed further.

Representation of values[edit]

Note that the values might and usually are also further structured themselves. We will detail how the values are exported as we progress with specifying them further within Wikidata. Since values might often have a unit or an accuracy associated with them, they will often be represented as yet another intermediate node with the respective values attached to it. For our examples we only deal with the datatype entity.

Specification[edit]

The specification is defined as a mapping of Wikidata statements written in the Wikidata Object Notation to OWL2 axioms written in Functional-style syntax. The transformation of OWL2 axioms to RDF triples in turn is defined by the OWL2 standard, the result is given here for convenience as well. The current specification is slightly simplified for now as it omits PropertyIntervalSnak, PropertySomeIntervalSnak, PropertyInstanceOfSnak, PropertySubclassOfSnak, and Rank.

The following is a reiteration of the relevant part of the Wikidata Object Notation.

ItemDescription :=  'ItemDescription(' Item {Statement} ')'
Statement :=  'Statement(' MainSnak {Qualifier} {ReferenceRecord} ')'

Every statement is translated into a number of OWL2 axioms as described below. Every statement is identified by a StatementID which is a IRI.

MainSnak a PropertyValueSnak[edit]

If the MainSnak is a PropertyValueSnak, then it is translated as follows:

ObjectPropertyAssertion( s:Property Item StatementID )
ClassAssertion( o:Statement StatementID )
ObjectPropertyAssertion( v:Property StatementID Value )
Annotation( StatementID rdfs:label ValueLabel(Statement) )

s:Property is an IRI that has the same local name as Property but replaces the p: namespace with the s: namespace. The same is defined for v:Property and q:Property respectively. Whenever one of the properties is used, an annotation property to connect them to the base property p:Property will also be given. The function ValueLabel returns an appropriate label describing the value of the statement including the qualifiers (not defined here).

MainSnak a PropertySomeValueSnak[edit]

If the MainSnak is a PropertySomeValueSnak, then it is translated as follows:

ObjectPropertyAssertion( s:Property Item StatementID )
ClassAssertion( o:Statement StatementID )
ClassAssertion( ObjectSomeValuesFrom( v:Property owl:Thing ) StatementID )
Annotation( StatementID rdfs:label ValueLabel(Statement) )

MainSnak a PropertyNoValueSnak[edit]

If the MainSnak is a PropertyNoValueSnak, then it is translated as follows:

ObjectPropertyAssertion( s:Property Item StatementID )
ClassAssertion( o:Statement StatementID )
ClassAssertion( ObjectAllValuesFrom( v:Property owl:Nothing ) StatementID )
Annotation( StatementID rdfs:label ValueLabel(Statement) )

Qualifier[edit]

Each Qualifier is translated as follows. If the Qualifier is a PropertyValueSnak, then it is translated as follows:

ObjectPropertyAssertion( q:Property StatementID Value )

If the Qualifier is a PropertySomeValueSnak, then it is translated as follows:

ClassAssertion( ObjectSomeValuesFrom( q:Property owl:Thing ) StatementID )

If the Qualifier is a PropertyNoValueSnak, then it is translated as follows:

ClassAssertion( ObjectAllValuesFrom( q:Property owl:Nothing ) StatementID )

ReferenceRecord[edit]

Every ReferenceRecord is given a ReferenceID which is an IRI. Every ReferenceRecord is translated as follows:

ObjectPropertyAssertion( prov:wasDerivedFrom StatementID ReferenceID )

Discussion of alternatives[edit]

The following discussion is for giving rationales to the design decisions in this note and can be skipped if the reader is not interested in them.

Punning for the property names[edit]

Instead of having different properties in the p:, s:, v: and q: namespaces, turning every property in Wikidata to four in the RDF export, we could have used a single property and just use it in all the four use cases. In almost all cases it would still be clear which property is actually used: p: connects the item with the value, s: the item with the statement, v: the statement with the value, and q: is used for qualifier, connecting the statement with the qualifier value. The only case of ambiguity would be between the v: and q: as they both connect the statement with a value, and it leads to an ambiguity if a qualifier with the same property is used as in the ground triple.

Punning is in general frowned upon on the Semantic Web, but it is not unheard of, e.g. in OWL2 individuals and classes can be punned in general. Also a recent proposal for the notorious HTTP-range14 discussion proposes to use punning as a solution.

We decided to not use punning in our case, but rather to accept the proliferation of properties. The reason for that is that it is not only best practice, but also necessary for the OWL2 DL serialization to be valid OWL as we need to make a clear difference between object and datatype properties.

We also wanted to ensure that the vocabulary developed within Wikidata will be reusable outside of Wikidata. As in most cases we expect external data publishers to use the direct representation of data with triples—i.e. just the ground triple—we also wanted to ensure that such a property is available for external reuse. Ironically this property is not the one we use in the Wikidata export itself, but there are well defined relationships between them, as described above.

Named graphs[edit]

Adding a reference to a claim can be done in three ways:

  1. put every claim in one file and then add provenance metadata about that file
  2. put every claim in a named graph and then add provenance metadata about that graph
  3. reify every claim so that we can add provenance metadata directly to the claim

Ad 1: Having one file for every claim would lead to many files. Even if you are conservative and expect about only ten statements per item, resolving that item would require ten or more HTTP requests. This is prohibitively slow, especially considering that they are all very small files that we are requesting. It also would lead to an unreasonable amount of load on the server, something MediaWiki is not very well equipped for (a server based on Node.js or Twisted would probably be much better equipped for that, though).

Ad 2: The file holding all statements about an item could also contain a named graph for every single claim and then add metdata about these graphs, like the references. Whereas this could be a solution, there is, as of time of writing, no standard for the serialization of named graphs in files. There are a number of contenders (TriX, TriG, NQuads, etc.), but none of them is even on the way of becoming a standard.

Ad 3: See the section on reification below.

Quads[edit]

Quads are a serialization for RDF that add a name (IRI) for every triple (the fourth value), or alternatively a context (i.e. it groups triples into sets). There is again no standard serialization for quads, and also no standard semantics.

Reification[edit]

Reification as per RDF standard is widely regarded as bloated and disliked. RDF introduces its own reification syntax, which has never really caught on. Due to its widely negative reputation, and due to discussions about deprecating reification from RDF, we decided against using this mechanism.

Publish the ground triple[edit]

One alternative decision regards the publication of the ground triple. We decided not to publish it, in order to be more consistent through the RDF serialization of our data model. It avoids publishing a triple like stating the population of Berlin at 8,000 as per above example.

One might say to publish unqualified statements with the ground triple at least, and not to do so for qualified statements. Again we decided against it: first, we would need to represent the statement anyhow in order to publish the reference. Second, this would mean that we would publish potentially conflicting ground triples in the same file - if there are two different sources for two different statements. By publishing everything on the level of statements only, we can remain consistent throughout the dataset by always remaining on the level of statements, taking the role of Wikidata as a secondary database serious.

To do[edit]

  • Representation of labels, descriptions and sitelinks
  • Representation of data values
  • Representation of references
  • Representation of rank
  • Representation of the following Snaks: InstanceOf, SubclassOf, PropertyInterval, and PropertySomeInterval

Namespaces[edit]

  • w: for Wikidata items
  • o: for the Wikidata ontology (a fixed and small set of terms)
  • p: for Wikidata properties
  • q: for properties used as qualifiers
  • s: for properties used to connect items and statements
  • v: for properties used to connect statements and values
  • Berlin: used as a shortcut for w:Berlin but defined as a prefix
  • xsd:, rdf:, rdfs:, owl: and prov: with their usual meanings

Acknowledgements[edit]

The data model and its serialization in RDF has been created with input from the whole Wikidata team, especially Markus Krötzsch and Daniel Kinzler. It further was deeply informed by the RENDER project and ongoing discussions there on how to represent a diversity of knowledge, special thanks to Elena Simperl, Andreas Thalhammer, and Ioan Toma. It was also strongly influenced by discussions during the ESWC Summer School 2012, special thanks go to Aidan Hogan, Barry Norton, and Dan Brickley. Discussions during the time at KIT helped further sharpen it, special thanks to Andreas Harth, Steffen Stadtmüller, Daniel Herzig and Günter Ladwig. It was also based on discussions concerning Shortipedia, special thanks to Varun Rathnakar and Yolanda Gil. --Denny Vrandečić (WMDE) (talk) 09:50, 6 August 2012 (UTC)

See also[edit]