Wikidata/Notes/Inclusion syntax v0.2

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search


This draft is obsolete and only kept for reference. Please refer to Wikidata/Notes/Inclusion_syntax for the current version
New Version (January 2013) This is a completely reworked proposal with little resemblance to the previous draft!

This page describes the data inclusion syntax for the Wikibase client, by which the properties of data items can be included and rendered on a wiki page using templates.

Note: Please read Wikidata/Data model before reading this document. Understanding the underlying data model is very useful when discussing the way the data can be accessed and rendered on wiki pages.

Editorial Note: Name is misleading, this is a functional spec for client side data access and formatting. It's not about syntax

Editorial Note: This should be turned into an object-oriented functional spec, with the syntax factored out into a parser function binding, Lua binding, etc

Accessing Item Data[edit]

Properties of a Wikidata Item can be used via the #property parser function:


This will provide the value of the population property of the page's default item. The default item is the Wikidata item that is associated with this page by virtue of its wikilink property.

To access properties of a different item, the item has to be specified explicitly, either by id or using the associated Wiki page:




Item data is cached in memory, so accessing several properties of the same item is efficient.

Editorial Note: The exact form of the different types of item ids is not yet decided. However, the form used for the item parameter here should be the same as the form used in (inter)wiki links when linking to items (resp. item pages) on Wikidata. For title-based links, we could use colon-syntax: "en:Germany", or just "Germany" to use the local site's language; For IDs, we could use "/q1234" or "qid/q1234" or "::q1234" or whatever.

Editorial Note: How to access values from complex data types, like geo-coordinates? Use parts? or use dots, ""? or slashes, "coord/lat"?

Value Formatting[edit]

The item value is returned as wikitext per default, with some formatting applied where appropriate. For instance, dates and numbers are formatted using the formatting rules of the page's content language, item references are turned into local wiki links, etc. The concrete formatting rules depend on the property's data type.

To access the raw value, e.g. to apply custom formatting, use the format option:


This would return something like 1234567 instead of using e.g. the en-US formatted version 1,234,567.

Unit conversion can also be done on the fly:


This would return the area in square miles, to one digit after the decimal point.

The following parameters can be used to control the function's output:

the precision for the output of scalar values, such as "1km" or "1 ounce". If not given, this may be derived based on the unit used for display, and on the value's precision qualifier.
the unit of measurement for display (implies a precision of 1/10th (TBD?!) of that unit, if the precision isn't given implicitly).
the language to prefer for the value, as a fall-back list of language codes. Per default, the page's content language is used.
select a specific statement to use by providing a source reference identifier (see #Explicit Selection of Statements).
which part of the property's statement(s) to show (see #Statement Parts). By default, the value part is shown (if one exists for the respective data type).

Statement Parts[edit]

In Wikidata, properties don't simply have one value. Instead, their have statements consisting of several parts (roughly equivalent to the "snaks" in the data model), one of it is (usually) "the" value. Some types of properties may not have a single value - e.g. a geo-location would have a latitude and longitude part, but no "value".

All parts of the statement can be accessed using the #property function:


These would provide the source(s), the accuracy, and the timestamp of the statement about the population. The timestamp would be given as a year, even if provided in more detail. The source references are themselves complex objects that require templates for rendering (see below).

If the part is not given explicitly, the default (value) part of the statement is used (as in the examples in the previous section). If the statement has no default value part (e.g. for geo-coordinates), and no part is explicitly specified, a warning message is returned instead of any property value.

The following parts are well known and/or have a special meaning:

the "main" value part (if one exists - some data types like geo-coordinates don't have "a" value).
any source references
the margin of error (for scalar values)
the point in the the value applies to
preference value(s) for the present statement(s), such as "preferred", "other", "unsourced" or "deprecated".

Editorial Note: (most) qualifiers are user defined. Qualifiers have data types.

Part names given here are essentially the names of qualifiers (that is: of snak types), in the data model. Additional qualifiers may be added and used on the repository, though.

There are also some "virtual parts" that have no correspondence in the data model but may be useful when showing properties:

link/button to expand the display of the property (e.g. to show a pop-up with all values/statements). (TBD: how to define/customize the appearance?)
link/button to trigger an edit interface for the given property (should show all values!)
any warnings associated with this value, using the appropriate icons and links. This is an automatic ("virtual") part, not based on an actual qualifier/snak in the data. Typical warnings are
  • disputed if there are multiple default values, but the property is not marked as inherently multi-value.
  • unsourced if an unsourced value is included, but the property is not marked as inherently unsourced.
  • deprecated if a deprecated value is included
  • unreviewed (for future use with flagged revisions) the value was not yet reviewed by an established editor.

In addition to the above, some parts are taken from the property declaration, not the item data itself:

the property's label (in the page's content language)
the property's data type identifier.

Coalescing Values[edit]

Editorial Note: "Aggregating" may be a better term than "coalescing"

Editorial Note: Perhaps "list of values" should be the only type of aggregation.

Editorial Note: Perhaps parts should not be aggregated individually?

Properties don't necessarily have a single value. They may have multiple values (more precisely, multiple statements), either because they are inherently multivalued, or because conflicting opinions exist about the value, or because the value changes over time.

Each statement has a "preference" value, which may be "preferred", "other", "unsourced" and "deprecated". When determining the statement to use for a call to the #property function, "preferred" statements are picked over "other" statements, "other" statements are preferred over "unsourced" onces, etc.

However, there may still be multiple statements that are equally "strong": There may be multiple "preferred" values, or multiple "other" values (and no preferred values), etc.

In that case, these statements are combined or "coalesced" to form a single virtual statement: all parts of the statements are combined in a way appropriate for they respective data types. The list of sources for the statements are concatenated, the worst accuracy is chosen, the values are combined into ranges or lists, depending on their type (note that while there may be several statements with different values for a given property, they all have the same data type, namely, the type specified in the property declaration). For example, if there are three values for the population given by different sources, and there's no agreement on which source should be authoritative, this would be represented by three statements for a single property:

property value accuracy timestamp source preference
population 263455 +/-200 2010 Foo preferred
population 251104 +/-100 2011 Bar preferred
population 268122 +/-1% January 2010 Quux preferred

There may be more non-preferred (e.g. older) values:

property value accuracy timestamp source preference
population 261108 +/-200 2009 Foo other
population 250104 +/-80 2009 Acme other

In this case, the default values would automatically be coalesced into a range:


Would evaluate to something like this (depending on the property definition):

251,104 – 268,122

Other parts are also coalesced, essentially forming a single statement:

property value accuracy timestamp source preference
population 251104 - 268122 +/-1% 2010 - 2011 Foo, Bar, Quux preferred

The way different parts are combined depends on their type and semantics: scalar values are combined into ranges, texts, sources and item references are combined into lists, the accuracy is using the maximum (worst accuracy), the timestamp uses a range cut be the minimum accuracy (in this case, the "January" part is dropped because other timestamps didn't provide a month), etc.

Editorial Note: More details about aggregation for different data types (and qualifier types).

Explicit Selection of Statements[edit]

To avoid automatic coalescing even if there are several equally strong statements, the desired statement can be selected explicitly, using it's source as an identifier (Note: this assumes that a single sources does not make multiple contradicting claims – if it does, the source should be made more precise, e.g. by giving the page and paragraph).

So, in the above example, this could be used:


This would pick the value given by source "Bar":


Editorial Note: TBD: still unclear how source/reference identifier are maintained.

Multiple Statements[edit]

If a property has multiple statements, it is sometimes desirable to simply list all or some of them in detail. With the help of a scripting language like Lua, this is easy enough using a for-loop. However, we need some help to achieve the same using traditional MediaWiki template syntax (i.e. parser functions).

To this end, the {{#property-values}} can be used:


This would call the template country-area-info for each statement of the property area. Inside the the template, the respective property (e.g. area) would have only a single value. Other parameters (e.g. unit) are passed to the template as simple parameters.

Editorial Note: This means template output depends on context beyond template parameters, which is bound to screw up caching.

If Template:Country-area-info was supposed to generate a single table row, its implementation could look something like this:

     | {{#property:area|unit={{{unit|km^2}}}|precision={{{precision|1}}}}} {{#property:area|part=accuracy}} {{{unit}}}
     | {{#property:area|part=timestamp|precision=year}} <ref>{{#property:area|part=source}}</ref>

No coalescing would take place here, since the additional statements for the property would be masked by the {{#property-values}} function.

So, using this mechanism, a table if different opinions about the area could be constructed like this:

{| class="wikitable" |
! value !! accuracy !! timestamp !! source

which would result in something like:

Area Time
23,455 +/-20 miles^2 2011[Foo]
25,104 +/-15 miles^2 2010[Bar]
26,822 +/-2% miles^2 May 2010[Quux]

Editorial Note: the above could of course be formatted more nicely, the example is deliberately kept simple.

Editorial Note: Pending: options for controlling the maximum number of items, minimum required strength, etc.

Changing the Default Item[edit]

Editorial Note: This feature is syntactic sugar and should be marked as optional

If a page or template wants to make a different item the default item, this can be done using the {{#data-item}} function. For instance, on the page Germany on the English language Wikipedia, the default item would (per definition) be en/Germany. So


would be shorthand for


To override this, the following syntax can then be used:




would be shorthand for


This may be of limited used directly inside an article page (the need to do this would indicate that there's something wrong with the language links or the article's scope). But it is expected that this mechanism is quite useful inside templates. Maybe there's an extra article (and data item) about the Germany Economy, with some overview data like GNP, etc. It may then be useful to show an infobox about the economy directly on the page about Germany itself. With the help of {{#data-item}}, we could do the following on the Germany page:

 {{national economy box|item=en/Economy of Germany}}

Inside Template:National_economy_box, the {{{item}}} parameter could now be used directly to fetch data properties from the desired item:


But this is cumbersome (item={{{item}}} all over the place) and also error prone (missing item={{{item}}} in some places). It's nicer to just set the default item for the template:


This sets the data item for the present scope (implemented using a preprocessor frame), so the item doesn't have to be given explicitly:


Note that {{#data-item}} sets the default item for the present scope (e.g. the template), not the entire page (which would be confusing)! The present "scope" is basically the wikitext stored on a single wiki page - "calling" a template creates a new scope.

Editorial Note: If "deeper" template calls should inherit the changed default item from "further up", this would mess with chunk caching in Parsoid, since it depends on context, no on the parameters.

Editorial Note: Parsoid evaluates parser functions in parallel, but #data-item would have to be evaluated before all calls to #property, otherwise the results become random

Special properties[edit]

TBD: Describe special properties like edit-link, id, alias, etc.

Formatting Sources[edit]

TBD: mechanisms for formatting sources and source references in compliance with the current citation system and existing citation templates.

Property Definitions[edit]

TBD: Describe what properties properties have, how they are maintained, etc

Item properties reference the Property declaration. Property declarations are heavily cached on repository and client. They contain:

  • Data type (also a reference, also heavily cached)
  • Labels in several languages
  • "no source needed" flag (don't assign "unsourced" status, even if no source is given)
  • "multi-value" flag (multiple default values don't constitute a conflict/dispute)