Wikidata/Development/Queries

From Meta, a Wikimedia project coordination wiki

This note describes how queries in Wikidata will be implemented. Since queries are a rather big topic, this document is incomplete and contains several places that need to be further refined.

Overview[edit]

In general, the version presented here basically ignores qualifiers and references for querying, but keeps them intact in the result set. But in the first iteration it is not possible to limit by them, sort by them, or do anything with them but display them. Also, all queries will be conducted only on statements that are ranked as preferred. All other statements are ignored for querying so far.

Wikidata will be able to take a Query and can return a QueryResult. A QueryResult in turn can be taken by a client (e.g. a Wikipedia) and turn it into a QueryResultVisualization.

A Query consists of

  • a QueryConcept (or query description, but this not to be confused with an entity description)
  • an array of SelectionRequests, and
  • a few QueryOptions: Sort, Offset and Limit

A QueryConcept allows to decide which entities can belong to the QueryResult that answers a Query. A QueryConcept can be

  • a conjunction of concepts
  • a disjunction of concepts
  • a property with any value
  • a property with a specific value or value range
  • a property with a value fulfilling a concept
  • (further query concept elements can be added later to deal with references and qualifiers)

Examples for query concepts are "everything with a 'population of more than 1,000,000 that 'is a' 'city'", "everything that is 'born in' something that is in 'country' 'Japan', and that 'born' before 1500", or simply "everything with the 'ISBN' 3-237-233-443" (the latter which will usually be just a single result).

A SelectionRequest describes a "column" in the QueryResult. In the most simple case that should just be a property. Later we can add certain claim patterns that further filter the selected results, or that describe a computation like the population divided through the area.

The QueryOptions have an array of Sorts, which each refer a Sorter to one of the existing SelectionRequests, and a Limit, which cuts off the result set at some point, based on the given sorts. A simple case would be to have the result set sorted by their population in a decreasing order, and take only the first 5000 results. (In general, many QueryConcepts are expected not to need QueryOptions, but that we can hold the complete QueryResult and sort and limit it further when visualizing the results on the client).

As a side note: it probably will not be possible to sort by label or by values of type multilingual text.

A QueryResult is an array of entities, each with claims. The claims are selected or created based on the SelectionRequests. (Basically, a query result is a tabular structure, where the rows are the entities, the columns are the selection requests, and every field contains the respective claims, including qualifiers and references.)

Wikidata will provide a module that given a Query returns a QueryResult. Wikidata decides whether the query can be answered adhoc, and if so, answers it. If allowed, it will check whether the query results have been cached and will return the cached results. Wikidata may simply answer that it does not want to answer the question any time.

A query can be saved in a query entity, which is a page in the wiki. The page displays the query concept, the selections, options, etc., and also, if already cached, the latest query results. The page, like all entities, has labels, descriptions, and aliases. A Wikipedia, in general, will access such a query by its name, and locally decide what kind of formatter to use to display the query results (e.g. bar chart, a map, a table, an interactive widget), and also if further limits and sorting should be done.

The query results are cached, and recalculated from time to time. Wikidata users can put a query into a (priority?) queue to get it recalculated. During "building" the query, also such a (priority?) queue may be used. This would also allow to recalculate a query result only on explicit request, or to have a "sighted version" of a query result.

Wikidata will also eventually provide a module that given a QueryConcept and an entity will return an explanation why the item is in the concept or not.

A first implementation step: Property-Value queries (SQL)[edit]

The first iteration of the implementation will only support adhoc queries (i.e. they still might be cached, but there is no notion of query entities yet, and only one QueryConcept type, a property with a specific value.

Technically, this will be implemented by having all property-values in a table and index over the property and value. This can be done in SQL. The type of queries are basically "located in Germany" or "has ISBN 1-393-2334-X". These queries do not require any joins or anything.

There will be one table per datatype (i.e. one that points to strings, one that points to items, one that points to numbers, etc. This is a short list). The table will have the following structure: row-id (if needed for housekeeping), entity-id, statement-id, property-id, value (type depends on the datatype of this table, eg. item-id for items, string for strings, etc.), statement-blob (with the whole structured statement in JSON for access - otherwise we would need a lookup on the entity content) (we are unsure about whether to have the statement-blob or not). This table will be updated whenever a statement is being changed (often). These tables will have together as many rows as we have statements, i.e. around 12.5 M currently, expected to grow quite a bit (around 50 M until the end of the year?). Because of frequent updates, the row-id is expected to grow beyond the 32 bit range soon.

For queries, the table will be indexed on property-id / value pairs (the entity id may be added to this index if this may help performance of joins via the entity id later); for updates when an entity is changed, there is a second (unique) index on entity-id and statement-id.

No support for distance queries on the Geodata will be implemented in the SQL approach.

EntitiesByPropertyValues (first iteration)[edit]

Parameters: property (required), value (type depends on property, required), entity type (required, offset (optional), limit (optional)

Returns: a list of all entity IDs that have a statement where the main snak has the given property and the value is exactly the given value. Optionally, if needed, the offset for further results.

(The value will be loosened later to maybe encompass range queries, i.e. bigger than, smaller than, etc.)

A second implementation step: Ranges, coordinates, and conjunctive queries (Solr / ElasticSearch)[edit]

We would then like to move for the next step to the possibility to query conjunctions (a city and located in Germany) and also the geographical data (close to this coordinate). This will not work well with SQL. Our current thinking and tests point out to Solr or ElasticSearch as good technologies to support this kind of queries.

In Solr or ElasticSearch, every item would be represented as a document, and each statement would lead to one or more feature of the document. Solr or ElasticSearch also support Geocoordinates, ranges, and sorting.