Wikidata/Technical proposal

From Meta, a Wikimedia project coordination wiki
Other languages:
This page represents the current status of discussions within the Wikidata project team. It is used as a base for development. To avoid confusion please only update it to reflect updates of this status. If you want to discuss it, please use the Wikidata mailing list or the talk page. Thank you!

Wikidata is a project to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all of the languages of the Wikimedia projects, and will enable central access to the data in a way similar to what Wikimedia Commons allows for multimedia files.

Wikidata is proposed as a Wikimedia-hosted and -maintained project. This document concerns the initial requirements for starting and developing the project, and discusses long-term ramifications.

This document suggests a three-phase approach to the initial development:

  • The first phase (interwiki links) will create an entity base for the Wikimedia projects. This will provide an alternative to the current interlanguage link system, which will be already linked into Wikipedia.
  • The second phase (infoboxes) will gather infobox-related data for a subset of the entities, with the explicit goal of augmenting the infoboxes that are currently widely used with data from data.wikimedia.org.
  • The third phase (lists) will expand the set of properties beyond those related to infoboxes, and will provide ways of exploiting this data within and outside the Wikimedia projects.

The following sections will discuss each phase in detail, giving an overview of the phase, the technical requirements that need to be implemented by a given phase, and why; the main challenge of the given phase with respect to technical, organizational, and community challenges; the "milestone" for the satisfactory completion of each phase; a list of what will not be done by the project, and why; and a list of optional requirements that can be added by decision of the project.

The three phases do not mean that there will be only three deployments of code, or version updates. We will rather follow the credo of release early, release often, which will enable us to swiftly react to user feedback and continuously improve the system.

Phase 1: Interwiki links[edit]

Already in phase 1, Wikidata has been launched on its final address. The wiki will provide a page for every entity described by an article in one of the Wikipedia language editions. This page will contain the following information:

  • Links to the articles in the different Wikipedia language editions describing the entity of the given page. Every Wikipedia will contain only one such article.
  • Labels and short descriptions for each such entity. There will be one label and short description for each supported language.
  • Aliases for each entity. It will be possible to have several aliases for each entity in each supported language.

The data will be exported in different formats, especially RDF, SKOS, and JSON. Wikidata will also provide an API to edit the content. The wiki source text will not be editable as text, but only through specific API calls like “Add source” or “Remove label”. Wikidata will provide one or more user interfaces to the Wikidata API. Since Wikidata will not provide full Wikipedia articles, but merely the data content, we do not need the full capabilities of wiki text here. It will be redefined for data pages using a more sane syntax with JSON being identified as the most promising candidate. The information will be stored directly using a storage backend suitable for structured data. In the long run, we will investigate if the text-based storage can be completely removed.

The project will implement a MediaWiki extension to be used in the Wikipedias that will allow them to query Wikidata for interwiki links instead of using locally defined interwiki links. This will require a back end that scales to the requirements of Wikipedia, and will need careful consideration of how to handle caching, i.e. within each Wikipedia or on Wikidata, or both.

Technical requirements and rationales[edit]

The following requirements will need to be implemented in this phase:

  • P1.1. Extend MediaWiki to allow for content types aside from MediaWiki wiki syntax – especially JSON, to serialize the data for internal use. This will allow us to continue using the MediaWiki-based tools used in the Wikimedia Foundation’s technical setting, especially for backup, restoring, and maintaining services. This requirement will include parsing the JSON content and saving it to the backend provided by P1.2.
  • P1.2. Implement a backend, where the data is stored. Even though the standard SMW-based backend can be used for prototyping, a deployment will require new implementations for the kind of data collected in phase 1.
  • P1.3. Define and implement the Wikidata API for editing and accessing the data. The API will need to be sufficiently documented that third parties can build applications on top of it. This is not expected for phase 1. Shortipedia provides a workable example for how the Wikidata API may look.
  • P1.4. Implement and test user interfaces on top of the Wikidata API. Note that the user interfaces will all be fully internationalized and partly localized. This will be achieved by cooperating with Translatewiki for localization. The user interfaces (UIs) will need to run on all browsers Wikipedia works on, be accessible, and consider the need to rename, merge and split entities. The Wikidata website should provide the following user interfaces:
    • A simple HTML front-end that will require no JavaScript to run.
    • A rich client based on HTML5 and JavaScript for modern browsers.
  • P1.5. Implement a diff algorithm and renderer that is optimized for Wikidata. Whereas a diff simply based on the serialization of each entity in the chosen JSON format would not be wrong, we can exploit the semantics of the data format and provide much smarter diffs.
  • P1.6. Implement an API and a reusable widget that will enable the selection of an entity from the entity base. The widget should allow for autocomplete where useful, and should work seamlessly on mobile devices. Similar widgets can be seen with Freebase suggest and the Shortipedia entity selector.
  • P1.7. Implement a MediaWiki extension that queries Wikidata for the relevant Interwiki links and displays them on the article page. To avoid disrupting the current usage of interwiki links, and to handle some peculiarities and special cases, a magic word will be used to decide whether a given article uses the current locally defined, or the new globally defined, interwiki links. The display will contain an edit-link that allows editing the interwiki data on Wikidata directly.
  • P1.8. Implement, set up and test an appropriate caching mechanism and data-flow infrastructure that scales to the number of Wikipedia requests without sacrificing freshness in the interwiki links.

Milestone[edit]

The infrastructure for replacing local interwiki links with global links is set up, and several Wikipedias are using the extension that allows their editors to replace local interwiki links via a magic word with the Wikidata-based system. Wikidata is launched on its final URL.

This will considerably reduce the maintenance effort for interwiki links in the Wikipedias and will provide a Wikimedia-backed entity database. Note that the milestone is non-blocking, as phase 2 can be started before the launch of Wikidata itself.

Not to be done by the project[edit]

The project doesn't aim to involve a huge number of editors when launching Phase 1; rather, we expect a small, active and communicative group. Therefore wide awareness within the Wikipedia communities is unnecessary at this stage.

The project will not automatically create alignments between articles; nor will it automatically collect them from the Wikipedias; this task will be up to the communities. The communities are already active in using bots to align interwiki links, and we will offer them help to make their bot frameworks work seamlessly with Wikidata as well. The project doesn't aim to provide the content, but merely to provide the technical infrastructure for the communities to create the desired content.

Optional extensions to phase 1[edit]

  • O1.1. A third party develops a user interface based on the documentation of the Wikidata API. The user interface is a demonstrator that the documentation is sufficient, and that a multitude of interfaces can be built on top of Wikidata.
  • O1.2. Provide an external interwiki linker tool where editors can easily make the alignments. The system collects the interwiki links from the Wikipedias, checks their alignments, and provides an interface to simply (i) remove the interwiki links from the individual article, (ii) replace them with the magic word, and (iii) add the links to Wikidata. Even though the editing is done automatically, every single edit needs still to be initialized by a human editor.
  • O1.3. Extend the system to store external IDs to other data collections, like the Linked Open Data Cloud URIs, ISBNs, IMDB identifiers, Eurostat identifier, UN standards, and PID. This is expected to dramatically increase support from third-party data providers.
  • O1.4. [Depends on O1.3] Provide an external tool to find alignments that can be automatically saved through O1.3; the actual alignment needs to be confirmed by a human editor.

Phase 2: Infoboxes[edit]

In Phase 2, entities in Wikidata can be enriched with facts (property–value pairs), along with their sources and other qualifiers. Facts can either be typed links between entities, or property–value pairs with typed values.

The interconnection between the entities can be facilitated in a multilingual fashion due to the labels available in the entity store. This will increase the incentives to name the entities and create entities without a directly corresponding Wikipedia article (thus circumventing the current notability rules and requiring new ones).

The page of an entity can now contain arbitrary information about the entity, instantiating newly user-defined properties. Every fact can be substantiated by a reference, thus lowering the probability of discussions, which can be problematic in a multilingual setting. The data will be fully exported in different formats, especially RDF and JSON. Wikidata will provide an API to edit the extended content.

The project will implement a MediaWiki extension that will allow Wikipedia editors to augment Infobox templates with data from Wikidata. The caching issues are the same as for Phase 1, so in this Phase we can fully concentrate on the new main technical challenge of creating an augmentation syntax that can deal with the diversity and power of Wikidata.

Technical requirements and rationales[edit]

The following requirements will need to be developed in phase 2:

  • P2.1. Customize the Semantic MediaWiki type system to fit with Wikidata. This includes the ability to provide (i) linear transformation support for values with units (which requires precision, as otherwise the transformation might be overly precise) and (ii) support for the internationalization of the values. Some data types that are basic but do not fit into linear transformations (like time, space, temperature, and currency) must be appropriately implemented.
  • P2.2. Develop and implement a system that allow editors to add sources and other qualifiers to facts. Sources go beyond mere URLs (as in Shortipedia): Wikidata will also allow for offline sources, the structural description of sources (within Wikidata itself), and – if available – the actual text snippet that supports a source.
  • P2.3. Select and set up a back-end store for the data from the infoboxes. This should be done with P3.2 in mind, but not as a requirement. It is possible that the benchmarking will result in having two different systems for document-like access (Phase 2) and the graph-based querying (Phase 3).
  • P2.4. Extend the Wikidata API to enable editing of custom properties, especially sources. These should be described sufficiently to allow external third parties to develop user interfaces on top of Wikidata – especially game-like interfaces as described in the optional requirements.
  • P2.5. Develop and implement user interfaces for editing and browsing Wikidata. The required user interfaces are continuations of those developed in P1.4 and will use the widget developed in P1.6.
  • P2.6. Implement the export of the data in relevant formats, especially JSON and RDF; and ensure that there are timely dumps of the whole data set in RDF and JSON, besides the XML dump of the wiki. Some of the types – like values with units, time, etc. – will require some further specification.
  • P2.7. Develop and implement a MediaWiki extension that allows data from Wikidata to be used in a MediaWiki, especially within the infobox templates in the Wikipedias. This needs to consider the possibility of overriding Wikidata values in a given Wikipedia entry, and prioritizing and filtering by sources. The extension will allow the creation of stub infoboxes for entities without articles in a given language. This must consider the issues of protection and protection propagation from the Wikipedias (see also O2.5).

Milestone[edit]

Several Wikipedias allow their editors to augment the infoboxes with data from Wikidata. This will increase consistency over the Wikipedias, providing useful stubs, especially for smaller language editions. It will considerably decrease the maintenance effort in the Wikipedias. It can display data from Wikidata for many entities in smaller languages, even when they do not have an article.

Not to be done by the project[edit]

The project will not automatically fill the data with knowledge extracted or provided from other sources. It is up to the Wikidata community to decide which sources to select and how to import the data.

The project will not:

  • create a system that will automatically discover, integrate, and upload linked open data from the Web of Data, as Shortipedia does, for example;
  • provide automatic mappings to external vocabularies and ontologies or use internal reasoning engines; or
  • provide interfaces to bulk upload diverse data sources – the API-based architecture of Wikidata will enable the easy creation of such interfaces (and we do expect such interfaces to be created), but it is not a task of this project to preselect and support certain data sources, which will necessarily occur whenever we implement an import for any single data formats.
  • define the available properties and the processes for the community on how to decide on which properties to use;
  • undertake by itself to provide mappings to the Web of Data; however, it will provide an infrastructure for others to build that.

Optional extensions to phase 2[edit]

  • O2.1. Provide an infrastructure to collect automatic suggestions for facts and sources. Bots, learning systems, knowledge extraction tools, etc, can provide these suggestions. Since these suggestions are not confirmed by human editors yet, they should be kept separate from the facts that are already checked. Provide a simple user interface to let human editors confirm or reject the extracted facts.
  • O2.2. [Depends on O2.1] Create game like interfaces and incentives schemes for checking data and sources, supporting casual encyclopeding.
  • O2.3. [Depends on O2.1] Extend an automatic knowledge extraction system so that it loads its data directly into O2.1. This could be organized as a challenge.
  • O2.4. Improve the autocompletion widget developed in P1.6 based on further information about the property and entity being filled, e.g. their domain and range, the actual entity being annotated, etc.
  • O2.5. Add a more fine granular approach towards protecting single facts instead of merely the whole entity.
  • O2.6. Export trust and provenance information about the facts in Wikidata. Since the relevant standards are not defined yet, this should be done by closely monitoring the W3C Provenance WG.
  • O2.7. Develop an exporter that provides on-the-fly transformation to a specific RDF vocabulary or to a specific JSON projection.
  • O2.8. A user interface optimized for mobile devices. Mobile devices pose specific challenges. Whereas the UI in O1.1 will be deployed externally, the mobile device UI will actually be part of Wikidata.
  • O2.9. Develop a rich and engaging user interface for a specific domain, e.g. about the 2012 London Olympics.

Phase 3: Lists[edit]

The vision of a Semantic Wikipedia as it was originally suggested is already fulfilled with Phase 2. Phase 3 enables to pose more complex queries that can provide aggregate views on the data and can further reduce maintenance in Wikipedia drastically. Phase 3 also allows finishing the project properly in order to ensure a high maintainability of both the software and the data and its surrounding processes. Phase 3 will closely monitor the arbitrary creation of properties and its impact on performance of the system, as well as the usage of the new possibilities to query the data. This provides a scalable way to use inline queries, one of Semantic MediaWiki’s most appealing and useful features.

Technical requirements and rationales[edit]

Phase 3 will allow for creating lists and aggregated views out of Wikidata.

  • P3.1. Develop and implement an extension that queries Wikidata and renders the results within a given wiki. This extends current SMW querying in two ways: first, it allows querying another knowledge base and second, it deals with the diversity within Wikidata, i.e. the possibility to have several qualified values for an entity-attribute pair. Both issues have been tackled previously within the project (specifically in P1.8 and P2.7). Note that the expressiveness of the query language might be very low.
  • P3.2. Select and set up a back end that allows the efficient execution of the queries developed in P3.1. Monitor closely performance of the queries and iterate on their expressivity. This task will already start during the other phases evaluating the possible implementations that could be used. It could be that several backends will be used simultaneously for the different uses in Wikidata, especially for infobox augmentation and list generation.
  • P3.3. Provide in close cooperation with the community a set of relevant formatters for the query results. Reuse results from Semantic Result Formats and Spark, if possible. Otherwise, new formatters need to be developed based on feedback from the community. The formatters will be developed on top of a well-documented API, so that third parties can add further formatters, like lists, charts, maps, etc..
  • P3.4. Extend and adjust current curation tools so that they work for Wikidata. This includes the discovery of untyped properties, invalid values, duplicate properties, and unsourced statements.
  • P3.5. Clean up the code so that it can be used outside of Wikidata. Merge changes back with Semantic MediaWiki, so that a common code base is retained. Publish the extensions that are newly developed for Wikidata.

Milestone[edit]

Wikipedia allows its editors to integrate lists from Wikidata. The maintenance of Wikidata is transferred to the Wikimedia Foundation.

Not to be done by the project[edit]

The project will not do research and development in the providing of complex trust calculations. Since all data will be exported, third parties can develop such mechanisms, and if implementable on scale they can be considered in the future for Wikidata, but the project itself will not provide these.

Optional extensions to phase 3[edit]

  • O3.1. Develop and prepare a SPARQL endpoint to the data. Even though a full-fledged SPARQL endpoint to the data will likely be impossible, we can provide a SPARQL endpoint that allows certain patterns of queries depending on the expressivity supported by the back end.
  • O3.2. Develop an intuitive and rich query editor for the data. A simple text based query editor is available automatically, but a rich editor that supports an intelligent autocomplete and probably enables the user to visually query the data would make the knowledge accessible to a much wider audience.
  • O3.3. [Depends on O3.2] Develop a tool to easily copy queries from the query editor to the Wikipedias, thus enabling editors to enrich articles easily.
  • O3.4. Develop a mobile application to query and possibly edit the data in Wikidata.
  • O3.5. Implement a keyword-based intelligent search, where the keyword query is interpreted as structural queries over the data and the query result is displayed.
  • O3.6. Based on a query and the query result, select an appropriate visualization automatically.

Resources[edit]

See also[edit]