Wikidata/Technical proposal
Wikidata is a project to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all the languages of the Wikimedia projects, and allow for central access to the data in a way similar to what Wikimedia Commons does for multimedia files.
Wikidata is proposed as a Wikimedia hosted and maintained project. This document regards the initial required development to start and develop the project, and discusses long-term ramification.
This document suggests an approach in three phases:
- The first phase (Interwiki links) consists of creating an entity base for the Wikimedia projects. This will provide an alternative to the current interlanguage link system, which will be already linked into Wikipedia.
- The second phase (Infoboxes) aims at gathering first infobox-related data for a subset of the entities, with the explicit goal to augment the currently widely used infoboxes with data from data.wikimedia.org.
- The third and last phase (Lists) of the initial development is to expand the set of properties beyond those related to infoboxes, and to provide ways of exploiting this data within and outside of the Wikimedia projects.
The following sections will discuss each phase in detail: they first give an overview of the phase; they describe the technical requirements that need to be implemented by a given phase and why; they identify the main challenge of the given phase in each of the three areas technical, organizational, and community; they define the passing milestone for each phase; they explicitly list what will not be done by the project and why; and they will list optional requirements that can be added if so wished upon project decision.
The three phases do not mean that there will be only three deployments of code, or version updates. We will rather follow the credo of release early, release often, which enables us to swiftly react to user feedback and continuously improve the system.
Contents |
[edit] Phase 1: Interwiki links
Already in phase 1, Wikidata will be launched on its final address. The wiki will provide a page for every entity described by an article in one of the Wikipedia language editions. This page will contain the following information:
- Links to the articles in the different Wikipedia language editions describing the entity of the given page. Every Wikipedia can only contain one such article.
- Labels and short descriptions for each such entity. There is one label and short description for each supported language.
- Aliases for each entity. There can be several aliases for each entity in each supported language.
The data will be exported in different formats, especially RDF, SKOS, and JSON. Wikidata will also provide an API to edit the content. The wiki source text will not be editable as text, but only through specific API calls like “Add source” or “Remove label”. Wikidata provides one or more user interfaces to the Wikidata API. Since Wikidata does not provide full Wikipedia articles but merely the data content, we do not need the full capabilities of wiki text here. It will be redefined for data pages using a more sane syntax with JSON being identified as the most promising candidate. Additionally, the information will be stored directly using a storage backend suitable for structured data. On the long run we will investigate if the text-based storage can be completely removed.
The project will implement a MediaWiki extension to be used in the Wikipedias that will allow them to query Wikidata for interwiki links instead of using locally defined interwiki links. This will require a back end that scales to the requirements of Wikipedia, and will need careful consideration of how to handle caching, i.e. within each Wikipedia or on Wikidata, or both.
[edit] Technical requirements and rationales
The following requirements will need to be implemented in this phase:
- P1.1. Extend MediaWiki to allow for different content types than MediaWiki wiki syntax, especially JSON in order to serialize the data for internal use. This allows continuing to use the MediaWiki based tools used in the Wikimedia Foundation’s technical setting, especially for backup, restoring, and maintaining the services. This requirement also includes parsing the JSON content and saving it to the backend provided by P1.2.
- P1.2. Implement a backend where the data is stored. Note that even though the standard SMW based backend can be used for prototyping, a deployment will require new implementations for the kind of data collected in phase 1.
- P1.3. Define and implement the Wikidata API for editing and accessing the data. The API needs to be sufficiently documented so that third parties can build applications against it. Note that this is not expected for phase 1. Shortipedia provides a workable example for how the Wikidata API could look like.
- P1.4. Implement and test user interfaces on top of the Wikidata API. Note that the user interfaces are all fully internationalized and partly localized. This will be achieved by cooperating with Translatewiki for localization. The UI needs to run on all browsers Wikipedia works on. The UI needs to be accessible. The user interfaces should consider the need to rename, merge and split entities. The Wikidata web site should provide the following user interfaces:
- A simple HTML front end that requires no JavaScript in order to run.
- A rich client based on HTML5 and JavaScript for modern browsers.
- P1.5. Implement a Diff algorithm and renderer that is optimized to be used for Wikidata. Whereas a diff simply based on the serialization of each entity in the chosen JSON format would not be wrong, we can exploit the semantics of the data format and provide much smarter diffs.
- P1.6. Implement an API and a reusable widget that allows for the selection of an entity from the entity base. The widget should allow for autocomplete where useful, and also work seamlessly on mobile devices. Similar widgets can be seen with Freebase suggest and the Shortipedia entity selector.
- P1.7. Implement a MediaWiki extension that queries Wikidata for the relevant Interwiki links and displays them on the article page. In order to not disrupt the current usage of interwiki links, and to handle some peculiarities and special cases, a magic word is used to decide if a given article uses the current, locally definied or the new, globally defined interwiki links. The display also contains an edit link that allows editing the interwiki data on Wikidata directly.
- P1.8. Implement, set up and test an appropriate caching mechanism and data flow infrastructure that scales to the number of Wikipeda requests without sacrificing freshness in the interwiki links.
[edit] Milestone
The infrastructure for replacing local interwiki links with global ones is set up and several Wikipedia editions are using the extension that allows their editors to replace the local interwiki links via a magic word with the Wikidata based system. Wikidata is launched on its final URL.
This will considerably reduce the maintenance effort for interwiki links in Wikipedia and provide a Wikimedia-backed entity database. Note that the milestone is non-blocking, as phase 2 can be started before the launch of Wikidata itself.
[edit] Not to be done by the project
The project will not aim to generate a huge number of editors when launching Phase 1. We rather expect a small but vocal group. Therefore it does not make sense yet to aim for wide awareness within the Wikipedia community.
The project will also not automatically create the alignments between articles, nor will it automatically collect them from the Wikipedias. This task is up to the community. The community is already very active in using bots to align the interwiki links, and we will offer them help in order to make their bot frameworks work seamlessly with Wikidata as well. It is not the aim of the project to provide the content, but merely to provide the technical infrastructure so that the community can create the desired content.
[edit] Optional extensions to phase 1
- O1.1. A third party develops a user interface based on the documentation of the Wikidata API. The user interface is a demonstrator that the documentation is sufficient, and that a multitude of interfaces can be built on top of Wikidata.
- O1.2. Provide an external interwiki linker tool where editors can easily make the alignments. The system collects the interwiki links from the Wikipedias, checks their alignments, and provides an interface to simply remove the interwiki links from the individual article, replace it with the magic word, and add the links to Wikidata. Even though the editing is done automatically, every single edit needs still to be initialized by a human editor.
- O1.3. Extend the system to store external IDs to other data collections, like the Linked Open Data Cloud URIs, ISBNs, IMDB identifiers, Eurostat identifier, UN standards, PID, etc. This is expected to increase the support from 3rd party data providers dramatically.
- O1.4. [Depends on O1.3] Provide an external tool to find alignments that can be saved through O1.3 automatically. The actual alignment needs to be confirmed by a human editor.
[edit] Phase 2: Infoboxes
In Phase 2, entities in Wikidata can be enriched with facts, i.e. with property / value pairs, along with their sources and other qualifiers. Facts can either be typed links between entities, or property-value pairs with typed values.
The interconnection between the entities can be facilitated in a multi-lingual fashion due to the labels available in the entity store. This also increases the incentives to name the entities and create entities without a directly corresponding Wikipedia article (thus circumventing the current notability rules and requiring new ones).
The page of an entity can now contain arbitrary information about the entity, instantiating newly user-defined properties. Every fact can be substantiated by a reference, thus lowering the probability of discussions, which would be problematic in a muti-lingual setting. The data will be fully exported in different formats, especially RDF and JSON. Wikidata will also provide an API to edit the extended content. The project will implement a MediaWiki extension to be used in the Wikipedias that will allow editors to augment Infobox templates with data from Wikidata. The caching issues are the same as for Phase 1 so that in this Phase we can fully concentrate on the new main technical challenge of creating an augmentation syntax that can deal with the diversity and power of Wikidata.
[edit] Technical requirements and rationales
The following requirements will need to be developed in phase 2:
- P2.1. Customize the Semantic MediaWiki type system to fit with Wikidata. This includes the ability to provide linear transformation support for values with units (which requires precision, as otherwise the transformation might be overly precise) and support internationalization for the values. Some data types that are basic but do not fit into linear transformations (like time, space, temperature, and money) must be appropriately implemented.
- P2.2. Develop and implement a system that allows adding sources to facts as well as other qualifiers. Sources go beyond merely an URL (as in Shortipedia), but allow also for offline sources, structural description of sources (within Wikidata itself), and – if available – also the actual text snippet supporting a source.
- P2.3. Select and set up a back end store for the data from the info boxes. This should be done with P3.2 in mind but not as a requirement. It is possible that the benchmarking results in having two different systems for the document-like access in Phase 2 and the graph-based querying in Phase 3.
- P2.4. Extend the Wikidata API to enable editing of custom properties, especially sources. These should be described sufficiently to allow external third parties to develop user interfaces on top of Wikidata, especially game-like interfaces as described in the optional requirements.
- P2.5. Develop and implement user interfaces for editing and browsing Wikidata. The required user interfaces are continuations of those developed in P1.4 and will use the widget developed in P1.6.
- P2.6. Implement the export of the data in relevant formats, especially JSON and RDF. Also ensure that there are timely dumps of the whole data set in RDF and JSON, besides the XML dump of the wiki. Some of the types – like values with units, time, etc. – will require some further specification.
- P2.7. Develop and implement a MediaWiki extension that allows using the data from Wikidata within a MediaWiki, especially within the Infobox templates in Wikipedia. This needs to consider the possibility to override Wikidata values in a given Wikipedia entry, and to prioritize and filter by sources. The extension also allows creating stub infoboxes for entities without articles in a given language. This must consider the issues of protection and protection propagation from the Wikipedias (see also O2.5).
[edit] Milestone
Wikipedia allows their editors to augment the infoboxes with data from Wikidata. This will increase the consistency over the Wikipedias, and provide useful stubs especially for smaller language editions. It will decrease considerably the maintenance effort in Wikipedia. It can display data from Wikidata for many entities in smaller languages, even when they do not have an article.
[edit] Not to be done by the project
The project will not automatically fill the data with knowledge extracted or provided from other sources. It is up to the community to decide which sources to select and how to import the data.
We will not create a system that will automatically discover, integrate, and upload Linked Open Data from the Web of Data, as e.g. Shortipedia does. We will not provide automatic mappings to external vocabularies and ontologies or use internal reasoning engines. We will also not provide interfaces to bulk upload diverse data sources. The API-based architecture of Wikidata will enable the easy creation of such interfaces, and we expect such interfaces to be created, but it is not a task of this project to preselect and support certain data sources, which will necessarily happen whenever we implement an import for any single data formats.
The project will not define the available properties and the processes for the community on how to decide on which properties to use. The project will not undertake by itself to provide mappings to the Web of Data. But it will provide an infrastructure for others to build that.
[edit] Optional extensions to phase 2
- O2.1. Provide an infrastructure to collect automatic suggestions for facts and sources. Bots, learning systems, knowledge extraction tools, etc, can provide these suggestions. Since these suggestions are not confirmed by human editors yet, they should be kept separate from the facts that are already checked. Provide a simple user interface to let human editors confirm or reject the extracted facts.
- O2.2. [Depends on O2.1] Create game like interfaces and incentives schemes for checking data and sources, supporting casual encyclopeding.
- O2.3. [Depends on O2.1] Extend an automatic knowledge extraction system so that it loads its data directly into O2.1. This could be organized as a challenge.
- O2.4. Improve the autocompletion widget developed in P1.6 based on further information about the property and entity being filled, e.g. their domain and range, the actual entity being annotated, etc.
- O2.5. Add a more fine granular approach towards protecting single facts instead of merely the whole entity.
- O2.6. Export trust and provenance information about the facts in Wikidata. Since the relevant standards are not defined yet, this should be done by closely monitoring the W3C Provenance WG.
- O2.7. Develop an exporter that provides on-the-fly transformation to a specific RDF vocabulary or to a specific JSON projection.
- O2.8. A user interface optimized for mobile devices. Mobile devices pose specific challenges. Whereas the UI in O1.1 will be deployed externally, the mobile device UI will actually be part of Wikidata.
- O2.8. Develop a rich and engaging user interface for a specific domain, e.g. about the 2012 London Olympics.
[edit] Phase 3: Lists
The vision of a Semantic Wikipedia as it was originally suggested is already fulfilled with Phase 2. Phase 3 enables to pose more complex queries that can provide aggregate views on the data and can further reduce maintenance in Wikipedia drastically. Phase 3 also allows finishing the project properly in order to ensure a high maintainability of both the software and the data and its surrounding processes. Phase 3 will closely monitor the arbitrary creation of properties and its impact on performance of the system, as well as the usage of the new possibilities to query the data. This provides a scalable way to use inline queries, one of Semantic MediaWiki’s most appealing and useful features.
[edit] Technical requirements and rationales
Phase 3 will allow for creating lists and aggregated views out of Wikidata.
- P3.1. Develop and implement an extension that queries Wikidata and renders the results within a given wiki. This extends current SMW querying in two ways: first, it allows querying another knowledge base and second, it deals with the diversity within Wikidata, i.e. the possibility to have several qualified values for an entity-attribute pair. Both issues have been tackled previously within the project (specifically in P1.8 and P2.7). Note that the expressiveness of the query language might be very low.
- P3.2. Select and set up a back end that allows the efficient execution of the queries developed in P3.1. Monitor closely performance of the queries and iterate on their expressivity. This task will already start during the other phases evaluating the possible implementations that could be used. It could be that several backends will be used simultaneously for the different uses in Wikidata, especially for infobox augmentation and list generation.
- P3.3. Provide in close cooperation with the community a set of relevant formatters for the query results. Reuse results from Semantic Result Formats and Spark, if possible. Otherwise, new formatters need to be developed based on feedback from the community. The formatters will be developed on top of a well-documented API, so that third parties can add further formatters, like lists, charts, maps, etc..
- P3.4. Extend and adjust current curation tools so that they work for Wikidata. This includes the discovery of untyped properties, invalid values, duplicate properties, and unsourced statements.
- P3.5. Clean up the code so that it can be used outside of Wikidata. Merge changes back with Semantic MediaWiki, so that a common code base is retained. Publish the extensions that are newly developed for Wikidata.
[edit] Milestone
Wikipedia allows its editors to integrate lists from Wikidata. The maintenance of Wikidata is transferred to the Wikimedia Foundation.
[edit] Not to be done by the project
The project will not do research and development in the providing of complex trust calculations. Since all data will be exported, third parties can develop such mechanisms, and if implementable on scale they can be considered in the future for Wikidata, but the project itself will not provide these.
[edit] Optional extensions to phase 3
- O3.1. Develop and prepare a SPARQL endpoint to the data. Even though a full-fledged SPARQL endpoint to the data will likely be impossible, we can provide a SPARQL endpoints that allows certain patterns of queries depending on the expressivity supported by the back end.
- O3.2. Develop an intuitive and rich query editor for the data. A simple text based query editor is available automatically, but a rich editor that supports an intelligent autocomplete and probably enables the user to visually query the data would make the knowledge accessible to a much wider audience.
- O3.3. [Depends on O3.2] Develop a tool to easily copy queries from the query editor to the Wikipedias, thus enabling to enrich articles easily.
- O3.4. Develop a mobile application to query and possibly edit the data in Wikidata.
- O3.5. Implement a keyword-based intelligent search, where the keyword query is interpreted as structural queries over the data and the query result is displayed.
- O3.6. Based on a query and the query result, select an appropriate visualization automatically.
[edit] Resources
[edit] See also
- w:de:Wikipedia:Projektdiskussion/Wikidata
- strategy:Proposal:Structured Data
- Bug 4547 - Support crosswiki template inclusion (transclusion => interwiki templates, etc.)