Grants:IEG/Wikidata Toolkit

This project is funded by an Individual Engagement Grant

status: selected

project:

Wikidata Toolkit

project contact:

markussemantic-mediawiki.org

participants:

grantees:

Markus Krötzsch is the creator of Semantic MediaWiki and data architect of Wikidata. He is a Departmental Lecturer at the University of Oxford and will be leading a research group at TU Dresden starting Nov 2013.
Research assistant tbd (another person from Markus's research group at TU Dresden)
Student assistant tbd (a secondary goal of the project is to involve students in Wikipedia-related development and research)

summary:

The project will develop a toolkit and web service to query and analyse information exported from Wikidata, providing a feature-rich query API based on a robust and scalable backend.

engagement target:

Wikidata

strategic priority:

Encouraging Innovation, Improving Quality

total amount requested:

30000 USD (22550 EUR)

2013 round 2

This page describes the original project proposal (Dec 2013). It will not be updated. Current information and usage instructions are found at the Wikidata Toolkit homepage.

Project idea

Problem: The Wikidata project collects large amounts of data, but understanding this data requires technical means for querying and analysis that are not currently available. Even skilled developers have hardly any basis for working with Wikidata.

Solution: A modular toolkit for loading, querying, and analysing Wikidata data will make it easy for developers to use Wikidata in their applications. A web service built on top of this toolkit will offer live query capabilities to a wider range of users. The work will heavily draw from prior experience and existing tools, the goal being to unify and improve existing partial solutions.

Motivation

Wikidata collects large amounts of data across all Wikipedia languages. The data comprises names, dates, coordinates, relationships, URLs, but also references for many statements. In contrast to Wikipedia, where the main way of accessing information is to read single pages, the information in Wikidata is most interesting when viewing facts in a wider context, combining information across many subjects. For example, we can now answer the question how the sex distribution of people with Wikipedia articles varies across languages. For Wikidata editors, complex questions are interesting for yet another reason: they use them to check data quality by looking for patterns that should not normally occur. For instance, the mother of a person should normally be female, which is not always the case now. This and many other interesting insights about Wikidata can be gained by querying the data set for certain patterns, thus revealing the true potential of the project.

Unfortunately, Wikidata does not support any advanced form of query. The basic API provided by the project is limited to retrieving elements by their label (or alias). It is not even possible to find pages that refer to another page, e.g., to find the albums recorded by a certain artist – MediaWiki's what links here is sometimes (ab)used as a workaround in cases where it is enough to know that another page is mentioned somewhere in the data. In all cases mentioned above, custom-made software is used to analyse the data from dumps. This is a time-consuming offline process in each case, which often takes hours to complete. Even worse, the lack of technical support excludes the vast majority of users from analysing Wikidata. Even technically trained users who would be able to formulate, say, an SQL query are discouraged by the immense technological barrier of creating their own query answering system.

The goal of this project is to develop necessary technical components to simplify query answering over up-to-date Wikidata data. The heart of this project is a robust and flexible query backend that provides an API for running a variety of queries. A web service to showcase the functionality will be created and set up to use current (or very recent) data. The main approach for achieving this is to develop a set of modular, re-usable, client-side components for in-memory query answering. While the size of Wikidata is large (and growing quickly), it is certainly in the range of modern main memory sizes, and the added flexibility of a memory-based model is essential to support a wider range of queries. Moreover, components for loading and updating data selectively can help to filter information so that querying is possible even on machines with commodity memory sizes.

Project goals

The project has two technical main outcomes:

(1) Wikidata toolkit. A set of modular components for in-memory processing of information from Wikidata in a programmatic way

(2) Query web service. A web service to run queries against current Wikidata content that is built on top of the toolkit

In addition, the project aims at a soft outcome to ensure sustainability beyond the initial grant:

(3) Community engagement. Active involvement of volunteer developers and interested users

Outcome (1) is the heart of the project. Outcome (2) is a first application that will make (1) more tangible and help evaluating project progress. Outcome (3) aims at increasing the long-term impact of the project. In view of (3), a particular focus of toolkit development will be maintainable code and an extensible architecture.

The general goals that these outcomes should help to achieve are:

Significantly lower barriers for using and analysing Wikidata content
Improved quality control mechanisms for Wikidata editors
Higher utility and visibility of Wikidata content, beyond direct use in Wikimedia projects
Increase in content-driven applications based on Wikidata content

The following are not goals of the project: to develop a new database management software (the project is read-only), to replace future Wikidata query features developed in Wikibase (they address different needs and requirements), to develop innovative user interfaces for queries/analysis (this might be a follow-up project), to improve MediaWiki API access for programmes (API access and bot frameworks are different types of toolkits; the problems addressed in the present project are not addressed by Wikidata's current web API).

Part 2: The Project Plan

Project plan

Scope:

Scope and activities

Development will proceed in three phases with main outcomes as follows:

Initial toolkit design: requirements and performance targets, review of existing technologies, initial architecture and key technology choices
Toolkit implementation: fully functional core components, refined architecture, evaluation of key functionality
Web service realisation: query interface implementation, deployment on web server

The phases will be overlapping, especially phase (2) and (3), and feedback between these activities is desired to improve the practical utility of the work. An orthogonal activity is (4) community engagement, which is required during all phases.

Work will be organised in tasks as follows. The relative effort refers to estimated part of overall working time. Management effort is not included.

Task#	Month	Relative effort	Title	Description
T1	1	5%	Requirements gathering	Define exact tasks and requirements. What queries should be supported? How current does data need to be? Define concrete performance goals.
T2	1–2	10%	Technology review	Review existing implementations for re-usability (DBMS, Wikidata query facilities, other components). Basic performance testing/benchmarking. Review existing code for relevant approaches.
T3	2	5%	Key architectural choices	Decide on main language to use (Java, C++, something else). Decide on re-usable components (esp. DBMS). Select general utility components to re-use (logging, testing, messaging, etc.). Set up repository/bugtracker/reporting.
T4	2	10%	Overall architecture	Plan structure and interplay of toolkit components. Lay out extension points and interfaces.
T5	2–3	10%	Toolkit skeleton	Design and implement core component interfaces and (boilerplate) functionality.
T6	3–5	25%	Implement main components	Work will be split by component. Details depending on architecture; see below.
T7	5–6	15%	Query Web service	Implement Web interface. Deploy web service and backend on a dedicated machine. Publish.
T8	6	10%	Evaluate prototype	Assess progress and evaluate performance. Identify open issues. Compile experiences and insights to obtain recommendations/targets for future development.
T9	1–6	10%	Community outreach & training	Inform users about project results. Provide developer support/documentation. Gather feedback on design choices.

The bulk of development work is in tasks T5, T6, and T7. These are understood to include design activities and architectural refinements based on practical experiences with the code. Component development in T5 will be further split by components, depending on earlier architectural choices; potentially relevant components are:

Data model: an object representation of Wikidata content
Loading: read input from Wikidata dump files
Index: data structures and mechanisms for faster data access
Query: interfaces to describe queries; algorithms to compute query answers using index structures
Update: mechanisms for updating previously loaded data
Wikidata API: fetch live data from Wikidata
Durability: components to persist indexes and other loaded data structures for faster re-loading across sessions
Language bindings: allow data access from additional programming languages or platforms

Of these, Data model, Loading, Index and Query are essential to realise the web service. Other components may or may not be considered essential based on the initial requirements analysis; components might be part of the general architecture (T4) even if their implementation is not in the scope of the project.

Tools, technologies, and techniques

A significant number of technologies is relevant to this project, and part of the work is to assess the available approaches and systems regarding their utility for Wikidata. Task T2 allocates effort to study existing work. The main relevant technologies are:

Wikidata Query (WQ) Currently the only available query system for Wikidata, developed by Magnus Manske. In contrast to the project proposed here, WQ is not meant for programmers but as a service to the editor community. The C++ code is not written for reuse. Also WQ's specific assumptions about the form of queries (tree-like with specific path expressions) and the considered data (only main properties of Wikidata statements) distinguish it from what is proposed here. Yet, it is an invaluable input for building query functions in the proposed toolkit.
Wikidata Analytics (WDA) Currently the only available thing that is close to a "programmer's library" for Wikidata. Python-based. Provides functions for downloading and parsing Wikidata dumps, as well as for analysing their contents. Used to create regular statistics. Relevant input for creating similar functionality in the toolkit.
Wikibase query services The Wikibase software underlying Wikidata will provide certain query services in the future, currently under development. Related extensions: Ask, WikibaseQueryEngine, WikibaseQuery, WikibaseDatabase, and AskJS; initial deployment planned for late 2013. The goals, capabilities, and intended use of these extensions are mostly different from what is planned for the Toolkit; main overlap in common use case of providing a query service of some form. Potential for useful co-operations: exchange use cases/requirements/typical loads (Task T1) and exchange experiences with implementation techniques and backend components (Tasks T2, T3, T6).
Database Management Systems (DBMS) Wikidata's highly normalised, statement-based data model shares similarities with that of modern graph, semantic web, and document databases. A wide range of DBMS exists to maintain and query such data. These include established DBMS and query systems for documents (e.g., Lucene & friends), RDF (e.g., Virtuoso, 4Store, Sesame), and graphs (Neo4j). Moreover, there are various research prototypes that implement more specialised and lightweight approaches to query answering, including diverse systems like RDF3X, AMIE, RDFox. Finally, "fact-based" data models as in Datomic, DLV, or LogicBlox may also be relevant, though of course non-free software can only be used to gather experiences and performance metrics.
Query languages It is actually unclear which query language(s) is (are) suitable for the given use cases. Wikidata Query uses an ad hoc formalism. Basic pattern matching queries (conjunctive queries) are not powerful enough to express most current constraints. Full-fledged languages like SQL or SPARQL are too complex for full re-implementation in this small project. At the same time, it is not clear that their mechanisms for path queries, grouping, and aggregates are appropriate here.

Hosting plan

The whole project, and especially the query service, is conceived as an in-memory application, which will require adequate resources to run. This is a relevant concern for hosting a public query service, since standard web servers are often not laid out for such load. While hosting on Wikimedia Labs is desirable for close Wikimedia integration, it might not be practical if memory requirements are too high. Supported by TU Dresden, the applicant thus intends to provide a dedicated server for hosting the query service from his own grant money. This includes university-grade bandwidth. Hosting of the query service demonstrator could be provided for the foreseeable future beyond the running time of the project (the applicant's current grant runs until 2018, pending a mid-term evaluation). The query service is intended for casual use by editors and interested readers: the kind of data centre infrastructure needed to scale to the number of read requests on Wikipedia is of course beyond the project, and other solutions would have to be sought if demand increases dramatically. This hosting plan does not exclude co-operations with Labs, which would also be welcome (resources permitting).

Budget:

Total amount requested

30,000 USD (22,550 EUR)

Budget breakdown

It is expected that the project manager, a research assistant and a student assistant work on the project in part of their time. As discussed on the talk page, the project cannot be executed by an organisation, but only by individual participants. Nevertheless, the regular salaries of these project participants of different qualifications provide a good estimate for a realistic and sensible value of their labour. Of course, short-term freelance work in IT is usually much more expensive than the salary of a long-term employee or student, but this is compensated by the fact that freelancers do not charge any institutional overhead.

Example calculation

The following budget breakdown provides a rough indication of personnel cost. It is intended as an example to show that the requested amount is commensurate with the planned work. Figures are simplified to show approximate values. A total overhead of 31% is included. This is the standard institutional overhead for industry-funded R&D activities at TU Dresden, but it is also a realistic estimate for the overhead of hiring a freelancer instead of an employee on a long-term contract. The expected personnel costs are therefore as follows:

Employee	PM⁽¹⁾	Total salary cost⁽²⁾	Breakdown of cost	Total⁽³⁾
Group leader	1	5,600 EUR	Salary grade TV-L 15 Ost, 25% non-wage labour costs	7,350 EUR
Research assistant	1.5	7,150 EUR	Salary grade TV-L 13 Ost, 26% non-wage labour costs	9,350 EUR
Student assistant	2.5⁽⁴⁾	4,600 EUR	5 months at 82 hours/month, 8.79 EUR/hour, 28% non-wage labour costs	6,050 EUR
Total				22,550 EUR

Remarks:

(1) Person months
(2) This is the gross salary plus any additional cost of labour of the employer
(3) Total cost including 31% overhead
(4) Students in Germany can work at most 82 hours/month on a contract without loosing their legal status as students, hence 2.5 PM correspond to a 5 month employment of one student assistant (or to a 6 month employment at a rate below the maximum hours)

The calculated total of 22,550 EUR thus roughly corresponds to the requested funding of 30,000 USD (subject to current exchange rates).

Practical considerations

The main applicant will act as the main individual grant holder, and distribute funds to other team members or advice WMF to do so (if preferred). TU Dresden will thus not be involved organisationally. Nevertheless, the project is closely related to research activities in the applicant's research group at TU Dresden, and it is hoped that members of this research group are participating (individually) in the effort. To the best of the applicant's knowledge, the above efforts are compatible with TU Dresden's regulations concerning admissible time commitments for side jobs. The participation of group members is desirable to ensure work on the project beyond the initial funding (sustainability).

Since the publication of this proposal, several community members have indicated their interest of participating in the project as developers. Since IE Grants are funding individuals rather than organisations, it would be possible for those community members to join the development team even without any affiliation to TU Dresden. This increased flexibility is an advantage. To allow for this flexibility, the above example calculation should not be considered to specify the final distribution of funds.

Intended impact:

Target audience

Developers of Wikidata-based applications: the toolkit that is the heart of the project will make it easier for developers inside and outside the MediaWiki cosmos to take full advantage of Wikidata content in their projects.
Wikidata "readers": Wikidata does not really have "readers." The project will help to create technologies that change this, allowing people to access content in more flexible and useful ways.
Wikidata editors: The project will simplify the computation of content-related statistics and the validation of constraints, which is of much interest to the community of Wikidata editors.
Researchers: Wikidata is a treasure trove for researchers who want to study, e.g., the topic interests of Wikimedia contributors in different languages, or the conceptual approaches taken to organise content across cultures. The project will provide them with a toolbox to get at this data.

Fit with strategy

The project affects all strategic objectives to a greater or lesser extent. The primary objective of the project is to encourage innovation. The demonstrator web service will also contribute to improving quality. Indirect effects of the project will help to increase reach and participation, and to stabilize infrastructure.

Encourage Innovation

Wikidata is one of the most innovative projects in the Wikimedia universe, and the community is only starting to explore the possibilities of this new resource. Additional investment is needed to support a wider community of developers to join the creative process to make the most of Wikidata.

Potential applications range from data browsers and search tools, over tools for data analysis and research, to quality control and editing support tools. Most of these applications require basic ways for obtaining, loading, representing, searching, and analysing Wikidata contents. The project will provide a common basis for such tools, so that innovators can focus on their ideas rather than on technical hassles.

Improve Quality

Currently, one of the main uses for complex queries over Wikidata is constraint validation. The project will develop software that allows complex questions to be answered more immediately, thus improving the existing quality control infrastructure. Analytical tools can also contribute to discovering editorial biases and other large-scale trends that are relevant for Wikimedia's quality goals.

Another current concern, or even risk, regarding the development of Wikidata is that many decisions about its content organisation are taken without knowing how the data will be used. Import into Wikipedias is still rare (since the system is lacking some features, notably ranks), and thus there are hardly any sources of requirements on what the data should look like. The existence of more applications that use the data will help to guide these efforts to make sure that they are driven by practical needs.

Increase Reach

Wikidata integrates data across all Wikipedia languages and several Wikimedia projects. Applications based on this data can easily be localized to reach wider audiences, including speakers of smaller language groups (see, e.g., this data-based map). Data-driven applications remove language and culture boundaries between user communities, thus directly contributing to the strategic goal of increasing reach.

Stabilize Infrastructure

Activities related to Wikimedia's infrastructure objective had the goal "to enable developers to create applications that work easily with MediaWiki." The present project extends this goal to the new requirements of Wikidata, enabling developers to work easily with the content gathered therein. A secondary effect is that the projects thus empowers community members to address tasks that would otherwise bind WMF resources, in terms of both staff and infrastructure.

Increase Participation

In Wikidata, increased reach is closely tied to increased participation. The project is not as widely known as Wikipedia, and every interesting direct use of its content will also help to win new editors who otherwise would not care about the data at all. This way of increasing participation is more promising for Wikidata than for most other Wikimedia projects, since there is hardly any gap between displaying data and modifying it. Indeed, it would be very easy for users to edit data they are viewing, even without ever visiting the Wikidata site. The powerful editing API of Wikidata makes this very easy for developers to add such features. The proposed project will further the creation of applications to integrate such in-place editing features with.

Sustainability

It is planned to continue work on the toolkit both in the research group of the main applicant and in the wider Wikidata community.

The applicant's own interest in the project is based on his interest in working with Wikidata content. This is also relevant for the applicant's research in data management, which is funded for the foreseeable future. The applicant is already involved in maintaining programming utilities for Wikidata (Wikidata Analytics) and plans to continue work on the toolkit beyond the end of the project.

The community involvement is an explicit task of the project, and this will also help to ensure future maintenance. Development will of course be completely open, thus inviting participation. Code will be developed with re-usability and maintainability in mind. If successful, the project will be of interest to a wider community, so future maintenance should be secured.

The availability of the query service demonstrator can be assured beyond the end of the project, as explained in the hosting plan.

Measures of success

End of project:

Fully functional query web service:
- implemented as a lightweight client on top of the toolkit
- able to answer typical constraint validation queries
- able to cope with small query loads using one reasonably-dimensioned server
Essential toolkit functionality available:
- Loading, re-loading, indexing, querying Wikidata content
Performance goals:
- Concrete performance targets as developed in Task T1
- Load performance significantly above existing Wikidata Analytics scripts
- Query performance comparable to Wikidata Query on basic queries
- Sustainable memory usage, based on current main memory sizes and Wikidata growth
Community building goals:
- Involvement of at least one student contributor
- Ongoing/planned projects using the toolkit
- Community participation in mailing list discussions, bug reporting, patch creation

Mid of project:

Insights about memory usage and query performance of several existing alternative systems
Concrete performance targets and resource requirements
Concrete implementation plans and architecture
Implementation of some key components (first functional prototypes)

Participant(s)

Markus Krötzsch is a Departmental Lecturer at the Department of Computer Science of the University of Oxford. Starting November 2013, Markus will move to TU Dresden where he will be leading a new research group. As of September 2013, the group still has two open positions for fully funded researchers (or doctoral students), hence the names of other participants are not given in this proposal.

Markus is the long-term lead of the Semantic MediaWiki (SMW) project, which he co-founded in 2005. Markus has contributed to the conception and current design of Wikidata, especially regarding the data model, and he is in close contact with the Wikibase development team. He is also a co-developer of the Wikidata Analytics toolkit, and of ELK (currently the fastest in-memory reasoner for the lightweight ontology language OWL EL).

Markus's research interest is in the areas of information systems, knowledge representation, databases, query languages, and (semantic) web technologies, where he has published many articles in leading journals and conferences. More information on his academic activities can be found at his personal homepage.

A research assistant from Markus's research group at TU Dresden will participate in the project. This person will hold either a PhD or MSc in computer science or a related area.

A student assistant will be involved in the project as early as possible. He or she will pursue a degree in computer science or a related area, most likely at TU Dresden. The student will be guided by the senior team members.

Discussion

Community Notification:

Links to where the relevant communities have been notified of this proposal, and to any other relevant community discussions:

Wikidata-l: http://lists.wikimedia.org/pipermail/wikidata-l/2013-September/002928.html
Wikidata-tech: http://lists.wikimedia.org/pipermail/wikidata-tech/2013-September/000321.html
semediawiki-user: http://sourceforge.net/mailarchive/message.php?msg_id=31463421
semediawiki-devel: http://sourceforge.net/mailarchive/message.php?msg_id=31463422

Endorsements:

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.

Community member: add your name and rationale here.
Yes, I endorse the project since it will be very necessary for the use of Wikidata. --Projekt ANA (talk) 21:39, 29 September 2013 (UTC)
I think this proposal is timely, well-scoped, reasonably budgeted, doable in a six-month timeframe, and clearly aligned with the strategic objectives. It strikes a chord with me in terms of several of the envisaged target audiences:
1. I am "reading" Wikidata on a regular basis in order to develop a sense of how it could be useful in scientific research contexts. That sense is rather rudimentary at the moment, and I think such a toolkit would be of great value here.
2. Most of my edits to Wikidata are of the drive-by kind, but I expect to become more involved once tools are available to browse and edit existing content within a given subject area more systematically, and to identify highlights, inconsistencies (e.g. in terms of zoological taxonomy across Wikipedia languages) and gaps therein. I also think that such tools - especially the data browsers - would facilitate and enhance Wikimedia outreach efforts to the scientific community, where the battle for open data (let alone linked to controlled vocabularies) is still far from being won.
3. I run a bot on Commons that proposes categories to the files it uploads, but this preliminary categorization has a lot of room for improvement. I think some intelligent querying of Wikidata may help here, and building the tools for that would certainly be easier if a toolkit were available.
4. A follow-up project I would like to see would be how to fill some kinds of gaps systematically with the help of scientific resources, particularly from natural history collections and the biodiversity literature. I hope that the toolkit proposed here will provide GLAMs and other potential Wikidata content partners with means to assess the potential impact before embarking on a collaboration, and to measure it once they have started.
To sum up, I fully endorse the funding of this proposal. -- Daniel Mietchen (talk) 02:17, 30 September 2013 (UTC)

Our customers are excited to use free knowledge in the space of there MediaWiki-driven business circumference. Meanwhile web services are the de facto standard for automated knowledge transfers. Several years after implementing my first web service client/server into MediaWiki with the good old NuSOAP - SOAP Toolkit for PHP, I can't wait to see/test the first data centric endpoint. I hope it will be a little bit easier to handle then other components :-). And I would guess the budget is currently to low.

--Steviex2 (talk) 16:29, 30 September 2013 (UTC)

As a member of the Gene Wiki project, I fully endorse this application. The Gene Wiki team could use the query services described here to improve the operation of the bot that maintains the gene-related entries. Further, our team would likely be among the first to make use of the query service for scientific applications. We have previously created semantic applications using similar data and look forward to bringing similar things to life based directly on wikidata. If this work can make the wikidata system advance towards its potential as a gateway to knowledge more quickly, and I think it can, then I am all for it! Genewiki123 (talk)
Proposal sounds very prospective and is vital for the future of Wikidata. --Tobias1984 (talk) 17:28, 30 September 2013 (UTC)
As the researcher that completed the first example of comparing the Sex ratio of person articles across Wikipedias, I will attest we need this. I had the idea of that research on the bus one morning, and then it took me 2 weeks to code the solution. If this was in place, it would have been done in the same afternoon. Maximilianklein (talk) 19:21, 30 September 2013 (UTC)
I think that the proposal is timely and very important for the future of Wikidata. The author is clearly capable to implement the project. Katkov Yury (talk) 08:50, 1 October 2013 (UTC)
The Wikidata development team supports this proposal. Markus is well suited to undertake this effort. It will be useful to have this to grow the 3rd-party ecosystem around Wikidata. --Lydia Pintscher (WMDE) (talk) 08:50, 2 October 2013 (UTC)
Appears to be very useful, if not essential. --Danrok (talk) 17:33, 3 October 2013 (UTC)
I support the proposal. It is urgently needed to improve the query capabilities of Wikidata, and Markus and his team can deliver it. --Longbow4u (talk) 11:31, 5 October 2013 (UTC)
Support the proposal. Wikidata query surely needs an upliftChinmay26 (talk)
I think that this project would be an amazing boon to Wikidata, and that Markus is uniquely positioned to create this. The realization of this project can provide for a crucial ingredient for the overall success of Wikidata, by supporting its usage outside as well as helping with the reuse inside of the WIkimedia universe. --denny (talk) 21:28, 11 November 2013 (UTC)