Grants:IEG/Wikidata Toolkit/Midpoint

This project is funded by an Individual Engagement Grant

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.

Summary[edit]

In the first three months, the Wikidata Toolkit IEG project has

produced and released a Java library for accessing Wikidata content (14k lines of code, +80% test coverage);
published RDF data exports of Wikidata;
written online documentation, 10k lines of Java documentation, and an academic paper;
developed key skills in the project team for continuing the development; and
engaged with the user and developer community, and given several presentations about the project at international events.

Methods and activities[edit]

In this section, we focus on how things were done; the what is detailed in the section Midpoint outcomes below.

Work environment[edit]

The project has been conceived as a model free software project, and a lot of technical setup work has been done before the start of the project to have the necessary infrastructure in place.

A project homepage was created at mediawiki.org
A Github project was created under the Wikidata organisation domain in Github
An Ohloh project page was set up to gather project statistics
A Travis.ci project has been configured for continuous integration
A Coveralls.io project has been created to gather metrics about test coverage
A project at Sonatype was created to publish package releases (see https://oss.sonatype.org/content/repositories/releases/org/wikidata/wdtk/)
Developers have set up local development environments, and we have created public documentation on how to do this

This provided us with a range of powerful tools from the very start of the project. This environment has helped both technically (e.g., catching errors found by Travis.ci) and socially (e.g., allowing other developers to engage with us through Github).

Methods and processes[edit]

Weekly team meeting:

The team is meeting every week for about one hour. This time is used to report about recent progress and to discuss questions. The main purpose of these meetings is to distribute tasks and to review progress. The goal is to organize work in short sprints but task sizes need to be very small to complete in a week for part-time developers. Yet, the meeting is important to stay in touch.

Online discussions:

Many technical discussions happen online, so as to be visible in public. The main places for this are the discussion threads on Github that are provided for every open issue and every piece of code. We have also used the Wikidata email lists for general discussions that are not specific to our code.

Code review:

Code is rarely committed directly to the master branch. Usually, we follow a branch-submit-review-merge process that is common on Github. All code should thus be reviewed by a second pair of eyes before being merged. Integration tests are run automatically.

Key development focuses and insights[edit]

Development has focussed on the outcomes detailed below: loading and representing Wikidata content in Java, exporting data to JSON and RDF.

Background research: significant time has been spent to research third-party libraries to use, e.g., for JSON parsing or for RDF serialization. Some of these insights will still lead to changes in our code (e.g., we plan to change the JSON library we are using). Community feedback was helpful there (e.g., on JSON).

Other activities[edit]

Besides actual coding, we have also undertaken a number of outreach activities, detailed below.

Midpoint outcomes[edit]

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

There are two main tangible outcomes of the project to date:

Wikidata Tookit v0.2.0 is a Java-based software library that greatly simplifies the processing and access of Wikidata data exports
The Wikidata RDF exports published at Wikimedia Labs are a useful stand-alone resource for third-party users

In addition, there are some intangible yet relevant outcomes of the first phase:

The project team (project lead, assistant, two students) has acquired many important skills
Outreach activities have made Wikidata Toolkit (and Wikidata as such!) more visible to others; external users and developers have engaged with the project

Each point will be discussed briefly below.

Wikidata Toolkit v0.2.0[edit]

Key facts about Wikidata Toolkit as of version 0.2.0

14,146 lines of Java code (statistics by Ohloh)
467 commits (statistics by Github)
Well-commented source code (online JavaDocs)
- 10,663 lines of comments in Java (43% of all file contents)
- ranking among the top third of Java projects in Ohloh
- according to Ohloh, this "might indicate that the code is well-documented and organized, and could be a sign of a helpful and disciplined development team."
5 developers have made commits to master (statistics by Ohloh):
- Markus Krötzsch: 243
- Fredo Erxleben: 91
- Julian Mendez: 70
- Michael Günther: 61
- Denny Vrandečić: 2

Note that commit numbers are only a rough indication of effort, since the size of the commits varies greatly.

62 closed issues; 13 open issues (managed on github)
- Development is open and transparent, with most project discussions happening in public
Well-tested code:
- >80% code coverage (statistics by coveralls.io)
- Fully automated continuous integration testing (provided by Travis.ci)

This Java library is the main technical component developed in the project. Version 0.2.0 is the second release of the project. Some basic facts that characterise the effort spent on this activity are shown in the box on the left. These figures show that Wikidata Toolkit is a very active project that puts an emphasis on documentation and testing. The latter are very important aspects for the development, since the code itself must be reliable and maintainable to be useful as a library.

The main features of Wikidata Toolkit v0.2.0 are as follows:

Complete implementation of the Wikibase data model in Java, so that all Wikidata content can be faithfully represented in programs
Processing MediaWiki XML exports and in particular for the data content exported by Wikibase-wikis, such as Wikidata
Fetching and managing Wikimedia dumps as published at http://dumps.wikimedia.org/
Serializing Wikibase data into RDF, using a choice of encoding approaches
Serializing Wikibase data into the official JSON format used in the API of Wikibase

A simple example for how to use this functionality is given in the file DumpProcessingExample.java. A few dozen lines are enough to compute statistics about the current content of Wikidata: this program automatically downloads the most recent dumps, decompresses and parses the MediaWiki page dumps, and parses the content of every data page to obtain data objects – the user only writes code to actually look at the data objects thus obtained to compute some statistics. A similar workflow can be used to build more complex statistics, as in Max Klein's recent analysis of gender ratios in Wikipedias which is based on Wikidata Toolkit as well.

In its current form, Wikidata Toolkit thus replaces the earlier Python implementations wda and its precursor wikidata analysis in almost every way.

Wikidata RDF exports[edit]

As a specific application of Wikidata Toolkit, we are creating regular exports of the content of Wikidata in the Resource Description Format (RDF) of the W3C. RDF was conceived as a data format for the Web and is used by a significant community of practitioners and researchers to represent the data they work with. Wikidata already has some basic RDF exports through a Web API, but most information is not available in RDF at all. While based on Wikidata Toolkit, RDF exports are therefore an outcome in their own right, which might well be used by others without considering Wikidata Toolkit at all. Indeed, there are a number of tools to process RDF which can be applied to this data.

We have created a new project at Wikimedia Labs to publish RDF exports regularly: http://tools.wmflabs.org/wikidata-exports/rdf/. Here is an example page for the exports for May 2014. As can be seen from this page, we are actually generating a number of RDF files each time. One the one hand, we split the data into several parts that cover distinct aspects; RDF files can easily be combined so this allows users to mix and match the data they care about. On the other hand, we provide some simplified exports that contain only a specific view of the data, which we think is useful for certain applications.

The encoding of Wikidata to RDF is not immediate, and indeed there are many design decisions on how to best accomplish this. We have documented our work and the resulting files in a recent research paper, which has been submitted to a conference (PDF). This provides extensive documentation to the respective functionality in Wikidata Toolkit, and it explains the outcome in a way that should be useful for those who want to make use of the data without considering Wikidata Toolkit at all. The paper is currently under submission; when accepted, this will add to the visibility of Wikidata in the academic community.

Team skills[edit]

The development team had to learn many things. Every member of the team had Java development experience before the project, yet there were many technical skills that some or all team members had to acquire first. Important skills of this kind include:

Working with git (in Eclipse) for version control
Using github and related services (continuous integration via Travis, code coverage reports via Coveralls)
Build management with Apache Maven
Deployment of Java releases on Sonatype
Using Wikimedia Labs
Working with a range of specific Java libraries (XML parsers, JSON libraries, RDF libraries, ...)

Overall, these aspects are an important part of the complex environment of modern software development. Becoming familiar with this environment requires a lot of effort that does not directly translate into lines of code. Yet, it is essential for productive software development and deployment.

Besides these rather generic skills, most team members also had to become familiar with Wikidata and Wikibase, and with some aspects of MediaWiki and Wikimedia.

Learning these things is essential to contribute to the development work in the project, but we also consider it as a project outcome in itself. Two of the team members are student assistants (funded by a research grant), and learning relevant skills is an important aspect of their engagement. The joint authoring of a research paper also contributed to this general goal.

Outreach and community engagement[edit]

For a new project like Wikidata Toolkit, it is important to engage users who use the software in practice, and provide feedback to guide future development. Of course, one can announce new versions on mailing list and maintain online documentation, but more is needed to spread the word in the first few months. We have therefore engaged in a number of specific outreach activities beyond development:

On May 11, Markus Kroetzsch gave a tutorial on using Wikidata Toolkit at the Zurich Hackathon 2014.
On May 22, Markus Kroetzsch gave an invited keynote talk at the 9th Semantic MediaWiki Conference (SMWCon Spring 2014) in Montreal, Canada
The aforementioned research paper and RDF export project was already helpful to some users (we have received email feedback and comments on the content of the paper). If accepted for publication, this will further increase visibility.

Further future activities of this kind have been planned and prepared:

Two Wikidata-related talk submissions to Wikimania 2014 London have been accepted for publication. Markus Kroetzsch will be a Featured Speaker in the Open Data track.
Three of the team members plan to participate in Wikimania London
Markus Kroetzsch will give an invited talk and tutorial about research with Wikidata (and Wikidata Toolkit) at the Web Intelligence Summer School 2014 in Saint-Étienne, France, in August 2014
Markus Kroetzsch has been invited to give a keynote at the Semantic Web in Libraries Conference (SWIB14) in Bonn, Germany, in December 2014.

Activities on github also witness some initial interest from developers. We have received several comments and reports from people who are interested in the project. We hope that in some cases, this can also lead to concrete development contributions, but we also welcome feedback in general.

Some of the above activities clearly focus on academic communities, for which Wikidata should be of specific interest. Being based at a university, the project is in an ideal position to reach out to these users. This also involves local activities: two ongoing student projects at TU Dresden are using Wikidata Toolkit to access Wikidata for further research. One is related to the benchmarking of query answering, the other is focussed on visualisation of query results. If successful, both can lead to valuable contributions.

Finances[edit]

Have you spent your funds according to plan so far?

Yes, but the finance page had not been updated as it should have been since the initial proposal. Namely, the proposal planned for the work of a research assistant from the project leader's group. However, this position could not be filled and the project has started with a different composition of team, and a different distribution of work. This has been communicated in email already at the project start, but the finance table still showed the old team. It has now been requested to change this.

This change does not affect the overall amount of work or the overall cost of the project, but the way in which both is distributed among people. Therefore, we consider this spending to be "according to plan".

Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

No further changes are anticipated. In particular, it is not planned to add another person to the project now. The travel cost for Wikimania might be listed separately in the final report; this would then be deducted from some of the personal costs.

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges[edit]

What challenges or obstacles have you encountered?

Limited work-force:
- three of four team members restricted to one or two working days/week
- conflicting commitments may temporarily reduce this (examination periods, day job)
Proper software development is very time consuming
- Test creation and configuration (e.g., for deployment) takes more time than "real" development
Little time for error
- Difficult to test several solutions when each takes weeks to complete
Varying levels of productivity
- Some tasks take much longer than expected (even when expecting this)

What will you do differently going forward?

Smaller task sizes
- Make sure that everyone can complete their tasks each week
- Introduce new features in small chunks over time
(Even) stronger separation of tasks
- Decouple the work of team members as much as possible for flexibility
Encourage specialization of team members
- Allow everybody to focus on "their" part of the code (easier now that the project is bigger)
Stronger focus on "visible" features
- Create more applications that one can make screenshots of

What is working well[edit]

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Set test tasks for recruiting programmers

Next steps and opportunities[edit]

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.

There are two major areas of work for the next three months:

Support for storage and query
Show-case applications on Wikimedia Labs

The first is at the heart of the technical proposal, and has been deferred to the second half in favour of finishing RDF serialization first. It is important work to support applications, but it is also "back-end work" that does not immediately translate into visible applications.

The second area is important for visibility and practical utility. In addition to the regular RDF exports, we plan to set up useful reports that are generated regularly on Wikimedia Labs. One particular opportunity that came up recently was the use of the Miga Data Viewer that was developed in a previous IEG grant as a front-end to display these reports in an appealing way. Wikidata Toolkit and Miga nicely complement each other, and Miga illustrates how one can provide dynamic Web applications for data visualisation without putting any additional load on the server. However, other visualisation options will also be considered, including query services as discussed in the proposal.

Visible applications bear the opportunity to engage more users, which is the ultimate goal of our work, and therefore will be have a higher priority in the second half. This also prepares our Wikimania presentations.

Grantee reflection[edit]

We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

Reporting was mostly lightweight, which helps to focus on the work, not on the management.
Recruiting is a challenge, but the flexible project planning helps to react to changes.
The midpoint report took some time to create, esp. to make a useful (hopefully) learning pattern that makes sense beyond the context of one project.
The combination of an Open Source Wikimedia project and university work/teaching requires some extra flexibility, but also has its own benefits (wider perspective, additional application areas).
As usual in software projects, one has to be prepared for some changes of plan.