Grants:IEG/Wikidata Toolkit/Timeline

From Meta, a Wikimedia project coordination wiki

Timeline for Wikidata Toolkit[edit]

Timeline Date
Assemble project team and get everyone up to speed 17 Feb 2014 (project launch)
Initial data representation and import: Java data model implementation, dump file parsing 17 March 2014
Primary data index and basic retrieval 31 March 2014
Example applications, basic documentation, command line tool 17 April 2014
Initial support for advanced queries 17 May 2014
Demonstrator Web site 17 June 2014
Extended support for advanced queries 17 July 2014
Final user documentation 17 August 2014


Monthly updates[edit]

The project starts on Feb 17, 2014. Monthly updates will appear after each completed work month. For current information and usage instructions, please see the Wikidata Toolkit homepage.

February/March 2014[edit]

The focus of the work in the first month has been to get access to Wikidata content in Java in the first place. To do this, the data model has been reimplemented in Java and new components have been created to download and process Wikimedia dump files. The first release, Wikidata Toolkit 0.1.0 published at the end of March, already allows users to access this data in a streaming fashion (no advanced querying, but convenient access at a speed that is well above the old Python scripts). The documentation has been extended accordingly.

The project counts more than 9,000 lines of code, is well-documented (according to Ohloh) and well tested (93% test coverage). Four developers are working on the code part time. Part of the first month has been spent on introducing everybody to the project and setting up basic processes. For the next weeks, it is planned to focus on performance improvements and export to other formats, since these can lead to the greatest short-term benefits for users. This is a slight change from the original plan to focus on storage and query first; this will still be done in parallel, but not exclusively.

The initial work on the release process took until end of March, so the consecutive phases are aligned with the months that they label.

April 2014[edit]

Work in April was focussed on three main topics:

  1. Adding missing functions in processing of Wikidata.org exports
  2. Simplifying the code required for using the library
  3. Support for JSON serialization

The main addition in the first area was full support for MediaWiki site links. These play an important role in Wikidata, since every page is linked to Wikipedias in many languages. To interpret these links, it is necessary to read the SQL export of the sites table and integrate this into the code. This is now functional, allowing users of the toolkit to access all relevant site information from Java, and to resolve site links appropriately. In addition, several bugs have been fixed in relation to export processing; in particular, dump files were sometimes not downloaded correctly, and property pages could not be processed.

The second strand of work has led to significant simplifications in the code that users need to write in order to process dumps. A new class has been introduced for this purpose. This safes about 100 lines of boilerplate code for processing Wikidata exports.

Finally, the third strand of work led to a new serialization class that can convert data into the official JSON format used by Wikidata. This format is not the same as the internal dump format (though both are based on JSON). The conversion code will be useful in the future for interacting with the Wikidata Web API, which uses this format to communicate.

May 2014[edit]

The focus of the work in May was to design and implement an export of Wikidata using the W3C Resource Description Format (RDF). RDF was conceived as a data format for the Web and is used by a significant community of practitioners and researchers to represent the data they work with. Wikidata already has some basic RDF exports through a Web API, but most information is not available in RDF at all.

The outcome of this work was a significant amount of code (>5000 lines) that can create a variety of custom RDF exports from Wikidata dumps. The results of these dumps are published in a dedicated Wikimedia Labs project page: http://tools.wmflabs.org/wikidata-exports/rdf/. Here is an example page for the exports for May 2014. This work and te underlying design has been documented in a dedicated report, which has also been submitted to a research conference now. If accepted, this will further improve the visibility of Wikidata in the academic community.

Several other activities in May were related to community engagement more than to actual development. Markus gave a tutorial on Wikidata Toolkit at the Wikimedia Hackathon in Zurich and a keynote about Wikidata at the 9th SWM Conference in Montreal (SMWCon Spring 2014). These activities are meant to help community members to make use of Wikidata Toolkit in their projects; a first example is Max Klein's recent analysis of gender ratios in Wikipedias.

June 2014[edit]

A second release, Wikidata Toolkit 0.2.0, has been made in June. It provides the new RDF serialization code and several other improvements.

The main development work done in this month was the start of the implementation for the code to interpret the "external" JSON format of Wikidata. All earlier work was based on the "internal" JSON format that was used in the data exports, while the "external" format that is only used in the Web API had reduced priority. Since the Wikidata team had announced that all exports will switch to using the external format, this became a priority.

Otherwise the team was travelling a lot in June (conferences), so only about half of the month was used for development.

The paper about the Wikidata RDF exports has been accepted at the 13th International Semantic Web Conference (ISWC 2014), where it will be presented in October 2014, and a final version was submitted. The paper (see link above) was updated accordingly.

July 2014[edit]

The two main activities in this month were:

The new JSON format is implemented using a new object-model based parsing library (jackson) that promises significant performance gains. However, trying to support the format completely also brought up several issues. Indeed, it is not clear if there is any other application that parses all data from this format right now, so it is no surprise that some issues come up. However, there was good progress on parsing the JSON provided by the Web API of Wikidata.

The new binary persistency format is meant to encode Wikidata content in a file format that supports fast random access and iteration without requiring the use of a full-fledged database management system. An existing Java database engine (MapDB) is used for managing data on a lower level. Wikidata Toolkit then provides the code to serialize and deserialize data in binary form. By the end of July/beginning of August, the binary format was completed and could be used to store all of Wikidata (including all labels) in less than 5Gb (uncompressed; 2Gb gzip compressed) while allowing for fast random access. In comparison, the compressed data exports are already over 2GB in size (and in the order of 10Gb when decompressed) without allowing for fast access of any kind. Preliminary experiments with Neo4J as a graph database solution required more than 18Gb of space when storing only the English labels and no other text data. Hence, overall, this progress is very promising. However, this code is not ready for release yet.

August 2014[edit]

In August 2014, the project was presented at Wikimania. There were several Wikidata-related talks given by Markus:

Three members of the project team attended Wikimania (Julian, Markus, Michael). Overall, Wikimania has been a huge success for Wikidata, as witnessed by the increased activity on the mailing lists.

In the second half of August, Markus presented Wikidata and Wikidata Toolkit at the Web Intelligence Summer School in Saint-Etienne, France. This involved a keynote talk and a hands-on session with Wikidata toolkit. Materials can be found online. Student formed project teams, and several team chose to use Wikidata. This includes the winning team, which built a nice demo application.

On the technical side, work continued on the new JSON support. The code is almost fully functional but some todos remain. The regular dumps have started to switch to the new format in August, but it turned out that this format is different from the one generated by the Web API (not intentionally in all cases; bug reports are dicussed with the project team in Berlin). Preliminary performance figures indicate that the code can parse and process the whole WIkidata dump in about 15min, which is a huge improvement over the >90min required previously.

Is your final report due but you need more time?



Extension request[edit]

New end date[edit]

15 October 2014

Rationale[edit]

There was a lot of unplanned extra work in the project since Wikidata has changed the file format for its data exports in August/September (this was announced before but not early enough to affect plans for the first half of the IEG project). This is a major change that required a complete rewrite of our parsing code in order to be able to process any data at all, hence it needed to be prioritized before other tasks. In fact, several changes to the export format are still expected in the near future. The progress of our implementation effort can be seen online: https://github.com/Wikidata/Wikidata-Toolkit/pull/91

This is almost done now and there will be a new release soon, but it has delayed our project plans by at least three weeks. We would therefore like to defer the submission of the final report (now scheduled for 15 Sept) by one month, provided this would not cause problems on the WMF side. This will not have any impact on the budget of the project.

--Markus Krötzsch (talk) 16:00, 9 September 2014 (UTC)

Hi Markus. I'm happy to approve this request - your final report is now due on October 15th. Good luck getting the new release out! Best wishes, Siko (WMF) (talk) 21:46, 15 September 2014 (UTC)

Extension request[edit]

New end date[edit]

3 Sept 2015

Rationale[edit]

We would like to complete the release the next version WDTK 0.5 before the final report, which will require two more weeks to finish. It will be nicer for the report to refer to a specific release.