Grants:IEG/Wikidata Toolkit/Final

From Meta, a Wikimedia project coordination wiki


Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project. In fact, this report is completed well after the end of the initial six-month period, and the project has involved people not funded by the grant from day 1. The report therefore covers outcomes that were not directly funded by the IEG grant alone, but all of this work would not have happened without the initial push from the availability of an IEG grant.

Part 1: The Project[edit]

Summary[edit]

As of Februrary 2016, Wikidata Toolkit IEG project has:

Methods and activities[edit]

The work environment has been set up as detailed in the Midpoint report, including a project homepage, Github project, OpenHub (formerly Ohloh) project page, quality assurance via Travis.ci and Coveralls.io, and a Sonatype project for distributing releases.

Development follows the common practices of distributed open source development, with public issue tracking, pull requests, code reviews, and (eventually) code merges into the master branch. The developers based in Dresden meet once a week for about 1h to discuss progress and tasks.

Outcomes and impact[edit]

Outcomes[edit]

The main outcomes of the project are:

  • Wikidata Toolkit, a software library for accessing Wikidata in Java
  • An RDF encoding for Wikidata, that became the basis of the Wikidata query service
  • Several spin-off tools and contributions to the Wikidata community


Wikidata Toolkit v0.6.0[edit]

Key facts about Wikidata Toolkit as of version 0.6.0

Note that commit numbers are only a rough indication of effort, since the size of the commits varies greatly.

This Java library is the main technical component developed in the project. Version 0.6.0 is the sixth release of the project. Some basic facts that characterise the effort spent on this activity are shown in the box on the left. These figures show that Wikidata Toolkit is a very active project that puts an emphasis on documentation and testing. The latter are very important aspects for the development, since the code itself must be reliable and maintainable to be useful as a library.

The main features of Wikidata Toolkit v0.6.0 are as follows:

  • Complete implementation of the Wikibase data model in Java, so that all Wikidata content can be faithfully represented in programs
  • Automated downloading, parsing, and processing of Wikidata content dumps found at http://dumps.wikimedia.org/
  • Wikibase API connectivity for reading and writing data to Wikidata.org (useful for building bots)
  • Serializing Wikibase data into RDF and into the JSON format used by Wikibase, with several encoding and filtering options
  • A stand-alone client to perform dump transcoding (used to build the RDF export page)

A simple example for how to use this functionality is given in the file DataExtractionProcessor.java. A few dozen lines are enough to find and extract custom information for hundreds of thousands of items from Wikidata: this program automatically downloads the most recent dumps, decompresses and parses the MediaWiki page dumps, parses the content of every data page, and extracts relevant data (in this case the GND identifiers of all humans) that is written to a CSV file. The user only writes code to for the actual custom task (finding humans and their GND) and can leave all the other aspects to the library. A similar workflow can be used to compute complex statistics, as in Max Klein's analysis of gender ratios in Wikipedias which is based on Wikidata Toolkit as well.

The analysis of offline Wikidata dumps can also be combined with online interaction to create bots. For example, FixIntegerQuantityPrecisionsBot.java finds all numbers on Wikidata that have a precision of "+/- 1" that is likely to be an input error, and replaces these numbers by exact versions ("+/- 0"). This problem occurs frequently since "+/- 1" is assumed as a default in Wikidata when users enter an integer number, which makes little sense for figures such as population numbers. The example bot searches for such problems in offline dumps, checks if they are still found in live data, and corrects them. It has been used to make thousands of corrections on Wikidata.

Wikidata Toolkit replaces the earlier Python implementations wda and its precursor wikidata analysis in almost every way.

RDF-based query answering[edit]

The original project plan was to build a new query answering engine from scratch. As an application for the mid-term release of Wikidata Toolkit, we built a tool for creating exports of the content of Wikidata in the Resource Description Format (RDF) of the W3C. RDF was conceived as a data format for the Web and is used by a significant community of practitioners and researchers to represent the data they work with. We have created and published RDF exports on Wikimedia Labs since more than a year: http://tools.wmflabs.org/wikidata-exports/rdf/. The encoding of Wikidata to RDF is not immediate and there are many design decisions to be made. We have documented our work and the resulting files in the research paper Introducing Wikidata to the Linked Data Web, presented at the 13th International Semantic Web Conference in 2014 (PDF).

Meanwhile, Wikimedia's own plans to build a query service have encountered some difficulties, including the unexpected discontinuation of development of the Titan graph database, which had initially been selected as a query engine. In search for an alternative, Wikimedia developers decided to use the graph database BlazeGraph, which is based on RDF. Because of this, the work done in the Wikidata Toolkit project on defining an RDF export became important. Markus Krötzsch engaged in many email exchanges and telephone calls with WMF developers to discuss this layout. While this has not produced code, it eventually contributed to what is now the official Wikidata query service, which has a huge positive impact on the use and impact of the whole project.

The main contribution of the Wikidata Toolkit project was to help shape the official RDF model underlying this service. The implementation and deployment of this approach on Wikimedia servers is due to the WMF development team of Stas Malyshev.

Spin-off tools[edit]

In addition to the actual Wikidata Toolkit library, several spin-off tools have been created by the project team. We give a brief list. This is not a list of third-party tools, but a list of tools created by people closely affiliated with the project team.

  • ViziData is an interactive online map that displays all Wikidata items. The tool is developed by Georg Wild, a student of Markus Krötzsch, and also won the Wikidata Visualization Challenge 2015 of Wikimedia Sweden recently. ViziData has been used in Edit-a-thons to identify local Wikidata items for editing.
  • Wikidata world maps is a collection of static, but nonetheless beautiful maps that show the location and density of Wikidata entries on Earth, and highlight geographic biases of Wikipedia's in several languages. The code that creates these maps is part of the examples that come with Wikidata Toolkit. The maps have been used in presentations and posters about Wikidata.
  • Makrobot is the first Wikidata bot built with Wikidata Toolkit. It is operated by Markus Krötzsch and so far has been fixing integer precisions (as described above) and adding missing labels to certain items, amounting to some ten thousands of edits.
  • Wikidata Class and Property Browser is an application to browse properties and classes used on Wikidata based on their usage statistics (how often are they used? what other properties/classes are used together with them? etc.). It is used by Wikidata editors to inspect current data and interlinked from many Wikidata pages. There is ongoing work to create a new, completely overhauled tool that further improves this service.
  • Wikidata Taxonomy Browser is an application developed by Serge Stratan. It enables users to explore the whole of Wikidata's (large and complex) class hierarchy in the browser. For example, one can thus display and explore the taxonomy of "waffle". Click nodes to expand neighbours. This tool, too, is interlinked from many Wikidata pages and used by editors to inspect the data.

Progress towards stated goals[edit]

Please use the below table to:

  1. List each of your original measures of success (your targets) from your project plan.
  2. List the actual outcome that was achieved.
  3. Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
Fully functional query web service
  • Query service available at https://query.wikidata.org/
  • Supported by creating initial RDF format and exports
  • Contribution to final design of RDF model though extended discussions with the WMF developers
The plan for achieving this goal has changed significantly. Rather than building a new tool from scratch, it was decided to use an RDF database. A lot of effort went into the necessary format conversions (see above). The final outcome surpasses original goals (fully integrated query service rather than external query prototype).
Essential toolkit functionality
  • Loading/processing as planned
  • Indexing/querying obsoleted by use of RDF database; additional RDF export functionality that was not part of the initial plan
  • Additional Web API read/write access that had not been planned
Changed focus to support large-scale dump processing and bot functionality, database query features dropped as they are now provided by RDF database; new bot-building features added to support further application types. Original targets partially reached (changed goals to fit actual need).
Performance goals
  • Load performance significantly above existing Wikidata Analytics scripts
  • Excellent RDF export performance
  • Sustainable memory usage, based on current main memory sizes and Wikidata growth
  • Parsing all data from a full dump can be achieved in 15min on a laptop; very good pay-as-you-go behaviour (the more data read from each entry, the more time it takes); much faster than previous scripts (which took hours for one pass)
  • Performance goals for query answering not applicable due to changed plan
  • Performance goals for RDF export were added: these files are generated regularly on Labs and do not put undue stress on those servers
  • Memory usage is very low since basic processing is done in a streaming fashion (less than 500MB RAM suffice)

Original goals reached, as pertaining to implemented functionality.

Community building goals
  • Involved 3 student contributors, and 2 more students working on spin-off applications so far (original goal was 1 student contributor)
  • Popular on github (in terms of forks/follows/stars)
  • Large development team compared to other OSS projects (10 contributors so far)
  • Several spin-off applications and third-party uses
See "Impact" below for details and Spin-off tool" above.

Final outcome surpasses original goals.


Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?

The overall goals of the project have been achieved. The project created a sustainable software project that is used in a variety of applications, and it has helped to get the official Wikidata query service into production.

Global Metrics[edit]

We are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.

  1. Next to each metric, list the actual numerical outcome achieved through this project.
  2. Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."

For more information and a sample, see Global Metrics.

Metric Achieved outcome Explanation
1. Number of active editors involved 0 The project is not related to editing.
2. Number of new editors 0 The project is not related to editing.
3. Number of individuals involved 10 Number of direct software contributors/developers (is this meant by "involved" here?).
4. Number of new images/media added to Wikimedia articles/pages 0 The project is not related to editing.
5. Number of articles added or improved on Wikimedia projects 26,000 recent activity of Makrobot which is based on Wikidata Toolkit; articles improved by people using other tools, e.g., at edit-a-thons is unknown
6. Absolute value of bytes added to or deleted from Wikimedia projects unknown the edits of Makrobot have been creating labels in one language, which leads to a few bytes per edit, but exact numbers are unknown; other activities of Makrobot have corrected errors, which keeps the total number of bytes roughly the same


Learning question
Did your work increase the motivation of contributors, and how do you know?

It did in some cases, but in general this is hard to track. Especially the spin-off tools have been used directly by editors:

  • Cristina Sarasua wrote "Wow, this is very useful. We will definitely use it." when being pointed to ViziData as a possible tool for supporting the Donostia-San Sebastián Wikidata Editathon (which is an event targetted to motivating editors)
  • The visialisations of world maps usually have a motivating effect in presentations. Markus Krötzsch has used this in talks and posters, usually leading to manz interested questions.
  • Users on Wikidata.org have linked to the Classes and Properties browser and the Wikidata Taxonomy Browser on talk pages and discussions, so it seems it was useful to them.
  • Joachim Neubert wrote "thank you very much, your code will be extremely helpful for solving my current need" when being pointed to the DataExtractionProcessor (explained above) as a way to extract large datasets.

Indicators of impact[edit]

Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.

Priority 1: Encourage Innovation[edit]

A primary goal of this project was to encourage innovation. The project has contributed to this in several ways.

  • The work on RDF translation has been highly innovative as such, and it has encouraged the development of the query service we have now. This is a highly innovative use of our data.
  • Wikidata Toolkit has enabled ViziData, a highly innovative spin-off project. This has been recognized by Wikimedia Sweden in making it the winner of the Wikidata Visualization Challenge.
  • Wikidata Toolkit is used in in the IEG project WIGI: Wikipedia Gender Index for preparing data for further processing.
  • Wikidata Toolkit has been mentioned directly in several research papers by people who are not affiliated with the project team (e.g., [1] and [2]). The paper on the Wikidata RDF export has been cited in 16 other scientific works since its publication in late 2014; of these publications, 13 do not have any connection with the IEG project team [3]. This shows that the project is in particular useful to enable researchers to work on Wikidata content.
  • Another 66 papers are citing "Wikidata: A Free Collaborative Knowledge Base" (by Vrandecic and Krötzsch), which also appeared in late 2014 [4]. This shows that researchers in general are becoming more aware of this new data source. The publishing and dissemination activity in this IEG project has contributed to this.

Priority 2: Improve Quality[edit]

Most uses of Wikidata Toolkit either directly improve quality (e.g., the activity of Makrobot), or lead to insights that help human editors to do so. For example, ViziData has been used to visualize the birthplaces of all humans born around year 1 CE. This led to several errors and cases of vandalism being detected (roughly, all people born in America or Australia during this time are errors, since we do not have historic records on this). The Wikidata Taxonomy Browser is also specifically focussed on helping editors to detect errors and improve quality. Also the IEG project WIGI that analyses gender inequality contributes to this goal, since it informs editors about more fundamental quality issues that are not visible when considering only individual data points (articles/items).

On another level, the development of Wikidata Toolkit has revealed several problems in the Wikibase software that is used to create exports. In particular, recurring issues like the wrong encoding of empty maps in many data dumps have been discovered and pointed out. Another example is the recent discussion of the real number of items in Wikidata: Wikidata Toolkit counts many more items in the dumps than the official Wikidata.org website reports. In general, a second, independent implementation that processes Wikidata can help to detect problems like this, which may not be noticed otherwise. This improves quality on the software level.

Priority 3: Increase Reach[edit]

Outreach activities were not a central part of this project, but some positive effects have still been achieved. Wikidata in general is language-neutral and thus has the potential to reach global audiences even in language communities with only few active Wikipedia editors, and Wikidata Toolkit contributes to this approach. Wikidata Toolkit has enabled users to take advantage of Wikidata content ("to read" would be too narrow a term in this project, obviously) in new ways, e.g., see the comment of Joachim Neubert above. The presentation of Wikidata Toolkit and related outputs (e.g., world maps) at several events (e.g., during invited talks given by Markus Krötzsch) has also made more people aware of the data and its potential uses.

Priority 4: Stabilize Infrastructure[edit]

One goal of this priority is "to enable developers to create applications that work easily with MediaWiki" and Wikidata Toolkit has directly contributed to this. In particular, most functionality of Wikidata Toolkit is not in fact specific to Wikidata, but can be used with any project using the Wikibase extension of MediaWiki to manage data. Our work on RDF export and query service has also made direct contributions to some of the most vital new parts of technical infrastructure around Wikidata.

Priority 5: Increase Participation[edit]

Option A: How did you increase participation in one or more Wikimedia projects?

Spin-off tools like ViziData have been useful to show Wikidata to newcomers, and to help current editors to understand the current content and to find things that need to be done. The use of such spin-off tools in edit-a-thons (see Cristina Sarsua's comment above) has also contributed directly to increasing participation.

Option B: How did you improve quality on one or more Wikimedia projects?

Work on Wikidata Toolkit has revealed bugs in Wikimedia software as well as issues in the content. Both have been described above already, so we do not repeat the details here.

Option C: How did you increase the reach (readership) of one or more Wikimedia projects?

See "Increase Reach" above.

Project resources[edit]

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well[edit]

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What didn’t work[edit]

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

  • Reporting. The reporting overhead is very large for a project this size. I am running far larger projects based on other research grants that require less than one report per year. Moreover, the reports asked for are rather complex, with a lot of reading required to find out what is expected. Several things that need to be reported are either completely unrelated to the project (e.g., number of images uploaded), or extremely hard to find out (e.g., number of bytes added to Wikimedia projects). Reporting for a small-scale funding scheme like IEG should be more lightweight and less like filling out a tax declaration.
  • The other main challenge for the project was co-evolving with a young and partly unstable Wikimedia project like Wikidata. We had to do a lot of unplanned extra work to stay compatible in the light of continued changes and recurring bugs, which are unavoidable for a new project like Wikidata. Up to the present day, we are receiving announcements of important technical changes only a week in advance, leaving us a few days (usually our weekends) to prepare a new release to stay compatible. It is difficult to maintain a professional relationship with Wikimedia. The continued working of our tool depends on our willingness and ability to monitor general purpose mailing lists for relevant technical information, which is disseminated at random and often very late, and our further commitment to react to those changes on a short timeline.

Other recommendations[edit]

If you have additional recommendations or reflections that don’t fit into the above sections, please list them here. None.

Next steps and opportunities[edit]

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

The project is under continued development, but a major goal now is not growth but maintenance, since we must react to all API and data format changes. Some new features are also considered:

  • Live update monitoring. We already have preliminary support for scanning the recent changes API, but more infrastructure for keeping up-to-date with changes is necessary.
  • SPARQL query integration. We could support SPARQL-based queries as an addition to our API. The main task there is to translate the SPARQL RDF model back into Java objects to simplify programming.
  • Further bot programming features. Editing Wikidata is very complex, since one can, e.g., create duplicate or otherwise redundant data in many ways. Bots need to be very careful or have builtin intelligence to avoid errors. More functions could be added to support this.
Think your project needs renewed funding for another 6 months?




Part 2: The Grant[edit]

Finances[edit]

Actual spending[edit]

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

TODO (pending EUR to USD conversion)

Remaining funds[edit]

Do you have any unspent funds from the grant?

No.

Documentation[edit]

Did you send documentation of all expenses paid with grant funds to grantsadmin(_AT_)wikimedia.org, according to the guidelines here?

Yes.

Confirmation of project status[edit]

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

  • Yes.

Is your project completed?

  • Yes.

Grantee reflection[edit]

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!

The above comment on reporting would fit here. On the other hand, the practical handling of WMF staff has been very liberal regarding timeline extensions. Direct support in case of questions has been very good.