Grants:Project/MFFUK/Wikidata & ETL/Final
Welcome to this project's final report! This report shares the outcomes, impact and learnings from the grantee's project.
- 1 Part 1: The Project
- 1.1 Summary
- 1.2 Project Goals
- 1.3 Project Impact
- 1.4 Methods and activities
- 1.5 Project resources
- 1.6 Learning
- 1.7 Next steps and opportunities
- 2 Part 2: The Grant
- 3 Grantee reflection
Part 1: The Project
In this project, we created a solution for bulk-loading data to Wikidata and other Wikibases in a repeatable, reusable and maintainable fashion. It allows volunteers to do this without the need to write code. They just need to use their knowledge of their source dataset, RDF, the Wikidata RDF dump format and SPARQL, which they need to know anyway to be able to query Wikidata. We also created a tutorial illustrating the process in LinkedPipes ETL, an open-source, Extract, transform, load tool used to publish and consume data in the Linked Data environment via data transformation processes called pipelines. The tutorial includes individual required steps and is accompanied by directly importable and reusable pipeline fragments. There are also 5 proof-of-concept pipelines which are currently used to load data to Wikidata, and can serve as an inspiration to create more.
Goal 1: Proof-of-concept use cases
The first goal of the project is to demonstrate the repeatable Wikidata data ingestion process on several proof-of-concept use cases for different types of data, possibly improving the LinkedPipes ETL tool in the process.
To be able to demonstrate the approach, we first had to analyse the current status of the Wikibase API, choose a suitable library (Wikidata Toolkit) and develop the Wikibase loader component for LinkedPipes ETL. During the development, we have also contributed to Wikidata Toolkit itself. The approach was then demonstrated on several proof-of-concept pipelines listed in the tutorial, which was created in Goal 2.
Goal 2: Methodology
The second goal of the project is to create a methodology (guide, tutorial) describing how volunteers can contribute to Wikidata content using this tool in a systematic and repeatable way, illustrated by the proof-of-concept transformations.
The tutorial, also describing the methodology, was created based on one of the proof-of-concept pipelines. It was seen by multiple volunteers interested in the approach who said the tutorial was very nice. One of the volunteers then became also an author of one of the proof-of-concept pipelines, demonstrating that the tutorial can actually be used to create additional data loading pipelines.
Important: The Wikimedia Foundation is no longer collecting Global Metrics for Project Grants. We are currently updating our pages to remove legacy references, but please ignore any that you encounter until we finish.
- In the first column of the table below, please copy and paste the measures you selected to help you evaluate your project's success (see the Project Impact section of your proposal). Please use one row for each measure. If you set a numeric target for the measure, please include the number.
- In the second column, describe your project's actual results. If you set a numeric target for the measure, please report numerically in this column. Otherwise, write a brief sentence summarizing your output or outcome for this measure.
- In the third column, you have the option to provide further explanation as needed. You may also add additional explanation below this table.
|Planned measure of success
(include numeric target, if applicable)
|1.1.1 We will identify 3 different types of data to be ingested into Wikidata.||4||Remarkable trees, languages mappings, Czech theatres, Czech streets|
|1.1.2 For each type, we will identify at least 1 data source (at least 5 in total), for which a repeatable data transformation pipeline will be created, documented and published.||5||Pipelines listed in tutorial|
|1.1.3 The community will be consulted for feedback and at least 2 new volunteers will be working with the tool.||2||Wikidata:User:Martin_Nečaský, Wikidata:User:Frettie|
|2.1 We will produce a set of documents describing how to use LinkedPipes ETL to enrich Wikidata with transformations of additional data sources||A tutorial and a component documentation was created.|
|2.2 Once the project is over, volunteers will be able to follow the tutorial to add more and more content to Wikidata in a systematic way - using LinkedPipes ETL pipelines.||The tutorial was validated by our volunteers who were able to create pipelines.|
Looking back over your whole project, what did you achieve? Tell us the story of your achievements, your results, your outcomes. Focus on inspiring moments, tough challenges, interesting anecdotes or anything that highlights the outcomes of your project. Imagine that you are sharing with a friend about the achievements that matter most to you in your project.
- This should not be a list of what you did. You will be asked to provide that later in the Methods and Activities section.
- Consider your original goals as you write your project's story, but don't let them limit you. Your project may have important outcomes you weren't expecting. Please focus on the impact that you believe matters most.
The initial motivation for this project came from the fact that there was no guidelines or examples on how to load data to Wikidata in bulk, from external data sources, in a more or less uniform way. At the same time, we were working on a similar thing, publishing data from various sources to the Linked Open Data cloud using our tool, LinkedPipes ETL. The LOD cloud, similarly to the Wikidata Query Service, uses RDF as a data model and SPARQL as a query language. The tool addressed the same issues that were present with Wikidata - the data transformation processes (pipelines) in this tool were using a library of reusable components, helping with their uniformity, and therefore even maintainability.
It was only during the work on this project, that we realized one important side-effect of our approach. While every user who ever queried Wikidata already knows RDF, the Wikidata RDF dump format and SPARQL, those users were not able to use these to contribute data to Wikidata. This was because the data can only enter Wikidata via the Wikibase API, which is very focued on manual work of volunteers inserting data through web forms, and is quite unsuitable for bulk loading. Moreover, this API is JSON-based, which means, that its users need to learn another serialization of data to be loaded to Wikidata, with available libraries only helping with handling the API itself. They were still required to code to be able to create a bot loading data in bulk. Using our approach, the users can create their data to be added to Wikidata only with the knowledge they need to have anyway to be able to query Wikidata. That is RDF, SPARQL and the Wikidata RDF dump format, with no need to write code and with sample pipelines and a tutorial to get them started. Users of our tool can also utilize its library of reusable components to actually prepare the data before loading it into Wikidata, or another Wikibase instance.
In addition, once the data transformation pipelines are created, they can be scheduled to run repeatedly, keeping Wikidata up to date with the source dataset. The transformations, or their parts, can also be shared and reused, as they are in fact JSON-LD (RDF) files.
At the Wikimedia Hackathon 2019 in Prague, when we consulted our idea, we received positive feedback, and when we presented our prototype at our Wikimania 2019 workshop and poster, we were further encouraged by the positive feedback and interest in the approach.
In the end, we achieved the goal of having 5 pipelines and 2 new volunteers working with the tool, who were satisfied with what the tool provides, and the tutorial created.
If you used surveys to evaluate the success of your project, please provide a link(s) in this section, then briefly summarize your survey results in your own words. Include three interesting outputs or outcomes that the survey revealed.
- We did not use surveys.
Is there another way you would prefer to communicate the actual results of your project, as you understand them? You can do that here!
Methods and activities
Please provide a list of the main methods and activities through which you completed your project.
- We deployed a test Wikibase instance on a virtual machine
- We analyzed the flow of data through a Wikidata instance, starting with the Wikibase API, ending with querying the Wikidata Query Service
- We tried logging into the Wikibase API, creating a first item in our Wikibase using existing LinkedPipes ETL components
- Wikidata Toolkit was selected as an existing library to handle the Wikibase API, which can be reused to create a Wikibase loader component for LinkedPipes ETL, and a prototype implementation was done
- A Wikimania 2019 workshop was organized, showcasing the prototype by playing with the Wikidata sandbox item and our Wikibase instance
- A Wikimania 2019 poster about the project was created and presented, positive feedback collected.
- Wikibase loader component was then developed iteratively, together with the proof-of-concept pipelines, ironing out bugs and other issues, contributing to the Wikidata Toolkit library in the process
- The tutorial was written based on one of the proof-of-concept pipelines
- The Wikidata GLAM Facebook group was notified about the tutorial
- Volunteers are using the tool, one of them authored one of the proof-of-concept pipelines loading some data about Czech theatres from the Theatre institute
Please provide links to all public, online documents and other artifacts that you created during the course of this project. Even if you have linked to them elsewhere in this report, this section serves as a centralized archive for everything you created during your project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.
- Project page at Wikidata
- Document containing the analysis of the problem and proposed approach
- The tutorial describes the approach to bulk loading data into Wikidata on a sample pipeline, with links to other sample pipelines
- The developed Wikibase loader component in LinkedPipes ETL GitHub
- Documentation of the developed Wikibase loader component in LinkedPipes ETL
- Pull request to Wikidata Toolkit providing compatibility with newer Wikibase versions
- Poster for Wikimania 2019
- Wikimania 2019 Workshop
- Bot task request for remarkable trees and groves
- Bot task request for Czech streets
- Bot task request for mapping Wikidata languages to EU vocabularies Language controlled vocabulary
- Bot task request of one of our volunteers for loading data about Czech Theatres
- Wikidata GLAM Facebook group notification about our approach
The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.
What worked well
What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
What didn’t work
What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.
- In the project proposal, we set ourselves the number of volunteers as a project goal measure. This turned out to be problematic as this was not completely under our control. Especially in our case, the volunteers needed to be people with some technical knowledge (RDF, SPARQL, Wikdata dump format) and a non-trivial free time slot, as their task was not trivial (installing a new tool, getting to know the process, producing actual data). Nowadays, these skilled people tend to be very busy, and therefore it is challanging to getting them working with the tools developed. Next time, I would either replace this project goal with another one, not depending on external volunteers, or plan considerably more time for their involvement in the project plan.
The biggest issues we had with developing our approach were with testing. Here are some insights:
- Since our approach is based both on the Wikibase API and the Query service, we missed a testing environment for Wikidata, which would actually behave like production Wikidata. The https://test.wikidata.org/ Wikibase instance does not contain the Query service, and is therefore not usable for testing. Our own Wikibase instance on the other hand did not have the various constraints and rules present in Wikidata, and therefore some of the error handling functionality could only be tested in production.
- Even the Wikidata sandbox item has some rules and constraints, which normally apply to Wikidata Items, disabled, so testing with this one is also not sufficient.
- Also the fact that each Wikibase instance has a different set of properties (i.e. P numbers) does not help with testing, as the data loaded into each Wikibase must be suited for that particular instance. It would help a lot, if one could easily create an exact clone of the Wikidata instance, including all the constraints, to test against. This could also have real use cases - I can imagine someone wanting to run his own Wikibase, with the properties defined in Wikidata, but their own data.
- Some of the Wikidata Items are broken in a way that prevents them from being changed via the Wikidata toolkit. They can (and must) be corrected manually. An example again the Wikidata sandbox item. One can reset it (as is recommended on the talk page and that state is, unfortunately, broken, because there is a sitelink to Wikivoyage, which leads to an empty page. This is viewed as an error by the API, preventing the Item from being saved. Since the toolkit edits and item by replacing it with a modified copy, this is not allowed. This is the case for many Wikidata Items.
If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.
Next steps and opportunities
Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.
- Not all features of the Wikidata Dump Format were implemented during the project. These could be further added, and include Ranks, Lexemes, Sitelinks and Redirects.
- The inverse mapping, i.e. from the Wikidata Dump Format to the Wikibase JSON-based API used by the Wikidata Toolkit could be externalized from the Wikibase loader LinkedPipes ETL component and used in an alternative, RESTful Wikibase API so that even users of other tools can communicate with a Wikibase using the Wikidata Dump Format, which they need to know anyway, as it is used in the Wikidata Query Service. This would allow users to stay in the RDF and SPARQL world even if they did not want to use our component.
Part 2: The Grant
Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.
|Expense||Approved amount||Actual funds spent||Difference|
|Senior SW Developer (15 hours per week for 8 months)||$11,578.16||$11,578.16||$0|
|Senior Data scientist (15 hours per week for 8 months)||$14,408.38||$14,408.38||$0|
|Project manager (3 hours per week for 8 months)||$2,500.00||$2,500.00||$0|
|Travel (Wikimania 2019, 2 people)||$3,266.10||$3,266.10||$0|
|University overhead 20%||$7,934.86||$7,934.86||$0|
Do you have any unspent funds from the grant?
Please answer yes or no. If no, include an explanation.
Confirmation of project status
Did you comply with the requirements specified by WMF in the grant agreement?
Is your project completed?
We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being a grantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the Project Grant experience? Please share it here!
Despite some initial hickups with the grant caused by our Faculty not having received a grant from the WMF yet, the overall experience with being a grantee was very pleasant. The reporting requirements seem well balanced, not overwhelming the grantees with bureaucracy, which is a common issue with both the European, and the local Czech grant agencies. We particularly enjoyed being able to attend Wikimania for the first time, which gave us a new point of view at how large events for a very diverse community can be organized. Being able to focus these 8 months on Wikidata and the surrounding technologies and cultural habits also gave us an important insight, and going forward, we will advocate for contributing data to Wikidata more often.