Grants:Project/MFFUK/Wikidata & ETL/Timeline
Timeline for MFFUK
|Analysis done||30 06 2019|
|Wikimania Demo||18 08 2019|
|Proof of concept transformations done||31 10 2019|
|Documentation done||30 11 2019|
Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.
Agreement signed, the actual start of project scheduled for April due to the need for signing contracts with the University.
- We have our own Wikibase instance set up for testing our bulk loading processes.
- We have successfully created the first few Wikibase items via a pipeline in LinkedPipes ETL (LP-ETL), showing where some development towards optimization needs to be done.
- We had a technical meeting with Czech Wikidata contributors, discussing possible approaches, pitfalls and potential new data sources for the project.
- We have identified necessary improvements in LP-ETL to provide a better user experience when setting up LP-ETL and while debugging and started on their implementation. Specifically, it is now possible to browse debug data via HTTP (previously only via FTP), which will be useful to pipeline developers.
- We have further analysed the Wikibase API and tokens handling, resulting in a more complex LinkedPipes ETL pipeline (screenshot attached). It works like this:
- Get data from its source
- Query Wikibase Blazegraph instance for existing items
- Create non-existent items
- Update items (both pre-existing and newly created)
- The pipeline seems rather complex. But this is due to the nature of the Wikibase API, which is primarily focused on manual webpage-based edits, not machine to machine interaction.
- We attended the Wikimedia Hackathon 2019 where we met with developers of the Wikibase API and Wikidata to discuss our approach.
- They confirmed that our strategy is correct and showed interest in LinkedPipes ETL
- They also confirmed that the identified API/token issues are by design, intentional, due to the preference of manual curation of Wikidata items over bots, leaving the handling of the rather inconvenient bulk load (mass import) process to libraries and bots to overcome - as a barrier against mass edits by non-experts.
- They indicated interest in becoming the users of our proof-of-concept
- Wikidata Toolkit was analysed and so far it seems it will be used as a library to deal with the Wikibase API issues
- The analysis and requirements document - output of work package 1 - was created and published
- Initial work has begun on implementation of the new Wikibase loader component in LinkedPipes ETL
- The original pipeline could be simplified significantly using the new component as can be seen in the attached screenshot
- We are registered for Wikimania 2019, where we have a workshop accepted. In addition, we will present a poster about the project. See you in Stockholm!
- We have created a Poster representing the process of loading RDF data into Wikibases such as Wikidata for Wikimania 2019
- We are continuing in implementation of the Wikibase loader component. Specifically, we now have support for complex data types (quantity, geo, timevalue), somevalue, novalue, and initial support for references and qualifiers.
- There is a teaser for our presentation at Wikimania at the LinkedPipes ETL news feed.
- We prepared for the Wikimania 2019 demo workshop
- We attended Wikimania 2019 - Feedback from the poster session and the Demo workshop was positive
- We implemented most of the Wikidata RDF data format in the Wikibase loader LP-ETL component - it is now ready to be used in actual pipelines
- We also contributed to Wikidata Toolkit - the library our component uses, fixing a bug introduced with the recent MediaWiki version
- LP-ETL has been dockerized, so now it can be deployed easily
- During our work on the proof of concept data loading pipelines we identified a usability problem when working with Wikidata statements with multiple references. Therefore, we added a new mode of loading to the component, which merges statements with references instead of replacing them (and possibly loosing references)
- We are now in the process of gaining a bot permission for production loading of data about Veteran trees in the Czech Republic into Wikidata
- LP-ETL dockerization improved - now it does not need to run as a root user
- Most of the work on resuming long running loads has been done
- We came into contact with the Theatre institute, which expressed interest in loading its data to Wikidata, and we will present our results to them in November
- We are a bit behind on the proof of concept transformations originally planned to be done in October mainly because we had to attend multiple conferences in October. This will be fixed in November.
- Proof of concept pipelines
- The November Wikidata Query Service lag complicates the development of proof-of-concept pipelines
- Proof of concept pipeline loading data about Czech Remarkable Trees from the authoritative source has been approved .
- Proof of concept pipeline loading data about Czech streets has been approved and has successfully run.
- Proof of concept pipeline linking languages in Wikidata to languages in Language EU Vocabulary was developed and has been approved and has run several times now.
- A volunteer, Martin Nečaský, created a pipeline based on the tutorial, loading data from Arts and Theatre Institute about theatres, approval pending.
- Documentation and Communication
- Documentation of the LP-ETL component has been significantly updated
- Based on the Remarkable trees pipeline, a tutorial documenting our approach was created
- A blog post about our experiences during the development of the transformation pipelines was included in the tutorial at LP-ETL website
- The Wikidata GLAM Facebook group was notified about the tutorial