Grants:Project/MFFUK/Wikidata & ETL/Timeline

From Meta, a Wikimedia project coordination wiki


Timeline for MFFUK[edit]

Timeline Date
Analysis done 30 06 2019
Wikimania Demo 18 08 2019
Proof of concept transformations done 31 10 2019
Documentation done 30 11 2019


Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

March[edit]

Agreement signed, the actual start of project scheduled for April due to the need for signing contracts with the University.

April[edit]

  • We have our own Wikibase instance set up for testing our bulk loading processes.
  • We have successfully created the first few Wikibase items via a pipeline in LinkedPipes ETL (LP-ETL), showing where some development towards optimization needs to be done.
  • We had a technical meeting with Czech Wikidata contributors, discussing possible approaches, pitfalls and potential new data sources for the project.
  • We have identified necessary improvements in LP-ETL to provide a better user experience when setting up LP-ETL and while debugging and started on their implementation. Specifically, it is now possible to browse debug data via HTTP (previously only via FTP), which will be useful to pipeline developers.

May[edit]

LinkedPipes ETL pipeline loading data into Wikidata (and dealing with Wikibase API)
  • We have further analysed the Wikibase API and tokens handling, resulting in a more complex LinkedPipes ETL pipeline (screenshot attached). It works like this:
    • Get data from its source
    • Query Wikibase Blazegraph instance for existing items
    • Create non-existent items
    • Update items (both pre-existing and newly created)
  • The pipeline seems rather complex. But this is due to the nature of the Wikibase API, which is primarily focused on manual webpage-based edits, not machine to machine interaction.
  • We attended the Wikimedia Hackathon 2019 where we met with developers of the Wikibase API and Wikidata to discuss our approach.
    • They confirmed that our strategy is correct and showed interest in LinkedPipes ETL
    • They also confirmed that the identified API/token issues are by design, intentional, due to the preference of manual curation of Wikidata items over bots, leaving the handling of the rather inconvenient bulk load (mass import) process to libraries and bots to overcome - as a barrier against mass edits by non-experts.
    • They indicated interest in becoming the users of our proof-of-concept
  • Wikidata Toolkit was analysed and so far it seems it will be used as a library to deal with the Wikibase API issues

June[edit]

Pipeline in LinkedPipes ETL simplified using the Wikibase uploader component
  • The analysis and requirements document - output of work package 1 - was created and published
  • Initial work has begun on implementation of the new Wikibase loader component in LinkedPipes ETL
  • The original pipeline could be simplified significantly using the new component as can be seen in the attached screenshot
  • We are registered for Wikimania 2019, where we have a workshop accepted. In addition, we will present a poster about the project. See you in Stockholm!

July[edit]

Poster representing the process of loading RDF data into Wikibases such as Wikidata

August[edit]

  • We prepared for the Wikimania 2019 demo workshop
  • We attended Wikimania 2019 - Feedback from the poster session and the Demo workshop was positive
  • We implemented most of the Wikidata RDF data format in the Wikibase loader LP-ETL component - it is now ready to be used in actual pipelines
  • We also contributed to Wikidata Toolkit - the library our component uses, fixing a bug introduced with the recent MediaWiki version

September[edit]

  • LP-ETL has been dockerized, so now it can be deployed easily
  • During our work on the proof of concept data loading pipelines we identified a usability problem when working with Wikidata statements with multiple references. Therefore, we added a new mode of loading to the component, which merges statements with references instead of replacing them (and possibly loosing references)
  • We are now in the process of gaining a bot permission for production loading of data about Veteran trees in the Czech Republic into Wikidata

October[edit]

  • LP-ETL dockerization improved - now it does not need to run as a root user
  • Most of the work on resuming long running loads has been done
  • We came into contact with the Theatre institute, which expressed interest in loading its data to Wikidata, and we will present our results to them in November
  • We are a bit behind on the proof of concept transformations originally planned to be done in October mainly because we had to attend multiple conferences in October. This will be fixed in November.

November[edit]

Is your final report due but you need more time?