Grants:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine/Timeline
|First release of a reconciliation service for Wikimedia Commons (task T289803)||15 November 2021|
|First release of a version of OpenRefine that allows editing structured data of Wikimedia Commons files||31 January 2022|
|First release of a version of OpenRefine that allows uploading Wikimedia Commons files with structured data||30 June 2022|
|Category of task||Jul||Aug||Sep||Oct||Nov||Dec||Jan||Feb||Mar||Apr||May||Jun||Jul||Aug|
|Software development for OpenRefine functionalities||Entity types I||Federation||Entity types II||Export as a file||Direct upload|
|Software development for Wikimedia functionalities||Commons reconciliation service||Batch upload tool|
|Community engagement||Bring together a test panel + continuous user feedback and testing|
|Documentation and training||Documentation + webinar for batch editing structured data||Documentation + webinar for all functionalities|
Features and architecture
|Workflow step||If the user has ... installed, they are able to ...||OpenRefine without Commons extension (via this grant)||OpenRefine with Commons extension - beta version (via this grant; mid 2022)||OpenRefine with Commons extension - v1.0 release (2023-ish?)||Remarks|
|01 🌟 Project creation||Let OpenRefine import a data sheet (any OR-supported format) with file paths from Wikimedia Commons||Yes||Yes||Yes|
|02 🚿 Data cleaning||Clean and improve a spreadsheet or dataset with file metadata||Yes||Yes||Yes|
|03 🔃 Reconciliation / recon||Reconcile data columns with Wikidata||Yes||Yes||Yes||Using the Wikimedia Commons Reconciliation Service.|
|03 🔃 Reconciliation / recon||Reconcile Commons file names with M-ids||Yes||Yes||Yes||Using the Wikimedia Commons Reconciliation Service.|
|03 🔃 Reconciliation / recon||Reconcile M-ids with Commons file names||Yes||Yes||Yes||Using the Wikimedia Commons Reconciliation Service.|
|04 ➡️ Reconciliation / data extension||Retrieve Wikitext (as one big blob) from existing Commons files which are reconciled in OpenRefine||Yes||Yes||Yes|
|04 ➡️ Reconciliation / data extension||Retrieve structured data (including captions) from existing Commons files which are reconciled in OpenRefine||Yes||Yes||Yes|
|06 ✍️ Editing schema preparation||Create an editing schema in OpenRefine that structures edits to files on Wikimedia Commons||Yes||Yes||Yes|
|06 ✍️ Editing schema preparation||Create an editing schema in OpenRefine that structures edits to Wikidata items||Yes||Yes||Yes||Is currently already supported in OpenRefine; user will be able to switch between (batch) editing Wikidata and Commons.|
|07 👀 Check/preview/test upload||See an example preview of the structured data that will be (batch) edited to existing Commons files||Yes||Yes||Yes|
|07 👀 Check/preview/test upload||See an example preview of the Wikitext (generated) infobox of files edited in batch||Yes||Yes||Yes|
|08 💿 Batch SDC edit||Add structured data to existing files on Wikimedia Commons||Yes||Yes||Yes|
|11 🤦♂️ Fix errors||Undo batch SDC edits to existing files||Yes||Yes||Yes||Using the EditGroups tool.|
|11 🤦♂️ Fix errors||Delete batch uploads||Yes||Yes||Yes||Using the EditGroups tool.|
|12 ✅ Report after upload||Download dataset with modified or uploaded file links from Commons, with their file names, M-ids, and metadata||Yes||Yes||Yes|
|01 🌟 Project creation||Provide OpenRefine with one or multiple Wikimedia Commons categories; OpenRefine then loads the (reconciled) file paths of all the files in these categories||No||Yes||Yes|
|10 🔺 Upload files||Upload new files to Wikimedia Commons||No||Yes||Yes||We may also decide to implement file upload in OpenRefine itself. It will certainly be possible as part of the Commons extension.|
|04 ➡️ Reconciliation / data extension||Retrieve Wikitext, split and cleaned per parameter and template, from existing Commons files which are reconciled in OpenRefine||No||Possibly?||Yes|
|07 👀 Check/preview/test upload||See an example preview of the structured data that will be (batch) added to newly uploaded Commons files||No||Possibly?||Yes|
|07 👀 Check/preview/test upload||See an example preview of the Wikitext (generated) infobox of files uploaded in batch||No||Possibly?||Yes|
|02 🚿 Data cleaning||Directly see thumbnails of media files while cleaning and editing their metadata||No||No||Yes||OpenRefine is very data-centric and does not natively support (direct) preview of thumbnails of files in its data operation screen. In a next version of the Commons extension we can, for instance, introduce a media-centric editing and viewing screen that allows end users to batch edit metadata of media files in a more visually oriented way.|
|05 📌 Metadata mapping||Map the user's dataset with a preset template / checklist of fields from Wikimedia Commons (e.g. Artwork, Book, Map...)||No||No||Yes|
|01 🌟 Project creation||IIIF support (retrieve and process images/files hosted on a IIIF service)||No||No||Possibly?||We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research.|
|09 📚 Batch Wikitext edit||Edit Wikitext of existing Commons files||No||No||Possibly?||Wikimedia Commons editing and upload functionalities in OpenRefine focus on Structured Data first and foremost.|
|10 🔺 Upload files||Upload new files to an arbitrary Wikibase||No||No||Possibly?||We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research. Members of the Wikibase stakeholder group have already indicated interest.|
Links; info about current and planned development
- Workboards related to development for this project on Phabricator:
- OpenRefine https://phabricator.wikimedia.org/tag/openrefine/
- Reconciliation https://phabricator.wikimedia.org/tag/reconciliation/
- Workboard related to development for this project on GitHub: https://github.com/orgs/OpenRefine/projects/2
- Public notes of our (regular, at least weekly) development team meetings: https://etherpad.wikimedia.org/p/OpenRefine-SDC-team-meetings
Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.
Started internal preparations for the project:
- Set up contract for Sandra Fauconnier, who will start working in July 2021 as Product Manager.
- We drafted job openings for the two developers to be hired by this project.
Started the hiring process for our two developers:
- Two job openings were posted on OpenRefine's blog (1) (2). They were announced and promoted in the Wikiverse and on various international open source related job boards and in the Outreachy community.
- We received a considerable number of applications for both positions, with people from 14 countries and 4 continents applying. (Most applicants discovered our vacancies either via the Outreachy or Wikimedia-tech networks.) We held interviews to get to know the most promising candidates.
Next steps on the hiring process for our two software developers:
- We selected the two top candidates - one developer for Wikimedia-specific features and one developer for OpenRefine-specific features.
- Contracts for both candidates are being finalized this month. As soon as the candidates signed their contracts, we'll announce their names :-)
Preparations on community engagement, testing and piloting of the features we'll develop
- Sandra has met with Fiona and Giovanna of the Wikimedia Foundation's GLAM and Culture team, which advises this project. Together, we are assembling a longlist of active Wikimedians and GLAM staff who will be actively approached with the request whether and how they want to provide feedback and run tests during our development process.
- Sandra is writing down a few example workflows in detail this month (e.g. for adding structured data to artwork-related files on Wikimedia Commons).
Product management preparations
- We are preparing various platforms and structures for transparent management of this project: a GitHub project and two Phabricator projects/workboards (one about OpenRefine, one about reconciliation). We included the links to these on this page.
- Our weekly development team meetings will take off in September. We will keep public notes here.
- Eugene Egbe started working as a junior developer (contractor) for Wikimedia-specific features in this project. Eugene is experienced in Structured Data on Commons development, as he was also the developer behind the popular ISA Tool. For OpenRefine, Eugene develops the Wikimedia Commons reconciliation service and a batch upload tool.
- First code has been written for the Wikimedia Commons Reconciliation Service. Code is available on Gerrit and the service itself will be available at https://commonsreconcile.toolforge.org/
- Antonin has ported the EditGroups tool (which is already quite popular on Wikidata) to Wikimedia Commons: https://editgroups-commons.toolforge.org/. This makes it possible for Commons contributors to undo certain batch edits on Wikimedia Commons, including future 'faulty' batch edits by OpenRefine.
- There is now a landing page for OpenRefine on Wikimedia Commons: https://commons.wikimedia.org/wiki/Commons:OpenRefine. For now, this page will point to information about the development process. As features are deployed, the page will point to general information and documentation.
- Community members interested in this project are signing up at Global message delivery/Targets/OpenRefine and SDC.
- We have written down first thoughts on batch editing and upload workflows and have invited various community members to review these.
- We have continued working on the Wikimedia Commons Reconciliation Service. By the end of October, we have started testing the service in OpenRefine itself, and are including and improving upon additional features, including support for various formats of Commons file names, and data extension, including support for all datatypes. The Wikimedia Commons Reconciliation Service is also available for testing at the Reconciliation service test bench.
- We are preparing work on upload functionalities in OpenRefine and are further researching the viability and necessary functionalities of a new, external (and generic) upload tool (which we included as optional/possible in our grant application). This external tool, if we decide to develop it, should be able to serve as a (more advanced) substitute to QuickStatements. From November 2021, we plan to do some design research (including user interviews) to explore such UX and functionalities in more depth. Lozana Rossenova will work on this.
- We presented our ongoing work to the Wikidata community at WikidataCon 2021 (see presentation on the right). Additionally, we also gave a general OpenRefine tutorial, and participated in a panel discussion about Wikimedia tool sustainability. Slides (where relevant) of our sessions can be found at https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021/Documentation/List_of_sessions.
- Joey Salazar started working as an OpenRefine developer on this project and was welcomed and onboarded to the team early this month. The SDC team for OpenRefine is now complete. Joey has proceeded to work on a foundational task in OpenRefine's codebase which is needed to make SDC editing and uploading possible: making OpenRefine's Wikibase extension compatible with other entity types than Wikidata items (including the MediaInfo entities from Wikimedia Commons). As part of this (very large) task, a Wikibase manifest specifically for Wikimedia Commons also has been deployed.
- Antonin has requested the creation of an OpenRefine 3.6 tag on Wikimedia Commons, which will be used to indicate any edits on Wikimedia Commons that will be done through OpenRefine. In this way, future edits done through the tool can be counted and measured.
- We are preparing for development of an external batch upload tool, which will be more powerful and flexible than QuickStatements and which will allow OpenRefine users to edit and upload large batches of media files and metadata without the need to keep OpenRefine running on their computers. We are running a request for feedback for a representation format for Wikibase edits which will be processed by this new tool.
- This month, Lozana Rossenova has started design research for file upload via OpenRefine, by doing a series of user interviews with prospective users of OpenRefine's batch upload functionalities. She has interviewed various GLAM staff, Wikimedia community members and Wikimedia chapter staff, to assess their basic needs related to batch uploading. Based on insights from these interviews, in December Lozana will design workflows and wireframes for batch uploading through OpenRefine.
- We resumed development on the Wikimedia Commons reconciliation service this month. Eugene has worked on code cleanup, on improvements to the data extension service, and on support for more diverse formats of Commons file names. The data extension functionality in OpenRefine randomly produces error messages, possibly due to Toolforge instabilities, which we are investigating further.
- On the OpenRefine side, Joey has implemented better error messages inside OpenRefine when adding invalid manifests, and has continued work on the major task to support other entity types than just Items in OpenRefine (including MediaInfo entities!) into OpenRefine. As a result, Joey has performed the official first structured data edit to Wikimedia Commons using OpenRefine on December 20! 🎉
- Lozana analyzed and presented her design research to our team, which provided us with valuable insights on what the user's workflow will look like and how the needed features can be integrated step by step into OpenRefine.
- Thanks to the increased clarity introduced via this design research, our team has taken some major architectural decisions:
- We will deploy various Wikimedia Commons-specific features for OpenRefine (such as support for loading a dataset from Wikimedia Commons categories; parse infobox Wikitext; see Commons-specific previews of files before edit or upload) in a separate new piece of software: a Wikimedia Commons / media file extension for OpenRefine. This decision will make it possible for us to create dedicated user interfaces for such workflow steps. We want to build this extension in such a way that, in a future iteration, it will also be able to support batch file editing and batch file upload for arbitrary Wikibases. In this table (also included on the top of this page) you can see an overview of features which will be enabled by OpenRefine itself and/or by the Wikimedia Commons extension.
- In our grant application, we (optionally) mentioned the possible development of an external, generic batch upload tool for Wikimedia Commons files, which would provide an alternative to QuickStatements. As a major benefit, such a tool will make it possible for end users to perform very large batch edits and uploads ‘in the background / in the cloud’ rather than via OpenRefine directly (for which they would need to keep their computer and OpenRefine running for very long periods of time). However, inclusion of very large batches of to be uploaded files (potentially several gigabytes at once, with varying file sizes and formats) offers a new challenge for such a tool. Therefore, we have decided to currently dedicate all our available development resources to deploy batch upload properly in OpenRefine itself. As we have learned from this process, we will decide upon the creation of a generic, separate batch upload tool after OpenRefine’s own file upload functionalities have been properly tested.
- After conversations with the Wikimedia Foundation's GLAM team, we have decided to ask for a project extension with some additional funding, for which the rationale can be found in this subpage of our grant.
- We started writing our midpoint report (which is due on January 14).
- And our team took a short break to recharge and celebrate the end of the year. We wish everyone a happy and inspirational 2022!
- We published our midpoint report.
- We have continued to improve the Wikimedia Commons Reconciliation Service. Eugene has (among other things) included support for retrieving file captions, better visibility of the multilingual nature of the reconciliation service, and finished work on support for all datatypes. Work has started on preview cards for media files in OpenRefine.
- We are now ready to test OpenRefine's new editing and upload features on Beta Commons. A Beta Commons copy of the reconciliation service has been created for this purpose too.
- Antonin has created a code repository for the Commons Extension of OpenRefine. We'll continue adding Wikimedia Commons-specific functionalities here in the upcoming months.
- On the OpenRefine side, further work was done on: adding a MediaInfo Entity update testcase; refactoring Items to Entities; and on allowing federation between Wikibase manifests. The Wikimedia Commons manifest was updated accordingly. Joey and Antonin also engaged with the developers of Wikidata Toolkit (a Java library for accessing Wikidata and other Wikibase installations, which is used by OpenRefine's Wikidata extension) to address issues around statement duplication.
- We are organizing a community meetup on February 22, in which we will showcase how to batch edit Wikimedia Commons files using OpenRefine, and we'll talk about next steps.
- Eugene made finishing touches to the Wikimedia Commons Reconciliation Service: more work on preview cards, implementation of an entity suggest service, some general code cleanup. The info page of the Reconciliation Service now also includes documentation and a short tutorial video.
- On the OpenRefine side, Joey completed the refactoring of OpenRefine's codebase from Items to Entities and did some further cleanup work. Began work on the Commons Extension, by improving and building the first functions / syntax to help OpenRefine users parse Wikitext of Wikimedia Commons files.
- OpenRefine's Wikibase extension has been improved by Antonin, adding more refined functionalities to avoid creating duplicate Wikidata/Wikibase/SDC statements, and to allow the deletion of (Wikidata/Wikibase/SDC) statements as well.
- We held a community meetup on February 22, 2022, which was attended by more than 40 people. Recording, slides and Etherpad with links and notes are available. We presented our work done so far, demonstrated SDC editing with OpenRefine, and talked about future development and the project extension that has been awarded for this grant.
- Work on OpenRefine's Commons extension has continued. Joey has worked on enabling this extension, on various GREL functions to make it easier to process Wikitext, and on the start screen which allows users to input Wikimedia Commons categories, which then will load an OpenRefine project that allows to add structured data to the files in these categories.
- In general, we are preparing a new official OpenRefine version release (version 3.6) which will allow structured data editing on Wikimedia Commons (this is currently only possible with unstable snapshot releases; OpenRefine's current stable version is 3.5.2.).
- UX design: Lozana has gathered feedback from our original interviewees on version 2 of our design wireframes, and is processing this feedback. (Interviewees were a mix of GLAM and volunteer contributors to Wikimedia Commons.)
- We also prepared a short user survey for April 2022, asking Commons uploaders (1) how they would like to start batch editing and upload projects in OpenRefine and (2) how they want information templates (showing e.g. artwork, book or map focused metadata) to work on Wikimedia Commons.
- As our February 22 community meetup was so well attended, we have started hosting monthly office hours. The first one took place on March 22. Next ones are listed on our info page on Wikimedia Commons.
- On the development side, we continued working on support for starting an OpenRefine project by entering one or more Wikimedia Commons category/ies (GitHub issue #3), did bug fixes for GREL functionalities for parsing Wikitext (Pull Request #18), and continued on backend work inside Wikidata Toolkit and OpenRefine's code to make uploading files possible.
- summarized, a user survey (32 respondents) that asked more in depth about users' expectations on how they would start editing and uploading files via OpenRefine, and how they would like to deal with typical templates for media files (e.g. for photographs, artworks, books...). OpenRefine now also has a brand new project focused on UI/UX design, which includes (but is not limited to) design tasks focused on SDC features: https://github.com/orgs/OpenRefine/projects/1/views/1
- The OpenRefine/SDC team spent three full days (April 25-27) together in person in Ghent, Belgium, for an intensive work sprint. We discussed the upcoming planning, design and technical tasks for this project in depth and made decisions around our upcoming timeline and priorities until end October. As a result, you can follow progress on the project on a new workboard on GitHub, which also contains a lot of detailed new tasks: https://github.com/orgs/OpenRefine/projects/2
- During our work sprint in Ghent, we also held a workshop for Flemish GLAM staff on April 26, hosted by meemoo, Flemish Institute for Archives (see this tweet for a few photos). More than 20 people attended this session. After a two-hour introduction to OpenRefine, we demonstrated OpenRefine's new SDC editing features; Lozana presented designs for the upcoming upload functionalities. The audience was enthusiastic, and we received great feedback, including the request to look more deeply into IIIF support (GitHub issue). One participants suggested to generalize OpenRefine's upcoming media/image functionalities (including thumbnail views) to work with any image / media file database or DAMS (not only Wikimedia Commons), which is a request that we certainly would like to investigate in more depth.