Grants:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine/Timeline

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Project Grants This project is funded by a Project Grant

proposal people timeline & progress finances midpoint report final report
Tracked in Phabricator:
task T289971


Timeline Date
First release of a reconciliation service for Wikimedia Commons (task T289803) 15 November 2021
First release of a version of OpenRefine that allows editing structured data of Wikimedia Commons files 31 January 2022
First release of a version of OpenRefine that allows uploading Wikimedia Commons files with structured data 30 June 2022


2021 2022
Category of task Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
Software development for OpenRefine functionalities Entity types I Federation Entity types II Export as a file Direct upload
Software development for Wikimedia functionalities Commons reconciliation service Batch upload tool
Community engagement Bring together a test panel + continuous user feedback and testing
Documentation and training Documentation + webinar for batch editing structured data Documentation + webinar for all functionalities

Features and architecture[edit]

Workflow step If the user has ... installed, they are able to ... OpenRefine without Commons extension (via this grant) OpenRefine with Commons extension - beta version (via this grant; mid 2022) OpenRefine with Commons extension - v1.0 release (2023-ish?) Remarks
01 ๐ŸŒŸ Project creation Let OpenRefine import a data sheet (any OR-supported format) with file paths from Wikimedia Commons  Yes  Yes  Yes
02 ๐Ÿšฟ Data cleaning Clean and improve a spreadsheet or dataset with file metadata  Yes  Yes  Yes
03 ๐Ÿ”ƒ Reconciliation / recon Reconcile data columns with Wikidata  Yes  Yes  Yes Using the Wikimedia Commons Reconciliation Service.
03 ๐Ÿ”ƒ Reconciliation / recon Reconcile Commons file names with M-ids  Yes  Yes  Yes Using the Wikimedia Commons Reconciliation Service.
03 ๐Ÿ”ƒ Reconciliation / recon Reconcile M-ids with Commons file names  Yes  Yes  Yes Using the Wikimedia Commons Reconciliation Service.
04 โžก๏ธ Reconciliation / data extension Retrieve Wikitext (as one big blob) from existing Commons files which are reconciled in OpenRefine  Yes  Yes  Yes
04 โžก๏ธ Reconciliation / data extension Retrieve structured data (including captions) from existing Commons files which are reconciled in OpenRefine  Yes  Yes  Yes
06 โœ๏ธ Editing schema preparation Create an editing schema in OpenRefine that structures edits to files on Wikimedia Commons  Yes  Yes  Yes
06 โœ๏ธ Editing schema preparation Create an editing schema in OpenRefine that structures edits to Wikidata items  Yes  Yes  Yes Is currently already supported in OpenRefine; user will be able to switch between (batch) editing Wikidata and Commons.
07 ๐Ÿ‘€ Check/preview/test upload See an example preview of the structured data that will be (batch) edited to existing Commons files  Yes  Yes  Yes
07 ๐Ÿ‘€ Check/preview/test upload See an example preview of the Wikitext (generated) infobox of files edited in batch  Yes  Yes  Yes
08 ๐Ÿ’ฟ Batch SDC edit Add structured data to existing files on Wikimedia Commons  Yes  Yes  Yes
11 ๐Ÿคฆโ€โ™‚๏ธ Fix errors Undo batch SDC edits to existing files  Yes  Yes  Yes Using the EditGroups tool.
11 ๐Ÿคฆโ€โ™‚๏ธ Fix errors Delete batch uploads  Yes  Yes  Yes Using the EditGroups tool.
12 โœ… Report after upload Download dataset with modified or uploaded file links from Commons, with their file names, M-ids, and metadata  Yes  Yes  Yes
01 ๐ŸŒŸ Project creation Provide OpenRefine with one or multiple Wikimedia Commons categories; OpenRefine then loads the (reconciled) file paths of all the files in these categories  No  Yes  Yes
10 ๐Ÿ”บ Upload files Upload new files to Wikimedia Commons  No  Yes  Yes We may also decide to implement file upload in OpenRefine itself. It will certainly be possible as part of the Commons extension.
04 โžก๏ธ Reconciliation / data extension Retrieve Wikitext, split and cleaned per parameter and template, from existing Commons files which are reconciled in OpenRefine  No Possible Possibly?  Yes
07 ๐Ÿ‘€ Check/preview/test upload See an example preview of the structured data that will be (batch) added to newly uploaded Commons files  No Possible Possibly?  Yes
07 ๐Ÿ‘€ Check/preview/test upload See an example preview of the Wikitext (generated) infobox of files uploaded in batch  No Possible Possibly?  Yes
02 ๐Ÿšฟ Data cleaning Directly see thumbnails of media files while cleaning and editing their metadata  No  No  Yes OpenRefine is very data-centric and does not natively support (direct) preview of thumbnails of files in its data operation screen. In a next version of the Commons extension we can, for instance, introduce a media-centric editing and viewing screen that allows end users to batch edit metadata of media files in a more visually oriented way.
05 ๐Ÿ“Œ Metadata mapping Map the user's dataset with a preset template / checklist of fields from Wikimedia Commons (e.g. Artwork, Book, Map...)  No  No  Yes
01 ๐ŸŒŸ Project creation IIIF support (retrieve and process images/files hosted on a IIIF service)  No  No Possible Possibly? We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research.
09 ๐Ÿ“š Batch Wikitext edit Edit Wikitext of existing Commons files  No  No Possible Possibly? Wikimedia Commons editing and upload functionalities in OpenRefine focus on Structured Data first and foremost.
10 ๐Ÿ”บ Upload files Upload new files to an arbitrary Wikibase  No  No Possible Possibly? We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research. Members of the Wikibase stakeholder group have already indicated interest.

Links; info about current and planned development[edit]

Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

June 2021[edit]

Started internal preparations for the project:

  • Set up contract for Sandra Fauconnier, who will start working in July 2021 as Product Manager.
  • We drafted job openings for the two developers to be hired by this project.

July 2021[edit]

Started the hiring process for our two developers:

  • Two job openings were posted on OpenRefine's blog (1) (2). They were announced and promoted in the Wikiverse and on various international open source related job boards and in the Outreachy community.
  • We received a considerable number of applications for both positions, with people from 14 countries and 4 continents applying. (Most applicants discovered our vacancies either via the Outreachy or Wikimedia-tech networks.) We held interviews to get to know the most promising candidates.

August 2021[edit]

Next steps on the hiring process for our two software developers:

  • We selected the two top candidates - one developer for Wikimedia-specific features and one developer for OpenRefine-specific features.
  • Contracts for both candidates are being finalized this month. As soon as the candidates signed their contracts, we'll announce their names :-)

Preparations on community engagement, testing and piloting of the features we'll develop

  • Sandra has met with Fiona and Giovanna of the Wikimedia Foundation's GLAM and Culture team, which advises this project. Together, we are assembling a longlist of active Wikimedians and GLAM staff who will be actively approached with the request whether and how they want to provide feedback and run tests during our development process.
  • Sandra is writing down a few example workflows in detail this month (e.g. for adding structured data to artwork-related files on Wikimedia Commons).

Product management preparations

  • We are preparing various platforms and structures for transparent management of this project: a GitHub project and two Phabricator projects/workboards (one about OpenRefine, one about reconciliation). We included the links to these on this page.
  • Our weekly development team meetings will take off in September. We will keep public notes here.

September 2021[edit]

Software development:

  • Eugene Egbe started working as a junior developer (contractor) for Wikimedia-specific features in this project. Eugene is experienced in Structured Data on Commons development, as he was also the developer behind the popular ISA Tool. For OpenRefine, Eugene develops the Wikimedia Commons reconciliation service and a batch upload tool.
  • First code has been written for the Wikimedia Commons Reconciliation Service. Code is available on Gerrit and the service itself will be available at
  • Antonin has ported the EditGroups tool (which is already quite popular on Wikidata) to Wikimedia Commons: This makes it possible for Commons contributors to undo certain batch edits on Wikimedia Commons, including future 'faulty' batch edits by OpenRefine.

Community outreach:

October 2021[edit]

Presentation at WikidataCon 2021 (October 30) about our ongoing work on SDC support in OpenRefine.

Software development:

  • We have continued working on the Wikimedia Commons Reconciliation Service. By the end of October, we have started testing the service in OpenRefine itself, and are including and improving upon additional features, including support for various formats of Commons file names, and data extension, including support for all datatypes. The Wikimedia Commons Reconciliation Service is also available for testing at the Reconciliation service test bench.
  • We are preparing work on upload functionalities in OpenRefine and are further researching the viability and necessary functionalities of a new, external (and generic) upload tool (which we included as optional/possible in our grant application). This external tool, if we decide to develop it, should be able to serve as a (more advanced) substitute to QuickStatements. From November 2021, we plan to do some design research (including user interviews) to explore such UX and functionalities in more depth. Lozana Rossenova will work on this.

Community outreach:

November 2021[edit]

Software development:

Community outreach:

  • This month, Lozana Rossenova has started design research for file upload via OpenRefine, by doing a series of user interviews with prospective users of OpenRefine's batch upload functionalities. She has interviewed various GLAM staff, Wikimedia community members and Wikimedia chapter staff, to assess their basic needs related to batch uploading. Based on insights from these interviews, in December Lozana will design workflows and wireframes for batch uploading through OpenRefine.

December 2021[edit]

  • We resumed development on the Wikimedia Commons reconciliation service this month. Eugene has worked on code cleanup, on improvements to the data extension service, and on support for more diverse formats of Commons file names. The data extension functionality in OpenRefine randomly produces error messages, possibly due to Toolforge instabilities, which we are investigating further.
  • On the OpenRefine side, Joey has implemented better error messages inside OpenRefine when adding invalid manifests, and has continued work on the major task to support other entity types than just Items in OpenRefine (including MediaInfo entities!) into OpenRefine. As a result, Joey has performed the official first structured data edit to Wikimedia Commons using OpenRefine on December 20! ๐ŸŽ‰
  • Lozana analyzed and presented her design research to our team, which provided us with valuable insights on what the user's workflow will look like and how the needed features can be integrated step by step into OpenRefine.
  • Thanks to the increased clarity introduced via this design research, our team has taken some major architectural decisions:
    • We will deploy various Wikimedia Commons-specific features for OpenRefine (such as support for loading a dataset from Wikimedia Commons categories; parse infobox Wikitext; see Commons-specific previews of files before edit or upload) in a separate new piece of software: a Wikimedia Commons / media file extension for OpenRefine. This decision will make it possible for us to create dedicated user interfaces for such workflow steps. We want to build this extension in such a way that, in a future iteration, it will also be able to support batch file editing and batch file upload for arbitrary Wikibases. In this table (also included on the top of this page) you can see an overview of features which will be enabled by OpenRefine itself and/or by the Wikimedia Commons extension.
    • In our grant application, we (optionally) mentioned the possible development of an external, generic batch upload tool for Wikimedia Commons files, which would provide an alternative to QuickStatements. As a major benefit, such a tool will make it possible for end users to perform very large batch edits and uploads โ€˜in the background / in the cloudโ€™ rather than via OpenRefine directly (for which they would need to keep their computer and OpenRefine running for very long periods of time). However, inclusion of very large batches of to be uploaded files (potentially several gigabytes at once, with varying file sizes and formats) offers a new challenge for such a tool. Therefore, we have decided to currently dedicate all our available development resources to deploy batch upload properly in OpenRefine itself. As we have learned from this process, we will decide upon the creation of a generic, separate batch upload tool after OpenRefineโ€™s own file upload functionalities have been properly tested.
  • After conversations with the Wikimedia Foundation's GLAM team, we have decided to ask for a project extension with some additional funding, for which the rationale can be found in this subpage of our grant.
  • We started writing our midpoint report (which is due on January 14).
  • And our team took a short break to recharge and celebrate the end of the year. We wish everyone a happy and inspirational 2022!

January 2022[edit]

February 2022[edit]

Wireframes for the Wikimedia Commons upload workflow in OpenRefine (version 2, currently - March 2022 - under feedback).

March 2022[edit]

April 2022[edit]

  • On the development side, we continued working on support for starting an OpenRefine project by entering one or more Wikimedia Commons category/ies (GitHub issue #3), did bug fixes for GREL functionalities for parsing Wikitext (Pull Request #18), and continued on backend work inside Wikidata Toolkit and OpenRefine's code to make uploading files possible.
  • Analysis of user survey held in March-April 2022
    On the design side, Lozana processed feedback from various end users into a newer version of our upload-focused wireframes. She also organized, and summarized, a user survey (32 respondents) that asked more in depth about users' expectations on how they would start editing and uploading files via OpenRefine, and how they would like to deal with typical templates for media files (e.g. for photographs, artworks, books...). OpenRefine now also has a brand new project focused on UI/UX design, which includes (but is not limited to) design tasks focused on SDC features:
  • The OpenRefine/SDC team spent three full days (April 25-27) together in person in Ghent, Belgium, for an intensive work sprint. We discussed the upcoming planning, design and technical tasks for this project in depth and made decisions around our upcoming timeline and priorities until end October. As a result, you can follow progress on the project on a new workboard on GitHub, which also contains a lot of detailed new tasks:
    • During our work sprint in Ghent, we also held a workshop for Flemish GLAM staff on April 26, hosted by meemoo, Flemish Institute for Archives (see this tweet for a few photos). More than 20 people attended this session. After a two-hour introduction to OpenRefine, we demonstrated OpenRefine's new SDC editing features; Lozana presented designs for the upcoming upload functionalities. The audience was enthusiastic, and we received great feedback, including the request to look more deeply into IIIF support (GitHub issue). One participants suggested to generalize OpenRefine's upcoming media/image functionalities (including thumbnail views) to work with any image / media file database or DAMS (not only Wikimedia Commons), which is a request that we certainly would like to investigate in more depth.

May 2022[edit]

June 2022[edit]

July 2022[edit]

August 2022[edit]

September 2022[edit]

October 2022[edit]

Is your final report due but you need more time?