Grants:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine/Timeline

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T289971

Milestones[edit]

Timeline Date
First release of a reconciliation service for Wikimedia Commons (task T289803) 15 November 2021
First release of a version of OpenRefine that allows editing structured data of Wikimedia Commons files 31 January 2022
First release of a version of OpenRefine that allows uploading Wikimedia Commons files with structured data 30 June 2022

Planning[edit]

2021 2022
Category of task Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
Software development for OpenRefine functionalities Entity types I Federation Entity types II Export as a file Direct upload
Software development for Wikimedia functionalities Commons reconciliation service Batch upload tool
Community engagement Bring together a test panel + continuous user feedback and testing
Documentation and training Documentation + webinar for batch editing structured data Documentation + webinar for all functionalities

Features and architecture[edit]

Workflow step If the user has ... installed, they are able to ... OpenRefine without Commons extension (via this grant) OpenRefine with Commons extension - beta version (via this grant; mid 2022) OpenRefine with Commons extension - v1.0 release (2023-ish?) Remarks
01 🌟 Project creation Let OpenRefine import a data sheet (any OR-supported format) with file paths from Wikimedia Commons  Yes  Yes  Yes
02 🚿 Data cleaning Clean and improve a spreadsheet or dataset with file metadata  Yes  Yes  Yes
03 🔃 Reconciliation / recon Reconcile data columns with Wikidata  Yes  Yes  Yes Using the Wikimedia Commons Reconciliation Service.
03 🔃 Reconciliation / recon Reconcile Commons file names with M-ids  Yes  Yes  Yes Using the Wikimedia Commons Reconciliation Service.
03 🔃 Reconciliation / recon Reconcile M-ids with Commons file names  Yes  Yes  Yes Using the Wikimedia Commons Reconciliation Service.
04 ➡️ Reconciliation / data extension Retrieve Wikitext (as one big blob) from existing Commons files which are reconciled in OpenRefine  Yes  Yes  Yes
04 ➡️ Reconciliation / data extension Retrieve structured data (including captions) from existing Commons files which are reconciled in OpenRefine  Yes  Yes  Yes
06 ✍️ Editing schema preparation Create an editing schema in OpenRefine that structures edits to files on Wikimedia Commons  Yes  Yes  Yes
06 ✍️ Editing schema preparation Create an editing schema in OpenRefine that structures edits to Wikidata items  Yes  Yes  Yes Is currently already supported in OpenRefine; user will be able to switch between (batch) editing Wikidata and Commons.
07 👀 Check/preview/test upload See an example preview of the structured data that will be (batch) edited to existing Commons files  Yes  Yes  Yes
07 👀 Check/preview/test upload See an example preview of the Wikitext (generated) infobox of files edited in batch  Yes  Yes  Yes
08 💿 Batch SDC edit Add structured data to existing files on Wikimedia Commons  Yes  Yes  Yes
11 🤦‍♂️ Fix errors Undo batch SDC edits to existing files  Yes  Yes  Yes Using the EditGroups tool.
11 🤦‍♂️ Fix errors Delete batch uploads  Yes  Yes  Yes Using the EditGroups tool.
12 ✅ Report after upload Download dataset with modified or uploaded file links from Commons, with their file names, M-ids, and metadata  Yes  Yes  Yes
01 🌟 Project creation Provide OpenRefine with one or multiple Wikimedia Commons categories; OpenRefine then loads the (reconciled) file paths of all the files in these categories  No  Yes  Yes
10 🔺 Upload files Upload new files to Wikimedia Commons  No  Yes  Yes We may also decide to implement file upload in OpenRefine itself. It will certainly be possible as part of the Commons extension.
04 ➡️ Reconciliation / data extension Retrieve Wikitext, split and cleaned per parameter and template, from existing Commons files which are reconciled in OpenRefine  No Possible Possibly?  Yes
07 👀 Check/preview/test upload See an example preview of the structured data that will be (batch) added to newly uploaded Commons files  No Possible Possibly?  Yes
07 👀 Check/preview/test upload See an example preview of the Wikitext (generated) infobox of files uploaded in batch  No Possible Possibly?  Yes
02 🚿 Data cleaning Directly see thumbnails of media files while cleaning and editing their metadata  No  No  Yes OpenRefine is very data-centric and does not natively support (direct) preview of thumbnails of files in its data operation screen. In a next version of the Commons extension we can, for instance, introduce a media-centric editing and viewing screen that allows end users to batch edit metadata of media files in a more visually oriented way.
05 📌 Metadata mapping Map the user's dataset with a preset template / checklist of fields from Wikimedia Commons (e.g. Artwork, Book, Map...)  No  No  Yes
01 🌟 Project creation IIIF support (retrieve and process images/files hosted on a IIIF service)  No  No Possible Possibly? We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research.
09 📚 Batch Wikitext edit Edit Wikitext of existing Commons files  No  No Possible Possibly? Wikimedia Commons editing and upload functionalities in OpenRefine focus on Structured Data first and foremost.
10 🔺 Upload files Upload new files to an arbitrary Wikibase  No  No Possible Possibly? We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research. Members of the Wikibase stakeholder group have already indicated interest.

Links; info about current and planned development[edit]

Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

June 2021[edit]

Started internal preparations for the project:

  • Set up contract for Sandra Fauconnier, who will start working in July 2021 as Product Manager.
  • We drafted job openings for the two developers to be hired by this project.

July 2021[edit]

Started the hiring process for our two developers:

  • Two job openings were posted on OpenRefine's blog (1) (2). They were announced and promoted in the Wikiverse and on various international open source related job boards and in the Outreachy community.
  • We received a considerable number of applications for both positions, with people from 14 countries and 4 continents applying. (Most applicants discovered our vacancies either via the Outreachy or Wikimedia-tech networks.) We held interviews to get to know the most promising candidates.

August 2021[edit]

Next steps on the hiring process for our two software developers:

  • We selected the two top candidates - one developer for Wikimedia-specific features and one developer for OpenRefine-specific features.
  • Contracts for both candidates are being finalized this month. As soon as the candidates signed their contracts, we'll announce their names :-)

Preparations on community engagement, testing and piloting of the features we'll develop

  • Sandra has met with Fiona and Giovanna of the Wikimedia Foundation's GLAM and Culture team, which advises this project. Together, we are assembling a longlist of active Wikimedians and GLAM staff who will be actively approached with the request whether and how they want to provide feedback and run tests during our development process.
  • Sandra is writing down a few example workflows in detail this month (e.g. for adding structured data to artwork-related files on Wikimedia Commons).

Product management preparations

  • We are preparing various platforms and structures for transparent management of this project: a GitHub project and two Phabricator projects/workboards (one about OpenRefine, one about reconciliation). We included the links to these on this page.
  • Our weekly development team meetings will take off in September. We will keep public notes here.

September 2021[edit]

Software development:

  • Eugene Egbe started working as a junior developer (contractor) for Wikimedia-specific features in this project. Eugene is experienced in Structured Data on Commons development, as he was also the developer behind the popular ISA Tool. For OpenRefine, Eugene develops the Wikimedia Commons reconciliation service and a batch upload tool.
  • First code has been written for the Wikimedia Commons Reconciliation Service. Code is available on Gerrit and the service itself will be available at https://commonsreconcile.toolforge.org/
  • Antonin has ported the EditGroups tool (which is already quite popular on Wikidata) to Wikimedia Commons: https://editgroups-commons.toolforge.org/. This makes it possible for Commons contributors to undo certain batch edits on Wikimedia Commons, including future 'faulty' batch edits by OpenRefine.

Community outreach:

October 2021[edit]

Presentation at WikidataCon 2021 (October 30) about our ongoing work on SDC support in OpenRefine.

Software development:

  • We have continued working on the Wikimedia Commons Reconciliation Service. By the end of October, we have started testing the service in OpenRefine itself, and are including and improving upon additional features, including support for various formats of Commons file names, and data extension, including support for all datatypes. The Wikimedia Commons Reconciliation Service is also available for testing at the Reconciliation service test bench.
  • We are preparing work on upload functionalities in OpenRefine and are further researching the viability and necessary functionalities of a new, external (and generic) upload tool (which we included as optional/possible in our grant application). This external tool, if we decide to develop it, should be able to serve as a (more advanced) substitute to QuickStatements. From November 2021, we plan to do some design research (including user interviews) to explore such UX and functionalities in more depth. Lozana Rossenova will work on this.

Community outreach:

November 2021[edit]

Software development:

Community outreach:

  • This month, Lozana Rossenova has started design research for file upload via OpenRefine, by doing a series of user interviews with prospective users of OpenRefine's batch upload functionalities. She has interviewed various GLAM staff, Wikimedia community members and Wikimedia chapter staff, to assess their basic needs related to batch uploading. Based on insights from these interviews, in December Lozana will design workflows and wireframes for batch uploading through OpenRefine.

December 2021[edit]

  • We resumed development on the Wikimedia Commons reconciliation service this month. Eugene has worked on code cleanup, on improvements to the data extension service, and on support for more diverse formats of Commons file names. The data extension functionality in OpenRefine randomly produces error messages, possibly due to Toolforge instabilities, which we are investigating further.
  • On the OpenRefine side, Joey has implemented better error messages inside OpenRefine when adding invalid manifests, and has continued work on the major task to support other entity types than just Items in OpenRefine (including MediaInfo entities!) into OpenRefine. As a result, Joey has performed the official first structured data edit to Wikimedia Commons using OpenRefine on December 20! 🎉
  • Lozana analyzed and presented her design research to our team, which provided us with valuable insights on what the user's workflow will look like and how the needed features can be integrated step by step into OpenRefine.
  • Thanks to the increased clarity introduced via this design research, our team has taken some major architectural decisions:
    • We will deploy various Wikimedia Commons-specific features for OpenRefine (such as support for loading a dataset from Wikimedia Commons categories; parse infobox Wikitext; see Commons-specific previews of files before edit or upload) in a separate new piece of software: a Wikimedia Commons / media file extension for OpenRefine. This decision will make it possible for us to create dedicated user interfaces for such workflow steps. We want to build this extension in such a way that, in a future iteration, it will also be able to support batch file editing and batch file upload for arbitrary Wikibases. In this table (also included on the top of this page) you can see an overview of features which will be enabled by OpenRefine itself and/or by the Wikimedia Commons extension.
    • In our grant application, we (optionally) mentioned the possible development of an external, generic batch upload tool for Wikimedia Commons files, which would provide an alternative to QuickStatements. As a major benefit, such a tool will make it possible for end users to perform very large batch edits and uploads ‘in the background / in the cloud’ rather than via OpenRefine directly (for which they would need to keep their computer and OpenRefine running for very long periods of time). However, inclusion of very large batches of to be uploaded files (potentially several gigabytes at once, with varying file sizes and formats) offers a new challenge for such a tool. Therefore, we have decided to currently dedicate all our available development resources to deploy batch upload properly in OpenRefine itself. As we have learned from this process, we will decide upon the creation of a generic, separate batch upload tool after OpenRefine’s own file upload functionalities have been properly tested.
  • After conversations with the Wikimedia Foundation's GLAM team, we have decided to ask for a project extension with some additional funding, for which the rationale can be found in this subpage of our grant.
  • We started writing our midpoint report (which is due on January 14).
  • And our team took a short break to recharge and celebrate the end of the year. We wish everyone a happy and inspirational 2022!

January 2022[edit]

February 2022[edit]

Wireframes for the Wikimedia Commons upload workflow in OpenRefine (version 2, currently - March 2022 - under feedback).

March 2022[edit]

April 2022[edit]

  • On the development side, we continued working on support for starting an OpenRefine project by entering one or more Wikimedia Commons category/ies (GitHub issue #3), did bug fixes for GREL functionalities for parsing Wikitext (Pull Request #18), and continued on backend work inside Wikidata Toolkit and OpenRefine's code to make uploading files possible.
  • Analysis of user survey held in March-April 2022
    On the design side, Lozana processed feedback from various end users into a newer version of our upload-focused wireframes. She also organized, and summarized, a user survey (32 respondents) that asked more in depth about users' expectations on how they would start editing and uploading files via OpenRefine, and how they would like to deal with typical templates for media files (e.g. for photographs, artworks, books...). OpenRefine now also has a brand new project focused on UI/UX design, which includes (but is not limited to) design tasks focused on SDC features: https://github.com/orgs/OpenRefine/projects/1/views/1
  • The OpenRefine/SDC team spent three full days (April 25-27) together in person in Ghent, Belgium, for an intensive work sprint. We discussed the upcoming planning, design and technical tasks for this project in depth and made decisions around our upcoming timeline and priorities until end October. As a result, you can follow progress on the project on a new workboard on GitHub, which also contains a lot of detailed new tasks: https://github.com/orgs/OpenRefine/projects/2
    • During our work sprint in Ghent, we also held a workshop for Flemish GLAM staff on April 26, hosted by meemoo, Flemish Institute for Archives (see this tweet for a few photos). More than 20 people attended this session. After a two-hour introduction to OpenRefine, we demonstrated OpenRefine's new SDC editing features; Lozana presented designs for the upcoming upload functionalities. The audience was enthusiastic, and we received great feedback, including the request to look more deeply into IIIF support (GitHub issue). One participants suggested to generalize OpenRefine's upcoming media/image functionalities (including thumbnail views) to work with any image / media file database or DAMS (not only Wikimedia Commons), which is a request that we certainly would like to investigate in more depth.

May 2022[edit]

June 2022[edit]

  • Video demo of Wikimedia Commons (structured data) editing with OpenRefine during Wikidata Lab XXXIV, 9 June 2022 (approx 1 hour 20 minutes).
    We released OpenRefine 3.6-beta2, which supports batch editing structured data of Wikimedia Commons files. We continue working on file upload functionalities in the upcoming version 3.7.
  • Work on the Wikimedia Commons extension has continued, finalizing a suggestion service when typing Wikimedia Commons categories to start an OpenRefine project. We also continued work on selecting multiple Commons categories and on parsing of the retrieved files when the new OpenRefine project is created.
  • As basic upload functionalities (although still rough) are available, we launched a call for test uploads, and immediately received several enthusiastic responses.
  • Lozana also organized and held five structured user tests in interview form, to help us discover the major UX hurdles in the current (still very rough) upload workflow, and to prioritize the upcoming work for file uploads in July-October.
  • Batch SDC editing in OpenRefine was demo'ed during Wikidata Lab XXXIV, organized by Wiki Movimento Brazil.

July 2022[edit]

  • Summary of user tests in June 2022
    In OpenRefine's Wikimedia Commons extension, we implemented a feature that will display the fetched files' M-ids, and a column with each file's related categories.
  • Lozana published a report of last month's user tests, and created various follow up issues on GitHub. As a team, we also discussed prioritization of the last work before the end of this grant. We will focus on the following work, which we see as most impactful within the current grant period:
    • Finishing the Wikimedia Commons extension with its support for starting projects from Commons categories
    • Improving schema building, implementing the notion of schema templates - making sure that OpenRefine users can work with pre-defined data models (such as artwork, book, art photo) and build their own for sharing with others
    • Including thumbnail previews of files in OpenRefine's grid (project) view
  • OpenRefine 3.6.0 was released: the first official release which natively supports batch editing Wikimedia Commons files!
  • Following this 3.6.0 release, OpenRefine's info page on Wikimedia Commons was overhauled, to become a landing page for documentation. It now also features a subpage with a step by step tutorial to batch edit Wikimedia Commons files with OpenRefine, and also links to temporary instructions for file uploading (in a Google doc).
  • We started work on schema templates (mentioned above).
  • Sandra followed up with several test uploaders, and held info sessions with a group of American art librarians (ARLIS; focused on data modeling in SDC) and with New Zealand Wikimedians and GLAM staff (demo of file uploading in OpenRefine).
  • At the time of writing this monthly report, more than 17,000 files have already been uploaded to Commons with OpenRefine. In July 2022, files uploaded with OpenRefine have received over 200,000 views on Wikimedia projects.

August 2022[edit]

September 2022[edit]

  • The OpenRefine team is preparing for the last sprint of work before this grant ends. During a small team meeting in Berlin, we discussed prioritization for the last two months.
  • We made great strides in including first support for an often-requested feature: schema templates, or the ability to select 'templates' for popular data models, such as digitized artworks, photos of 3D artworks, and books.
  • In OpenRefine's Wikimedia Commons extension, we continued work on code restructuring, retrieval of categories of requested files, and started work on 'category depth' fetching. We have also started working on some last user interface improvements there.
  • Sandra attended the European GLAMwiki coordinators meeting in Prague, and gave an OpenRefine workshop there. During a (different) session on tool prioritization, OpenRefine emerged as a tool that receives high priority from the GLAMwiki coordinators.
  • Lozana has started conducting a final round of user tests, with a focus on testing our new features for schema templates, and general functionality of the Commons extension.

October 2022[edit]

Is your final report due but you need more time?



Extension request (15 extra days for final report)[edit]

New final report due date[edit]

December 15, 2022 (15 days extension for submitting the project's final report)

Rationale[edit]

The work on this project is finished! But I would like to ask for a two-week extension for submitting our final report. We are still in the process of collecting all expenses. Because it is the busy end of the year, OpenRefine's fiscal sponsor hasn't been able to collect a full financial report yet.

Feel free to contact me with any questions you may have! With kind regards, SFauconnier (talk) 17:55, 23 November 2022 (UTC)

Approval[edit]

Hello SFauconnier, Please note that we have updated your final report due date to 15 December 2022. Thank you! -- JTud (WMF), Grants Administrator (talk) 19:11, 23 November 2022 (UTC)

Clarification and additional question[edit]

Hello JTud (WMF), I wanted to ask an additional question. Two final invoices for this project have been submitted in November 2022 (as we have been working until the very end of October 2022 and a few invoices have arrived afterwards). I silently assumed that our 15-day reporting extension would allow this, but I now realize that that assumption is premature. Is it OK for us to include these invoices still, and hence submit a financial report that runs until November 30, 2022? Many thanks for your consideration, and with kind regards! SFauconnier (talk) 16:39, 12 December 2022 (UTC)

Hi SFauconnier, yes, it is fine to include the final invoices in your upcoming report. I have also amended your grant's end date to allow for the bills received/paid later than original October grant end date. Thanks for checking. -- JTud (WMF), Grants Administrator (talk) 17:34, 12 December 2022 (UTC)