Grants:Project/DPLA/Extending the DPLA digital asset pipeline to improve quality and discoverability/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2020-21 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/DPLA/Extending the DPLA digital asset pipeline to improve quality and discoverability.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learning from the first half of the grant period.

Summary[edit]

We conducted extensive Structured Data Statements modeling and circulated our plans within the community for feedback.
We wrote Structured Data Statements for all existing files uploaded by DPLA currently in Wikimedia Commons.
We implemented Readiness stats Integration into our partner analytics dashboard to guide their planning for uploads.
We've built a prototype for image citations based on structured data statements and shared it with the community.

Methods and activities[edit]

Our project activities have mainly been accomplished by DPLA Data Fellow Dominic Byrd-McDevitt and DPLA Senior Software Engineer Scott Williams.
The majority of the work in the first half of the granting period has been related to planning and implementation work for Structured Data Statements and image citations.
The second half of the grant period will be devoted to putting data synchronization into production.

Midpoint outcomes[edit]

We circulated our modeling plans with the community: https://commons.wikimedia.org/wiki/Commons:Digital_Public_Library_of_America/Modeling
We’ve created over 27 million structured data statements across 2,334,599 items, including statements about copyright status, copyright license, RightsStatements.org statement, creators, subjects, identifiers, contributing institutions, description, title, and collection. Each of these statements has been designed with predicates and references that identify that it came from the originating institution and links to the DPLA item page, which identifies which properties were added by DPLA Bot and helps us peacefully interoperate with properties contributed by other means.
We helped design the “original catalog description” and “Commons media contributed by” properties on Wikidata.
We redesigned the file info box for DPLA items in Wikimedia Commons to draw from Structured Data Statements rather than duplicative wikitext.
We shared a prototype of an image citation that drew from structured data statements with the community.
We worked with community members to design a new metadata template using structured data rather than wikitext.
We identified macro functionality that will need to be created for structured data statements that is similar to functions that are available for WikiData that will be required for the image citations and shared that information with the WMF.

Finances[edit]

DPLA has spent funds as indicated in the proposal with the exception of the modification we notified Chris Schilling of on September 14, 2021 and subsequently received formal approval

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges[edit]

A major challenge for DPLA is attempting to impose some level of permanence to data that is outside of our domain. DPLA does not edit or express editorial control over the records we handle, but instead harvest them from 3rd parties. As a result, we don’t have a specific view on when data are modified or moved. This creates challenges throughout our pipeline to be handled when possible.
Another challenge is managing the reaction that contributions that the DPLA community makes to Wikipedia in terms of embedding uploaded images in articles. We’ve received pushback from one community member in particular that does not seem helpful given that our community members are new to editing articles, and in many cases, are putting images in articles where there wasn’t one before.

What is working well[edit]

DPLA has proceeded with a discussion-based approach for modeling which has allowed iterative addition of SDC. This approach could have been frowned upon by the community, because we did not propose a settled data model before requesting bot approval. However, while we have not received extensive commentary on our data model so far, we have not received complaints, and we have provided a clear forum for editors to discuss any issues with the modeling which is linked in the summary of every edit we make.
DPLA’s iterative approach to adding and modifying SDC statements over time is a novel approach different from how most Wikimedia Commons uploads are not changed by their GLAM institution after upload. This has been a useful approach that could be replicated by others, and SDC makes it easier in many ways to maintain one’s data in Commons.

Next steps and opportunities[edit]

The main focus for the remainder of the work is creating the synchronization pipeline so updated metadata records create updated metadata on Commons.
We have had early success with creating a synchronization process that reads the current state of the object in Commons, queries the DPLA API for the new state, and deletes and recreates the properties that need modification.

Grantee reflection[edit]

We’d love to hear any thoughts you have on how the experience of being an grantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?