GLAM Wiki 2023/Program/DPLA's Digital Asset Pipeline: How we uploaded 4 million images of cultural heritage to Commons (so far)

From Meta, a Wikimedia project coordination wiki
Logo GLAM Wiki Conference 2023 ID: 2306 DPLA's Digital Asset Pipeline: How we uploaded 4 million images of cultural heritage to Commons (so far)
Facilitators/Speakers: Dominic Byrd-McDevitt Time block: Beginning:
Location: Duration:

This talk will be an in-depth technical treatment of the Digital Public Library of America's digital asset pipeline—which is responsible for uploading 4 million images (by November, estimated) to Wikimedia Commons, as well as adding nearly 100 million structured data statements. DPLA is a national aggregator of cultural heritage metadata in the United States. DPLA's project has allowed it to become the largest overall contributor to Wikimedia Commons, and generate hundreds of millions of pageviews for its participating institutions. This presentation is a companion to DPLA's other proposal, which is primarily a discussion of the issues of strategy and movement capacity relating to the program—and this proposal is specifically to provide detailed information about how the technology actually works.

I will provide an overview of DPLA's organizational structure and its aggregation initiative, which makes all this possible. I will give a walkthrough of the DPLA Wikimedia Commons project and how it works. I will then spend the bulk of the presentation discussing the actual operation of our Wikimedia account and how we have accomplished what we have. Our bot is a set of scripts written in Python, which use pywikipediabot. We run the bot on an AWS server, with a script that use aggregated metadata from our partners to determine items that are eligible for Commons and downloads them to S3. We must also transform the data from DPLA's data model to wikitext for upload, using a crosswalk. This wikitext is becoming increasingly ephemeral (hopefully someday unnecessary) as we transition to Structured Data on Commons. A separate data synchronization script is run periodically across all of DPLA's uploads, and adds/updates the metadata from the source in the form of structured data statements, so that the data can be displayed in Lua-powered templates.

I hope this case study will provide insights for others trying to replicate any piece of this workflow on their own project.

Participants will leave the session with:

1. Technical aspects of bulk Wikimedia Commons upload from GLAM collections 2. Adding cultural heritage metadata as Structured Data on Commons, and running continuous updates 3. How iterative approaches allow technical projects to scale up over time

Experience level: Advanced
Keywords: Content uploads & workflows, Free, Libre & Open Source Software (FLOSS) for cultural heritage, Tech, platforms & tools
Notes: #GLAMWiki232306
Next session: Wikisource workshop