Grants:DPLA 2022 Structured Data Across Wikimedia project

From Meta, a Wikimedia project coordination wiki


Welcome to the final report for DPLA's 2022 SDAW funded project! This report shares the outcomes, impact and learnings from the grantee's project. This project was not a Project Grant, or part of the Wikimedia grants program. It was funded by the Wikimedia Foundation's Structured Data Across Wikimedia. This page reuses the format of Project Grant reports, and so some of the prompts may use irrelevant language referencing grants.

Summary[edit]

As a result of this project, DPLA's approach to Structured Data was improved with the addition of large numbers of statements using Wikidata entities, and improving our process for adding subjects and depicts statements. We have also created a tool to help users quickly add depicts statements, and this tool can be applied to all Commons media with subject statements (and could be extended to other recommendation logics). DPLA has improved documentation of its upload pipeline and engaged in outreach to many professional conferences and peer institutions.

Project Goals[edit]

Goals
  1. DPLA will develop a path for one or more contributing institutions or hubs to share more of their descriptive metadata with DPLA and Wikimedia that could be reconciled with Wikidata entities. This will allow for more descriptive structured data to allow for more items to appear in search results on Commons.
    DPLA introduced these changes to the DPLA data model last fall, for the first time allowing institutions to provide subject term URIs (using SKOS exactMatch) rather than simple text strings. We now ingest all of our subjects from the US National Archives as URIs for the NARA authority file, which allowed us to add these as SDC statements for the terms which have already been matched to their items on Wikidata.
  2. To address visibility for records that have irreconcilable descriptive metadata, DPLA will develop a tool for suggesting depicts statements based on metadata subjects and evangelize that the tool be used by DPLA’s community to improve search on Commons.
    DPLA developed DepictAssist, a tool based on Special:SuggestedTags which uses subjects to suggest potential depicts statements. The initial iteration performed searching on Wikidata items based on the subject string, but after user feedback was gathered, DPLA also began work on an additional approach that utilizes pre-reconciled subject terms (matched with Wikidata items using OpenRefine) to suggest to users. This reconciliation work will also help our aggregation, as we plan to add it to the metadata just like the institution-provided subject URIs.
  3. To leverage DPLA’s detailed modeling of sources in structured data records, DPLA will coordinate with the community to create a Wikipedia citation gadget. While this gadget will follow the data modeling used by DPLA for their contributing institutions, DPLA will document the gadget sufficiently to allow for customisation for use by other institutions should they desire to undertake similar work.
    DPLA's Data Fellow, Dominic Byrd-McDevitt, was involved in the creation of View it!. While a citation gadget has not been created, per se, we are closer to that goal, and also hopeful that if the Wikimedia Foundation implements SDC access across all wikis, this will become a more enticing project for the community to work on. DPLA plans to complete an initial batch of adding citations to images in the near future.
  4. DPLA will create process documentation and conduct outreach to other national-scale aggregation projects to share DPLA’s successes and learnings and to advocate for more contributions to Commons globally.
    DPLA's digital asset pipeline is now documented in detail on GitHub. In addition, DPLA has conducted outreach in many places, in conferences such as Wikimania, the Wikimedia & Libraries conference, and the American Library Association, as well as with many meetings with key peer institutions, such as the US National Archives, the Smithsonian Institution, and the Biodiversity Heritage Library. We also formed the Wikimedia Working Group, which brings together key players from the DPLA network to help direct efforts and improve documentation.

Project Impact[edit]

Important: The Wikimedia Foundation is no longer collecting Global Metrics for Project Grants. We are currently updating our pages to remove legacy references, but please ignore any that you encounter until we finish.

Targets[edit]

  1. In the first column of the table below, please copy and paste the measures you selected to help you evaluate your project's success (see the Project Impact section of your proposal). Please use one row for each measure. If you set a numeric target for the measure, please include the number.
  2. In the second column, describe your project's actual results. If you set a numeric target for the measure, please report numerically in this column. Otherwise, write a brief sentence summarizing your output or outcome for this measure.
  3. In the third column, you have the option to provide further explanation as needed. You may also add additional explanation below this table.
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
At least 1 million DPLA-contributed images updated with subject, creator, or other reconciled entities. At latest count, this number is 3,618,323 subject statements added In this respect, we greatly exceeded the expected amount. Moreover, we built a lot of infrastructure that allows us to continue this work, having integrated it into the regular data synchronization process. We maintain a public JSON file that can be edited to add or modify subject reconciliations. Additionally, the Wikidata URIs for these subject matches will eventually be added into the DPLA aggregation data itself.
DPLA will launch a tool for adding "depicts" statements and evangelize it to our network. DPLA launched DepictAssist and is continuing to develop it. We consider this part of DPLA’s services, and are continuing to maintain and develop it as we anticipate pushing it out to broader audiences. The tool, in various iterations, was shared in many forums, including a meeting of the LD4 Wikidata Affinity Group, the DPLA Wikimedia Working Group, and an in-person meeting of 8 IUPUI University librarians. This feedback led to improvements and documenting other design needs which will be addressed in the following months.
35K images on Wikipedia with SD image citations. This item was not done, but we are hopeful to see progress in the coming months
Pipeline, subject-depicts tool, and citation gadget fully documented on wiki. Documentation was created for all of our services except citation gadget (see elsewhere). DPLA's digital asset pipeline are fully documented on our GitHub account, including the DPLA ingestion repo, documenting how Wikimedia markup is generated from item records, and ingest-wikimedia, which documents the upload and metadata synchronization.
Citation template and/or gadget shared at 3 Wikimedia conferences or meetings, reaching >100 people. We evangelized the concept to over 100 people This was discussed at Wikimania, WikiConference North America, LD4, DPLA Wikimedia Working Group and other network meetings, and elsewhere. DPLA has done considerable work to create energy around the idea of SDC-powered image citation, but it is not yet a reality in actual practice.
DPLA will conduct outreach to other regional and national aggregators to share our success and learnings and to encourage them to engage in similar programs. This was done continuously over the course of the funded project. DPLA has conducted outreach in many places, in conferences such as Wikimania, the Wikimedia & Libraries conference, LD4, and the American Library Association, as well as with many meetings with key peer institutions, such as the US National Archives, the Smithsonian Institution, and the Biodiversity Heritage Library. We also formed the Wikimedia Working Group, which brings together key players from the DPLA network to help direct efforts and improve documentation.


Story[edit]

Looking back over your whole project, what did you achieve? Tell us the story of your achievements, your results, your outcomes. Focus on inspiring moments, tough challenges, interesting anecdotes or anything that highlights the outcomes of your project. Imagine that you are sharing with a friend about the achievements that matter most to you in your project.

  • This should not be a list of what you did. You will be asked to provide that later in the Methods and Activities section.
  • Consider your original goals as you write your project's story, but don't let them limit you. Your project may have important outcomes you weren't expecting. Please focus on the impact that you believe matters most.

One of the best stories is the way in which this work benefits the broader ecosystem of linked open cultural heritage data.

We often find that the value of DPLA's Wikimedia work for its contributors helps move forward other stalled initiatives progress as well. For example, for 10 years, DPLA's aggregation has collected institution names and subject terms without any entities. This date is often created on the provider's end with the use of authority files and stored with URIs. But DPLA never previously invested in implementing entities in its data model, and aggregated simply by ingesting all of these types of terms as string values. This was always a problem without an obvious enough rationale to spend valuable engineering resources on it.

In this project, we took advantage of the SDAW funding to meet a goal through solving a long-standing problem for DPLA. We have been addressing the goal of adding subjects to Commons media not simply by matching strings in DPLA's data to Wikidata items, but by working with out actual providers to retrieve their URIs. DPLA has implemented a SKOS exactMatch for URIs in our data model so these can be ingested. And then they can be matched exactly with Wikidata items. But these entities can also have any number of other benefits, such as allowing DPLA users to facet on subject terms in the future. These changes have already affected millions of DPLAs items. In this way, work that was spurred and justified by the impact of Wikimedia engagement is having effects on the data and DPLA's users even outside of this immediate project.

Survey(s)[edit]

If you used surveys to evaluate the success of your project, please provide a link(s) in this section, then briefly summarize your survey results in your own words. Include three interesting outputs or outcomes that the survey revealed.

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well[edit]

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

DPLA has learned over the last year that aggregation works very well with Commons and Structured Data, and our approach to data synchronization has allowed us to make iterative changes that can be implemented quickly and have high impact. For example, at the start of this project, there were no DPLA-added subject (P921) statements in our uploads, and by the end there were millions. We accomplished this by utilizing our data synchronization script, previously developed, to make updates to past uploads. We implemented new logic in the script to add the statements when the subject is one that we have already identified. As part of this project, we began keeping a database of subject mappings for Wikidata, which the synchronization script utilizes to add subjects to Wikimedia Commons files.

This approach means a publicly editable mapping document is continually checked as data is regularly updated, and these new subjects were added rapidly as passes were made for the bot to update all other metadata for the media files as well. Changes or additions made to the mapping will be automatically reflected in future synchronizations, as well. This document is not only open on GitHub, but was primarily developed by Evan Robb, a librarian who does not work for DPLA, but works for one of our main providers. This also shows how aggregation gives everyone in the network a reason to spend time on efforts that benefit the collective, and not just their own institution—and that Wikimedia projects can foster that as well.

What didn’t work[edit]

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

The main work yet to be done is the SDC-powered image citations. We believed through this project we could generate enough interest in the idea that the community might take up the cause, if we did not do it ourselves. This has not been successful, though we do believe the work of View it! brings this very close to reality, since it allows in theory for images to be added by a user script or gadget that populates the caption using structured data statements—it just doesn’t currently format these with a citation template. View it! Itself or code derived from it could be used to allow users to add Commons media with templated citations. The fact that SDC data cannot be accessed from other wikis also makes solutions more difficult, since it means SDC cannot be added by the template itself, but a user script needs to be be developed to import it, and then it is only a snapshot rather than dynamic. We evangelized around this issue in particular. This issue was raised at T238798 and in the Community Wishlist (where it came in #8 in the "Multimedia and Commons" category). We will continue to work on automated ways to add the citation template from SDC, but would like to roll it out the community in a thoughtful manner. We are targeting to make changes to a smaller number of pages, such as 100 or fewer, by the end of June, and then to further gauge the community's temperature on this proposal before adding more in the next 2-3 months.

We also also had more challenges in developing DepictAssist than we originally hoped, and got started on that work later than was planned. While the actual stated goal were met, we know this is something we will continue working on before promoting it more widely. There were several causes for this. One was that we discovered it was more challenging than initially realized to provide good recommendations via Wikidata search. Using the MediaWiki API with keywords is difficult because Wikidata does not have any search stemming, while most cultural heritage subject terms are capitalized, plural, and often have broader and narrower terms in the same subject. We try matching URIs with the OpenRefine endpoint, but this also difficult, since Wikidata does not store the full URI, only the identifier, for most properties. Finally, these problems that make it difficult to have 100% accuracy are compounded by the experiences we’ve had with the Wikimedia Commons community. There are still unsettled issues around depicts statements and whether to use broader terms when a subclass is already used, for example. We already received reverts and (unfounded) warnings from Commons editors with strong opinion, for edits we know to be valid. And while the goal of the project is non-controversial—simply to improve descriptive metadata for discoverability—the attitudes of many Commons editors related to SDC issues means we now worry about sending users into a contested space where they will have a hostile interaction.

Additionally, both of these parts of the project would have had better outcomes with from development work from someone more well-versed on the front-end, or more developer time in general. However, DPLA lost both its Director of Technology and one software developer (out of two) during the project, which resulted in more difficulties due to fewer technical resources being available than at the start.

Other recommendations[edit]

If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.

Next steps and opportunities[edit]

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this funded project that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

DPLA will be continuing and deepening our relationship with the Wikimedia community. As a direct result of the work of this project and previous Wikimedia Foundation grants, DPLA was able to secure further funding from the Sloan Foundation for more Wikimedia programs over the next three years, as announced in our recent press release. We see that work as a continuation of the work started with the SDAW funding (and we recognize both are Sloan-funded), but just with a broader scope than structured data. As such, we plan to maintain and continue to develop DepictAssist, continue to reconcile and add subjects to DPLA uploads, work on the SDC-powered citation concept, and continue to be active in global outreach with peer institutions.

Finances[edit]

Actual spending[edit]

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

Expense Approved amount Actual funds spent Difference
Software Development 20,000 20,000 0
Data Processing 10,000 10,000 0
Community Coordination & Outreach 10,000 10,000 0
Indirect 10,000 10,000 0
Total 50,000 50,000 0


Remaining funds[edit]

Do you have any unspent funds from the grant? Please answer yes or no. If yes, list the amount you did not use and explain why.

No.

If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:

N/A

Documentation[edit]

Did you send documentation of all expenses paid with grant funds to grantsadmin(_AT_)wikimedia.org, according to the guidelines here? Please answer yes or no. If no, include an explanation.

Confirmation of project status[edit]

Did you comply with the requirements specified by WMF in the grant agreement? Please answer yes or no.


Is your project completed? Please answer yes or no.