Grants:Project/MHz Curationist/Building a sustainable system that unlocks museum metadata for Wikidata use

From Meta, a Wikimedia project coordination wiki


statusnot selected
Building a sustainable system that unlocks museum metadata for Wikidata use
summaryWe will unlock open access museum metadata by aggregating the data of multiple museum data sources into a single dataset, then building a bot that performs regular bulk data imports to new or existing Wikidata items.
targetWikidata
amount$81,000
nonprofityes
advisorDominicLoriLee
contact• dominic@1909digital.com
organization• MHz Foundation
join
endorse
created on04:19, 17 March 2021 (UTC)


Project idea[edit]

What is the problem you're trying to solve?[edit]

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.


Museums are increasingly embracing open access attitudes, but all this open access metadata is not all making its way into Wikidata. In 2016, a study identified over 50 major cultural institutions that had adopted open access practices,[1] and this trend has only accelerated more recently, with major open access launches from the Metropolitan Museum of Art in 2017[2], the Art Institute of Chicago in 2018[3], the Cleveland Museum of Art in 2019[4], and Paris Musées[5] and the Smithsonian Institution[6] in 2020, just to name a few. These projects typically involve not just a change in the institution's copyrights or terms of use, but also are accompanied by a corresponding release of collections metadata, via an open dataset publication or new API.

In principle, all this data is ripe for harvesting into Wikidata, and doing so aligns with the type of work Wikidata is clearly already trying to achieve (e.g. Wikidata:WikiProject sum of all paintings). But the process is difficult, requiring either immense volunteer effort on behalf of the Wikimedia community or, for institutions attempting to work on Wikidata themselves, a steep learning curve in a community and technical landscape they poorly understand. The labor cost is high because importing large datasets into Wikidata involves (1) a significant amount of intellectual work, including data mapping by someone familiar enough with Wikidata properties; (2) technical resources to develop the actual data import workflow (e.g., a Pywikibot script to import/edit Wikidata items via API), and (3) sufficient understanding of the Wikimedia community to navigate the processes and discussions necessary for bulk import and running bots. Because each open access museum is starting from a different data model, this process must be recreated each time, which increases inefficiency as this type of work scales up and creates a high barrier to entry to Wikidata for institutions. In an ideal world, making collections open access should be enough to get them into Wikidata, but the technical tools and workflows necessary to make this possible do not yet exist.

Notes[edit]

What is your solution to this problem?[edit]

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.


Open access cultural heritage metadata is vital to Wikidata.

We are proposing a project to bridge the gap between museums and Wikidata by developing a single, centralized pipeline, from varied open access museum data sources to Wikidata. As a result, the data sources will be able to contribute to their data about artworks, books, and other museum objects to Wikidata by virtue of making their datasets open, without the necessity of each institution individually developing the technical solutions to do so on their own. This approach attempts to address the skills gap in the GLAM sector, and remove barriers by introducing an intermediary aggregation service. This aggregation service would handle the ingestion of cultural heritage metadata into a single database, and then would operate an account on Wikidata to import or update items for those works on Wikidata.

This aggregation service will be built and maintained by the non-profit MHz Foundation, for their flagship project Curationist. This curated content platform for cultural heritage, like the Wikimedia community, is engaged in building open access resources that increase knowledge equity and cultural understanding. The idea behind Curationist is to fulfill the promise of open access by combining the efforts of many stakeholders across open access communities to do the most good. Because of this mission, contributing to Wikimedia projects is a core mission goal—and something we are building into our V2 web platform (currently under development) from the start.

The software and infrastructure development this grant would fund is not a standalone project, but a Wikimedia-facing component of a larger product, with other development already underway or planned. The site, when launched in late 2021, will draw upon CC0 and CC BY content shared by GLAMs, plus third-party aggregators and community content hubs; thereby, providing users centralized access to expansive cultural objects in one platform (curationist.org) for editorial, learning, and research purposes. Curationist has already, or is currently performing, much of the necessary work, such as developing a single data standard (using Wikidata as its controlled taxonomy), crosswalking data from multiple museums, and harvesting data and assets from museums into a single database. Because of this aggregation work, Curationist is uniquely positioned to provide benefit to Wikidata, because it will be a single source that can provide data from many institutions.

The specific concept is to develop technology to connect the Curationist database (AWS documentDB using elastic search for indexing and search), to Wikidata. The major components of this project will be mapping the Curationist data standard to Wikidata properties, developing and implementing workflows to allow Curationist to reconcile to entities from data sources with Wikidata IDs, creating bot code that could bulk import this data as new items/statements to Wikidata, and conducting the Wikidata outreach and communications necessary to receive buy-in and approval from the community.

With this model, Wikimedia Foundation funding will help set up an ongoing pipeline that will be sustained beyond the grant period, with the vision that new data sources will be added over time, which we believe makes this solution more impactful than a one-off bulk data import. This not only generates value for Wikimedia projects, but also improves the value proposition for museums going open access—by providing them easier access to an important platform for their metadata.


Project goals[edit]

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

  1. Increasing cultural heritage metadata on Wikidata by adding new statements and items.
  2. Building value for Wikidata/Wikimedia (and open access generally) in the cultural sector.

Project impact[edit]

How will you know if you have met your goals?[edit]

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.


  1. Goal 1: Increasing cultural heritage metadata on Wikidata [aligned directly to "Invest in Skills and Leadership Development"]
    • Output: This goal is all about producing new, high-quality content for Wikidata. Each institution that Curationist adopts as a data source can have its authoritative metadata about cultural heritage works imported to Wikidata. As a result of this development, Curationist will import museum metadata to Wikidata items for at least 200,000 metadata records.
    • Outcome: The data itself contributed during the grant period will greatly improve Wikidata's coverage of cultural heritage. This project's software development will set up an ongoing pipeline of cultural heritage metadata to Wikidata. Importantly, Wikimedia funding will be used not just to facilitate a one-off import of an single institution data, but to develop a system which Curationist will continue to operate, maintain, and adopt additional data sources for over time.

      In addition, the software development planned for this phase will set up the project for future enhancements that will have even more impact for Wikidata. If we were to be funded and complete this data pipeline to Wikidata, our next step will be to develop a tagging system (using e.g. "depicts" statements), in which user-generated content on the Curationist platform could be added to Wikidata, or as SDC statements on Wikimedia Commons using the same technology and Curationist bot.

  2. Goal 2: Building value for Wikidata/Wikimedia (and open access generally) in the cultural sector ["Manage Internal Knowledge"]
    • Output: An important aspect of this project is how it will use the approach of aggregation, to allow for the import of metadata from any open access cultural heritage source, once data is mapped, to Wikidata. In the initial phase of the project completed with grant funding, this system will result in contributions of metadata from at least 3 data sources during the grant period—the Cleveland Museum of Art, the Smithsonian, and Rijksmuseum at first—with additional contributions of metadata from Statens Museum for Kunst: SMK, Walters Art Museum, Brooklyn Museum of Art, Tāmaki Paenga Hira Auckland War Memorial Museum, Paris Musées, The Met, Sketchfab, Digital Public Library of America, Wikimedia Commons, Science Museum Group, The Art Institute of Chicago.
    • Outcome: This project seeks to unlock open access collections that might not otherwise have been imported to Wikidata. We believe that the most important and unique aspect of the plan is that it is not being completed by any individual organization, but by an outside foundation collecting data from many sources. As the intermediary, Curationist will also serve as a third-party platform that curates and adds context to both museum content and user-generated data on Wikidata.

      Because we are undertaking technical work that has the power to lower any cultural institution's barrier for entry to Wikidata—provided it publishes its data openly—this project is not just about providing actual content for Wikidata, but we believe an important outcome of the project will also be increasing the value proposition for institutions considering releasing their data under open access terms and contributing to Wikidata. In all of these ways, we hope to pave the way for continued open access content releases and Wikidata contributions.

Do you have any goals around participation or content?[edit]

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.

Our first goal's output is directly related to the third shared metric of improving content pages on Wikidata, and we have set a numeric target for that goal.


Project plan[edit]

Activities[edit]

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

The end goal of this project is a system continuously providing regular updates of GLAM-sourced data for works in the Curationist database to corresponding Wikidata items. This will take development work on the Curationist backend, developing a bot script for editing Wikidata, and community engagement.

Prep Work on Curationist Environment

Linking top-level works or authority records with Wikidata IDs will allow for an exchange of data between Curationist's museum sources and the broader information commons on Wikidata. The tasks involved in this prep work include:

Ensure Curationist content standard meets data-linking needs

  • A major selling point of the Curationist platform is that it is already the work of mapping museum metadata to a common standard and aggregating it into a single database. Grant funding would be put towards any additional data modeling necessary to use this standard for Wikidata linking, by storing Q-ids for both top-level works and all controlled vocabulary terms.

Develop technical workflows to accommodate data-linking needs

  • In addition to the ability to store Wikidata identifiers, the technology also needs to be developed in such a way that source data from museums can be ingested absent a Wikidata identifier, and then later be reconciled with Wikidata asynchronously. This will be done with a tool similar to OpenRefine and/or Mix'n'match, and as entities are identified, these identifiers would be batch updated in the stored data record for the work.

Creation of Wikidata mapping document

  • Just as Curationist has developed mappings for the source institutions to its common metadata standard, a mapping will be created from that standard to Wikidata properties. This interoperability is what will allow for the data exchange with Wikidata, including both pushing out data to Wikidata that is not yet represented, as well as pulling in new data from Wikidata.
Bot Development

Development of bot code

  • Using the Curationist–Wikidata mapping, Curationist will develop bot code (using a tool such as Pywikibot or WikidataIntegrator) that will add new items to Wikidata when appropriate, or add/update statements to existing Wikidata items for open access collections. This bot would be designed to add statements based on differences detected between the Wikidata version of a work's item when compared with the metadata aggregated from the source museum, as well as to meet any requirements by the Wikidata community.

Integration of bot with Curationist workflows

  • Logic will be developed surrounding when the Curationist Wikidata bot will be triggered and run. Once a record in the Curationist database for an artwork sourced from a museum has been ingested into the content standard, it will be linked to Wikidata using the workflow outlined above. Only then can the data in the Curationist database be checked against Wikidata, and any information available to Curationist which is not yet in Wikidata, could be synced to Wikidata using the bot. Therefore, the bot code itself will perform the edit to Wikidata, but there will also be code that runs during the ingestion/reconciliation process that makes the Wikidata pipeline a continuous operation.
Community Engagement

In addition to the software development, the project will require several types of on-Wiki outreach and communications, including:

  • Publication of the Curationist–Wikidata mapping and consultation with the community to solicit feedback
  • Facilitation of the bot permission process, for community approval
  • Continued monitoring and engagement of the Wikidata community as bot operates, to ensure feedback or concerns are addressed


Budget[edit]

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!


Task Personnel cost
Prep Work on Curationist Environment
  • Upfront strategy and research to mapping the Curationist data standard to Wikidata properties
$28,000.00
Bot Development
  • Developing and implementing workflows to allow Curationist to reconcile to entities from data sources with Wikidata IDs and creating bot code that could bulk import this data as new items/statements to Wikidata
$37,500.00
Community Engagement
  • Conducting the Wikidata outreach and communications necessary to receive buy-in and approval from the community and implementing that research and findings into the workflows and outputs
$15,500.00
Total $81,000.00


Community engagement[edit]

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

For this project to be successful, we fully understand that the Wikidata community needs to be aware of, and be part of, our approach. We have experienced Wikidata editors as part of the project team to ensure we have this capability. We plan to make high-volume edits by bot, which requires approval from the community. Additionally, the project demands that we conform to community standards in modeling the type of data we import—especially since we intend to import diverse types of museum collections to both new, and existing, Wikidata items.

This will be an ongoing conversation with the community, as we work through our initial data modeling and solicit feedback in appropriate forums, which might be Project chat, Wikidata and GLAM Facebook page, Wikidata Telegram, individual property talk pages, or relevant WikiProjects. As part of the mapping activity, we would publish the mapping in a prominent location, such as the bot user page, so it could be reviewed by the community both before and after approval. We would also fully expect to hear talk page messages with feedback related to our work, even after the bot request, some of which may lead to additional action or changes to the mapping.

Get involved[edit]

Participants[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Dominic Byrd-McDevitt

Advisor

Dominic has advised MHz Curationist in Wikimedia engagement, and in its development of this project proposal.

As a longtime Wikimedian and GLAM-Wiki practitioner, Dominic has produced scripts and bots that have performed over one million Wikidata edits and over 2.5 million Wikimedia Commons uploads. Having previously worked on Wikimedia projects for such institutions as the Digital Public Library of America, the US National Archives, the Smithsonian Institution, and the Cleveland Museum of Art, Dominic specializes in issues relating to cultural heritage metadata, aggregation, rights, and partnerships. He has previously developed the User:US National Archives bot, User:DPLA bot, and User:Openaccess_cma bot.

Virginia Poundstone

Director of Product and Content, MHz Curationist, MHz Foundation

Virginia Poundstone works on the product and content strategy for the open access art and cultural heritage project MHz Curationist as the Director of Product and Content. Prior to joining the MHz Foundation she was an art educator at Parsons, MICA, and Columbia University where she taught courses about making things by breaking down systems to build improved structures. She is an artist, a Pollock-Krasner grantee, and a member of the cooperatively artist-run gallery, Essex Flowers, in New York City.
Thomas Guignard

Technical Project Manager

Thomas Guignard is an independent consultant specializing in managing technology projects for libraries and cultural actors. As the Technical Project Manager for the MHz Curationist project, he coordinates two development teams and a network of subject matter experts to realize the project’s vision. A librarian with a background in engineering, Thomas has over 15 years of experience aligning technology with user requirements and helping people organize and navigate large datasets. Strongly believing that information wants to be free, he is a longtime supporter of open, collaborative ways to share and reuse content and has been a Wikipedia contributor since 2006.
Datacrafted
Distributed team of system and software developers focused on platform build and creating points of integration with other systems.

Community notification[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • Support Support I think the budget needs a bit more of work/clarification but overall the proposal is relevant, curious to see where this lands. The project has definitely the right advisors. Scann (talk) 19:03, 9 April 2021 (UTC)
  • Support Support There is a critical need for easier infrastructure that supports mutual collaboration between cultural institutions and Wikimedia platforms. The MHz Curationst aggregation service can be an important hub to help to improve the sharing and use of open access content between Wikimedia collaborators, platforms, and partners. The support of this project is a crucial step for the long-term viability and sustainability of institutions and Wikimedia platforms' ongoing work together. The project has the right leadership with Dominic along with the MHz Curationist team. Nealstimler (talk) 14:31 April 11, 2021 (UTC)
  • Support Support Creative Commons (CC) is committed to supporting museums across the world as they open up their collections to enable access and participation in culture, in meaningful, sustainable and equitable ways, including on public-minded sharing platforms like Wikipedia and its sister project Wikidata. CC’s Public Domain Mark 1.0 (PDM) and Public Domain Dedication tool (CC0 1.0) are the keys to open up cultural heritage and can be used to encourage their sharing, use and reuse. Catherine Stihler, CEO, Creative Commons, and Brigitte Vézina, Director of Policy, Creative Commons, are thus glad to endorse this grant proposal and applauds the project’s goals of increasing cultural heritage metadata on Wikidata by adding new statements and items and building value for Wikidata/Wikimedia (and open access generally) in the cultural sector. 77.161.91.32 15:14, 13 April 2021 (UTC)
  • This proposal is also been endorsed by the Cleveland Museum of Art itself, as conveyed in an email from Chief Digital Information Officer, Jane Alexander. CMA is one of the initial data sources mentioned above. Dominic (talk) 20:38, 13 April 2021 (UTC)
  • Support Support I see this as an important step in bringing the Wikimedia projects closer to GLAMs. It is challenging for individual organizations to contribute content to Wikimedia, and a stop-gap in the exchange of information between Wikimedia and the original collections can serve many purposes. It could have the ability to consolidate metadata, identify versions of the same item/media, and harmonize copyright/licensing expressions. It could be used for metadata roundtripping or detecting Open Access violations. It will be important to develop these in collaboration with the numerous actors to whom this openness and interoperability is significant. I can see that the Wikimedia technologies and content have even more to offer, especially in terms of multilinguality. I think that the MHz Curationist can bring this topic further fast, and it is important to prepare for collaborative next steps. – Susanna Ånäs (Susannaanas) (talk) 15:21, 14 April 2021 (UTC)