Community Wishlist Survey 2023/Multimedia and Commons/Tool to copy images from archives to Commons with metadata

From Meta, a Wikimedia project coordination wiki

Tool to copy images from archives to Commons with metadata

  • Problem: Many GLAM institutions make images available on their website which can be copies to Commons. These have to manually be download/uploaded one by one and metadata copied across.
  • Proposed solution: A way for Wikimedia Commons to take a URL and copy the image to Commons with descriptions and relevant metadata, including links back to the source.
  • Who would benefit: Everyone.
  • More comments: GLAM institutions like Libraries Tasmania / Tasmanian Archives have thousands of public domain images on their website (example). To add each one manually to Commons would take forever. A tool like this would help users of Wikimedia projects add more media, help GLAM institutions quickly share their own content, and make sharing images more accessible to new comers during training events.
  • Phabricator tickets: T193526
  • Proposer: Jimmyjrg (talk) 03:59, 24 January 2023 (UTC)[reply]

Discussion

  • It looks like the example above is using a catalogue product from SirsiDynix; I've not been able to find any API info. I think one aspect of this proposal is likely to be whether we can build a general-purpose tool that works with many libraries, or a single-purpose tool. For example, many archival catalogue systems support OAI-PMH, so if we built something that worked with that then it'd be perhaps more widely used. For site-specific scraping requests, there's a register of them at commons:Commons:Batch uploading. SWilson (WMF) (talk) 07:09, 24 January 2023 (UTC)[reply]
    Yes, I'd like something that adapts to the website/database that is being looked at. Some libraries use Spydus (example: Stonnington), which I think has a public API. Ideally it'd be best if there was some way to have it learn how a website works (the first time you visit you have to manually copy and paste all the information) but then it knows how to do it itself after.--Jimmyjrg (talk) 22:05, 24 January 2023 (UTC)[reply]
    @Jimmyjrg Double checking I understand the problem correctly: the proposal is to create a workaround for resources that are available online from institutions that do not have APIs or data dumps that can facilitate sharing data in bulk. Is that correct? __VPoundstone-WMF (talk) 16:55, 26 January 2023 (UTC)[reply]
    Yes @VPoundstone-WMF: That’s a good explanation. Basically I’d like something quicker than downloading and uploading everything myself (and copying/inputting metadata) when there’s a few images to move to Commons. Jimmyjrg (talk) 08:00, 27 January 2023 (UTC)[reply]
    Without commenting on the specific example above, in my experience, creating a generic tool to reliably scrape random websites with the sort of detail required for Commons is probably technically infeasible. c:Commons:Batch uploading exists for a reason. -FASTILY 22:33, 28 January 2023 (UTC)[reply]
  • I think, that there is one big problem. And that is, the source data have always different formats, so every time you have to change your program. And that's why there is a service on Commons, which helps with such mass transfers. Just now, I cannot find a link.Juandev (talk) 19:14, 9 February 2023 (UTC)[reply]
    I was inspired by the Web2Cit project which can learn to add citations using different formats. But you're right, it's likely more difficult for catalogues of images. Jimmyjrg (talk) 23:24, 21 February 2023 (UTC)[reply]
  • en:GLAM (cultural heritage), an acronym for galleries, libraries, archives, and museums, the cultural heritage institutions --Error (talk) 15:56, 13 February 2023 (UTC)[reply]

Voting