Community Wishlist Survey 2023/Multimedia and Commons/Prevent Flickr2Commons from uploading duplicate files

From Meta, a Wikimedia project coordination wiki

Prevent Flickr2Commons from uploading duplicate files

  • Problem: Flickr2Commons is the main source of duplicate files.
  • Proposed solution: Fix Flickr2Commons by checking for the SH1-hash message digest of a Flickr file, if it is already available on Commons.
  • Who would benefit: Everyone
  • More comments:
  • Phabricator tickets:
  • Proposer: C.Suthorn (talk) 09:30, 30 January 2023 (UTC)[reply]

Discussion

  • I would have guessed the SHA1 hash check happens server-side. I've definitely ran into the warning myself using the native Special:Upload interface. That tells me the hash values probably don't match, and whatever image Flickr2Commons is trying to upload must have subtle differences with the duplicates already on Commons. I'm not sure how this could be resolved without machine learning. MusikAnimal (WMF) (talk) 16:48, 30 January 2023 (UTC)[reply]
    Yes, there is a server side check. It produces a "warning" (not an "error"). In case of an error the file will not be published. In case of a warning, the upload tool can decide to stop the upload or to go on and publish. It is possible to compute the SHA1 and check against the server, than not upload, if it is already there. Or it is possible to upload a file, than check for the warning, than stop the upload, if there is a duplicate. You could even do both: Check before uploading and check again before publishing the upload. F2c only checks for an identical filename (not an identical flickr id, but for an identical filename including the flickr Id. If the file gets renamed (or was uploaded with an other filename scheme (like Upload wizard does) the check by f2c will fail. C.Suthorn (talk) 22:57, 30 January 2023 (UTC)[reply]
  • What are the benefits of Flickr2Commons compared to the Flickr-import functionality in UploadWizard? It seems like it might be better to incorporate anything missing there, rather than having multiple things do the same work. SWilson (WMF) (talk) 01:01, 8 February 2023 (UTC)[reply]
    @SWilson (WMF) it can bulk-upload groups, photosets etc. Would be great to have this in UW. Strainu (talk) 21:35, 10 February 2023 (UTC)[reply]
    @Strainu: Yes, good idea. UploadWizard does support photosets/albums though. I'm not sure how it handles really large ones, and I don't think it does groups. SWilson (WMF) (talk) 07:43, 16 February 2023 (UTC)[reply]
    When i still tried to use the UW it allowed to upload 150/500 files in one go, but i never found any way to upload more than 100 files in one go without the webbrowser crashing. Also in the second go of uploads it always said, the first file of the second go was already uploading and it falsely claimed, it was more than 150/500 files, if the sum of the first and second go was larger than 150/500 files. C.Suthorn (talk) 17:55, 16 February 2023 (UTC)[reply]
  • Flickr has announced that they're taking over maintenance of Flickr2Commons, so this bug might be fixed as part of that. Discussion thread here: commons:Commons:Village pump#Flickr_Foundation_adopts_Flickr2CommonsSam Wilson 04:49, 2 June 2023 (UTC)[reply]
    Yes! Please see our project page here. Jessamyn (talk) 17:20, 7 July 2023 (UTC)[reply]

Voting