Community Wishlist Survey 2022/Bots and gadgets/Tool that reviews new uploads for potential copyright violations

From Meta, a Wikimedia project coordination wiki

Tool that reviews new uploads for potential copyright violations

  • Problem: There are so many uploads on Wikimedia Commons that are copyright violations. Some of them are snapshots of some computer screen or some other image like a poster. I estimate that about 3-6% of all files on Commons (most of them probably images) are copyright violations. I already had some files around the time when Commons started that were copyright violations and nobody ever noticed.
  • Proposed solution: A bot or tool that checks uploaded files and flags them if they are potentially copyright violations
  • Who would benefit: Wikimedia Commons as such as well as admins
  • More comments:
  • Phabricator tickets: phab:T120453
  • Proposer: --D-Kuru (talk) 14:00, 18 January 2022 (UTC)[reply]

Discussion

In my opinion copyright violations share one or more of these features:

  • EXIF data: Copyvios usually do not have EXIF info. Not said that they can not include them.
  • Size: Copyvios are usually small in size (usually websize like around 800 px for the long length). Not said that they can't be larger.
  • Account edits: Copyvios are often uploaded by what I call hit and run accounts. So the account is created, uploads one copyvio and is never to be used ever again.
  • Can be found in the web: Copyvios can often be found on other websites. Not said that they can not be freely shared by other people on sites like flickr.
  • Author: No matter how impossible it is, the file is tagged with "own work" and the upload as the author name
  • Percentage of useruploads: When a hit and run account uploads three images and two of them are speedy deleted because of copyright violation, the third one should may be checked.
  • Content: Copyvios usually look better than the average image on Wikimedia Commons. All other parts are probably possible to check rather easy. But this one would need a highly specialised AI that can tell good from bad quality. For a project like Wikimedia I guess this is next to impossible if nobody like Google steps in to help out.

There are also hints that indicate a file is a legit one: GPS data (not possible for every file), upload by a user with many edits over many years, user who has a global account with more profiles on different projects, etc.

My suggestion is NOT to have a bot run wild and delete files it deems as copyvios, but to have a bot that assigns a (public?) score to an image and checks how likely it is that some file could be a potential copyvio so that less of them slipp by.

I thought about creating something like a Trusted Author who has to meets certain criteria to ensure that people can use uploads by this user. There is a tiny story behind this: I was asked by somebody via Mail if they can use my image in a school book and what they would have to do to use it. I said of course and told them all they would have to do is to credit me and the licence. They ended up not using my image (even I own the copyright and have it licenced under a free licence) because they can't be sure for 100% and they could potentially get into legal troubles if they use the image without having a real permission. This might not be a big deal for a website, but can be a huge deal for a (school) book that is printed and sold over many years. --D-Kuru (talk) 20:21, 20 January 2022 (UTC)[reply]

  • @D-Kuru: This is a valid proposal, but it's a bit wordy. phab:T120453 is basically what you're asking for (a bot to flag Commons uploads that are possible copyvios). Is that correct? If so, with your permission, may I simplify the wording of your proposal? It will eventually be marked for translation, so putting in fewer words will make it easier on the translators. For instance, the symptoms you mention in "More comments" such as looking at EXIF data, size, etc., are helpful but not really necessary to understanding the wish. We could move that here to the discussion section. Thanks, MusikAnimal (WMF) (talk) 19:06, 20 January 2022 (UTC)[reply]
@MusikAnimal (WMF): I did not have the time to check the ticket, but from your description this sounds about right. I moved the More comments section as suggested. --D-Kuru (talk) 20:21, 20 January 2022 (UTC)[reply]
Ok thanks. I have done some slight rewording of your proposal for better translatability, which I hope is okay :) Best, MusikAnimal (WMF) (talk) 22:07, 20 January 2022 (UTC)[reply]
This had a lot of support, so I have made a bot solution that can be expanded that should fulfil some of the needs of this wishlist. You can read the bot request here: https://commons.wikimedia.org/wiki/Commons:Bots/Requests/CommCheck Ed6767 (talk) 19:21, 9 February 2022 (UTC)[reply]

I wonder if calculating hashes on uploaded files could help... there's not likely a database we could bump against, but we could probably track hashes of uploaded files over time, and if a file with the same hash is uploaded again, that could be one factor into such a score. Alternatively, it could help indicate if the file exists already, under a different name? = paul2520 (talk) 19:37, 5 February 2022 (UTC)[reply]

I would suggest on upload always use the "This file is not my own work." -upload form. If it is own work, they should declare that they are the author. Many people just use the simpler form, which leads to copyvio. If they declare author and source it is easier to detect copyvios, than images wrongly taged as own work.  — Johannes Kalliauer - Talk | Contributions 18:41, 16 February 2022 (UTC)[reply]

Voting