Jump to content

User:Zache/Wikimedia Hackathon 2024

From Meta, a Wikimedia project coordination wiki

Improving Wikimedia Commons image hashing

[edit]

The project idea is to calculate perceptual hashes for Wikimedia Commons images so that it is possible to reliably detect if a photo is already in Wikimedia Commons and match photos to photos in other image repositories. (Finna, Europeana, Flickr ...) This will allow for the updating of the image metadata and image files. It will also help for preventing uploading of duplicate images.

Speed improvement

[edit]

Before the hackathon, the indexing speed was 15000 images per hour—i.e., 10 million per month. With that speed, indexing all 100 million Wikimedia Commons photos would take a year. So, in this hackathon, I moved the indexing code from Toolforge to a virtual server in wmlabs, which tripled the indexing speed to 30M+ photos per month. Indexing is expected to be ready in the summer.

Ontop SPARQL

[edit]

We also installed the ontop server for querying hashes/duplicate images using SPARQL. This work is still ongoing, but currently we are able to query hashes located postgresql database using SPARQL.

STATUS: There is still missing pieces in our Ontop SPARQL -> SQL translation configuration and setup is pretty far away from being practical.