Grants:IdeaLab/Use a Siamese network to propose categories for images at Commons/Technical notes

From Meta, a Wikimedia project coordination wiki

A w:siamese network is a kind of w:artificial neural network, where the output goes to a layer "inside" a larger network, so the output vector acts like w:locality-sensitive hashing, and can be extracted and used for further processing. That vector is like a fingerprint, and can be used as a distance metric against other images, or even a category of images. In effect an image is processed by the Siamese network into a fingerprint, and that fingerprint can then be compared to the fingerprints for individual categories. The closest categories can then be proposed as possible candidates for categorization of the image.

It is important that the network does not output a final category (or multiple categories), it outputs a kind of fingerprint. That fingerprint will then be compared to other stored fingerprints, and that again makes it possible to have a huge repository of possible fingerprints. A fingerprint is typically a vector of rank 100 or more.

Fingerprints can typically be learned by use of the w:triplet loss function, where there is an anchor image and some other positive image from a common class and another negative image outside this class. In this case the positive image will pull back the weights, that is act as an regularizer. An alternate is the w:contrastive loss (missing) function. In this case there must be an weight decay to regularize the weights, or some similar operation like a normalization.

It is although likely that a better result can be achieved by using alternate binary classifiers that use the fingerprints as input. This will be a huge number of small networks, typically each such classifier network will be used in a minibatch, and each one of them will only classify an image for a specific category. One possible regulizer for the fingerprint would be a simple normalization.

Note that storing a vector of real valued numbers, and searching through them, is non-trivial in an ordinary relational database. There are a few tricks that can be used, like using a w:string grammar to include candidates when large parts of the fingerprints are (or should be) similar, or excluding fingerprints with to large single entries. The final filtering would be pretty slow anyhow, but perhaps it is fast enough with some in-memory database scheme.

A string grammar in this context would be to turn the vector elements into characters, and merge them into short string segments with a leading char. That would make them searchable, and thus a category can be identified by a number of such strings.