Research:A System for Large-scale Image Similarity

From Meta, a Wikimedia project coordination wiki
Created
13:41, 4 August 2021 (UTC)
Collaborators
no affiliation
Duration:  2022-01 – 2022-06

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


After several requests from different parts of the movement, the Research team is working on an image similarity tool. The tool will take as input an image and return the most similar images in Wikimedia Commons.

Methods[edit]

This project is done in two parts:

  1. Calculate the embeddings of the images stored as string(base64) in csv file.
  2. Create the index tree using embeddings generate in above step, for recommendations. For this step we will be using AnnoyIndex from annoy.

Step 1: Calculating embeddings from the image base64 strings[edit]


[Pipeline used for embedding generation


As, image strings are stored in csv format so we decided to go with pandas and loaded the csv file into pandas Dataframe. After this we extracted the column of image base64 representation. Now, after getting Image base64 strings we passed the strings to generate_image function which used these strings to generate image array. We then finished out final step to calculate image embeddings which then again stored in csv file with one additional column of image urls.

Step 2: Creating Index tree using embeddings[edit]

After getting the embedding of the images, we created index tree using AnnoyIndex from annoy. This index tree will we used to get similar images at the time of APIs calling.

Timeline[edit]

  1. Investigate the best tools and methods to efficiently compute image similarity at scale
  2. Implement a first prototype for large-scale image similarity
  3. Iterate on the prototype and implement a public-facing tool

Experiments[edit]

Experiment 1: Model selection for Embedding generation[edit]

We mainly want to test EfficientNet against the ResNet50, that is why we only took variants of these models.
Observation: While, doing experiment we observe one strange thing variants of EfficientNet taking much time to load on memory otherwise they are faster in terms of model inference.

Main Result of Experiment
name parameters images time_took(sec) testing_image overall_acc
EfficientNetV2B3 12930622 101733 298.12573432922363 1180 48.42%
EfficientNetB4 17673816 101733 543.248486995697 1180 42.33%
ResNet50V1 23561152 101733 187.57541871070862 1180 40.51%
EfficientNetB7 64097680 101733 867.1762790679932 1180 41.60%
ResNet50V2 23564800 101733 365.39129114151 1180 40.14%

Experiment 2: Finding the scalability of AnnoyIndex tree[edit]

Experiment 3: Selecting the n_components of PCA for embeddings[edit]

To selected the best PCA n_components, we experimented with different models by selecting different n_components. To compare different models with different PCA components we used to matrix
1. class_accuracy and
2. overlap_acc
.

Columns of Table[edit]

1. name: Name of the model used in the experiment.
2. n_components: It defines the number of principle components we used for PCA.
3. variance_ratio: It is the ratio of the total information to the information represented by principle components of PCA on images.
4. class_accuracy: It the the accuracy of the predicted class of top 10 predicted image of belongs to same class as the source image. It is done for all images in testing dateset.
5. overlap_acc: It is quite interesting, start with assumption that images predicted by tree(annoy) of original images(let's say ideal_list) without any PCA is ideal. Now, images predicted after applying PCA with different n_components, if image present in ideal_list, we will call it Hit otherwise Miss, then sum of all hits divided by total predicted images(we took 100) gave us this accuracy for that n_component of PCA.
6. tree_size: We also keep track of size of tree(annoy) created after applying PCA.

EfficientNetB3V2 PCA Experiment
name n_components variance ratio class_accuracy overlap_acc tree_size
EfficientNetB3V2 -1 100.00% 48.42% 100.00% 726.94MB
EfficientNetB3V2 1024 94.47% 48.25% 81.85% 525.67MB
EfficientNetB3V2 512 83.62% 47.96% 80.22% 334.65MB
EfficientNetB3V2 256 73.22% 47.50% 77.42% 240.83MB
EfficientNetB3V2 128 60.28% 46.32% 69.26% 194.31MB
EfficientNetB3V2 64 46.31% 40.97% 52.70% 166.54MB
EfficientNetB4 PCA Experiment
name n_components variance ratio class_accuracy overlap_acc tree_size
EfficientNetB4 -1 100.00% 42.33% 100.00% 825.95MB
EfficientNetB4 1024 93.45% 42.32% 85.10% 526.46MB
EfficientNetB4 512 85.37% 42.25% 84.09% 336.73MB
EfficientNetB4 256 74.33% 42.08% 76.84% 241.28MB
EfficientNetB4 128 58.49% 40.48% 67.35% 194.53MB
EfficientNetB4 64 43.50% 34.98% 50.88% 168.76MB
EfficientNetB7 PCA Experiment
name n_components variance ratio class_accuracy overlap_acc tree_size
EfficientNetB7 -1 100.00% 41.60% 100.00% 1132.17MB
EfficientNetB7 1024 92.78% 41.69% 85.67% 532.97MB
EfficientNetB7 512 87.51% 41.90% 85.05% 343.51MB
EfficientNetB7 256 76.62% 40.76% 75.54% 250.43MB
EfficientNetB7 128 60.11% 38.07% 63.64% 201.94MB
EfficientNetB7 64 44.96% 36.08% 53.85% 174.78MB
ResNet50V1 PCA Experiment
name n_components variance ratio class_accuracy overlap_acc tree_size
ResNet50V1 -1 100.00% 40.51% 100.00% 917.29MB
ResNet50V1 1024 93.57% 40.31% 83.25% 523.87MB
ResNet50V1 512 84.63% 40.10% 80.77% 331.79MB
ResNet50V1 256 73.91% 39.24% 74.97% 238.33MB
ResNet50V1 128 61.83% 38.88% 69.67% 190.20MB
ResNet50V1 64 49.22% 35.58% 55.98% 162.61MB
ResNet50V2 PCA Experiment
name n_components variance ratio class_accuracy overlap_acc tree_size
ResNet50V2 -1 100.00% 40.14% 100.00% 916.58MB
ResNet50V2 1024 95.73% 40.25% 82.71% 526.60MB
ResNet50V2 512 86.23% 39.92% 80.06% 334.51MB
ResNet50V2 256 74.25% 39.01% 73.46% 239.71MB
ResNet50V2 128 60.57% 38.21% 66.84% 192.16MB
ResNet50V2 64 46.84% 35.47% 55.34% 164.16MB

Experiment 4: End-to-End(in reference to Time) testing of models[edit]

In this experiment what we want to test, is mainly timing of the model. Like how much time it will take to generate result. In this experiment time is calculated from passing to url to getting the recommendation of similar images.

PIPELINE FOLLOWED[edit]

1. Downloading of image from the URL and loading it into numpy array.
2. Loading of the PCA.
3. Loading of the tree(annoy).
4. Calculation of the embeddings of image.
5. Applying of the PCA to embeddings.
6. Getting the ids of recommended images.
7. Fetching the URL of recommended image from CSV file.
NOTE: We preloaded the model into RAM.

EfficientNetB3V3 End-to-End
name n_components already_loaded iter time_took(sec)
EfficientNetB3V2 -1 True 1 2.993394613265991
EfficientNetB3V2 -1 True 2 0.8071532249450684
EfficientNetB3V2 -1 True 3 0.8314507007598877
EfficientNetB3V2 1024 True 1 2.900543689727783
EfficientNetB3V2 1024 True 2 0.8321168422698975
EfficientNetB3V2 1024 True 3 0.8280463218688965
EfficientNetB3V2 512 True 1 2.4594342708587646
EfficientNetB3V2 512 True 2 0.8171608448028564
EfficientNetB3V2 512 True 3 0.8010208606719971
EfficientNetB3V2 256 True 1 1.8577890396118164
EfficientNetB3V2 256 True 2 0.8222873210906982
EfficientNetB3V2 256 True 3 0.8436849117279053
EfficientNetB3V2 128 True 1 0.8022160530090332
EfficientNetB3V2 128 True 2 0.7558486461639404
EfficientNetB3V2 128 True 3 0.7694368362426758
EfficientNetB3V2 64 True 1 0.8248932361602783
EfficientNetB3V2 64 True 2 0.8313982486724854
EfficientNetB3V2 64 True 3 0.7825984954833984
EfficientNetB4 End-to-End
name n_components already_loaded iter time_took(sec)
EfficientNetB4 -1 True 1 31.089104413986206
EfficientNetB4 -1 True 2 0.8255059719085693
EfficientNetB4 -1 True 3 0.7638154029846191
EfficientNetB4 1024 True 1 0.8171260356903076
EfficientNetB4 1024 True 2 0.8361170291900635
EfficientNetB4 1024 True 3 0.8306858539581299
EfficientNetB4 512 True 1 0.7886874675750732
EfficientNetB4 512 True 2 0.8270878791809082
EfficientNetB4 512 True 3 0.7964653968811035
EfficientNetB4 256 True 1 0.7913780212402344
EfficientNetB4 256 True 2 0.7964553833007812
EfficientNetB4 256 True 3 0.7966964244842529
EfficientNetB4 128 True 1 0.7879147529602051
EfficientNetB4 128 True 2 0.8378396034240723
EfficientNetB4 128 True 3 0.8303694725036621
EfficientNetB4 64 True 1 0.7877256870269775
EfficientNetB4 64 True 2 0.7898170948028564
EfficientNetB4 64 True 3 0.7847003936767578
EfficientNetB7 End-to-End
name n_components already_loaded iter time_took(sec)
EfficientNetB7 -1 True 1 7.930605411529541
EfficientNetB7 -1 True 2 0.835761308670044
EfficientNetB7 -1 True 3 0.8150243759155273
EfficientNetB7 1024 True 1 0.813317060470581
EfficientNetB7 1024 True 2 0.8263413906097412
EfficientNetB7 1024 True 3 0.8415734767913818
EfficientNetB7 512 True 1 0.8173937797546387
EfficientNetB7 512 True 2 0.8228459358215332
EfficientNetB7 512 True 3 0.8680799007415771
EfficientNetB7 256 True 1 0.8364121913909912
EfficientNetB7 256 True 2 0.8222959041595459
EfficientNetB7 256 True 3 0.8135280609130859
EfficientNetB7 128 True 1 0.787550687789917
EfficientNetB7 128 True 2 0.8863904476165771
EfficientNetB7 128 True 3 0.8605077266693115
EfficientNetB7 64 True 1 0.8497884273529053
EfficientNetB7 64 True 2 1.692915678024292
EfficientNetB7 64 True 3 0.8910524845123291
ResNet50V1 End-to-End
name n_components already_loaded iter time_took(sec)
ResNet50V1 -1 True 1 1.8402047157287598
ResNet50V1 -1 True 2 0.8064224720001221
ResNet50V1 -1 True 3 0.7804932594299316
ResNet50V1 1024 True 1 0.8325328826904297
ResNet50V1 1024 True 2 0.7556729316711426
ResNet50V1 1024 True 3 0.7284393310546875
ResNet50V1 512 True 1 0.7692885398864746
ResNet50V1 512 True 2 0.7979185581207275
ResNet50V1 512 True 3 0.7842895984649658
ResNet50V1 256 True 1 0.8078567981719971
ResNet50V1 256 True 2 0.7786800861358643
ResNet50V1 256 True 3 0.7741687297821045
ResNet50V1 128 True 1 0.8202202320098877
ResNet50V1 128 True 2 0.8022265434265137
ResNet50V1 128 True 3 0.7970542907714844
ResNet50V1 64 True 1 0.7963719367980957
ResNet50V1 64 True 2 0.8078024387359619
ResNet50V1 64 True 3 0.8760688304901123
ResNet50V2 End-to-End
name n_components already_loaded iter time_took(sec)
ResNet50V2 -1 True 1 1.9190337657928467
ResNet50V2 -1 True 2 0.8073186874389648
ResNet50V2 -1 True 3 0.8345324993133545
ResNet50V2 1024 True 1 0.8155877590179443
ResNet50V2 1024 True 2 0.8098921775817871
ResNet50V2 1024 True 3 0.8261690139770508
ResNet50V2 512 True 1 0.8170630931854248
ResNet50V2 512 True 2 0.7819721698760986
ResNet50V2 512 True 3 0.7720069885253906
ResNet50V2 256 True 1 0.7744569778442383
ResNet50V2 256 True 2 0.8012771606445312
ResNet50V2 256 True 3 0.7616298198699951
ResNet50V2 128 True 1 0.7649922370910645
ResNet50V2 128 True 2 0.7748057842254639
ResNet50V2 128 True 3 0.7836484909057617
ResNet50V2 64 True 1 0.7733273506164551
ResNet50V2 64 True 2 0.7945435047149658
ResNet50V2 64 True 3 0.8264520168304443

We also observe same thing which we discussed in Experiment 1.
RESULT: We decided to go with EfficientNetB3V2 with PCA(n_components=256).

Experiment 5: accuracy of selected model with different classes[edit]

After selecting the model we decided to do one class accuracy check with model.

EfficientNetB3V2: Classes-Accuracy
category accuracy
Architecture 12.40%
Drawings 32.80%
Flags 42.50%
Fossils 61.10%
Paintings 32.20%
Politician 48.10%
Sculptures 21.80%
aircraft 69.20%
apple 70.30%
banner 27.10%
bed 49.10%
bicycle 56.90%
bird 54.50%
boat 45.70%
book 39.50%
bottle 62.90%
brick 25.40%
bridge 49.80%
building 13.80%
bus 67.60%
cake 46.90%
car 73.90%
cat 73.60%
chair 43.50%
clothes 34.90%
cow 54.30%
dog 51.30%
door 43.50%
fence 18.10%
fire_hydrant 53.00%
flower 37.90%
fog 26.50%
food 20.80%
fruit 40.00%
furniture 37.10%
hill 39.80%
horse 56.30%
house 24.20%
leaves 37.80%
light 60.10%
motorcycle 82.80%
mountain 33.60%
mud 38.20%
paper 83.30%
person 13.80%
plastic 35.80%
potted_plant 69.70%
river 31.60%
road 48.30%
rock 28.40%
roof 25.00%
rug 57.70%
sand 31.40%
sea 29.70%
sheep 65.00%
shoe 62.60%
sky 15.50%
snow 30.80%
stairs 23.80%
stone 28.60%
street_sign 74.50%
table 34.20%
textile 50.90%
tile 41.10%
toilet 74.40%
train 73.50%
tree 22.10%
truck 55.30%
vase 62.10%
vegetable 49.30%
wall 19.20%
water 21.80%
water_droplets 53.10%
window 31.30%
wood 30.20%
Overall Accuracy 43.80%

After seeing these low accuracy of classes, we were confident that there is something wrong class labels, to confirm this we perform one more experiment Experiment 6

Experiment 6: Testing model with critical classes(having low accuracy)[edit]

These are some sample images of this experiment where we can easily observe that although images are similar to each other content-wise but they are present under different categories.

Food
Similar Images to Food category using EfficientNetB3

Above (source) image belongs to Food category, notice last three images contains food but they belongs to different category which was the cause of low score in preview experiment.

Similar Images to House category using EfficientNetB3
Similar Images to House category using EfficientNetB3

This is second sample of recommendation (source)image belongs to House category.

Spark: Recommendation Tree Generation[edit]

1. PCA generation for mapping embeddings to [256,] vector[edit]

When we were starting with this step there were mainly two choices available -
1. pyspark.ml.feature.PCA
2. sklearn.decomposition.IncrementalPCA
and both were have there problems -
The problem with pyspark.ml.feature.PCA was that it required spark context everytime even at the time of inference which was not possible in our case. And, the problem with sklearn.decomposition.IncrementalPCA was that it can't be distributed with spark.
We choose sklearn.decomposition.IncrementalPCA because we could fill it with batches of data, which will reduce the spark efficiency but it would not be required spark context at the time of inference which was required in our case.

# Created the instance of PCA with 256 components
pca = IncrementalPCA(n_components=256, batch_size=50000)

# this function will fill the batches of data to pca
def fill_pca(df):
    global pca
    df_list = df.collect()
    features = []
    for row in tqdm(df_list):
        features.append(list(row['features']))
    arr = np.array(features)
    pca = pca.partial_fit(arr)

These are the part of code which is used to fill pca with required data batch by batch

with open('pca256.pkl', 'wb') as f:
    pickle.dump(pca, f)

At the end we just dumped the pca instance for using it later for PCA generation of embeddings.

2. Index to URL map file generation[edit]

OK, we had the small problem is that annoy will return index of the matched embeddings of image, but the limitation with annoy was that only integers can be used as index but we needed URL of the matched image.
SO, to solve this problem we came up with a map which will accept index of the image and return URL of the image. To generate this map we used following code -

tree = AnnoyIndex(256, 'angular')

# will work as map file for index->URL
d = dict()

ite = 0
def fill_tree(df, ite):
    df_list = df.collect()
    # following will iterate batch row by row
    for row in tqdm(df_list):
        data = np.array(list(row['features'])).reshape(1, -1)

        # this will map image_url from index  
        d[ite] = row['image_url']

        tree.add_item(ite, pca.transform(data).reshape(-1))
        ite += 1
    return ite

At last we dumped the map file to use at the time of inferencing the API.

with open('./idx2url.pkl', 'wb') as f:
    pickle.dump(d, f)

3. Annoy tree generation for recommendation[edit]

After generating pca and idx2url map, it was time to complete the final step of generating annoy tree for making recommedation of similar image. Nothing much left to tell about annoy as we discussed every thing about this in this document, we will direct jump to code we use to fill embedding to tree:

# this is a instance of annoy tree
tree = AnnoyIndex(256, 'angular')
d = dict()
ite = 0

# this function will fill embeddings to tree row by row
def fill_tree(df, ite):
    df_list = df.collect()
    for row in tqdm(df_list):
        data = np.array(list(row['features'])).reshape(1, -1)
        d[ite] = row['image_url']
        
        # notice we are adding embeddings to tree after applying pca on it 
        tree.add_item(ite, pca.transform(data).reshape(-1))

        ite += 1
    return ite

In order to understand why 100 is passed in build method, you can refer to following paragraph which was taken from the annoy github page.
a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

We saved tree to use it in API

tree.build(100)
tree.save('./tree.ann')

4. Testing everything till now, before moving further[edit]

# We took url of image
url = 'https://www.malaysia-today.net/wp-content/uploads/2019/11/Taj-Mahal-Agra-India.jpg'
# downloading image data form URL
img_data = urllib.request.urlopen(url).read()
# image data to image conversion
image = generate_image(img_data)

# first we passed image to model(efficientNetB3)
features = model.predict(np.expand_dims(image, axis=0))
# here we are applying pca to embeddings
feature_pca = loaded_pca.transform(features)

# using tree to generate index of similar images which then mapped to image_url
res = [d[i] for i in tree.get_nns_by_vector(feature_pca.reshape(-1), n=10)]
Similar Images
Similar Images

Codes[edit]

API Codes Please refer to this page
UI Codes Check this out

Results[edit]

You can test result by yourself: [[1]]

On Phabricator[edit]

Follow this task

References[edit]