Cosine similarity — measuring similarity between multiple images

Onyeka Okonji
10 min readApr 6, 2023

--

Introduction:

If you’ve used an iPhone upwards of the X-series, you’d be familiar with the Face ID feature which allows you control access to different features of your phone, some of which include unlocking the phone, granting access to downloads and making online payments. If that’s too vague a picture to paint, surely you must have wondered, “how can I get my computer to detect the similarity between two images”.

This article aims to give you a good understanding of a very useful and popular machine learning technique used in measuring the similarity between data, most commonly unstructured data e.g. image and text. It is subdivided into 3 sections:

  1. Representational Learning — Embeddings.
  2. Understanding the intuition behind cosine similarity.
  3. Implementation of cosine similarity in PyTorch.
Photo by Arno Senoner on Unsplash

Learning representations from images:

Core to the function of any deep learning model is the learning of representations which assist with the recognition of the input data. Representation learning refers to the extraction of key features in an input data which summarily help with the identification of the input. Through representation learning, one is able to obtain a vector which comprises of the pixels of relevant features of the input which aid in its classification and recognition. The idea behind representation learning is to obtain a low-dimensional feature vector which is a good estimation (or summary) of the relevant features of a given input data. Note that representation learning is used in many aspects of machine learning including computer vision, natural language processing, dimensionality reduction etc.

To provide a better understanding let’s use an example; assuming we have as input, an image of a 4-legged table, when passed through a DNN (Deep Neural Network), the model should be able to extract the relevant features of this table which should sufficiently aid its future recognition and classification. So assuming we have an input data batch comprising images of the table, a lion and a human being, the model will learn representations for each of these images which should help in their recognition and classification — so for the table, it should learn features relating to the 4 vertical edges and 1 horizontal edge; for the lion, it should learn 4 vertical edges and features relating to the mane of the lion and the tail; while for the human being, it should learn features relating to 2 vertical edges relating to the legs, another couple of vertical edges relating to the hands and any other features it needs to sufficiently have a good representation of the images. Coming back to the point about obtaining a low-dimensional feature vector, when compared to the input data (in this case an image) which is usually in 3-D, the output of the representation learning process will be a 1-D feature vector comprising of pixel intensities of the relevant features which characterize the image. Should you want to learn more about representation learning, I’ll highly recommend this paper by Yoshua Benjio et al. which provides an excellent explanation of the concept, if you prefer videos, there is also this video where he explains the intuition behind it.

Having explained what representation learning is about and the goal being to generate feature vectors otherwise called embeddings which sufficiently comprises of relevant features which would aid in the recognition and classification of a given input image, let’s move on to the next item being measurements of similarity.

Cosine Similarity:

While there are a number of similarity measurements otherwise called distance measurements, for this article, we will focus on the cosine similarity measurement of the similarity or otherwise of two images. However, if you want to learn more about the other distance measurements, please visit this link:

As you may have noticed, I used similarity measurement and distance measurement interchangeably, this is because to calculate the similarity between two or more input data in machine learning, we need to calculate the distance between their feature vectors — their representational learning. The intuition behind cosine similarity is that, for any given number of feature vectors in space, similar vectors will lie close together while dissimilar will lie farther apart from each other.

Given 2 vectors A and B, when projected into a feature space, vectors A and B will lie close to the other if they are similar and the distance between them is small, otherwise, they will lie far apart.

Cosine similarity measurement intuition for two vectors in space
source

From the image above, the idea behind cosine similarity is to find the angle between two vectors in a vector space and then calculate the inverse cosine of the result. The result will usually fall within the range -1 to +1 with -1 meaning the 2 vectors are 180 degrees while scores of 0 means the vectors are orthogonal to each other and scores closer to 1 means the vectors lie together or very close to each other.

The formula for calculating the cosine similarity goes thus:

created by author

The numerator of the equation is the dot product of the two vectors while the denominator is the dot product of the size (otherwise called the magnitude) of the two vectors.

The magnitude of a vector is simply the square root of the sum of its dimensions. Therefore, the equation above can be re-written as:

created by author

We can implement this in Python using the basic Numpy or Math library but this can prove tedious for large feature vectors which is what you will most often see in practical cases. Thankfully, the developers behind PyTorch have created simple lines of codes which help us calculate the cosine similarity between two feature vectors.

Having come this far, let’s attempt an example to make all that we’ve learnt concrete in our understanding. To do this, I will be attempting to find out how similar 2 images are and for this, I will be using an image of a table, lion and a human being as exemplified earlier on. To follow through, you can download the sample images which I obtained from Unsplash: person, lion and table. When you download the images, it should be obvious to you that the 3 are entirely different images, so let’s add a fourth image which would be another image of a lion (i love lions a lot), which you can find here. For added complexity, I chose for the fourth image a lion of different colour, different orientation and in a different environment from the former. If you’ve come this far, well done, take a deep breathe and let’s take a deep dive into calculating the similarities each of the images have with others.

In line with good Python programming, I will write a Python class which would be tasked with performing all the necessary sub-processes and eventually returning the cosine similarity.

Looking at the code block above, we can see a python script containing a class which is tasked with a number of responsibilities namely the instantiation of a DNN model, preprocessing of the input images, learning the representations of the input images and finally calculating the cosine similarity score. Let’s take each function separately for complete understanding.

def __init__(self,image_path_1, image_path_2, device=None):
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
self.image_path_1 = image_path_1
self.image_path_2 = image_path_2

def model(self):
"""Instantiates the feature extracting model

Parameters
----------
model

Returns
-------
Vision Transformer model object

"""
wt = torchvision.models.ViT_H_14_Weights.DEFAULT
model = vit_h_14(weights=wt)
model.heads = nn.Sequential(*list(model.heads.children())[:-1])
model = model.to(self.device)

return model

As with all Python classes, we must first create an __init__() function which is tasked with initializing the class with necessary parameters it needs to function. In this case, we initialize the class with the paths to two input images for which we will be calculating their similarities. The second function is tasked with loading a pre-trained model. There are several ways you can learn the representations of input data, one is via probabilistic models which aim to discover the probability distribution of the features in the input data, another is via the use of a deep learning model which as you must already know, contains a sequential series of layers which extract features from the input they receive. In our case, we will use the second approach.

As you must also know, all the convolutional layers of a DNN are tasked with learning features with the shallow layers learning simple features like edges, contours etc while the deeper layers learn more class-specific features. This means, if we feed as input to the DNN, an image of a lion, the deeper convolutional layers will learn features specific to recognizing and classifying a lion, the result of which is usually sent to a series of fixed length dense layers for classification. As such, for our task, we will be extracting as output, the features returned by the last convolutional layer in the model we use. And because in our case, we are using everyday objects which form part of most image recognition datasets used in computer vision, rather than training a model from the scratch, we can use a pre-trained model.

In this task, I opted for a Vision Transformer pre-trained using PyTorch, you can always choose any pre-trained model but keep in mind the model factors that affect the quality of image recognition some of which include the depth of the model, its width, filter size etc. Having loaded the pre-trained weights, we proceed to removing the last non-convolutional layers of the model as their output aren’t of concern to us — we need the feature vector embedding of the last convolutional layer and finally we send the model to a GPU if available or it can run on a CPU. I should mention, if you use the pre-trained model above, it’s best run on a GPU as the model is a little over 2gigabytes and it would be very slow to run on CPU.

def process_test_image(self, image_path):
"""Processing images

Parameters
----------
image_path :str

Returns
-------
Processed image : str

"""
img = Image.open(image_path)
transformations = tr.Compose([tr.ToTensor(),
tr.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
tr.Resize((518, 518))])
img = transformations(img).float()
img = img.unsqueeze_(0)

img = img.to(self.device)

return img

Having loaded the pre-trained model and cut off the fixed length dense layers causing the new output to be the features, we need to pre-process the input images to align with the expected input dimensions needed by the pre-trained model. This involves converting the image to a Tensor as expected in PyTorch, normalizing the image and then resizing to a size expected by the model, in this case, it’s 518px by 518px, adding a batch dimension as the model expects a batch of images (even though the batch can contain only one image) and finally, we send the transformed input data to the device choice selected in the previous step i.e CPU or GPU.

def get_embeddings(self):
"""Computer embessings given images

Parameters
image_paths : str

Returns
-------
embeddings: np.ndarray

"""
img1 = self.process_test_image(self.image_path_1)
img2 = self.process_test_image(self.image_path_2)
model = self.model()

emb_one = model(img1).detach().cpu()
emb_two = model(img2).detach().cpu()

return emb_one, emb_two

After writing the necessary instance methods for loading a pre-trained model and processing the input data, the next phase as expected is to obtain the representation features which we will call embeddings. From the code block immediately above, we can see that we first pre-process the input data, the images before passing it to the pre-trained model. Because we will need to do some computations on the output of model, we need to detach the output so the computations performed aren’t done in the GPU computation graph it was created in, prevent the buildup of gradients, and finally, we send it to CPU storage.

def compute_scores(self):
"""Computes cosine similarity between two vectors."""
emb_one, emb_two = self.get_embeddings()
scores = torch.nn.functional.cosine_similarity(emb_one, emb_two)

return scores.numpy().tolist()

The final step in the process involves the calculation of the cosine similarity between the two embeddings generated in the previous process. For ease, we use the method provided by PyTorch, this should give us a score between -1 and +1 depending on the similarity between the two embeddings which is based on the distance between them in the vector space.

cosine similarity scores for 3 pairs of dissimilar images

Looking at the results of the 3 different tests above, using dissimilar images, we can see that we get very low scores because indeed, each image in a pair are dissimilar. You’ll notice that when we compare the brown table with a brown lion, we still got a low score, this can be attributed to the excellent representation learning done by the pre-trained model in that it didn’t use colours as a key factor in recognizing tables and lions as both can come in any number of colours.

cosine similarity check for similar pairs of images

As we can see above, despite there being similar images of different colour, orientation and background, the model learns the required features needed for classification, as such, we got a reasonably high similarity score. This is to show that a key factor in checking the similarity between two images is obtaining a very good embedding or feature vector of the images and this is a function of how good your DNN model is.

Side note:

Although this article focuses on image similarity, cosine similarity as a measure of similarity between input data isn’t limited to images only, as it can be used with text (e.g. document similarity, recommendation systems), videos (e.g. pose estimation and checking if an athlete pose during training matches the required pose)

In summary;

  1. Representation learning refers to the extraction of meaningful features out of an input data which aids in its recognition and classification.
  2. Good representation learning is a key factor to accurately calculating the similarity or otherwise between two or more input data.
  3. Cosine similarity measures the distance between two or more vectors in a vector space with the aim of finding how similar they are. Similar vectors have shorter distance between them and the inverse for dissimilar vectors.

--

--

Onyeka Okonji
Onyeka Okonji

Written by Onyeka Okonji

Machine Learning Engineer passionate about Computer Vision

No responses yet