Fahim Rustamy, PhD

Towards Data Science

10 min read

Dec 11, 2023

CLIP, which stands for Contrastive Language-Image Pretraining, is a deep learning model developed by OpenAI in 2021. CLIP’s embeddings for images and text share the same space, enabling direct comparisons between the two modalities. This is accomplished by training the model to bring related images and texts closer together while pushing unrelated ones apart.

Some applications of CLIP include:

  1. Image Classification and Retrieval: CLIP can be used for image classification tasks by associating images with natural language descriptions. It allows for more versatile and flexible image retrieval systems where users can search for images using textual queries.
  2. Content Moderation: CLIP can be used to moderate content on online platforms by analyzing images and accompanying text to identify and filter out inappropriate or harmful content.

The original CLIP model aimed to unite image and text modalities within a shared embedding space. This concept, along with its techniques, extends beyond images and text to embrace other modalities. Netflix, in this blog post, trained a model by combining video and text modalities in the common embedding space to enhance search within video applications. Contrastive Language-Audio Pretraining (CLAP) is another model that integrates text and audio modalities within the same embedding space, making it valuable for improving search functionalities within audio applications.

The underlying technology for CLIP is extremely simple but very powerful, opening the door for many multi-model machine learning techniques. Meta AI recently released ImageBind, which learns a joint embedding across six modalities — images, text, audio, depth, thermal, and IMU data. CLIP, the first large-scale AI model that accepts two modalities, is a prerequisite to understanding ImageBind and other multi-modality AI systems.

CLIP Model and The Importance of Multimodal Embeddings | by Fahim Rustamy, PhD | Dec, 2023 - image  on https://aiquantumintelligence.com
Imagebind from META AI accepts six different modalities as input (Taken from ImageBind’s official GitHub page).

What is CLIP

CLIP is designed to predict which N × N potential (image, text) pairings within the batch are actual matches. To achieve this, CLIP establishes a multi-modal embedding space through the joint training of an image encoder and text encoder. The CLIP loss aims to maximize the cosine similarity between the image and text embeddings for the N genuine pairs in the batch while minimizing the cosine similarity for the N² − N incorrect pairings. The optimization process involves using a symmetric cross-entropy loss function that operates on these similarity scores. The following presents pseudocode (taken from the original paper) outlining the core implementation of CLIP.

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

Here’s a step-by-step description of each line in the pseudo code and its implementation using PyTorch:

Model Architecture:

ClIP uses two separate architectures as the backbone for encoding vision and text datasets:

  • image_encoder: Represents the neural network architecture (e.g., ResNet or Vision Transformer) responsible for encoding images.
  • text_encoder: Represents the neural network architecture (e.g., CBOW, BERT, or Text Transformer) responsible for encoding textual information.

The original CLIP model was trained from scratch without initializing the image encoder and the text encoder with pre-trained weights due to the large volume of the dataset (400 million image-text pairs) that they used to train their CLIP model. In the example in this blog post, we’ll do things a bit differently. We’ll start with pre-trained weights from resnet (for images) and distilbert (for text) models to initialize these parts.

CLIP Model and The Importance of Multimodal Embeddings | by Fahim Rustamy, PhD | Dec, 2023 - image  on https://aiquantumintelligence.com
Architecture of CLIP model (taken from the original paper)

Input Data:

The model takes a batch of n pairs of images and texts as input where:

  • I[n, h, w, c]: Represents a minibatch of aligned images, where n is the batch size, h is the image height, w is the image width, and c is the number of channels.
  • T[n, l]: Represents a minibatch of aligned texts, where n is the batch size, and l is the length of the textual sequence.
CLIP Model and The Importance of Multimodal Embeddings | by Fahim Rustamy, PhD | Dec, 2023 - image  on https://aiquantumintelligence.com
One batch of image and caption pairs for a batch size of 128

Feature Extraction:

  • I_f = image_encoder(I): Extracts feature representations (I_f) from the image encoder. The shape of I_f is [n, d_i], where d_i is the dimensionality of the image features.
  • T_f = text_encoder(T): Extracts feature representations (T_f) from the text encoder. The shape of T_f is [n, d_t], where d_t is the dimensionality of the text features.
I_f = models.resnet34(pretrained=True)      # for encoding images
T_f= AutoModel.from_pretrained("distilbert-base-multilingual-cased") # for encoding captions

Learned Projections:

  • W_i[d_i, d_e]: Represents the learned projection matrix for mapping image features (I_f) to an embedding space (I_e). The shape of W_i is [d_i, d_e], where d_e is the desired dimensionality of the joint embedding space.
  • W_t[d_t, d_e]: Represents the learned projection matrix for mapping text features (T_f) to the same embedding space (T_e). The shape of W_t is [d_t, d_e].

The projection operation can be coded using a neural network with two linear layers, whose weights are the learned projection matrix. In most cases, the projection weights are the only weights with active gradients that can be trained on new datasets. Additionally, the projection layer plays a crucial role in aligning the dimensions of image and text embeddings, ensuring that they have the same size.

class Projection(nn.Module):
def __init__(self, d_in: int, d_out: int, p: float=0.5) -> None:
self.linear1 = nn.Linear(d_in, d_out, bias=False)
self.linear2 = nn.Linear(d_out, d_out, bias=False)
self.layer_norm = nn.LayerNorm(d_out)
self.drop = nn.Dropout(p)

def forward(self, x: torch.Tensor) -> torch.Tensor:
embed1 = self.linear1(x)
embed2 = self.drop(self.linear2(F.gelu(embed1)))
embeds = self.layer_norm(embed1 + embed2)
return embeds

Embedding and Normalization:

  • I_e = l2_normalize(np.dot(I_f, W_i), axis=1): Embeds and normalizes image features in the joint embedding space (I_e).
  • T_e = l2_normalize(np.dot(T_f, W_t), axis=1): Embeds and normalizes text features in the joint embedding space (T_e).

The code below illustrates the sequential processing of image and text data. Initially, the data undergoes processing through the base encoder, followed by the projection layer. finally, normalized embeddings are generated for both modalities and returned.

class VisionEncoder(nn.Module):
def __init__(self, d_out: int) -> None:
base = models.resnet34(pretrained=True)
d_in = base.fc.in_features
base.fc = nn.Identity()
self.base = base
self.projection = Projection(d_in, d_out)
for p in self.base.parameters():
p.requires_grad = False

def forward(self, x):
projected_vec = self.projection(self.base(x))
projection_len = torch.norm(projected_vec, dim=-1, keepdim=True)
return projected_vec / projection_len

class TextEncoder(nn.Module):
def __init__(self, d_out: int) -> None:
self.base = AutoModel.from_pretrained(Config.text_model)
self.projection = Projection(Config.transformer_embed_dim, d_out)
for p in self.base.parameters():
p.requires_grad = False

def forward(self, x):
out = self.base(x)[0]
out = out[:, 0, :] # get CLS token output
projected_vec = self.projection(out)
projection_len = torch.norm(projected_vec, dim=-1, keepdim=True)
return projected_vec / projection_len

vision_encoder = VisionEncoder(Config.embed_dim)
I_e = vision_encoder(images)
caption_encoder = TextEncoder(Config.embed_dim)
T_e = caption_encoder(text["input_ids"])

Cosine Similarities:

  • logits = np.dot(I_e, T_e.T) * np.exp(t): Computes pairwise cosine similarities between image and text embeddings, scaled by a learned temperature parameter t.

In this example, we interchangeably use similarity with logits in the same way that was used in the original paper. We will not include the temperature parameter t in this blog post.

logits = T_e @ T_e.T

Symmetric Loss Function:

CLIP uses contrastive loss (first introduced in Representation Learning with Contrastive Predictive Coding) to bring related images and texts closer together while pushing unrelated ones apart.

  • labels = np.arange(n): Generates labels representing the indices of the batch.
  • loss_i = cross_entropy_loss(logits, labels, axis=0): Computes the cross-entropy loss along the image axis.
  • loss_t = cross_entropy_loss(logits, labels, axis=1): Computes the cross-entropy loss along the text axis.
  • loss = (loss_i + loss_t)/2: Computes the symmetric average of the image and text losses.
def CLIP_loss(logits: torch.Tensor) -> torch.Tensor:
n = logits.shape[1] # number of samples
labels = torch.arange(n) # Create labels tensor
# Calculate cross entropy losses along axis 0 and 1
loss_i = F.cross_entropy(logits.transpose(0, 1), labels, reduction="mean")
loss_t = F.cross_entropy(logits, labels, reduction="mean")
# Calculate the final loss
loss = (loss_i + loss_t) / 2

return loss

Final Custom CLIP Model

Combing all the different pieces together, the final custom CLIP model looks like the following:

class CustomModel(nn.Module):
def __init__(self, lr: float = 1e-3) -> None:
self.vision_encoder = VisionEncoder(Config.embed_dim)
self.caption_encoder = TextEncoder(Config.embed_dim)
self.tokenizer = Tokenizer(AutoTokenizer.from_pretrained(Config.text_model))
self.lr = lr
self.device = "cuda" if torch.cuda.is_available() else "cpu"

def forward(self, images, text):
text = self.tokenizer(text).to(self.device)

image_embed = self.vision_encoder(images)
caption_embed = self.caption_encoder(text["input_ids"])
similarity = caption_embed @ image_embed.T

loss = CLIP_loss(similarity)
img_acc, cap_acc = metrics(similarity)
return loss, img_acc, cap_acc


This example demonstrates the process of creating image caption datasets and training a custom CLIP model. The aim is to train a vision encoder and a text encoder jointly to project the representation of images and their captions into the same embedding space, such that the caption embeddings are located near the embeddings of the images they describe. The code for this project is in my GitHub repository.

Dataset and Dataloader

Our custom CLIP model will be trained using the flickr30k dataset. This dataset comprises more than 31,000 images, each with a minimum of 5 independent human-generated captions. We will use two captions for each image in this example to have a total of 62,000 image and text pairs for training. Although traditionally employed for image captioning tasks, we intend to adapt the image-caption pairs to train our dual encoder model specifically for image search purposes. The GitHub repository also includes the code to train the model on the MS-COCO dataset with 164,000 image and text pairs.

from torch.utils.data import DataLoader
from datasets import load_dataset
from torchvision import transforms
from PIL import Image
import torch
from torchvision import transforms
from PIL import Image
# Define a custom dataset class for Flickr30k
class Flickr30kDataset(torch.utils.data.Dataset):
def __init__(self):
self.dataset = load_dataset("nlphuji/flickr30k", cache_dir="./huggingface_data")
self.transform = transforms.Compose([
transforms.Resize((224, 224)),
self.cap_per_image = 2

def __len__(self):
return self.dataset.num_rows["test"] * self.cap_per_image

def __getitem__(self, idx):
original_idx = idx // self.cap_per_image
image = self.dataset["test"][original_idx]["image"].convert("RGB")
image = self.transform(image)

# labels
caption = self.dataset["test"][original_idx]["caption"][idx % self.cap_per_image]

return {"image": image, "caption": caption}

# Create an instance of the custom dataset
flickr30k_custom_dataset = Flickr30kDataset()

Key model constants includeembed_dim for learned representations, transformer_embed_dim for transformer layer features, and max_len for text input length. The chosen text_model is “distilbert-base-multilingual-cased.” Training spans 3epochs with abatch_size of 128, which are the constants that will feed into the model building and training.

from dataclasses import dataclass

class Config:
Configuration class for the CLIP training script.

embed_dim: int = 512 # Embedding dimension
transformer_embed_dim: int = 768 # Transformer embedding dimension
max_len: int = 32 # Maximum text length
text_model: str = "distilbert-base-multilingual-cased" # Text model name
epochs: int = 3 # Number of training epochs
batch_size: int = 128 # Batch size

The DataLoader is set up for efficient iteration during training, providing organized access to image-caption pairs.

# Create the DataLoader
clip_dataloader = DataLoader(flickr30k_custom_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)

Here is an example of an image caption pair in one of the batches in the dataset.

import numpy as np
import matplotlib.pyplot as plt
# Create an iterator from the dataloader
data_iter = iter(clip_dataloader)

# Get one batch
batch = next(data_iter)

image = batch["image"][0] # get one image from the batch
caption = batch["caption"][0] # get one text from the batch

# Convert the image tensor to a NumPy array and permute dimensions
image_np = np.transpose(image.numpy(), (1, 2, 0))

# Display the image and caption
plt.title(f"Caption: {caption}")

CLIP Model and The Importance of Multimodal Embeddings | by Fahim Rustamy, PhD | Dec, 2023 - image  on https://aiquantumintelligence.com

Here, we initiate our CustomModel and send it to the device (CPU or GPU). Additionally, we specify the parameters to be optimized throughout the training process. Given that we have fixed the base layer for both text and image encoders, only the parameters associated with the projection layer will undergo training on the new dataset.

# Create an instance of your model
model = CustomModel().to(device)

# Define optimizer
optimizer = torch.optim.Adam([
{'params': model.vision_encoder.parameters()},
{'params': model.caption_encoder.parameters()}
], lr=model.lr)

Model training

The training was performed with a Tesla T4 (g4dn-xlarge) GPU machine for 3 training epochs. The Jupyter Notebook is available in the project’s GitHub repository and contains the code for the training loop.

batch_zero = True
for epoch in range(start_epoch, num_epochs):
for batch in clip_dataloader:
image = batch["image"].to(device)
text = batch["caption"]
# images, text = batch
loss, img_acc, cap_acc = model.common_step((image, text))

# Backward pass and optimization

if batch_zero:
print(f"Epoch [{0}/{num_epochs}], Batch Loss: {loss.item()}")
batch_zero = False

# Print training statistics
print(f"Epoch [{epoch+1}/{num_epochs}], Batch Loss: {loss.item()}")

print("Training complete.")

The following are the results of training loops for each epoch using the flicker30k dataset. For additional details, please refer to this notebook.

Epoch [0/3], Batch Loss: 4.854558944702148
Epoch [1/3], Batch Loss: 3.187166690826416
Epoch [2/3], Batch Loss: 3.0981950759887695
Epoch [3/3], Batch Loss: 3.164858818054199
Training complete.

Here are the results from the training loops for each epoch using the COCO2017 dataset. The model exhibits faster convergence on the COCO dataset, attributed to the availability of over 160,000 image-text pairs, in contrast to the 62,000 image pairs in the flickr30k dataset. For additional details, please refer to this notebook.

Epoch [0/3], Batch Loss: 4.852224349975586
Epoch [1/3], Batch Loss: 2.7819151878356934
Epoch [2/3], Batch Loss: 2.727229118347168
Epoch [3/3], Batch Loss: 2.717097759246826
Training complete.


In conclusion, this blog post has explored the CLIP model, uncovering its potential for wide-ranging applications. As we understand the applications of CLIP, it becomes evident that its impact spans far beyond initial expectations, paving the way for innovative solutions across diverse fields. CLIP was the first successful model that bridged the gap between different modalities and opened avenues for cross-disciplinary innovations.

Source link