Retrieval Optimmized CLIP Models

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment (ArXiv)

Konstantin Schall, Kai Uwe Barthel, Nico Hezel, Klaus Jung

Abstract:

Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.

Method:

We propose two methods that significantly improve pre-trained CLIP models for image-to-image retrieval, while preserving the joint-embedding alignement and text-based task qualities.

The second methods, Multi-Caption-Image-Pairing (MCIP) leads to the best results across all models:

Results:

Model Checkpoints:

open_clip Name	open_clip pretrained	Optimized Checkpoint
ViT-L-14-336	openai	checkpoint
ViT-SO400M-14-SigLIP-384	webli	checkpoint

Using our models

If you want to try out models, you simply have to install open_clip, download one of the above checkpoints, create the respective open_clip model instance and load our weights. Thats it!

import open_clip

model, _, transform = open_clip.create_model_and_transforms("ViT-SO400M-14-SigLIP-384", pretrained="webli")

checkpoint_path = '/path/to/checkpoint.pth'
mcip_state_dict = torch.load(checkpoint_path)
model.load_state_dict(mcip_state_dict, strict=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Retrieval Optimmized CLIP Models

Abstract:

Method:

Results:

Model Checkpoints:

Using our models

Files

README.md

Latest commit

History

README.md

File metadata and controls

Retrieval Optimmized CLIP Models

Abstract:

Method:

Results:

Model Checkpoints:

Using our models