CLIP#
The Contrastive Language-Image Pre-training (CLIP) paper <https://arxiv.org/pdf/2103.00020.pdf>_ offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.
During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.
Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.
To get started with CLIP, follow these steps.
Import from Hugging Face to NeMo 2.0#
The following script downloads the checkpoint for CLIP and converts it to NeMo format. The converted checkpoint is then stored in the NeMo cache folder located at~/.cache/nemo.For example, when used with the NeMo container, the full path is/root/.cache/nemo/models/openai/clip-vit-large-patch14.The checkpoint can be used to initialize the Vision Language Model (VLM) and fine-tune the CLIP model for Supervised Fine-Tuning (SFT).
fromnemo.collections.llmimportimport_ckptfromnemo.collectionsimportvlmfromnemo.collections.vlmimportClipConfigL14if__name__=='__main__':# Specify the Hugging Face model IDhf_model_id="hf://openai/clip-vit-large-patch14"# Import the model and convert to NeMo 2.0 formatimport_ckpt(model=vlm.CLIPModel(ClipConfigL14()),# Model configurationsource=f"{hf_model_id}",# Hugging Face model source)
NeMo 2.0 Pretraining Recipes#
We provide some default recipes for pretraining CLIPclip_b32.
fromnemo.collectionsimportvlmpretrain=vlm.clip_b32.pretrain_recipe(name="clip_pretrain",dir=f"/path/to/checkpoints",num_nodes=1,num_gpus_per_node=8,)
Note
The configuration in the recipes is done using the NeMo-Runrun.Config andrun.Partial configuration objects. Please review the NeMo-Rundocumentation to learn more about its configuration and execution system.
Note
The recipes use theMockDataModule for thedata argument. You are expected to replace theMockDataModule with your custom dataset.
Once you have your final configuration ready, you can execute it using any of the NeMo-Run supported executors. The simplest option is the local executor, which runs the pretraining locally in a separate process. You can use it as follows:.. code-block:: python
import nemo_run as run
run.run(pretrain, executor=run.LocalExecutor())
Additionally, you can also run it directly in the same Python process as follows:
run.run(pretrain,direct=True)
Use the Energon Dataloader#
Given a dataset in WebDataset format, you can use theEnergon data loaderto prepare the data for use with CLIP. You can run the following command from the<data_root> directory to convert WebDataset format to Energon format:
energonprepare.
Use CrudeSample as your sample class.
Below is an example of how to set up theEnergon data module for CLIP training:
fromnemo.collections.multimodal.data.energonimportEnergonMultiModalDataModulefromnemo.collections.vlm.clip.data.clip_data_moduleimportClipTaskEncoder# Paths and configurationdata_path="<path_to_dataset>"text_seq_length=80mbs=500gbs=4000num_workers=16# Load the task encoder for train and validationtrain_task_encoder=ClipTaskEncoder(max_length=text_seq_length)valid_task_encoder=ClipTaskEncoder(max_length=text_seq_length,is_train=False)data=EnergonMultiModalDataModule(data_path,seq_length=text_seq_length,image_processor=None,micro_batch_size=mbs,global_batch_size=gbs,num_workers=num_workers,task_encoder=train_task_encoder,tokenizer=train_task_encoder.tokenizer,validation_task_encoder=valid_task_encoder,image_decode="pil",ignore_decoder_errors=True,)
Replace theMockDataModule in the default recipes with the above data.
fromnemo.collectionsimportvlm# Define the finetuning recipepretrain=vlm.clip_b32.pretrain_recipe(name="clip_pretrain",dir=f"/path/to/checkpoints",num_nodes=1,num_gpus_per_node=8,)# Assign the above data module to the finetuning recipepretrain.data=data
We have also included additional example scripts to further customize CLIP training and inference:
Inference with HF and Nemo:clip_infer.py
Pretraining:clip_pretrain.py
These scripts allow for flexible and comprehensive training workflows tailored to your requirements.Eg:- If you want to do SFT, you can useclip_pretrain.py and pass restore_path as the checkpoint path obtained after HF convervion to Nemo 2.0