Before we get started, have you tried our newPython Code Assistant? It's like having an expert coder at your fingertips. Check it out!
Image captioning is the task of generating a text description of an input image. It involves bothComputer Vision (such as Vision Transformers, or CNNs) andNatural Language Processing (NLP), such as language models.
Image captioning can help in many real-world applications, such as image search, providing a description of visual content to users with visual impairments, allowing them to better understand the content, and many more.
In this tutorial, you will learn how to perform image captioning using pre-trained models, as well as train your own model using PyTorch with the help of transformers library in Python.
Table of content:
As you may already know, transformer architectures in neural networks have recently dominated the NLP field, with models likeGPT andBERT has outperformed previous recurrent neural network architectures.
For computer vision, that's also the case now! Whenthe paper "An Image is Worth 16x16 Words" was released, transformers also proved to be powerful in vision. Models like Vision Transformer (ViT), andDeiT have demonstrated state-of-the-art results in various computer vision tasks, such as image classification, object detection,image segmentation, and many more.
The below figure shows the ViT architecture, taken fromthe original paper:

Figure 1: The Vision Transformer (ViT) architecture
The idea of Vision Transformer architecture is that it splits the image into fixed-size patches, these patches are flattened and then lower-dimensional linear embeddings are created from these patches. This way, it will behave as if it's a sequence of text.
Another Vision Transformer is theSwin Transformer, which adds the idea ofShiftedWindows that brings greater efficiency by limiting self-attention computation to non-overlapping windows and still permitting cross-window connections. Here is the main difference between ViT and Swin, a figure taken fromthe Swin paper:

Figure 2: Swin Transformer vs ViT
Many research papers suggested that initializing image-to-text sequence models with pre-trained checkpoints has been shown to be effective, such asthe TrOCR paper.
Therefore, in this tutorial, we will use Vision Encoder-Decoder architecture models, where the encoder is the ViT or Swin (or any other), and the decoder is a language model such as GPT2 or BERT, something like this:

Figure 3: The Vision Encoder-Decoder architecture we'll use for image captioning
The most common dataset for image captioning is theCommon Objects in Context (COCO). We'll be using the 2014 version of it which contain more than 500,000 images and their descriptions.
There is the 2017 version of the COCO dataset, and also theFlickr30k which contains 31,000 images collected from Flickr. You're free to choose any dataset you want, or you can combine them if you know what you're doing.
In this tutorial, we will start by using models that are already trained so we can get a sense of how easy it is to get started with🤗 Transformers.
Next, we'll train our own model using the COCO dataset using theTrainer class, and also using a regular PyTorch training loop, so you can pick the one that suits you best.
After that, we'll see how image captioning models are evaluated and which metrics are used to compare them.
Finally, we'll use our model to generate captions of any image we find on the Internet. Let's get started!
We will use thetransformers library, as well as 🤗evaluate anddatasets libraries for proper model evaluation and downloading the dataset.
You can either use PyTorch or TensorFlow undertransformers. I'll choose PyTorch for this:
$ pip install torch transformers rouge_score evaluate datasetsWe need therouge_score library as it's a native implementation of the ROUGE score in Python, we'll see why it's needed in the next sections.
Of course, it's suggested that you use GPU for deep learning, as it'll be much faster, even during inference you'll notice a lot of improvements in terms of inference time. Head tothis link to install PyTorch for your CUDA version. If you're on Google Colab, just ensure you're picking "GPU" in the notebook settings.
Open up a new Jupyter or Colab notebook, and import the following:
import requestsimport torchfrom PIL import Imagefrom transformers import *from tqdm import tqdm# set device to GPU if availabledevice = "cuda" if torch.cuda.is_available() else "cpu"It's not suggested that you import everything fromtransformers as there will be a lot of classes and methods imported. Feel free to change it and only import what you need.
Throughout the tutorial, we'll be passing our model and data inputs to thedevice specified above. If CUDA is installed and available, then it'll be"cuda", and"cpu" otherwise.
Next, let's download afine-tuned image captioning model:
# load a fine-tuned image captioning model and corresponding tokenizer and image processorfinetuned_model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning").to(device)finetuned_tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")finetuned_image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")This model is a PyTorch version of the FLAX one that was fine-tuned onthe COCO2017 dataset on a single epoch, you can see the training metricshere.
Let's try the model:
import urllib.parse as parseimport os# a function to determine whether a string is a URL or notdef is_url(string): try: result = parse.urlparse(string) return all([result.scheme, result.netloc, result.path]) except: return False # a function to load an imagedef load_image(image_path): if is_url(image_path): return Image.open(requests.get(image_path, stream=True).raw) elif os.path.exists(image_path): return Image.open(image_path) # a function to perform inferencedef get_caption(model, image_processor, tokenizer, image_path): image = load_image(image_path) # preprocess the image img = image_processor(image, return_tensors="pt").to(device) # generate the caption (using greedy decoding by default) output = model.generate(**img) # decode the output caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0] return captionTheget_caption() function takes the model, the image processor, the tokenizer, and the image's path and performs inference on the model.
We simply call themodel.generate() method and pass the outputs of the image processor, which are the pixel values of the image. Let's use it:
# load displayerfrom IPython.display import displayurl = "http://images.cocodataset.org/test-stuff2017/000000009384.jpg"# display the imagedisplay(load_image(url))# get the captionget_caption(finetuned_model, finetuned_image_processor, finetuned_tokenizer, url)Output:

a person walking down a street with a snow covered sidewalkExcellent. You can pass any image you want, whether it's in your local environment or a URL just like we did here. You can checkthis XML file containing some test images on the COCO2017 dataset.
Now that we're familiar with image captioning, let's fine-tune our model from pre-trained encoder and decoder models:
# the encoder model that process the image and return the image features# encoder_model = "WinKawaks/vit-small-patch16-224"# encoder_model = "google/vit-base-patch16-224"# encoder_model = "google/vit-base-patch16-224-in21k"encoder_model = "microsoft/swin-base-patch4-window7-224-in22k"# the decoder model that process the image features and generate the caption text# decoder_model = "bert-base-uncased"# decoder_model = "prajjwal1/bert-tiny"decoder_model = "gpt2"# load the modelmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( encoder_model, decoder_model).to(device)As in demonstratedFigure 3, the encoder is a vision transformer that encodes the image into hidden vectors. The decoder is a regular language model that takes these hidden vectors and decodes them into human text.
For this demo, as mentioned earlier, we're going for Microsoft's Swin vision transformer that was pre-trained on ImageNet-21k (14 million images) at a resolution of224x224. Checkthe Swin Transformer paper for more info regarding the details. As for the decoder, I'm choosing thegpt2 language model.
If you have limited computing resources, make sure you use smaller models such as theWinKawaks/vit-small-patch16-224 for the encoder andprajjwal1/bert-tiny for the decoder, you can uncomment them above.
To load the pre-trained weights of both models and combine them together in a single model, we use theVisionEncoderDecoderModel class and thefrom_encoder_decoder_pretrained() method that expects the name of both models. You can browse all the models inthe huggingface hub.
Next, we have to load ourimage_processor andtokenizer:
# initialize the tokenizer# tokenizer = AutoTokenizer.from_pretrained(decoder_model)tokenizer = GPT2TokenizerFast.from_pretrained(decoder_model)# tokenizer = BertTokenizerFast.from_pretrained(decoder_model)# load the image processorimage_processor = ViTImageProcessor.from_pretrained(encoder_model)We need the tokenizer to tokenize our captions into a sequence of integers using theGPT2TokenizerFast. If you're using a different decoder, make sure to comment this out and use theAutoTokenizer class. The reason I'm using it is thatGPT2TokenizerFast is way faster thanAutoTokenizer in case I'm using GPT2.
TheViTImageProcessor is responsible for processing our image before training/inferring, such as normalizing, resizing the image into the appropriate resolution, and scaling the pixel values.
Before proceeding, we have to make sure that thedecoder_start_token_id andpad_token_id are present in our model configuration. Therefore, we have to manually set them using thetokenizer config:
if "gpt2" in decoder_model: # gpt2 does not have decoder_start_token_id and pad_token_id # but has bos_token_id and eos_token_id tokenizer.pad_token = tokenizer.eos_token # pad_token_id as eos_token_id model.config.eos_token_id = tokenizer.eos_token_id model.config.pad_token_id = tokenizer.pad_token_id # set decoder_start_token_id as bos_token_id model.config.decoder_start_token_id = tokenizer.bos_token_idelse: # set the decoder start token id to the CLS token id of the tokenizer model.config.decoder_start_token_id = tokenizer.cls_token_id # set the pad token id to the pad token id of the tokenizer model.config.pad_token_id = tokenizer.pad_token_idHere is a definition of each special token defined above:
bos_token_id is the ID of the token that represents the beginning of the sentence.eos_token_id is the ID of the token that represents the end of the sentence.decoder_start_token_id is used to indicate the starting point of the decoder to start generating the target sequence (in our case, the caption).pad_token_id is used to pad short sequences of text into a fixed length.cls_token_id represents the classification token and is typically used by BERT and other tokenizers as the first token in a sequence of text before the actual sentence starts.The GPT2 tokenizer does not have thepad_token_id anddecoder_start_token_id but it hasbos_token_id andeos_token_id. Therefore, we can simply set thepad_token as theeos_token anddecoder_start_token_id as thebos_token_id.
For other language models such as BERT, we set thedocoder_start_token_id as thecls_token_id.
The reason we're setting all of these is that when we assemble our model, these token ids are not loaded by default. If we do not set them now, we'll get weird errors later in training.
Now we've constructed our model, let's get into our dataset. As mentioned at the beginning, we will use theCOCO2014 (Karpathy's annotations & splits):
from datasets import load_datasetmax_length = 32 # max length of the captions in tokenscoco_dataset_ratio = 50 # 50% of the COCO2014 datasettrain_ds = load_dataset("HuggingFaceM4/COCO", split=f"train[:{coco_dataset_ratio}%]")valid_ds = load_dataset("HuggingFaceM4/COCO", split=f"validation[:{coco_dataset_ratio}%]")test_ds = load_dataset("HuggingFaceM4/COCO", split="test")len(train_ds), len(valid_ds), len(test_ds)This will take more than 20 minutes of downloading if you're on Colab because it's more than 20GB in total size. COCO2017 is much bigger than that by the way.
Since we have limited computing resources, I'm only taking 50% of the total dataset. Nevertheless, here are the total samples of each set:
(283374, 12505, 25010)I'm taking the complete testing set so we can reliably compare models. Over 280K samples and that's only 50% of it. Feel free to change this ratio to a lower number if you just want to get going with the training, or to a higher one (possibly 100%) if you have a good GPU and time.
max_length is the maximum length of the caption, so our captions will only have 32 tokens as a maximum. If it's higher than that, the caption will be truncated. If it's lower than that, it'll be padded with thepad_token_id.
Next, during my initial training, I ran into some errors because some samples do not have 3 dimensions. Therefore, I'm filtering them out here:
import numpy as np# remove the images with less than 3 dimensions (possibly grayscale images)train_ds = train_ds.filter(lambda item: np.array(item["image"]).ndim in [3, 4], num_proc=2)valid_ds = valid_ds.filter(lambda item: np.array(item["image"]).ndim in [3, 4], num_proc=2)test_ds = test_ds.filter(lambda item: np.array(item["image"]).ndim in [3, 4], num_proc=2)I'm using the.filter() method to only take images with the expected dimension. Settingnum_proc to 2 will speed up the processing as it'll do it in two CPU cores. If you have more CPU cores, then increase this number to speed things up.
Now that we have valid samples, let's preprocess our inputs:
def preprocess(items): # preprocess the image pixel_values = image_processor(items["image"], return_tensors="pt").pixel_values.to(device) # tokenize the caption with truncation and padding targets = tokenizer([ sentence["raw"] for sentence in items["sentences"] ], max_length=max_length, padding="max_length", truncation=True, return_tensors="pt").to(device) return {'pixel_values': pixel_values, 'labels': targets["input_ids"]}# using with_transform to preprocess the dataset during trainingtrain_dataset = train_ds.with_transform(preprocess)valid_dataset = valid_ds.with_transform(preprocess)test_dataset = test_ds.with_transform(preprocess)Thepreprocess() function expects the samples as parameters (items). We're preprocessing the image using our image processor, and tokenizing the captions with truncation and padding using ourtokenizer. At the end of both, we pass them to thedevice using theto() method.
We could use themap() function to process our dataset. However, it may take too long and consume memory and storage in our case as it's a large dataset, and therefore we'll use thewith_transform() so thepreprocess() function will run only during training. In other words, the preprocessing happens on the fly when we pass the batches to the model.
Next, we define our function that collates the batches:
# a function we'll use to collate the batchesdef collate_fn(batch): return { 'pixel_values': torch.stack([x['pixel_values'] for x in batch]), 'labels': torch.stack([x['labels'] for x in batch]) }We will pass thecollate_fn() callback to our data loader before we start training.
There are a lot of metrics that emerged for image captioning, to mention a few:
For this tutorial, we're going to stick with ROUGE-L and BLEU scores. The below code loads these metrics to compute them against a model:
import evaluate# load the rouge and bleu metricsrouge = evaluate.load("rouge")bleu = evaluate.load("bleu") def compute_metrics(eval_pred): preds = eval_pred.label_ids labels = eval_pred.predictions # decode the predictions and labels pred_str = tokenizer.batch_decode(preds, skip_special_tokens=True) labels_str = tokenizer.batch_decode(labels, skip_special_tokens=True) # compute the rouge score rouge_result = rouge.compute(predictions=pred_str, references=labels_str) # multiply by 100 to get the same scale as the rouge score rouge_result = {k: round(v * 100, 4) for k, v in rouge_result.items()} # compute the bleu score bleu_result = bleu.compute(predictions=pred_str, references=labels_str) # get the length of the generated captions generation_length = bleu_result["translation_length"] return { **rouge_result, "bleu": round(bleu_result["bleu"] * 100, 4), "gen_len": bleu_result["translation_length"] / len(preds) }Thecompute_metrics() function takes theEvalPrediction object to compute the ROUGE and BLEU scores after decoding them using thetokenizer, we also multiply the scores by 100.
Let's define some basic training parameters:
num_epochs = 2 # number of epochsbatch_size = 16 # the size of batchesWe're going through the dataset twice. Again, if you have more computing, make sure to increase thenum_epochs to say 10. At the time of writing this, the free version of the Colab instance gives us NVIDIA Tesla T4 which fits abatch_size of 16 very well and does not raise any Out of Memory errors.
If you have a GPU with more VRAM, you should increase thebatch_size to take the most advantage of your GPU and speed up the training.
Before we proceed with training, let's print a single sample to see whether the shapes are as expected:
for item in train_dataset: print(item["labels"].shape) print(item["pixel_values"].shape) breakI'm iterating over the training dataset and printing the shapes oflabels andpixel_values. Here's the output:
torch.Size([32])torch.Size([3, 224, 224])PyTorch tensors with the expected shape,labels is the caption with the size ofmax_length=32, and thepixel_values is the actual image with(3, 224, 224) resolution.
For training, we have two choices. The first one is using theTrainer class that is provided by thetransformers library, which is convenient and very simple to use. Or you can use the regular PyTorch training loop if you want. I will show you how to do both, and you're free to show any of them.
Let's define the training arguments:
# define the training argumentstraining_args = Seq2SeqTrainingArguments( predict_with_generate=True, # use generate to calculate the loss num_train_epochs=num_epochs, # number of epochs evaluation_strategy="steps", # evaluate after each eval_steps eval_steps=2000, # evaluate after each 2000 steps logging_steps=2000, # log after each 2000 steps save_steps=2000, # save after each 2000 steps per_device_train_batch_size=batch_size, # batch size for training per_device_eval_batch_size=batch_size, # batch size for evaluation output_dir="vit-swin-base-224-gpt2-image-captioning", # output directory # push_to_hub=True # whether you want to push the model to the hub, # check this guide for more details: https://huggingface.co/transformers/model_sharing.html)We will evaluate, log, and save the model checkpoint every 2000 steps, you're always encouraged to change this value depending on yourbatch_size,num_epochs andcoco_dataset_ratio.
There are about 100 parameters you can pass toSeq2SeqTrainingArguments, checkthe doc reference if you're curious.
Next, we pass the training arguments to our actual trainer, along with the collation andcompute_metrics() functions,model, and all the datasets:
# instantiate trainertrainer = Seq2SeqTrainer( model=model, # the instantiated 🤗 Transformers model to be trained tokenizer=image_processor, # we use the image processor as the tokenizer args=training_args, # pass the training arguments compute_metrics=compute_metrics, train_dataset=train_dataset, eval_dataset=valid_dataset, data_collator=collate_fn, )The documentation encourages us to subclass theTrainer to define our own trainer so we can do custom behaviors with certain classes. Since I'm too lazy for that, I'm just overridingget_training_dataloder(),get_eval_dataloader(), andget_test_dataloader() functions to return a regular PyTorchDataLoader:
from torch.utils.data import DataLoaderdef get_eval_loader(eval_dataset=None): return DataLoader(valid_dataset, collate_fn=collate_fn, batch_size=batch_size)def get_test_loader(eval_dataset=None): return DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size)# override the get_train_dataloader, get_eval_dataloader and# get_test_dataloader methods of the trainer# so that we can properly load the datatrainer.get_train_dataloader = lambda: DataLoader(train_dataset, collate_fn=collate_fn, batch_size=batch_size)trainer.get_eval_dataloader = get_eval_loadertrainer.get_test_dataloader = get_test_loaderLet's fine-tune the model now:
# train the modeltrainer.train()This will take several hours to train, here's an output during my training ofAbdou/vit-swin-base-224-gpt2-image-captioning:
[10602/10602 5:08:53, Epoch 2/2]StepTraining LossValidation LossRouge1 Rouge2RougelRougelsumBleu Gen Len20001.0018 0.8859 38.6537 13.814535.393235.3935008.24480011.29463640000.8827 0.8394 40.0458 14.882936.532136.5366009.11690011.29463660000.8378 0.8139 41.2736 15.957637.550437.5512009.87100011.29463680000.7913 0.8011 41.6642 16.198737.878637.88910010.07860011.294636100000.7794 0.7933 41.9119 16.373838.106238.12920010.28800011.294636TrainOutput(global_step=10602, training_loss=0.8540051526291104, metrics={'train_runtime': 18543.3546, 'train_samples_per_second': 36.59, 'train_steps_per_second': 0.572, 'total_flos': 1.2314333621526567e+20, 'train_loss': 0.8540051526291104, 'epoch': 2.0})The above output was usingbatch_size of 64. The training ended in approximately 5 hours on NVIDIA A100 GPU. You can further increase thecoco_dataset_ratio andnum_epochs to increase the scores.
These scores (ROUGE-1, ROUGE-2, ROUGE-L, BLEU) are calculated on the validation set. Let's evaluate our model on the test set:
# evaluate on the test_datasettrainer.evaluate(test_dataset)Output:
{'eval_loss': 0.7923195362091064, 'eval_rouge1': 41.8451, 'eval_rouge2': 16.3493, 'eval_rougeL': 38.0288, 'eval_rougeLsum': 38.049, 'eval_bleu': 10.2776, 'eval_gen_len': 11.294636296840558, 'eval_runtime': 386.5944, 'eval_samples_per_second': 38.725, 'eval_steps_per_second': 0.605, 'epoch': 2.0}Amazing, we got a BLEU score of ~10.28, and a ROUGE-L of ~38.03 with thepredict_with_generate parameter set toTrue.
For the people who like flexibility in training, I have made this available to you. Let's wrap our training, validation, and testing sets as data loaders:
# alternative way of training: pytorch loopfrom torch.utils.data import DataLoader# define our data loaderstrain_dataset_loader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=batch_size, shuffle=True)valid_dataset_loader = DataLoader(valid_dataset, collate_fn=collate_fn, batch_size=8, shuffle=True)test_dataset_loader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=8, shuffle=True)Defining the optimizer:
from torch.optim import AdamW# define the optimizeroptimizer = AdamW(model.parameters(), lr=1e-5)I'm choosingAdamW optimizer as in theTrainer API. A learning rate of1e-5 is no way near the optimal, feel free to play around with it.
Optionally, loading tensorboard on notebooks:
# start tensorboard%load_ext tensorboard%tensorboard --logdir ./image-captioning/tensorboardNext, defining some variables and the summary tensorboard writer:
from torch.utils.tensorboard import SummaryWritersummary_writer = SummaryWriter(log_dir="./image-captioning/tensorboard")# print some statistics before training# number of training stepsn_train_steps = num_epochs * len(train_dataset_loader)# number of validation stepsn_valid_steps = len(valid_dataset_loader)# current training stepcurrent_step = 0# logging, eval & save stepssave_steps = 1000Here's the training loop now:
for epoch in range(num_epochs): # set the model to training mode model.train() # initialize the training loss train_loss = 0 for batch in tqdm(train_dataset_loader, "Training", total=len(train_dataset_loader), leave=False): if current_step % save_steps == 0: ### evaluation code ### # evaluate on the validation set # if the current step is a multiple of the save steps print(f"\nValidation at step {current_step}...\n") # set the model to evaluation mode model.eval() # initialize our lists that store the predictions and the labels predictions, labels = [], [] # initialize the validation loss valid_loss = 0 for batch in valid_dataset_loader: # get the batch pixel_values = batch["pixel_values"] label_ids = batch["labels"] # forward pass outputs = model(pixel_values=pixel_values, labels=label_ids) # get the loss loss = outputs.loss valid_loss += loss.item() # free the GPU memory logits = outputs.logits.detach().cpu() # add the predictions to the list predictions.extend(logits.argmax(dim=-1).tolist()) # add the labels to the list labels.extend(label_ids.tolist()) # make the EvalPrediction object that the compute_metrics function expects eval_prediction = EvalPrediction(predictions=predictions, label_ids=labels) # compute the metrics metrics = compute_metrics(eval_prediction) # print the stats print(f"\nEpoch: {epoch}, Step: {current_step}, Train Loss: {train_loss / save_steps:.4f}, " + f"Valid Loss: {valid_loss / n_valid_steps:.4f}, BLEU: {metrics['bleu']:.4f}, " + f"ROUGE-1: {metrics['rouge1']:.4f}, ROUGE-2: {metrics['rouge2']:.4f}, ROUGE-L: {metrics['rougeL']:.4f}\n") # log the metrics summary_writer.add_scalar("valid_loss", valid_loss / n_valid_steps, global_step=current_step) summary_writer.add_scalar("bleu", metrics["bleu"], global_step=current_step) summary_writer.add_scalar("rouge1", metrics["rouge1"], global_step=current_step) summary_writer.add_scalar("rouge2", metrics["rouge2"], global_step=current_step) summary_writer.add_scalar("rougeL", metrics["rougeL"], global_step=current_step) # save the model model.save_pretrained(f"./image-captioning/checkpoint-{current_step}") tokenizer.save_pretrained(f"./image-captioning/checkpoint-{current_step}") image_processor.save_pretrained(f"./image-captioning/checkpoint-{current_step}") # get the model back to train mode model.train() # reset the train and valid loss train_loss, valid_loss = 0, 0 ### training code below ### # get the batch & convert to tensor pixel_values = batch["pixel_values"] labels = batch["labels"] # forward pass outputs = model(pixel_values=pixel_values, labels=labels) # get the loss loss = outputs.loss # backward pass loss.backward() # update the weights optimizer.step() # zero the gradients optimizer.zero_grad() # log the loss loss_v = loss.item() train_loss += loss_v # increment the step current_step += 1 # log the training loss summary_writer.add_scalar("train_loss", loss_v, global_step=current_step)In the training loop, we're doing the forward and backward pass, updating the weights using theoptimizer.step() and zeroing the gradients.
If thecurrent_step is a multiple of thesave_steps, then we perform the evaluation on the validation set, print out the metrics and add them to the tensorboard. I ran the training and here's how the output looks like:
Training: 0%| | 0/17669 [00:00<?, ?it/s]Validation at step 1000...Epoch: 0, Step: 1000, Train Loss: 0.0000, Valid Loss: 1.0927, BLEU: 8.1102, ROUGE-1: 42.6778, ROUGE-2: 13.0396, ROUGE-L: 40.6797Training: 6%|▌ | 1000/17669 [24:51<4:38:49, 1.00s/it]Validation at step 2000...Epoch: 0, Step: 2000, Train Loss: 1.0966, Valid Loss: 0.9991, BLEU: 10.8885, ROUGE-1: 46.1669, ROUGE-2: 16.6826, ROUGE-L: 44.4348Training: 11%|█▏ | 2000/17669 [49:38<4:33:30, 1.05s/it]Validation at step 3000...Epoch: 0, Step: 3000, Train Loss: 1.0323, Valid Loss: 0.9679, BLEU: 11.6235, ROUGE-1: 47.1454, ROUGE-2: 17.6634, ROUGE-L: 45.5163Model weights saved in ./image-captioning/checkpoint-3000/pytorch_model.binI stopped the training after 3000 steps, so it's working!
I see that the metrics are best after 3000 steps (I'm sure you can get better results if you continue training). Let's load that model:
# load the best model, change the checkpoint number to the best checkpoint# if the last checkpoint is the best, then ignore this cellbest_checkpoint = 3000best_model = VisionEncoderDecoderModel.from_pretrained(f"./image-captioning/checkpoint-{best_checkpoint}").to(device)In this section, we are going to evaluate three different models:
nlpconnect/vit-gpt2-image-captioning.Trainer class and is pushed to the hub,Abdou/vit-swin-base-224-gpt2-image-captioning.First, let's make the function that takes themodel anddataset as input, and return the metrics for that model:
def get_evaluation_metrics(model, dataset): model.eval() # define our dataloader dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=batch_size) # number of testing steps n_test_steps = len(dataloader) # initialize our lists that store the predictions and the labels predictions, labels = [], [] # initialize the test loss test_loss = 0.0 for batch in tqdm(dataloader, "Evaluating"): # get the batch pixel_values = batch["pixel_values"] label_ids = batch["labels"] # forward pass outputs = model(pixel_values=pixel_values, labels=label_ids) # outputs = model.generate(pixel_values=pixel_values, max_length=max_length) # get the loss loss = outputs.loss test_loss += loss.item() # free the GPU memory logits = outputs.logits.detach().cpu() # add the predictions to the list predictions.extend(logits.argmax(dim=-1).tolist()) # add the labels to the list labels.extend(label_ids.tolist()) # make the EvalPrediction object that the compute_metrics function expects eval_prediction = EvalPrediction(predictions=predictions, label_ids=labels) # compute the metrics metrics = compute_metrics(eval_prediction) # add the test_loss to the metrics metrics["test_loss"] = test_loss / n_test_steps return metricsIt's quite similar (maybe even identical) to the evaluation code in the training loop we wrote earlier. Let's use this function to evaluate ourbest_model we just trained:
metrics = get_evaluation_metrics(best_model, test_dataset)metricsEvaluating: 100%|██████████| 6230/6230 [17:10<00:00, 6.04it/s]{'rouge1': 46.9427, 'rouge2': 17.659, 'rougeL': 45.2971, 'rougeLsum': 45.2916, 'bleu': 11.7049, 'gen_len': 11.262560192616373, 'test_loss': 0.9731424459819809}Next, remember we loaded thenlpconnect/vit-gpt2-image-captioning model at the beginning of this tutorial, let's evaluate it on the COCO2014 testing set:
get_evaluation_metrics(finetuned_model, test_dataset){'rouge1': 48.624, 'rouge2': 20.5349, 'rougeL': 47.0933, 'rougeLsum': 47.0975, 'bleu': 11.7336, 'gen_len': 11.262560192616373, 'test_loss': 9.437558887552106}This is slightly better on almost all the metrics, the reason is that this one is fine-tuned on a whole epoch (and not only 3000 steps) on the COCO2017 dataset.
Let's now load theAbdou/vit-swin-base-224-gpt2-image-captioning model using the simplepipeline API and do the evaluation:
# using the pipeline APIimage_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")image_captioner.model = image_captioner.model.to(device)get_evaluation_metrics(image_captioner.model, test_dataset){'rouge1': 53.1153, 'rouge2': 24.2307, 'rougeL': 51.5002, 'rougeLsum': 51.4983, 'bleu': 17.7765, 'gen_len': 11.262560192616373, 'test_loss': 0.7988893618313879}That's much better than the previous two. In the next section, we'll have fun with these models and predict some images.
In the end, let's predict the captions of some sample images grabbed from the COCO2017 testing set:
def show_image_and_captions(url): # get the image and display it display(load_image(url)) # get the captions on various models our_caption = get_caption(best_model, image_processor, tokenizer, url) finetuned_caption = get_caption(finetuned_model, finetuned_image_processor, finetuned_tokenizer, url) pipeline_caption = get_caption(image_captioner.model, image_processor, tokenizer, url) # print the captions print(f"Our caption: {our_caption}") print(f"nlpconnect/vit-gpt2-image-captioning caption: {finetuned_caption}") print(f"Abdou/vit-swin-base-224-gpt2-image-captioning caption: {pipeline_caption}")Below are some examples:
show_image_and_captions("http://images.cocodataset.org/test-stuff2017/000000000001.jpg")Output:

Our caption: A truck parked in a parking lot with a man on the back.nlpconnect/vit-gpt2-image-captioning caption: a green truck parked next to a curb Abdou/vit-swin-base-224-gpt2-image-captioning caption: A police car parked next to a fence.A second example:
show_image_and_captions("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")
Our caption: A cow standing in a field with a bunch of grass.nlpconnect/vit-gpt2-image-captioning caption: a cow is standing in a field of grass Abdou/vit-swin-base-224-gpt2-image-captioning caption: Two cows laying in a field with a sky background.The first two models didn't see the second cow in the back! Here's a third example:
show_image_and_captions("http://images.cocodataset.org/test-stuff2017/000000000128.jpg")
Our caption: A large elephant standing in a dirt field.nlpconnect/vit-gpt2-image-captioning caption: an elephant with a large trunk standing on a dirt ground Abdou/vit-swin-base-224-gpt2-image-captioning caption: An elephant standing next to a box on a cement ground.Here's a final example:
show_image_and_captions("http://images.cocodataset.org/test-stuff2017/000000003720.jpg")
Our caption: A woman standing on a sidewalk with a umbrella.nlpconnect/vit-gpt2-image-captioning caption: a person walking down a street holding an umbrella Abdou/vit-swin-base-224-gpt2-image-captioning caption: A woman holding an umbrella walking down a sidewalk.Alright! We have covered a lot in this article:
transformers'Trainer API or PyTorch.Here are some of the suggestions to further improve the results:
Salesforce/blip-image-captioning-base model, it can be used both as conditional or unconditional image captioning.coco_dataset_ratio to 100, and set it to training for more hours. In case you obtain improved results, feel free to share your weights onthe HuggingFace hub.This link should help you.compute_metrics() method,this link will help.Related articles:
References and useful links:
nlpconnect/vit-gpt2-image-captioningYou can getthe complete code here. Alternatively, followthis link for the Colab version.
Happy learning ♥
Loved the article? You'll love ourCode Converter even more! It's your secret weapon for effortless coding. Give it a whirl!
View Full Code Improve My CodeGot a coding query or need some guidance before you comment? Check out thisPython Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!
