Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15033))

  • 249Accesses

Abstract

Large-scale pretrained visual-language models like CLIP have proven highly effective in learning universal representations and achieved significant success across various downstream tasks. Recently, there has been increased interest in fine-tuning these large models with limited data. To alleviate the overfitting issue during fine-tuning, existing methods usually freeze the parameters of CLIP pretrained on large-scale datasets and use the features extracted from the CLIP model for downstream tasks. However, such fine-tuning strategy may limit the performance of these methods because semantic visual features specific for downstream tasks may not be well extracted based on the frozen feature extractor of CLIP. In this study, we propose an effective framework to fine-tune CLIP with few-shot samples and meanwhile alleviate the overfitting. In this framework, a visual adapter is embedded at the end of the CLIP’s visual encoder to encourage the model to effectively extract semantic features relevant to the downstream task, and a supervised contrastive loss is introduced to alleviate the overfitting by guiding the optimisation to focus on the adapter. In addition, the multimodal feature alignment capability of CLIP is utilized to direct the adapted visual encoder in extracting class-relevant image features through textual prompts. Experimental evaluations on 11 datasets confirm the superior performance of the proposed approach.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: ECCV, pp. 446–461 (2014)

    Google Scholar 

  2. Chao, J., Yinfei, Y., Ye, X., et al: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)

    Google Scholar 

  3. Chen, D., Zhang, J., Zheng, W.S., Wang, R.: Featwalk: Enhancing few-shot classification through local view leveraging. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1019–1027 (2024)

    Google Scholar 

  4. Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: prompt learning with optimal transport for vision-language models (2022).arXiv:2210.01253

  5. Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at few-shot classification (2019).arXiv:1904.04232

  6. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR, pp. 3606–3613 (2014)

    Google Scholar 

  7. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  8. Feng, L., Bichen, W., Xiaoliang, D., et al: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR, pp. 7061–7070 (2023)

    Google Scholar 

  9. Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: better vision-language models with feature adapters. IJCV132(2), 581–595 (2023)

    Article  Google Scholar 

  10. Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. STARS12(7), 2217–2226 (2019)

    Google Scholar 

  11. Hendrycks, D., Basart, S., Mu, N., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: CVPR, pp. 8340–8349 (2021)

    Google Scholar 

  12. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR, pp. 15262–15271 (2021)

    Google Scholar 

  13. Hongbo, S., Xiangteng, H., Jiahuan, Z., et al: Fine-grained visual prompt learning of vision-language models for image recognition. In: ACM MM, pp. 5828–5836 (2023)

    Google Scholar 

  14. Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: ECCV, pp. 709–727 (2022)

    Google Scholar 

  15. Khosla, P., Teterwak, P., Wang, C., et al.: Supervised contrastive learning. NeurIPS33, 18661–18673 (2020)

    Google Scholar 

  16. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: CVPR, pp. 554–561 (2013)

    Google Scholar 

  17. Li, F., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR, pp. 178–178 (2004)

    Google Scholar 

  18. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. CSUR55(9), 1–35 (2023)

    Article  Google Scholar 

  19. Liu, Z., Cristian, R.O., Damien, T., Stephen, G.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV, pp. 2125–2134 (2021)

    Google Scholar 

  20. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017).arXiv:1711.05101

  21. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft (2013).arXiv:1306.5151

  22. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGPI, pp. 722–729 (2008)

    Google Scholar 

  23. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR, pp. 3498–3505 (2012)

    Google Scholar 

  24. Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH, pp. 1–11 (2023)

    Google Scholar 

  25. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  26. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022).arXiv:2204.06125

  27. Rebuffi, S.A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: CVPR, pp. 8119–8127 (2018)

    Google Scholar 

  28. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML, pp. 5389–5400 (2019)

    Google Scholar 

  29. Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: mapping pictures to words for zero-shot composed image retrieval. In: CVPR, pp. 19305–19314 (2023)

    Google Scholar 

  30. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: CVPR, pp. 618–626 (2017)

    Google Scholar 

  31. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012).arXiv:1212.0402

  32. Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: CVPR, pp. 403–412 (2019)

    Google Scholar 

  33. Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. NeurIPS32 (2019)

    Google Scholar 

  34. Wang, Z., Yu, L., Qiang, L., et al: CRIS: CLIP-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022)

    Google Scholar 

  35. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision (2021).arXiv:2108.10904

  36. Wu, G., Chen, J., Zhang, W., Wang, R.: Feature adaptation with clip for few-shot classification. In: ACM MMAsia, pp. 1–7 (2023)

    Google Scholar 

  37. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR, pp. 3485–3492. IEEE (2010)

    Google Scholar 

  38. Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language models. In: CVPR, pp. 10899–10909 (2023)

    Google Scholar 

  39. Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: training-free adaption of clip for few-shot classification. In: ECCV, pp. 493–510 (2022)

    Google Scholar 

  40. Zhang, Y., Zhang, C., Hu, X., He, Z.: Unsupervised prototype adapter for vision-language models. In: PRCV, pp. 197–209 (2023)

    Google Scholar 

  41. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16816–16825 (2022)

    Google Scholar 

  42. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV130(9), 2337–2348 (2022)

    Article  Google Scholar 

  43. Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR, pp. 10157–10166 (2023)

    Google Scholar 

Download references

Acknowledgement

This work is supported in part by the National Natural Science Foundation of China (grant No. 62071502), the Major Key Project of PCL (grant No. PCL2023A09), and Guangdong Excellent Youth Team Program (grant No. 2023B1515040025).

Author information

Authors and Affiliations

  1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

    Jing Luo, Guangxing Wu, Hongmei Liu & Ruixuan Wang

  2. Peng Cheng Laboratory, Shenzhen, China

    Ruixuan Wang

  3. Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China

    Jing Luo, Guangxing Wu & Ruixuan Wang

Authors
  1. Jing Luo

    You can also search for this author inPubMed Google Scholar

  2. Guangxing Wu

    You can also search for this author inPubMed Google Scholar

  3. Hongmei Liu

    You can also search for this author inPubMed Google Scholar

  4. Ruixuan Wang

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toRuixuan Wang.

Editor information

Editors and Affiliations

  1. Peking University, Beijing, China

    Zhouchen Lin

  2. Nankai University, Tianjin, China

    Ming-Ming Cheng

  3. Chinese Academy of Sciences, Beijing, China

    Ran He

  4. Xinjiang University, Ürümqi, Xinjiang, China

    Kurban Ubul

  5. Xinjiang University, Ürümqi, China

    Wushouer Silamu

  6. Peking University, Beijing, China

    Hongbin Zha

  7. Tsinghua University, Beijing, China

    Jie Zhou

  8. Chinese Academy of Sciences, Beijing, China

    Cheng-Lin Liu

Rights and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, J., Wu, G., Liu, H., Wang, R. (2025). Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning. In: Lin, Z.,et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15033. Springer, Singapore. https://doi.org/10.1007/978-981-97-8502-5_8

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp