Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15033))
Included in the following conference series:
249Accesses
Abstract
Large-scale pretrained visual-language models like CLIP have proven highly effective in learning universal representations and achieved significant success across various downstream tasks. Recently, there has been increased interest in fine-tuning these large models with limited data. To alleviate the overfitting issue during fine-tuning, existing methods usually freeze the parameters of CLIP pretrained on large-scale datasets and use the features extracted from the CLIP model for downstream tasks. However, such fine-tuning strategy may limit the performance of these methods because semantic visual features specific for downstream tasks may not be well extracted based on the frozen feature extractor of CLIP. In this study, we propose an effective framework to fine-tune CLIP with few-shot samples and meanwhile alleviate the overfitting. In this framework, a visual adapter is embedded at the end of the CLIP’s visual encoder to encourage the model to effectively extract semantic features relevant to the downstream task, and a supervised contrastive loss is introduced to alleviate the overfitting by guiding the optimisation to focus on the adapter. In addition, the multimodal feature alignment capability of CLIP is utilized to direct the adapted visual encoder in extracting class-relevant image features through textual prompts. Experimental evaluations on 11 datasets confirm the superior performance of the proposed approach.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 8465
- Price includes VAT (Japan)
- Softcover Book
- JPY 10581
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: ECCV, pp. 446–461 (2014)
Chao, J., Yinfei, Y., Ye, X., et al: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)
Chen, D., Zhang, J., Zheng, W.S., Wang, R.: Featwalk: Enhancing few-shot classification through local view leveraging. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1019–1027 (2024)
Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: prompt learning with optimal transport for vision-language models (2022).arXiv:2210.01253
Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at few-shot classification (2019).arXiv:1904.04232
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR, pp. 3606–3613 (2014)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Feng, L., Bichen, W., Xiaoliang, D., et al: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR, pp. 7061–7070 (2023)
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: better vision-language models with feature adapters. IJCV132(2), 581–595 (2023)
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. STARS12(7), 2217–2226 (2019)
Hendrycks, D., Basart, S., Mu, N., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: CVPR, pp. 8340–8349 (2021)
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR, pp. 15262–15271 (2021)
Hongbo, S., Xiangteng, H., Jiahuan, Z., et al: Fine-grained visual prompt learning of vision-language models for image recognition. In: ACM MM, pp. 5828–5836 (2023)
Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: ECCV, pp. 709–727 (2022)
Khosla, P., Teterwak, P., Wang, C., et al.: Supervised contrastive learning. NeurIPS33, 18661–18673 (2020)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: CVPR, pp. 554–561 (2013)
Li, F., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR, pp. 178–178 (2004)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. CSUR55(9), 1–35 (2023)
Liu, Z., Cristian, R.O., Damien, T., Stephen, G.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV, pp. 2125–2134 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017).arXiv:1711.05101
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft (2013).arXiv:1306.5151
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGPI, pp. 722–729 (2008)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR, pp. 3498–3505 (2012)
Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH, pp. 1–11 (2023)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022).arXiv:2204.06125
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: CVPR, pp. 8119–8127 (2018)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML, pp. 5389–5400 (2019)
Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: mapping pictures to words for zero-shot composed image retrieval. In: CVPR, pp. 19305–19314 (2023)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: CVPR, pp. 618–626 (2017)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012).arXiv:1212.0402
Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: CVPR, pp. 403–412 (2019)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. NeurIPS32 (2019)
Wang, Z., Yu, L., Qiang, L., et al: CRIS: CLIP-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022)
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision (2021).arXiv:2108.10904
Wu, G., Chen, J., Zhang, W., Wang, R.: Feature adaptation with clip for few-shot classification. In: ACM MMAsia, pp. 1–7 (2023)
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR, pp. 3485–3492. IEEE (2010)
Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language models. In: CVPR, pp. 10899–10909 (2023)
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: training-free adaption of clip for few-shot classification. In: ECCV, pp. 493–510 (2022)
Zhang, Y., Zhang, C., Hu, X., He, Z.: Unsupervised prototype adapter for vision-language models. In: PRCV, pp. 197–209 (2023)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16816–16825 (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV130(9), 2337–2348 (2022)
Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR, pp. 10157–10166 (2023)
Acknowledgement
This work is supported in part by the National Natural Science Foundation of China (grant No. 62071502), the Major Key Project of PCL (grant No. PCL2023A09), and Guangdong Excellent Youth Team Program (grant No. 2023B1515040025).
Author information
Authors and Affiliations
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Jing Luo, Guangxing Wu, Hongmei Liu & Ruixuan Wang
Peng Cheng Laboratory, Shenzhen, China
Ruixuan Wang
Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China
Jing Luo, Guangxing Wu & Ruixuan Wang
- Jing Luo
You can also search for this author inPubMed Google Scholar
- Guangxing Wu
You can also search for this author inPubMed Google Scholar
- Hongmei Liu
You can also search for this author inPubMed Google Scholar
- Ruixuan Wang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toRuixuan Wang.
Editor information
Editors and Affiliations
Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Luo, J., Wu, G., Liu, H., Wang, R. (2025). Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning. In: Lin, Z.,et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15033. Springer, Singapore. https://doi.org/10.1007/978-981-97-8502-5_8
Download citation
Published:
Publisher Name:Springer, Singapore
Print ISBN:978-981-97-8501-8
Online ISBN:978-981-97-8502-5
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative