Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15033))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

249Accesses

Abstract

Large-scale pretrained visual-language models like CLIP have proven highly effective in learning universal representations and achieved significant success across various downstream tasks. Recently, there has been increased interest in fine-tuning these large models with limited data. To alleviate the overfitting issue during fine-tuning, existing methods usually freeze the parameters of CLIP pretrained on large-scale datasets and use the features extracted from the CLIP model for downstream tasks. However, such fine-tuning strategy may limit the performance of these methods because semantic visual features specific for downstream tasks may not be well extracted based on the frozen feature extractor of CLIP. In this study, we propose an effective framework to fine-tune CLIP with few-shot samples and meanwhile alleviate the overfitting. In this framework, a visual adapter is embedded at the end of the CLIP’s visual encoder to encourage the model to effectively extract semantic features relevant to the downstream task, and a supervised contrastive loss is introduced to alleviate the overfitting by guiding the optimisation to focus on the adapter. In addition, the multimodal feature alignment capability of CLIP is utilized to direct the adapted visual encoder in extracting class-relevant image features through textual prompts. Experimental evaluations on 11 datasets confirm the superior performance of the proposed approach.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-layer Tuning CLIP for Few-Shot Image Classification

Context-Aware Robust Fine-Tuning

Article03 December 2023

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

References

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: ECCV, pp. 446–461 (2014)
Google Scholar
Chao, J., Yinfei, Y., Ye, X., et al: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)
Google Scholar
Chen, D., Zhang, J., Zheng, W.S., Wang, R.: Featwalk: Enhancing few-shot classification through local view leveraging. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1019–1027 (2024)
Google Scholar
Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: prompt learning with optimal transport for vision-language models (2022).arXiv:2210.01253
Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at few-shot classification (2019).arXiv:1904.04232
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR, pp. 3606–3613 (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Feng, L., Bichen, W., Xiaoliang, D., et al: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR, pp. 7061–7070 (2023)
Google Scholar
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: better vision-language models with feature adapters. IJCV132(2), 581–595 (2023)
Article Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. STARS12(7), 2217–2226 (2019)
Google Scholar
Hendrycks, D., Basart, S., Mu, N., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: CVPR, pp. 8340–8349 (2021)
Google Scholar
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR, pp. 15262–15271 (2021)
Google Scholar
Hongbo, S., Xiangteng, H., Jiahuan, Z., et al: Fine-grained visual prompt learning of vision-language models for image recognition. In: ACM MM, pp. 5828–5836 (2023)
Google Scholar
Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: ECCV, pp. 709–727 (2022)
Google Scholar
Khosla, P., Teterwak, P., Wang, C., et al.: Supervised contrastive learning. NeurIPS33, 18661–18673 (2020)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: CVPR, pp. 554–561 (2013)
Google Scholar
Li, F., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: CVPR, pp. 178–178 (2004)
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. CSUR55(9), 1–35 (2023)
Article Google Scholar
Liu, Z., Cristian, R.O., Damien, T., Stephen, G.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV, pp. 2125–2134 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017).arXiv:1711.05101
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft (2013).arXiv:1306.5151
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGPI, pp. 722–729 (2008)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR, pp. 3498–3505 (2012)
Google Scholar
Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH, pp. 1–11 (2023)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022).arXiv:2204.06125
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: CVPR, pp. 8119–8127 (2018)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML, pp. 5389–5400 (2019)
Google Scholar
Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: mapping pictures to words for zero-shot composed image retrieval. In: CVPR, pp. 19305–19314 (2023)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: CVPR, pp. 618–626 (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012).arXiv:1212.0402
Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: CVPR, pp. 403–412 (2019)
Google Scholar
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. NeurIPS32 (2019)
Google Scholar
Wang, Z., Yu, L., Qiang, L., et al: CRIS: CLIP-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022)
Google Scholar
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision (2021).arXiv:2108.10904
Wu, G., Chen, J., Zhang, W., Wang, R.: Feature adaptation with clip for few-shot classification. In: ACM MMAsia, pp. 1–7 (2023)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR, pp. 3485–3492. IEEE (2010)
Google Scholar
Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language models. In: CVPR, pp. 10899–10909 (2023)
Google Scholar
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: training-free adaption of clip for few-shot classification. In: ECCV, pp. 493–510 (2022)
Google Scholar
Zhang, Y., Zhang, C., Hu, X., He, Z.: Unsupervised prototype adapter for vision-language models. In: PRCV, pp. 197–209 (2023)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16816–16825 (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV130(9), 2337–2348 (2022)
Article Google Scholar
Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR, pp. 10157–10166 (2023)
Google Scholar

Download references

Acknowledgement

This work is supported in part by the National Natural Science Foundation of China (grant No. 62071502), the Major Key Project of PCL (grant No. PCL2023A09), and Guangdong Excellent Youth Team Program (grant No. 2023B1515040025).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Jing Luo, Guangxing Wu, Hongmei Liu & Ruixuan Wang
Peng Cheng Laboratory, Shenzhen, China
Ruixuan Wang
Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China
Jing Luo, Guangxing Wu & Ruixuan Wang

Authors

Jing Luo
View author publications
You can also search for this author inPubMed Google Scholar
Guangxing Wu
View author publications
You can also search for this author inPubMed Google Scholar
Hongmei Liu
View author publications
You can also search for this author inPubMed Google Scholar
Ruixuan Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toRuixuan Wang.

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, J., Wu, G., Liu, H., Wang, R. (2025). Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning. In: Lin, Z.,et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15033. Springer, Singapore. https://doi.org/10.1007/978-981-97-8502-5_8

Download citation

DOI:https://doi.org/10.1007/978-981-97-8502-5_8
Published:01 November 2024
Publisher Name:Springer, Singapore
Print ISBN:978-981-97-8501-8
Online ISBN:978-981-97-8502-5
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-layer Tuning CLIP for Few-Shot Image Classification

Context-Aware Robust Fine-Tuning

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now