Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15060))
Included in the following conference series:
622Accesses
Abstract
Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page:https://ywyue.github.io/FiT3D.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 8465
- Price includes VAT (Japan)
- Softcover Book
- JPY 10581
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep ViT features as dense visual descriptors. In: European Conference on Computer Vision (ECCV) Workshops (2022)
Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: MultiMAE: multi-modal multi-task masked autoencoders. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVII, pp. 348–367. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-19836-6_20
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of Image Transformers. In: International Conference on Learning Representations (ICLR) (2022)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.35(8), 1798–1828 (2013)
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: depth estimation using adaptive bins. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: International Conference on Neural Information Processing Systems (NeurIPS) (2020)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: International Conference on Computer Vision (ICCV) (2021)
Chen, M., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning (ICML) (2020)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. In: International Conference on Learning Representations (ICLR) (2024)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2018)
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: International Conference on Computer Vision (ICCV) (2015)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2020)
Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: International Conference on Neural Information Processing Systems (NeurIPS) (2014)
El Banani, M., et al.: Probing the 3D awareness of visual foundation models. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Tombari, F.: OpenNeRF: open set 3D neural scene segmentation with pixel-wise features and rendered novel views. In: International Conference on Learning Representations (ICLR) (2024)
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision (2015)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the Kitti dataset. Int. J. Robot. Res. (2013)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: European Conference on Computer Vision (ECCV) (2022)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR) (2018)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: International Conference on Neural Information Processing Systems (NeurIPS) (2020)
Ha, H., Song, S.: Semantic abstraction: open-world 3D scene understanding from 2D vision-language models. In: Conference on Robot Learning (CoRL) (2022)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2006)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Hou, J., Dai, X., He, Z., Dai, A., Nießner, M.: Mask3D: pre-training 2D vision transformers by learning masked 3D priors. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Hou, J., Xie, S., Graham, B., Dai, A., Nießner, M.: Pri3D: can 3D priors help 2D representation learning? In: International Conference on Computer Vision (ICCV) (2021)
Huang, R., et al.: Segment3D: learning fine-grained class-agnostic 3D segmentation without manual labels. In: European Conference on Computer Vision (ECCV) (2024)
Jatavallabhula, K.M., et al.: ConceptFusion: open-set multimodal 3D mapping. Sci. Syst. (RSS) Robot. (2023)
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. (2023)
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: LERF: language embedded radiance fields. In: International Conference on Computer Vision (ICCV) (2023)
Kirillov, A., et al.: Segment anything. In: International Conference on Computer Vision (ICCV) (2023)
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing Nerf for editing via feature field distillation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2022)
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven Semantic Segmentation. In: International Conference on Learning Representations (ICLR) (2022)
Li, F., et al.: Mask DINO: towards a unified transformer-based framework for object detection and segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV) (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)
Mazur, K., Sucar, E., Davison, A.J.: Feature-realistic neural fusion for real-time, open set scene understanding. In: International Conference on Robotics and Automation (ICRA) (2023)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: European Conference on Computer Vision (ECCV) (2020)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: European Conference on Computer Vision (ECCV) (2016)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2023)
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: LangSplat: 3D language Gaussian splatting. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Saxena, S., et al.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2024)
Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. In: Conference on Robot Learning (CoRL) (2023)
Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3D Gaussians for open-vocabulary scene understanding. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).https://doi.org/10.1007/978-3-642-33715-4_54
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)
Tan, H., Wu, S., Pi, J.: Semantic diffusion network for semantic segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2022)
Touvron, H., Cord, M., Jégou, H.: DeiT III: revenge of the ViT. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXIV, pp. 516–533. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-20053-3_30
Tschernezki, V., Laina, I., Larlus, D., Vedaldi, A.: Neural feature fusion fields: 3D distillation of self-supervised 2D image representations. In: International Conference on 3D Vision (3DV) (2022)
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: International Conference on Computer Vision (ICCV) (2015)
Weder, S., Blum, H., Engelmann, F., Pollefeys, M.: LabelMaker: automatic semantic label generation from RGB-D trajectories. In: International Conference on 3D Vision (3DV) (2024)
Weinzaepfel, P., et al.: CroCo: self-supervised pre-training for 3D vision tasks by cross-view completion. In: International Conference on Neural Information Processing Systems (NeurIPS) (2022)
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: International Conference on Computer Vision (ICCV) (2023)
Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20K dataset. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhou, S., et al.: Feature 3DGS: supercharging 3D Gaussian splatting to enable distilled feature fields. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Acknowledgements
Francis Engelmann is partially supported by an ETH AI Center postdoctoral research fellowship and an ETH Zurich Career Seed Award.
Author information
Authors and Affiliations
ETH Zurich, Zurich, Switzerland
Yuanwen Yue, Francis Engelmann & Siyu Tang
Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrucken, Germany
Anurag Das & Jan Eric Lenssen
Google, Zurich, Switzerland
Francis Engelmann
- Yuanwen Yue
You can also search for this author inPubMed Google Scholar
- Anurag Das
You can also search for this author inPubMed Google Scholar
- Francis Engelmann
You can also search for this author inPubMed Google Scholar
- Siyu Tang
You can also search for this author inPubMed Google Scholar
- Jan Eric Lenssen
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toYuanwen Yue.
Editor information
Editors and Affiliations
University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol
1Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yue, Y., Das, A., Engelmann, F., Tang, S., Lenssen, J.E. (2025). Improving 2D Feature Representations by 3D-Aware Fine-Tuning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_4
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-72626-2
Online ISBN:978-3-031-72627-9
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative