Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15060))

Included in the following conference series:

European Conference on Computer Vision

622Accesses
1Citations

Abstract

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page:https://ywyue.github.io/FiT3D.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SimpleRecon: 3D Reconstruction Without 3D Convolutions

Denoising Vision Transformers

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

References

Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep ViT features as dense visual descriptors. In: European Conference on Computer Vision (ECCV) Workshops (2022)
Google Scholar
Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: MultiMAE: multi-modal multi-task masked autoencoders. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVII, pp. 348–367. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-19836-6_20
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of Image Transformers. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.35(8), 1798–1828 (2013)
Google Scholar
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: depth estimation using adaptive bins. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: International Conference on Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning (ICML) (2020)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2018)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: International Conference on Neural Information Processing Systems (NeurIPS) (2014)
Google Scholar
El Banani, M., et al.: Probing the 3D awareness of visual foundation models. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Tombari, F.: OpenNeRF: open set 3D neural scene segmentation with pixel-wise features and rendered novel views. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision (2015)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the Kitti dataset. Int. J. Robot. Res. (2013)
Google Scholar
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: International Conference on Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Ha, H., Song, S.: Semantic abstraction: open-world 3D scene understanding from 2D vision-language models. In: Conference on Robot Learning (CoRL) (2022)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2006)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Hou, J., Dai, X., He, Z., Dai, A., Nießner, M.: Mask3D: pre-training 2D vision transformers by learning masked 3D priors. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Hou, J., Xie, S., Graham, B., Dai, A., Nießner, M.: Pri3D: can 3D priors help 2D representation learning? In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Huang, R., et al.: Segment3D: learning fine-grained class-agnostic 3D segmentation without manual labels. In: European Conference on Computer Vision (ECCV) (2024)
Google Scholar
Jatavallabhula, K.M., et al.: ConceptFusion: open-set multimodal 3D mapping. Sci. Syst. (RSS) Robot. (2023)
Google Scholar
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. (2023)
Google Scholar
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: LERF: language embedded radiance fields. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing Nerf for editing via feature field distillation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven Semantic Segmentation. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Li, F., et al.: Mask DINO: towards a unified transformer-based framework for object detection and segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Mazur, K., Sucar, E., Davison, A.J.: Feature-realistic neural fusion for real-time, open set scene understanding. In: International Conference on Robotics and Automation (ICRA) (2023)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: European Conference on Computer Vision (ECCV) (2016)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2023)
Google Scholar
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: LangSplat: 3D language Gaussian splatting. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Saxena, S., et al.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2024)
Google Scholar
Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. In: Conference on Robot Learning (CoRL) (2023)
Google Scholar
Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3D Gaussians for open-vocabulary scene understanding. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Tan, H., Wu, S., Pi, J.: Semantic diffusion network for semantic segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Touvron, H., Cord, M., Jégou, H.: DeiT III: revenge of the ViT. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXIV, pp. 516–533. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-20053-3_30
Tschernezki, V., Laina, I., Larlus, D., Vedaldi, A.: Neural feature fusion fields: 3D distillation of self-supervised 2D image representations. In: International Conference on 3D Vision (3DV) (2022)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Weder, S., Blum, H., Engelmann, F., Pollefeys, M.: LabelMaker: automatic semantic label generation from RGB-D trajectories. In: International Conference on 3D Vision (3DV) (2024)
Google Scholar
Weinzaepfel, P., et al.: CroCo: self-supervised pre-training for 3D vision tasks by cross-view completion. In: International Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20K dataset. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Zhou, S., et al.: Feature 3DGS: supercharging 3D Gaussian splatting to enable distilled feature fields. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar

Download references

Acknowledgements

Francis Engelmann is partially supported by an ETH AI Center postdoctoral research fellowship and an ETH Zurich Career Seed Award.

Author information

Authors and Affiliations

ETH Zurich, Zurich, Switzerland
Yuanwen Yue, Francis Engelmann & Siyu Tang
Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrucken, Germany
Anurag Das & Jan Eric Lenssen
Google, Zurich, Switzerland
Francis Engelmann

Authors

Yuanwen Yue
View author publications
You can also search for this author inPubMed Google Scholar
Anurag Das
View author publications
You can also search for this author inPubMed Google Scholar
Francis Engelmann
View author publications
You can also search for this author inPubMed Google Scholar
Siyu Tang
View author publications
You can also search for this author inPubMed Google Scholar
Jan Eric Lenssen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toYuanwen Yue.

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2992 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yue, Y., Das, A., Engelmann, F., Tang, S., Lenssen, J.E. (2025). Improving 2D Feature Representations by 3D-Aware Fine-Tuning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_4

Download citation

DOI:https://doi.org/10.1007/978-3-031-72627-9_4
Published:20 October 2024
Publisher Name:Springer, Cham
Print ISBN:978-3-031-72626-2
Online ISBN:978-3-031-72627-9
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SimpleRecon: 3D Reconstruction Without 3D Convolutions

Denoising Vision Transformers

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1Electronic supplementary material

Supplementary material 1 (pdf 2992 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now