Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15093))
Included in the following conference series:
509Accesses
Abstract
Sparse 3D detectors have received significant attention since the query-based paradigm embraces low latency without explicit dense BEV feature construction. However, these detectors achieve worse performance than their dense counterparts. In this paper, we find the key to bridging the performance gap is to enhance the awareness of rich representations in two modalities. Here, we present a high-performance fully sparse detector for end-to-end multi-modality 3D object detection. The detector, termed SparseLIF, contains three key designs, which are (1) Perspective-Aware Query Generation (PAQG) to generate high-quality 3D queries with perspective priors, (2) RoI-Aware Sampling (RIAS) to further refine prior queries by sampling RoI features from each modality, (3) Uncertainty-Aware Fusion (UAF) to precisely quantify the uncertainty of each sensor modality and adaptively conduct final multi-modality fusion, thus achieving great robustness against sensor noises. By the time of paper submission, SparseLIF achieves state-of-the-art performance on the nuScenes dataset, ranking1st on both validation set and test benchmark, outperforming all state-of-the-art 3D object detectors by a notable margin.
H. Zhang, L. Liang and P. Zeng—Equal Contribution.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 8465
- Price includes VAT (Japan)
- Softcover Book
- JPY 10581
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3d object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1090–1099 (2022)
Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3d object detection. arXiv preprintarXiv:2005.09927 (2020)
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Cai, H., Zhang, Z., Zhou, Z., Li, Z., Ding, W., Zhao, J.: Bevfusion4d: learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv preprintarXiv:2303.17099 (2023)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58452-8_13
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Chen, X., Zhang, T., Wang, Y., Wang, Y., Zhao, H.: Futr3d: a unified sensor fusion framework for 3d detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 172–181 (2023)
Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9775–9784 (2019)
Chen, Y., et al.: Focalformer3d: focusing on hard instance for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8394–8405 (2023)
Chen, Z., et al.: Autoalign: pixel-instance feature aggregation for multi-modal 3d object detection. arXiv preprintarXiv:2201.06493 (2022)
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: Autoalignv2: deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv preprintarXiv:2207.10316 (2022)
Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.https://github.com/open-mmlab/mmdetection3d (2020)
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)
Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: Rangedet: in defense of range view for lidar-based 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2918–2927 (2021)
Gao, Z., Wang, L., Han, B., Guo, S.: Adamixer: a fast-converging query-based object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5364–5373 (2022)
Gunn, J., et al.: Lift-attend-splat: bird’s-eye-view camera-lidar fusion using transformers. arXiv preprintarXiv:2312.14919 (2023)
Han, C., et al.: Exploring recurrent long-term temporal fusion for multi-view 3d perception. arXiv preprintarXiv:2303.05970 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hu, C., et al.: Fusionformer: a multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3d objection. arXiv preprintarXiv:2309.05257 (2023)
Hu, H., et al.: Ea-lss: edge-aware lift-splat-shot framework for 3d bev object detection,2. arXiv preprintarXiv:2303.17895 (2023)
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3d object detection. arXiv preprintarXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv preprintarXiv:2112.11790 (2021)
Huang, J., Ye, Y., Liang, Z., Shan, Y., Du, D.: Detecting as labeling: rethinking lidar-camera fusion in 3d object detection. arXiv preprintarXiv:2311.07152 (2023)
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58555-6_3
Jiang, X., et al.: Far3d: expanding the horizon for surround-view 3d object detection. arXiv preprintarXiv:2308.09616 (2023)
Jiao, Y., Jie, Z., Chen, S., Chen, J., Ma, L., Jiang, Y.G.: Msmdfusion: fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21643–21652 (2023)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Lee, Y., Park, J.: Centermask: real-time anchor-free instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13906–13915 (2020)
Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. arXiv preprintarXiv:1608.07916 (2016)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13619–13627 (2022)
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1477–1485 (2023)
Li, Z., Wang, F., Wang, N.: Lidar r-cnn: an efficient and universal 3d object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 1–18. Springer, Heidelberg (2022).https://doi.org/10.1007/978-3-031-20077-9_1
Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. Adv. Neural. Inf. Process. Syst.35, 10421–10434 (2022)
Liang, Z., Zhang, M., Zhang, Z., Zhao, X., Pu, S.: Rangercnn: towards fast and accurate 3d object detection with range image representation. arXiv preprintarXiv:2009.00206 (2020)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4d: multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprintarXiv:2211.10581 (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4d v2: recurrent temporal fusion with sparse model. arXiv preprintarXiv:2305.14018 (2023)
Lin, X., Pei, Z., Lin, T., Huang, L., Su, Z.: Sparse4d v3: advancing end-to-end 3d detection and tracking. arXiv preprintarXiv:2311.11722 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3d object detection from multi-camera videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18580–18590 (2023)
Liu, H., et al.: Pai3d: painting adaptive instance-prior for 3d object detection. In: European Conference on Computer Vision, pp. 459–475. Springer, Heidelberg (2022).https://doi.org/10.1007/978-3-031-25072-9_32
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: European Conference on Computer Vision, pp. 531–548. Springer, Heidelberg (2022).https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: Petrv2: a unified framework for 3d perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)
Mao, J., et al.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: Lasernet: an efficient probabilistic 3d object detector for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12677–12686 (2019)
Nabati, R., Qi, H.: Centerfusion: center-based radar and camera fusion for 3d object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1527–1536 (2021)
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3d object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021)
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprintarXiv:2210.02443 (2022)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58568-6_12
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst.30 (2017)
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3d object detection. arXiv preprintarXiv:1811.08188 (2018)
Sheng, H., et al.: Improving 3d object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)
Shi, S., et al.: Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538 (2020)
Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Sindagi, V.A., Zhou, Y., Tuzel, O.: Mvx-net: multimodal voxelnet for 3d object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)
Sun, P., et al.: Rsn: range sparse net for efficient, accurate lidar 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5725–5734 (2021)
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: cross-modal augmentation for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803 (2021)
Wang, H., et al.: Unitr: a unified and efficient multi-modal transformer for bird’s-eye-view representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6792–6802 (2023)
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprintarXiv:2303.11926 (2023)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8445–8453 (2019)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Wang, Z., Jia, K.: Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749. IEEE (2019)
Xu, S., Zhou, D., Fang, J., Yin, J., Bin, Z., Zhang, L.: Fusionpainting: multimodal fusion with adaptive attention for 3d object detection. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 3047–3054. IEEE (2021)
Yan, J., et al.: Cross modal transformer: towards fast and robust 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278 (2023)
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors18, 3337 (2018)
Yang, C., et al.: Bevformer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11040–11048 (2020)
Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., Zhang, L.: Deepinteraction: 3d object detection via modality interaction. Adv. Neural. Inf. Process. Syst.35, 1992–2005 (2022)
Yao, Z., Ai, J., Li, B., Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior. arXiv preprintarXiv:2104.01318 (2021)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)
Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3d detection. Adv. Neural. Inf. Process. Syst.34, 16494–16507 (2021)
Yu, K., et al.: Benchmarking the robustness of lidar-camera fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3187–3197 (2023)
Yuan, Z., Song, X., Bai, L., Wang, Z., Ouyang, W.: Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving. IEEE Trans. Circuits Syst. Video Technol.32(4), 2068–2078 (2021)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprintarXiv:1904.07850 (2019)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Zhou, Z., Tulsiani, S.: Sparsefusion: distilling view-conditioned diffusion for 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12588–12597 (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. arXiv preprintarXiv:2010.04159 (2020)
Author information
Authors and Affiliations
SenseTime Research, Hong Kong, China
Hongcheng Zhang, Liu Liang, Pengxin Zeng, Xiao Song & Zhe Wang
College of Computer Science, Sichuan University, Chengdu, China
Pengxin Zeng
- Hongcheng Zhang
You can also search for this author inPubMed Google Scholar
- Liu Liang
You can also search for this author inPubMed Google Scholar
- Pengxin Zeng
You can also search for this author inPubMed Google Scholar
- Xiao Song
You can also search for this author inPubMed Google Scholar
- Zhe Wang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toXiao Song.
Editor information
Editors and Affiliations
University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, H., Liang, L., Zeng, P., Song, X., Wang, Z. (2025). SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_7
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-72760-3
Online ISBN:978-3-031-72761-0
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative