- Zheng Li ORCID:orcid.org/0000-0003-4160-08491,
- Houjin Chen1,
- Jupeng Li1,
- Song Peng2,
- Zhenhao Zhang1,
- Baozheng Wang3 &
- …
- Changyong Wang3
248Accesses
1Altmetric
Abstract
Improving the segmentation accuracy of small objects is essential for tasks such as autono-mous driving and remote sensing. However, the current main semantic segmentation methods are inadequate for small objects. To improve the segmentation accuracy of small objects, long-range global information and fine local details are needed, and neither pure Convolutional Neural Networks (CNNs) nor Vision Transformers (ViTs) can effectively provide these two different types of information simultaneously. In this paper, we introduce a novel model FusFormer for finding a solution, which contains a global branch and a detailed branch to fully capture the long-range features and spatial detail features from the input image. The global branch is based on MiT-B2 to efficiently acquire global context, while the detailed branch acquires rich local detail information by Spatial Prior Module (SPM) and Multi-scale Module (MSM). Feature Interaction Module (FIM) is proposed to perform information fusion across features at a dual scale. In addition, Multi-scale Edge Extraction Module (MSEEM) is utilized to supplement the missing edge information during model training, helping the model to better enhance the intra-class consistency of small objects. Extensive experiments on Cityscapes, ADE20K and PASCAL VOC 2012 show that our model achieves competitive overall segmentation accuracy, especially on small objects. FusFormer achieves 82.6\(\%\), 47.3\(\%\) and 82.4\(\%\) mIoU on the Cityscapes, ADE20K and PASCAL VOC 2012 validation sets, compared with other state-of-the-art methods, the proposed model significantly improves the IoU on small objects by 2\(\%\)-4\(\%\).
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.















Similar content being viewed by others
Data Availability
All data included in this study are available upon request by contact with the corresponding author on reasonable request.
References
Gao X, Wang B, Tao D, Li X (2011) A relay level set method for automatic image segmentation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 41(2):518–525
Zhang K, Liu Q, Song H, Li X (2015) A variational approach to simultaneous image segmentation and bias correction. IEEE Trans Cybern 45(8):1426–1437
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Zhang H, Dana K, Shi J, Zhang Z., Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7151–7160
Zhang F, Chen Y, Li Z, Hong Z, Liu J, Ma F, Han J, Ding E (2019) Acfnet: Attentional class feature network for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6798–6807
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly, S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: Simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Yan H, Zhang C, Wu M (2022) Lawin transformer: Improving semantic segmentation transformer with multi-scale representations via large window attention.arXiv:2201.01615
Meng Z, Fan X, Chen X, Chen M, Tong Y (2017) Detecting small signs from large images. In: 2017 IEEE international conference on information reuse and integration (IRI), pp 217–224
Li J, Liang X, Wei Y, Xu T, Feng J, Yan S (2017) Perceptual generative adversarial networks for small object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1222–1230
Li H, Lin Z, Shen X, Brandt J, Hua G (2015) A convolutional neural network cascade for face detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5325–5334
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 1529–1537
Guo D, Zhu L, Lu Y, Yu H, Wang S (2018) Small object sensitive segmentation of urban street scene with spatial adjacency between object classes. IEEE Trans Image Process 28(6):2643–2653
Krähenbühl P, Koltun V (2011) Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems 24
Chandra S, Kokkinos I (2016) Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In: Computer Vision–ECCV 2016: 14th european conference, amsterdam, the netherlands, October 11–14, 2016, Proceedings, Part VII 14, pp 402–418
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22–31
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588
Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) Contnet: Why not use convolution and transformer at the same time.arXiv:2104.13497
Dai Z, Liu H, Le QV, Tan M (2021) Coatnet: Marrying convolution and attention for all data sizes. Adv Neural Inf Process Syst 34:965–3977
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Zhou B, Zhao H, Puig X, Fidler, S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 633–641
Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. Int J Comput Vis 111(1):98–136
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612
Yuan Y, Chen X, Chen X, Wang J (2019) Segmentation transformer: Object-contextual representations for semantic segmentation.arXiv:1909.11065
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Takikawa T, Acuna D, Jampani V, Fidler S (2019) Gated-scnn: Gated shape cnns for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5229–5238
Jin Z, Liu B, Chu Q, Yu N (2021) Isnet: Integrate image-level and semantic-level context for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7189–7198
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6881–6890
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Ma A, Wang J, Zhong Y, Zheng Z (2021) Factseg: Foreground activation-driven small object semanticsegmentation in large-scale remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–16
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp 10347–10357
Chen C-FR, Fan Q, Panda R (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578
Xu S, Gu J, Hua Y, Liu Y (2023) Dktnet: Dual-key transformer network for small object detection. Neurocomputing 525:29–41
Zhang Q, Yang Y-B (2021) Rest: An efficient transformer for visual recognition. Adv Neural Inf Process Syst 34:15475–15485
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation.arXiv:1706.05587
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 801–818
Yang M, Yu K, Zhang C, Li Z, Yang K (2018) Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3684–3692
Liu Y, Han J, Zhang Q, Shan C (2019) Deep salient object detection with contextual information guidance. IEEE Trans Image Process 29:360–374
Liu Y, Duanmu M, Huo Z, Qi H, Chen Z, Li L, Zhang Q (2021) Exploring multi-scale deformable context and channel-wise attention for salient object detection. Neurocomputing 428:92–103
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization.arXiv:1607.06450
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus).arXiv:1606.08415
Guo M-H, Lu C-Z, Hou Q, Liu Z, Cheng M-M, Hu S-M (2022) Segnext: Rethinking convolutional attention design for semantic segmentation.arXiv:2209.08575
Deng H, Ren Q, Chen X, Zhang H, Ren J, Zhang Q (2021) Discovering and explaining the representation bottleneck of dnns.arXiv:2111.06236
Li S, Wang Z, Liu Z, Tan C, Lin H, Wu D, Chen Z, Zheng J, Li SZ (2022) Efficient multi-order gated aggregation network.arXiv:2211.03295
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Elfwing S, Uchibe E, Doya K (2018) Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3–11
Shi W, Caballero J, Huszár F, Totz J, Aitken AP, Bishop R, Rueckert, D, Wang, Z (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1874–1883
Mao M, Zhang R, Zheng H, Ma T, Peng Y, Ding E, Zhang B, Han S et al (2021) Dual-stream network for visual recognition. Adv Neural Inf Process Syst 34:25346–25358
Wang Y, Sun H, Wang X, Zhang B, Li C, Xin Y, Zhang B, Ding E, Han S (2022) Maformer: A transformer network with multi-scale attention fusion for visual recognition.arXiv:2209.01620
Huang S, Lu Z, Cheng R, He C (2021) Fapn: Feature-aligned pyramid network for dense image prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 864–873
Islam MA, Jia S, Bruce ND (2020) How much position information do convolutional neural networks encode.arXiv:2001.08248
Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers.arXiv:2102.10882
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 315–323
Contributors M (2020) MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark.https://github.com/open-mmlab/mmsegmentation
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization.arXiv:1711.05101
Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 593–602
Wu T, Tang S, Zhang R, Cao J, Zhang Y (2020) Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans Image Process 30:1169–1179
Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, Jia J (2018) Psanet: Point-wise spatial attention network for scene parsing. In: Proceedings of the european conference on computer vision (ECCV), pp 267–283
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the european conference on computer vision (ECCV), pp 418–434
Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9167–9176
Park N, Kim S (2022) How do vision transformers work.arXiv:2202.06709
Pan Z, Cai J, Zhuang B (2022) Fast vision transformers with hilo attention.arXiv:2205.13213
Bai J, Yuan L, Xia S-T, Yan S, Li Z, Liu W (2022) Improving vision transformers by revisiting high-frequency components.arXiv:2204.00993
Author information
Authors and Affiliations
School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing, 100044, China
Zheng Li, Houjin Chen, Jupeng Li & Zhenhao Zhang
School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230026, China
Song Peng
Beijing Institute of Basic Medical Sciences, Beijing, 100850, China
Baozheng Wang & Changyong Wang
- Zheng Li
You can also search for this author inPubMed Google Scholar
- Houjin Chen
You can also search for this author inPubMed Google Scholar
- Jupeng Li
You can also search for this author inPubMed Google Scholar
- Song Peng
You can also search for this author inPubMed Google Scholar
- Zhenhao Zhang
You can also search for this author inPubMed Google Scholar
- Baozheng Wang
You can also search for this author inPubMed Google Scholar
- Changyong Wang
You can also search for this author inPubMed Google Scholar
Contributions
Conceptualization, Z.L. and J.L.; Methodology, Software, Validation, Writing - original draft, Formal analysis, Z.L.; Writing - review and editing, J.L.; Investigation, Z.Z.; Data curation, S.P.; Resources, B.W. and C.W.; Supervision, H.C.; Project Administration, H.C.; Funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Correspondence toJupeng Li.
Ethics declarations
Conflicts of interest
The authors declare no conflict of interest.
Ethical and Informed Consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Z., Chen, H., Li, J.et al. FusFormer: global and detail feature fusion transformer for semantic segmentation of small objects.Multimed Tools Appl83, 88717–88744 (2024). https://doi.org/10.1007/s11042-024-18911-8
Received:
Revised:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative