Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Springer Nature Link
Log in

FusFormer: global and detail feature fusion transformer for semantic segmentation of small objects

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Improving the segmentation accuracy of small objects is essential for tasks such as autono-mous driving and remote sensing. However, the current main semantic segmentation methods are inadequate for small objects. To improve the segmentation accuracy of small objects, long-range global information and fine local details are needed, and neither pure Convolutional Neural Networks (CNNs) nor Vision Transformers (ViTs) can effectively provide these two different types of information simultaneously. In this paper, we introduce a novel model FusFormer for finding a solution, which contains a global branch and a detailed branch to fully capture the long-range features and spatial detail features from the input image. The global branch is based on MiT-B2 to efficiently acquire global context, while the detailed branch acquires rich local detail information by Spatial Prior Module (SPM) and Multi-scale Module (MSM). Feature Interaction Module (FIM) is proposed to perform information fusion across features at a dual scale. In addition, Multi-scale Edge Extraction Module (MSEEM) is utilized to supplement the missing edge information during model training, helping the model to better enhance the intra-class consistency of small objects. Extensive experiments on Cityscapes, ADE20K and PASCAL VOC 2012 show that our model achieves competitive overall segmentation accuracy, especially on small objects. FusFormer achieves 82.6\(\%\), 47.3\(\%\) and 82.4\(\%\) mIoU on the Cityscapes, ADE20K and PASCAL VOC 2012 validation sets, compared with other state-of-the-art methods, the proposed model significantly improves the IoU on small objects by 2\(\%\)-4\(\%\).

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

All data included in this study are available upon request by contact with the corresponding author on reasonable request.

References

  1. Gao X, Wang B, Tao D, Li X (2011) A relay level set method for automatic image segmentation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 41(2):518–525

  2. Zhang K, Liu Q, Song H, Li X (2015) A variational approach to simultaneous image segmentation and bias correction. IEEE Trans Cybern 45(8):1426–1437

    Article  Google Scholar 

  3. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  4. Zhang H, Dana K, Shi J, Zhang Z., Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7151–7160

  5. Zhang F, Chen Y, Li Z, Hong Z, Liu J, Ma F, Han J, Ding E (2019) Acfnet: Attentional class feature network for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6798–6807

  6. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly, S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929

  7. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: Simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090

    Google Scholar 

  8. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  9. Yan H, Zhang C, Wu M (2022) Lawin transformer: Improving semantic segmentation transformer with multi-scale representations via large window attention.arXiv:2201.01615

  10. Meng Z, Fan X, Chen X, Chen M, Tong Y (2017) Detecting small signs from large images. In: 2017 IEEE international conference on information reuse and integration (IRI), pp 217–224

  11. Li J, Liang X, Wei Y, Xu T, Feng J, Yan S (2017) Perceptual generative adversarial networks for small object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1222–1230

  12. Li H, Lin Z, Shen X, Brandt J, Hua G (2015) A convolutional neural network cascade for face detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5325–5334

  13. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 1529–1537

  14. Guo D, Zhu L, Lu Y, Yu H, Wang S (2018) Small object sensitive segmentation of urban street scene with spatial adjacency between object classes. IEEE Trans Image Process 28(6):2643–2653

    Article MathSciNet  Google Scholar 

  15. Krähenbühl P, Koltun V (2011) Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems 24

  16. Chandra S, Kokkinos I (2016) Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In: Computer Vision–ECCV 2016: 14th european conference, amsterdam, the netherlands, October 11–14, 2016, Proceedings, Part VII 14, pp 402–418

  17. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  18. Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272

  19. Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22–31

  20. Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588

  21. Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) Contnet: Why not use convolution and transformer at the same time.arXiv:2104.13497

  22. Dai Z, Liu H, Le QV, Tan M (2021) Coatnet: Marrying convolution and attention for all data sizes. Adv Neural Inf Process Syst 34:965–3977

    Google Scholar 

  23. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223

  24. Zhou B, Zhao H, Puig X, Fidler, S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 633–641

  25. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. Int J Comput Vis 111(1):98–136

    Article  Google Scholar 

  26. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154

  27. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612

  28. Yuan Y, Chen X, Chen X, Wang J (2019) Segmentation transformer: Object-contextual representations for semantic segmentation.arXiv:1909.11065

  29. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  30. Takikawa T, Acuna D, Jampani V, Fidler S (2019) Gated-scnn: Gated shape cnns for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5229–5238

  31. Jin Z, Liu B, Chu Q, Yu N (2021) Isnet: Integrate image-level and semantic-level context for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7189–7198

  32. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6881–6890

  33. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  34. Ma A, Wang J, Zhong Y, Zheng Z (2021) Factseg: Foreground activation-driven small object semanticsegmentation in large-scale remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–16

    Google Scholar 

  35. Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852

  36. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255

  37. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp 10347–10357

  38. Chen C-FR, Fan Q, Panda R (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366

  39. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578

  40. Xu S, Gu J, Hua Y, Liu Y (2023) Dktnet: Dual-key transformer network for small object detection. Neurocomputing 525:29–41

    Article  Google Scholar 

  41. Zhang Q, Yang Y-B (2021) Rest: An efficient transformer for visual recognition. Adv Neural Inf Process Syst 34:15475–15485

    Google Scholar 

  42. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890

  43. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  44. Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation.arXiv:1706.05587

  45. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 801–818

  46. Yang M, Yu K, Zhang C, Li Z, Yang K (2018) Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3684–3692

  47. Liu Y, Han J, Zhang Q, Shan C (2019) Deep salient object detection with contextual information guidance. IEEE Trans Image Process 29:360–374

    Article MathSciNet  Google Scholar 

  48. Liu Y, Duanmu M, Huo Z, Qi H, Chen Z, Li L, Zhang Q (2021) Exploring multi-scale deformable context and channel-wise attention for salient object detection. Neurocomputing 428:92–103

    Article  Google Scholar 

  49. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization.arXiv:1607.06450

  50. Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus).arXiv:1606.08415

  51. Guo M-H, Lu C-Z, Hou Q, Liu Z, Cheng M-M, Hu S-M (2022) Segnext: Rethinking convolutional attention design for semantic segmentation.arXiv:2209.08575

  52. Deng H, Ren Q, Chen X, Zhang H, Ren J, Zhang Q (2021) Discovering and explaining the representation bottleneck of dnns.arXiv:2111.06236

  53. Li S, Wang Z, Liu Z, Tan C, Lin H, Wu D, Chen Z, Zheng J, Li SZ (2022) Efficient multi-order gated aggregation network.arXiv:2211.03295

  54. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  55. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

  56. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500

  57. Elfwing S, Uchibe E, Doya K (2018) Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3–11

    Article  Google Scholar 

  58. Shi W, Caballero J, Huszár F, Totz J, Aitken AP, Bishop R, Rueckert, D, Wang, Z (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1874–1883

  59. Mao M, Zhang R, Zheng H, Ma T, Peng Y, Ding E, Zhang B, Han S et al (2021) Dual-stream network for visual recognition. Adv Neural Inf Process Syst 34:25346–25358

    Google Scholar 

  60. Wang Y, Sun H, Wang X, Zhang B, Li C, Xin Y, Zhang B, Ding E, Han S (2022) Maformer: A transformer network with multi-scale attention fusion for visual recognition.arXiv:2209.01620

  61. Huang S, Lu Z, Cheng R, He C (2021) Fapn: Feature-aligned pyramid network for dense image prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 864–873

  62. Islam MA, Jia S, Bruce ND (2020) How much position information do convolutional neural networks encode.arXiv:2001.08248

  63. Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers.arXiv:2102.10882

  64. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456

  65. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 315–323

  66. Contributors M (2020) MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark.https://github.com/open-mmlab/mmsegmentation

  67. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization.arXiv:1711.05101

  68. Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 593–602

  69. Wu T, Tang S, Zhang R, Cao J, Zhang Y (2020) Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans Image Process 30:1169–1179

    Article  Google Scholar 

  70. Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, Jia J (2018) Psanet: Point-wise spatial attention network for scene parsing. In: Proceedings of the european conference on computer vision (ECCV), pp 267–283

  71. Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the european conference on computer vision (ECCV), pp 418–434

  72. Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9167–9176

  73. Park N, Kim S (2022) How do vision transformers work.arXiv:2202.06709

  74. Pan Z, Cai J, Zhuang B (2022) Fast vision transformers with hilo attention.arXiv:2205.13213

  75. Bai J, Yuan L, Xia S-T, Yan S, Li Z, Liu W (2022) Improving vision transformers by revisiting high-frequency components.arXiv:2204.00993

Download references

Author information

Authors and Affiliations

  1. School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing, 100044, China

    Zheng Li, Houjin Chen, Jupeng Li & Zhenhao Zhang

  2. School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230026, China

    Song Peng

  3. Beijing Institute of Basic Medical Sciences, Beijing, 100850, China

    Baozheng Wang & Changyong Wang

Authors
  1. Zheng Li

    You can also search for this author inPubMed Google Scholar

  2. Houjin Chen

    You can also search for this author inPubMed Google Scholar

  3. Jupeng Li

    You can also search for this author inPubMed Google Scholar

  4. Song Peng

    You can also search for this author inPubMed Google Scholar

  5. Zhenhao Zhang

    You can also search for this author inPubMed Google Scholar

  6. Baozheng Wang

    You can also search for this author inPubMed Google Scholar

  7. Changyong Wang

    You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization, Z.L. and J.L.; Methodology, Software, Validation, Writing - original draft, Formal analysis, Z.L.; Writing - review and editing, J.L.; Investigation, Z.Z.; Data curation, S.P.; Resources, B.W. and C.W.; Supervision, H.C.; Project Administration, H.C.; Funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence toJupeng Li.

Ethics declarations

Conflicts of interest

The authors declare no conflict of interest.

Ethical and Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Chen, H., Li, J.et al. FusFormer: global and detail feature fusion transformer for semantic segmentation of small objects.Multimed Tools Appl83, 88717–88744 (2024). https://doi.org/10.1007/s11042-024-18911-8

Download citation

Keywords

Associated Content

Part of a collection:

Track 6: Computer Vision for Multimedia Applications

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp