Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

A topic-based multi-channel attention model under hybrid mode for image caption

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Automatically generating captions of an image is not closely related to every spatial area of the visual information, but always related to the topic of the image expression. Aiming at the decoupling problem of visual spatial feature attention and semantic decoder, a topic-based multi-channel attention model (TMA) under hybrid mode for image caption is proposed. First, natural language processing (NLP) technology is used to preprocess the caption references, including filtering stop words, analyzing word frequency and constructing a semantic network graph with node labels. Then, combined with the image features extracted by the convolutional neural network (CNN), a semantic perception network is designed to achieve cross-domain prediction from image to topic. Next, a topic-based multi-channel attention fusion mechanism is proposed to realize image-text attention fusion representation under the joint action of the global spatial features of the image, the local semantic features of the graph nodes and the hidden layer features of the long short-term memory (LSTM) decoder. Finally, multi-task loss function is used to train the TMA. Experimental results show that the proposed model has better evaluation performance with topic-focused attention than state-of-the-art (SOTA) methods.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

  1. Zhao ZQ, Zheng P, Xu ST, Wu X (2019) Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 30(11):3212–3232

    Article  Google Scholar 

  2. Öztürk Ş (2021) Class-driven content-based medical image retrieval using hash codes of deep features. Biomed Signal Process Control 68:102601

    Article  Google Scholar 

  3. Öztürk Ş (2020) Stacked auto-encoder based tagging with deep features for content-based medical image retrieval. Expert Syst Appl 161:113693

    Article  Google Scholar 

  4. Öztürk Ş (2021) Convolutional neural network based dictionary learning to create hash codes for content-based image retrieval. Proced Comput Sci 183:624–629

    Article  Google Scholar 

  5. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  6. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural networks 61:85–117

    Article  Google Scholar 

  7. Miguel A, Gonzalo J, García-Lagos F (2020) Advances in computational intelligence. Neural Comput Appl 32(2):309–311

    Article  Google Scholar 

  8. Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends Signal Process 7(3–4):197–387

    Article MathSciNet  Google Scholar 

  9. Ordonez V, Kulkarni G, Berg T (2011) Im2text: Describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151

    Google Scholar 

  10. Su JH, Chou CL, Lin CY, Tseng VS (2011) Effective semantic annotation by image-to-concept distribution model. IEEE Trans Multimed 13(3):530–538

    Article  Google Scholar 

  11. Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812

    Article  Google Scholar 

  12. Ballan L, Uricchio T, Seidenari L, Del Bimbo A (2014) A cross-media model for automatic image annotation. In Proceedings of international conference on multimedia retrieval pp. 73–80

  13. Makadia A, Pavlovic V, Kumar S (2010) Baselines for image annotation. Int J Comput Vis 90(1):88–105

    Article  Google Scholar 

  14. Zahangir AM, Mahmudul H, Chris Y, Taha TM, Asari VK (2020) Improved inception-residual convolutional neural network for object recognition. Neural Comput Appl 32(1):279–293

    Article  Google Scholar 

  15. Qian K, Tian L, Liu Y, Wen X, Bao J (2021) Image robust recognition based on feature-entropy-oriented differential fusion capsule network. Appl Intell 51(2):1108–1117

    Article  Google Scholar 

  16. LeCun Y, Kavukcuoglu K, Farabet C (2010). Convolutional networks and applications in vision. In: Proceedings of 2010 IEEE international symposium on circuits and systems pp. 253-256

  17. Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops pp. 806–813

  18. Raj JS, Ananthi JV (2019) Recurrent neural networks and nonlinear prediction in support vector machines. J Soft Comput Paradigm (JSCP) 1(01):33–40

    Article  Google Scholar 

  19. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp. 740-755. Springer, Cham

  20. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation, In: EMNLP

  21. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  22. Du S, Li T, Yang Y, Horng SJ (2020) Multivariate time series forecasting via attention-based encoder-decoder framework. Neurocomputing 388:269–279

    Article  Google Scholar 

  23. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3156–3164

  24. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057, PMLR

  25. Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1)

  26. Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manag 57(2):102178

    Article  Google Scholar 

  27. He X, Yang Y, Shi B, Bai X (2019) VD-SAN: Visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55

    Article  Google Scholar 

  28. Zhang W, Tang S, Su J, Xiao J, Zhuang Y (2020) Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention. Multimed Tools Appl 80:1–16

    Google Scholar 

  29. Hirschberg J, Manning CD (2015) Advances in natural language processing. Science 349(6245):261–266

    Article MathSciNet  Google Scholar 

  30. Yang X, Zhu X, Zhao H, Zhang Q, Feng Y (2019) Enhancing unsupervised pretraining with external knowledge for natural language inference. In: Canadian conference on artificial intelligence. pp. 413–419. Springer, Cham

  31. Ralph MAL, Jefferies E, Patterson K, Rogers TT (2017) The neural and computational bases of semantic cognition. Nat Rev Neurosci 18(1):42–55

    Article  Google Scholar 

  32. Jackson RL, Rogers TT, Ralph MAL (2021) Reverse-engineering the cortical architecture for controlled semantic cognition. Nat Hum Behav 5:1–13

    Article  Google Scholar 

  33. Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304

    Article  Google Scholar 

  34. Ding S, Qu S, Xi Y, Sangaiah AK, Wan S (2019) Image caption generation with high-level image features. Pattern Recognition Lett 123:89–95

    Article  Google Scholar 

  35. Khademi M, Schulte O (2018) Image caption generation with hierarchical contextual visual spatial attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 1943–1951

  36. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4651–4659

  37. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788

  38. Papineni K, Roukos S, Ward T, Zhu W J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. pp. 311–318

  39. Yang J, Wang M, Zhou H, Zhao C, Zhang W, Yu Y, Li L (2020) Towards making the most of bert in neural machine translation. In: Proceedings of the AAAI conference on artificial intelligence 34(5):9378–9385

  40. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. pp. 376–380

  41. Lin C Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81

  42. Sun S, Nenkova A (2019) The feasibility of embedding based automatic evaluation for single document summarization. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 1216–1221

  43. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575

  44. Wang Z, Huang Z, Luo Y (2020) Human consensus-oriented image captioning. In: Proceedings of international joint conference on artificial intelligence, IJCAI. pp. 659–665

Download references

Acknowledgements

This paper is supported by Nanjing Institute of Technology High-level Scientific Research Foundation for the introduction of talent (No. YKJ201918), the Natural Science Foundation-Youth Fund of Jiangsu Province of China(No.BK20210931), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 20KJB510049) and partially supported by the National Natural Science Foundation of China (No. 61902179).

Author information

Authors and Affiliations

  1. School of Automation, Nanjing Institute of Technology, Nanjing, China

    Kui Qian & Lei Tian

Authors
  1. Kui Qian
  2. Lei Tian

Corresponding author

Correspondence toKui Qian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qian, K., Tian, L. A topic-based multi-channel attention model under hybrid mode for image caption.Neural Comput & Applic34, 2207–2216 (2022). https://doi.org/10.1007/s00521-021-06557-8

Download citation

Keywords

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp