Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

  • 689Accesses

Abstract

Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)

    Google Scholar 

  2. Agrawal, A., et al.: Vqa: Visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  3. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

    Google Scholar 

  4. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprintarXiv:1607.06450 11 (2016).https://doi.org/10.48550/arXiv.1607.06450

  5. Chen, C., et al.: Utc: a unified transformer with inter-task contrastive learning for visual dialog. In: CVPR, pp. 18103–18112 (2022)

    Google Scholar 

  6. Chen, F., Chen, X., Meng, F., Li, P., Zhou, J.: Gog: relation-aware graph-over-graph network for visual dialog. In: Findings of ACL, pp. 230–243 (2021)

    Google Scholar 

  7. Chen, F., Chen, X., Xu, C., Jiang, D.: Learning to ground visual objects for visual dialog. In: EMNLP Findings, pp. 1081–1091 (2021)

    Google Scholar 

  8. Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP (2022)

    Google Scholar 

  9. Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: Dmrm: a dual-channel multi-hop reasoning model for visual dialog. In: AAAI (2020)

    Google Scholar 

  10. Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., Xu, B.: Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In: ACM MM, pp. 4142–4153 (2022)

    Google Scholar 

  11. Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)

    Google Scholar 

  12. Desai, K., Das, A., Batra, D., Parikh, D.: Visual dialog challenge starter code.https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch (2019)

  13. Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. In: ACL, pp. 6463–6474 (2019)

    Google Scholar 

  14. Jiang, X., Du, S., Qin, Z., Sun, Y., Yu, J.: Kbgn: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In: ACM MM (2020)

    Google Scholar 

  15. Jiang, X., et al.: Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In: AAAI, pp. 11125–11132 (2020)

    Google Scholar 

  16. Jiang, X., et al.: Dam: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. In: IJCAI (2020)

    Google Scholar 

  17. Kang, G.C., Lim, J., Zhang, B.T.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP, pp. 2024–2033 (2019)

    Google Scholar 

  18. Le, H., Sahoo, D., Chen, N.F., Hoi, S.C.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: ACL, pp. 5612–5623 (2019)

    Google Scholar 

  19. Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: NeurIPS (2017)

    Google Scholar 

  20. Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: ECCV, pp. 336–352 (2020)

    Google Scholar 

  21. Nguyen, V.Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: ECCV, pp. 223–240 (2020)

    Google Scholar 

  22. Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: CVPR (2019)

    Google Scholar 

  23. Pang, W., Wang, X.: Guessing state tracking for visual dialogue. In: 16th European Conference on Computer Vision - ECCV 2020, pp. 683–698 (2020)

    Google Scholar 

  24. Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI (Oral), pp. 11831–11838 (2020)

    Google Scholar 

  25. Sungjin, P., Taesun, W., Yeochan, Y., Heuiseok, L.: Multi-view attention network for visual dialog. Appl. Sci.11(7) (2021).https://doi.org/10.3390/app11073009

  26. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  27. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: CVPR, pp. 5503–5512 (2017)

    Google Scholar 

  28. Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a Unified Vision and Dialog Transformer with BERT. In: EMNLP, pp. 3325–3338 (2020)

    Google Scholar 

  29. Wu, Q., Wang, P., Shen, C., Reid, I., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)

    Google Scholar 

  30. Yang, L., Meng, F., Liu, X., Wu, M.K.D., Ying, V., Xu, X.: Seqdialn: sequential visual dialog networks in joint visual-linguistic representation space. In: 1st Workshop on Document-grounded Dialogue and Conversational Question Answering, pp. 8–17 (2021)

    Google Scholar 

  31. Yang, T., Zha, Z.J., Zhang, H.: Making history matter: history-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)

    Google Scholar 

  32. Zhao, L., Li, J., Gao, L., Rao, Y., Song, J., Shen, H.T.: Heterogeneous knowledge network for visual dialog. IEEE Trans. Circ. Syst. Video Technol. (TCSVT), pp. 1–1 (2022).https://doi.org/10.1109/TCSVT.2022.3207228

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions. This paper was partially supported by the National Natural Science Foundation of China (NSFC 62076032), Huawei Noah’s Ark Lab, MoECMCC “Artificial Intelligence” Project (No. MCM20190701), Beijing Natural Science Foundation (Grant No. 4204100), and BUPT Excellent Ph.D. Students Foundation (No. CX2020309).

Author information

Authors and Affiliations

  1. Beijing Information Science and Technology University, Beijing, China

    Wei Pang

Authors
  1. Wei Pang

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toWei Pang.

Editor information

Editors and Affiliations

  1. Tsinghua University, Beijing, China

    Lu Fang

  2. Duke University, Durham, NC, USA

    Jian Pei

  3. Shanghai Jiao Tong Univeristy, Shanghai, China

    Guangtao Zhai

  4. Chinese Academy of Sciences, Beijing, China

    Ruiping Wang

Rights and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pang, W. (2024). Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_44

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp