Part of the book series:Lecture Notes in Computer Science ((LNAI,volume 14473))
Included in the following conference series:
689Accesses
Abstract
Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 10295
- Price includes VAT (Japan)
- Softcover Book
- JPY 12869
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)
Agrawal, A., et al.: Vqa: Visual question answering. In: ICCV, pp. 2425–2433 (2015)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprintarXiv:1607.06450 11 (2016).https://doi.org/10.48550/arXiv.1607.06450
Chen, C., et al.: Utc: a unified transformer with inter-task contrastive learning for visual dialog. In: CVPR, pp. 18103–18112 (2022)
Chen, F., Chen, X., Meng, F., Li, P., Zhou, J.: Gog: relation-aware graph-over-graph network for visual dialog. In: Findings of ACL, pp. 230–243 (2021)
Chen, F., Chen, X., Xu, C., Jiang, D.: Learning to ground visual objects for visual dialog. In: EMNLP Findings, pp. 1081–1091 (2021)
Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP (2022)
Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: Dmrm: a dual-channel multi-hop reasoning model for visual dialog. In: AAAI (2020)
Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., Xu, B.: Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In: ACM MM, pp. 4142–4153 (2022)
Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)
Desai, K., Das, A., Batra, D., Parikh, D.: Visual dialog challenge starter code.https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch (2019)
Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. In: ACL, pp. 6463–6474 (2019)
Jiang, X., Du, S., Qin, Z., Sun, Y., Yu, J.: Kbgn: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In: ACM MM (2020)
Jiang, X., et al.: Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In: AAAI, pp. 11125–11132 (2020)
Jiang, X., et al.: Dam: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. In: IJCAI (2020)
Kang, G.C., Lim, J., Zhang, B.T.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP, pp. 2024–2033 (2019)
Le, H., Sahoo, D., Chen, N.F., Hoi, S.C.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: ACL, pp. 5612–5623 (2019)
Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: NeurIPS (2017)
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: ECCV, pp. 336–352 (2020)
Nguyen, V.Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: ECCV, pp. 223–240 (2020)
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: CVPR (2019)
Pang, W., Wang, X.: Guessing state tracking for visual dialogue. In: 16th European Conference on Computer Vision - ECCV 2020, pp. 683–698 (2020)
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI (Oral), pp. 11831–11838 (2020)
Sungjin, P., Taesun, W., Yeochan, Y., Heuiseok, L.: Multi-view attention network for visual dialog. Appl. Sci.11(7) (2021).https://doi.org/10.3390/app11073009
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: CVPR, pp. 5503–5512 (2017)
Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a Unified Vision and Dialog Transformer with BERT. In: EMNLP, pp. 3325–3338 (2020)
Wu, Q., Wang, P., Shen, C., Reid, I., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)
Yang, L., Meng, F., Liu, X., Wu, M.K.D., Ying, V., Xu, X.: Seqdialn: sequential visual dialog networks in joint visual-linguistic representation space. In: 1st Workshop on Document-grounded Dialogue and Conversational Question Answering, pp. 8–17 (2021)
Yang, T., Zha, Z.J., Zhang, H.: Making history matter: history-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)
Zhao, L., Li, J., Gao, L., Rao, Y., Song, J., Shen, H.T.: Heterogeneous knowledge network for visual dialog. IEEE Trans. Circ. Syst. Video Technol. (TCSVT), pp. 1–1 (2022).https://doi.org/10.1109/TCSVT.2022.3207228
Acknowledgements
We thank the reviewers for their comments and suggestions. This paper was partially supported by the National Natural Science Foundation of China (NSFC 62076032), Huawei Noah’s Ark Lab, MoECMCC “Artificial Intelligence” Project (No. MCM20190701), Beijing Natural Science Foundation (Grant No. 4204100), and BUPT Excellent Ph.D. Students Foundation (No. CX2020309).
Author information
Authors and Affiliations
Beijing Information Science and Technology University, Beijing, China
Wei Pang
- Wei Pang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toWei Pang.
Editor information
Editors and Affiliations
Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pang, W. (2024). Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_44
Download citation
Published:
Publisher Name:Springer, Singapore
Print ISBN:978-981-99-8849-5
Online ISBN:978-981-99-8850-1
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative