Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

European Conference on Computer Vision

2128Accesses
2Citations

Abstract

The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized theCommands for Autonomous Vehicles (C4AV) challenge based on the recentTalk2Car dataset. This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.

T. Deruyttere, S. Vandenhende and D. Grujicic—Contributed equally.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LingoQA: Visual Question Answering for Autonomous Driving

Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Chen, C., Liu, M.-Y., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 214–230. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-54193-8_14
Chapter Google Scholar
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)
Google Scholar
Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Dai, H., Luo, S., Ding, Y., Shao, L.: Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) ECCV Workshop (2020)
Google Scholar
Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding. In: Reasoning for Complex QA Workshop, AAAI (2020)
Google Scholar
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2088–2098 (2019)
Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10584-0_23
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).http://arxiv.org/abs/1512.03385
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res.47, 853–899 (2013)
Article MathSciNet Google Scholar
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. CoRR abs/1807.08556 (2018).http://arxiv.org/abs/1807.08556
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 53–69 (2018)
Google Scholar
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564 (2016)
Google Scholar
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. CoRR abs/1803.03067 (2018).http://arxiv.org/abs/1803.03067
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018)
Google Scholar
Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 3008–3017 (2017)
Google Scholar
Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Google Scholar
Kovvuri, R., Nevatia, R.: PIRC net: using proposal indexing, relationships and context for phrase grounding. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 451–467. Springer, Cham (2019).https://doi.org/10.1007/978-3-030-20870-7_28
Chapter Google Scholar
Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1222–1230 (2017)
Google Scholar
Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014).http://arxiv.org/abs/1405.0312
Luo, S., Dai, H., Shao, L., Ding, Y.: Cross-modal representations from transformer. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)
Google Scholar
Ma, E.: NLP augmentation (2019).https://github.com/makcedward/nlpaug
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Google Scholar
Mittal, V.: Attngrounder: talking to cars with attention. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)
Google Scholar
Ou, J., Zhang, X.: Attention enhanced single stage multi-modal reasoner. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) ECCV Workshop (2020)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprintarXiv:1804.02767 (2018)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019).http://arxiv.org/abs/1908.10084
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015).http://arxiv.org/abs/1506.01497
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_49
Chapter Google Scholar
Rufus, N., Nair, U., Krishnam, M., Gandhi, V.: Cosine meets softmax: a tough-to-beat baseline for visual grounding. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)
Google Scholar
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries (2019)
Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347 (2019)
Google Scholar
Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence (2020)
Google Scholar
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (2020).https://openreview.net/forum?id=SygXPaEYvH
Suarez, J., Johnson, J., Li, F.F.: DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer (2018).http://arxiv.org/abs/1803.11361
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprintarXiv:1905.11946 (2019)
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: Conference on Robot Learning, pp. 394–406 (2020)
Google Scholar
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., Van Gool, L.: SCAN: learning to classify images without labels. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 268–285. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58607-2_16
Chapter Google Scholar
Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge. arXiv preprintarXiv:2004.13822 (2020)
Vandenhende, S., Georgoulis, S., Proesmans, M., Dai, D., Van Gool, L.: Revisiting multi-task learning in the deep learning era. arXiv preprintarXiv:2004.13379 (2020)
Vasudevan, A.B., Dai, D., Van Gool, L.: Talk2nav: long-range vision-and-language navigation in cities. arXiv preprintarXiv:1910.02029 (2019)
Vasudevan, A.B., Dai, D., Van Gool, L., Zurich, E.: Object referring in videos with language and human gaze (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv 2017. arXiv preprintarXiv:1706.03762 (2017)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell.41(2), 394–407 (2018)
Article Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist.2, 67–78 (2014)
Article Google Scholar
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprintarXiv:1904.07850 (2019)

Download references

Acknowledgements

This project is sponsored by the MACCHINA project from the KU Leuven with grant number C14/18/065. Additionally, we acknowledge support by the Flemish Government under the Artificial Intelligence (AI) Flanders programme. Finally, we thank Huawei for sponsoring the workshop and AICrowd for hosting our challenge.

Author information

Authors and Affiliations

Department of Computer Science (CS), KU Leuven, Leuven, Belgium
Thierry Deruyttere & Marie-Francine Moens
Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium
Simon Vandenhende, Dusan Grujicic, Yu Liu, Luc Van Gool, Matthew Blaschko & Tinne Tuytelaars

Authors

Thierry Deruyttere
View author publications
You can also search for this author inPubMed Google Scholar
Simon Vandenhende
View author publications
You can also search for this author inPubMed Google Scholar
Dusan Grujicic
View author publications
You can also search for this author inPubMed Google Scholar
Yu Liu
View author publications
You can also search for this author inPubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author inPubMed Google Scholar
Matthew Blaschko
View author publications
You can also search for this author inPubMed Google Scholar
Tinne Tuytelaars
View author publications
You can also search for this author inPubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toThierry Deruyttere.

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

1Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2165 KB)

Appendix A

1.1A.1 Multi-step Reasoning MSRR

This section discusses the influence of having reasoning steps and showcases an example where the MSRR [9] successfully finds the correct answer for the command by using multiple reasoning steps.

First, we will look at the influence of reasoning steps. Assume we have a MSRR model that uses 10 reasoning steps, Fig. 4 shows in which of these 10 reasoning steps the model makes its final prediction. It is clear that most of the final predictions are made in the very first reasoning step. For instance, if we would only consider the answers in the first step and ignore any change of decision in the following steps, we would achieve\(\approx \)55%\(IoU_{.5}\). Yet, by including more reasoning steps we can further improve this to\(\approx \)60%\(IoU_{.5}\). This shows that having reasoning steps can be beneficial for this kind of task.

The example used in this section uses a specific visualisation that first needs to be introduced. In the Figs. 5 and6, we explain in detail this visualisation. Then, Fig. 7 shows the starting state of the MSRR. Figure 8 shows that the model makes a wrong decission at first but in Fig. 9, and after six reasoning steps, we see that the model selects the correct answer. Finally, in Fig. 10, we see that the object selected after six reasoning step, is the final output of the model.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Deruyttere, T.et al. (2020). Commands 4 Autonomous Vehicles (C4AV) Workshop Summary. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_1

Download citation

DOI:https://doi.org/10.1007/978-3-030-66096-3_1
Published:03 January 2021
Publisher Name:Springer, Cham
Print ISBN:978-3-030-66095-6
Online ISBN:978-3-030-66096-3
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LingoQA: Visual Question Answering for Autonomous Driving

Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1Electronic supplementary material

Supplementary material 1 (pdf 2165 KB)

Appendix A

Appendix A

1.1A.1 Multi-step Reasoning MSRR

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now