- Thierry Deruyttere10,
- Simon Vandenhende11,
- Dusan Grujicic11,
- Yu Liu11,
- Luc Van Gool11,
- Matthew Blaschko11,
- Tinne Tuytelaars11 &
- …
- Marie-Francine Moens10
Part of the book series:Lecture Notes in Computer Science ((LNIP,volume 12536))
Included in the following conference series:
2128Accesses
Abstract
The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized theCommands for Autonomous Vehicles (C4AV) challenge based on the recentTalk2Car dataset. This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.
T. Deruyttere, S. Vandenhende and D. Grujicic—Contributed equally.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 11439
- Price includes VAT (Japan)
- Softcover Book
- JPY 14299
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Chen, C., Liu, M.-Y., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 214–230. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-54193-8_14
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)
Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Dai, H., Luo, S., Ding, Y., Shao, L.: Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) ECCV Workshop (2020)
Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding. In: Reasoning for Complex QA Workshop, AAAI (2020)
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2088–2098 (2019)
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10584-0_23
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).http://arxiv.org/abs/1512.03385
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res.47, 853–899 (2013)
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. CoRR abs/1807.08556 (2018).http://arxiv.org/abs/1807.08556
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 53–69 (2018)
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564 (2016)
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. CoRR abs/1803.03067 (2018).http://arxiv.org/abs/1803.03067
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018)
Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 3008–3017 (2017)
Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Kovvuri, R., Nevatia, R.: PIRC net: using proposal indexing, relationships and context for phrase grounding. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 451–467. Springer, Cham (2019).https://doi.org/10.1007/978-3-030-20870-7_28
Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1222–1230 (2017)
Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014).http://arxiv.org/abs/1405.0312
Luo, S., Dai, H., Shao, L., Ding, Y.: Cross-modal representations from transformer. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)
Ma, E.: NLP augmentation (2019).https://github.com/makcedward/nlpaug
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Mittal, V.: Attngrounder: talking to cars with attention. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)
Ou, J., Zhang, X.: Attention enhanced single stage multi-modal reasoner. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) ECCV Workshop (2020)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprintarXiv:1804.02767 (2018)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019).http://arxiv.org/abs/1908.10084
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015).http://arxiv.org/abs/1506.01497
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_49
Rufus, N., Nair, U., Krishnam, M., Gandhi, V.: Cosine meets softmax: a tough-to-beat baseline for visual grounding. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries (2019)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347 (2019)
Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence (2020)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (2020).https://openreview.net/forum?id=SygXPaEYvH
Suarez, J., Johnson, J., Li, F.F.: DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer (2018).http://arxiv.org/abs/1803.11361
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprintarXiv:1905.11946 (2019)
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: Conference on Robot Learning, pp. 394–406 (2020)
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., Van Gool, L.: SCAN: learning to classify images without labels. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 268–285. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58607-2_16
Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge. arXiv preprintarXiv:2004.13822 (2020)
Vandenhende, S., Georgoulis, S., Proesmans, M., Dai, D., Van Gool, L.: Revisiting multi-task learning in the deep learning era. arXiv preprintarXiv:2004.13379 (2020)
Vasudevan, A.B., Dai, D., Van Gool, L.: Talk2nav: long-range vision-and-language navigation in cities. arXiv preprintarXiv:1910.02029 (2019)
Vasudevan, A.B., Dai, D., Van Gool, L., Zurich, E.: Object referring in videos with language and human gaze (2018)
Vaswani, A., et al.: Attention is all you need. arXiv 2017. arXiv preprintarXiv:1706.03762 (2017)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell.41(2), 394–407 (2018)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist.2, 67–78 (2014)
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46475-6_5
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprintarXiv:1904.07850 (2019)
Acknowledgements
This project is sponsored by the MACCHINA project from the KU Leuven with grant number C14/18/065. Additionally, we acknowledge support by the Flemish Government under the Artificial Intelligence (AI) Flanders programme. Finally, we thank Huawei for sponsoring the workshop and AICrowd for hosting our challenge.
Author information
Authors and Affiliations
Department of Computer Science (CS), KU Leuven, Leuven, Belgium
Thierry Deruyttere & Marie-Francine Moens
Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium
Simon Vandenhende, Dusan Grujicic, Yu Liu, Luc Van Gool, Matthew Blaschko & Tinne Tuytelaars
- Thierry Deruyttere
You can also search for this author inPubMed Google Scholar
- Simon Vandenhende
You can also search for this author inPubMed Google Scholar
- Dusan Grujicic
You can also search for this author inPubMed Google Scholar
- Yu Liu
You can also search for this author inPubMed Google Scholar
- Luc Van Gool
You can also search for this author inPubMed Google Scholar
- Matthew Blaschko
You can also search for this author inPubMed Google Scholar
- Tinne Tuytelaars
You can also search for this author inPubMed Google Scholar
- Marie-Francine Moens
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toThierry Deruyttere.
Editor information
Editors and Affiliations
University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello
1Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix A
Appendix A
1.1A.1 Multi-step Reasoning MSRR
This section discusses the influence of having reasoning steps and showcases an example where the MSRR [9] successfully finds the correct answer for the command by using multiple reasoning steps.
First, we will look at the influence of reasoning steps. Assume we have a MSRR model that uses 10 reasoning steps, Fig. 4 shows in which of these 10 reasoning steps the model makes its final prediction. It is clear that most of the final predictions are made in the very first reasoning step. For instance, if we would only consider the answers in the first step and ignore any change of decision in the following steps, we would achieve\(\approx \)55%\(IoU_{.5}\). Yet, by including more reasoning steps we can further improve this to\(\approx \)60%\(IoU_{.5}\). This shows that having reasoning steps can be beneficial for this kind of task.
This plot shows in which step a 10 reasoning step MSRR makes its final decision. We use MSRR Correct (blue) to indicate when the final answer by the model is also the correct answer while MSRR Wrong (orange) is used when the final answer is the wrong answer. (Color figure online)
Explaining the visualisation of the reasoning process (Part 1). Figure from [9].
The example used in this section uses a specific visualisation that first needs to be introduced. In the Figs. 5 and6, we explain in detail this visualisation. Then, Fig. 7 shows the starting state of the MSRR. Figure 8 shows that the model makes a wrong decission at first but in Fig. 9, and after six reasoning steps, we see that the model selects the correct answer. Finally, in Fig. 10, we see that the object selected after six reasoning step, is the final output of the model.
Explaining the visualisation of the reasoning process (Part 2). Figure from [9].
Example 3 - The state of the model before the reasoning process starts for the given command, regions and image. Figure from [9].
Example 3 - Visualization of reasoning process. Step 1. Figure from [9].
Example 3 - Visualization of reasoning process. Step 6. Figure from [9].
Example 3 - Visualization of reasoning process. Final step. Figure from [9].
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Deruyttere, T.et al. (2020). Commands 4 Autonomous Vehicles (C4AV) Workshop Summary. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_1
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-030-66095-6
Online ISBN:978-3-030-66096-3
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative