Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

  • Conference paper
  • First Online:

Abstract

The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized theCommands for Autonomous Vehicles (C4AV) challenge based on the recentTalk2Car dataset. This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.

T. Deruyttere, S. Vandenhende and D. Grujicic—Contributed equally.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  2. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)

    Google Scholar 

  3. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  4. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

    Google Scholar 

  5. Chen, C., Liu, M.-Y., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 214–230. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-54193-8_14

    Chapter  Google Scholar 

  6. Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)

    Google Scholar 

  7. Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: The IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  8. Dai, H., Luo, S., Ding, Y., Shao, L.: Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) ECCV Workshop (2020)

    Google Scholar 

  9. Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding. In: Reasoning for Complex QA Workshop, AAAI (2020)

    Google Scholar 

  10. Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2088–2098 (2019)

    Google Scholar 

  11. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10584-0_23

    Chapter  Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).http://arxiv.org/abs/1512.03385

  13. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res.47, 853–899 (2013)

    Article MathSciNet  Google Scholar 

  14. Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. CoRR abs/1807.08556 (2018).http://arxiv.org/abs/1807.08556

  15. Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 53–69 (2018)

    Google Scholar 

  16. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564 (2016)

    Google Scholar 

  17. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. CoRR abs/1803.03067 (2018).http://arxiv.org/abs/1803.03067

  18. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018)

    Google Scholar 

  19. Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 3008–3017 (2017)

    Google Scholar 

  20. Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)

    Google Scholar 

  21. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)

    Google Scholar 

  22. Kovvuri, R., Nevatia, R.: PIRC net: using proposal indexing, relationships and context for phrase grounding. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 451–467. Springer, Cham (2019).https://doi.org/10.1007/978-3-030-20870-7_28

    Chapter  Google Scholar 

  23. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1222–1230 (2017)

    Google Scholar 

  24. Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014).http://arxiv.org/abs/1405.0312

  25. Luo, S., Dai, H., Shao, L., Ding, Y.: Cross-modal representations from transformer. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)

    Google Scholar 

  26. Ma, E.: NLP augmentation (2019).https://github.com/makcedward/nlpaug

  27. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)

    Google Scholar 

  28. Mittal, V.: Attngrounder: talking to cars with attention. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)

    Google Scholar 

  29. Ou, J., Zhang, X.: Attention enhanced single stage multi-modal reasoner. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) ECCV Workshop (2020)

    Google Scholar 

  30. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)

    Google Scholar 

  31. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprintarXiv:1804.02767 (2018)

  32. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019).http://arxiv.org/abs/1908.10084

  33. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015).http://arxiv.org/abs/1506.01497

  34. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_49

    Chapter  Google Scholar 

  35. Rufus, N., Nair, U., Krishnam, M., Gandhi, V.: Cosine meets softmax: a tough-to-beat baseline for visual grounding. In: Proceedings of the 16th European Conference on Computer Vision, 2020. Commands for Autonomous Vehicles (C4AV) Workshop (2020)

    Google Scholar 

  36. Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries (2019)

    Google Scholar 

  37. Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347 (2019)

    Google Scholar 

  38. Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence (2020)

    Google Scholar 

  39. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (2020).https://openreview.net/forum?id=SygXPaEYvH

  40. Suarez, J., Johnson, J., Li, F.F.: DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer (2018).http://arxiv.org/abs/1803.11361

  41. Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprintarXiv:1905.11946 (2019)

  42. Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: Conference on Robot Learning, pp. 394–406 (2020)

    Google Scholar 

  43. Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., Van Gool, L.: SCAN: learning to classify images without labels. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 268–285. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58607-2_16

    Chapter  Google Scholar 

  44. Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge. arXiv preprintarXiv:2004.13822 (2020)

  45. Vandenhende, S., Georgoulis, S., Proesmans, M., Dai, D., Van Gool, L.: Revisiting multi-task learning in the deep learning era. arXiv preprintarXiv:2004.13379 (2020)

  46. Vasudevan, A.B., Dai, D., Van Gool, L.: Talk2nav: long-range vision-and-language navigation in cities. arXiv preprintarXiv:1910.02029 (2019)

  47. Vasudevan, A.B., Dai, D., Van Gool, L., Zurich, E.: Object referring in videos with language and human gaze (2018)

    Google Scholar 

  48. Vaswani, A., et al.: Attention is all you need. arXiv 2017. arXiv preprintarXiv:1706.03762 (2017)

  49. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

    Google Scholar 

  50. Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell.41(2), 394–407 (2018)

    Article  Google Scholar 

  51. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

  52. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist.2, 67–78 (2014)

    Article  Google Scholar 

  53. Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)

    Google Scholar 

  54. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  55. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprintarXiv:1904.07850 (2019)

Download references

Acknowledgements

This project is sponsored by the MACCHINA project from the KU Leuven with grant number C14/18/065. Additionally, we acknowledge support by the Flemish Government under the Artificial Intelligence (AI) Flanders programme. Finally, we thank Huawei for sponsoring the workshop and AICrowd for hosting our challenge.

Author information

Authors and Affiliations

  1. Department of Computer Science (CS), KU Leuven, Leuven, Belgium

    Thierry Deruyttere & Marie-Francine Moens

  2. Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium

    Simon Vandenhende, Dusan Grujicic, Yu Liu, Luc Van Gool, Matthew Blaschko & Tinne Tuytelaars

Authors
  1. Thierry Deruyttere

    You can also search for this author inPubMed Google Scholar

  2. Simon Vandenhende

    You can also search for this author inPubMed Google Scholar

  3. Dusan Grujicic

    You can also search for this author inPubMed Google Scholar

  4. Yu Liu

    You can also search for this author inPubMed Google Scholar

  5. Luc Van Gool

    You can also search for this author inPubMed Google Scholar

  6. Matthew Blaschko

    You can also search for this author inPubMed Google Scholar

  7. Tinne Tuytelaars

    You can also search for this author inPubMed Google Scholar

  8. Marie-Francine Moens

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toThierry Deruyttere.

Editor information

Editors and Affiliations

  1. University of Clermont Auvergne, Clermont Ferrand, France

    Adrien Bartoli

  2. Università degli Studi di Udine, Udine, Italy

    Andrea Fusiello

1Electronic supplementary material

Below is the link to the electronic supplementary material.

Appendix A

Appendix A

1.1A.1 Multi-step Reasoning MSRR

This section discusses the influence of having reasoning steps and showcases an example where the MSRR [9] successfully finds the correct answer for the command by using multiple reasoning steps.

First, we will look at the influence of reasoning steps. Assume we have a MSRR model that uses 10 reasoning steps, Fig. 4 shows in which of these 10 reasoning steps the model makes its final prediction. It is clear that most of the final predictions are made in the very first reasoning step. For instance, if we would only consider the answers in the first step and ignore any change of decision in the following steps, we would achieve\(\approx \)55%\(IoU_{.5}\). Yet, by including more reasoning steps we can further improve this to\(\approx \)60%\(IoU_{.5}\). This shows that having reasoning steps can be beneficial for this kind of task.

Fig. 4.
figure 4

This plot shows in which step a 10 reasoning step MSRR makes its final decision. We use MSRR Correct (blue) to indicate when the final answer by the model is also the correct answer while MSRR Wrong (orange) is used when the final answer is the wrong answer. (Color figure online)

Fig. 5.
figure 5

Explaining the visualisation of the reasoning process (Part 1). Figure from [9].

The example used in this section uses a specific visualisation that first needs to be introduced. In the Figs. 5 and6, we explain in detail this visualisation. Then, Fig. 7 shows the starting state of the MSRR. Figure 8 shows that the model makes a wrong decission at first but in Fig. 9, and after six reasoning steps, we see that the model selects the correct answer. Finally, in Fig. 10, we see that the object selected after six reasoning step, is the final output of the model.

Fig. 6.
figure 6

Explaining the visualisation of the reasoning process (Part 2). Figure from [9].

Fig. 7.
figure 7

Example 3 - The state of the model before the reasoning process starts for the given command, regions and image. Figure from [9].

Fig. 8.
figure 8

Example 3 - Visualization of reasoning process. Step 1. Figure from [9].

Fig. 9.
figure 9

Example 3 - Visualization of reasoning process. Step 6. Figure from [9].

Fig. 10.
figure 10

Example 3 - Visualization of reasoning process. Final step. Figure from [9].

Rights and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Deruyttere, T.et al. (2020). Commands 4 Autonomous Vehicles (C4AV) Workshop Summary. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_1

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp