Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Is Cross-Modal Information Retrieval Possible Without Training?

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 13981))

Included in the following conference series:

  • 2092Accesses

Abstract

Encoded representations from a pretrained deep learning model (e.g., BERT text embeddings, penultimate CNN layer activations of an image) convey a rich set of features beneficial for information retrieval. Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net. In this paper, we take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem to serve a means to cross-modal information retrieval. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval. Despite simplicity, our mappings perform reasonably well reaching the highest accuracy of 77% on recall@10, which is comparable to those requiring costly neural net training and fine-tuning. We have improved the simple mappings by contrastive learning on the pretrained models. Contrastive learning can be thought as properly biasing the pretrained encoders to enhance the cross-modal mapping quality. We have further improved the performance by multilayer perceptron with gating (gMLP), a simple neural architecture.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  2. Choi, H., Kim, J., Joe, S., Gwon, Y.: Evaluation of BERT and ALBERT sentence embedding performance on downstream NLP tasks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5482–5487 (2021).https://doi.org/10.1109/ICPR48806.2021.9412102

  3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186. Association for Computational Linguistics (2019)

    Google Scholar 

  4. Dosovitskiy, A., et al.: An image is worth\(16\times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021).https://openreview.net/forum?id=YicbFdNTTy

  5. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910 (2021)

    Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  7. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849 (2020)

    Google Scholar 

  8. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event. Proceedings of Machine Learning Research, 18–24 July 2021, vol. 139, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  9. Li, X., et al.:Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  10. Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to MLPs. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021).https://openreview.net/forum?id=KBnXrODoBW

  11. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).http://arxiv.org/abs/1907.11692

  12. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019).https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html

  13. Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. CoRR abs/2001.07966 (2020)

    Google Scholar 

  14. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event. Proceedings of Machine Learning Research, 18–24 July 2021, vol. 139, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  15. Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 153–170. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58598-3_10

    Chapter  Google Scholar 

  16. Schönemann, P.: A generalized solution of the orthogonal procrustes problem. Psychometrika31(1), 1–10 (1966).https://doi.org/10.1007/BF02289451

    Article MathSciNet MATH  Google Scholar 

  17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015).http://arxiv.org/abs/1409.1556

  18. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)

    Google Scholar 

  19. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. CoRR abs/1707.02968 (2017).http://arxiv.org/abs/1707.02968

  20. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019, pp. 5099–5110. Association for Computational Linguistics (2019)

    Google Scholar 

  21. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  22. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist.2, 67–78 (2014)

    Article  Google Scholar 

  23. Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. CoRR abs/2010.00747 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Samsung SDS, Seoul, Korea

    Hyunjin Choi, Hyunjae Lee, Seongho Joe & Youngjune Gwon

Authors
  1. Hyunjin Choi

    You can also search for this author inPubMed Google Scholar

  2. Hyunjae Lee

    You can also search for this author inPubMed Google Scholar

  3. Seongho Joe

    You can also search for this author inPubMed Google Scholar

  4. Youngjune Gwon

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toHyunjin Choi.

Editor information

Editors and Affiliations

  1. University of Amsterdam, Amsterdam, The Netherlands

    Jaap Kamps

  2. Université Grenoble-Alpes, Saint-Martin-d’Hères, France

    Lorraine Goeuriot

  3. Università della Svizzera Italiana, Lugano, Switzerland

    Fabio Crestani

  4. University of Copenhagen, Copenhagen, Denmark

    Maria Maistro

  5. University of Tsukuba, Ibaraki, Japan

    Hideo Joho

  6. Dublin City University, Dublin, Ireland

    Brian Davis

  7. Dublin City University, Dublin, Ireland

    Cathal Gurrin

  8. Universität Regensburg, Regensburg, Germany

    Udo Kruschwitz

  9. Dublin City University, Dublin, Ireland

    Annalina Caputo

Rights and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Choi, H., Lee, H., Joe, S., Gwon, Y. (2023). Is Cross-Modal Information Retrieval Possible Without Training?. In: Kamps, J.,et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_27

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp