Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 13981))

Included in the following conference series:

European Conference on Information Retrieval

2092Accesses

Abstract

Encoded representations from a pretrained deep learning model (e.g., BERT text embeddings, penultimate CNN layer activations of an image) convey a rich set of features beneficial for information retrieval. Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net. In this paper, we take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem to serve a means to cross-modal information retrieval. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval. Despite simplicity, our mappings perform reasonably well reaching the highest accuracy of 77% on recall@10, which is comparable to those requiring costly neural net training and fine-tuning. We have improved the simple mappings by contrastive learning on the pretrained models. Contrastive learning can be thought as properly biasing the pretrained encoders to enhance the cross-modal mapping quality. We have further improved the performance by multilayer perceptron with gating (gMLP), a simple neural architecture.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 10295; Price includes VAT (Japan)

Softcover Book: JPY 12869; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Cycle-Consistent Models for Text and Image Retrieval

Estimating the information gap between textual and visual representations

Article01 December 2017

A Cross-Modal Image-Text Retrieval System with Deep Learning

References

Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Choi, H., Kim, J., Joe, S., Gwon, Y.: Evaluation of BERT and ALBERT sentence embedding performance on downstream NLP tasks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5482–5487 (2021).https://doi.org/10.1109/ICPR48806.2021.9412102
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth\(16\times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021).https://openreview.net/forum?id=YicbFdNTTy
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849 (2020)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event. Proceedings of Machine Learning Research, 18–24 July 2021, vol. 139, pp. 4904–4916. PMLR (2021)
Google Scholar
Li, X., et al.:Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to MLPs. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021).https://openreview.net/forum?id=KBnXrODoBW
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).http://arxiv.org/abs/1907.11692
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019).https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. CoRR abs/2001.07966 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event. Proceedings of Machine Learning Research, 18–24 July 2021, vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 153–170. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58598-3_10
Chapter Google Scholar
Schönemann, P.: A generalized solution of the orthogonal procrustes problem. Psychometrika31(1), 1–10 (1966).https://doi.org/10.1007/BF02289451
Article MathSciNet MATH Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015).http://arxiv.org/abs/1409.1556
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)
Google Scholar
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. CoRR abs/1707.02968 (2017).http://arxiv.org/abs/1707.02968
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019, pp. 5099–5110. Association for Computational Linguistics (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist.2, 67–78 (2014)
Article Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. CoRR abs/2010.00747 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Samsung SDS, Seoul, Korea
Hyunjin Choi, Hyunjae Lee, Seongho Joe & Youngjune Gwon

Authors

Hyunjin Choi
View author publications
You can also search for this author inPubMed Google Scholar
Hyunjae Lee
View author publications
You can also search for this author inPubMed Google Scholar
Seongho Joe
View author publications
You can also search for this author inPubMed Google Scholar
Youngjune Gwon
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toHyunjin Choi.

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Choi, H., Lee, H., Joe, S., Gwon, Y. (2023). Is Cross-Modal Information Retrieval Possible Without Training?. In: Kamps, J.,et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_27

Download citation

DOI:https://doi.org/10.1007/978-3-031-28238-6_27
Published:17 March 2023
Publisher Name:Springer, Cham
Print ISBN:978-3-031-28237-9
Online ISBN:978-3-031-28238-6
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Is Cross-Modal Information Retrieval Possible Without Training?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Cycle-Consistent Models for Text and Image Retrieval

Estimating the information gap between textual and visual representations

A Cross-Modal Image-Text Retrieval System with Deep Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now