- Taihong Xiao1,
- Sifei Liu2,
- Shalini De Mello2,
- Zhiding Yu2,
- Jan Kautz2 &
- …
- Ming-Hsuan Yang ORCID:orcid.org/0000-0003-4848-23041,3
1163Accesses
1Altmetric
ACorrection to this article was published on 13 April 2022
This article has beenupdated
Abstract
Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling pixel-level dense correspondences is labor intensive and infeasible to scale. Most existing methods focus on designing various matching modules using fully-supervised ImageNet pretrained networks. On the other hand, while a variety of self-supervised approaches are proposed to explicitly measure image-level similarities, correspondence matching the pixel level remains under-explored. In this work, we propose a multi-level contrastive learning approach for semantic matching, which does not rely on any ImageNet pretrained model. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects, while the performance can be further enhanced by regularizing cross-instance cycle-consistency at intermediate feature levels. Experimental results on the PF-PASCAL, PF-WILLOW, and SPair-71k benchmark datasets demonstrate that our method performs favorably against the state-of-the-art approaches.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Change history
13 April 2022
A Correction to this paper has been published:https://doi.org/10.1007/s11263-022-01614-8
References
Bristow, H., Valmadre, J., & Lucey, S. (2015). Dense semantic correspondence where every pixel is a classifier. IEEE International Conference on Computer Vision (ICCV) pp 4024–4031
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. International Conference on Machine Learning (ICML)
Chen, Y. C., Huang, P. H., Yu, L. Y., Huang, J. B., Yang, M. H., & Lin, Y. Y. (2018). Deep semantic matching with foreground detection and cycle-consistency. In: Asian Conference on Computer Vision (ACCV), Springer, pp 347–362
Choy, C. B., Gwak, J., Savarese, S., & Chandraker, M. (2016). Universal correspondence network. Neural Information Processing Systems (NeurIPS)
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Dale, K., Johnson, M. K., Sunkavalli, K., Matusik, W., & Pfister, H. (2009). Image restoration using online photo collections. IEEE International Conference on Computer Vision (ICCV) pp 2217–2224
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 248–255
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. IEEE International Conference on Computer Vision (ICCV) pp 1422–1430
Duchenne, O., Joulin, A., & Ponce, J. (2011). A graph-matching kernel for object categorization. In: IEEE International Conference on Computer Vision (ICCV), pp 1792–1799
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2014). The pascal visual object classes challenge: A retrospective.International Journal on Computer Vision (IJCV),111, 98–136.
Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. International Conference on Learning Representations (ICLR)
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., & Azar, M. G., et al. (2020).. Bootstrap your own latent: A new approach to self-supervised learning. Neural Information Processing Systems (NeurIPS)
Ham, B., Cho, M., Schmid, C., & Ponce, J. (2016). Proposal flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3475–3484
Ham, B., Cho, M., Schmid, C., & Ponce, J. (2018). Proposal flow: Semantic correspondences from object proposals.IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI),40, 1711–1725.
Han, K., Rezende, R. S., Ham, B., Wong, K.Y.K., Cho, M., Schmid, C., & Ponce, J. (2017). Scnet: Learning semantic correspondence. In: IEEE International Conference on Computer Vision (ICCV)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 770–778
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9729–9738
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2019). Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR)
Huang, S., Wang, Q., Zhang, S., Yan, S., & He, X. (2019). Dynamic context correspondence network for semantic alignment. IEEE International Conference on Computer Vision (ICCV) pp 2010–2019
Hur, J., Lim, H., Park, C., & Chul Ahn, S. (2015). Generalized deformable spatial pyramid: Geometry-preserving dense correspondence estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1392–1400
Jabri, A., Owens, A., & Efros, A.A. (2020). Space-time correspondence as a contrastive random walk. Neural Information Processing Systems (NeurIPS)
Jeon, S., Kim, S., Min, D., & Sohn, K. (2018). Parn: Pyramidal affine regression networks for dense semantic correspondence. In: European Conference on Computer Vision (ECCV)
Kanazawa, A., Jacobs, D. W., & Chandraker, M. (2016). Warpnet: Weakly supervised matching for single-view reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 3253–3261
Kang, G., Wei, Y., Yang, Y., Zhuang, Y., & Hauptmann, A. G. (2020). Pixel-level cycle association: A new perspective for domain adaptive semantic segmentation. In: Neural Information Processing Systems (NeurIPS)
Kim, J., Liu, C., Sha, F., & Grauman, K. (2013). Deformable spatial pyramid matching for fast dense correspondences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2307–2314
Kim, S., Min, D., Lin, S., & Sohn, K. (2017). Dctm: Discrete-continuous transformation matching for semantic flow. In: IEEE International Conference on Computer Vision (ICCV), pp 4529–4538
Kim, S., Lin, S., Jeon, S. R., Min, D., & Sohn, K. (2018). Recurrent transformer networks for semantic correspondence.Neural Information Processing Systems (NeurIPS),31, 6126–6136.
Kim, S., Min, D., Ham, B., Jeon, S., Lin, S., & Sohn, K. (2019). Fcss: Fully convolutional self-similarity for dense semantic correspondence.IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI),41, 581–595.
Lee, J., Kim, D., Ponce, J., & Ham, B. (2019). Sfnet: Learning object-aware semantic correspondence. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2278–2287
Li, X., Liu, S., Mello, S. D., Wang, X., Kautz, J., & Yang, M. H. (2019). Joint-task self-supervised learning for temporal correspondence. Neural Information Processing Systems (NeurIPS)
Liu, C., Yuen, J., & Torralba, A. (2011). Sift flow: Dense correspondence across scenes and its applications.IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI),33, 978–994.
Liu, P., King, I., Lyu, M. R., & Xu, J. (2019). Ddflow: Learning optical flow with unlabeled data distillation.Association for the Advancement of Artificial Intelligence (AAAI),33, 8770–8777.
Liu, Y., Zhu, L., Yamada, M., & Yang, Y. (2020). Semantic correspondence as an optimal transport problem. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4463–4472
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV)
Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In: Association for the Advancement of Artificial Intelligence (AAAI)
Min, J., Lee, J., Ponce, J., & Cho, M. (2019a). Hyperpixel flow: Semantic correspondence with multi-layer neural features. IEEE International Conference on Computer Vision (ICCV) pp 3394–3403
Min, J., Lee, J., Ponce, J., & Cho, M. (2019b). Spair-71k: A large-scale benchmark for semantic correspondence.arXiv:1908.10543
Min, J., Lee, J., Ponce, J., & Cho, M. (2020). Learning to compose hypercolumns for visual correspondence. In: European Conference on Computer Vision (ECCV)
Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In: European Conference on Computer Vision (ECCV)
Munkres, J. (1957). Algorithms for the assignment and transportation problems.Journal of The Society for Industrial and Applied Mathematics,10, 196–210.
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision (ECCV)
Novotny, D., Larlus, D., & Vedaldi, A. (2017). Anchornet: A weakly supervised network to learn geometry-sensitive features for semantic matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5277–5286
Oord Avd, Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. (2016). Conditional image generation with pixelcnn decoders. Neural Information Processing Systems (NeurIPS)
Oord Avd, Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding.arXiv:1807.03748
Pathak, D., Girshick, R. B., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 6024–6033
Pinheiro, P. O., Almahairi, A., Benmaleck, R. Y., Golemo, F., & Courville, A. (2020). Unsupervised learning of dense visual representations. Neural Information Processing Systems (NeurIPS)
Rocco, I., Arandjelovic, R., & Sivic, J. (2017). Convolutional neural network architecture for geometric matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6148–6157
Rocco, I., Arandjelović, R., & Sivic, J. (2018a). End-to-end weakly-supervised semantic alignment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6917–6925
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., & Sivic, J. (2018b). Neighbourhood consensus networks. Neural Information Processing Systems (NeurIPS)
Seo, P. H., Lee, J., Jung, D., Han, B., & Cho, M. (2018). Attentive semantic alignment with offset-aware correlation kernels. In: European Conference on Computer Vision (ECCV), pp 349–364
Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums.American Mathematical Monthly,74, 402.
Taniai, T., Sinha, S. N., & Sato, Y. (2016). Joint recovery of dense correspondence and cosegmentation in two images. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 4246–4255
Tola, E., Lepetit, V., & Fua, P. (2010). Daisy: An efficient dense descriptor applied to wide-baseline stereo.IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI),32, 815–830.
Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. International Conference on Machine Learning (ICML)
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML)
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In: ECCV
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In: IEEE International Conference on Computer Vision (ICCV), pp 2794–2802
Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2561–2571
Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2020). Dense contrastive learning for self-supervised visual pre-training. ArXiv
Xiao, T., Hong, J., & Ma, J. (2018a). Dna-gan: Learning disentangled representations from multi-attribute images. International Conference on Learning Representations Workshop (ICLRW)
Xiao, T., Hong, J., & Ma, J. (2018b). Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In: European Conference on Computer Vision (ECCV), pp 172–187
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., & Hu, H. (2021). Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 16684–16693
Yang, H., Lin, W.Y., & Lu, J. (2014). Daisy filter flow: A generalized discrete approach to dense correspondences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 3406–3413
Zhang R, Isola P, & Efros, A. A. (2016). Colorful image colorization. In: European Conference on Computer Vision (ECCV)
Zhou, S., Xiao, T., Yang, Y., Feng, D., He, Q., & He, W. (2017). Genegan: Learning object transfiguration and attribute subspace from unpaired data. In: British Machine Vision Conference (BMVC)
Zhou, T., Lee, Y. J., Yu, S. X., & Efros, A. A. (2015a). Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1191–1200
Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016). Learning dense correspondence via 3d-guided cycle consistency. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 117–126
Zhou, X., Zhu, M., & Daniilidis, K. (2015b). Multi-image matching via fast alternating minimization. In: IEEE International Conference on Computer Vision (ICCV), pp 4032–4040
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), pp 2223–2232
Acknowledgements
T. Xiao and M.-H. Yang are supported in part by NSF CAREER grant 1149783.
Author information
Authors and Affiliations
University of California, Merced, CA, USA
Taihong Xiao & Ming-Hsuan Yang
Nvidia, Santa Clara, CA, USA
Sifei Liu, Shalini De Mello, Zhiding Yu & Jan Kautz
Yonsei University, Seoul, Korea
Ming-Hsuan Yang
- Taihong Xiao
You can also search for this author inPubMed Google Scholar
- Sifei Liu
You can also search for this author inPubMed Google Scholar
- Shalini De Mello
You can also search for this author inPubMed Google Scholar
- Zhiding Yu
You can also search for this author inPubMed Google Scholar
- Jan Kautz
You can also search for this author inPubMed Google Scholar
- Ming-Hsuan Yang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toMing-Hsuan Yang.
Additional information
Communicated by Bumsub Ham.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiao, T., Liu, S., De Mello, S.et al. Learning Contrastive Representation for Semantic Correspondence.Int J Comput Vis130, 1293–1309 (2022). https://doi.org/10.1007/s11263-022-01602-y
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative