Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15144))
Included in the following conference series:
561Accesses
Abstract
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coinedIDM–VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in ourproject page.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 8465
- Price includes VAT (Japan)
- Softcover Book
- JPY 10581
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We do not compare with GP-VTON as it uses private parsing modules. Comparison with HR-VITON is in the appendix.
- 2.
- 3.
- 4.
References
Avrahami, O., Hayes, T., Gafni, O., Gupta, S., Taigman, Y., Parikh, D., Lischinski, D., Fried, O., Yin, X.: Spatext: Spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18370–18380 (2023)
Chari, P., Ma, S., Ostashev, D., Kadambi, A., Krishnan, G., Wang, J., Aberman, K.: Personalized restoration via dual-pivot tuning. arXiv preprintarXiv:2312.17234 (2023)
Choi, S., Park, S., Lee, M., Choo, J.: Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14131–14140 (2021)
Cui, A., Mahajan, J., Shah, V., Gomathinayagam, P., Lazebnik, S.: Street tryon: Learning in-the-wild virtual try-on from unpaired person images. arXiv preprintarXiv:2311.16094 (2023)
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprintarXiv:2208.01618 (2022)
Ge, C., Song, Y., Ge, Y., Yang, H., Liu, W., Luo, P.: Disentangled cycle consistency for highly-realistic virtual try-on. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16928–16937 (2021)
Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8485–8493 (2021)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM63(11), 139–144 (2020)
Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., Zhang, L.: Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 7599–7607 (2023)
Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7297–7306 (2018)
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprintarXiv:2303.11305 (2023)
Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7543–7552 (2018)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30 (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst.33, 6840–6851 (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprintarXiv:2207.12598 (2022)
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning. pp. 2790–2799. PMLR (2019)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprintarXiv:2106.09685 (2021)
Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprintarXiv:2311.17117 (2023)
Hyvärinen, A., Dayan, P.: Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research6(4) (2005)
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul2021)
Issenhuth, T., Mary, J., Calauzenes, C.: Do not mask what you do not need to mask: a parser-free virtual try-on. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. pp. 619–635. Springer (2020)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
Kim, J., Gu, G., Park, M., Park, S., Choo, J.: Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. arXiv preprintarXiv:2312.01725 (2023)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1931–1941 (2023)
Lee, K., Kwak, S., Sohn, K., Shin, J.: Direct consistency optimization for compositional text-to-image personalization. arXiv preprintarXiv:2402.12004 (2024)
Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: European Conference on Computer Vision. pp. 204–219. Springer (2022)
Li, N., Liu, Q., Singh, K.K., Wang, Y., Zhang, J., Plummer, B.A., Lin, Z.: Unihuman: A unified model for editing human images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2039–2048 (2024)
Men, Y., Mao, Y., Jiang, Y., Ma, W.Y., Lian, Z.: Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5084–5093 (2020)
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprintarXiv:2108.01073 (2021)
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. arXiv preprintarXiv:2305.13501 (2023)
Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: High-resolution multi-category virtual try-on. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2231–2235 (2022)
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprintarXiv:2302.08453 (2023)
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprintarXiv:2112.10741 (2021)
Ning, S., Wang, D., Qin, Y., Jin, Z., Wang, B., Han, X.: Picture: Photorealistic virtual try-on from unconstrained designs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6976–6985 (2024)
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprintarXiv:2307.01952 (2023)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research21(1), 5485–5551 (2020)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprintarXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22500–22510 (2023)
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.35, 36479–36494 (2022)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)
Sohn, K., Ruiz, N., Lee, K., Chin, D.C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., Li, Y., et al.: Styledrop: Text-to-image generation in any style. arXiv preprintarXiv:2306.00983 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprintarXiv:2010.02502 (2020)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprintarXiv:2011.13456 (2020)
Tang, L., Ruiz, N., Chu, Q., Li, Y., Holynski, A., Jacobs, D.E., Hariharan, B., Pritch, Y., Wadhwa, N., Aberman, K., et al.: Realfill: Reference-driven generation for authentic image completion. arXiv preprintarXiv:2309.16668 (2023)
team, D.: Stable diffusion xl inpainting.link (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30 (2017)
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European conference on computer vision (ECCV). pp. 589–604 (2018)
Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.13(4), 600–612 (2004)
Xie, Z., Huang, Z., Dong, X., Zhao, F., Dong, H., Zhang, X., Zhu, F., Liang, X.: Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23550–23559 (2023)
Xu, Y., Gu, T., Chen, W., Chen, C.: Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprintarXiv:2403.01779 (2024)
Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18381–18391 (2023)
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprintarXiv:2308.06721 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems36 (2024)
Zhu, L., Yang, D., Zhu, T., Reda, F., Chan, W., Saharia, C., Norouzi, M., Kemelmacher-Shlizerman, I.: Tryondiffusion: A tale of two unets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4606–4615 (2023)
Acknowledgement
This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075 Artificial Intelligence Graduate School Program (KAIST); No. RS-2021-II212068, Artificial Intelligence Innovation Hub).
Author information
Authors and Affiliations
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Yisol Choi, Sangkyung Kwak, Kyungmin Lee & Jinwoo Shin
OMNIOUS.AI, Seoul, South Korea
Hyungwon Choi
- Yisol Choi
You can also search for this author inPubMed Google Scholar
- Sangkyung Kwak
You can also search for this author inPubMed Google Scholar
- Kyungmin Lee
You can also search for this author inPubMed Google Scholar
- Hyungwon Choi
You can also search for this author inPubMed Google Scholar
- Jinwoo Shin
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toJinwoo Shin.
Editor information
Editors and Affiliations
University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol
Appendices
Appendix
A Implementation Details
1.1A.1 In-The-Wild Dataset
In-the-Wild dataset comprises multiple human images wearing each target garment. Images of garment are collected from MLB online shopping mallFootnote2 and images of human wearing each garment are gathered from social media platforms like Instagram. As shown in Fig. 8, the human images exhibit diverse backgrounds, ranging from parks and buildings to snowy landscapes. For preprocessing, we employ center cropping on the human images to achieve resolutions of\(1024\times 768\) resolutions, while the garment images are resized to the same dimensions for compatible setting.
1.2A.2 Training and Inference
We train the model using the Adam [24] optimizer with a fixed learning rate of 1e−5 over 130 epochs (63k iterations with batch size of 24). It takes around 40 h in training with 4\(\times \)A800 GPUs. We apply data augmentations following StableVITON [23], where we apply horizontal flip (with probability 0.5), random affine shifting and scaling (limit of 0.2, with probability 0.5) to the inputs of TryonNet,i.e.,\(\textbf{x}_p, \textbf{x}_{\text {pose}}, \textbf{x}_m\) and\(\textbf{m}\). For customization, we fine-tune our model using the Adam optimizer with a fixed learning rate of 1e−6 for 100 steps. It takes around 2 min with a single A800 GPU.
During the inference, we generate images using the DDPM scheduler with 30 steps. We set the strength value to 1.0,i.e., denoising begins from random Gaussian noise, to ignore the masked portion of the input image. For classifier-free guidance [15] (CFG), we merge both conditioning,i.e., low-level features\(\textbf{L}_g\) from GarmentNet and high-level semantics\(\textbf{H}_g\) from IP-Adapter, as these conditions contain features of the same garment image, following SpaText [1]. In specific, the forward is given as follows:
where\(\boldsymbol{\epsilon }_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t)\) denotes noise prediction with conditions\(\textbf{L}_g\) and\(\textbf{H}_g\), and\(\boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)\) denotes unconditional noise prediction. We use guidance scales = 2.0, which works well in practice.
Illustration of generating detailed captions of garments. We utilize pretrained fashion attribute tagging annotator to extract information of garment and generate detailed captions of the garment based on extracted information.
1.3A.3 Detailed Captioning of Garments
We generate detailed captions for each garment to leverage the prior knowledge of T2I diffusion models. We employ OMNIOUS.AI’s commercialFootnote3 fashion attribute tagging annotator, which has been trained with over 1,000 different fashion attributes. The image annotator provides various feature categories present in a given image, such as sleeve length and neckline type. We extract three different feature categories: sleeve length, neckline type, and item name, as illustrated in Fig. 9. Subsequently, we generate captions based on this feature information, for example, “short sleeve off shoulder t-shirts”.
Comparison between IDM–VTON and OOTDiffusion [55]on VITON-HD and DressCode dataset. Both methods are trained on VITON-HD training dataset. Best viewed in zoomed, color monitor.
Comparison between IDM–VTON and OOTDiffusion [55]on In-the-Wild dataset. We provide visual comparison results on In-the-Wild dataset. Both methods are trained on VITON-HD training dataset. Best viewed in zoomed, color monitor.
B Comparison with Concurrent Work
We additionally compare IDM–VTON with OOTDiffusion [55], which is a concurrent work on virtual try-on task. We compare each model trained on VITON-HD and DressCode datasets. We use publicly available model checkpoints for generating try-on images of OOTDiffusionFootnote4.
Figure 10 and Fig. 11 present qualitative comparisons of OOTDiffusion and ours on VITON-HD, DressCode and In-the-Wild dataset. As shown in Fig. 10, we see that IDM–VTON outperforms OOTDiffusion in capturing both high-level semantics and low-level details, and generating more authentic images. In particular, we observe notable improvements of IDM–VTON compared to OOTDiffusion on In-the-Wild dataset, which demonstrates the generalization ability of IDM–VTON.
Table 5, Table 6 and Table 7 show quantitative comparisons between IDM–VTON and OOTDiffusion on VITON-HD, DressCode, and In-the-Wild datasets. One can notify that IDM–VTON outperforms OOTDiffusion on all metrics including image fidelity (FID), and reconstruction of garments (LPIPS, SSIM and CLIP-I), which verifies our claim (Figs. 12,13,14,15,16,17 and18).
Qualitative comparison on VITON-HD and DressCode dataset. As we observed in our quantitative analysis, GAN-based methods generally struggle to generate high-fidelity images introducing non desirable distortions (e.g., non-realistic body and arms) while diffusion-based methods fail to capture low-level features or high-level semantics. All methods are trained on VITON-HD training data.
Qualitative comparison on In-the-Wild dataset. While baselines fail to generate natural images or capture the details of clothing, IDM–VTON produces realistic images and preserves fine details effectively. All methods are trained on VITON-HD training data.
Try-on results on VITON-HD test data by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.
Try-on results on DressCode test data by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.
Try-on results on In-the-Wild dataset by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.
Try-on results on In-the-Wild dataset by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.
Try-on results on In-the-Wild dataset by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Choi, Y., Kwak, S., Lee, K., Choi, H., Shin, J. (2025). Improving Diffusion Models for Authentic Virtual Try-on in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15144. Springer, Cham. https://doi.org/10.1007/978-3-031-73016-0_13
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-73015-3
Online ISBN:978-3-031-73016-0
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative