Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15144))

Included in the following conference series:

Abstract

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coinedIDM–VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in ourproject page.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

Notes

  1. 1.

    We do not compare with GP-VTON as it uses private parsing modules. Comparison with HR-VITON is in the appendix.

  2. 2.
  3. 3.
  4. 4.

References

  1. Avrahami, O., Hayes, T., Gafni, O., Gupta, S., Taigman, Y., Parikh, D., Lischinski, D., Fried, O., Yin, X.: Spatext: Spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18370–18380 (2023)

    Google Scholar 

  2. Chari, P., Ma, S., Ostashev, D., Kadambi, A., Krishnan, G., Wang, J., Aberman, K.: Personalized restoration via dual-pivot tuning. arXiv preprintarXiv:2312.17234 (2023)

  3. Choi, S., Park, S., Lee, M., Choo, J.: Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14131–14140 (2021)

    Google Scholar 

  4. Cui, A., Mahajan, J., Shah, V., Gomathinayagam, P., Lazebnik, S.: Street tryon: Learning in-the-wild virtual try-on from unpaired person images. arXiv preprintarXiv:2311.16094 (2023)

  5. Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprintarXiv:2208.01618 (2022)

  6. Ge, C., Song, Y., Ge, Y., Yang, H., Liu, W., Luo, P.: Disentangled cycle consistency for highly-realistic virtual try-on. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16928–16937 (2021)

    Google Scholar 

  7. Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8485–8493 (2021)

    Google Scholar 

  8. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM63(11), 139–144 (2020)

    Article MathSciNet  Google Scholar 

  9. Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., Zhang, L.: Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 7599–7607 (2023)

    Google Scholar 

  10. Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7297–7306 (2018)

    Google Scholar 

  11. Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprintarXiv:2303.11305 (2023)

  12. Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7543–7552 (2018)

    Google Scholar 

  13. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30 (2017)

    Google Scholar 

  14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst.33, 6840–6851 (2020)

    Google Scholar 

  15. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprintarXiv:2207.12598 (2022)

  16. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning. pp. 2790–2799. PMLR (2019)

    Google Scholar 

  17. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprintarXiv:2106.09685 (2021)

  18. Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprintarXiv:2311.17117 (2023)

  19. Hyvärinen, A., Dayan, P.: Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research6(4) (2005)

    Google Scholar 

  20. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul2021)

    Google Scholar 

  21. Issenhuth, T., Mary, J., Calauzenes, C.: Do not mask what you do not need to mask: a parser-free virtual try-on. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. pp. 619–635. Springer (2020)

    Google Scholar 

  22. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

    Google Scholar 

  23. Kim, J., Gu, G., Park, M., Park, S., Choo, J.: Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. arXiv preprintarXiv:2312.01725 (2023)

  24. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

  25. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1931–1941 (2023)

    Google Scholar 

  26. Lee, K., Kwak, S., Sohn, K., Shin, J.: Direct consistency optimization for compositional text-to-image personalization. arXiv preprintarXiv:2402.12004 (2024)

  27. Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: European Conference on Computer Vision. pp. 204–219. Springer (2022)

    Google Scholar 

  28. Li, N., Liu, Q., Singh, K.K., Wang, Y., Zhang, J., Plummer, B.A., Lin, Z.: Unihuman: A unified model for editing human images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2039–2048 (2024)

    Google Scholar 

  29. Men, Y., Mao, Y., Jiang, Y., Ma, W.Y., Lian, Z.: Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5084–5093 (2020)

    Google Scholar 

  30. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprintarXiv:2108.01073 (2021)

  31. Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. arXiv preprintarXiv:2305.13501 (2023)

  32. Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: High-resolution multi-category virtual try-on. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2231–2235 (2022)

    Google Scholar 

  33. Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprintarXiv:2302.08453 (2023)

  34. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprintarXiv:2112.10741 (2021)

  35. Ning, S., Wang, D., Qin, Y., Jin, Z., Wang, B., Han, X.: Picture: Photorealistic virtual try-on from unconstrained designs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6976–6985 (2024)

    Google Scholar 

  36. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprintarXiv:2307.01952 (2023)

  37. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

    Google Scholar 

  38. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  39. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprintarXiv:2204.061251(2), 3 (2022)

  40. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

    Google Scholar 

  41. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)

    Google Scholar 

  42. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22500–22510 (2023)

    Google Scholar 

  43. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.35, 36479–36494 (2022)

    Google Scholar 

  44. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)

    Google Scholar 

  45. Sohn, K., Ruiz, N., Lee, K., Chin, D.C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., Li, Y., et al.: Styledrop: Text-to-image generation in any style. arXiv preprintarXiv:2306.00983 (2023)

  46. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprintarXiv:2010.02502 (2020)

  47. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprintarXiv:2011.13456 (2020)

  48. Tang, L., Ruiz, N., Chu, Q., Li, Y., Holynski, A., Jacobs, D.E., Hariharan, B., Pritch, Y., Wadhwa, N., Aberman, K., et al.: Realfill: Reference-driven generation for authentic image completion. arXiv preprintarXiv:2309.16668 (2023)

  49. team, D.: Stable diffusion xl inpainting.link (2023)

  50. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30 (2017)

    Google Scholar 

  51. Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European conference on computer vision (ECCV). pp. 589–604 (2018)

    Google Scholar 

  52. Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)

    Google Scholar 

  53. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.13(4), 600–612 (2004)

    Article  Google Scholar 

  54. Xie, Z., Huang, Z., Dong, X., Zhao, F., Dong, H., Zhang, X., Zhu, F., Liang, X.: Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23550–23559 (2023)

    Google Scholar 

  55. Xu, Y., Gu, T., Chen, W., Chen, C.: Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprintarXiv:2403.01779 (2024)

  56. Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18381–18391 (2023)

    Google Scholar 

  57. Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprintarXiv:2308.06721 (2023)

  58. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)

    Google Scholar 

  59. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

    Google Scholar 

  60. Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems36 (2024)

    Google Scholar 

  61. Zhu, L., Yang, D., Zhu, T., Reda, F., Chan, W., Saharia, C., Norouzi, M., Kemelmacher-Shlizerman, I.: Tryondiffusion: A tale of two unets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4606–4615 (2023)

    Google Scholar 

Download references

Acknowledgement

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075 Artificial Intelligence Graduate School Program (KAIST); No. RS-2021-II212068, Artificial Intelligence Innovation Hub).

Author information

Authors and Affiliations

  1. Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea

    Yisol Choi, Sangkyung Kwak, Kyungmin Lee & Jinwoo Shin

  2. OMNIOUS.AI, Seoul, South Korea

    Hyungwon Choi

Authors
  1. Yisol Choi

    You can also search for this author inPubMed Google Scholar

  2. Sangkyung Kwak

    You can also search for this author inPubMed Google Scholar

  3. Kyungmin Lee

    You can also search for this author inPubMed Google Scholar

  4. Hyungwon Choi

    You can also search for this author inPubMed Google Scholar

  5. Jinwoo Shin

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toJinwoo Shin.

Editor information

Editors and Affiliations

  1. University of Birmingham, Birmingham, UK

    Aleš Leonardis

  2. University of Trento, Trento, Italy

    Elisa Ricci

  3. Technical University of Darmstadt, Darmstadt, Hessen, Germany

    Stefan Roth

  4. Princeton University, Palo Alto, CA, USA

    Olga Russakovsky

  5. Czech Technical University in Prague, Prague, Czech Republic

    Torsten Sattler

  6. École des Ponts ParisTech, Marne-la-Vallée, France

    Gül Varol

Appendices

Appendix

A Implementation Details

1.1A.1 In-The-Wild Dataset

Fig. 8.
figure 8

Examples of In-the-Wild dataset. We collect pairs of garment and human wearing the garment.

In-the-Wild dataset comprises multiple human images wearing each target garment. Images of garment are collected from MLB online shopping mallFootnote2 and images of human wearing each garment are gathered from social media platforms like Instagram. As shown in Fig. 8, the human images exhibit diverse backgrounds, ranging from parks and buildings to snowy landscapes. For preprocessing, we employ center cropping on the human images to achieve resolutions of\(1024\times 768\) resolutions, while the garment images are resized to the same dimensions for compatible setting.

1.2A.2 Training and Inference

We train the model using the Adam [24] optimizer with a fixed learning rate of 1e−5 over 130 epochs (63k iterations with batch size of 24). It takes around 40 h in training with 4\(\times \)A800 GPUs. We apply data augmentations following StableVITON [23], where we apply horizontal flip (with probability 0.5), random affine shifting and scaling (limit of 0.2, with probability 0.5) to the inputs of TryonNet,i.e.,\(\textbf{x}_p, \textbf{x}_{\text {pose}}, \textbf{x}_m\) and\(\textbf{m}\). For customization, we fine-tune our model using the Adam optimizer with a fixed learning rate of 1e−6 for 100 steps. It takes around 2 min with a single A800 GPU.

During the inference, we generate images using the DDPM scheduler with 30 steps. We set the strength value to 1.0,i.e., denoising begins from random Gaussian noise, to ignore the masked portion of the input image. For classifier-free guidance [15] (CFG), we merge both conditioning,i.e., low-level features\(\textbf{L}_g\) from GarmentNet and high-level semantics\(\textbf{H}_g\) from IP-Adapter, as these conditions contain features of the same garment image, following SpaText [1]. In specific, the forward is given as follows:

$$\begin{aligned} \hat{\boldsymbol{\epsilon }}_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t) = s\cdot (\boldsymbol{\epsilon }_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t) - \boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)) + \boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)\text {,} \end{aligned}$$
(4)

where\(\boldsymbol{\epsilon }_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t)\) denotes noise prediction with conditions\(\textbf{L}_g\) and\(\textbf{H}_g\), and\(\boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)\) denotes unconditional noise prediction. We use guidance scales = 2.0, which works well in practice.

Fig. 9.
figure 9

Illustration of generating detailed captions of garments. We utilize pretrained fashion attribute tagging annotator to extract information of garment and generate detailed captions of the garment based on extracted information.

1.3A.3 Detailed Captioning of Garments

We generate detailed captions for each garment to leverage the prior knowledge of T2I diffusion models. We employ OMNIOUS.AI’s commercialFootnote3 fashion attribute tagging annotator, which has been trained with over 1,000 different fashion attributes. The image annotator provides various feature categories present in a given image, such as sleeve length and neckline type. We extract three different feature categories: sleeve length, neckline type, and item name, as illustrated in Fig. 9. Subsequently, we generate captions based on this feature information, for example, “short sleeve off shoulder t-shirts”.

Fig. 10.
figure 10

Comparison between IDM–VTON and OOTDiffusion [55]on VITON-HD and DressCode dataset. Both methods are trained on VITON-HD training dataset. Best viewed in zoomed, color monitor.

Fig. 11.
figure 11

Comparison between IDM–VTON and OOTDiffusion [55]on In-the-Wild dataset. We provide visual comparison results on In-the-Wild dataset. Both methods are trained on VITON-HD training dataset. Best viewed in zoomed, color monitor.

B Comparison with Concurrent Work

We additionally compare IDM–VTON with OOTDiffusion [55], which is a concurrent work on virtual try-on task. We compare each model trained on VITON-HD and DressCode datasets. We use publicly available model checkpoints for generating try-on images of OOTDiffusionFootnote4.

Table 5.Quantitative results of models trained on VITON-HD training dataset and evaluated on VITON-HD and DressCode (upper body) test dataset. We additionally compare the metric scores of IDM–VTON (ours) with the concurrent work OOTDiffusion [55].
Table 6.Quantitative results of models trained on DressCode training dataset and evaluated on VITON-HD and DressCode test dataset. We additionally compare the metric scores of IDM–VTON (ours) with the concurrent work OOTDiffusion [55].
Table 7.Quantitative results on In-the-Wild dataset. We compare IDM–VTON (ours) with the concurrent work OOTDiffusion [55] on In-the-Wild dataset to assess the generalization capabilities. We report LPIPS, SSIM and CLIP image similarity scores.

Figure 10 and Fig. 11 present qualitative comparisons of OOTDiffusion and ours on VITON-HD, DressCode and In-the-Wild dataset. As shown in Fig. 10, we see that IDM–VTON outperforms OOTDiffusion in capturing both high-level semantics and low-level details, and generating more authentic images. In particular, we observe notable improvements of IDM–VTON compared to OOTDiffusion on In-the-Wild dataset, which demonstrates the generalization ability of IDM–VTON.

Table 5, Table 6 and Table 7 show quantitative comparisons between IDM–VTON and OOTDiffusion on VITON-HD, DressCode, and In-the-Wild datasets. One can notify that IDM–VTON outperforms OOTDiffusion on all metrics including image fidelity (FID), and reconstruction of garments (LPIPS, SSIM and CLIP-I), which verifies our claim (Figs. 12,13,14,15,16,17 and18).

Fig. 12.
figure 12

Qualitative comparison on VITON-HD and DressCode dataset. As we observed in our quantitative analysis, GAN-based methods generally struggle to generate high-fidelity images introducing non desirable distortions (e.g., non-realistic body and arms) while diffusion-based methods fail to capture low-level features or high-level semantics. All methods are trained on VITON-HD training data.

Fig. 13.
figure 13

Qualitative comparison on In-the-Wild dataset. While baselines fail to generate natural images or capture the details of clothing, IDM–VTON produces realistic images and preserves fine details effectively. All methods are trained on VITON-HD training data.

Fig. 14.
figure 14

Try-on results on VITON-HD test data by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.

Fig. 15.
figure 15

Try-on results on DressCode test data by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.

Fig. 16.
figure 16

Try-on results on In-the-Wild dataset by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.

Fig. 17.
figure 17

Try-on results on In-the-Wild dataset by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.

Fig. 18.
figure 18

Try-on results on In-the-Wild dataset by IDM–VTON trained on VITON-HD training data. Best viewed in zoomed, color monitor.

Rights and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Choi, Y., Kwak, S., Lee, K., Choi, H., Shin, J. (2025). Improving Diffusion Models for Authentic Virtual Try-on in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15144. Springer, Cham. https://doi.org/10.1007/978-3-031-73016-0_13

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp