Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15144))

Included in the following conference series:

European Conference on Computer Vision

561Accesses
3Citations

Abstract

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coinedIDM–VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in ourproject page.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Virtual Try-On with Garment-Focused Diffusion Models

Slot-VTON: subject-driven diffusion-based virtual try-on with slot attention

Article23 August 2024

$$\textrm{D}^4$$ -VTON: Dynamic Semantics Disentangling for Differential Diffusion Based Virtual Try-On

Notes

1.
We do not compare with GP-VTON as it uses private parsing modules. Comparison with HR-VITON is in the appendix.
2.
https://www.mlb-korea.com/main/mall/view.
3.
https://omnicommerce.ai.
4.
https://github.com/levihsu/OOTDiffusion.

References

Avrahami, O., Hayes, T., Gafni, O., Gupta, S., Taigman, Y., Parikh, D., Lischinski, D., Fried, O., Yin, X.: Spatext: Spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18370–18380 (2023)
Google Scholar
Chari, P., Ma, S., Ostashev, D., Kadambi, A., Krishnan, G., Wang, J., Aberman, K.: Personalized restoration via dual-pivot tuning. arXiv preprintarXiv:2312.17234 (2023)
Choi, S., Park, S., Lee, M., Choo, J.: Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14131–14140 (2021)
Google Scholar
Cui, A., Mahajan, J., Shah, V., Gomathinayagam, P., Lazebnik, S.: Street tryon: Learning in-the-wild virtual try-on from unpaired person images. arXiv preprintarXiv:2311.16094 (2023)
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprintarXiv:2208.01618 (2022)
Ge, C., Song, Y., Ge, Y., Yang, H., Liu, W., Luo, P.: Disentangled cycle consistency for highly-realistic virtual try-on. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16928–16937 (2021)
Google Scholar
Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8485–8493 (2021)
Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., Zhang, L.: Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 7599–7607 (2023)
Google Scholar
Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7297–7306 (2018)
Google Scholar
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprintarXiv:2303.11305 (2023)
Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7543–7552 (2018)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30 (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst.33, 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprintarXiv:2207.12598 (2022)
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning. pp. 2790–2799. PMLR (2019)
Google Scholar
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprintarXiv:2106.09685 (2021)
Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprintarXiv:2311.17117 (2023)
Hyvärinen, A., Dayan, P.: Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research6(4) (2005)
Google Scholar
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul2021)
Google Scholar
Issenhuth, T., Mary, J., Calauzenes, C.: Do not mask what you do not need to mask: a parser-free virtual try-on. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. pp. 619–635. Springer (2020)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
Google Scholar
Kim, J., Gu, G., Park, M., Park, S., Choo, J.: Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. arXiv preprintarXiv:2312.01725 (2023)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1931–1941 (2023)
Google Scholar
Lee, K., Kwak, S., Sohn, K., Shin, J.: Direct consistency optimization for compositional text-to-image personalization. arXiv preprintarXiv:2402.12004 (2024)
Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: European Conference on Computer Vision. pp. 204–219. Springer (2022)
Google Scholar
Li, N., Liu, Q., Singh, K.K., Wang, Y., Zhang, J., Plummer, B.A., Lin, Z.: Unihuman: A unified model for editing human images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2039–2048 (2024)
Google Scholar
Men, Y., Mao, Y., Jiang, Y., Ma, W.Y., Lian, Z.: Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5084–5093 (2020)
Google Scholar
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprintarXiv:2108.01073 (2021)
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. arXiv preprintarXiv:2305.13501 (2023)
Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: High-resolution multi-category virtual try-on. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2231–2235 (2022)
Google Scholar
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprintarXiv:2302.08453 (2023)
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprintarXiv:2112.10741 (2021)
Ning, S., Wang, D., Qin, Y., Jin, Z., Wang, B., Han, X.: Picture: Photorealistic virtual try-on from unconstrained designs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6976–6985 (2024)
Google Scholar
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprintarXiv:2307.01952 (2023)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
Google Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprintarXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22500–22510 (2023)
Google Scholar
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.35, 36479–36494 (2022)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)
Google Scholar
Sohn, K., Ruiz, N., Lee, K., Chin, D.C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., Li, Y., et al.: Styledrop: Text-to-image generation in any style. arXiv preprintarXiv:2306.00983 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprintarXiv:2010.02502 (2020)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprintarXiv:2011.13456 (2020)
Tang, L., Ruiz, N., Chu, Q., Li, Y., Holynski, A., Jacobs, D.E., Hariharan, B., Pritch, Y., Wadhwa, N., Aberman, K., et al.: Realfill: Reference-driven generation for authentic image completion. arXiv preprintarXiv:2309.16668 (2023)
team, D.: Stable diffusion xl inpainting.link (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30 (2017)
Google Scholar
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European conference on computer vision (ECCV). pp. 589–604 (2018)
Google Scholar
Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.13(4), 600–612 (2004)
Article Google Scholar
Xie, Z., Huang, Z., Dong, X., Zhao, F., Dong, H., Zhang, X., Zhu, F., Liang, X.: Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23550–23559 (2023)
Google Scholar
Xu, Y., Gu, T., Chen, W., Chen, C.: Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprintarXiv:2403.01779 (2024)
Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18381–18391 (2023)
Google Scholar
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprintarXiv:2308.06721 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
Google Scholar
Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems36 (2024)
Google Scholar
Zhu, L., Yang, D., Zhu, T., Reda, F., Chan, W., Saharia, C., Norouzi, M., Kemelmacher-Shlizerman, I.: Tryondiffusion: A tale of two unets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4606–4615 (2023)
Google Scholar

Download references

Acknowledgement

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075 Artificial Intelligence Graduate School Program (KAIST); No. RS-2021-II212068, Artificial Intelligence Innovation Hub).

Author information

Authors and Affiliations

Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Yisol Choi, Sangkyung Kwak, Kyungmin Lee & Jinwoo Shin
OMNIOUS.AI, Seoul, South Korea
Hyungwon Choi

Authors

Yisol Choi
View author publications
You can also search for this author inPubMed Google Scholar
Sangkyung Kwak
View author publications
You can also search for this author inPubMed Google Scholar
Kyungmin Lee
View author publications
You can also search for this author inPubMed Google Scholar
Hyungwon Choi
View author publications
You can also search for this author inPubMed Google Scholar
Jinwoo Shin
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toJinwoo Shin.

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Appendices

Appendix

A Implementation Details

1.1A.1 In-The-Wild Dataset

In-the-Wild dataset comprises multiple human images wearing each target garment. Images of garment are collected from MLB online shopping mall^Footnote2 and images of human wearing each garment are gathered from social media platforms like Instagram. As shown in Fig. 8, the human images exhibit diverse backgrounds, ranging from parks and buildings to snowy landscapes. For preprocessing, we employ center cropping on the human images to achieve resolutions of$1024\times 768$ resolutions, while the garment images are resized to the same dimensions for compatible setting.

1.2A.2 Training and Inference

We train the model using the Adam [24] optimizer with a fixed learning rate of 1e−5 over 130 epochs (63k iterations with batch size of 24). It takes around 40 h in training with 4$\times $A800 GPUs. We apply data augmentations following StableVITON [23], where we apply horizontal flip (with probability 0.5), random affine shifting and scaling (limit of 0.2, with probability 0.5) to the inputs of TryonNet,i.e.,$\textbf{x}_p, \textbf{x}_{\text {pose}}, \textbf{x}_m$ and$\textbf{m}$. For customization, we fine-tune our model using the Adam optimizer with a fixed learning rate of 1e−6 for 100 steps. It takes around 2 min with a single A800 GPU.

During the inference, we generate images using the DDPM scheduler with 30 steps. We set the strength value to 1.0,i.e., denoising begins from random Gaussian noise, to ignore the masked portion of the input image. For classifier-free guidance [15] (CFG), we merge both conditioning,i.e., low-level features$\textbf{L}_g$ from GarmentNet and high-level semantics$\textbf{H}_g$ from IP-Adapter, as these conditions contain features of the same garment image, following SpaText [1]. In specific, the forward is given as follows:

$$\begin{aligned} \hat{\boldsymbol{\epsilon }}_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t) = s\cdot (\boldsymbol{\epsilon }_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t) - \boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)) + \boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)\text {,} \end{aligned}$$

(4)

where$\boldsymbol{\epsilon }_\theta (\textbf{x}_t;\textbf{L}_g,\textbf{H}_g,t)$ denotes noise prediction with conditions$\textbf{L}_g$ and$\textbf{H}_g$, and$\boldsymbol{\epsilon }_\theta (\textbf{x}_t;t)$ denotes unconditional noise prediction. We use guidance scales = 2.0, which works well in practice.

1.3A.3 Detailed Captioning of Garments

We generate detailed captions for each garment to leverage the prior knowledge of T2I diffusion models. We employ OMNIOUS.AI’s commercial^Footnote3 fashion attribute tagging annotator, which has been trained with over 1,000 different fashion attributes. The image annotator provides various feature categories present in a given image, such as sleeve length and neckline type. We extract three different feature categories: sleeve length, neckline type, and item name, as illustrated in Fig. 9. Subsequently, we generate captions based on this feature information, for example, “short sleeve off shoulder t-shirts”.

B Comparison with Concurrent Work

We additionally compare IDM–VTON with OOTDiffusion [55], which is a concurrent work on virtual try-on task. We compare each model trained on VITON-HD and DressCode datasets. We use publicly available model checkpoints for generating try-on images of OOTDiffusion^Footnote4.

Table 5.Quantitative results of models trained on VITON-HD training dataset and evaluated on VITON-HD and DressCode (upper body) test dataset. We additionally compare the metric scores of IDM–VTON (ours) with the concurrent work OOTDiffusion [55].

Full size table

Table 6.Quantitative results of models trained on DressCode training dataset and evaluated on VITON-HD and DressCode test dataset. We additionally compare the metric scores of IDM–VTON (ours) with the concurrent work OOTDiffusion [55].

Full size table

Table 7.Quantitative results on In-the-Wild dataset. We compare IDM–VTON (ours) with the concurrent work OOTDiffusion [55] on In-the-Wild dataset to assess the generalization capabilities. We report LPIPS, SSIM and CLIP image similarity scores.

Full size table

Figure 10 and Fig. 11 present qualitative comparisons of OOTDiffusion and ours on VITON-HD, DressCode and In-the-Wild dataset. As shown in Fig. 10, we see that IDM–VTON outperforms OOTDiffusion in capturing both high-level semantics and low-level details, and generating more authentic images. In particular, we observe notable improvements of IDM–VTON compared to OOTDiffusion on In-the-Wild dataset, which demonstrates the generalization ability of IDM–VTON.

Table 5, Table 6 and Table 7 show quantitative comparisons between IDM–VTON and OOTDiffusion on VITON-HD, DressCode, and In-the-Wild datasets. One can notify that IDM–VTON outperforms OOTDiffusion on all metrics including image fidelity (FID), and reconstruction of garments (LPIPS, SSIM and CLIP-I), which verifies our claim (Figs. 12,13,14,15,16,17 and18).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Choi, Y., Kwak, S., Lee, K., Choi, H., Shin, J. (2025). Improving Diffusion Models for Authentic Virtual Try-on in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15144. Springer, Cham. https://doi.org/10.1007/978-3-031-73016-0_13

Download citation

DOI:https://doi.org/10.1007/978-3-031-73016-0_13
Published:26 October 2024
Publisher Name:Springer, Cham
Print ISBN:978-3-031-73015-3
Online ISBN:978-3-031-73016-0
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Virtual Try-On with Garment-Focused Diffusion Models

Slot-VTON: subject-driven diffusion-based virtual try-on with slot attention

$$\textrm{D}^4$$ -VTON: Dynamic Semantics Disentangling for Differential Diffusion Based Virtual Try-On

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix

A Implementation Details

1.1A.1 In-The-Wild Dataset

1.2A.2 Training and Inference

1.3A.3 Detailed Captioning of Garments

B Comparison with Concurrent Work

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now