Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

NR-CION: Non-rigid Consistent Image Composition Via Diffusion Model

  • Conference paper
  • First Online:

Abstract

Text guided image diffusion model has demonstrated remarkable ability in consistent image generation. In this paper, we introduce a training free image composition framework that realizes the non-rigid objects composition based on a pair of source and target prompts. Specifically, we aim at blending the user provided object reference image into the background image in a non-rigid manner and keep the balance of fidelity and editability. For example, we can make a standing dog jumping while preserving its shape and appearance under the guidance of target prompt. Our proposed method has three key components: firstly, the reference image and background are inverted into latent noises with different image inversion methods. Secondly, we guarantee the consistent image attribute generation of the reference object by injecting the self-attention key and value features from original pipeline in sampling steps. Thirdly, we iteratively optimize the object mask in the target pipeline, and progressively compose image in different regions. Experiments shows that our proposed method can achieve the non-rigid object image editing and seamless composition, the results are impressive in consistent and editable image composition.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: How to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4432–4441 (2019)

    Google Scholar 

  2. Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Transactions on Graphics (TOG)42(4), 1–11 (2023)

    Article  Google Scholar 

  3. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18208–18218 (2022)

    Google Scholar 

  4. Azadi, S., Pathak, D., Ebrahimi, S., Darrell, T.: Compositional gan: Learning image-conditional binary composition. Int. J. Comput. Vision128, 2570–2585 (2020)

    Article  Google Scholar 

  5. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)

    Google Scholar 

  6. Burt, P.J., Adelson, E.H.: A multiresolution spline with application to image mosaics. ACM Transactions on Graphics (TOG)2(4), 217–236 (1983)

    Article  Google Scholar 

  7. Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22560–22570 (October 2023)

    Google Scholar 

  8. Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprintarXiv:2210.11427 (2022)

  9. Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., Raff, E.: Vqgan-clip: Open domain image generation and editing with natural language guidance. In: European Conference on Computer Vision. pp. 88–105. Springer (2022)

    Google Scholar 

  10. Dong, W., Xue, S., Duan, X., Han, S.: Prompt tuning inversion for text-driven image editing using diffusion models. arXiv preprintarXiv:2305.04441 (2023)

  11. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprintarXiv:2208.01626 (2022)

  12. Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion-based editing with 3 lines of code. International Conference on Learning Representations (2023)

    Google Scholar 

  13. Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6007–6017 (2023)

    Google Scholar 

  14. Ke, Z., Sun, C., Zhu, L., Xu, K., Lau, R.W.: Harmonizer: Learning to perform white-box image and video harmonization. In: European Conference on Computer Vision. pp. 690–706. Springer (2022)

    Google Scholar 

  15. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprintarXiv:2304.02643 (2023)

  16. Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J.: Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12174–12182 (2019)

    Google Scholar 

  17. Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: European Conference on Computer Vision. pp. 423–439. Springer (2022)

    Google Scholar 

  18. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprintarXiv:2303.05499 (2023)

  19. Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2294–2305 (2023)

    Google Scholar 

  20. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

    Google Scholar 

  21. Miyake, D., Iohara, A., Saito, Y., Tanaka, T.: Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprintarXiv:2305.16807 (2023)

  22. Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6038–6047 (2023)

    Google Scholar 

  23. Porter, T., Duff, T.: Compositing digital images. In: Proceedings of the 11th annual conference on Computer graphics and interactive techniques. pp. 253–259 (1984)

    Google Scholar 

  24. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

    Google Scholar 

  25. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprintarXiv:2204.061251(2), 3 (2022)

  26. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

    Google Scholar 

  27. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.35, 36479–36494 (2022)

    Google Scholar 

  28. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. International Conference on Learning Representations (2021)

    Google Scholar 

  29. Tao, M.W., Johnson, M.K., Paris, S.: Error-tolerant image compositing. Int. J. Comput. Vision103, 178–189 (2013)

    Article  Google Scholar 

  30. Yu, Y., Zhou, K., Xu, D., Shi, X., Bao, H., Guo, B., Shum, H.Y.: Mesh editing with poisson-based gradient field manipulation. In: ACM SIGGRAPH 2004 Papers, pp. 644–651 (2004)

    Google Scholar 

  31. Zhang, H., Zhang, J., Perazzi, F., Lin, Z., Patel, V.M.: Deep image compositing. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 365–374 (2021)

    Google Scholar 

  32. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Fujitsu R&D Center, Co., LTD., Beijing, China

    Wei Liu, Liuan Wang & Jun Sun

Authors
  1. Wei Liu

    You can also search for this author inPubMed Google Scholar

  2. Liuan Wang

    You can also search for this author inPubMed Google Scholar

  3. Jun Sun

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toWei Liu.

Editor information

Editors and Affiliations

  1. University of Salford, Salford, Lancashire, UK

    Apostolos Antonacopoulos

  2. Indian Institute of Technology Bombay, Mumbai, Maharashtra, India

    Subhasis Chaudhuri

  3. Johns Hopkins University, Baltimore, MD, USA

    Rama Chellappa

  4. Chinese Academy of Sciences, Beijing, China

    Cheng-Lin Liu

  5. Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

    Saumik Bhattacharya

  6. ISI Kolkata, Kolkata, West Bengal, India

    Umapada Pal

Rights and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, W., Wang, L., Sun, J. (2025). NR-CION: Non-rigid Consistent Image Composition Via Diffusion Model. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15325. Springer, Cham. https://doi.org/10.1007/978-3-031-78389-0_22

Download citation

Publish with us

Societies and partnerships

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp