Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15325))

Included in the following conference series:

International Conference on Pattern Recognition

190Accesses

Abstract

Text guided image diffusion model has demonstrated remarkable ability in consistent image generation. In this paper, we introduce a training free image composition framework that realizes the non-rigid objects composition based on a pair of source and target prompts. Specifically, we aim at blending the user provided object reference image into the background image in a non-rigid manner and keep the balance of fidelity and editability. For example, we can make a standing dog jumping while preserving its shape and appearance under the guidance of target prompt. Our proposed method has three key components: firstly, the reference image and background are inverted into latent noises with different image inversion methods. Secondly, we guarantee the consistent image attribute generation of the reference object by injecting the self-attention key and value features from original pipeline in sampling steps. Thirdly, we iteratively optimize the object mask in the target pipeline, and progressively compose image in different regions. Experiments shows that our proposed method can achieve the non-rigid object image editing and seamless composition, the results are impressive in consistent and editable image composition.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Controllable 3D Object Generation with Single Image Prompt

HOIEdit: Human–object interaction editing with text-to-image diffusion model

Article17 January 2025

References

Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: How to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4432–4441 (2019)
Google Scholar
Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Transactions on Graphics (TOG)42(4), 1–11 (2023)
Article Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18208–18218 (2022)
Google Scholar
Azadi, S., Pathak, D., Ebrahimi, S., Darrell, T.: Compositional gan: Learning image-conditional binary composition. Int. J. Comput. Vision128, 2570–2585 (2020)
Article Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)
Google Scholar
Burt, P.J., Adelson, E.H.: A multiresolution spline with application to image mosaics. ACM Transactions on Graphics (TOG)2(4), 217–236 (1983)
Article Google Scholar
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22560–22570 (October 2023)
Google Scholar
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprintarXiv:2210.11427 (2022)
Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., Raff, E.: Vqgan-clip: Open domain image generation and editing with natural language guidance. In: European Conference on Computer Vision. pp. 88–105. Springer (2022)
Google Scholar
Dong, W., Xue, S., Duan, X., Han, S.: Prompt tuning inversion for text-driven image editing using diffusion models. arXiv preprintarXiv:2305.04441 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprintarXiv:2208.01626 (2022)
Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion-based editing with 3 lines of code. International Conference on Learning Representations (2023)
Google Scholar
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6007–6017 (2023)
Google Scholar
Ke, Z., Sun, C., Zhu, L., Xu, K., Lau, R.W.: Harmonizer: Learning to perform white-box image and video harmonization. In: European Conference on Computer Vision. pp. 690–706. Springer (2022)
Google Scholar
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprintarXiv:2304.02643 (2023)
Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J.: Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12174–12182 (2019)
Google Scholar
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: European Conference on Computer Vision. pp. 423–439. Springer (2022)
Google Scholar
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprintarXiv:2303.05499 (2023)
Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2294–2305 (2023)
Google Scholar
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
Google Scholar
Miyake, D., Iohara, A., Saito, Y., Tanaka, T.: Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprintarXiv:2305.16807 (2023)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6038–6047 (2023)
Google Scholar
Porter, T., Duff, T.: Compositing digital images. In: Proceedings of the 11th annual conference on Computer graphics and interactive techniques. pp. 253–259 (1984)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprintarXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
Google Scholar
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.35, 36479–36494 (2022)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. International Conference on Learning Representations (2021)
Google Scholar
Tao, M.W., Johnson, M.K., Paris, S.: Error-tolerant image compositing. Int. J. Comput. Vision103, 178–189 (2013)
Article Google Scholar
Yu, Y., Zhou, K., Xu, D., Shi, X., Bao, H., Guo, B., Shum, H.Y.: Mesh editing with poisson-based gradient field manipulation. In: ACM SIGGRAPH 2004 Papers, pp. 644–651 (2004)
Google Scholar
Zhang, H., Zhang, J., Perazzi, F., Lin, Z., Patel, V.M.: Deep image compositing. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 365–374 (2021)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Fujitsu R&D Center, Co., LTD., Beijing, China
Wei Liu, Liuan Wang & Jun Sun

Authors

Wei Liu
View author publications
You can also search for this author inPubMed Google Scholar
Liuan Wang
View author publications
You can also search for this author inPubMed Google Scholar
Jun Sun
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toWei Liu.

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
ISI Kolkata, Kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, W., Wang, L., Sun, J. (2025). NR-CION: Non-rigid Consistent Image Composition Via Diffusion Model. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15325. Springer, Cham. https://doi.org/10.1007/978-3-031-78389-0_22

Download citation

DOI:https://doi.org/10.1007/978-3-031-78389-0_22
Published:05 December 2024
Publisher Name:Springer, Cham
Print ISBN:978-3-031-78388-3
Online ISBN:978-3-031-78389-0
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Movatterモバイル変換

NR-CION: Non-rigid Consistent Image Composition Via Diffusion Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Controllable 3D Object Generation with Single Image Prompt

HOIEdit: Human–object interaction editing with text-to-image diffusion model

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Access this chapter

Subscribe and save

Buy Now