Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15126))

Included in the following conference series:

European Conference on Computer Vision

325Accesses

Abstract

The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose acoarse-to-fine 3D scene texturing framework, referred to asRoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available athttps://qwang666.github.io/RoomTex/.

Q. Wang and R. Lu—Equal contribution, work done during the internship at Shanghai AI Laboratory.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8007; Price includes VAT (Japan)

Softcover Book: JPY 10009; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Pseudo 3D Guidance for View-Consistent Texturing with 2D Diffusion

VCD-Texture: Variance Alignment Based 3D-2D Co-denoising for Text-Guided Texturing

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

References

Balaji, Y., et al.: EDIFFI: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprintarXiv:2211.01324 (2022)
Bokhovkin, A., Tulsiani, S., Dai, A.: Mesh2Tex: generating mesh textures from image queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8918–8928 (2023)
Google Scholar
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell.6, 679–698 (1986)
Article Google Scholar
Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: TexFusion: synthesizing 3D textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4169–4181 (2023)
Google Scholar
Chen, D.Z., Li, H., Lee, H.Y., Tulyakov, S., Nießner, M.: SceneTex: high-quality texture synthesis for indoor scenes via diffusion priors. In: CVPR, pp. 21081–21091 (2024)
Google Scholar
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2Tex: text-driven texture synthesis via diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18558–18568 (2023)
Google Scholar
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22246–22256 (2023)
Google Scholar
Chen, Y., Chen, R., Lei, J., Zhang, Y., Jia, K.: TANGO: text-driven photorealistic and robust 3D stylization via lighting decomposition. Adv. Neural. Inf. Process. Syst.35, 30923–30936 (2022)
Google Scholar
Chen, Z., Yin, K., Fidler, S.: AUV-Net: learning aligned UV maps for texture transfer and synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1465–1474 (2022)
Google Scholar
Cohen-Bar, D., Richardson, E., Metzer, G., Giryes, R., Cohen-Or, D.: Set-the-Scene: global-local training for generating controllable nerf scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2920–2929 (2023)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Fang, C., Hu, X., Luo, K., Tan, P.: Ctrl-Room: controllable text-to-3D room meshes generation with layout constraints. arXiv preprintarXiv:2310.03602 (2023)
Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: SceneScape: text-driven consistent scene generation. Adv. Neural Info. Process. Syst.36 (2024)
Google Scholar
Fu, H., et al.: 3D-front: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)
Google Scholar
Gao, J., et al.: GET3D: a generative model of high quality 3D textured shapes learned from images. Adv. Neural. Inf. Process. Syst.35, 31841–31854 (2022)
Google Scholar
Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3DGEN: triplane latent diffusion for textured mesh generation. arXiv preprintarXiv:2303.05371 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2Room: extracting textured 3D meshes from 2D text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7909–7920 (2023)
Google Scholar
Hwang, I., Kim, H., Kim, Y.M.: Text2Scene: text-driven indoor scene stylization with part-aware details. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1890–1899 (2023)
Google Scholar
Jun, H., Nichol, A.: Shap-E: generating conditional 3D implicit functions. arXiv preprintarXiv:2305.02463 (2023)
Li, W., Chen, R., Chen, X., Tan, P.: SweetDreamer: aligning geometric priors in 2D diffusion for consistent text-to-3D. arXiv preprintarXiv:2310.02596 (2023)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Google Scholar
Liu, Z., et al.: UniDream: unifying diffusion priors for relightable text-to-3D generation. arXiv preprintarXiv:2312.08754 (2023)
Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: MeshDiffusion: score-based generative 3D mesh modeling. arXiv preprintarXiv:2303.08133 (2023)
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-NeRF for shape-guided generation of 3D shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)
Google Scholar
Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2Mesh: text-driven neural stylization for meshes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13492–13502 (2022)
Google Scholar
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process.21(12), 4695–4708 (2012)
Article MathSciNet Google Scholar
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4296–4304 (2024)
Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-E: a system for generating 3D point clouds from complex prompts. arXiv preprintarXiv:2212.08751 (2022)
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning, pp. 16784–16804. PMLR (2022)
Google Scholar
Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture Fields: learning texture representations in function space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4531–4540 (2019)
Google Scholar
Po, R., Wetzstein, G.: Compositional 3D scene generation using locally conditioned diffusion. In: 2024 International Conference on 3D Vision (3DV), pp. 651–663. IEEE (2024)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)
Google Scholar
Qiu, L., et al.: RichDreamer: a generalizable normal-depth diffusion model for detail richness in text-to-3D. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9914–9925 (2024)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: TEXTure: text-guided texturing of 3D shapes. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Saharia, C.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.35, 36479–36494 (2022)
Google Scholar
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprintarXiv:2111.02114 (2021)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual Captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Google Scholar
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MvDream: multi-view diffusion for 3D generation. arXiv preprintarXiv:2308.16512 (2023)
Siddiqui, Y., et al.: MeshGPT: generating triangle meshes with decoder-only transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19615–19625 (2024)
Google Scholar
Siddiqui, Y., Thies, J., Ma, F., Shan, Q., Nießner, M., Dai, A.: Texturify: generating textures on 3D shape surfaces. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13663, pp. 72–88. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-20062-5_5
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Song, L., et al.: RoomDreamer: text-driven 3D indoor scene synthesis with coherent geometry and texture. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 6898–6906 (2023)
Google Scholar
Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: MVDiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprintarXiv:2307.01097 (2023)
Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian Chaining: lifting pretrained 2D diffusion models for 3D generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)
Google Scholar
Wang, T., Kanakis, M., Schindler, K., Van Gool, L., Obukhov, A.: Breathing new life into 3D assets with generative repainting. arXiv preprintarXiv:2309.08523 (2023)
Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. Adv. Neural Inf. Process. Syst.36 (2024)
Google Scholar
Yang, B., et al.: DreamSpace: dreaming your room space with text-driven panoramic texture propagation. In: 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 650–660. IEEE (2024)
Google Scholar
Yeh, Y.Y., et al.: TextureDreamer: image-guided texture synthesis through geometry-aware diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4304–4314 (2024)
Google Scholar
Youwang, K., Oh, T.H., Pons-Moll, G.: Paint-it: text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4347–4356 (2024)
Google Scholar
Yu, X., Dai, P., Li, W., Ma, L., Liu, Z., Qi, X.: Texture generation on 3D meshes with point-UV diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4206–4216 (2023)
Google Scholar
Zeng, X., et al.: Paint3D: paint anything 3D with lighting-less texture diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4252–4262 (2024)
Google Scholar
Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2NeRF: text-driven 3D scene generation with neural radiance fields. IEEE Trans. Vis. Comput. Graph. (2024)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, Q., et al.: Scenewiz3D: towards text-guided 3D scene composition. arXiv preprintarXiv:2312.08885 (2023)

Download references

Acknowledgements

This research is supported in part by the Early Career Scheme of the Research Grants Council (RGC) of the Hong Kong SAR under grant No. 26202321, SAIL Research Project, HKUST-Zeekr Collaborative Research Fund, HKUST-WeBank Joint Lab Project, Tencent Rhino-Bird Focused Research Program, Sichuan Science and Technology Program (2023YFSY0008), China Tower-Peking University Joint Laboratory of Intelligent Society and Space Governance, National Natural Science Foundation of China (61632003, 61375022, 61403005), Grant SCITLAB-20017 of Intelligent Terminal Key Laboratory of SiChuan Province, Beijing Advanced Innovation Center for Intelligent Robots and Systems (2018IRS11), and PEK-SenseTime Joint Laboratory of Machine Vision. This research is also supported by Shanghai Artificial Intelligence Laboratory.

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Hong Kong, Hong Kong SAR, China
Qi Wang, Michael Yu Wang & Dan Xu
National Key Laboratory of General AI, School of IST, Peking University, Beijing, China
Ruijie Lu & Gang Zeng
Shanghai AI Laboratory, Shanghai, China
Qi Wang, Ruijie Lu, Xudong Xu, Jingbo Wang & Bo Dai

Authors

Qi Wang
View author publications
You can also search for this author inPubMed Google Scholar
Ruijie Lu
View author publications
You can also search for this author inPubMed Google Scholar
Xudong Xu
View author publications
You can also search for this author inPubMed Google Scholar
Jingbo Wang
View author publications
You can also search for this author inPubMed Google Scholar
Michael Yu Wang
View author publications
You can also search for this author inPubMed Google Scholar
Bo Dai
View author publications
You can also search for this author inPubMed Google Scholar
Gang Zeng
View author publications
You can also search for this author inPubMed Google Scholar
Dan Xu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toXudong Xu.

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 89366 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Q.et al. (2025). RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15126. Springer, Cham. https://doi.org/10.1007/978-3-031-73113-6_27

Download citation

DOI:https://doi.org/10.1007/978-3-031-73113-6_27
Published:21 November 2024
Publisher Name:Springer, Cham
Print ISBN:978-3-031-73112-9
Online ISBN:978-3-031-73113-6
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Pseudo 3D Guidance for View-Consistent Texturing with 2D Diffusion

VCD-Texture: Variance Alignment Based 3D-2D Co-denoising for Text-Guided Texturing

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1Electronic supplementary material

Supplementary material 1 (zip 89366 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now