(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: The Hong Kong University of Science and Technology 22institutetext: Peking University 33institutetext: Shanghai AI Laboratory

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Qi Wang 1*1*    Ruijie Lu 2*2*    Xudong Xu 33    **gbo Wang 33    Michael Yu Wang 11   
Bo Dai
3†3†
   Gang Zeng 22    Dan Xu 11
Abstract

The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at https://qwang666.github.io/RoomTex/.

Keywords:
Scene Texturing Scene Generation Texture Synthesis
* Equal contribution, work done during the internship at Shanghai AI Laboratory.
{\dagger} Corresponding author: [email protected]

1 Introduction

Refer to caption
Figure 1: We propose RoomTex to synthesize high-quality and style-consistent textures for given scene meshes. Our method supports generating multiple styles.

Generating high-quality textured 3D models, especially indoor scenes, is imperative for various industrial applications, ranging from gaming and filming to AR/VR. Current delicate 3D indoor scenes, however, are mostly carefully designed by professional artists with expertise and thus is an expensive and time-consuming process. Recently, significant progress has been made in the realm of 3D object generation [37, 8, 23, 38, 46, 17, 18, 26], especially in terms of geometry quality. Despite satisfactory scene geometry, achieving captivating and style-consistent textures still demands painstaking efforts from artists equipped with specialized knowledge and aesthetic training. Hence, automatic scene texturing, i.e., generating textures for untextured scene-level meshes, remains a valuable but challenging problem.

Existing texturing methods mostly focus on synthesizing textures for 3D objects, most of which are either limited to several specific training categories [34, 47, 2] or relying on the corresponding UV maps [10, 60]. Thanks to powerful CLIP model [39] connecting text and images, some subsequent approaches [28, 9, 60, 7, 4, 61] is capable of painting general 3D objects by leveraging CLIP model or more advanced text-to-image diffusion models. Conditioned on given text descriptions, they typically apply an iterative scheme to texture an object from different viewpoints. Despite remarkable results on 3D objects, these methods cannot be naively extended to 3D indoor scenes owing to the complicated occlusion problem in the scene. Although an indoor scene can be viewed as the composition of various textured objects, it is nontrivial to ensure global style consistency of texture for all the individual objects inside.

In this work, we propose a novel coarse-to-fine framework, dubbed RoomTex, to synthesize high-fidelity and style-consistent texture for a compositional 3D indoor scene under the guidance of text prompt that simultaneously enables flexible scene editing and fine-grained texture control. Contrary to directly using perfect indoor scene meshes [16] designed by professional artists, we opt to leverage off-the-shelf 3D object generative models along with a given 2D room layout to form a compositional untextured scene mesh with imperfect geometry, which alleviates the time-consuming model-making process and better aligns with the great development of 3D generative models. Afterward, this indoor scene mesh is firstly unwrapped to obtain a global panoramic depth map. Based on the depth map and text prompt, RoomTex leverages ControlNet [63] to synthesize a panoramic image of the entire scene in the coarse stage. It is noteworthy that such a room panorama is regarded as the coarse reference in the subsequent fine stage to maintain global style consistency. To cope with the occlusion problem between objects and interior surfaces, we will remove all the objects inside to inpaint the occluded areas to acquire complete room interior surfaces, including walls, floor, and ceilings.

Refer to caption
Figure 2: RoomTex simultaneously enables interactive fine-grained texture control and flexible scene editing of individual objects inside.

In the fine stage, RoomTex will texture every single object in the room based on the panoramic image acquired in the coarse stage, leading to a complete 3D room that can be perceived from any viewpoint. Specifically, for a particular 3D object to be painted, we first re-project the panorama from an appropriate view targeting this object to obtain a perspective image of the object. However, this perspective image inevitably contains unacceptable distortions owing to the equirectangular projection and thus will be refined to a better and more detailed image via depth-guided ControlNet [63]. Apart from this initial view, we additionally select a series of camera positions around this 3D object, along which this 3D object will be textured iteratively under the guidance of text prompt. Regarding the refined perspective image as a starting point, we warp this partially painted 3D object to other viewpoints under the depth guidance and inpaint the missing texture in each new viewpoint, which repeats until the 3D object is completely painted. Unfortunately, the generated texture cannot perfectly align with the guiding depth map. In particular, the texture in the foreground of an object will dilate to the background area, which typically occurs in the depth edge areas and might be imperceptible from the current view. However, the dilated texture from the previous iteration leads to a messy area while the object is warped to a new viewpoint, which will be exacerbated as the iterative inpainting goes on. To mitigate this problem, we propose to detect these misalignment areas with Canny [3] edges of RGB images and Laplacian edges of the corresponding depth maps. Afterward, these areas will not taken into consideration during the iterative inpainting to avoid awful object textures.

Extensive experiments demonstrate that our method can synthesize high-fidelity and style-consistent texture for a compositional room mesh conditioned on the given text prompts as shown in Fig. 1. More importantly, thanks to the powerful control capabilities of ControlNet [63] and our inpainting-based texturing framework, RoomTex supports flexible scene editing of all individual objects and fine-grained texture control such as aligning the texture of a specific 3D object with provided sketches or text descriptions as illustrated in Fig. 2. Our contributions can be summarized as follows:

  • We propose a novel coarse-to-fine texture generation framework that first generates a room panorama as a coarse reference and then paints each component in the scene to achieve global style consistency.

  • Thanks to our subtly designed alignment between RGB and depth spaces, our method can take imperfect geometry from 3D object generative models as input and generate holistic and high-fidelity scene textures.

  • Users can not only flexibly edit the indoor scene where they can add, remove, replace, move, and rescale any furniture item, but also realize fine-grained texture control over any object with given sketches or text prompts.

2 Related Work

Diffusion-Based Text-to-Image Generation. In recent years, diffusion probabilistic models [19, 48] have achieved unprecedented success in text-to-image generation. By training on large-scale text-image paired datasets [44, 43], diffusion models manage to learn an implicit connection between semantic concepts and corresponding text embeddings, thus generating diverse and complex images of objects and scenes from given text prompts [33, 39, 42, 1]. Different from pixel-based diffusion approaches, latent diffusion models [41] (LDMs) apply the diffusion model on the latent space of pretrained autoencoders, significantly reducing the demands for massive computational resources. Moreover, several fantastic works [63, 31, 51] have explored utilizing additional conditions like sketches, Canny edges, depth maps, etc., to control the image generation of large pretrained text-to-image models. In this work, we leverage ControlNet [63] to synthesize indoor panorama conditioned on the corresponding indoor depth map and subsequent inpainting or editing procedures.

Text-Driven 3D Object Generation and Texturing. The great success of text-to-image synthesis empowers booming development in the domain of text-to-3D generation. Based on powerful 2D text-to-image diffusion models, DreamFusion [37] first proposes an effective Score Distillation Sampling (SDS) loss to guide the generation of 3D models. Later, a vast body of following works leverages the SDS loss to synthesize various 3D objects with higher quality and better 3D consistency [52, 24, 8, 55, 45, 23, 25, 38]. By training on massive 3D synthetic data, prior attempts [32, 22] on direct generation of 3D point clouds or meshes are shown to significantly accelerate the generation process. It is noteworthy that our method can capitalize on these generated 3D objects for room composition as input, despite their imperfect geometry. Given untextured 3D meshes and the conditioning text descriptions or reference images, some approaches [40, 7, 29, 53, 61, 58, 60, 4, 59, 27] also exploit the text-to-image diffusion models for 3D object texturing by using an iterative painting scheme or relying on the corresponding UV map. Unlike them, we aim to paint the entire room, including each independent 3D object inside, with high-quality and consistent textures.

Indoor Scene Generation with Panorama. Recently, MVDiffusion [50] and Ctrl-Room [14] subtly design their specific diffusion models to synthesize multi-view consistent images or 3D layouts of indoor scenes. However, they are both constrained to generating a panoramic image to represent the whole 3D room and thus wandering around the room is far beyond the capability of these methods. Although Ctrl-Room further combines the estimated depth map with the panoramic image to obtain a complete textured room, the occlusion area cannot be covered with a single panorama and the potential panoramic distortion remains out of reach. In contrast, our method aims to texture a compositional room mesh and leverages a panorama to ensure style consistency in the scene.

Text-Driven 3D Scene Indoor Generation and Texturing. Several prior works [11, 35] explore generating a 3D indoor scene by using 3D bounding boxes as layouts and optimizing the entire scene with SDS loss [37]. Yet, the texure quality is still far from satisfactory since the generated scene often looks unrealistic and over-saturated. Alternative approaches [20, 62, 15] adopt an incremental framework, where they mainly leverage image war** to obtain renderings from new viewpoints and then inpaint the missing areas based on the estimated depth map. However, inaccurate depth estimation leads to severe geometry distortion, significantly affecting the generation results. Moreover, RoomDreamer [49] will jointly refine the geometry and texture of an existing indoor mesh via pretrained text-to-image diffusion models, but still cannot cope with the unobserved regions. Parallel to scene generation, a line of research works [57, 6, 21, 64], including ours, start paying attention to 3D scene texturing, i.e., generating high-quality textures for given 3D scene-level meshes. Despite their compelling results, DreamSpace [57] and Text2Scene [21] have to rely on an initial room texture for the succeeding stylization, while concurrent works SceneTex [6] and SceneWiz3D [64] cannot support fine-grained texture controls due to their adopted optimization-based framework. Unlike any of the above, our model RoomTex targets at texturing indoor scenes that consist of untextured 3D object meshes and simultaneously enables fine-grained controls over the scene.

3 Method

In this section, we present our coarse-to-fine generation framework, RoomTex, for synthesizing high-fidelity and style-consistent texture for a compositional room. We utilize off-the-shelf 3D shape generative models along with a given room layout to assemble the room mesh. In the coarse stage, the 3D room mesh is unwrapped to a panorama depth map, based on which we generate a panoramic image of the room as a coarse reference (Sec. 3.1). Then in the fine stage, the empty room will be further refined in perspective views (Sec. 3.2). Afterward, we employ an iterative inpainting pipeline to refine and paint every independent 3D object in the room (Sec. 3.3). To better align the generated texture and the guidance depth map, we introduce an edge-detection module to identify and then remove the misalignment areas between them (Sec. 3.4). Finally, our framework also supports interactive fine-grained texture control (Sec. 3.5). An overview of our framework is illustrated in Fig. 3.

Refer to caption
Figure 3: Framework of RoomTex. We first generate a panoramic reference image of the indoor scene based on a depth map rendered from a compositional untextured room mesh. Based on the panorama, we will refine and paint every object for a textured 3D object. By integrating objects and the empty room, we can finally get a completely textured 3D indoor scene.

3.1 Panoramic Image Generation

Given the room layout, it is relatively straightforward to assemble an untextured mesh of the room as input by leveraging off-the-shelf 3D object generative models. Subsequently, a virtual depth camera is put at the center of the room, leading to a panoramic depth map 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT of the room via the equirectangular projection. Under the depth guidance 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and text prompts 𝐓𝐓\mathbf{T}bold_T, RoomTex utilizes powerful ControlNet [63] i()subscript𝑖\mathcal{F}_{i}(\cdot)caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) to synthesize a panoramic image 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as a coarse reference to maintain the style consistency of the generated texture:

𝐈p=i(𝐃p,𝐓).subscript𝐈𝑝subscript𝑖subscript𝐃𝑝𝐓\mathbf{I}_{p}=\mathcal{F}_{i}(\mathbf{D}_{p},\mathbf{T}).bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_T ) . (1)

3.2 Empty Room Refinement

To enable a more flexible editing to the generated scene like moving furniture items in the scene, we further generate a complete texture of mere interior surfaces using a depth-aware inpainting model so that the missing areas occluded by objects in the initial panorama 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT will be filled. We first remove all the object meshes inside and obtain the panoramic depth of an empty room 𝐃rsubscript𝐃𝑟\mathbf{D}_{r}bold_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. All the occluded areas are denoted with a binary mask 𝐌rsubscript𝐌𝑟\mathbf{M}_{r}bold_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and the inpainting can be represented with:

𝐈r=inp(𝐈p,𝐌r,𝐃r,𝐓)subscript𝐈𝑟subscriptinpsubscript𝐈𝑝subscript𝐌𝑟subscript𝐃𝑟𝐓\mathbf{I}_{r}=\mathcal{F}_{\text{inp}}(\mathbf{I}_{p},\mathbf{M}_{r},\mathbf{% D}_{r},\mathbf{T})bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT inp end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_T ) (2)

where inpsubscriptinp\mathcal{F}_{\text{inp}}caligraphic_F start_POSTSUBSCRIPT inp end_POSTSUBSCRIPT is the depth-aware inpainting model where ControlNet [63] is used, and the occluded areas in the input pananora 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are assigned zero values during the inpainting. It is noteworthy that a more complete empty room texture is shown to be beneficial for the subsequent object generation process.

Moreover, to cope with the distortion brought by the panoramic image 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, we carefully choose an overhead view targeting the floor and an upward view targeting the ceiling to refine these two important areas. The panoramic image 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT will be re-projected to these two perspective views 𝐯floorsubscript𝐯floor\mathbf{v}_{\text{floor}}bold_v start_POSTSUBSCRIPT floor end_POSTSUBSCRIPT and 𝐯ceilingsubscript𝐯ceiling\mathbf{v}_{\text{ceiling}}bold_v start_POSTSUBSCRIPT ceiling end_POSTSUBSCRIPT for the corresponding images 𝐈floorsubscript𝐈floor\mathbf{I}_{\text{floor}}bold_I start_POSTSUBSCRIPT floor end_POSTSUBSCRIPT and 𝐈ceilingsubscript𝐈ceiling\mathbf{I}_{\text{ceiling}}bold_I start_POSTSUBSCRIPT ceiling end_POSTSUBSCRIPT using

𝐈subscript𝐈\displaystyle\mathbf{I}_{\cdot}bold_I start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT =𝒫(𝒯panoworld𝐈p,𝐯),absent𝒫subscript𝒯panoworldsubscript𝐈𝑝subscript𝐯\displaystyle=\mathcal{P}(\mathcal{T}_{\text{pano}\to\text{world}}\circ\mathbf% {I}_{p},\mathbf{v}_{\cdot}),= caligraphic_P ( caligraphic_T start_POSTSUBSCRIPT pano → world end_POSTSUBSCRIPT ∘ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT ) , (3)

where 𝒯panoworldsubscript𝒯panoworld\mathcal{T}_{\text{pano}\to\text{world}}caligraphic_T start_POSTSUBSCRIPT pano → world end_POSTSUBSCRIPT is a transformation function that projects every single pixel in the panoramic image to the spherical coordinates and then projects to the world coordinates with the help of a panoramic depth map, 𝒫(,𝐯)𝒫𝐯\mathcal{P}(\cdot,\mathbf{v})caligraphic_P ( ⋅ , bold_v ) is a projection function that projects the point cloud in the world coordinates to a specific view 𝐯𝐯\mathbf{v}bold_v. Regarding 𝐈floorsubscript𝐈floor\mathbf{I}_{\text{floor}}bold_I start_POSTSUBSCRIPT floor end_POSTSUBSCRIPT and 𝐈ceilingsubscript𝐈ceiling\mathbf{I}_{\text{ceiling}}bold_I start_POSTSUBSCRIPT ceiling end_POSTSUBSCRIPT as the initialization, we employ the aforementioned inpainting Eq. 2 to refine the floor and ceiling images under the guidance of the corresponding mask and the depth map.

3.3 Iterative Object Texturing

Notably, generating a panoramic image is not essentially equal to generating a complete room texture supporting free novel view rendering inside mainly due to the lack of information in the occluded area. Therefore, it is necessary to apply an inpainting process to fill in the ‘other’ side of every single object, i.e., the missing areas in the panorama. In this stage, the global panorama 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the empty room panorama 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are further used as references for the texturing of each object in the scene. Our method RoomTex aims to generate texture for a compositional 3D scene and thus conduct the inpainting for each independent object separately.

For a specific 3D object to be painted, we will select an initial perspective view 𝐯0subscript𝐯0\mathbf{v}_{0}bold_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT targeting it and re-project 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to obtain foreground and background images 𝐈objfgsubscriptsuperscript𝐈fgobj\mathbf{I}^{\text{fg}}_{\text{obj}}bold_I start_POSTSUPERSCRIPT fg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT and 𝐈objbgsubscriptsuperscript𝐈bgobj\mathbf{I}^{\text{bg}}_{\text{obj}}bold_I start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT of this object via Eq. 3. In particular, the perspective image resolution is set as a constant, and the focal length of the initial view 𝐯0subscript𝐯0\mathbf{v}_{0}bold_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is related to the size of the specific object and the relative distance between the camera and the object. With perspective images 𝐈bgsubscript𝐈bg\mathbf{I}_{\text{bg}}bold_I start_POSTSUBSCRIPT bg end_POSTSUBSCRIPT and 𝐈fgsubscript𝐈fg\mathbf{I}_{\text{fg}}bold_I start_POSTSUBSCRIPT fg end_POSTSUBSCRIPT, we further integrate them into one image 𝐈obj^^subscript𝐈obj\hat{\mathbf{I}_{\text{obj}}}over^ start_ARG bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_ARG using the object mask 𝐌objsubscript𝐌obj\mathbf{M}_{\text{obj}}bold_M start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT as follows:

𝐈obj^=𝐈fg𝐌obj+𝐈bg(1𝐌obj).^subscript𝐈objdirect-productsubscript𝐈fgsubscript𝐌objdirect-productsubscript𝐈bg1subscript𝐌obj\hat{\mathbf{I}_{\text{obj}}}=\mathbf{I}_{\text{fg}}\odot\mathbf{M}_{\text{obj% }}+\mathbf{I}_{\text{bg}}\odot(1-\mathbf{M}_{\text{obj}}).over^ start_ARG bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_ARG = bold_I start_POSTSUBSCRIPT fg end_POSTSUBSCRIPT ⊙ bold_M start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT + bold_I start_POSTSUBSCRIPT bg end_POSTSUBSCRIPT ⊙ ( 1 - bold_M start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT ) . (4)

Afterward, this fused image 𝐈obj^^subscript𝐈obj\hat{\mathbf{I}_{\text{obj}}}over^ start_ARG bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_ARG is refined to a new image 𝐈objsubscript𝐈obj\mathbf{I}_{\text{obj}}bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT with less distortion and higher resolution via Eq. 2.

Refer to caption
Figure 4: Iterative inpainting. We leverage the object depth to unproject only object areas of the initial image to the world coordinates. Then, we choose a group of suitable views and iteratively warp the 3D object to these views, under which the untextured area will be filled with diffusion-based inpainting (dense areas) and interpolation-based inpainting (sparse areas).

After obtaining the refined initial view 𝐈objsubscript𝐈obj\mathbf{I}_{\text{obj}}bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT of the object, we leverage iterative inpainting to completely texture this object as illustrated in Fig. 4. To this end, we extra select a series of camera viewpoints {𝐯i},i=1,2,,Nformulae-sequencesubscript𝐯𝑖𝑖12𝑁\{\mathbf{v}_{i}\},i=1,2,...,N{ bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_i = 1 , 2 , … , italic_N, which will cover every aspect of the object as comprehensively as possible. The first eight views place the camera on a sphere centered around the object looking at the center of the object. To be specific, the radius of the sphere is set to be slightly larger than the object and the polar angle is set to π/4𝜋4\pi/4italic_π / 4 and 3π/43𝜋43\pi/43 italic_π / 4. Moreover, some additional views will be picked if this 3D object is out of the view range of eight selected camera poses. Once the group of views {𝐯i}subscript𝐯𝑖\left\{\mathbf{v}_{i}\right\}{ bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is acquired, we can apply an iterative war** and inpainting process to texture this 3D object. First of all, the initial image is unprojected to partial point cloud 𝐏objsubscript𝐏obj\mathbf{P}_{\text{obj}}bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT in the world coordinate under the depth guidance:

𝐏obj=𝒫1(𝐈obj𝐌obj,𝐯0),subscript𝐏objsuperscript𝒫1direct-productsubscript𝐈objsubscript𝐌objsubscript𝐯0\mathbf{P}_{\text{obj}}=\mathcal{P}^{-1}(\mathbf{I}_{\text{obj}}\odot\mathbf{M% }_{\text{obj}},\mathbf{v}_{0}),bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT = caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT ⊙ bold_M start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (5)

where 𝒫1(,𝐯)superscript𝒫1𝐯\mathcal{P}^{-1}(\cdot,\mathbf{v})caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ , bold_v ) aims to unproject pixels in the perspective view 𝐯𝐯\mathbf{v}bold_v back to the world coordinate.

For a novel view 𝐯isubscript𝐯𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the partial point cloud 𝐏objsubscript𝐏obj\mathbf{P}_{\text{obj}}bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT will be warped to a novel view image 𝐈obj, i^^subscript𝐈obj, i\hat{\mathbf{I}_{\text{obj, i}}}over^ start_ARG bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT end_ARG to be inpainted via Eq. 12, and the corresponding inpainting mask 𝐌obj, isubscript𝐌obj, i\mathbf{M}_{\text{obj, i}}bold_M start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT can be also obtained simultaneously. During the inpainting, we will initialize the missing areas with the nearest neighbor color to guide the inpainting model for a style-consistent texture, and then inpaint the image 𝐈obj, i^^subscript𝐈obj, i\hat{\mathbf{I}_{\text{obj, i}}}over^ start_ARG bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT end_ARG to 𝐈obj, isubscript𝐈obj, i\mathbf{I}_{\text{obj, i}}bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT via aforementioned Eq. 2. However, this inpainted image 𝐈obj, isubscript𝐈obj, i\mathbf{I}_{\text{obj, i}}bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT inevitably contains numerous sparse and small holes due to the sparsity of partial point cloud 𝐏objsubscript𝐏obj\mathbf{P}_{\text{obj}}bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT, which cannot be filled well with diffusion-based inpainting models like ControlNet [63]. Therefore, we extra leverage an interpolation-based inpainting method intpsubscriptintp\mathcal{F}_{\text{intp}}caligraphic_F start_POSTSUBSCRIPT intp end_POSTSUBSCRIPT to fill these sparse areas indicated by a binary mask 𝐌obj, issubscriptsuperscript𝐌𝑠obj, i\mathbf{M}^{s}_{\text{obj, i}}bold_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT. Combining with the room background, the final image under this novel view 𝐈obj, ifinalsubscriptsuperscript𝐈finalobj, i\mathbf{I}^{\text{final}}_{\text{obj, i}}bold_I start_POSTSUPERSCRIPT final end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT can be represented as follows:

𝐈obj, ifinal=𝐈obj, i^(1𝐌obj, i)+intp(𝐈obj, i^)𝐌obj, is+𝐈obj, i(𝐌obj, i𝐌obj, is).subscriptsuperscript𝐈finalobj, idirect-product^subscript𝐈obj, i1subscript𝐌obj, idirect-productsubscriptintp^subscript𝐈obj, isubscriptsuperscript𝐌𝑠obj, idirect-productsubscript𝐈obj, isubscript𝐌obj, isubscriptsuperscript𝐌𝑠obj, i\begin{split}\mathbf{I}^{\text{final}}_{\text{obj, i}}&=\hat{\mathbf{I}_{\text% {obj, i}}}\odot(1-\mathbf{M}_{\text{obj, i}})+\mathcal{F}_{\text{intp}}(\hat{% \mathbf{I}_{\text{obj, i}}})\odot\mathbf{M}^{s}_{\text{obj, i}}\\ &\quad+\mathbf{I}_{\text{obj, i}}\odot(\mathbf{M}_{\text{obj, i}}-\mathbf{M}^{% s}_{\text{obj, i}}).\end{split}start_ROW start_CELL bold_I start_POSTSUPERSCRIPT final end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT end_CELL start_CELL = over^ start_ARG bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT end_ARG ⊙ ( 1 - bold_M start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT ) + caligraphic_F start_POSTSUBSCRIPT intp end_POSTSUBSCRIPT ( over^ start_ARG bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT end_ARG ) ⊙ bold_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT ⊙ ( bold_M start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT - bold_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT ) . end_CELL end_ROW (6)

Afterward, the final image in this novel view will be unprojected to the world coordinate under the depth guidance and then merge with the previous partial point cloud 𝐏objsubscript𝐏obj\mathbf{P}_{\text{obj}}bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT following:

𝐏obj:=𝐏obj(𝒫1(𝐈obj, ifinal𝐌obj, i,𝐯i)),assignsubscript𝐏objsubscript𝐏objsuperscript𝒫1direct-productsubscriptsuperscript𝐈finalobj, isubscript𝐌obj, isubscript𝐯𝑖\mathbf{P}_{\text{obj}}:=\mathbf{P}_{\text{obj}}\cup(\mathcal{P}^{-1}(\mathbf{% I}^{\text{final}}_{\text{obj, i}}\odot\mathbf{M}_{\text{obj, i}},\mathbf{v}_{i% })),bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT := bold_P start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT ∪ ( caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_I start_POSTSUPERSCRIPT final end_POSTSUPERSCRIPT start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT ⊙ bold_M start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (7)

where the updated point cloud will engage in the next iteration of object texturing, and \cup denotes the set union operation. The iterative inpainting will be conducted following our selected camera poses {𝐯i},i=1,2,,Nformulae-sequencesubscript𝐯𝑖𝑖12𝑁\{\mathbf{v}_{i}\},i=1,2,...,N{ bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_i = 1 , 2 , … , italic_N, until the 3D object is completely textured.

Refer to caption
Figure 5: Misalignment removal. We first get the Canny edges of RGB images and Laplacian edges of depth maps as shown in (a) and (b). (c) shows the misalignment areas between texture and depth, which will be removed during the unprojection.
Refer to caption
Figure 6: Fine-grained texture control. Users could interactively point out the specific area they would like to edit together with a sketch illustrating how they would like to edit this area. We take the sketch as an additional input to ControlNet to achieve fine-grained control.

3.4 Misalignment Removal

As illustrated in Fig. 6, the generated texture cannot perfectly align with the depth map, especially in the areas around the depth edge. To mitigate this problem, a carefully designed edge detection module is proposed to identify the misalignment areas. We denote a perspective image from diffusion models as 𝐈𝐈\mathbf{I}bold_I, its Canny edges as C(𝐈)subscript𝐶𝐈\mathcal{E}_{C}(\mathbf{I})caligraphic_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_I ), depth map as 𝐃𝐃\mathbf{D}bold_D, and its Laplacian edges as L(𝐃)subscript𝐿𝐃\mathcal{E}_{L}(\mathbf{D})caligraphic_E start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_D ). Then, our method will leverage traditional erosion and dilatation operations to determine the misalignment areas. Specifically, we filter out all the irrelevant areas by dilating the Laplacian edges and only keep the overlap** part:

C(𝐈)^=C(𝐈)Dilate(L(𝐃)).^subscript𝐶𝐈subscript𝐶𝐈Dilatesubscript𝐿𝐃\hat{\mathcal{E}_{C}(\mathbf{I})}=\mathcal{E}_{C}(\mathbf{I})\cap\text{Dilate}% (\mathcal{E}_{L}(\mathbf{D})).over^ start_ARG caligraphic_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_I ) end_ARG = caligraphic_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_I ) ∩ Dilate ( caligraphic_E start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_D ) ) . (8)

Afterward, the mask of the misaligned area can be obtained via erosion and dilatation operations:

𝐌mis=Erode(Dilate(C(𝐈)^L(𝐃))).subscript𝐌misErodeDilate^subscript𝐶𝐈subscript𝐿𝐃\mathbf{M}_{\text{mis}}=\text{Erode}\big{(}\text{Dilate}(\hat{\mathcal{E}_{C}(% \mathbf{I})}\cup\mathcal{E}_{L}(\mathbf{D}))\big{)}.bold_M start_POSTSUBSCRIPT mis end_POSTSUBSCRIPT = Erode ( Dilate ( over^ start_ARG caligraphic_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_I ) end_ARG ∪ caligraphic_E start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_D ) ) ) . (9)

Misalignment areas will not be taken into consideration during the unprojection.

3.5 Fine-grained Texture Control

In practice, users may not be completely satisfied with the generated textures and would like to rectify specific areas, or even provide more detailed controls to ensure the generated textures meet their expectations as illustrated in Fig. 6. We denote the image from the specific viewpoint 𝐯𝐯\mathbf{v}bold_v that the user wishes to edit as 𝐈𝐈\mathbf{I}bold_I, the area they want to interact as 𝐌𝐌\mathbf{M}bold_M, the corresponding depth map as 𝐃𝐃\mathbf{D}bold_D, and a sketch illustrating how they would like to edit this area as 𝐒𝐒\mathbf{S}bold_S. Then, we will repaint the masked-out area according to the additional sketch conditions:

𝐈=s(𝐈,𝐌,𝐃,𝐒,𝐓)superscript𝐈subscript𝑠𝐈𝐌𝐃𝐒𝐓\mathbf{I}^{\prime}=\mathcal{F}_{s}(\mathbf{I},\mathbf{M},\mathbf{D},\mathbf{S% },\mathbf{T})bold_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_I , bold_M , bold_D , bold_S , bold_T ) (10)

Afterward, the newly painted area will be projected back to the world coordinates, updating the point cloud of the scene 𝐏scenesubscript𝐏scene\mathbf{P}_{\text{scene}}bold_P start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT:

𝐏scene:=𝐏scene𝐏orig(𝒫1(𝐈𝐌,𝐯)),assignsubscript𝐏scenesubscript𝐏scenesubscript𝐏origsuperscript𝒫1direct-productsuperscript𝐈𝐌𝐯\mathbf{P}_{\text{scene}}:=\mathbf{P}_{\text{scene}}\setminus\mathbf{P}_{\text% {orig}}\cup(\mathcal{P}^{-1}(\mathbf{I}^{\prime}\odot\mathbf{M},\mathbf{v})),bold_P start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT := bold_P start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT ∖ bold_P start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT ∪ ( caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊙ bold_M , bold_v ) ) , (11)

where 𝐏origsubscript𝐏orig\mathbf{P}_{\text{orig}}bold_P start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT denotes the original point cloud of the masked area and \setminus stands for set subtraction operation. It is worth noting that users may generate their desirable sketches using text-to-image models. Moreover, if users just want to repaint several unsatisfactory objects, our methods also support object-level texture control like changing the color with mere text guidance as shown in Fig.2.

4 Experiment

In this section, we will present qualitative and quantitative results as well as our ablation study. For implementation details, please refer to the supplemental material, and we will first give a brief introduction to the baseline methods.

Refer to caption
Figure 7: We compare our generated textured scene with TEXTure[40] and SceneTex[6], where the figures include an overhead view and several views rendered from inside of the scene. Our reference panorama is also shown. (Zoom in for best view)
Refer to caption
Figure 8: Scene editing. Here we show how the compositional design empowers flexible editing. Rearranging, removing, adding, deleting, duplicating, and rescaling objects in different scenes are naturally supported.
Refer to caption
Figure 9: Fine-grained texture control results. We show two examples of interactive fine-grained texture control. On the left, we show a top view comparison of the floor before and after re-patterning the floor texture. On the right, we show an upward view comparison of the ceiling before and after repainting the ceiling texture.

Baselines. We compare our method against two recent texture synthesis methods. As for MVDiffusion [50], we found the pre-trained depth-guided holistic generation model only works well for ScanNet [12] and can only generate blur images that do not align well with the depth guidance in cases for comparisons.

  • \bullet

    TEXTure-H [40]: We compare with TEXTure whose input is an untextured mesh and text prompts. We holistically implement this baseline by removing several walls and the ceiling in our scenes so that the texture of interior surfaces can be generated.

  • \bullet

    TEXTure-C[40]: We implement this baseline by compositionally generating the texture of every object mesh and the room interior surfaces, whose several walls and ceilings will also be removed. Afterward, we would integrate all components into a holistic mesh.

  • \bullet

    SceneTex [6]: We compare with SceneTex whose input is also an untextured mesh and text prompts. We utilize the exact same text description as ours to guide the generation process.

Evaluation Metrics. The generated 3D textured room is evaluated both quantitatively and qualitatively. We leverage the Aesthetic Score(AS) introduced by LAION [43], CLIP Score(CS) [39] and BRISQUE(BQ) [30] to reflect the image quality of the generated scenes.

4.1 Qualitative Results

Comparison to baselines. For comparison with baseline methods, we generate three scenes including a bathroom, a bedroom, and a living room. We show top-down views into the scene and several perspective images for our method and baselines in Fig. 7. Neither TEXTure-H nor TEXTure-C can generate satisfactory interior surfaces, and unnatural spots and stripes may appear on the individual objects. Furthermore, the texture generated by TEXTure-C is inconsistent for lack of a holistic constraint and the texture generated by TEXTure-H is also of low quality due to the inaccurate description. SceneTex [6] can generate a relatively consistent texture for the whole scene including interior surfaces. However, the generated texture contains lots of unsatisfactory noise, unreasonable lighting, and severe misalignment between texture and geometry. For example, strange circle-like objects appear on the walls of the bathroom and a strange box-like object appears in front of the bed in the bedroom. More crucially, SceneTex cannot thoroughly settle the occlusion problem in the scene, resulting in blurry texture in the occluded areas. In contrast, our approach creates a highly detailed and compelling texture for all the 3D objects in the scene. Thanks to the panorama reference, the texture style between different objects and the context (mostly walls and the floor) is coherent. Moreover, the style of the room can be easily modified as shown in Fig. 1.

Scene editing and Fine-grained texture control. Our texture is generated for a compositional scene including objects and surroundings, and the representation of every object is essentially a colored point cloud. Therefore, editing like adding, duplicating, removing, rotating, moving, and rescaling objects can be simply supported. In Fig. 8, we include several editing results of rendered images under various views in different scenes. Moreover, as illustrated in Sec. 3.5, interactive fine-grained control over texture can be easily achieved and two additional examples are provided in Fig. 9.

4.2 Quantitative Results

We show quantitative results averaged over multiple scenes including a bathroom, a bedroom, and a living room in Tab.1. We render about 100 images from novel views for each scene to calculate these three metrics. Blurry and messy texture lead to lower scores for the baselines in image-based scores. As for the computing of CS, it is hard to offer an accurate description for all rendered images, so the general text input is used instead. TEXTure-C leverages the text prompt for every object inside, thus leading to a higher CS score than ours. SceneTex is prone to generating evident lighting, leading to a higher BQ value under specific viewpoints since BQ metric favors images with evident lighting.

Table 1: Quantitative comparison. We report image quality metrics including Aesthetic Score(AS), Clip Score(CS), BRISQUE(BQ). Our method outperforms baselines on AS, slightly worse than TEXTure-C in terms of CS and slightly worse than SceneTex in terms of BQ.
Method AS(\uparrow) CS(\uparrow) BQ(\downarrow)
TEXTure-C [40] 5.11 30.17 39.78
TEXTure-H [40] 4.33 27.86 47.54
SceneTex [6] 4.77 26.43 26.91
Ours 5.20 29.16 30.91

4.3 Ablation Study

Panorama distortion elimination. It is noteworthy that except for the object distortion brought by the equirectangular projection, the texture of the ceilings, floor, and baseboards may also suffer from distortion as shown in Fig. 10. This kind of texture is unacceptable and may further influence the iterative object inpainting process since the room context (background texture) is unrealistic. After choosing an overhead and an upward view to repaint floors and ceilings, we can get a textured empty room with less distortion.

Initialization of untextured areas. In Sec. 3.3, initialization of the untextured areas may contribute to a higher quality of inpainting if the missing texture is of a similar color to its nearby area. As shown in Fig. 12, whether or not we fill the untextured areas with the nearest color will make a huge difference in the generation quality.

Refer to caption
Figure 10: Ablation study on distortion elimination. Repainting results of floors and ceilings are significantly better than those without repaint.
Refer to caption
Figure 11: Ablation study on initialization of untextured areas. We show generated texture without initializing the untextured areas, where messy texture and a large inharmonious white area appear. In contrast, the initialization can mitigate this problem.
Refer to caption
Figure 12: Ablation study on misalignment detection. We show results of using different misalignment detection techniques, including using no guidance, mere RGB guidance, mere depth guidance, and both. Only using both introduced in our method can avoid the gray texture on the pillow dilating to the headboard.

Misalignment detection. As mentioned in Sec. 3.4, misalignment between the generated texture and the guiding depth map often occurs, which will dilate the foreground texture to the background as shown in the top-left image in Fig. 12. For example, the texture of the pillows may dilate into the area of the headboard, which will interfere with the inpainting model in the later iteration since the texture warped from previous frames is not convincing. These areas will be detected and then discarded during the unprojection. Visualized reprojection results of using different kinds of projection masks are shown from a new viewpoint.

5 Conclusion and Limitation

We have proposed a novel text-driven indoor scene texture generation framework, which is capable of generating high-fidelity and coherent texture that aligns well with geometry. The crux of our approach is to first synthesize a panoramic image as a holistic reference for style consistency and then inpaint every object iteratively to support a compositional 3D scene with complete and harmonious textures. Experimental results demonstrate the superiority of our approach concerning generation quality and editing flexibility.
Limitations and future work. Our iterative inpainting strategy is incapable of capturing all views of a 3D object in one run, potentially leading to inconsistent texture despite all the refinement we have applied. We believe multi-view diffusion models [45, 50] may mitigate this problem.

References

  • [1] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
  • [2] Bokhovkin, A., Tulsiani, S., Dai, A.: Mesh2tex: Generating mesh textures from image queries. arXiv preprint arXiv:2304.05868 (2023)
  • [3] Canny, J.: A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence (6), 679–698 (1986)
  • [4] Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4169–4181 (2023)
  • [5] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)
  • [6] Chen, D.Z., Li, H., Lee, H.Y., Tulyakov, S., Nießner, M.: Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261 (2023)
  • [7] Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
  • [8] Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)
  • [9] Chen, Y., Chen, R., Lei, J., Zhang, Y., Jia, K.: Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. NeurIPS (2022)
  • [10] Chen, Z., Yin, K., Fidler, S.: Auv-net: Learning aligned uv maps for texture transfer and synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1465–1474 (2022)
  • [11] Cohen-Bar, D., Richardson, E., Metzer, G., Giryes, R., Cohen-Or, D.: Set-the-scene: Global-local training for generating controllable nerf scenes. arXiv preprint arXiv:2303.13450 (2023)
  • [12] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)
  • [13] Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13142–13153 (2023)
  • [14] Fang, C., Hu, X., Luo, K., Tan, P.: Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602 (2023)
  • [15] Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023)
  • [16] Fu, H., Cai, B., Gao, L., Zhang, L.X., Wang, J., Li, C., Zeng, Q., Sun, C., Jia, R., Zhao, B., et al.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10933–10942 (2021)
  • [17] Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems 35, 31841–31854 (2022)
  • [18] Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371 (2023)
  • [19] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
  • [20] Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989 (2023)
  • [21] Hwang, I., Kim, H., Kim, Y.M.: Text2scene: Text-driven indoor scene stylization with part-aware details. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1890–1899 (2023)
  • [22] Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
  • [23] Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023)
  • [24] Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 300–309 (2023)
  • [25] Liu, Z., Li, Y., Lin, Y., Yu, X., Peng, S., Cao, Y.P., Qi, X., Huang, X., Liang, D., Ouyang, W.: Unidream: Unifying diffusion priors for relightable text-to-3d generation. arXiv preprint arXiv:2312.08754 (2023)
  • [26] Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: Score-based generative 3d mesh modeling. arXiv preprint arXiv:2303.08133 (2023)
  • [27] Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12663–12673 (2023)
  • [28] Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes. CVPR (2022)
  • [29] Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13492–13502 (2022)
  • [30] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21(12), 4695–4708 (2012)
  • [31] Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
  • [32] Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
  • [33] Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning. pp. 16784–16804. PMLR (2022)
  • [34] Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: Learning texture representations in function space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4531–4540 (2019)
  • [35] Po, R., Wetzstein, G.: Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218 (2023)
  • [36] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
  • [37] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. ICLR (2023)
  • [38] Qiu, L., Chen, G., Gu, X., Zuo, Q., Xu, M., Wu, Y., Yuan, W., Dong, Z., Bo, L., Han, X.: Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv preprint arXiv:2311.16918 (2023)
  • [39] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [40] Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721 (2023)
  • [41] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [42] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
  • [43] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
  • [44] Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018)
  • [45] Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
  • [46] Siddiqui, Y., Alliegro, A., Artemov, A., Tommasi, T., Sirigatti, D., Rosov, V., Dai, A., Nießner, M.: Meshgpt: Generating triangle meshes with decoder-only transformers. arXiv preprint arXiv:2311.15475 (2023)
  • [47] Siddiqui, Y., Thies, J., Ma, F., Shan, Q., Nießner, M., Dai, A.: Texturify: Generating textures on 3d shape surfaces. In: European Conference on Computer Vision. pp. 72–88. Springer (2022)
  • [48] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)
  • [49] Song, L., Cao, L., Xu, H., Kang, K., Tang, F., Yuan, J., Zhao, Y.: Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023)
  • [50] Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097 (2023)
  • [51] Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–11 (2023)
  • [52] Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12619–12629 (2023)
  • [53] Wang, T., Kanakis, M., Schindler, K., Van Gool, L., Obukhov, A.: Breathing new life into 3d assets with generative repainting. arXiv preprint arXiv:2309.08523 (2023)
  • [54] Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
  • [55] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
  • [56] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
  • [57] Yang, B., Dong, W., Ma, L., Hu, W., Liu, X., Cui, Z., Ma, Y.: Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. arXiv preprint arXiv:2310.13119 (2023)
  • [58] Yeh, Y.Y., Huang, J.B., Kim, C., Xiao, L., Nguyen-Phuoc, T., Khan, N., Zhang, C., Chandraker, M., Marshall, C.S., Dong, Z., et al.: Texturedreamer: Image-guided texture synthesis through geometry-aware diffusion. arXiv preprint arXiv:2401.09416 (2024)
  • [59] Youwang, K., Oh, T.H., Pons-Moll, G.: Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering. arXiv preprint arXiv:2312.11360 (2023)
  • [60] Yu, X., Dai, P., Li, W., Ma, L., Liu, Z., Qi, X.: Texture generation on 3d meshes with point-uv diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4206–4216 (2023)
  • [61] Zeng, X., Chen, X., Qi, Z., Liu, W., Zhao, Z., Wang, Z., FU, B., Liu, Y., Yu, G.: Paint3d: Paint anything 3d with lighting-less texture diffusion models (2023)
  • [62] Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588 (2023)
  • [63] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
  • [64] Zhang, Q., Wang, C., Siarohin, A., Zhuang, P., Xu, Y., Yang, C., Lin, D., Zhou, B., Tulyakov, S., Lee, H.Y.: Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885 (2023)

Appendix 0.A More Implementation Details

We provide additional implementation details in the following subsections. All of our experiments are conducted on 4 NVIDIA A100 GPUs, and it takes about 90909090 minutes to generate a scene.

0.A.1 Text Prompt

Our method takes a room text prompt along with a compositional mesh based on a given room layout as input and aims to synthesize a complete 3D room texture enabling free novel view rendering inside. Each text prompt is composed of two parts, the style and the description of all the objects in the scene. During the experiments, we utilize the ‘Emauromin style’ as our default style and also use other styles including ‘Misc Kawaii’, ‘Anime’, ‘Game Pokemon’, ‘Artstyle Impressionist’, and so on, which can be found in this website111https://stable-diffusion-art.com/sdxl-styles/. For better stylization results, we also use the corresponding negative prompt for each style. For example, here is one of the text prompts used to generate a bedroom:
Prompt: Emauromin style, a bedroom with oil paintings on the wall, a single-size bed, brown cotton pillows, a wooden bedside table, a wooden wardrobe, an empty bookshelf, a white desk, a chair, a square and flat ceiling lamp hanging on the ceiling. finely detailed, purism, computer rendering, minimalism, minimal product design.
Negative prompt: blurry, blur, text, watermark, render, 3D, NSFW, nude, CGl, monochrome, B&W. cartoon, painting, smooth, plasticblurry, low-resolution, deep-fried, oversaturated.
All generated texture of 3D rooms presented in this paper and their corresponding text prompts with one specific style are shown in Fig. 18 and Fig. 23. As for the style prompt and the corresponding negative prompt, please refer to the aforementioned website for details. During the iterative object texturing, the text prompt of every single object is its text description as well as the style instead of the whole text prompt as shown above.

0.A.2 Room Geometry Generation

Though the core of our method lies in generating the texture of a 3D room, it is quite straightforward to combine our method with some 3D shape generators instead of merely utilizing datasets like 3D-FRONT [16] designed by professional artists for better convenience and flexibility. Despite the fact that the geometry generated by these methods is not perfect, they can still provide the texturing process with strong geometry priors. Generally, we break down the room geometry generation into two parts: object generation and empty room generation.

3D shape generation. As for the furniture items in the scene, we generate most of our object meshes by leveraging an off-the-shelf text-to-3D object generative model, Shap-E[22]. We cut out object descriptions like ‘a wooden bedside table’ or ‘a white desk’ from the text prompt above, and then send them to the 3D object generative models to generate the corresponding 3D shapes. Delicate decorations like ceiling lamps and chandeliers are borrowed from Objaverse[13] since we found that current object generators are still incapable of generating such fine-grained decorations. The scarcity of such data in 3D object datasets like ShapeNet[5] makes it hard for a 3D generative model to learn. However, we believe that the quality gap between experts and 3D generators, especially for fine-grained models, will be closed with the rapid development of large-scale 3D generative models.

Empty room generation. A procedural generation process is applied to get an empty room mesh. Based on our observations of indoor scene datasets like 3D-FRONT[16], we provide users with various options for diverse room meshes. Specifically, they can decide whether to include baseboards, where to position doors and windows, which ceiling style to choose, and the size of the room. Under the guidance of these choices, an empty room mesh can be generated automatically. For example, the available ceiling styles are illustrated in Fig. 13. Moreover, the generated 3D shapes can also be included in the room according to the provided room layout, and thus a complete room mesh is obtained.

0.A.3 Panorama Generation Details

Initial panorama generation. We leverage the SDXL 1.01.01.01.0 base and refiner models [36] for image generation, where the sampler is selected as ‘Euler a’. The sampling step is 50505050, and we switch from the base model to the refiner at fraction 0.80.80.80.8, i.e., 40404040 steps. To generate a panoramic image with better visual fidelity and less distortion, we additionally add ‘720 degrees panorama photo view of’ to the beginning of text prompt. To employ the depth guidance, a depth-based ControlNet model [63] is also applied, and the control weight is 1.51.51.51.5. Besides, we also use SDXL VAE, and the CFG scale is set to 6.56.56.56.5.

Empty room refinement. When refining the ceilings and floors of an empty room, we will select an upward view and an overhead view to capture the corresponding areas. The virtual camera is put at the center of the room and towards the center of the ceiling or floor. The focal length and the mask will be adjusted according to the width and height of the room. As shown in Fig. 13, ceilings with star-like or diamond-like decorations and some other styles are all supported in our method.

Super-resolution and its limitation. Due to the limitation of memory and inference speed, the generated image from the SDXL model has a resolution of 2,048×1,024204810242,048\times 1,0242 , 048 × 1 , 024. To enrich the texture details, we leverage an off-the-shelf super-resolution method [54] to upscale these panorama images to 4,096×2,048409620484,096\times 2,0484 , 096 × 2 , 048. Unfortunately, some weird artifacts may appear after using the super-resolution method as illustrated in Fig. 14.

Refer to caption
Figure 13: Different ceiling styles. We show different designs of ceiling styles with an upward view.
Refer to caption
Figure 14: Limitation of super-resolution modules. Some weird artifacts as circled out may appear as shown in the image at the bottom.

0.A.4 Iterative Object Texturing Details

Settings of initial perspective view. When re-projecting 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to the initial perspective view 𝐯0subscript𝐯0\mathbf{v}_{0}bold_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the default focal length of the virtual camera is set to 500500500500. However, if the default setting leads to a bad situation where the object occupies less than half of the image or extends beyond the image boundary, we will adjust the camera’s focal length accordingly. To be specific, we would gradually increase the focal length until an object occupies half the width of the image while simultaneously guaranteeing it does not exceed the image. The resolution of perspective images is 1,024×1,024102410241,024\times 1,0241 , 024 × 1 , 024, and these images will also be upscaled to 4,096×4,096409640964,096\times 4,0964 , 096 × 4 , 096 via super-resolution modules. The setting of SDXL is the same as that for initial panorama generation.

View selection. We divide the views used for iterative object texturing into two groups: basic views and additional views. First, eight basic views are selected and all of them target at the center of the object. These cameras are roughly located in eight corners of the bounding box covering the whole 3D object. For the additional views, different strategies will be applied according to the length-width ratio of the object. If this ratio is less than 1.51.51.51.5, eight additional cameras will be used and still target the center of the object. Their positions are located on a sphere centered around the object with a radius of 0.70.70.70.7 times the diagonal length of the object bounding box, the elevation angle is set to be a random value between π/6𝜋6\pi/6italic_π / 6 and π/3𝜋3\pi/3italic_π / 3, and the azimuth angles are set to 00, π/2𝜋2\pi/2italic_π / 2, π𝜋\piitalic_π and 3π/23𝜋23\pi/23 italic_π / 2, respectively. Besides, if the aspect ratio is larger than 1.51.51.51.5, we will select 2 groups of eight additional cameras, i.e., 16161616 cameras in total. In particular, each group of cameras will be also located on a sphere but centered at one-third of the length of this 3D object, ensuring all the objects can be completely viewed with such 16161616 cameras. It’s noteworthy that virtual cameras will be strictly placed within the room boundary, and cameras that are too close to the object will be deleted too.

Mask of untextured area. As we warp our images to a novel view, those originally occluded parts may be observed due to the sparsity of point clouds. For example, the front-side texture of a wardrobe may appear when we inpaint its back side. To eliminate such unreasonable pixels, we identify these areas where the depth is larger than the ground-truth depth and then remove these pixels thereby.

Inpainting strategy in sparse mask area. We use an interpolation-based method to inpaint areas with relatively sparse masks. Specifically, the interpolation-based method means Telea’s inpaint algorithm in OpenCV. A comparison with using the diffusion model to inpaint these kinds of areas is shown in Fig. 15. It can be seen that ControlNet does not perform well in sparse areas while our method can generate consistent and natural results.

Refer to caption
Figure 15: Ablation study on inpainting methods. We show comparison results of inpainting using ControlNet only and our method. (a) is the rendering image from a novel view and (b) is the inpainting mask (white area). (c) shows the inpainting result using ControlNet only and (d) shows the inpainting result from our method. We can observe clear messy areas in (c) since the diffusion-based inpainting model is insensitive to sparse masks.

Selecting satisfying images. It is known that images generated by diffusion models exhibit a high degree of diversity, which makes it necessary to select one satisfying image from multiple generation candidates. While selecting the initial perspective view, we already have the text prompt 𝐓𝐓\mathbf{T}bold_T and the warped image 𝐈refsubscript𝐈ref\mathbf{I}_{\text{ref}}bold_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT from 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. To make sure the generated image aligns with 𝐓𝐓\mathbf{T}bold_T well and is similar to 𝐈refsubscript𝐈ref\mathbf{I}_{\text{ref}}bold_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, we compute SSIM Score [56] and CLIP Score [39] among the candidate images and select the one with the highest score:

𝐈obj=argmaxj(𝐒(𝐈objj,𝐈ref)+𝐂(𝐈objj,𝐓))subscript𝐈objsubscript𝑗𝐒superscriptsubscript𝐈obj𝑗subscript𝐈ref𝐂superscriptsubscript𝐈obj𝑗𝐓\mathbf{I}_{\text{obj}}=\mathop{\arg\max}_{j}(\mathbf{S}(\mathbf{I}_{\text{obj% }}^{j},\mathbf{I}_{\text{ref}})+\mathbf{C}(\mathbf{I}_{\text{obj}}^{j},\mathbf% {T}))bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_S ( bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) + bold_C ( bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_T ) ) (12)

where 𝐒()𝐒\mathbf{S}(\cdot)bold_S ( ⋅ ) is the function to calculate SSIM Score, 𝐂()𝐂\mathbf{C}(\cdot)bold_C ( ⋅ ) stands for the function to calculate CLIP Score and {𝐈objj}j=05superscriptsubscriptsuperscriptsubscript𝐈obj𝑗𝑗05\left\{\mathbf{I}_{\text{obj}}^{j}\right\}_{j=0}^{5}{ bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT represent 5 candidate images used here.

During the iterative object texturing, we notice that the inpainted areas can’t strictly align with other regions on the perspective images, leading to inconsistent styles and weird patterns. Hence, we will dilate the inpainting mask and leverage the dilation areas to judge the style consistency. Specifically, we additionally compute a PSNR score in the dilated area to encourage good alignment following:

𝐈obj, i=argmaxj(𝐏(𝐈obj, ij,𝐈ref, i)+𝐂(𝐈obj, ij,𝐓))subscript𝐈obj, isubscript𝑗𝐏superscriptsubscript𝐈obj, i𝑗subscript𝐈ref, i𝐂superscriptsubscript𝐈obj, i𝑗𝐓\mathbf{I}_{\text{obj, i}}=\mathop{\arg\max}_{j}(\mathbf{P}(\mathbf{I}_{\text{% obj, i}}^{j},\mathbf{I}_{\text{ref, i}})+\mathbf{C}(\mathbf{I}_{\text{obj, i}}% ^{j},\mathbf{T}))bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_P ( bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_I start_POSTSUBSCRIPT ref, i end_POSTSUBSCRIPT ) + bold_C ( bold_I start_POSTSUBSCRIPT obj, i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_T ) ) (13)

where 𝐏()𝐏\mathbf{P}(\cdot)bold_P ( ⋅ ) represents the function to calculate the PSNR score, 𝐈ref, isubscript𝐈ref, i\mathbf{I}_{\text{ref, i}}bold_I start_POSTSUBSCRIPT ref, i end_POSTSUBSCRIPT is the image to be inpainted under view 𝐯isubscript𝐯𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the CLIP score is also used to select the most suitable inpainting results from 5 candidates.

0.A.5 Fine-grained Texture Control

Since our method aims to generate harmonious texture across the whole scene, it is natural for our method to ignore some semantics in the object-level text if they break the overall consistency significantly(like a blue stool among a bunch of brown furniture in the last example of Fig. 18). This misalignment is mainly due to the SDXL model, which is trained on real-world scene images with globally consistent textures. However, it is easy for our method to align with all the object-level prompts by sacrificing some extent of harmoniousness. It is up to the users themselves to decide how they would like the room texture. This fine-grained control can be achieved by simply ignoring the reference textures of these objects in the panorama during the object texturing process. Such a compromised result is shown in the teaser image as well as in Fig. 16. Apart from aligning object textures perfectly with text prompts, other fine-grained texture controls including controlling the texture of floors, walls, ceilings, and objects using scribbles are also integrated into one scene in the demo video.

Refer to caption
Figure 16: Object-level alignment results. We show an object-level alignment result in the living room by simply ignoring the reference texture from the initial panorama. Every object in this scene aligns with its corresponding text prompt.

Appendix 0.B User Study

We leverage a flask-based web application for the user study to compare our method with baselines from the human perspective. Fig. 19 shows the interface of our questionnaire, where the text description is put on the top, the room overview(a top view of the room) and two random perspective images are in the middle, and a video showing free roaming in the room is also provided below. In the questionnaires, we have 6 groups of scenes in total, where 3 results from baselines and 1 from ours are included in each group. We invite 61616161 volunteers to conduct the user study and each participant will be randomly shown 2 groups of scenes, i.e., 8 generated scenes, and be asked to judge each presented scene from three different dimensions, 3D consistency(3DC), texture quality(TQ), and perceptual quality(PQ). Specifically, they have to give a score ranging from 1 to 5 for such three aspects. The higher, the better. In the end, we gather 488488488488 responses from the 61616161 participants and calculate the overall preferences as shown in Tab. 2.

Method 3DC(\uparrow) TQ(\uparrow) PQ(\uparrow)
TEXTure-C [40] 3.06(±plus-or-minus\pm±0.85) 2.75(±plus-or-minus\pm±0.80) 2.88(±plus-or-minus\pm±0.77)
TEXTure-H [40] 2.83(±plus-or-minus\pm±0.86) 2.63(±plus-or-minus\pm±0.84) 2.64(±plus-or-minus\pm±0.83)
SceneTex [6] 3.98(±plus-or-minus\pm±0.86) 3.73(±plus-or-minus\pm±1.02) 3.56(±plus-or-minus\pm±0.95)
Ours 4.51(±plus-or-minus\pm±0.71) 4.29(±plus-or-minus\pm±0.81) 4.26(±plus-or-minus\pm±0.76)
Table 2: Quantitative comparison of the user study. Mean opinion scores are in the range of 1 similar-to\sim 5. Our method outperforms TEXTure-C and TEXTure-H by a large margin.
Refer to caption
Figure 17: Limitations of optimization-based framework. We show several obvious artifacts of the optimization-based framework including misalignment between geometry and texture, the existence of unnatural light and noise, and evident blurry areas and seams.

Appendix 0.C Detailed Comparison with SceneTex

As the most closely related and concurrent work as ours, SceneTex [6] formulates the whole texturing process as an optimization problem by using a multi-resolution texture field and the VSD objective [55]. Though being able to generate compelling textures for a given room geometry, the optimization-based framework may suffer some underlying problems. First of all, the texture generated via optimization may not align well with the given geometry as the texture prior distilled from text-to-image diffusion models tends to make images look as realistic as possible from certain viewpoints. For example, as shown in Fig. 17 (a), there should not exist handle-like objects on the walls of the bathroom since there are no handles at all in the given meshes. Similarly, SceneTex is prone to generate indoor textures containing unnatural lights and weird noises as shown in Fig. 17 (b) and (c). Moreover, the choice of viewpoints leads to some blurry areas due to the severe occlusion problem in the indoor scene as shown in Fig. 17 (d). However, we believe the blurry problem may be mitigated via a more delicate viewpoint selection strategy. Some clear seams can be observed in Fig. 17 (e) due to the usage of UV map. On the other hand, textures of objects with complex topological structures may easily be affected by accumulative errors under an explicit inpainting-based framework, even though we have designed a module to detect the misalignment between depth space and rgb space. Optimization-based methods naturally possess some extent of continuity and will not be significantly impacted by a particular viewpoint. But objects with complex topological structures like multi-layer lamps still pose a challenge for both approaches due to the severe self-occlusion. In the future, we believe a well-designed strategy could marry the merits of the inpainting-based method and the optimization-based method for more harmonious and consistent texture generation.

Appendix 0.D Additional Results

More qualitative results including a kitchen, a bedroom, and a living-dining room compared with baseline methods are shown in Fig. 20 and more stylized room results are shown in Fig. 21. Though it is more flexible to assemble a room using 3D shape generators along with our provided empty room generator by users themselves, our method is also capable of texturing a room from professional datasets like 3D-FRONT [16]. We choose five rooms including a bedroom, three living rooms, and a living-dining room from the dataset, and the results of the overhead view and several perspective views from inside are shown in Fig. 22. The overview images of these rooms as well as their corresponding text prompts are shown in Fig. 23. We render some room tour videos of different scenes with different styles, which are integrated into a unified video put in the supplementary. Besides, we also present a demo video to demonstrate the effectiveness of using our misalignment detection technique. Another demo video shows how our method supports interactive fine-grained texture controls as well as a room tour video in the new room after applying these controls.

Refer to caption
Figure 18: Generated rooms and their corresponding text prompts. 6 compositional rooms with default style are shown with an overview image on the left and their corresponding text prompts on the right.
Refer to caption
Figure 19: The interface of our questionnaire used in the user study. The text prompt is shown on the top, the room overview and two randomly selected perspective images are in the middle, and a room tour video is put at the bottom.
Refer to caption
Figure 20: More qualitative comparison. We show more qualitative results compared with baselines.
Refer to caption
Figure 21: Results of stylized rooms. We show some stylized rooms with several rendered perspective images from several perspective views.
Refer to caption
Figure 22: Our scene texturing results on 3D-FRONT dataset We show our texturing results on the 3D-FRONT dataset with an overhead view on the left and three perspective views from inside on the right. The text prompt here is concise, please refer to Fig. 23 for details.
Refer to caption
Figure 23: Our scene texturing results and corresponding text prompts on the 3D-FRONT dataset. 5 3D-FRONT rooms with different styles are shown with an overview image on the left, and their corresponding text prompts on the right.