(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: The Hong Kong University of Science and Technology ²²institutetext: Peking University ³³institutetext: Shanghai AI Laboratory

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Qi Wang 1*1* Ruijie Lu 2*2* Xudong Xu 33 **gbo Wang 33 Michael Yu Wang 11
Bo Dai 3†3† Gang Zeng 22 Dan Xu 11

Abstract

The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at https://qwang666.github.io/RoomTex/.

Keywords:

Scene Texturing Scene Generation Texture Synthesis

^†^†

*

Equal contribution, work done during the internship at Shanghai AI Laboratory.

{\dagger}

Corresponding author: [email protected]

1 Introduction

Refer to caption — Figure 1: We propose RoomTex to synthesize high-quality and style-consistent textures for given scene meshes. Our method supports generating multiple styles.

Generating high-quality textured 3D models, especially indoor scenes, is imperative for various industrial applications, ranging from gaming and filming to AR/VR. Current delicate 3D indoor scenes, however, are mostly carefully designed by professional artists with expertise and thus is an expensive and time-consuming process. Recently, significant progress has been made in the realm of 3D object generation [37, 8, 23, 38, 46, 17, 18, 26], especially in terms of geometry quality. Despite satisfactory scene geometry, achieving captivating and style-consistent textures still demands painstaking efforts from artists equipped with specialized knowledge and aesthetic training. Hence, automatic scene texturing, i.e., generating textures for untextured scene-level meshes, remains a valuable but challenging problem.

Existing texturing methods mostly focus on synthesizing textures for 3D objects, most of which are either limited to several specific training categories [34, 47, 2] or relying on the corresponding UV maps [10, 60]. Thanks to powerful CLIP model [39] connecting text and images, some subsequent approaches [28, 9, 60, 7, 4, 61] is capable of painting general 3D objects by leveraging CLIP model or more advanced text-to-image diffusion models. Conditioned on given text descriptions, they typically apply an iterative scheme to texture an object from different viewpoints. Despite remarkable results on 3D objects, these methods cannot be naively extended to 3D indoor scenes owing to the complicated occlusion problem in the scene. Although an indoor scene can be viewed as the composition of various textured objects, it is nontrivial to ensure global style consistency of texture for all the individual objects inside.

In this work, we propose a novel coarse-to-fine framework, dubbed RoomTex, to synthesize high-fidelity and style-consistent texture for a compositional 3D indoor scene under the guidance of text prompt that simultaneously enables flexible scene editing and fine-grained texture control. Contrary to directly using perfect indoor scene meshes [16] designed by professional artists, we opt to leverage off-the-shelf 3D object generative models along with a given 2D room layout to form a compositional untextured scene mesh with imperfect geometry, which alleviates the time-consuming model-making process and better aligns with the great development of 3D generative models. Afterward, this indoor scene mesh is firstly unwrapped to obtain a global panoramic depth map. Based on the depth map and text prompt, RoomTex leverages ControlNet [63] to synthesize a panoramic image of the entire scene in the coarse stage. It is noteworthy that such a room panorama is regarded as the coarse reference in the subsequent fine stage to maintain global style consistency. To cope with the occlusion problem between objects and interior surfaces, we will remove all the objects inside to inpaint the occluded areas to acquire complete room interior surfaces, including walls, floor, and ceilings.

In the fine stage, RoomTex will texture every single object in the room based on the panoramic image acquired in the coarse stage, leading to a complete 3D room that can be perceived from any viewpoint. Specifically, for a particular 3D object to be painted, we first re-project the panorama from an appropriate view targeting this object to obtain a perspective image of the object. However, this perspective image inevitably contains unacceptable distortions owing to the equirectangular projection and thus will be refined to a better and more detailed image via depth-guided ControlNet [63]. Apart from this initial view, we additionally select a series of camera positions around this 3D object, along which this 3D object will be textured iteratively under the guidance of text prompt. Regarding the refined perspective image as a starting point, we warp this partially painted 3D object to other viewpoints under the depth guidance and inpaint the missing texture in each new viewpoint, which repeats until the 3D object is completely painted. Unfortunately, the generated texture cannot perfectly align with the guiding depth map. In particular, the texture in the foreground of an object will dilate to the background area, which typically occurs in the depth edge areas and might be imperceptible from the current view. However, the dilated texture from the previous iteration leads to a messy area while the object is warped to a new viewpoint, which will be exacerbated as the iterative inpainting goes on. To mitigate this problem, we propose to detect these misalignment areas with Canny [3] edges of RGB images and Laplacian edges of the corresponding depth maps. Afterward, these areas will not taken into consideration during the iterative inpainting to avoid awful object textures.

Extensive experiments demonstrate that our method can synthesize high-fidelity and style-consistent texture for a compositional room mesh conditioned on the given text prompts as shown in Fig. 1. More importantly, thanks to the powerful control capabilities of ControlNet [63] and our inpainting-based texturing framework, RoomTex supports flexible scene editing of all individual objects and fine-grained texture control such as aligning the texture of a specific 3D object with provided sketches or text descriptions as illustrated in Fig. 2. Our contributions can be summarized as follows:

•

We propose a novel coarse-to-fine texture generation framework that first generates a room panorama as a coarse reference and then paints each component in the scene to achieve global style consistency.
•

Thanks to our subtly designed alignment between RGB and depth spaces, our method can take imperfect geometry from 3D object generative models as input and generate holistic and high-fidelity scene textures.
•

Users can not only flexibly edit the indoor scene where they can add, remove, replace, move, and rescale any furniture item, but also realize fine-grained texture control over any object with given sketches or text prompts.

2 Related Work

Diffusion-Based Text-to-Image Generation. In recent years, diffusion probabilistic models [19, 48] have achieved unprecedented success in text-to-image generation. By training on large-scale text-image paired datasets [44, 43], diffusion models manage to learn an implicit connection between semantic concepts and corresponding text embeddings, thus generating diverse and complex images of objects and scenes from given text prompts [33, 39, 42, 1]. Different from pixel-based diffusion approaches, latent diffusion models [41] (LDMs) apply the diffusion model on the latent space of pretrained autoencoders, significantly reducing the demands for massive computational resources. Moreover, several fantastic works [63, 31, 51] have explored utilizing additional conditions like sketches, Canny edges, depth maps, etc., to control the image generation of large pretrained text-to-image models. In this work, we leverage ControlNet [63] to synthesize indoor panorama conditioned on the corresponding indoor depth map and subsequent inpainting or editing procedures.

Text-Driven 3D Object Generation and Texturing. The great success of text-to-image synthesis empowers booming development in the domain of text-to-3D generation. Based on powerful 2D text-to-image diffusion models, DreamFusion [37] first proposes an effective Score Distillation Sampling (SDS) loss to guide the generation of 3D models. Later, a vast body of following works leverages the SDS loss to synthesize various 3D objects with higher quality and better 3D consistency [52, 24, 8, 55, 45, 23, 25, 38]. By training on massive 3D synthetic data, prior attempts [32, 22] on direct generation of 3D point clouds or meshes are shown to significantly accelerate the generation process. It is noteworthy that our method can capitalize on these generated 3D objects for room composition as input, despite their imperfect geometry. Given untextured 3D meshes and the conditioning text descriptions or reference images, some approaches [40, 7, 29, 53, 61, 58, 60, 4, 59, 27] also exploit the text-to-image diffusion models for 3D object texturing by using an iterative painting scheme or relying on the corresponding UV map. Unlike them, we aim to paint the entire room, including each independent 3D object inside, with high-quality and consistent textures.

Indoor Scene Generation with Panorama. Recently, MVDiffusion [50] and Ctrl-Room [14] subtly design their specific diffusion models to synthesize multi-view consistent images or 3D layouts of indoor scenes. However, they are both constrained to generating a panoramic image to represent the whole 3D room and thus wandering around the room is far beyond the capability of these methods. Although Ctrl-Room further combines the estimated depth map with the panoramic image to obtain a complete textured room, the occlusion area cannot be covered with a single panorama and the potential panoramic distortion remains out of reach. In contrast, our method aims to texture a compositional room mesh and leverages a panorama to ensure style consistency in the scene.

Text-Driven 3D Scene Indoor Generation and Texturing. Several prior works [11, 35] explore generating a 3D indoor scene by using 3D bounding boxes as layouts and optimizing the entire scene with SDS loss [37]. Yet, the texure quality is still far from satisfactory since the generated scene often looks unrealistic and over-saturated. Alternative approaches [20, 62, 15] adopt an incremental framework, where they mainly leverage image war** to obtain renderings from new viewpoints and then inpaint the missing areas based on the estimated depth map. However, inaccurate depth estimation leads to severe geometry distortion, significantly affecting the generation results. Moreover, RoomDreamer [49] will jointly refine the geometry and texture of an existing indoor mesh via pretrained text-to-image diffusion models, but still cannot cope with the unobserved regions. Parallel to scene generation, a line of research works [57, 6, 21, 64], including ours, start paying attention to 3D scene texturing, i.e., generating high-quality textures for given 3D scene-level meshes. Despite their compelling results, DreamSpace [57] and Text2Scene [21] have to rely on an initial room texture for the succeeding stylization, while concurrent works SceneTex [6] and SceneWiz3D [64] cannot support fine-grained texture controls due to their adopted optimization-based framework. Unlike any of the above, our model RoomTex targets at texturing indoor scenes that consist of untextured 3D object meshes and simultaneously enables fine-grained controls over the scene.

3 Method

In this section, we present our coarse-to-fine generation framework, RoomTex, for synthesizing high-fidelity and style-consistent texture for a compositional room. We utilize off-the-shelf 3D shape generative models along with a given room layout to assemble the room mesh. In the coarse stage, the 3D room mesh is unwrapped to a panorama depth map, based on which we generate a panoramic image of the room as a coarse reference (Sec. 3.1). Then in the fine stage, the empty room will be further refined in perspective views (Sec. 3.2). Afterward, we employ an iterative inpainting pipeline to refine and paint every independent 3D object in the room (Sec. 3.3). To better align the generated texture and the guidance depth map, we introduce an edge-detection module to identify and then remove the misalignment areas between them (Sec. 3.4). Finally, our framework also supports interactive fine-grained texture control (Sec. 3.5). An overview of our framework is illustrated in Fig. 3.

3.1 Panoramic Image Generation

Given the room layout, it is relatively straightforward to assemble an untextured mesh of the room as input by leveraging off-the-shelf 3D object generative models. Subsequently, a virtual depth camera is put at the center of the room, leading to a panoramic depth map $\mathbf{D}_{p}$ of the room via the equirectangular projection. Under the depth guidance $\mathbf{D}_{p}$ and text prompts $\mathbf{T}$ , RoomTex utilizes powerful ControlNet [63] $\mathcal{F}_{i}(\cdot)$ to synthesize a panoramic image $\mathbf{I}_{p}$ as a coarse reference to maintain the style consistency of the generated texture:

\mathbf{I}_{p}=\mathcal{F}_{i}(\mathbf{D}_{p},\mathbf{T}).

(1)

3.2 Empty Room Refinement

To enable a more flexible editing to the generated scene like moving furniture items in the scene, we further generate a complete texture of mere interior surfaces using a depth-aware inpainting model so that the missing areas occluded by objects in the initial panorama $\mathbf{I}_{p}$ will be filled. We first remove all the object meshes inside and obtain the panoramic depth of an empty room $\mathbf{D}_{r}$ . All the occluded areas are denoted with a binary mask $\mathbf{M}_{r}$ , and the inpainting can be represented with:

\mathbf{I}_{r}=\mathcal{F}_{\text{inp}}(\mathbf{I}_{p},\mathbf{M}_{r},\mathbf{% D}_{r},\mathbf{T})

(2)

where $\mathcal{F}_{\text{inp}}$ is the depth-aware inpainting model where ControlNet [63] is used, and the occluded areas in the input pananora $\mathbf{I}_{p}$ are assigned zero values during the inpainting. It is noteworthy that a more complete empty room texture is shown to be beneficial for the subsequent object generation process.

Moreover, to cope with the distortion brought by the panoramic image $\mathbf{I}_{p}$ , we carefully choose an overhead view targeting the floor and an upward view targeting the ceiling to refine these two important areas. The panoramic image $\mathbf{I}_{p}$ will be re-projected to these two perspective views $\mathbf{v}_{\text{floor}}$ and $\mathbf{v}_{\text{ceiling}}$ for the corresponding images $\mathbf{I}_{\text{floor}}$ and $\mathbf{I}_{\text{ceiling}}$ using

\displaystyle\mathbf{I}_{\cdot}

\displaystyle=\mathcal{P}(\mathcal{T}_{\text{pano}\to\text{world}}\circ\mathbf% {I}_{p},\mathbf{v}_{\cdot}),

(3)

where $\mathcal{T}_{\text{pano}\to\text{world}}$ is a transformation function that projects every single pixel in the panoramic image to the spherical coordinates and then projects to the world coordinates with the help of a panoramic depth map, $\mathcal{P}(\cdot,\mathbf{v})$ is a projection function that projects the point cloud in the world coordinates to a specific view $\mathbf{v}$ . Regarding $\mathbf{I}_{\text{floor}}$ and $\mathbf{I}_{\text{ceiling}}$ as the initialization, we employ the aforementioned inpainting Eq. 2 to refine the floor and ceiling images under the guidance of the corresponding mask and the depth map.

3.3 Iterative Object Texturing

Notably, generating a panoramic image is not essentially equal to generating a complete room texture supporting free novel view rendering inside mainly due to the lack of information in the occluded area. Therefore, it is necessary to apply an inpainting process to fill in the ‘other’ side of every single object, i.e., the missing areas in the panorama. In this stage, the global panorama $\mathbf{I}_{p}$ and the empty room panorama $\mathbf{I}_{r}$ are further used as references for the texturing of each object in the scene. Our method RoomTex aims to generate texture for a compositional 3D scene and thus conduct the inpainting for each independent object separately.

For a specific 3D object to be painted, we will select an initial perspective view $\mathbf{v}_{0}$ targeting it and re-project $\mathbf{I}_{p}$ and $\mathbf{I}_{r}$ to obtain foreground and background images $\mathbf{I}^{\text{fg}}_{\text{obj}}$ and $\mathbf{I}^{\text{bg}}_{\text{obj}}$ of this object via Eq. 3. In particular, the perspective image resolution is set as a constant, and the focal length of the initial view $\mathbf{v}_{0}$ is related to the size of the specific object and the relative distance between the camera and the object. With perspective images $\mathbf{I}_{\text{bg}}$ and $\mathbf{I}_{\text{fg}}$ , we further integrate them into one image $\hat{\mathbf{I}_{\text{obj}}}$ using the object mask $\mathbf{M}_{\text{obj}}$ as follows:

\hat{\mathbf{I}_{\text{obj}}}=\mathbf{I}_{\text{fg}}\odot\mathbf{M}_{\text{obj% }}+\mathbf{I}_{\text{bg}}\odot(1-\mathbf{M}_{\text{obj}}).

(4)

Afterward, this fused image $\hat{\mathbf{I}_{\text{obj}}}$ is refined to a new image $\mathbf{I}_{\text{obj}}$ with less distortion and higher resolution via Eq. 2.

After obtaining the refined initial view $\mathbf{I}_{\text{obj}}$ of the object, we leverage iterative inpainting to completely texture this object as illustrated in Fig. 4. To this end, we extra select a series of camera viewpoints $\{\mathbf{v}_{i}\},i=1,2,...,N$ , which will cover every aspect of the object as comprehensively as possible. The first eight views place the camera on a sphere centered around the object looking at the center of the object. To be specific, the radius of the sphere is set to be slightly larger than the object and the polar angle is set to $\pi/4$ and $3\pi/4$ . Moreover, some additional views will be picked if this 3D object is out of the view range of eight selected camera poses. Once the group of views $\left\{\mathbf{v}_{i}\right\}$ is acquired, we can apply an iterative war** and inpainting process to texture this 3D object. First of all, the initial image is unprojected to partial point cloud $\mathbf{P}_{\text{obj}}$ in the world coordinate under the depth guidance:

\mathbf{P}_{\text{obj}}=\mathcal{P}^{-1}(\mathbf{I}_{\text{obj}}\odot\mathbf{M% }_{\text{obj}},\mathbf{v}_{0}),

(5)

where $\mathcal{P}^{-1}(\cdot,\mathbf{v})$ aims to unproject pixels in the perspective view $\mathbf{v}$ back to the world coordinate.

For a novel view $\mathbf{v}_{i}$ , the partial point cloud $\mathbf{P}_{\text{obj}}$ will be warped to a novel view image $\hat{\mathbf{I}_{\text{obj, i}}}$ to be inpainted via Eq. 12, and the corresponding inpainting mask $\mathbf{M}_{\text{obj, i}}$ can be also obtained simultaneously. During the inpainting, we will initialize the missing areas with the nearest neighbor color to guide the inpainting model for a style-consistent texture, and then inpaint the image $\hat{\mathbf{I}_{\text{obj, i}}}$ to $\mathbf{I}_{\text{obj, i}}$ via aforementioned Eq. 2. However, this inpainted image $\mathbf{I}_{\text{obj, i}}$ inevitably contains numerous sparse and small holes due to the sparsity of partial point cloud $\mathbf{P}_{\text{obj}}$ , which cannot be filled well with diffusion-based inpainting models like ControlNet [63]. Therefore, we extra leverage an interpolation-based inpainting method $\mathcal{F}_{\text{intp}}$ to fill these sparse areas indicated by a binary mask $\mathbf{M}^{s}_{\text{obj, i}}$ . Combining with the room background, the final image under this novel view $\mathbf{I}^{\text{final}}_{\text{obj, i}}$ can be represented as follows:

\begin{split}\mathbf{I}^{\text{final}}_{\text{obj, i}}&=\hat{\mathbf{I}_{\text% {obj, i}}}\odot(1-\mathbf{M}_{\text{obj, i}})+\mathcal{F}_{\text{intp}}(\hat{% \mathbf{I}_{\text{obj, i}}})\odot\mathbf{M}^{s}_{\text{obj, i}}\\ &\quad+\mathbf{I}_{\text{obj, i}}\odot(\mathbf{M}_{\text{obj, i}}-\mathbf{M}^{% s}_{\text{obj, i}}).\end{split}

(6)

Afterward, the final image in this novel view will be unprojected to the world coordinate under the depth guidance and then merge with the previous partial point cloud $\mathbf{P}_{\text{obj}}$ following:

\mathbf{P}_{\text{obj}}:=\mathbf{P}_{\text{obj}}\cup(\mathcal{P}^{-1}(\mathbf{% I}^{\text{final}}_{\text{obj, i}}\odot\mathbf{M}_{\text{obj, i}},\mathbf{v}_{i% })),

(7)

where the updated point cloud will engage in the next iteration of object texturing, and $\cup$ denotes the set union operation. The iterative inpainting will be conducted following our selected camera poses $\{\mathbf{v}_{i}\},i=1,2,...,N$ , until the 3D object is completely textured.

3.4 Misalignment Removal

As illustrated in Fig. 6, the generated texture cannot perfectly align with the depth map, especially in the areas around the depth edge. To mitigate this problem, a carefully designed edge detection module is proposed to identify the misalignment areas. We denote a perspective image from diffusion models as $\mathbf{I}$ , its Canny edges as $\mathcal{E}_{C}(\mathbf{I})$ , depth map as $\mathbf{D}$ , and its Laplacian edges as $\mathcal{E}_{L}(\mathbf{D})$ . Then, our method will leverage traditional erosion and dilatation operations to determine the misalignment areas. Specifically, we filter out all the irrelevant areas by dilating the Laplacian edges and only keep the overlap** part:

\hat{\mathcal{E}_{C}(\mathbf{I})}=\mathcal{E}_{C}(\mathbf{I})\cap\text{Dilate}% (\mathcal{E}_{L}(\mathbf{D})).

(8)

Afterward, the mask of the misaligned area can be obtained via erosion and dilatation operations:

\mathbf{M}_{\text{mis}}=\text{Erode}\big{(}\text{Dilate}(\hat{\mathcal{E}_{C}(% \mathbf{I})}\cup\mathcal{E}_{L}(\mathbf{D}))\big{)}.

(9)

Misalignment areas will not be taken into consideration during the unprojection.

3.5 Fine-grained Texture Control

In practice, users may not be completely satisfied with the generated textures and would like to rectify specific areas, or even provide more detailed controls to ensure the generated textures meet their expectations as illustrated in Fig. 6. We denote the image from the specific viewpoint $\mathbf{v}$ that the user wishes to edit as $\mathbf{I}$ , the area they want to interact as $\mathbf{M}$ , the corresponding depth map as $\mathbf{D}$ , and a sketch illustrating how they would like to edit this area as $\mathbf{S}$ . Then, we will repaint the masked-out area according to the additional sketch conditions:

\mathbf{I}^{\prime}=\mathcal{F}_{s}(\mathbf{I},\mathbf{M},\mathbf{D},\mathbf{S% },\mathbf{T})

(10)

Afterward, the newly painted area will be projected back to the world coordinates, updating the point cloud of the scene $\mathbf{P}_{\text{scene}}$ :

\mathbf{P}_{\text{scene}}:=\mathbf{P}_{\text{scene}}\setminus\mathbf{P}_{\text% {orig}}\cup(\mathcal{P}^{-1}(\mathbf{I}^{\prime}\odot\mathbf{M},\mathbf{v})),

(11)

where $\mathbf{P}_{\text{orig}}$ denotes the original point cloud of the masked area and $\setminus$ stands for set subtraction operation. It is worth noting that users may generate their desirable sketches using text-to-image models. Moreover, if users just want to repaint several unsatisfactory objects, our methods also support object-level texture control like changing the color with mere text guidance as shown in Fig.2.

4 Experiment

In this section, we will present qualitative and quantitative results as well as our ablation study. For implementation details, please refer to the supplemental material, and we will first give a brief introduction to the baseline methods.

Baselines. We compare our method against two recent texture synthesis methods. As for MVDiffusion [50], we found the pre-trained depth-guided holistic generation model only works well for ScanNet [12] and can only generate blur images that do not align well with the depth guidance in cases for comparisons.

$\bullet$

TEXTure-H [40]: We compare with TEXTure whose input is an untextured mesh and text prompts. We holistically implement this baseline by removing several walls and the ceiling in our scenes so that the texture of interior surfaces can be generated.
$\bullet$

TEXTure-C[40]: We implement this baseline by compositionally generating the texture of every object mesh and the room interior surfaces, whose several walls and ceilings will also be removed. Afterward, we would integrate all components into a holistic mesh.
$\bullet$

SceneTex [6]: We compare with SceneTex whose input is also an untextured mesh and text prompts. We utilize the exact same text description as ours to guide the generation process.

Evaluation Metrics. The generated 3D textured room is evaluated both quantitatively and qualitatively. We leverage the Aesthetic Score(AS) introduced by LAION [43], CLIP Score(CS) [39] and BRISQUE(BQ) [30] to reflect the image quality of the generated scenes.

4.1 Qualitative Results

Comparison to baselines. For comparison with baseline methods, we generate three scenes including a bathroom, a bedroom, and a living room. We show top-down views into the scene and several perspective images for our method and baselines in Fig. 7. Neither TEXTure-H nor TEXTure-C can generate satisfactory interior surfaces, and unnatural spots and stripes may appear on the individual objects. Furthermore, the texture generated by TEXTure-C is inconsistent for lack of a holistic constraint and the texture generated by TEXTure-H is also of low quality due to the inaccurate description. SceneTex [6] can generate a relatively consistent texture for the whole scene including interior surfaces. However, the generated texture contains lots of unsatisfactory noise, unreasonable lighting, and severe misalignment between texture and geometry. For example, strange circle-like objects appear on the walls of the bathroom and a strange box-like object appears in front of the bed in the bedroom. More crucially, SceneTex cannot thoroughly settle the occlusion problem in the scene, resulting in blurry texture in the occluded areas. In contrast, our approach creates a highly detailed and compelling texture for all the 3D objects in the scene. Thanks to the panorama reference, the texture style between different objects and the context (mostly walls and the floor) is coherent. Moreover, the style of the room can be easily modified as shown in Fig. 1.

Scene editing and Fine-grained texture control. Our texture is generated for a compositional scene including objects and surroundings, and the representation of every object is essentially a colored point cloud. Therefore, editing like adding, duplicating, removing, rotating, moving, and rescaling objects can be simply supported. In Fig. 8, we include several editing results of rendered images under various views in different scenes. Moreover, as illustrated in Sec. 3.5, interactive fine-grained control over texture can be easily achieved and two additional examples are provided in Fig. 9.

4.2 Quantitative Results

We show quantitative results averaged over multiple scenes including a bathroom, a bedroom, and a living room in Tab.1. We render about 100 images from novel views for each scene to calculate these three metrics. Blurry and messy texture lead to lower scores for the baselines in image-based scores. As for the computing of CS, it is hard to offer an accurate description for all rendered images, so the general text input is used instead. TEXTure-C leverages the text prompt for every object inside, thus leading to a higher CS score than ours. SceneTex is prone to generating evident lighting, leading to a higher BQ value under specific viewpoints since BQ metric favors images with evident lighting.

Table 1: Quantitative comparison. We report image quality metrics including Aesthetic Score(AS), Clip Score(CS), BRISQUE(BQ). Our method outperforms baselines on AS, slightly worse than TEXTure-C in terms of CS and slightly worse than SceneTex in terms of BQ.

Method	AS( $\uparrow$ )	CS( $\uparrow$ )	BQ( $\downarrow$ )
TEXTure-C [40]	5.11	30.17	39.78
TEXTure-H [40]	4.33	27.86	47.54
SceneTex [6]	4.77	26.43	26.91
Ours	5.20	29.16	30.91

4.3 Ablation Study

Panorama distortion elimination. It is noteworthy that except for the object distortion brought by the equirectangular projection, the texture of the ceilings, floor, and baseboards may also suffer from distortion as shown in Fig. 10. This kind of texture is unacceptable and may further influence the iterative object inpainting process since the room context (background texture) is unrealistic. After choosing an overhead and an upward view to repaint floors and ceilings, we can get a textured empty room with less distortion.

Initialization of untextured areas. In Sec. 3.3, initialization of the untextured areas may contribute to a higher quality of inpainting if the missing texture is of a similar color to its nearby area. As shown in Fig. 12, whether or not we fill the untextured areas with the nearest color will make a huge difference in the generation quality.

Misalignment detection. As mentioned in Sec. 3.4, misalignment between the generated texture and the guiding depth map often occurs, which will dilate the foreground texture to the background as shown in the top-left image in Fig. 12. For example, the texture of the pillows may dilate into the area of the headboard, which will interfere with the inpainting model in the later iteration since the texture warped from previous frames is not convincing. These areas will be detected and then discarded during the unprojection. Visualized reprojection results of using different kinds of projection masks are shown from a new viewpoint.

5 Conclusion and Limitation

We have proposed a novel text-driven indoor scene texture generation framework, which is capable of generating high-fidelity and coherent texture that aligns well with geometry. The crux of our approach is to first synthesize a panoramic image as a holistic reference for style consistency and then inpaint every object iteratively to support a compositional 3D scene with complete and harmonious textures. Experimental results demonstrate the superiority of our approach concerning generation quality and editing flexibility.
Limitations and future work. Our iterative inpainting strategy is incapable of capturing all views of a 3D object in one run, potentially leading to inconsistent texture despite all the refinement we have applied. We believe multi-view diffusion models [45, 50] may mitigate this problem.

References

[1] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
[2] Bokhovkin, A., Tulsiani, S., Dai, A.: Mesh2tex: Generating mesh textures from image queries. arXiv preprint arXiv:2304.05868 (2023)
[3] Canny, J.: A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence (6), 679–698 (1986)
[4] Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4169–4181 (2023)
[5] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)
[6] Chen, D.Z., Li, H., Lee, H.Y., Tulyakov, S., Nießner, M.: Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261 (2023)
[7] Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
[8] Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)
[9] Chen, Y., Chen, R., Lei, J., Zhang, Y., Jia, K.: Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. NeurIPS (2022)
[10] Chen, Z., Yin, K., Fidler, S.: Auv-net: Learning aligned uv maps for texture transfer and synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1465–1474 (2022)
[11] Cohen-Bar, D., Richardson, E., Metzer, G., Giryes, R., Cohen-Or, D.: Set-the-scene: Global-local training for generating controllable nerf scenes. arXiv preprint arXiv:2303.13450 (2023)
[12] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)
[13] Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13142–13153 (2023)
[14] Fang, C., Hu, X., Luo, K., Tan, P.: Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602 (2023)
[15] Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023)
[16] Fu, H., Cai, B., Gao, L., Zhang, L.X., Wang, J., Li, C., Zeng, Q., Sun, C., Jia, R., Zhao, B., et al.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10933–10942 (2021)
[17] Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems 35, 31841–31854 (2022)
[18] Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371 (2023)
[19] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
[20] Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989 (2023)
[21] Hwang, I., Kim, H., Kim, Y.M.: Text2scene: Text-driven indoor scene stylization with part-aware details. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1890–1899 (2023)
[22] Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
[23] Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023)
[24] Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 300–309 (2023)
[25] Liu, Z., Li, Y., Lin, Y., Yu, X., Peng, S., Cao, Y.P., Qi, X., Huang, X., Liang, D., Ouyang, W.: Unidream: Unifying diffusion priors for relightable text-to-3d generation. arXiv preprint arXiv:2312.08754 (2023)
[26] Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: Score-based generative 3d mesh modeling. arXiv preprint arXiv:2303.08133 (2023)
[27] Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12663–12673 (2023)
[28] Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes. CVPR (2022)
[29] Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13492–13502 (2022)
[30] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21(12), 4695–4708 (2012)
[31] Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
[32] Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
[33] Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning. pp. 16784–16804. PMLR (2022)
[34] Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: Learning texture representations in function space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4531–4540 (2019)
[35] Po, R., Wetzstein, G.: Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218 (2023)
[36] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
[37] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. ICLR (2023)
[38] Qiu, L., Chen, G., Gu, X., Zuo, Q., Xu, M., Wu, Y., Yuan, W., Dong, Z., Bo, L., Han, X.: Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv preprint arXiv:2311.16918 (2023)
[39] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[40] Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721 (2023)
[41] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[42] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
[43] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
[44] Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018)
[45] Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
[46] Siddiqui, Y., Alliegro, A., Artemov, A., Tommasi, T., Sirigatti, D., Rosov, V., Dai, A., Nießner, M.: Meshgpt: Generating triangle meshes with decoder-only transformers. arXiv preprint arXiv:2311.15475 (2023)
[47] Siddiqui, Y., Thies, J., Ma, F., Shan, Q., Nießner, M., Dai, A.: Texturify: Generating textures on 3d shape surfaces. In: European Conference on Computer Vision. pp. 72–88. Springer (2022)
[48] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. PMLR (2015)
[49] Song, L., Cao, L., Xu, H., Kang, K., Tang, F., Yuan, J., Zhao, Y.: Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023)
[50] Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097 (2023)
[51] Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–11 (2023)
[52] Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12619–12629 (2023)
[53] Wang, T., Kanakis, M., Schindler, K., Van Gool, L., Obukhov, A.: Breathing new life into 3d assets with generative repainting. arXiv preprint arXiv:2309.08523 (2023)
[54] Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
[55] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
[56] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
[57] Yang, B., Dong, W., Ma, L., Hu, W., Liu, X., Cui, Z., Ma, Y.: Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. arXiv preprint arXiv:2310.13119 (2023)
[58] Yeh, Y.Y., Huang, J.B., Kim, C., Xiao, L., Nguyen-Phuoc, T., Khan, N., Zhang, C., Chandraker, M., Marshall, C.S., Dong, Z., et al.: Texturedreamer: Image-guided texture synthesis through geometry-aware diffusion. arXiv preprint arXiv:2401.09416 (2024)
[59] Youwang, K., Oh, T.H., Pons-Moll, G.: Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering. arXiv preprint arXiv:2312.11360 (2023)
[60] Yu, X., Dai, P., Li, W., Ma, L., Liu, Z., Qi, X.: Texture generation on 3d meshes with point-uv diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4206–4216 (2023)
[61] Zeng, X., Chen, X., Qi, Z., Liu, W., Zhao, Z., Wang, Z., FU, B., Liu, Y., Yu, G.: Paint3d: Paint anything 3d with lighting-less texture diffusion models (2023)
[62] Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588 (2023)
[63] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
[64] Zhang, Q., Wang, C., Siarohin, A., Zhuang, P., Xu, Y., Yang, C., Lin, D., Zhou, B., Tulyakov, S., Lee, H.Y.: Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885 (2023)

Appendix 0.A More Implementation Details

We provide additional implementation details in the following subsections. All of our experiments are conducted on 4 NVIDIA A100 GPUs, and it takes about $90$ minutes to generate a scene.

0.A.1 Text Prompt

Our method takes a room text prompt along with a compositional mesh based on a given room layout as input and aims to synthesize a complete 3D room texture enabling free novel view rendering inside. Each text prompt is composed of two parts, the style and the description of all the objects in the scene. During the experiments, we utilize the ‘Emauromin style’ as our default style and also use other styles including ‘Misc Kawaii’, ‘Anime’, ‘Game Pokemon’, ‘Artstyle Impressionist’, and so on, which can be found in this website¹¹1https://stable-diffusion-art.com/sdxl-styles/. For better stylization results, we also use the corresponding negative prompt for each style. For example, here is one of the text prompts used to generate a bedroom:
Prompt: Emauromin style, a bedroom with oil paintings on the wall, a single-size bed, brown cotton pillows, a wooden bedside table, a wooden wardrobe, an empty bookshelf, a white desk, a chair, a square and flat ceiling lamp hanging on the ceiling. finely detailed, purism, computer rendering, minimalism, minimal product design.
Negative prompt: blurry, blur, text, watermark, render, 3D, NSFW, nude, CGl, monochrome, B&W. cartoon, painting, smooth, plasticblurry, low-resolution, deep-fried, oversaturated.
All generated texture of 3D rooms presented in this paper and their corresponding text prompts with one specific style are shown in Fig. 18 and Fig. 23. As for the style prompt and the corresponding negative prompt, please refer to the aforementioned website for details. During the iterative object texturing, the text prompt of every single object is its text description as well as the style instead of the whole text prompt as shown above.

0.A.2 Room Geometry Generation

Though the core of our method lies in generating the texture of a 3D room, it is quite straightforward to combine our method with some 3D shape generators instead of merely utilizing datasets like 3D-FRONT [16] designed by professional artists for better convenience and flexibility. Despite the fact that the geometry generated by these methods is not perfect, they can still provide the texturing process with strong geometry priors. Generally, we break down the room geometry generation into two parts: object generation and empty room generation.

3D shape generation. As for the furniture items in the scene, we generate most of our object meshes by leveraging an off-the-shelf text-to-3D object generative model, Shap-E[22]. We cut out object descriptions like ‘a wooden bedside table’ or ‘a white desk’ from the text prompt above, and then send them to the 3D object generative models to generate the corresponding 3D shapes. Delicate decorations like ceiling lamps and chandeliers are borrowed from Objaverse[13] since we found that current object generators are still incapable of generating such fine-grained decorations. The scarcity of such data in 3D object datasets like ShapeNet[5] makes it hard for a 3D generative model to learn. However, we believe that the quality gap between experts and 3D generators, especially for fine-grained models, will be closed with the rapid development of large-scale 3D generative models.

Empty room generation. A procedural generation process is applied to get an empty room mesh. Based on our observations of indoor scene datasets like 3D-FRONT[16], we provide users with various options for diverse room meshes. Specifically, they can decide whether to include baseboards, where to position doors and windows, which ceiling style to choose, and the size of the room. Under the guidance of these choices, an empty room mesh can be generated automatically. For example, the available ceiling styles are illustrated in Fig. 13. Moreover, the generated 3D shapes can also be included in the room according to the provided room layout, and thus a complete room mesh is obtained.

0.A.3 Panorama Generation Details

Initial panorama generation. We leverage the SDXL $1.0$ base and refiner models [36] for image generation, where the sampler is selected as ‘Euler a’. The sampling step is $50$ , and we switch from the base model to the refiner at fraction $0.8$ , i.e., $40$ steps. To generate a panoramic image with better visual fidelity and less distortion, we additionally add ‘720 degrees panorama photo view of’ to the beginning of text prompt. To employ the depth guidance, a depth-based ControlNet model [63] is also applied, and the control weight is $1.5$ . Besides, we also use SDXL VAE, and the CFG scale is set to $6.5$ .

Empty room refinement. When refining the ceilings and floors of an empty room, we will select an upward view and an overhead view to capture the corresponding areas. The virtual camera is put at the center of the room and towards the center of the ceiling or floor. The focal length and the mask will be adjusted according to the width and height of the room. As shown in Fig. 13, ceilings with star-like or diamond-like decorations and some other styles are all supported in our method.

Super-resolution and its limitation. Due to the limitation of memory and inference speed, the generated image from the SDXL model has a resolution of $2,048\times 1,024$ . To enrich the texture details, we leverage an off-the-shelf super-resolution method [54] to upscale these panorama images to $4,096\times 2,048$ . Unfortunately, some weird artifacts may appear after using the super-resolution method as illustrated in Fig. 14.

0.A.4 Iterative Object Texturing Details

Settings of initial perspective view. When re-projecting $\mathbf{I}_{p}$ to the initial perspective view $\mathbf{v}_{0}$ , the default focal length of the virtual camera is set to $500$ . However, if the default setting leads to a bad situation where the object occupies less than half of the image or extends beyond the image boundary, we will adjust the camera’s focal length accordingly. To be specific, we would gradually increase the focal length until an object occupies half the width of the image while simultaneously guaranteeing it does not exceed the image. The resolution of perspective images is $1,024\times 1,024$ , and these images will also be upscaled to $4,096\times 4,096$ via super-resolution modules. The setting of SDXL is the same as that for initial panorama generation.

View selection. We divide the views used for iterative object texturing into two groups: basic views and additional views. First, eight basic views are selected and all of them target at the center of the object. These cameras are roughly located in eight corners of the bounding box covering the whole 3D object. For the additional views, different strategies will be applied according to the length-width ratio of the object. If this ratio is less than $1.5$ , eight additional cameras will be used and still target the center of the object. Their positions are located on a sphere centered around the object with a radius of $0.7$ times the diagonal length of the object bounding box, the elevation angle is set to be a random value between $\pi/6$ and $\pi/3$ , and the azimuth angles are set to $0$ , $\pi/2$ , $\pi$ and $3\pi/2$ , respectively. Besides, if the aspect ratio is larger than $1.5$ , we will select 2 groups of eight additional cameras, i.e., $16$ cameras in total. In particular, each group of cameras will be also located on a sphere but centered at one-third of the length of this 3D object, ensuring all the objects can be completely viewed with such $16$ cameras. It’s noteworthy that virtual cameras will be strictly placed within the room boundary, and cameras that are too close to the object will be deleted too.

Mask of untextured area. As we warp our images to a novel view, those originally occluded parts may be observed due to the sparsity of point clouds. For example, the front-side texture of a wardrobe may appear when we inpaint its back side. To eliminate such unreasonable pixels, we identify these areas where the depth is larger than the ground-truth depth and then remove these pixels thereby.

Inpainting strategy in sparse mask area. We use an interpolation-based method to inpaint areas with relatively sparse masks. Specifically, the interpolation-based method means Telea’s inpaint algorithm in OpenCV. A comparison with using the diffusion model to inpaint these kinds of areas is shown in Fig. 15. It can be seen that ControlNet does not perform well in sparse areas while our method can generate consistent and natural results.

Selecting satisfying images. It is known that images generated by diffusion models exhibit a high degree of diversity, which makes it necessary to select one satisfying image from multiple generation candidates. While selecting the initial perspective view, we already have the text prompt $\mathbf{T}$ and the warped image $\mathbf{I}_{\text{ref}}$ from $\mathbf{I}_{p}$ . To make sure the generated image aligns with $\mathbf{T}$ well and is similar to $\mathbf{I}_{\text{ref}}$ , we compute SSIM Score [56] and CLIP Score [39] among the candidate images and select the one with the highest score:

\mathbf{I}_{\text{obj}}=\mathop{\arg\max}_{j}(\mathbf{S}(\mathbf{I}_{\text{obj% }}^{j},\mathbf{I}_{\text{ref}})+\mathbf{C}(\mathbf{I}_{\text{obj}}^{j},\mathbf% {T}))

(12)

where $\mathbf{S}(\cdot)$ is the function to calculate SSIM Score, $\mathbf{C}(\cdot)$ stands for the function to calculate CLIP Score and $\left\{\mathbf{I}_{\text{obj}}^{j}\right\}_{j=0}^{5}$ represent 5 candidate images used here.

During the iterative object texturing, we notice that the inpainted areas can’t strictly align with other regions on the perspective images, leading to inconsistent styles and weird patterns. Hence, we will dilate the inpainting mask and leverage the dilation areas to judge the style consistency. Specifically, we additionally compute a PSNR score in the dilated area to encourage good alignment following:

\mathbf{I}_{\text{obj, i}}=\mathop{\arg\max}_{j}(\mathbf{P}(\mathbf{I}_{\text{% obj, i}}^{j},\mathbf{I}_{\text{ref, i}})+\mathbf{C}(\mathbf{I}_{\text{obj, i}}% ^{j},\mathbf{T}))

(13)

where $\mathbf{P}(\cdot)$ represents the function to calculate the PSNR score, $\mathbf{I}_{\text{ref, i}}$ is the image to be inpainted under view $\mathbf{v}_{i}$ , and the CLIP score is also used to select the most suitable inpainting results from 5 candidates.

0.A.5 Fine-grained Texture Control

Since our method aims to generate harmonious texture across the whole scene, it is natural for our method to ignore some semantics in the object-level text if they break the overall consistency significantly(like a blue stool among a bunch of brown furniture in the last example of Fig. 18). This misalignment is mainly due to the SDXL model, which is trained on real-world scene images with globally consistent textures. However, it is easy for our method to align with all the object-level prompts by sacrificing some extent of harmoniousness. It is up to the users themselves to decide how they would like the room texture. This fine-grained control can be achieved by simply ignoring the reference textures of these objects in the panorama during the object texturing process. Such a compromised result is shown in the teaser image as well as in Fig. 16. Apart from aligning object textures perfectly with text prompts, other fine-grained texture controls including controlling the texture of floors, walls, ceilings, and objects using scribbles are also integrated into one scene in the demo video.

Appendix 0.B User Study

We leverage a flask-based web application for the user study to compare our method with baselines from the human perspective. Fig. 19 shows the interface of our questionnaire, where the text description is put on the top, the room overview(a top view of the room) and two random perspective images are in the middle, and a video showing free roaming in the room is also provided below. In the questionnaires, we have 6 groups of scenes in total, where 3 results from baselines and 1 from ours are included in each group. We invite $61$ volunteers to conduct the user study and each participant will be randomly shown 2 groups of scenes, i.e., 8 generated scenes, and be asked to judge each presented scene from three different dimensions, 3D consistency(3DC), texture quality(TQ), and perceptual quality(PQ). Specifically, they have to give a score ranging from 1 to 5 for such three aspects. The higher, the better. In the end, we gather $488$ responses from the $61$ participants and calculate the overall preferences as shown in Tab. 2.

Method	3DC( $\uparrow$ )	TQ( $\uparrow$ )	PQ( $\uparrow$ )
TEXTure-C [40]	3.06( $\pm$ 0.85)	2.75( $\pm$ 0.80)	2.88( $\pm$ 0.77)
TEXTure-H [40]	2.83( $\pm$ 0.86)	2.63( $\pm$ 0.84)	2.64( $\pm$ 0.83)
SceneTex [6]	3.98( $\pm$ 0.86)	3.73( $\pm$ 1.02)	3.56( $\pm$ 0.95)
Ours	4.51( $\pm$ 0.71)	4.29( $\pm$ 0.81)	4.26( $\pm$ 0.76)

Table 2: Quantitative comparison of the user study. Mean opinion scores are in the range of 1

\sim

5. Our method outperforms TEXTure-C and TEXTure-H by a large margin.

Appendix 0.C Detailed Comparison with SceneTex

As the most closely related and concurrent work as ours, SceneTex [6] formulates the whole texturing process as an optimization problem by using a multi-resolution texture field and the VSD objective [55]. Though being able to generate compelling textures for a given room geometry, the optimization-based framework may suffer some underlying problems. First of all, the texture generated via optimization may not align well with the given geometry as the texture prior distilled from text-to-image diffusion models tends to make images look as realistic as possible from certain viewpoints. For example, as shown in Fig. 17 (a), there should not exist handle-like objects on the walls of the bathroom since there are no handles at all in the given meshes. Similarly, SceneTex is prone to generate indoor textures containing unnatural lights and weird noises as shown in Fig. 17 (b) and (c). Moreover, the choice of viewpoints leads to some blurry areas due to the severe occlusion problem in the indoor scene as shown in Fig. 17 (d). However, we believe the blurry problem may be mitigated via a more delicate viewpoint selection strategy. Some clear seams can be observed in Fig. 17 (e) due to the usage of UV map. On the other hand, textures of objects with complex topological structures may easily be affected by accumulative errors under an explicit inpainting-based framework, even though we have designed a module to detect the misalignment between depth space and rgb space. Optimization-based methods naturally possess some extent of continuity and will not be significantly impacted by a particular viewpoint. But objects with complex topological structures like multi-layer lamps still pose a challenge for both approaches due to the severe self-occlusion. In the future, we believe a well-designed strategy could marry the merits of the inpainting-based method and the optimization-based method for more harmonious and consistent texture generation.

Appendix 0.D Additional Results

More qualitative results including a kitchen, a bedroom, and a living-dining room compared with baseline methods are shown in Fig. 20 and more stylized room results are shown in Fig. 21. Though it is more flexible to assemble a room using 3D shape generators along with our provided empty room generator by users themselves, our method is also capable of texturing a room from professional datasets like 3D-FRONT [16]. We choose five rooms including a bedroom, three living rooms, and a living-dining room from the dataset, and the results of the overhead view and several perspective views from inside are shown in Fig. 22. The overview images of these rooms as well as their corresponding text prompts are shown in Fig. 23. We render some room tour videos of different scenes with different styles, which are integrated into a unified video put in the supplementary. Besides, we also present a demo video to demonstrate the effectiveness of using our misalignment detection technique. Another demo video shows how our method supports interactive fine-grained texture controls as well as a room tour video in the new room after applying these controls.