HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

Hieu T. Nguyen1,∗, Yiwen Chen1,∗, Vikram Voleti2, Varun Jampani2, Huaizu Jiang1
1 Northeastern University, 2 Stability AI ( Equal contribution)
Abstract

We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation of the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/

Refer to caption
Figure 1: HouseCrafter can lift floorplans to 3D scenes. Top: It adapts a 2D diffusion model to generate multi-view RGB-D images across different locations of the scene in an autoregressive manner. The RGB-D images are then fused into a 3D mesh. Bottom: HouseCrafter can generate high-quality 3D meshes of the scene that are faithful to the input floorplan.

1 Introduction

High-fidelity 3D environments are crucial for delivering truly immersive user experiences in AR, VR, gaming, and beyond. Traditionally, this process has been labor-intensive, demanding meticulous effort from skilled artists and designers, especially for intricate indoor settings with numerous furniture pieces and decorative objects. The development of automated tools for generating realistic 3D scenes can significantly improve this process, streamlining the creation of complex virtual environments, which enables faster iteration cycles and empowers novice users to bring their creative visions to life. Such tools hold immense potential across industries like architecture, interior design, and real estate, facilitating rapid visualization, iteration, and collaborative design.

Recent advances in denoising diffusion models(Ren et al., 2023; Ju et al., 2023) show great promise toward develo** 3D generative models using 3D data. However, in contrast to the abundant availability of 2D imagery(Schuhmann et al., 2022), 3D data requires intensive labor to create or cquire(Dai et al., 2017; Chang et al., 2017; Fu et al., 2021; Ge et al., 2024; Behley et al., 2019; Yeshwanth et al., 2023). Thus, using 2D generative models(Rombach et al., 2022; Saharia et al., 2022) is a promising direction for 3D generation. InSong et al. (2023); Tang et al. (2023), 2D diffusion models are used to texturize a given raw 3D scene. But obtaining the untextured 3D scene, either in the format of mesh or point cloud, is not trivial. Alternatively, a 3D scene can be estimated based on generated multi-view observations(Liu et al., 2023b; Ye et al., 2023; Weng et al., 2023; Liu et al., 2023c; Shi et al., 2023b; a; Long et al., 2023; Liu et al., 2024; 2023a; Kant et al., 2023; Szymanowicz et al., 2023; Kant et al., 2024; Wang et al., 2024; Zheng & Vedaldi, 2023; Hu et al., 2024; Huang et al., 2023; Voleti et al., 2024). However, these works only investigate object-centric generation with relatively simple camera positions.

For 3D scene generation, text-to-image diffusion models are employed to create room panoramas(Song et al., 2023; Tang et al., 2023), offering visually appealing results. But converting these panoramas into 3D representations without additional input is challenging. Other works(Höllein et al., 2023; Chung et al., 2023; Shriram et al., 2024) obtain a 3D representation of the scene by continuously generating 2D images of the scene and projecting them to 3D space using depth provided by monocular depth estimation models(Piccinelli et al., 2024; Ke et al., 2024). While achieving good results on a small scale, these methods struggle to scale up to bigger scenes, which tend to produce repeated content and distorted geometry. Instead of using textual descriptions, layout maps can better convey the global guidance for scene generation. Several studies have explored this approach at the room-scale level, demonstrating the benefits of incorporating layout information (Schult et al., 2023; Fang et al., 2023; Bahmani et al., 2023). However, extending this method to house-scale generation poses challenges, as the current strategy of generating all scene content in one batch becomes impractical for larger, more complex scenes.

In this paper, we present HouseCrafter, an autoregressive pipeline for house-scale 3D scene generation guided by 2D floorplans, as shown in Fig. 1. Our key insight is to adapt a powerful pre-trained 2D diffusion model (Rombach et al., 2021) to generate multi-view consistent RGB and depth (RGB-D) images, decoupling color and geometry, across different places of the scene to reconstruct the 3D house. Specifically, we sample a set of camera poses within the scene based on the given floorplan. A novel view synthesis model is developed to generate RGB-D images at these poses in a batch-wise manner. In each batch, the model takes the already generated RGB-D images at neighboring poses (initially empty) as reference and simultaneously generates RGB-D images at nearby target poses, guided by the local view of the floorplan. With all the generated RGB-D images inside the house, we use the TSDF fusion(Zeng et al., 2017) to reconstruct the scene, providing explicit meshes for downstream applications (e.g., in an AR/VR application).

Our novel-view synthesis model is inspired by the object-centric generation model EscherNet(Kong et al., 2024). Although it shows promising results of ensuring the view consistency with camera position encoding, it is not designed to handle the complexity of geometry and appearances on a scene level, resulting in failure when camera move away from the initial object. We make two important modifications to adapt it for 3D house generation. We first extend it to take 2D floorplan as an additional input, leading to globally consistent scene generation. Second, we incorporate depth information into both the input and output of novel view synthesis, decoupling geometry and appearance of the scene and yielding better generation results. The idea of RGB-D novel view synthesis is also investigated by Hu et al. (2024). However, MVDFusion is not designed for scene generation either and produces low-resolution depth images due to concatenation with the downsampled features of the RGB images in the denoising process. Instead, our model produces high-resolution depth images, leading to high-quality 3D scene reconstruction.

We evaluate our model on the 3D-Front dataset (Fu et al., 2021). Through our experiments, we demonstrate the effectiveness of our RGB-D novel view synthesis model in generating images at the novel views that are consistent not only with the input reference views and floorplan but also among the generated images themselves. Moreover, we demonstrate the model’s efficacy in generating more compelling 3D scenes that are globally coherent than existing methods.

In summary, our key contributions are summarized as follows.

  • We introduce a novel method HouseCrafter, which can lift a 2D floorplan into a 3D house. Compared with existing room-scale methods(Höllein et al., 2023; Bahmani et al., 2023), our approach can generate globally consistent house-scale scenes.

  • We present a RGB-D novel synthesis method, which takes nearby RGB-D images as reference to generate a set of RGB-D images at novel views, guided by the floorplan. Compared to existing methods(Kong et al., 2024; Hu et al., 2024), our approach generates semantically and geometrically consistent multi-view RGB-D images, enabling high-quality and efficient 3D scene reconstruction.

  • Through both quantitative and qualitative evaluation, we demonstrate the effectiveness of our model in producing images that are faithful to both reference images and floorplan. We also show that our approach can generate globally coherent house-scale indoor scenes.

2 Related Work

3D Object Generation. Recent advancements in 2D image generation (Rombach et al., 2021; Blattmann et al., 2023) have inspired attempts to use diffusion models for 3D generation. Some works (Poole et al., 2022; Lin et al., 2023; Yi et al., 2024) optimize 3D representations (Mildenhall et al., 2021; Kerbl et al., 2023) by leveraging the denoising capabilities of diffusion models. However, these models struggle to maintain a single object instance across denoising updates and are unaware of camera poses, limiting the quality of the optimized 3D representations.

Alternatively, some works convert generated images into 3D models (Liu et al., 2023b; Ye et al., 2023; Weng et al., 2023; Liu et al., 2023c; Shi et al., 2023b; a; Long et al., 2023; Liu et al., 2024; 2023a; Kant et al., 2023; Szymanowicz et al., 2023; Kant et al., 2024; Wang et al., 2024; Tochilkin et al., 2024; Zheng & Vedaldi, 2023; Hu et al., 2024; Huang et al., 2023).  Liu et al. (2023b) demonstrated that diffusion models (Rombach et al., 2021) fine-tuned on large-scale object datasets (Deitke et al., 2023; 2024) can generate consistent multi-view RGB images, enabling 3D model reconstruction. Building on this, subsequent research has focused on enhancing multi-view image quality by integrating 3D representations  (Yang et al., 2023; Liu et al., 2023c; Kant et al., 2023; Weng et al., 2023; Shi et al., 2023b; Liu et al., 2024; 2023a; Hu et al., 2024) or using cross-view attention  (Zheng & Vedaldi, 2023; Blattmann et al., 2023; Kong et al., 2024; Shi et al., 2023b; Voleti et al., 2024).

Inspired by these approaches, we aim to generate multi-view images at the scene level. Our model uses multi-view RGB-D images and 2D layout as conditions to generate new multi-view RGB-D images. Integrating depth enhances multi-view consistency and provides explicit scene geometry for 3D reconstruction. Unlike Kong et al. (2024), which only outputs multi-view RGB images, and  Hu et al. (2024), which denoises depth images with RGB latents, our model denoises both RGB and depth images in the latent space. This maintains geometry awareness and produces high-resolution depth images and high-quality 3D reconstructions, ensuring geometric and semantic consistency across views.

Text-guided 3D Scene Generation Text-to-image models can be also utilized for 3D scene generation. Some works  (Rockwell et al., 2021; Zhang et al., 2023; Yu et al., 2023; Chung et al., 2023; Ouyang et al., 2023; Höllein et al., 2023; Shriram et al., 2024) continuously aggregates frames with existing scenes, using monocular depth estimators to project 2D images into 3D space, but faces challenges like scale ambiguity and depth inconsistencies. Recent work improves geometry by training depth-completion models (Engstler et al., 2024). However, most of these methods focus on forward-facing scenes, struggling for larger or more complex scenes like rooms or houses since global plausibility is not guaranteed (Höllein et al., 2023).

To enhance global plausibility, MVDiffusion  (Tang et al., 2023) and Roomdreamer  (Song et al., 2023) generate multiple images in a single batch to form a panorama, though without geometry generation. Gaudi (Bautista et al., 2022), directly generates global 3D scene representation, producing 3D scenes with globally plausible content, but the quality is limited by the scarcity of 3D data with text.

Our pipeline generates views of the scene autoregressively but in batches. Compared to image-by-image generation pipelines (Höllein et al., 2023; Chung et al., 2023; Shriram et al., 2024), batch generation scales better and benefits from the built-in cross-view consistency of multi-view models. Additionally, by including depth images, HouseCrafter addresses scale ambiguity and leverages geometry from previous steps to generate novel views.

Layout-guided 3D Scene Generation. Complimenting to text, the layout provides the detailed position of objects in the scene. Early work (Vidanapathirana et al., 2021) is able to uplift a 2D floorplan to a 3D house model but only focuses on the architectural structure, i.e. floor, wall, ceiling. Also conditioned on 2D layout, BlockFusionWu et al. (2024) achieves commendable results in geometry generation but does not generate texture.

For both geometry and texture generation, Ctrl-Room (Fang et al., 2023) and ControlRoom3D (Schult et al., 2023) show that 3D layout guidance improves geometry and object arrangement compared to text-only methods (Höllein et al., 2023). However, these methods generate a single panorama, limited to room-scale scenes. CC3D (Bahmani et al., 2023), closest to our work, uses 2D layout guidance to produce a 3D neural radiance field, enabling textured mesh but still limited to single-room scenes. Our method effectively uses 2D layout guidance to scale to larger scenes, such as entire houses.

Other works Other approaches treat indoor scene generation as an object layout problem (Wen et al., 2023; Feng et al., 2024; Yang et al., 2024). These works focus on predicting floor layouts and furniture placement using with language model, and retrieving suitable objects from a database. Alternatively, Ge et al. (2024) create augmented layouts from templates, while others use procedure generation (Deitke et al., 2022; Raistrick et al., 2024) These approaches complement our pipeline, as we can use predicted floorplans to generate the scene’s texture and geometry accordingly.

3 Proposed Method: HouseCrafter

3.1 Overview

Our goal is to lift a 2D floorplan to a 3D scene that we can interact with, where explicit scene representation is desired, e.g., in terms of meshes. If we had enough 3D data, training a generative model that outputs the desired 3D asset would be the most straightforward solution. In practice, however, 3D data is harder to acquire and thus far more scarce than 2D imagery. Therefore, in this paper, we resort to generating multi-view 2D observations of the scene first and then reconstructing it in 3D. It allows us to harness the powerful generative prior of recent advances in diffusion-based models that are trained using a large set of 2D images.

Refer to caption
Figure 2: Layout-guided novel view RGB-D generation model. Adopted from Eschernet, our model has three important design changes for 3D scene generation. First, our model simultaneously denoises the latent of RGB images {𝐱ir}i=1Nrsuperscriptsubscriptsuperscriptsubscript𝐱𝑖𝑟𝑖1subscript𝑁𝑟\{{\mathbf{x}}_{i}^{r}\}_{i=1}^{N_{r}}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and depth images{Dir}i=1Nrsuperscriptsubscriptsuperscriptsubscript𝐷𝑖𝑟𝑖1subscript𝑁𝑟\{D_{i}^{r}\}_{i=1}^{N_{r}}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, enabling geometry and texture consistency. Second, the introduced layout-attention block allows the target images to condition on the given floorplan L𝐿Litalic_L. Lastly, DeCaPE is proposed to leverage the depth images of the reference views, allowing the attention between the point cloud features of reference views and target image features with only camera poses.

As shown in Fig. 1, we sample a lot of locations inside the house based on the 2D floorplan and then visit these locations autoregressively in a batch-wise manner. In each batch, with our developed novel view synthesis model, we take already generated RGB-D images from nearby locations as references, conditioned on the floorplan, we generate a set of both semantically and geometrically consistent RGB-D images in novel neighboring locations simultaneously. After exhausting these locations, we use an off-the-shelf TSDF fusion model Zeng et al. (2017) to reconstruct a detailed 3D vertex-colored mesh from the generated RGB-D images.

3.2 Layout-guided Novel View RGB-D Image Generation

We fine-tune the UNet of the StableDiffusion v1.5 Rombach et al. (2021) to leverage its powerful generation capacity obtained from training on web-scale data. Specifically, given the 2D floorplan L𝐿Litalic_L, the already generated depth images {Dir}i=1Nrsuperscriptsubscriptsuperscriptsubscript𝐷𝑖𝑟𝑖1subscript𝑁𝑟\{D_{i}^{r}\}_{i=1}^{N_{r}}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and the latent features of already generated RGB images {𝐱ir}i=1Nrsuperscriptsubscriptsuperscriptsubscript𝐱𝑖𝑟𝑖1subscript𝑁𝑟\{{\mathbf{x}}_{i}^{r}\}_{i=1}^{N_{r}}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at certain poses {𝐏ir}i=1Nrsuperscriptsubscriptsuperscriptsubscript𝐏𝑖𝑟𝑖1subscript𝑁𝑟\{{\mathbf{P}}_{i}^{r}\}_{i=1}^{N_{r}}{ bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as references, the goal of our novel view synthesis model is to denoise the latents of RGB-D images {(𝐱jn,𝐝jn)}j=1Nnsuperscriptsubscriptsuperscriptsubscript𝐱𝑗𝑛superscriptsubscript𝐝𝑗𝑛𝑗1subscript𝑁𝑛\{({\mathbf{x}}_{j}^{n},{\mathbf{d}}_{j}^{n})\}_{j=1}^{N_{n}}{ ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at the novel poses {𝐏jn}j=1Nnsuperscriptsubscriptsuperscriptsubscript𝐏𝑗𝑛𝑗1subscript𝑁𝑛\{{\mathbf{P}}_{j}^{n}\}_{j=1}^{N_{n}}{ bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Nnsubscript𝑁𝑛N_{n}italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the number of reference and novel images, respectively. Both the reference pose 𝐏irsuperscriptsubscript𝐏𝑖𝑟{\mathbf{P}}_{i}^{r}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and novel pose 𝐏jnsuperscriptsubscript𝐏𝑗𝑛{\mathbf{P}}_{j}^{n}bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are in the SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) space.

For the reference RGB image Iirsuperscriptsubscript𝐼𝑖𝑟I_{i}^{r}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, we use a lightweight image encoder (Woo et al., 2023) to get its latent 𝐱irsuperscriptsubscript𝐱𝑖𝑟{\mathbf{x}}_{i}^{r}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. The latent 𝐝jnsuperscriptsubscript𝐝𝑗𝑛{\mathbf{d}}_{j}^{n}bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is obtained by replicating the single-channel target depth image into 3 channels and normalizing the depth value to [1,1]11[-1,1][ - 1 , 1 ] then passing through the VAE encoder. In this way, we can use the pre-trained VAE to encode it just like an RGB image. At the novel pose 𝐏jnsuperscriptsubscript𝐏𝑗𝑛{\mathbf{P}}_{j}^{n}bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the RGB and depth latents 𝐱jnsuperscriptsubscript𝐱𝑗𝑛{\mathbf{x}}_{j}^{n}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐝jnsuperscriptsubscript𝐝𝑗𝑛{\mathbf{d}}_{j}^{n}bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are gradually denoised in T𝑇Titalic_T steps starting from pure Gaussian noises. Finally, the latents are fed into the VAE decoder to get the RGB and depth images. In this section, we omit the denoising step for brevity.

An illustration of the model is shown in Fig. 2. Our model architecture is inspired by designs of SOTA object-centric novel view synthesis models (Zheng & Vedaldi, 2023; Kong et al., 2024), but re-designed for the geometric and semantic complexity of scene-level contents. First, we extend both the reference conditioning and image generation to the RGB-D setting instead of RGB only as RGB-D images provide strong cues for 3D scene reconstruction. Secomd, we insert a layout attention layer at the beginning of each unet block to encourage the generated images to be faithful to the floorplan, ensuring global consistency in generating a house-scale scene. Moreover, the cross-attention layer, which is introduced in prior works for reference-novel view attention, is updated to leverage geometry from the reference depth, leading to higher-quality image generation.

Multi-novel-view RGB-D Image Generation. Given RGB and depth latents 𝐱jnsuperscriptsubscript𝐱𝑗𝑛{\mathbf{x}}_{j}^{n}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐝jnsuperscriptsubscript𝐝𝑗𝑛{\mathbf{d}}_{j}^{n}bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, instead of denoising them separately, we concatenate them along the channel dimension as 𝐳jn=[𝐱jn,𝐝jn]superscriptsubscript𝐳𝑗𝑛superscriptsubscript𝐱𝑗𝑛superscriptsubscript𝐝𝑗𝑛{\mathbf{z}}_{j}^{n}=[{\mathbf{x}}_{j}^{n},{\mathbf{d}}_{j}^{n}]bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ]. In this way, the model can effectively fuse the information of RGB and depth images into a single representation to ensure the semantic consistency between them at a single view. We double the input and output channels of the UNet to accommodate 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. When we denoise a set of latents {𝐳jn}j=1Nnsuperscriptsubscriptsuperscriptsubscript𝐳𝑗𝑛𝑗1subscript𝑁𝑛\{{\mathbf{z}}_{j}^{n}\}_{j=1}^{N_{n}}{ bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT simultaneously, it ensures consistency across RGB and depth images both semantically and geometrically across different views and thus leads to higher-quality generation as shown in the experiments.

Floorplan Conditioning. We use a vectorized representation L𝐿Litalic_L for the floorplan  Zheng et al. (2023), which describes the structure and furniture arrangement of the house from a bird-eye view. L={oi}i=1N𝐿superscriptsubscriptsubscript𝑜𝑖𝑖1𝑁L=\{o_{i}\}_{i=1}^{N}italic_L = { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT consists of N𝑁Nitalic_N items, where each component oi={ci,pi}subscript𝑜𝑖subscript𝑐𝑖subscript𝑝𝑖o_{i}=\{c_{i},p_{i}\}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is specified by its category cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and geometry information pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If the component oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents furniture (e.g., a chair), pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defines the 2D bounding box enclosing the object. For other components, including walls, doors, and windows, it specifies the start and end points of a line segment corresponding to them.

To use it as condition to the diffusion model, we encode the floorplan with respect to each target view. Fig. 3 illustrates the encoding process for a novel view. For every pixel of its latents 𝐳jnC×H×Wsuperscriptsubscript𝐳𝑗𝑛superscript𝐶𝐻𝑊{\mathbf{z}}_{j}^{n}\in\mathbb{R}^{C\times H\times W}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, we shoot a ray 𝐫𝐫{\mathbf{r}}bold_r originating at the camera center of 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT going through the pixel center, which is orthogonally projected down as 𝐫superscript𝐫{\mathbf{r}}^{\prime}bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to the floor plane. Along the projected ray 𝐫superscript𝐫{\mathbf{r}}^{\prime}bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we take at most M𝑀Mitalic_M points that intersect with the 2D object bounding boxes or other floorplan components (e.g., walls). With each intersection point, we obtain its position and the object category. Gathering across all the pixels of 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we get 𝐜jM×H×Wsubscript𝐜𝑗superscript𝑀𝐻𝑊{\mathbf{c}}_{j}\in{\mathbb{N}}^{M\times H\times W}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT italic_M × italic_H × italic_W end_POSTSUPERSCRIPT for the semantic category and 𝐩jM×2×H×Wsubscript𝐩𝑗superscript𝑀2𝐻𝑊{\mathbf{p}}_{j}\in\mathbb{R}^{M\times 2\times H\times W}bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 2 × italic_H × italic_W end_POSTSUPERSCRIPT for the point position where the dimension of 2 consists the depth along the ray and the height from the floor.

Refer to caption
Figure 3: Floorplan Encoding.

Note that we exclude the intersections after the ray first hits the wall to take the occlusion into effect, and use zero-padding to ensure the same number of intersection points per ray for batching.

To inject the floorplan information 𝐜j,𝐩jsubscript𝐜𝑗subscript𝐩𝑗{\mathbf{c}}_{j},{\mathbf{p}}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT into the latent 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we first embed them into a latent space,

𝐥jsubscript𝐥𝑗\displaystyle{\mathbf{l}}_{j}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =Embed(𝐜j)+PosEnc(𝐩j),absentEmbedsubscript𝐜𝑗PosEncsubscript𝐩𝑗\displaystyle=\texttt{Embed}({\mathbf{c}}_{j})+\texttt{PosEnc}({\mathbf{p}}_{j% }),= Embed ( bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + PosEnc ( bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (1)

where Embed() map each semantic class to a latent vector and PosEnc() is sinusoidal position embedding, to obtain 𝐥jM×C×H×Wsubscript𝐥𝑗superscript𝑀𝐶𝐻𝑊{\mathbf{l}}_{j}\in\mathbb{R}^{M\times C\times H\times W}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT which encodes both geometry and semantic of the floorplan.

Subsequently, the layout-attention block modulates RGB-D latents using cross-attention between the input latents and 𝐥jsubscript𝐥𝑗{\mathbf{l}}_{j}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on pixel level, each latent feature in 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the query and the floorplan features along the corresponding ray are the keys and values, meaning the attention for each pixel is performed independently. We provide more technical details in the appendix (Section A).

Multi-view Reference RGB-D Image Conditioning. In addition to being faithful to the input floorplan, the generated RGB-D images should be consistent with the reference images as well. Our multi-view RGB-D conditioning design is inspired by EscherNet (Kong et al., 2024), an object-centric novel-view synthesis method. By cross-attending from the target RGB latents (as query) to the reference RGB images (as key and value), where the image features are simply treated as tokens in sequences, it can take multiple RGB reference images as conditions to ensure its output’s quality. To encode the camera poses and thus capture the relative transformation of the target and reference view, Camera Positional Encoding (CaPE) was introduced to augment the visual tokens. In contrast, not only have RGB reference, we also have depth which can provide geometry reference. Hence, we introduce Depth-enhanced Camera Positional Encoding (DeCaPE) to better enhance the visual tokens to capture their similarity in 3D and thus improve the generation quality.

We first revisit CaPE and then describe DeCaPE. To avoid notation clutter, let’s denote 𝐏Q=𝐏jnsubscript𝐏𝑄superscriptsubscript𝐏𝑗𝑛{\mathbf{P}}_{Q}={\mathbf{P}}_{j}^{n}bold_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐏K=𝐏irsubscript𝐏𝐾superscriptsubscript𝐏𝑖𝑟{\mathbf{P}}_{K}={\mathbf{P}}_{i}^{r}bold_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Further, we have 𝐯Qsubscript𝐯𝑄{\mathbf{v}}_{Q}bold_v start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝐯Ksubscript𝐯𝐾{\mathbf{v}}_{K}bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, which are two tokens from 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐱irsuperscriptsubscript𝐱𝑖𝑟{\mathbf{x}}_{i}^{r}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, respectively. In CaPE, ϕ(𝐏)italic-ϕ𝐏\phi({\mathbf{P}})italic_ϕ ( bold_P ) is defined in analogy to camera extrinsic 𝐏𝐏{\mathbf{P}}bold_P so that the visual tokens in high-dimensional space can be transformed via ϕ(𝐏)italic-ϕ𝐏\phi({\mathbf{P}})italic_ϕ ( bold_P ) in the similar way that point cloud in 3D space is transformed via 𝐏𝐏{\mathbf{P}}bold_P. The similarity between 𝐯Qsubscript𝐯𝑄{\mathbf{v}}_{Q}bold_v start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝐯Ksubscript𝐯𝐾{\mathbf{v}}_{K}bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is then computed as

sQK=ϕ(𝐏Q)𝐯Q,ϕ(𝐏K)𝐯K=𝐯Qϕ(𝐏Q1)ϕ(𝐏K)𝐯Ksubscript𝑠𝑄𝐾italic-ϕsuperscriptsubscript𝐏𝑄absentsubscript𝐯𝑄italic-ϕsubscript𝐏𝐾subscript𝐯𝐾superscriptsubscript𝐯𝑄italic-ϕsuperscriptsubscript𝐏𝑄1italic-ϕsubscript𝐏𝐾subscript𝐯𝐾\displaystyle s_{QK}=\langle\phi({\mathbf{P}}_{Q}^{-\intercal}){\mathbf{v}}_{Q% },\phi({\mathbf{P}}_{K}){\mathbf{v}}_{K}\rangle={\mathbf{v}}_{Q}^{\intercal}% \phi({\mathbf{P}}_{Q}^{-1})\phi({\mathbf{P}}_{K}){\mathbf{v}}_{K}italic_s start_POSTSUBSCRIPT italic_Q italic_K end_POSTSUBSCRIPT = ⟨ italic_ϕ ( bold_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ⊺ end_POSTSUPERSCRIPT ) bold_v start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_ϕ ( bold_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ⟩ = bold_v start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT italic_ϕ ( bold_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_ϕ ( bold_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT =𝐯Qϕ(𝐏Q1𝐏K)𝐯K.absentsuperscriptsubscript𝐯𝑄italic-ϕsuperscriptsubscript𝐏𝑄1subscript𝐏𝐾subscript𝐯𝐾\displaystyle={\mathbf{v}}_{Q}^{\intercal}\phi({\mathbf{P}}_{Q}^{-1}{\mathbf{P% }}_{K}){\mathbf{v}}_{K}.= bold_v start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT italic_ϕ ( bold_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT . (2)

The key property of CaPE is that 𝐏𝐐𝟏𝐏𝐊superscriptsubscript𝐏𝐐1subscript𝐏𝐊\mathbf{P_{Q}^{-1}}\mathbf{P_{K}}bold_P start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - bold_1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT encodes the relative transformation of the camera poses while being invariant to the choice of the world coordinate system. Eq.(2) can be interpreted as the feature of the reference view (key) in its camera coordinate system is transformed to the coordinate system of the novel view (query) before taking the dot product with the query feature. Since we have the explicit 3D position of the reference tokens from the reference depth image, DeCaPE uses the 3D position to augment 𝐯Ksubscript𝐯𝐾{\mathbf{v}}_{K}bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT in its camera coordinate before applying the camera transformation,

sQK=𝐯Qϕ(𝐏Q1𝐏Kcamera poses)(𝐯K+PosEnc(𝐩K)3D position from depth),subscript𝑠𝑄𝐾superscriptsubscript𝐯𝑄italic-ϕsubscriptsuperscriptsubscript𝐏𝑄1subscript𝐏𝐾camera posessubscript𝐯𝐾subscriptPosEncsubscript𝐩𝐾3D position from depths_{QK}={\mathbf{v}}_{Q}^{\intercal}\phi(\underbrace{{\mathbf{P}}_{Q}^{-1}{% \mathbf{P}}_{K}}_{\text{camera poses}})({\mathbf{v}}_{K}+\underbrace{\texttt{% PosEnc}({\mathbf{p}}_{K})}_{\text{3D position from depth}}),italic_s start_POSTSUBSCRIPT italic_Q italic_K end_POSTSUBSCRIPT = bold_v start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT italic_ϕ ( under⏟ start_ARG bold_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT camera poses end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT + under⏟ start_ARG PosEnc ( bold_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT 3D position from depth end_POSTSUBSCRIPT ) , (3)

where 𝐩Ksubscript𝐩𝐾{\mathbf{p}}_{K}bold_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is the 3D position of 𝐯Ksubscript𝐯𝐾{\mathbf{v}}_{K}bold_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT in the camera coordinate of the key (reference view), which is obtained from depth image, and PosEnc() is a learnable positional encoding. While preserving the invariance to the choice of world coordinate, Eq.(3) enhances the similarity (attention score) computation of CaPE for the cross attention and therefore leads to better generation as we will show in the experiments.

Refer to caption
Figure 4: Refinement example for a bed: To improve the quality of the noisy bed mesh (left), we sample cameras surrounding(highlighted in blue) the bed and generate RGB-D images in one batch, allowing more complete, smooth mesh (right)

3.3 3D Scene Reconstruction and Post Refinement

Autoregressive RGB-D Image Generation. The images are generated in a batch-wise manner, the order of which is decided by a connected pose graph. The camera poses are uniformly placed across the scene space, with randomly chosen rotation. We construct the pose graph by linking every two camera poses whose relative distance and rotation angle fall within a threshold. To generate the first batch, thanks to the classifer-free guidance, we take only the layout as condition to generate RGB-D images. Next, we generate RGB-D images at the neighboring locations conditioned on the already generated images in the first batch. We then traverse the entire graph in a batch-wise manner following this procedure. When traversing the graph and encountering a pose v𝑣vitalic_v whose images have not been generated, we choose generated views within δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT hops from v𝑣vitalic_v as reference views and poses within δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT hops from v𝑣vitalic_v that have not been generated as novel views (We provides detailed procedure and ablation of δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in Appendices B). With all the generated RGB-D images covering the entire house, we fuse them into a 3D mesh using TSDF fusion Zeng et al. (2017).

Post Refinement for Scene Reconstruction. While the pose graph provides good coverage of the scene, holes still exist in the reconstructed mesh, which happens in the region with clustered objects. In addition, for some objects (e.g., chairs, sofas, and beds), denser RGB-D images are needed to obtain detailed geometry and texture. Examples are shown in Fig. 4 (a). To address both issues, we densely sample more camera poses looking at each object in the scene and then generate all RGB-D images around the same object in a single batch. In this way, the dense, object-centric poses allow complete and detailed observations of the object and the single-step generation ensures the cross-view consistency, leading to higher reconstruction quality, as shown in Fig. 4 (b). We provide more details in Appendices B.

Refer to caption
Figure 5: Qualitative comparison We show two random viewpoints for each scene as well as a top-down views. We compare our model with CC3D Bahmani et al. (2023) and Text2Room Höllein et al. (2023). HouseCrafter generates results with better geometry and textures. More examples are provided in Fig. 9.

4 Experiment

4.1 Experimental setup

Dataset. We conduct experiments on 3D-FRONT Fu et al. (2021), a synthetic indoor scene dataset that contains rich house-scale layouts and is populated by detailed 3D furniture models. Compared with other indoor scene datasets Dai et al. (2017); Chang et al. (2017), it allows us to render high-quality images of the scene at any selected pose, which is essential to training our novel view RGB-D image diffusion model. For each house in the dataset, we obtain the floorplan based on furniture bounding boxes and wall mesh and generate the training images by rendering from sampled poses. Nearly 2000 houses with 2 million rendered images are used for training while 300 houses are for evaluation

Refer to caption
Figure 6: User Study  Participants significantly favor our method over baselines, for both overall quality and coherence to the floorplan.

Evaluation. We evaluate the multi-view RGB-D image generation and the quality of the reconstructed 3D scene meshes. Regarding the multi-view RGB-D generation, we evaluate the consistency among the multi-view images and their visual quality. For consistency, we consider two aspects: reference-novel (R-N) and novel-novel (N-N) view consistency. While the open-ended nature of the generation task makes the evaluation challenging due to the absence of ground truth information, we can measure the consistency of two views within their overlapped region, which can be estimated via the depth and poses. Given the estimated overlap region, we evaluate RGB consistency using PSNR and depth consistency using Absolute Mean Relative Error (AbsRel) and percentage of pixel inliers δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with threshold 1.25isuperscript1.25𝑖1.25^{i}1.25 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. We also report Fréchet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) for the visual quality.

To evaluate the faithfulness to the input floorplan, we rely on the state-of-the-art 3D instance segmentation method, ODIN (Jain et al., 2024). We extract top-down 2D boxes of the 3D segmentation results to compare with the floorplan’s boxes using mAP@25 (Lin et al., 2014). While the absolute value of mAP does not directly reflect the layout compliance of the generated results due to segmentation errors, we assume that mAP has positive correlation with layout compliance, meaning better generation results leading to higher mAP. We also report mAP of ground-truth images as a reference.

Table 1: Quantitative comparison in terms of visual quality (IS) and compliance with layout guidance (mAP@25)
Method Visual Layout
IS \uparrow mAP@25\uparrow
Text2Room 5.35 -
CC3D 4.02 25.60
HouseCrafter 4.24 46.48
GT-3DFront 4.50 54.51

Regarding 3D scenes, we conduct an user study, involving 12 participants, to compare our results with baseline methods in terms of perceptual quality and coherence to the given floorplan. For each baseline, 8 pairs of meshes (our vs. baseline) are shown to the participant. We also add 3 pairs with grounthtruth meshes, resulting in a total of 228 data points. In addition, we report IS calculated from RGB images rendered at random poses for each scene. For methods that have layout guidance, mAP of instance segmentation is also reported. We provide more details about evaluation in the Appendix  C.3.

4.2 Comparison with state of the art

Baselines. There are no direct methods that generate 3D houses from floorplans. Closest to our work is CC3D  (Bahmani et al., 2023), which produces a room-scale indoor scene from 2D layout. CC3D represents the scene as a feature volume that can be rendered with a neural renderer. We also compare against Text2Room (Höllein et al., 2023), which generates an indoor scene from a series of text prompts. Since Text2Room (Höllein et al., 2023) does not receive any layout guidance, we only compare to it in terms of visual quality.

Results. We provide a detailed quantitative analysis in Fig. 6 and Table 1 and quanlitative comparisons in Fig. 5. Both human (Fig. 6) and automated Table 1 evaluations show that our method performs better in generating faithful results to the layout guidance. However, IS greatly favors Text2Room over our method and CC3D, while the users significantly prefer our results regarding the visual quality of the generated mesh. The higher IS of Text2Room is due to the more diverse scenes generated by the text-to-image model Rombach et al. (2022) trained on the web-scale dataset. Although our results are less diverse due to fine-tuneing on a smaller dataset, it can produce more realistic rooms with information from the floorplan, as recognized by users.

Table 2: Ablation studies of different design choices for novel view RGB-D image generation. The best results are highlighted with bold and the second best with underline.
Variant Output Depth Input Depth Layout Cond. RGB Metrics Depth Metrics
FID \downarrow IS \uparrow PSNR \uparrow AbsRel \downarrow δ0.5subscript𝛿0.5\delta_{0.5}italic_δ start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT\uparrow
R-N N-N R-N N-N R-N N-N
49.35 5.00 - - - - - -
33.39 5.23 20.99 22.60 23.56 11.48 79.14 88.79
35.77 5.16 20.91 21.98 22.28 12.05 81.78 88.23
15.64 4.70 25.36 24.79 7.65 7.85 90.44 91.77
16.70 4.74 25.31 24.69 6.79 7.37 92.20 92.65

4.3 Ablation studies

We perform ablation studies for various design choices of the generation model on a set of 300 houses from 3D-FRONT datasets. We sample camera poses in groups of 6, 3 reference and 3 novel views. In each group, reference-novel consistency is measured using the correspondence of each novel view with all reference views, while novel-novel consistency is measured based on 3 pairs in each up of 3 novel views. Regarding layout evaluation, we use images generated in the autoregressive pipeline.

Table 3: Layout compliant evaluation.
Variant Input mAP@25\uparrow
Depth
48.46
52.26
GT 52.56

Generating depth improves visual appearance. Variant pair (①, ②) in Table 2 demonstrates that by forcing the model to learn to generate depth, the FID and IS of RGB output are both improved, indicating better performance of the RGB generation.

Depth conditioning enhances geometry consistency. As shown in variants pairs (②,③) and (④,⑤) in Table 2, reference depth images improves the depth consistency with a stronger effect in R-N than N-N, while having mixed influences on the RGB metrics. The geometry improvement also benefits layout compliance (Table 3), demonstrating the effectiveness of the depth condition.

Layout guidance is critical for both appearance and geometry quality. Variant pairs (②, ④) and (③, ⑤) show strong improvement in all metrics especially the depth by having the layout conditioning. The results reinforce the finding from previous works  Schult et al. (2023); Fang et al. (2023) that coarse depth and high-level semantic information from the layout have a significant impact on the generation results.

5 Conclusion

In this work, we present HouseCrafter, a pipeline that transforms 2D floorplans into detailed 3D spaces. We generate dense RGB-D images autoregressively and fuse them into a 3D mesh. Our key innovation is an image-based diffusion model that produces multiview-consistent RGB-D images guided by floorplan and reference RGB-D images. This capability enables the generating of house-scale 3D scenes with high-quality geometry and texture, surpassing previous approaches which could only generate scenes at the room scale.

References

  • Bahmani et al. (2023) Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Xingguang Yan, Gordon Wetzstein, Leonidas Guibas, and Andrea Tagliasacchi. Cc3d: Layout-conditioned generation of compositional 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7171–7181, 2023.
  • Bautista et al. (2022) Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al. Gaudi: A neural architect for immersive 3d scene generation. Advances in Neural Information Processing Systems, 35:25102–25116, 2022.
  • Behley et al. (2019) Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9297–9307, 2019.
  • Blattmann et al. (2023) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  • Chang et al. (2017) Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
  • Chung et al. (2023) Jaeyoung Chung, Suyoung Lee, Hyeong** Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023.
  • Dai et al. (2017) Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5828–5839, 2017.
  • Deitke et al. (2022) Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022.
  • Deitke et al. (2023) Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023.
  • Deitke et al. (2024) Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024.
  • Engstler et al. (2024) Paul Engstler, Andrea Vedaldi, Iro Laina, and Christian Rupprecht. Invisible stitch: Generating smooth 3d scenes with depth inpainting, 2024.
  • Fang et al. (2023) Chuan Fang, Xiaotao Hu, Kunming Luo, and ** Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints, 2023.
  • Feng et al. (2024) Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Fu et al. (2021) Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10933–10942, 2021.
  • Ge et al. (2024) Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, et al. Behavior vision suite: Customizable dataset generation via simulation. arXiv preprint arXiv:2405.09546, 2024.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Höllein et al. (2023) Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  7909–7920, October 2023.
  • Hu et al. (2024) Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, and Shubham Tulsiani. Mvd-fusion: Single-view 3d via depth-consistent multi-view generation. In CVPR, 2024.
  • Huang et al. (2023) Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Qiao, Bo Dai, et al. Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion. arXiv preprint arXiv:2312.06725, 2023.
  • Jain et al. (2024) Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, and Katerina Fragkiadaki. Odin: A single model for 2d and 3d perception. arXiv preprint arXiv:2401.02416, 2024.
  • Ju et al. (2023) Xiaoliang Ju, Zhaoyang Huang, Yi** Li, Guofeng Zhang, Yu Qiao, and Hongsheng Li. Diffindscene: Diffusion-based high-quality 3d indoor scene generation. 2023.
  • Kant et al. (2023) Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, and Igor Gilitschenski. invs: Repurposing diffusion inpainters for novel view synthesis. In SIGGRAPH Asia 2023 Conference Papers, pp.  1–12, 2023.
  • Kant et al. (2024) Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024.
  • Ke et al. (2024) Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
  • Kong et al. (2024) Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, and Andrew J Davison. Eschernet: A generative model for scalable view synthesis. arXiv preprint arXiv:2402.03908, 2024.
  • Lin et al. (2023) Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  • Liu et al. (2023a) Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023a.
  • Liu et al. (2024) Minghua Liu, Chao Xu, Haian **, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  • Liu et al. (2023b) Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023b.
  • Liu et al. (2023c) Yuan Liu, Cheng Lin, Zijiao Zeng, ** Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  • Long et al. (2023) Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  • Mildenhall et al. (2021) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Ouyang et al. (2023) Hao Ouyang, Kathryn Heal, Stephen Lombardi, and Tiancheng Sun. Text2immersion: Generative immersive scene with 3d gaussians. arXiv preprint arXiv:2312.09242, 2023.
  • Piccinelli et al. (2024) Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Poole et al. (2022) Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • Raistrick et al. (2024) Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21783–21794, 2024.
  • Ren et al. (2023) Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. arXiv preprint, 2023.
  • Rockwell et al. (2021) Chris Rockwell, David F. Fouhey, and Justin Johnson. Pixelsynth: Generating a 3d-consistent experience from a single image. In ICCV, 2021.
  • Rogozhnikov (2022) Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oapKSVM2bcj.
  • Rombach et al. (2021) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Schult et al. (2023) Jonas Schult, Sam Tsai, Lukas Höllein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: Room generation using semantic proxy rooms. arXiv preprint arXiv:2312.05208, 2023.
  • Shi et al. (2023a) Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  • Shi et al. (2023b) Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  • Shriram et al. (2024) Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion, 2024.
  • Song et al. (2023) Liangchen Song, Liangliang Cao, Hongyu Xu, Kai Kang, Feng Tang, Junsong Yuan, and Yang Zhao. Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337, 2023.
  • Szymanowicz et al. (2023) Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8863–8873, 2023.
  • Tang et al. (2023) Shitao Tang, Fuayng Zhang, Jiacheng Chen, Peng Wang, and Furukawa Yasutaka. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint 2307.01097, 2023.
  • Tochilkin et al. (2024) Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
  • Vidanapathirana et al. (2021) Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X Chang, and Manolis Savva. Plan2scene: Converting floorplans to 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10733–10742, 2021.
  • Voleti et al. (2024) Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. arXiv preprint 2403.12008, 2024.
  • Wang et al. (2024) Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xiang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024.
  • Wen et al. (2023) Zehao Wen, Zichen Liu, Srinath Sridhar, and Rao Fu. Anyhome: Open-vocabulary generation of structured and textured 3d homes. arXiv preprint arXiv:2312.06644, 2023.
  • Weng et al. (2023) Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, CL Chen, and Lei Zhang. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  • Woo et al. (2023) Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16133–16142, 2023.
  • Wu et al. (2024) Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, and Pan Ji. Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation, 2024.
  • Yang et al. (2023) Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hongdong Li. Consistnet: Enforcing 3d consistency for multi-view images diffusion, 2023.
  • Yang et al. (2024) Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024.
  • Ye et al. (2023) Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, and Heng Wang. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. arXiv preprint arXiv:2310.03020, 2023.
  • Yeshwanth et al. (2023) Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  • Yi et al. (2024) Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In CVPR, 2024.
  • Yu et al. (2023) Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. arXiv preprint arXiv:2312.03884, 2023.
  • Zeng et al. (2017) Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In CVPR, 2017.
  • Zhang et al. (2023) **gbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and **g Liao. Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588, 2023.
  • Zheng & Vedaldi (2023) Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d representation. arXiv, 2023.
  • Zheng et al. (2023) Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22490–22499, 2023.

Appendix A Details of Floorplan Conditioning

For a novel view with the latent feature 𝐳jnC×H×Wsuperscriptsubscript𝐳𝑗𝑛superscript𝐶𝐻𝑊{\mathbf{z}}_{j}^{n}\in\mathbb{R}^{C\times H\times W}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT (where C𝐶Citalic_C is the feature dimension and H×W𝐻𝑊H\times Witalic_H × italic_W the spatial dimensions), we obtain the layout information 𝐥jM×C×H×Wsubscript𝐥𝑗superscript𝑀𝐶𝐻𝑊{\mathbf{l}}_{j}\in\mathbb{R}^{M\times C\times H\times W}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT at the (latent) pixel-level by casting rays through the pixels and encoding semantic and geometric information at every intersection point between the projected ray and floorplan components.

Subsequently, we use cross-attention at the ray-level where each pixel feature the in 𝐳jnsuperscriptsubscript𝐳𝑗𝑛{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the query and the layout features along the ray are the keys and values, meaning the attention for each ray is performed independently. To illustrate the operation we added the batch dimension B𝐵Bitalic_B and use einops Rogozhnikov (2022) notation:

𝐳jnsuperscriptsubscript𝐳𝑗𝑛\displaystyle{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT rearrange(𝐳jn, B C H W (B H W) 1 C)absentrearrangesuperscriptsubscript𝐳𝑗𝑛 B C H W (B H W) 1 C\displaystyle\leftarrow\textrm{rearrange}({\mathbf{z}}_{j}^{n},\textrm{ B C H % W }\rightarrow\textrm{(B H W) 1 C})← rearrange ( bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , B C H W → (B H W) 1 C )
𝐥jsubscript𝐥𝑗\displaystyle{\mathbf{l}}_{j}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT rearrange(𝐥j,B N C H W(B H W) N C)absentrearrangesubscript𝐥𝑗B N C H W(B H W) N C\displaystyle\leftarrow\textrm{rearrange}({\mathbf{l}}_{j},\textrm{B N C H W}% \rightarrow\textrm{(B H W) N C})← rearrange ( bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , B N C H W → (B H W) N C )
𝐳jnsuperscriptsubscript𝐳𝑗𝑛\displaystyle{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT MHA(q=𝐳jn,k=𝐥j,v=𝐥j)absentMHAformulae-sequence𝑞superscriptsubscript𝐳𝑗𝑛formulae-sequence𝑘subscript𝐥𝑗𝑣subscript𝐥𝑗\displaystyle\leftarrow\textrm{MHA}(q={\mathbf{z}}_{j}^{n},k={\mathbf{l}}_{j},% v={\mathbf{l}}_{j})← MHA ( italic_q = bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_k = bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v = bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
𝐳jnsuperscriptsubscript𝐳𝑗𝑛\displaystyle{\mathbf{z}}_{j}^{n}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT rearrange(𝐳jn,(B H W) 1 C B C H W ),absentrearrangesuperscriptsubscript𝐳𝑗𝑛(B H W) 1 C B C H W \displaystyle\leftarrow\textrm{rearrange}({\mathbf{z}}_{j}^{n},\textrm{(B H W)% 1 C}\rightarrow\textrm{ B C H W }),← rearrange ( bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , (B H W) 1 C → B C H W ) ,

where MHA() is multihead attention layer. The layout information injection is applied in the first block of each feature level in the Unet blocks of the base diffusion model. Note that each level operates at a different resolution, so this amounts to injecting the encoded floorplan at different scales.

In the design described above, we choose to inject into each pixel the information from a single ray while alternatively, a receptive field with kernel size K>1𝐾1K>1italic_K > 1 provides more spatial information. We argue that the quadratic growth O(K2)𝑂superscript𝐾2O(K^{2})italic_O ( italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) of the sequence length of keys and values is expensive for the attention operation while the local information exchange between pixels can be handled in the convolution layers of the network. Furthermore, attention to intersection points from a single ray omits the requirement of using 3D positions for these points, which depend on an arbitrary world coordinate, since the depth along the ray is enough to differentiate these points. We also use the height with respect to the floor for the position because the up direction is a well-defined canonical direction for the indoor scene and the height may help the model decide the visible object along the ray

Appendix B Details of 3D Scene Reconstruction and Post Refinement

B.1 Autoregressive RGB-D Image Generation.

To generate RGB-D images for scene reconstruction, we first create a connected pose graph G(V,E)𝐺𝑉𝐸G(V,E)italic_G ( italic_V , italic_E ), where the vertices V𝑉Vitalic_V are camera poses uniformly placed across the free space obtained from the layout, with randomly chosen rotation. Two poses are linked if their relative distance and rotation angle fall within a threshold.

The reference and novel poses are selected while traversing the graph. The procedure is described in Algo. 1. To control the number of poses in each generation step, we use two parameters δr,δnsubscript𝛿𝑟subscript𝛿𝑛\delta_{r},\delta_{n}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT which are the hop distance with respect to the current pose for the reference and novel poses. When visiting a pose v𝑣vitalic_v whose images have not been generated, we choose generated views within δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT hops from v𝑣vitalic_v as reference views and the novel poses are those that have not been generated and within δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT hops from v𝑣vitalic_v.

Refer to caption
Figure 7: Influence of δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In the left chart we vary δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from 1111 to 4444 while kee** δr=4subscript𝛿𝑟4\delta_{r}=4italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 4 and vary δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in the right chart. Increasing δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (right) and δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (left) improves both visual quality (FID) and layout compliance (mAP) of the generated image sequences. δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has more significant effect than δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

We exam influence of δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT on the generated image sequence using FID and layout evaluation (Fig. 7). Specifically, we vary δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in the range [1,4]14[1,4][ 1 , 4 ] while kee** δr=4subscript𝛿𝑟4\delta_{r}=4italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 4 and vice versa. With the hop distance of 4444 the number of views can be as many as 80808080, we limit the number of novel/reference views at 60606060 due to the memory constraint. As shown in Fig. 7, higher hop distance leads to higher FID and mAP.

Algorithm 1 Autoregressive generation via graph traversal
G(V,E)𝐺𝑉𝐸G(V,E)italic_G ( italic_V , italic_E ): Pose graph
δnsubscript𝛿𝑛\delta_{n}italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT: Hop distance for novel views
δrsubscript𝛿𝑟\delta_{r}italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: Hop distance for reference views
X𝑋X\leftarrow\emptysetitalic_X ← ∅ \triangleright Initialize set of generated poses.
for v𝑣vitalic_v in DFS(G)𝐷𝐹𝑆𝐺DFS(G)italic_D italic_F italic_S ( italic_G ) do \triangleright traverse graph via depth-first search.
     if vX𝑣𝑋v\notin Xitalic_v ∉ italic_X then
         XrXN(v,G,δr)subscript𝑋𝑟𝑋𝑁𝑣𝐺subscript𝛿𝑟X_{r}\leftarrow X\cap N(v,G,\delta_{r})italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ← italic_X ∩ italic_N ( italic_v , italic_G , italic_δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) \triangleright Get reference poses. N(v,G,d)𝑁𝑣𝐺𝑑N(v,G,d)italic_N ( italic_v , italic_G , italic_d ): nodes within d𝑑ditalic_d hop from v𝑣vitalic_v
         XnN(v,G,δn)\Xsubscript𝑋𝑛\𝑁𝑣𝐺subscript𝛿𝑛𝑋X_{n}\leftarrow N(v,G,\delta_{n})\backslash Xitalic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_N ( italic_v , italic_G , italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) \ italic_X \triangleright Get novel poses.
         if Xnsubscript𝑋𝑛X_{n}\neq\emptysetitalic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≠ ∅ then
              Generate(Xr,Xn)𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒subscript𝑋𝑟subscript𝑋𝑛Generate(X_{r},X_{n})italic_G italic_e italic_n italic_e italic_r italic_a italic_t italic_e ( italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) \triangleright Generate novel views.
              XXXn𝑋𝑋subscript𝑋𝑛X\leftarrow X\cup X_{n}italic_X ← italic_X ∪ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
         end if
     end if
end for

B.2 Post Refinement for Scene Reconstruction.

After generating images for all poses in the graph, we further generate object-centric views for furniture in the scene to reduce the missing observation. To sample the camera location, we use a heuristic based on the 2D floorplan and the statistics of the object’s height in the dataset to avoid positions that may be inside the object. In particular, for each object we derive a 3D bounding box from its 2D box in the floorplan and the maximum height of the objects in the dataset with the same category. Using derived bounding boxes as occupied regions, for each object we sample 20 poses within 2222 meter looking at the object center, these views are generated in a single batch using nearby, previously generated views as the reference.

Appendix C Details of Evaluation

C.1 Consistency Evaluation

In this section, we describe the correspondence estimation for a pair of posed RGB-D images. Then we provide the details of the evaluation metrics.

Correspondence estimation Given a pair of views, each with RGB and depth, (I1,D1)subscript𝐼1subscript𝐷1(I_{1},D_{1})( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (I2,D2)subscript𝐼2subscript𝐷2(I_{2},D_{2})( italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we warp images I1,D1subscript𝐼1subscript𝐷1I_{1},D_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of the first view to the second view, obtaining I12,D12subscript𝐼12subscript𝐷12I_{1\rightarrow 2},D_{1\rightarrow 2}italic_I start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT. If the pair of images views are perfectly consistent, the correspondence region \mathcal{M}caligraphic_M is the region that the warped depth D12subscript𝐷12D_{1\rightarrow 2}italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT match perfectly with D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

:-𝟙(D12=D2),:-1subscript𝐷12subscript𝐷2\mathcal{M}\coloneq\mathbbm{1}(D_{1\rightarrow 2}=D_{2}),caligraphic_M :- blackboard_1 ( italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (4)

where 𝟙()1\mathbbm{1}()blackboard_1 ( ) is indicator function. To account for the potential inconsistency of the generated images, we introduce a tolerance threshold τ𝜏\tauitalic_τ to estimate the correspondence,

^:-𝟙(|D12D2|<τ).:-^1subscript𝐷12subscript𝐷2𝜏\mathcal{\hat{M}}\coloneq\mathbbm{1}(|D_{1\rightarrow 2}-D_{2}|<\tau).over^ start_ARG caligraphic_M end_ARG :- blackboard_1 ( | italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | < italic_τ ) . (5)

Given the estimated correspondence ^^\mathcal{\hat{M}}over^ start_ARG caligraphic_M end_ARG, the level of consistency is computed for depth image pair (D12,D2)subscript𝐷12subscript𝐷2(D_{1\rightarrow 2},D_{2})( italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and the RGB image pair (I12,I2)subscript𝐼12subscript𝐼2(I_{1\rightarrow 2},I_{2})( italic_I start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

RGB Metrics. Given the image pair (I12,I2)subscript𝐼12subscript𝐼2(I_{1\rightarrow 2},I_{2})( italic_I start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and the correspondence ^^\mathcal{\hat{M}}over^ start_ARG caligraphic_M end_ARG, we compute the peak signal-to-noise ratio PSNR for color consistency,

PSNR:-20log10(255)10log10(MSE),:-𝑃𝑆𝑁𝑅20subscript1025510subscript10𝑀𝑆𝐸PSNR\coloneq 20\cdot\log_{10}(255)-10\cdot\log_{10}(MSE),italic_P italic_S italic_N italic_R :- 20 ⋅ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( 255 ) - 10 ⋅ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_M italic_S italic_E ) , (6)

where

MSE:-1k^(k)k^(k)[I12(k)I2(k)]2,:-𝑀𝑆𝐸1subscript𝑘^𝑘subscript𝑘^𝑘superscriptdelimited-[]subscript𝐼12𝑘subscript𝐼2𝑘2MSE\coloneq\frac{1}{\sum_{k}\mathcal{\hat{M}}(k)}\sum_{k}\mathcal{\hat{M}}(k)% \cdot[I_{1\rightarrow 2}(k)-I_{2}(k)]^{2},italic_M italic_S italic_E :- divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG caligraphic_M end_ARG ( italic_k ) end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG caligraphic_M end_ARG ( italic_k ) ⋅ [ italic_I start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT ( italic_k ) - italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

k𝑘kitalic_k is pixel index. Note that we omit averaging over the color channels to simplify the equation.

Depth Metrics. Given the image pair (D12,D2)subscript𝐷12subscript𝐷2(D_{1\rightarrow 2},D_{2})( italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and the correspondence ^^\mathcal{\hat{M}}over^ start_ARG caligraphic_M end_ARG, we compute Absolute Mean Relative Error (AbsRel) and percentage of pixel inliers δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for depth consistency. AbsRel is calculated as:

AbsRel:-1k^(k)k^(k)|D12(k)D2(k)|D2(k).:-𝐴𝑏𝑠𝑅𝑒𝑙1subscript𝑘^𝑘subscript𝑘^𝑘subscript𝐷12𝑘subscript𝐷2𝑘subscript𝐷2𝑘AbsRel\coloneq\frac{1}{\sum_{k}\mathcal{\hat{M}}(k)}\sum_{k}\mathcal{\hat{M}}(% k)\cdot\frac{|D_{1\rightarrow 2}(k)-D_{2}(k)|}{D_{2}(k)}.italic_A italic_b italic_s italic_R italic_e italic_l :- divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG caligraphic_M end_ARG ( italic_k ) end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG caligraphic_M end_ARG ( italic_k ) ⋅ divide start_ARG | italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT ( italic_k ) - italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k ) | end_ARG start_ARG italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k ) end_ARG . (8)

The percentage of pixel inliers δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated as:

δi:-1k^(k)k^(k)𝟙(max(D12(k)D2(k),D2(k)D12(k))<1.25i).:-subscript𝛿𝑖1subscript𝑘^𝑘subscript𝑘^𝑘1subscript𝐷12𝑘subscript𝐷2𝑘subscript𝐷2𝑘subscript𝐷12𝑘superscript1.25𝑖\delta_{i}\coloneq\frac{1}{\sum_{k}\mathcal{\hat{M}}(k)}\sum_{k}\mathcal{\hat{% M}}(k)\cdot\mathbbm{1}\left(\max\left(\frac{D_{1\rightarrow 2}(k)}{D_{2}(k)},% \frac{D_{2}(k)}{D_{1\rightarrow 2}(k)}\right)<1.25^{i}\right).italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT :- divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG caligraphic_M end_ARG ( italic_k ) end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG caligraphic_M end_ARG ( italic_k ) ⋅ blackboard_1 ( roman_max ( divide start_ARG italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT ( italic_k ) end_ARG start_ARG italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k ) end_ARG , divide start_ARG italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k ) end_ARG start_ARG italic_D start_POSTSUBSCRIPT 1 → 2 end_POSTSUBSCRIPT ( italic_k ) end_ARG ) < 1.25 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (9)

We choose i=0.5𝑖0.5i=0.5italic_i = 0.5 to have a tight threshold.

C.2 Layout Evaluation

The layout evaluation protocol is the ”inverse” of HouseCrater where we predict the top-down 2D bounding boxes of objects in the generated scene. The predicted 2D bounding boxes are then compared with 2D boxes from the given floorplan using mean Average Precision at the intersection-over-union threshold of 0.25, mAP@25. Specifically, we use ODIN Jain et al. (2024), a 3D instance segmentation method that takes multi-view posed RGB-D images as input and predicts instance segmentation of the point cloud accumulated from input images. Then, top-down 2D boxes are extracted from the segmented instances. As a scene may have up to 2000 images based on its size, we cannot pass all the images to ODIN at once. Instead, these images are partitioned by room, we do segmentation per room. This strategy does not affect the evaluation results since an object in the scene do not span in more than one room. We finetune ODIN on 3D-Front dataset to make the segmentation results more reliable since both HouseCrafter and CC3D are trained on this dataset.

C.3 User Study

Refer to caption
Figure 8: User Study Interface. We show users 2 meshes at a time, one is produced by our model and the other is produced by a baseline method. We then ask users to choose one mesh that appears “better looking in general”, and one mesh that appears “align better” with the given floorplan.

We conduct a user study to evaluate the results produced by Text2Room, CC3D, and our method. In the study, we ask 12 participants to rate the results in a pair-wise manner. Specifically, we present the participants with two meshes at a time and ask them to choose: i) the one that appears more visually appealing; and ii) the one that is more coherent with the provided floorplan. The interface is shown in Fig. 8. Since Text2Room does not take layout as a form of guidance, we do not report participants’ answers to the second question if one of the meshes is produced by it. However, we still ask the question to prevent unconscious bias. Given that CC3D generates results at the room level rather than for entire houses, we clip our results and floorplan to the specific room CC3D produces when making comparisons.

Appendix D Implementation details

D.1 Training

We initialize our model from StableDiffusion v1.5 Rombach et al. (2021). For the first layer of the Unet, we duplicate the pre-trained weights and divide the weights by two to accommodate the depth’s latent and to reduce the change of the output scale. For the last layer of the Unet, we only duplicate the pre-trained weights. The model is trained for 15,0001500015,00015 , 000 iterations in 2 days with an effective batch size of 256 (4444 samples per GPU ×8absent8\times 8× 8 GPUs ×8absent8\times 8× 8 gradient accumulation steps). Each data sample contains 3 reference views and 3 novel views with the resolution of 256. We use Adam optimizer with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. All training is conducted on a machine with 8888 A6000 48GB GPUs.

Appendix E Limitations and Future Directions

Our work is the first that can generate textured meshes of 3D scenes at the house-scale, and yet without limitations, allowing intriguing future directions.

First, the employed TSDF fusion method produces reasonable results in fusing generated RGB-D images and robust their inconsistency. However, it cannot model the view-dependent color, baking the lighting effect into the mesh texture, and thus giving unsatisfactory results. To address this issue, a reconstruction method that is robust to the inconsistency of generated multi-view images and able to model view-dependent color is required.

Second, while using image generation models gives the advantages of using large-scale image data as prior for 3D generation, the current pipeline has a lot of redundancy from the high overlap of multiview images. Thus an effective poses sampling strategy that can balance the view overlap for consistency and efficiency is a promising direction.

Lastly, in our proposed method of injecting the layout guidance to the generation process, only the geometry and semantics of the object are leveraged, while the information about the object instance is omitted. We believe that instance-awareness can give better scene understanding thus generating scene more faithful to the floorplan.

Appendix F Additional Results

Refer to caption
Figure 9: Additional comparisons with CC3D and Text2Room