Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

Xiaolin, Hong1,3, Hongwei, Yi211footnotemark: 1, Fazhi, He1\equalcontrib, Qiong, Cao3\equalcontrib
Abstract

Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlap** object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods. Our code and data will be made publicly available upon publication of this work.

[Uncaptioned image]
Figure 1: Our method generates more plausible 3D scenes given input human motions and floor plans. It excels in two key aspects: (1) avoiding collisions between humans and objects, as well as between objects, a significant improvement over MIME (Yi et al. 2023), and (2) providing better support for human-object interactions compared to DiffuScene (Tang et al. 2023).

Introduction

Creating diverse and realistic 3D environments inhabited by humans is essential for numerous applications, such as virtual reality (VR), interior design and training embodied artificial intelligence (AI) agents (Zhang et al. 2019). For example, VR game designers aim to construct immersive environments that enable participants to engage seamlessly and naturally with objects. These demands have driven researchers to explore diverse scene generation methods, propelling the rapid advancement of 3D scene synthesis (Luo et al. 2020; Paschalidou et al. 2021; Liu et al. 2023; Gao et al. 2023). Despite recent progress, there remain significant challenges in generating visually plausible scenes that are populated with appropriate furniture and adhere to various human motions.

There has been a recent surge in research dedicated to the problem of human-aware scene generation (Ye et al. 2022; Yi et al. 2023). A common approach in these studies is to apply autoregressive models to sequentially place objects based on input humans and already generated objects. However, these methods often yield implausible scenes with object-object collisions. This limitation primarily arises from the inherent inability of autoregressive models to capture the joint distribution of multiple objects and multiple humans fully. Consequently, exploring a generative method that can effectively model and capture these complex distributions is crucial for generating realistic 3D scenes.

More recently, diffusion-based approaches for scene synthesis (Tang et al. 2023; Yang et al. 2024) have shown an impressive ability to simplify the approximation of the joint distribution of objects, thereby generating the entire scenes at once and thus enhancing the realism of generated scenes. Meanwhile, many studies in image generation (Saharia et al. 2022; Zhang, Rao, and Agrawala 2023; Kawar et al. 2023) and human motion synthesis (Huang et al. 2023; Karunratanakul et al. 2023) have demonstrated that diffusion models can effectively incorporate inference guidance to meet user-defined goals. Despite these advancements, there is still no standard solution for generating plausible 3D scenes that both support various human interactions and adhere to spatial constraints, such as avoiding motion collisions and respecting room boundaries.

To tackle these challenges, we propose SHADE, a Spatially-constrained Human-Aware Diffusion based 3D Environments synthesis. As shown in Figure 1, our method can generate plausible scene layouts that avoid collisions with humans and between objects, while supporting various human activities such as sitting and lying. Our key insight lies in innovative harnessing diffusion models to simultaneously input all humans and the floor plan to generate holistic object configurations. Specifically, we input the contact bounding boxes and free space mask extracted from input human motions and the floor plan, following (Yi et al. 2023). We then learn a diffusion model to capture the joint distribution of objects, enabling the simultaneous generation of plausible object placements and understanding the relationships between their attributes. To further enhance the plausibility of generated scenes, we design two spatial collision guidance functions: 1) motions collision avoidance that calculates the 2D distances between objects and moving humans to prevent their implausible penetrations, while 2) boundary constraint that penalizes the distance by which objects extend beyond the floor plan, ensuring that object placement respects room boundaries. During inference, we combine two guidance functions with an object-object collision function (Yang et al. 2024). This allows our diffusion model to generate collision-free scenes that respect human movements, room boundaries, and prevent object overlap.

Beyond model design, we enhance human-aware scene generation by addressing key data challenges in datasets like 3D FRONT HUMAN (Yi et al. 2023). Specifically, we tackle two main issues: 1) incorrect penetrations in human-object interactions hinder accurate spatial relationship modeling, and 2) limited interaction diversity restricts model generalization. Our approach includes automated techniques for adjusting translations to correct penetrations and augmenting categories and orientations to diversify human-object interactions. We use our calibrated dataset to train SHADE and evaluate it on synthetic and real datasets. Our results, both quantitative and qualitative, show that our method generates more plausible 3D scenes with realistic human-scene interactions and effectively reduces human-object collisions. These findings underscore the superiority of SHADE compared to previous state-of-the-art methods.

Our main contributions are: (i) We propose a human-aware diffusion-based model for generating realistic 3D scene layouts with plausible human-object interactions in a single step. (ii) We design two spatial collision inference strategies, motion collision avoidance and boundary constraint guidance, to enhance scene fidelity by preventing collisions with human motions and respecting room layouts. (iii) We devise an automated calibration pipeline to enhance accuracy and diversity in human-object interactions within the original dataset, thereby improving the generative quality and diversity of human-aware scene synthesis.

Related Works

Human-agnostic Scene Synthesis. The objective of human-agnostic scene generative methods is to generate plausible and diverse scene layouts without taking into account human activities (Wang et al. 2018; Zhou, While, and Kalogerakis 2019; Ritchie, Wang, and Lin 2019; Wang, Yeshwanth, and Nießner 2020; Zhang et al. 2020a, b; Paschalidou et al. 2021; Tang et al. 2023; Liu et al. 2023; Patil et al. 2023). Early works explore procedural modeling with grammars (Müller et al. 2006) to place objects in the scenes, heavily relying on manually designed rules and producing scenes with limited diversity. Subsequently, graph-based approaches (Zhou, While, and Kalogerakis 2019; Li et al. 2019; Luo et al. 2020; Gao et al. 2023) have been extensively developed to represent 3D scenes as scene graphs and capture the underlying structure of indoor scenes using graph neural networks. Unlike these studies, recent autoregressive generation methods represent 3D scenes as object sequences and learn autoregressive models to sequentially predict the next object conditioned on already generated objects. The representative methods include SceneFormer (Wang, Yeshwanth, and Nießner 2020), ATISS (Paschalidou et al. 2021), and CLIP-Layout (Liu et al. 2023). However, these autoregressive methods fail to accurately capture the distribution of the entire object sequence, which often results in generating overlap** objects in the same place. More recently, some diffusion-based approaches (Tang et al. 2023; Yang et al. 2024) have been proposed to ease the approximation of the joint distribution of multiple objects, benefiting the generation of high-fidelity scenes. Inspired by these works, we develop a spatially-constrained diffusion method, SHADE. Unlike previous diffusion-based methods (Tang et al. 2023; Yang et al. 2024), SHADE learns a diffusion model to capture the joint distribution of objects conditioned on both human motions and the floor plan. Furthermore, our SHADE integrates multiple spatial constraints as inference guidance to improve scene plausibility.

Human-aware Scene Synthesis. This branch of scene synthesis focuses on producing plausible scenes in which the input human motions can naturally take place (Qi et al. 2018; Nie et al. 2022; Ye et al. 2022; Yi et al. 2023). To this end, Pose2Room (Nie et al. 2022) proposes a pose-conditioned generative model to predict object configurations from human pose trajectories. However, Pose2Room can only predict contact objects rather than an entire scene. In contrast, SUMMON (Ye et al. 2022) learns a ContactFormer to generate affordable objects that contact with humans and employs an autoregressive model to complete the scene. Similar to (Paschalidou et al. 2021), MIME (Yi et al. 2023) learn an autoregressive model to sequentially predict object placements based on the contact information and free space extracted from human motions and floor plan. Unlike these approaches, SHADE is a diffusion-based, non-autoregressive method that inherently explores the relationships between all object attributes to generate more plausible furniture layouts. Moreover, our method incorporates spatial-constrained guidance into the generation process, thereby enhancing scene plausibility.

Method

Refer to caption
Figure 2: Overview of our method. SHADE learns a diffusion model to gradually clean the noisy scene 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT by simultaneously considering the contact bounding boxes, free-space mask, floor plan, and time step. During inference, SHADE applies three spatial collision guidance functions to ensure the generation of plausible scenes that avoid conflicts with human motions, room boundaries, as well as prevent object overlap.

The overview of SHADE is presented in Fig. 2. Specifically, our method firstly inputs the contact bounding boxes and free space mask extracted from given human motions and the floor plan, and extracts their embeddings with respective encoders. Conditioned on these embeddings, we learn a scene diffusion model to capture the joint distribution of multiple objects. This enables us to generate plausible object configurations by exploring the relationships between their attributes. During inference, we apply three spatial collision guidance functions to guide the diffusion model in generating plausible 3D scenes that avoid conflicts with human motions and adhere to layout constraints. In the following, we first describe the problem formulation of our task and then detail our human-aware diffusion model and inference guidance.

Problem Formulation

Given input human motions \mathcal{H}caligraphic_H and an empty floor plan \mathcal{F}caligraphic_F, our goal is to generate a 3D scene 𝒮𝒮\mathcal{S}caligraphic_S that can support various human interactions and movements. Following (Paschalidou et al. 2021; Yi et al. 2023; Tang et al. 2023), the scene 𝒮𝒮\mathcal{S}caligraphic_S is represented as an unordered set of N𝑁Nitalic_N objects, denoted as 𝐱={oi}i=1N𝐱subscriptsuperscriptsubscript𝑜𝑖𝑁𝑖1\mathbf{x}=\{o_{i}\}^{N}_{i=1}bold_x = { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT. Each object oi={ki,i,si,ri}subscript𝑜𝑖subscript𝑘𝑖subscript𝑖subscript𝑠𝑖subscript𝑟𝑖o_{i}=\{k_{i},\ell_{i},s_{i},r_{i}\}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } consists of a semantic label kiKsubscript𝑘𝑖superscript𝐾k_{i}\in\mathbb{R}^{K}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT out of K𝐾Kitalic_K categories, location i3subscript𝑖superscript3\ell_{i}\in\mathbb{R}^{3}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, size si3subscript𝑠𝑖superscript3s_{i}\in\mathbb{R}^{3}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and orientation ri1subscript𝑟𝑖superscript1r_{i}\in\mathbb{R}^{1}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. Based on human-object interaction, these objects 𝒪𝒪\mathcal{O}caligraphic_O can be categorized into two kinds: 1) contact objects 𝒬={q}i=1L𝒬subscriptsuperscript𝑞𝐿𝑖1\mathcal{Q}=\{q\}^{L}_{i=1}caligraphic_Q = { italic_q } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, which are in contact with humans, and 2) non-contact objects 𝒬¯={q¯}i=1NL¯𝒬subscriptsuperscript¯𝑞𝑁𝐿𝑖1\bar{\mathcal{Q}}=\{\bar{q}\}^{N-L}_{i=1}over¯ start_ARG caligraphic_Q end_ARG = { over¯ start_ARG italic_q end_ARG } start_POSTSUPERSCRIPT italic_N - italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT without any interaction with humans. To condition the 3D scene generation with human motions, we extract both contact humans and free-space humans from the input motions, following (Yi et al. 2023). Specifically, contact humans indicate the location and category of contact objects, represented as the collection of contact boxes 𝒞={ci}i=1L𝒞subscriptsuperscriptsubscript𝑐𝑖𝐿𝑖1\mathcal{C}=\{c_{i}\}^{L}_{i=1}caligraphic_C = { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT. Free-space humans define the walkable area of a room, specifying regions where objects cannot be placed. This information is represented as a binary free-space mask 𝒮𝒮\mathcal{FS}caligraphic_F caligraphic_S by projecting all foot contact points on the floor plan. Formally, conditioned on the floor plan \mathcal{F}caligraphic_F, the free-space mask 𝒮𝒮\mathcal{FS}caligraphic_F caligraphic_S and all contact humans 𝒞𝒞\mathcal{C}caligraphic_C, we learn a generative model to predict the contact objects 𝒬𝒬\mathcal{Q}caligraphic_Q and the non-contact objects 𝒬¯¯𝒬\bar{\mathcal{Q}}over¯ start_ARG caligraphic_Q end_ARG, such that they can support all human interactions while adhering to the constraints imposed by human motions \mathcal{H}caligraphic_H and the floor plan \mathcal{F}caligraphic_F.

Human-aware Scene Synthesis

The diffusion model provides a robust framework for scene generation by learning the joint distribution of multiple objects, which is essential for creating plausible scenes without overlap** objects. In the following sections, we detail how we incorporate the diffusion model with all input humans and the floor plan to generate realistic 3D scenes.

Diffusion and Generative Process. Denote 𝐱0q(𝐱0)similar-tosubscript𝐱0𝑞subscript𝐱0\mathbf{x}_{0}\sim q(\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) as a clean scene sampled from the training data, we gradually add Gaussian noise to 𝐱𝟎subscript𝐱0\mathbf{x_{0}}bold_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT with a forward diffusion process q(𝐱t+1|𝐱t)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡q(\mathbf{x}_{t+1}|\mathbf{x}_{t})italic_q ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of length T𝑇Titalic_T. After T𝑇Titalic_T diffusion steps, it approximates Gaussian noise 𝐱T𝒩(𝟎,𝐈)similar-tosubscript𝐱𝑇𝒩0𝐈\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ). To generate scenes conditioned on human motions and floor plan, our diffusion model learns a reverse denoising process pθ(𝐱t1|𝐱t,C)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝐶p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},C)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_C ) to convert 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT back to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

pθ(𝐱t1|𝐱t,C)=𝒩(𝐱t1;μθ(𝐱t,t,C),(1αt)𝐈),subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝐶𝒩subscript𝐱𝑡1subscript𝜇𝜃subscript𝐱𝑡𝑡𝐶1subscript𝛼𝑡𝐈p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},C)=\mathcal{N}(\mathbf{x}_{t-1};\mu% _{\theta}(\mathbf{x}_{t},t,C),(1-\alpha_{t})\mathbf{I}),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_C ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (1)

where θ𝜃\thetaitalic_θ are the model parameters, αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends on a pre-defined variance schedule, C𝐶Citalic_C denotes the conditioning signals. As illustrated in Fig. 2, we select C={𝒞,,𝒮}𝐶𝒞𝒮C=\{\mathcal{C},\mathcal{F},\mathcal{FS}\}italic_C = { caligraphic_C , caligraphic_F , caligraphic_F caligraphic_S } as the conditioning signals, enabling the diffusion model to generate 3D scenes by leveraging the information from all contact humans 𝒞𝒞\mathcal{C}caligraphic_C, the floor plan \mathcal{F}caligraphic_F, and the free-space mask 𝒮𝒮\mathcal{FS}caligraphic_F caligraphic_S. According to ϵlimit-fromitalic-ϵ\epsilon-italic_ϵ -prediction (Ho, Jain, and Abbeel 2020a), μθ(𝐱t,t,C)subscript𝜇𝜃subscript𝐱𝑡𝑡𝐶\mu_{\theta}(\mathbf{x}_{t},t,C)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) can be re-parameterized as:

μθ(𝐱t,t,C)=1αt(𝐱t1αt1α¯tϵθ(𝐱t,t,C)),subscript𝜇𝜃subscript𝐱𝑡𝑡𝐶1subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡𝐶\mu_{\theta}(\mathbf{x}_{t},t,C)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{% t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\bm{\epsilon}_{\theta}(% \mathbf{x}_{t},t,C)}\right),italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) ) , (2)

where α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ϵθ(𝐱t,t,C)subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡𝐶\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,C)bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) is a denoising network that predicts the noise applied to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from a noisy scene 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the conditions C𝐶Citalic_C. Finally, we can reconstruct a clean scene 𝐱𝟎subscript𝐱0\mathbf{x_{0}}bold_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT as follows:

pθ(𝐱0|C)=p(𝐱T)t=1Tpθ(𝐱t1|𝐱t,C),subscript𝑝𝜃conditionalsubscript𝐱0𝐶𝑝subscript𝐱𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝐶p_{\theta}(\mathbf{x}_{0}|C)=p(\mathbf{x}_{T})\prod_{t=1}^{T}p_{\theta}(% \mathbf{x}_{t-1}|\mathbf{x}_{t},C),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_C ) , (3)

where pθ(𝐱0|C)subscript𝑝𝜃conditionalsubscript𝐱0𝐶p_{\theta}(\mathbf{x}_{0}|C)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C ) denotes the probability of scene 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT conditioned on C𝐶Citalic_C. Mathematically, we can maximize the conditional probability pθ(𝐱0|C)subscript𝑝𝜃conditionalsubscript𝐱0𝐶p_{\theta}(\mathbf{x}_{0}|C)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C ) by training the denoising network ϵθ(𝐱t,t,C)subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡𝐶\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,C)bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) with a simplified objective (Ho, Jain, and Abbeel 2020a):

simsubscript𝑠𝑖𝑚\displaystyle\mathcal{L}_{sim}caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT =𝔼ϵ,t,𝐱0[ϵϵθ(𝐱t,t,C)2]absentsubscript𝔼bold-italic-ϵ𝑡subscript𝐱0delimited-[]superscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡𝐶2\displaystyle=\mathbb{E}_{\bm{\epsilon},t,\mathbf{x}_{0}}\left[\left\|\bm{% \epsilon}-\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,C)\right\|^{2}\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_ϵ , italic_t , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (4)
=𝔼ϵ,t,𝐱0[ϵϵθ(α¯t𝐱0+1α¯tϵ,t,C)2].absentsubscript𝔼bold-italic-ϵ𝑡subscript𝐱0delimited-[]superscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡bold-italic-ϵ𝑡𝐶2\displaystyle=\mathbb{E}_{\bm{\epsilon},t,\mathbf{x}_{0}}\left[\left\|\bm{% \epsilon}-\bm{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1% -\bar{\alpha}_{t}}\bm{\epsilon},t,C)\right\|^{2}\right].= blackboard_E start_POSTSUBSCRIPT bold_italic_ϵ , italic_t , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , italic_t , italic_C ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

During scene generation, our diffusion model starts from a Gaussian noise 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and gradually predicts a cleaner scene with Eq. 1 and 2. In each denoising step, the diffusion model generates all objects within the scene simultaneously, which inherently captures the relationships between all object attributes to enhance scene plausibility. After T𝑇Titalic_T steps, we generate a plausible scene 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that can afford various human interactions, such as touching, sitting, and lying.

Inference Guidance for Spatial Constraints

In human-aware scene generation, the generated scenes must avoid colliding with human motions and respect room boundary constraints. To enhance these, we incorporate inference guidance into the generation process. Specifically, we introduce two spatial collision guidance functions to improve further the plausibility of 3D scenes generated by the diffusion model: 1) the motion collision avoidance function pushes objects back when they collide with human motions, and 2) the boundary constraint function repositions objects that are outside the floor plan.

Motion Collision Avoidance. Without directly calculating the mesh interpenetration between moving humans and objects, we estimate motion collision scores using the predicted bounding boxes of objects {o^i}i=1Nsubscriptsuperscriptsubscript^𝑜𝑖𝑁𝑖1\{\hat{o}_{i}\}^{N}_{i=1}{ over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT in the scene and free-space mask 𝒮𝒮\mathcal{FS}caligraphic_F caligraphic_S induced by human movements. Specifically, we project the free-space mask 𝒮𝒮\mathcal{FS}caligraphic_F caligraphic_S onto a bird’s-eye view to produce a 2D human motion map, from which we calculate the 2D transform distance map \mathcal{M}caligraphic_M. Using this representation, we measure the motion collision scores by:

𝒥m=i=1NSDF(o^i,),subscript𝒥𝑚superscriptsubscript𝑖1𝑁SDFsubscript^𝑜𝑖\mathcal{J}_{m}=\sum_{i=1}^{N}\text{SDF}(\hat{o}_{i},\mathcal{M}),caligraphic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT SDF ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_M ) , (5)

where SDF(o^i,)SDFsubscript^𝑜𝑖\text{SDF}(\hat{o}_{i},\mathcal{M})SDF ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_M ) queries the 2D distance value between each object and human motion map. A positive distance indicates a collision between the object and human motions.

Boundary Constraint. Besides satisfying motion collision constraints, the generated objects must also meet boundary constraints, meaning all objects should be located within the given floor plan. To this end, we introduce an additional boundary constraint function to penalize objects that exceed the boundaries. Similar to motion collision avoidance, we extract the out-of-bounds region from the 2D projection of the floor plan \mathcal{F}caligraphic_F and calculate its 2D transform distance map \mathcal{B}caligraphic_B. The distance that objects extend beyond the boundaries is then measured by calculating their 2D distance from \mathcal{B}caligraphic_B:

𝒥b=i=1NSDF(o^i,).subscript𝒥𝑏superscriptsubscript𝑖1𝑁SDFsubscript^𝑜𝑖\mathcal{J}_{b}=\sum_{i=1}^{N}\text{SDF}(\hat{o}_{i},\mathcal{B}).caligraphic_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT SDF ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_B ) . (6)

Overall Inference Guidance. To generate a more plausible scene that respects spatial constraints from human motions and room layout, we integrate motion collision avoidance 𝒥msubscript𝒥𝑚\mathcal{J}_{m}caligraphic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and boundary constraint 𝒥bsubscript𝒥𝑏\mathcal{J}_{b}caligraphic_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT into the inference guidance of our diffusion model. Inspired by (Yang et al. 2024), we also employ the object collision avoidance 𝒥osubscript𝒥𝑜\mathcal{J}_{o}caligraphic_J start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to penalize the collision between objects. Finally, our overall inference guidance can be formulated as 𝒥=𝒥m+𝒥b+𝒥o𝒥subscript𝒥𝑚subscript𝒥𝑏subscript𝒥𝑜\mathcal{J}=\mathcal{J}_{m}+\mathcal{J}_{b}+\mathcal{J}_{o}caligraphic_J = caligraphic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + caligraphic_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + caligraphic_J start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, where 𝒥o=i,j,ij𝐈𝐨𝐔(oi,oj)subscript𝒥𝑜subscript𝑖𝑗𝑖𝑗𝐈𝐨𝐔subscript𝑜𝑖subscript𝑜𝑗\mathcal{J}_{o}=\sum_{i,j,i\neq j}\mathbf{IoU}({o}_{i},o_{j})caligraphic_J start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_i ≠ italic_j end_POSTSUBSCRIPT bold_IoU ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) measures the objects collision score by calculating the 3D IoU of each object pair.

During generative process, the gradient of 𝒥𝒥\mathcal{J}caligraphic_J perturbs the predicted 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at each denoising step as 𝐱~0=𝐱^0γ𝒥(𝐱^0),subscript~𝐱0subscript^𝐱0𝛾𝒥subscript^𝐱0\tilde{\mathbf{x}}_{0}=\hat{\mathbf{x}}_{0}-\gamma\nabla\mathcal{J}(\hat{% \mathbf{x}}_{0}),over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_γ ∇ caligraphic_J ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , where γ𝛾\gammaitalic_γ controls the guidance strength. Then, the predicted mean μθsubscript𝜇𝜃\mathbf{\mu}_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be computed with the updated clean scene 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as (Ho et al. 2022; Rempe et al. 2023). This approach effectively pushes the generated scenes towards the desired spatial constraints, ensuring they can avoid colliding with human motions and room boundaries, and prevent object collision.

Dataset Calibration of 3D FRONT HUMAN

In this section, we introduce the calibration pipeline designed to tackle two major data challenges in the 3D FRONT HUMAN dataset (Yi et al. 2023). Our pipeline begins with translation modification to correct penetrations in human-object interactions, thereby improving spatial accuracy. Next, it employs category and orientation augmentation to increase the diversity of interactions, which enhances model generalization. These techniques are explained in greater detail below.

Translation Modification. To address incorrect penetrations in human-object interactions, we adjust the translation parameters of humans to avoid implausible contact with objects. Specifically, we modify these parameters within a range of [2m,2m]2𝑚2𝑚[-2m,2m][ - 2 italic_m , 2 italic_m ] centered on the contact object. We then evaluate the plausibility of the interactions using two indicators: the 3D IoU score (Zhou et al. 2019) between the bounding boxes of the human and object, and the human-scene interpenetration error 𝐄𝐩𝐞𝐧subscript𝐄𝐩𝐞𝐧\mathbf{{E}_{pen}}bold_E start_POSTSUBSCRIPT bold_pen end_POSTSUBSCRIPT (Shen et al. 2023):

𝐄𝐩𝐞𝐧=i=1V𝟙x<0[sdf(vi,𝒮)]|sdf(vi,𝒮)|,subscript𝐄𝐩𝐞𝐧superscriptsubscript𝑖1𝑉subscript1𝑥0delimited-[]𝑠𝑑𝑓subscript𝑣𝑖𝒮𝑠𝑑𝑓subscript𝑣𝑖𝒮\mathbf{{E}_{pen}}=\sum_{i=1}^{V}\mathds{1}_{x<0}\left[sdf\left(v_{i},\mathcal% {S}\right)\right]\cdot\left|sdf\left(v_{i},\mathcal{S}\right)\right|,bold_E start_POSTSUBSCRIPT bold_pen end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_x < 0 end_POSTSUBSCRIPT [ italic_s italic_d italic_f ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_S ) ] ⋅ | italic_s italic_d italic_f ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_S ) | , (7)

where V𝑉Vitalic_V is the number of human mesh vertices, sdf(vi,𝒮)𝑠𝑑𝑓subscript𝑣𝑖𝒮sdf\left(v_{i},\mathcal{S}\right)italic_s italic_d italic_f ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_S ) is the signed distance from vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to scene 𝒮𝒮\mathcal{S}caligraphic_S, and 𝟙x<0[]subscript1𝑥0delimited-[]\mathds{1}_{x<0}[\cdot]blackboard_1 start_POSTSUBSCRIPT italic_x < 0 end_POSTSUBSCRIPT [ ⋅ ] is an indicator function that returns 1 when the condition is met, and 0 otherwise. The translation modification is considered complete when 𝐄𝐩𝐞𝐧subscript𝐄𝐩𝐞𝐧\mathbf{{E}_{pen}}bold_E start_POSTSUBSCRIPT bold_pen end_POSTSUBSCRIPT falls below the safe bound σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the 3D IoU is higher than σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Empirically, we set σ1=20subscript𝜎120\sigma_{1}=20italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 20, σ2=90%subscript𝜎2percent90\sigma_{2}=90\%italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 90 % for sitting and lying humans, and σ2=50%subscript𝜎2percent50\sigma_{2}=50\%italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 50 % for touching humans.

Category Augmentation. After refining the implausible human-object interpenetration via translation modification, we perform category augmentation to enhance the diversity of human-object interactions. The motivation behind category augmentation is that people interact with objects in various ways in real human activities, such as lying on a bed or sitting on it. To capture these behaviors, we augment the dataset by randomly replacing half of the contact humans with alternate modes of interaction, thereby fabricating a refreshed set of human-object interactions. For instance, we might transform a scene from “lying on the bed" to “sitting on the bed", as shown in Fig. 3 b.

Orientation Augmentation. Considering that people might interact with objects from various angles, we also introduce orientation augmentation to enrich the dataset. To achieve this, we firstly add random noise to orientation parameters of contact humans. Next, we adjust the translation parameters of contact humans until their interactions with objects meet the same criterion proposed in the translation modification. This ensures that contact humans can interact with objects in a varied and realistic manner.

Refer to caption
Figure 3: Comparison between 3D-FRONT HUMAN (Yi et al. 2023) and our calibrated dataset. We correct human-object penetrations through translation modification to improve spatial accuracy. Additionally, we apply category and orientation augmentation to enhance the diversity in interactions.

To verify the effectiveness of our calibrated pipeline, we report 𝐄𝐩𝐞𝐧subscript𝐄𝐩𝐞𝐧\mathbf{E_{pen}}bold_E start_POSTSUBSCRIPT bold_pen end_POSTSUBSCRIPT and 3D IoU score for 3D-FRONT HUMAN and our calibrated dataset in Tab. 1. It shows that the calibration pipeline effectively reduces human-scene interpenetration errors and improves human-object interactions of 3D-FRONT HUMAN across all room types. The qualitative comparisons presented in Figure 3 also demonstrate that our calibrated dataset provides more plausible and diverse human-object interactions.

Table 1: Quantitative comparisons between the 3D-FRONT HUMAN (Yi et al. 2023) and our calibrated dataset on human-scene interaction metrics.
Room Type Dataset 𝐄𝐩𝐞𝐧subscript𝐄𝐩𝐞𝐧\mathbf{E_{pen}}bold_E start_POSTSUBSCRIPT bold_pen end_POSTSUBSCRIPT \downarrow 3D IoU \uparrow
Bedroom 3D-FRONT HUMAN (Yi et al. 2023) 226.97 0.75
Our calibrated dataset 16.01 0.94
Living 3D-FRONT HUMAN (Yi et al. 2023) 125.49 0.77
Our calibrated dataset 19.50 0.94
Dining 3D-FRONT HUMAN (Yi et al. 2023) 133.34 0.78
Our calibrated dataset 20.72 0.95

Experiments

Implementations. Following the default settings in (Ho, Jain, and Abbeel 2020b), the forward process variances are set to constants increasing linearly from β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to βTsubscript𝛽𝑇\beta_{T}italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.020.020.020.02. Unlike previous studies (Tang et al. 2023; Yang et al. 2024), we reduce the number of diffusion steps to T𝑇Titalic_T=100 from 1000, thereby substantially speeding up the sampling process. During training, we sample the diffusion step t𝑡titalic_t from a uniform distribution at each iteration and use the Adam (Kingma and Ba 2014) optimizer with a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and no weight decay. Moreover, we apply random global rotation augmentation between [0, 360] degrees on the entire scene, including the floor plane, all objects, all contact humans, and the free space. Finally, we train SHADE on a Nvidia Tesla A100 GPU for 625K iterations with a batch size of 128. During inference, we use the DDPM sampler (Ho, Jain, and Abbeel 2020b) to obtain the object properties and apply inference guidance in the last five steps.

Network Architecture. As depicted in Fig. 2, we implement SHADE with a transformer-based architecture composed of 4 DiT blocks with adaLN-Zero (Peebles and Xie 2022). Each block has a latent dimension of 512, and each attention layer consists of 8 heads. As input, we extract the layout embedding from the 3D point clouds of the floor plan \mathcal{F}caligraphic_F and the free-space mask 𝒮𝒮\mathcal{FS}caligraphic_F caligraphic_S using a PointNet (Qi et al. 2017). Meanwhile, all objects and contact humans are encoded by a fully connected network (Tang et al. 2023), and the time step t𝑡titalic_t is encoded using a positional embedding (Tancik et al. 2020). Each object embedding is then treated as an input token and fed into the denoising network. To ensure the scene generation is aware of different contact humans and room layout, we sum the embedding vectors of each contact human, floor plan, and t𝑡titalic_t as conditions for each contact object, from which we regress the transformation parameters of each adaLN-Zero block. These parameters are then applied to normalize the embedding vector of the corresponding contact object. This strategy effectively guides the diffusion model to generate plausible contact objects for different contact humans. Note that the generation of non-contact objects only condition on the sum of the embedding vectors of layout and t𝑡titalic_t. Additionally, the self-attention (Vaswani et al. 2017) layer in each adaLN-Zero block models the spatial relationships between all objects within the scene, thereby enhancing scene plausibility.

Dataset. We conduct experiments on the calibrated 3D FRONT HUMAN dataset, which contains a total of 5,689 bedrooms, 2,987 living rooms, and 2,549 dining rooms. We use 21 object categories for the bedrooms, and 24 for the living rooms and dining rooms. Following (Yi et al. 2023), for each kind of room, we split the data into 80% for training, 10% for validation, and 10% for testing. We train and validate our model on the training and validation sets respectively, and evaluate it on the test set for each room type. During inference, we use the 3D-FUTURE dataset (Fu et al. 2021) for object retrieval, as suggested in (Paschalidou et al. 2021).

Refer to caption
Figure 4: Qualitative comparison on the test split in calibrated 3D FRONT HUMAN. Compared with existing state-of-the-art methods, our method generates more plausible scenes that avoid conflict with free-space humans and room boundaries, and present fewer overlap** objects.
Table 2: Quantitative comparison on the test split of the Calibrated 3D FRONT HUMAN dataset. We compare SHADE with MIME and DiffuScene on human-scene interaction score 3D IoU, scene plausibility metrics 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT, 𝐑𝐨𝐮𝐭subscript𝐑𝐨𝐮𝐭\mathbf{R_{out}}bold_R start_POSTSUBSCRIPT bold_out end_POSTSUBSCRIPT, 𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT, and standard perceptual quality scores FID, CKL. The best scores are highlighted in bold, and the second best scores are underlined.
Room Type Method 3D IoU \uparrow 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT \downarrow 𝐑𝐨𝐮𝐭subscript𝐑𝐨𝐮𝐭\mathbf{R_{out}}bold_R start_POSTSUBSCRIPT bold_out end_POSTSUBSCRIPT \downarrow 𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT \downarrow FID \downarrow CKL \downarrow
Bedroom MIME (Yi et al. 2023) 0.905 0.068 0.003 0.020 37.32 0.021
DiffuScene (Tang et al. 2023) 0.550 0.169 0.005 0.049 42.83 0.026
SHADE (Ours) 0.915 0.051 0.006 0.016 41.59 0.023
Living MIME (Yi et al. 2023) 0.899 0.041 0.002 0.060 34.25 0.013
DiffuScene (Tang et al. 2023) 0.301 0.184 0.003 0.117 38.15 0.031
SHADE (Ours) 0.913 0.029 0.002 0.023 35.71 0.008
Dining MIME (Yi et al. 2023) 0.924 0.031 0.002 0.067 34.61 0.015
DiffuScene (Tang et al. 2023) 0.280 0.153 0.003 0.115 41.35 0.040
SHADE (Ours) 0.903 0.023 0.003 0.050 39.99 0.012
Table 3: Quantitative comparison on the PROXD qualitative dataset (Hassan et al. 2019).
Method 3D IoU \uparrow 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT \downarrow 𝐑𝐨𝐮𝐭subscript𝐑𝐨𝐮𝐭\mathbf{R_{out}}bold_R start_POSTSUBSCRIPT bold_out end_POSTSUBSCRIPT \downarrow 𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT \downarrow
MIME (Yi et al. 2023) 0.883 0.224 0.002 0.144
DiffuScene (Tang et al. 2023) 0.046 0.202 0.003 0.096
SHADE (Ours) 0.881 0.171 0.003 0.070

Baselines. We compare our method with MIME (Yi et al. 2023) and DiffuScene (Tang et al. 2023) using their official implementations. MIME is a transformer-based autoregressive method for human-aware scene generation, while DiffuScene is a diffusion-based method that learns 3D scene distributions without conditioning on the floor plan. For a more fair comparison, we adapt DiffuScene on the 2D-floor plan with the free-space mask to enhance its perception of human motions. All methods are trained on our calibrated 3D FRONT HUMAN dataset.

Evaluation Metrics. Following (Yi et al. 2023), we evaluate all methods on the plausibility of human-object interaction and the realism of the generated scenes. As suggested in (Yi et al. 2023), we use the 3D IoU score to measure the plausibility of human-object interaction, by calculating the intersection ratio between input contact bounding boxes and generated objects. Additionally, we employ 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT (Yi et al. 2023) to evaluate collision between generated objects and free-space humans, and 𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT (Yang et al. 2024) to represent the collision rate between objects in the generated scene. To measure the violation of the room layout, we compute the collision rate between generated objects and the areas outside the floor plan, denoted as 𝐑𝐨𝐮𝐭subscript𝐑𝐨𝐮𝐭\mathbf{R_{out}}bold_R start_POSTSUBSCRIPT bold_out end_POSTSUBSCRIPT. Furthermore, following (Paschalidou et al. 2021), we calculate the Fréchet inception distance (FID) score and the category KL divergence (CKL) between generated scenes and real scenes to measure the realism of generated scenes. All these evaluation experiments are conducted on the test split of the calibrated 3D FRONT HUMAN dataset.

Refer to caption
Figure 5: Ablation on spatial collision guidance functions. The top row shows the scenes generated without guidance (w/o), with red boxes indicating constraint violations, while the bottom row shows scenes generated with guidance (w), where green boxes highlight the improvements achieved by the proposed guidance.
Table 4: Ablation study on different spatial collision guidance functions. The best scores are highlighted in bold, and the second best scores are underlined.
Motion Collision Room Boundary Object Collision 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT(\downarrow) 𝐑𝐨𝐮𝐭subscript𝐑𝐨𝐮𝐭\mathbf{R_{out}}bold_R start_POSTSUBSCRIPT bold_out end_POSTSUBSCRIPT(\downarrow) 𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT(\downarrow)
0.082 0.003 0.020
0.034 0.008 0.023
0.100 0.002 0.021
0.084 0.004 0.008
0.051 0.006 0.016
Refer to caption
Figure 6: Ablation study on varying numbers of free-space and contact humans. (a) As the number of free-space humans increases, SHADE generates fewer objects in the scene. (b) Providing more contact humans as input, SHADE generates more occupied objects to interact with them.

Human-aware Scene Synthesis

Figure 4 exhibits the generation ability of our method and baselines for different room types. Since DiffuScene only considers floor plan and free-space information, it fails to generate appropriate objects to interact with contact humans. Additionally, DiffuScene also generates objects colliding with free-space humans. Both MIME and our method can generate reasonable objects to support various contact humans, such as placing a bed under a sitting human or a sofa under a lying human. However, MIME still generates objects in free-space humans or outside the floor plan (see the second column of Fig. 4). In contrast, our method generates more plausible scenes that can avoid colliding with free-space humans and room boundaries, as well as prevent object overlap.

These observations are validated by the quantitative comparisons of various evaluation metrics presented in Table 2. Our method outperforms MIME in 3D IoU scores for both the bedroom and living room and performs comparably in the dining room. Furthermore, our method significantly reduces motion collisions (𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT) and object collisions (𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT) across all room types when compared to the baselines. Although MIME achieves better FID scores, our method consistently shows better CKL scores, indicating better alignment with the categorical distribution of real data. Overall, these results demonstrate the effectiveness of our framework.

Following MIME (Yi et al. 2023), we test SHADE on a real dataset of human motion to evaluate the generalization of our method. We use 5 input motions from the PROXD (Hassan et al. 2019) dataset and generate 10 scenes for each motion sequence. The quantitative results of our method and the baselines are reported in Table 3. Note that all methods are not fine-tuned on the PROXD dataset. For human-object interaction plausibility, our method achieves a 3D IoU score of 0.881, outperforming DiffuScene and being comparable to MIME. In terms of scene plausibility, our method surpasses the other methods on the motion collision metric 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT and the object collision metric 𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT, indicating that our method can generate more realistic scenes with fewer human-object and object-object collisions. Figure 1 presents the qualitative visualization of all models’ generations.

Ablation Studies

Inference Guidance. We investigate the impact of each spatial collision guidance function on the bedroom and present the results in Tab. 4. Compared to scene generation without any guidance function, incorporating the motion collision avoidance function reduces the 𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT metric to 0.034, demonstrating its effectiveness. Similar conclusions can be drawn from the results of the other two spatial collision guidance functions. It is noteworthy that these collision-based guidance functions can negatively affect each other. For example, while the room boundary constraint function improves the 𝐑𝐨𝐮𝐭subscript𝐑𝐨𝐮𝐭\mathbf{R_{out}}bold_R start_POSTSUBSCRIPT bold_out end_POSTSUBSCRIPT metric, it can degrade the other metrics. This is reasonable because room boundary guidance pushes objects within the floor plan, potentially leading lead to increased collisions with free-space humans (𝐂𝐨𝐥𝐦𝐨𝐭subscript𝐂𝐨𝐥𝐦𝐨𝐭\mathbf{Col_{mot}}bold_Col start_POSTSUBSCRIPT bold_mot end_POSTSUBSCRIPT) and other objects (𝐂𝐨𝐥𝐨𝐛𝐣subscript𝐂𝐨𝐥𝐨𝐛𝐣\mathbf{Col_{obj}}bold_Col start_POSTSUBSCRIPT bold_obj end_POSTSUBSCRIPT). To balance these effects, we integrate all spatial collision functions, thereby achieving better overall performance in scene plausibility. Fig. 5 provides a qualitative visualization of the effect of each spatial collision guidance function. The improvements shown in the second row of Fig. 5 confirm that our collision-based guidance functions can significantly enhance the plausibility of 3D scenes.

Number of Input Humans. In Fig. 6, we further investigate the impact of input humans on scene generation by varying the number of free-space humans and contact humans provided as input. Our qualitative results indicate that as the density of free-space humans increases, our method generates fewer objects in the scenes. Additionally, when given more contact humans, we produce more occupied objects to interact with them. These findings demonstrate the flexible scene generation capabilities of our method, influenced by the varying numbers of input humans.

Conclusions

We introduced SHADE, a spatially-constrained diffusion model for human-aware 3D scene synthesis. SHADE learns a scene diffusion model that simultaneously considers all input humans and floor maps to generate plausible furniture layouts. During scene generation, SHADE integrates the motion collision avoidance, boundary constraint and object collision avoidance as guidance to further enhance the scene plausibility. Additionally, we devise an automated calibration pipeline to improve the spatial accuracy and diversity of human-object interactions in existing human-aware 3D scene dataset, thereby enhancing the generation ability of SHADE. The quantitative and qualitative results showcase promising improvements in synthetic and real-world HSI datasets, demonstrating the effectiveness of our framework and calibration pipeline.

References

  • Fu et al. (2021) Fu, H.; Jia, R.; Gao, L.; Gong, M.; Zhao, B.; Maybank, S.; and Tao, D. 2021. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 1–25.
  • Gao et al. (2023) Gao, L.; Sun, J.-M.; Mo, K.; Lai, Y.-K.; Guibas, L. J.; and Yang, J. 2023. SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation With Fine-Grained Geometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7): 8902–8919.
  • Hassan et al. (2019) Hassan, M.; Choutas, V.; Tzionas, D.; and Black, M. J. 2019. Resolving 3D human pose ambiguities with 3D scene constraints. In Proceedings of the IEEE/CVF international conference on computer vision, 2282–2292.
  • Ho, Jain, and Abbeel (2020a) Ho, J.; Jain, A.; and Abbeel, P. 2020a. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
  • Ho, Jain, and Abbeel (2020b) Ho, J.; Jain, A.; and Abbeel, P. 2020b. Denoising Diffusion Probabilistic Models. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 6840–6851. Curran Associates, Inc.
  • Ho et al. (2022) Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; and Fleet, D. J. 2022. Video diffusion models. Advances in Neural Information Processing Systems, 35: 8633–8646.
  • Huang et al. (2023) Huang, S.; Wang, Z.; Li, P.; Jia, B.; Liu, T.; Zhu, Y.; Liang, W.; and Zhu, S.-C. 2023. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16750–16761.
  • Karunratanakul et al. (2023) Karunratanakul, K.; Preechakul, K.; Suwajanakorn, S.; and Tang, S. 2023. Guided motion diffusion for controllable human motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2151–2162.
  • Kawar et al. (2023) Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; and Irani, M. 2023. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6007–6017.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Li et al. (2019) Li, M.; Patil, A. G.; Xu, K.; Chaudhuri, S.; Khan, O.; Shamir, A.; Tu, C.; Chen, B.; Cohen-Or, D.; and Zhang, H. 2019. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics (TOG), 38(2): 1–16.
  • Liu et al. (2023) Liu, J.; Xiong, W.; Jones, I.; Nie, Y.; Gupta, A.; and Ouguz, B. 2023. CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding. ArXiv, abs/2303.03565.
  • Luo et al. (2020) Luo, A.; Zhang, Z.; Wu, J.; and Tenenbaum, J. B. 2020. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3754–3763.
  • Müller et al. (2006) Müller, P.; Wonka, P.; Haegler, S.; Ulmer, A.; and Van Gool, L. 2006. Procedural modeling of buildings. In ACM SIGGRAPH 2006 Papers, 614–623.
  • Nie et al. (2022) Nie, Y.; Dai, A.; Han, X.; and Nießner, M. 2022. Pose2room: understanding 3d scenes from human activities. In European Conference on Computer Vision, 425–443. Springer.
  • Paschalidou et al. (2021) Paschalidou, D.; Kar, A.; Shugrina, M.; Kreis, K.; Geiger, A.; and Fidler, S. 2021. ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
  • Patil et al. (2023) Patil, A. G.; Patil, S. G.; Li, M.; Fisher, M.; Savva, M.; and Zhang, H. 2023. Advances in Data-Driven Analysis and Synthesis of 3D Indoor Scenes. ArXiv, abs/2304.03188.
  • Peebles and Xie (2022) Peebles, W.; and Xie, S. 2022. Scalable Diffusion Models with Transformers. CoRR, abs/2212.09748.
  • Qi et al. (2017) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
  • Qi et al. (2018) Qi, S.; Zhu, Y.; Huang, S.; Jiang, C.; and Zhu, S.-C. 2018. Human-centric indoor scene synthesis using stochastic grammar. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5899–5908.
  • Rempe et al. (2023) Rempe, D.; Luo, Z.; Bin Peng, X.; Yuan, Y.; Kitani, K.; Kreis, K.; Fidler, S.; and Litany, O. 2023. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13756–13766.
  • Ritchie, Wang, and Lin (2019) Ritchie, D.; Wang, K.; and Lin, Y.-A. 2019. Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Saharia et al. (2022) Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; and Norouzi, M. 2022. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, 1–10.
  • Shen et al. (2023) Shen, Z.; Cen, Z.; Peng, S.; Shuai, Q.; Bao, H.; and Zhou, X. 2023. Learning human mesh recovery in 3D scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17038–17047.
  • Tancik et al. (2020) Tancik, M.; Srinivasan, P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J.; and Ng, R. 2020. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33: 7537–7547.
  • Tang et al. (2023) Tang, J.; Yinyu, N.; Lev, M.; Angela, D.; Justus, T.; and Nießner, M. 2023. DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis. In arxiv.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al. (2018) Wang, K.; Savva, M.; Chang, A. X.; and Ritchie, D. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics (TOG), 37(4): 1–14.
  • Wang, Yeshwanth, and Nießner (2020) Wang, X.; Yeshwanth, C.; and Nießner, M. 2020. SceneFormer: Indoor Scene Generation with Transformers. arXiv preprint arXiv:2012.09793.
  • Yang et al. (2024) Yang, Y.; Jia, B.; Zhi, P.; and Huang, S. 2024. PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI. arXiv preprint arXiv:2404.09465.
  • Ye et al. (2022) Ye, S.; Wang, Y.; Li, J.; Park, D.; Liu, C. K.; Xu, H.; and Wu, J. 2022. Scene Synthesis from Human Motion. In SIGGRAPH Asia 2022 Conference Papers, SA ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394703.
  • Yi et al. (2023) Yi, H.; Huang, C.-H. P.; Tripathi, S.; Hering, L.; Thies, J.; and Black, M. J. 2023. MIME: Human-Aware 3D Scene Generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 12965–12976.
  • Zhang, Rao, and Agrawala (2023) Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847.
  • Zhang et al. (2019) Zhang, S.-H.; Zhang, S.-K.; Liang, Y.; and Hall, P. 2019. A survey of 3d indoor scene synthesis. Journal of Computer Science and Technology, 34: 594–608.
  • Zhang et al. (2020a) Zhang, S.-H.; Zhang, S.-K.; Xie, W.-Y.; Luo, C.-Y.; and Fu, H.-B. 2020a. Fast 3d indoor scene synthesis with discrete and exact layout pattern extraction. arXiv preprint arXiv:2002.00328.
  • Zhang et al. (2020b) Zhang, Z.; Yang, Z.; Ma, C.; Luo, L.; Huth, A.; Vouga, E.; and Huang, Q. 2020b. Deep generative modeling for scene synthesis via hybrid representations. ACM Transactions on Graphics (TOG), 39(2): 1–21.
  • Zhou et al. (2019) Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; and Yang, R. 2019. Iou loss for 2d/3d object detection. In 2019 international conference on 3D vision (3DV), 85–94. IEEE.
  • Zhou, While, and Kalogerakis (2019) Zhou, Y.; While, Z.; and Kalogerakis, E. 2019. SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).