Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

Xiaolin, Hong^1,3, Hongwei, Yi²¹¹footnotemark: 1, Fazhi, He¹\equalcontrib, Qiong, Cao³\equalcontrib

Abstract

Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlap** object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods. Our code and data will be made publicly available upon publication of this work.

Figure 1: Our method generates more plausible 3D scenes given input human motions and floor plans. It excels in two key aspects: (1) avoiding collisions between humans and objects, as well as between objects, a significant improvement over MIME (Yi et al. 2023), and (2) providing better support for human-object interactions compared to DiffuScene (Tang et al. 2023).

Introduction

Creating diverse and realistic 3D environments inhabited by humans is essential for numerous applications, such as virtual reality (VR), interior design and training embodied artificial intelligence (AI) agents (Zhang et al. 2019). For example, VR game designers aim to construct immersive environments that enable participants to engage seamlessly and naturally with objects. These demands have driven researchers to explore diverse scene generation methods, propelling the rapid advancement of 3D scene synthesis (Luo et al. 2020; Paschalidou et al. 2021; Liu et al. 2023; Gao et al. 2023). Despite recent progress, there remain significant challenges in generating visually plausible scenes that are populated with appropriate furniture and adhere to various human motions.

There has been a recent surge in research dedicated to the problem of human-aware scene generation (Ye et al. 2022; Yi et al. 2023). A common approach in these studies is to apply autoregressive models to sequentially place objects based on input humans and already generated objects. However, these methods often yield implausible scenes with object-object collisions. This limitation primarily arises from the inherent inability of autoregressive models to capture the joint distribution of multiple objects and multiple humans fully. Consequently, exploring a generative method that can effectively model and capture these complex distributions is crucial for generating realistic 3D scenes.

More recently, diffusion-based approaches for scene synthesis (Tang et al. 2023; Yang et al. 2024) have shown an impressive ability to simplify the approximation of the joint distribution of objects, thereby generating the entire scenes at once and thus enhancing the realism of generated scenes. Meanwhile, many studies in image generation (Saharia et al. 2022; Zhang, Rao, and Agrawala 2023; Kawar et al. 2023) and human motion synthesis (Huang et al. 2023; Karunratanakul et al. 2023) have demonstrated that diffusion models can effectively incorporate inference guidance to meet user-defined goals. Despite these advancements, there is still no standard solution for generating plausible 3D scenes that both support various human interactions and adhere to spatial constraints, such as avoiding motion collisions and respecting room boundaries.

To tackle these challenges, we propose SHADE, a Spatially-constrained Human-Aware Diffusion based 3D Environments synthesis. As shown in Figure 1, our method can generate plausible scene layouts that avoid collisions with humans and between objects, while supporting various human activities such as sitting and lying. Our key insight lies in innovative harnessing diffusion models to simultaneously input all humans and the floor plan to generate holistic object configurations. Specifically, we input the contact bounding boxes and free space mask extracted from input human motions and the floor plan, following (Yi et al. 2023). We then learn a diffusion model to capture the joint distribution of objects, enabling the simultaneous generation of plausible object placements and understanding the relationships between their attributes. To further enhance the plausibility of generated scenes, we design two spatial collision guidance functions: 1) motions collision avoidance that calculates the 2D distances between objects and moving humans to prevent their implausible penetrations, while 2) boundary constraint that penalizes the distance by which objects extend beyond the floor plan, ensuring that object placement respects room boundaries. During inference, we combine two guidance functions with an object-object collision function (Yang et al. 2024). This allows our diffusion model to generate collision-free scenes that respect human movements, room boundaries, and prevent object overlap.

Beyond model design, we enhance human-aware scene generation by addressing key data challenges in datasets like 3D FRONT HUMAN (Yi et al. 2023). Specifically, we tackle two main issues: 1) incorrect penetrations in human-object interactions hinder accurate spatial relationship modeling, and 2) limited interaction diversity restricts model generalization. Our approach includes automated techniques for adjusting translations to correct penetrations and augmenting categories and orientations to diversify human-object interactions. We use our calibrated dataset to train SHADE and evaluate it on synthetic and real datasets. Our results, both quantitative and qualitative, show that our method generates more plausible 3D scenes with realistic human-scene interactions and effectively reduces human-object collisions. These findings underscore the superiority of SHADE compared to previous state-of-the-art methods.

Our main contributions are: (i) We propose a human-aware diffusion-based model for generating realistic 3D scene layouts with plausible human-object interactions in a single step. (ii) We design two spatial collision inference strategies, motion collision avoidance and boundary constraint guidance, to enhance scene fidelity by preventing collisions with human motions and respecting room layouts. (iii) We devise an automated calibration pipeline to enhance accuracy and diversity in human-object interactions within the original dataset, thereby improving the generative quality and diversity of human-aware scene synthesis.

Related Works

Human-agnostic Scene Synthesis. The objective of human-agnostic scene generative methods is to generate plausible and diverse scene layouts without taking into account human activities (Wang et al. 2018; Zhou, While, and Kalogerakis 2019; Ritchie, Wang, and Lin 2019; Wang, Yeshwanth, and Nießner 2020; Zhang et al. 2020a, b; Paschalidou et al. 2021; Tang et al. 2023; Liu et al. 2023; Patil et al. 2023). Early works explore procedural modeling with grammars (Müller et al. 2006) to place objects in the scenes, heavily relying on manually designed rules and producing scenes with limited diversity. Subsequently, graph-based approaches (Zhou, While, and Kalogerakis 2019; Li et al. 2019; Luo et al. 2020; Gao et al. 2023) have been extensively developed to represent 3D scenes as scene graphs and capture the underlying structure of indoor scenes using graph neural networks. Unlike these studies, recent autoregressive generation methods represent 3D scenes as object sequences and learn autoregressive models to sequentially predict the next object conditioned on already generated objects. The representative methods include SceneFormer (Wang, Yeshwanth, and Nießner 2020), ATISS (Paschalidou et al. 2021), and CLIP-Layout (Liu et al. 2023). However, these autoregressive methods fail to accurately capture the distribution of the entire object sequence, which often results in generating overlap** objects in the same place. More recently, some diffusion-based approaches (Tang et al. 2023; Yang et al. 2024) have been proposed to ease the approximation of the joint distribution of multiple objects, benefiting the generation of high-fidelity scenes. Inspired by these works, we develop a spatially-constrained diffusion method, SHADE. Unlike previous diffusion-based methods (Tang et al. 2023; Yang et al. 2024), SHADE learns a diffusion model to capture the joint distribution of objects conditioned on both human motions and the floor plan. Furthermore, our SHADE integrates multiple spatial constraints as inference guidance to improve scene plausibility.

Human-aware Scene Synthesis. This branch of scene synthesis focuses on producing plausible scenes in which the input human motions can naturally take place (Qi et al. 2018; Nie et al. 2022; Ye et al. 2022; Yi et al. 2023). To this end, Pose2Room (Nie et al. 2022) proposes a pose-conditioned generative model to predict object configurations from human pose trajectories. However, Pose2Room can only predict contact objects rather than an entire scene. In contrast, SUMMON (Ye et al. 2022) learns a ContactFormer to generate affordable objects that contact with humans and employs an autoregressive model to complete the scene. Similar to (Paschalidou et al. 2021), MIME (Yi et al. 2023) learn an autoregressive model to sequentially predict object placements based on the contact information and free space extracted from human motions and floor plan. Unlike these approaches, SHADE is a diffusion-based, non-autoregressive method that inherently explores the relationships between all object attributes to generate more plausible furniture layouts. Moreover, our method incorporates spatial-constrained guidance into the generation process, thereby enhancing scene plausibility.

Method

Refer to caption — Figure 2: Overview of our method. SHADE learns a diffusion model to gradually clean the noisy scene $\mathbf{x}_{T}$ by simultaneously considering the contact bounding boxes, free-space mask, floor plan, and time step. During inference, SHADE applies three spatial collision guidance functions to ensure the generation of plausible scenes that avoid conflicts with human motions, room boundaries, as well as prevent object overlap.

The overview of SHADE is presented in Fig. 2. Specifically, our method firstly inputs the contact bounding boxes and free space mask extracted from given human motions and the floor plan, and extracts their embeddings with respective encoders. Conditioned on these embeddings, we learn a scene diffusion model to capture the joint distribution of multiple objects. This enables us to generate plausible object configurations by exploring the relationships between their attributes. During inference, we apply three spatial collision guidance functions to guide the diffusion model in generating plausible 3D scenes that avoid conflicts with human motions and adhere to layout constraints. In the following, we first describe the problem formulation of our task and then detail our human-aware diffusion model and inference guidance.

Problem Formulation

Given input human motions $\mathcal{H}$ and an empty floor plan $\mathcal{F}$ , our goal is to generate a 3D scene $\mathcal{S}$ that can support various human interactions and movements. Following (Paschalidou et al. 2021; Yi et al. 2023; Tang et al. 2023), the scene $\mathcal{S}$ is represented as an unordered set of $N$ objects, denoted as $\mathbf{x}=\{o_{i}\}^{N}_{i=1}$ . Each object $o_{i}=\{k_{i},\ell_{i},s_{i},r_{i}\}$ consists of a semantic label $k_{i}\in\mathbb{R}^{K}$ out of $K$ categories, location $\ell_{i}\in\mathbb{R}^{3}$ , size $s_{i}\in\mathbb{R}^{3}$ and orientation $r_{i}\in\mathbb{R}^{1}$ . Based on human-object interaction, these objects $\mathcal{O}$ can be categorized into two kinds: 1) contact objects $\mathcal{Q}=\{q\}^{L}_{i=1}$ , which are in contact with humans, and 2) non-contact objects $\bar{\mathcal{Q}}=\{\bar{q}\}^{N-L}_{i=1}$ without any interaction with humans. To condition the 3D scene generation with human motions, we extract both contact humans and free-space humans from the input motions, following (Yi et al. 2023). Specifically, contact humans indicate the location and category of contact objects, represented as the collection of contact boxes $\mathcal{C}=\{c_{i}\}^{L}_{i=1}$ . Free-space humans define the walkable area of a room, specifying regions where objects cannot be placed. This information is represented as a binary free-space mask $\mathcal{FS}$ by projecting all foot contact points on the floor plan. Formally, conditioned on the floor plan $\mathcal{F}$ , the free-space mask $\mathcal{FS}$ and all contact humans $\mathcal{C}$ , we learn a generative model to predict the contact objects $\mathcal{Q}$ and the non-contact objects $\bar{\mathcal{Q}}$ , such that they can support all human interactions while adhering to the constraints imposed by human motions $\mathcal{H}$ and the floor plan $\mathcal{F}$ .

Human-aware Scene Synthesis

The diffusion model provides a robust framework for scene generation by learning the joint distribution of multiple objects, which is essential for creating plausible scenes without overlap** objects. In the following sections, we detail how we incorporate the diffusion model with all input humans and the floor plan to generate realistic 3D scenes.

Diffusion and Generative Process. Denote $\mathbf{x}_{0}\sim q(\mathbf{x}_{0})$ as a clean scene sampled from the training data, we gradually add Gaussian noise to $\mathbf{x_{0}}$ with a forward diffusion process $q(\mathbf{x}_{t+1}|\mathbf{x}_{t})$ of length $T$ . After $T$ diffusion steps, it approximates Gaussian noise $\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . To generate scenes conditioned on human motions and floor plan, our diffusion model learns a reverse denoising process $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},C)$ to convert $\mathbf{x}_{T}$ back to $\mathbf{x}_{0}$ :

p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},C)=\mathcal{N}(\mathbf{x}_{t-1};\mu% _{\theta}(\mathbf{x}_{t},t,C),(1-\alpha_{t})\mathbf{I}),

(1)

where $\theta$ are the model parameters, $\alpha_{t}$ depends on a pre-defined variance schedule, $C$ denotes the conditioning signals. As illustrated in Fig. 2, we select $C=\{\mathcal{C},\mathcal{F},\mathcal{FS}\}$ as the conditioning signals, enabling the diffusion model to generate 3D scenes by leveraging the information from all contact humans $\mathcal{C}$ , the floor plan $\mathcal{F}$ , and the free-space mask $\mathcal{FS}$ . According to $\epsilon-$ prediction (Ho, Jain, and Abbeel 2020a), $\mu_{\theta}(\mathbf{x}_{t},t,C)$ can be re-parameterized as:

\mu_{\theta}(\mathbf{x}_{t},t,C)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{% t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\bm{\epsilon}_{\theta}(% \mathbf{x}_{t},t,C)}\right),

(2)

where $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ , $\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,C)$ is a denoising network that predicts the noise applied to $\mathbf{x}_{0}$ from a noisy scene $\mathbf{x}_{t}$ and the conditions $C$ . Finally, we can reconstruct a clean scene $\mathbf{x_{0}}$ as follows:

p_{\theta}(\mathbf{x}_{0}|C)=p(\mathbf{x}_{T})\prod_{t=1}^{T}p_{\theta}(% \mathbf{x}_{t-1}|\mathbf{x}_{t},C),

(3)

where $p_{\theta}(\mathbf{x}_{0}|C)$ denotes the probability of scene $\mathbf{x}_{0}$ conditioned on $C$ . Mathematically, we can maximize the conditional probability $p_{\theta}(\mathbf{x}_{0}|C)$ by training the denoising network $\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,C)$ with a simplified objective (Ho, Jain, and Abbeel 2020a):

	$\displaystyle\mathcal{L}_{sim}$	$\displaystyle=\mathbb{E}_{\bm{\epsilon},t,\mathbf{x}_{0}}\left[\left\\|\bm{% \epsilon}-\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t,C)\right\\|^{2}\right]$		(4)
		$\displaystyle=\mathbb{E}_{\bm{\epsilon},t,\mathbf{x}_{0}}\left[\left\\|\bm{% \epsilon}-\bm{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1% -\bar{\alpha}_{t}}\bm{\epsilon},t,C)\right\\|^{2}\right].$		(4)

During scene generation, our diffusion model starts from a Gaussian noise $\mathbf{x}_{T}$ and gradually predicts a cleaner scene with Eq. 1 and 2. In each denoising step, the diffusion model generates all objects within the scene simultaneously, which inherently captures the relationships between all object attributes to enhance scene plausibility. After $T$ steps, we generate a plausible scene $\mathbf{x}_{0}$ that can afford various human interactions, such as touching, sitting, and lying.

Inference Guidance for Spatial Constraints

In human-aware scene generation, the generated scenes must avoid colliding with human motions and respect room boundary constraints. To enhance these, we incorporate inference guidance into the generation process. Specifically, we introduce two spatial collision guidance functions to improve further the plausibility of 3D scenes generated by the diffusion model: 1) the motion collision avoidance function pushes objects back when they collide with human motions, and 2) the boundary constraint function repositions objects that are outside the floor plan.

Motion Collision Avoidance. Without directly calculating the mesh interpenetration between moving humans and objects, we estimate motion collision scores using the predicted bounding boxes of objects $\{\hat{o}_{i}\}^{N}_{i=1}$ in the scene and free-space mask $\mathcal{FS}$ induced by human movements. Specifically, we project the free-space mask $\mathcal{FS}$ onto a bird’s-eye view to produce a 2D human motion map, from which we calculate the 2D transform distance map $\mathcal{M}$ . Using this representation, we measure the motion collision scores by:

\mathcal{J}_{m}=\sum_{i=1}^{N}\text{SDF}(\hat{o}_{i},\mathcal{M}),

(5)

where $\text{SDF}(\hat{o}_{i},\mathcal{M})$ queries the 2D distance value between each object and human motion map. A positive distance indicates a collision between the object and human motions.

Boundary Constraint. Besides satisfying motion collision constraints, the generated objects must also meet boundary constraints, meaning all objects should be located within the given floor plan. To this end, we introduce an additional boundary constraint function to penalize objects that exceed the boundaries. Similar to motion collision avoidance, we extract the out-of-bounds region from the 2D projection of the floor plan $\mathcal{F}$ and calculate its 2D transform distance map $\mathcal{B}$ . The distance that objects extend beyond the boundaries is then measured by calculating their 2D distance from $\mathcal{B}$ :

\mathcal{J}_{b}=\sum_{i=1}^{N}\text{SDF}(\hat{o}_{i},\mathcal{B}).

(6)

Overall Inference Guidance. To generate a more plausible scene that respects spatial constraints from human motions and room layout, we integrate motion collision avoidance $\mathcal{J}_{m}$ and boundary constraint $\mathcal{J}_{b}$ into the inference guidance of our diffusion model. Inspired by (Yang et al. 2024), we also employ the object collision avoidance $\mathcal{J}_{o}$ to penalize the collision between objects. Finally, our overall inference guidance can be formulated as $\mathcal{J}=\mathcal{J}_{m}+\mathcal{J}_{b}+\mathcal{J}_{o}$ , where $\mathcal{J}_{o}=\sum_{i,j,i\neq j}\mathbf{IoU}({o}_{i},o_{j})$ measures the objects collision score by calculating the 3D IoU of each object pair.

During generative process, the gradient of $\mathcal{J}$ perturbs the predicted $\hat{\mathbf{x}}_{0}$ at each denoising step as $\tilde{\mathbf{x}}_{0}=\hat{\mathbf{x}}_{0}-\gamma\nabla\mathcal{J}(\hat{% \mathbf{x}}_{0}),$ where $\gamma$ controls the guidance strength. Then, the predicted mean $\mathbf{\mu}_{\theta}$ can be computed with the updated clean scene $\tilde{\mathbf{x}}_{0}$ as (Ho et al. 2022; Rempe et al. 2023). This approach effectively pushes the generated scenes towards the desired spatial constraints, ensuring they can avoid colliding with human motions and room boundaries, and prevent object collision.

Dataset Calibration of 3D FRONT HUMAN

In this section, we introduce the calibration pipeline designed to tackle two major data challenges in the 3D FRONT HUMAN dataset (Yi et al. 2023). Our pipeline begins with translation modification to correct penetrations in human-object interactions, thereby improving spatial accuracy. Next, it employs category and orientation augmentation to increase the diversity of interactions, which enhances model generalization. These techniques are explained in greater detail below.

Translation Modification. To address incorrect penetrations in human-object interactions, we adjust the translation parameters of humans to avoid implausible contact with objects. Specifically, we modify these parameters within a range of $[-2m,2m]$ centered on the contact object. We then evaluate the plausibility of the interactions using two indicators: the 3D IoU score (Zhou et al. 2019) between the bounding boxes of the human and object, and the human-scene interpenetration error $\mathbf{{E}_{pen}}$ (Shen et al. 2023):

\mathbf{{E}_{pen}}=\sum_{i=1}^{V}\mathds{1}_{x<0}\left[sdf\left(v_{i},\mathcal% {S}\right)\right]\cdot\left|sdf\left(v_{i},\mathcal{S}\right)\right|,

(7)

where $V$ is the number of human mesh vertices, $sdf\left(v_{i},\mathcal{S}\right)$ is the signed distance from vertex $v_{i}$ to scene $\mathcal{S}$ , and $\mathds{1}_{x<0}[\cdot]$ is an indicator function that returns 1 when the condition is met, and 0 otherwise. The translation modification is considered complete when $\mathbf{{E}_{pen}}$ falls below the safe bound $\sigma_{1}$ and the 3D IoU is higher than $\sigma_{2}$ . Empirically, we set $\sigma_{1}=20$ , $\sigma_{2}=90\%$ for sitting and lying humans, and $\sigma_{2}=50\%$ for touching humans.

Category Augmentation. After refining the implausible human-object interpenetration via translation modification, we perform category augmentation to enhance the diversity of human-object interactions. The motivation behind category augmentation is that people interact with objects in various ways in real human activities, such as lying on a bed or sitting on it. To capture these behaviors, we augment the dataset by randomly replacing half of the contact humans with alternate modes of interaction, thereby fabricating a refreshed set of human-object interactions. For instance, we might transform a scene from “lying on the bed" to “sitting on the bed", as shown in Fig. 3 b.

Orientation Augmentation. Considering that people might interact with objects from various angles, we also introduce orientation augmentation to enrich the dataset. To achieve this, we firstly add random noise to orientation parameters of contact humans. Next, we adjust the translation parameters of contact humans until their interactions with objects meet the same criterion proposed in the translation modification. This ensures that contact humans can interact with objects in a varied and realistic manner.

To verify the effectiveness of our calibrated pipeline, we report $\mathbf{E_{pen}}$ and 3D IoU score for 3D-FRONT HUMAN and our calibrated dataset in Tab. 1. It shows that the calibration pipeline effectively reduces human-scene interpenetration errors and improves human-object interactions of 3D-FRONT HUMAN across all room types. The qualitative comparisons presented in Figure 3 also demonstrate that our calibrated dataset provides more plausible and diverse human-object interactions.

Table 1: Quantitative comparisons between the 3D-FRONT HUMAN (Yi et al. 2023) and our calibrated dataset on human-scene interaction metrics.

Room Type	Dataset	$\mathbf{E_{pen}}$ $\downarrow$	3D IoU $\uparrow$
Bedroom	3D-FRONT HUMAN (Yi et al. 2023)	226.97	0.75
Bedroom	Our calibrated dataset	16.01	0.94
Living	3D-FRONT HUMAN (Yi et al. 2023)	125.49	0.77
Living	Our calibrated dataset	19.50	0.94
Dining	3D-FRONT HUMAN (Yi et al. 2023)	133.34	0.78
Dining	Our calibrated dataset	20.72	0.95

Experiments

Implementations. Following the default settings in (Ho, Jain, and Abbeel 2020b), the forward process variances are set to constants increasing linearly from $\beta_{1}$ = $10^{-4}$ to $\beta_{T}$ = $0.02$ . Unlike previous studies (Tang et al. 2023; Yang et al. 2024), we reduce the number of diffusion steps to $T$ =100 from 1000, thereby substantially speeding up the sampling process. During training, we sample the diffusion step $t$ from a uniform distribution at each iteration and use the Adam (Kingma and Ba 2014) optimizer with a learning rate of $1\times 10^{-4}$ and no weight decay. Moreover, we apply random global rotation augmentation between [0, 360] degrees on the entire scene, including the floor plane, all objects, all contact humans, and the free space. Finally, we train SHADE on a Nvidia Tesla A100 GPU for 625K iterations with a batch size of 128. During inference, we use the DDPM sampler (Ho, Jain, and Abbeel 2020b) to obtain the object properties and apply inference guidance in the last five steps.

Network Architecture. As depicted in Fig. 2, we implement SHADE with a transformer-based architecture composed of 4 DiT blocks with adaLN-Zero (Peebles and Xie 2022). Each block has a latent dimension of 512, and each attention layer consists of 8 heads. As input, we extract the layout embedding from the 3D point clouds of the floor plan $\mathcal{F}$ and the free-space mask $\mathcal{FS}$ using a PointNet (Qi et al. 2017). Meanwhile, all objects and contact humans are encoded by a fully connected network (Tang et al. 2023), and the time step $t$ is encoded using a positional embedding (Tancik et al. 2020). Each object embedding is then treated as an input token and fed into the denoising network. To ensure the scene generation is aware of different contact humans and room layout, we sum the embedding vectors of each contact human, floor plan, and $t$ as conditions for each contact object, from which we regress the transformation parameters of each adaLN-Zero block. These parameters are then applied to normalize the embedding vector of the corresponding contact object. This strategy effectively guides the diffusion model to generate plausible contact objects for different contact humans. Note that the generation of non-contact objects only condition on the sum of the embedding vectors of layout and $t$ . Additionally, the self-attention (Vaswani et al. 2017) layer in each adaLN-Zero block models the spatial relationships between all objects within the scene, thereby enhancing scene plausibility.

Dataset. We conduct experiments on the calibrated 3D FRONT HUMAN dataset, which contains a total of 5,689 bedrooms, 2,987 living rooms, and 2,549 dining rooms. We use 21 object categories for the bedrooms, and 24 for the living rooms and dining rooms. Following (Yi et al. 2023), for each kind of room, we split the data into 80% for training, 10% for validation, and 10% for testing. We train and validate our model on the training and validation sets respectively, and evaluate it on the test set for each room type. During inference, we use the 3D-FUTURE dataset (Fu et al. 2021) for object retrieval, as suggested in (Paschalidou et al. 2021).

Table 2: Quantitative comparison on the test split of the Calibrated 3D FRONT HUMAN dataset. We compare SHADE with MIME and DiffuScene on human-scene interaction score 3D IoU, scene plausibility metrics

\mathbf{Col_{mot}}

\mathbf{R_{out}}

\mathbf{Col_{obj}}

, and standard perceptual quality scores FID, CKL. The best scores are highlighted in bold, and the second best scores are underlined.

Room Type	Method	3D IoU $\uparrow$	$\mathbf{Col_{mot}}$ $\downarrow$	$\mathbf{R_{out}}$ $\downarrow$	$\mathbf{Col_{obj}}$ $\downarrow$	FID $\downarrow$	CKL $\downarrow$
Bedroom	MIME (Yi et al. 2023)	0.905	0.068	0.003	0.020	37.32	0.021
	DiffuScene (Tang et al. 2023)	0.550	0.169	0.005	0.049	42.83	0.026
	SHADE (Ours)	0.915	0.051	0.006	0.016	41.59	0.023
Living	MIME (Yi et al. 2023)	0.899	0.041	0.002	0.060	34.25	0.013
	DiffuScene (Tang et al. 2023)	0.301	0.184	0.003	0.117	38.15	0.031
	SHADE (Ours)	0.913	0.029	0.002	0.023	35.71	0.008
Dining	MIME (Yi et al. 2023)	0.924	0.031	0.002	0.067	34.61	0.015
	DiffuScene (Tang et al. 2023)	0.280	0.153	0.003	0.115	41.35	0.040
	SHADE (Ours)	0.903	0.023	0.003	0.050	39.99	0.012

Table 3: Quantitative comparison on the PROXD qualitative dataset (Hassan et al. 2019).

Method	3D IoU $\uparrow$	$\mathbf{Col_{mot}}$ $\downarrow$	$\mathbf{R_{out}}$ $\downarrow$	$\mathbf{Col_{obj}}$ $\downarrow$
MIME (Yi et al. 2023)	0.883	0.224	0.002	0.144
DiffuScene (Tang et al. 2023)	0.046	0.202	0.003	0.096
SHADE (Ours)	0.881	0.171	0.003	0.070

Baselines. We compare our method with MIME (Yi et al. 2023) and DiffuScene (Tang et al. 2023) using their official implementations. MIME is a transformer-based autoregressive method for human-aware scene generation, while DiffuScene is a diffusion-based method that learns 3D scene distributions without conditioning on the floor plan. For a more fair comparison, we adapt DiffuScene on the 2D-floor plan with the free-space mask to enhance its perception of human motions. All methods are trained on our calibrated 3D FRONT HUMAN dataset.

Evaluation Metrics. Following (Yi et al. 2023), we evaluate all methods on the plausibility of human-object interaction and the realism of the generated scenes. As suggested in (Yi et al. 2023), we use the 3D IoU score to measure the plausibility of human-object interaction, by calculating the intersection ratio between input contact bounding boxes and generated objects. Additionally, we employ $\mathbf{Col_{mot}}$ (Yi et al. 2023) to evaluate collision between generated objects and free-space humans, and $\mathbf{Col_{obj}}$ (Yang et al. 2024) to represent the collision rate between objects in the generated scene. To measure the violation of the room layout, we compute the collision rate between generated objects and the areas outside the floor plan, denoted as $\mathbf{R_{out}}$ . Furthermore, following (Paschalidou et al. 2021), we calculate the Fréchet inception distance (FID) score and the category KL divergence (CKL) between generated scenes and real scenes to measure the realism of generated scenes. All these evaluation experiments are conducted on the test split of the calibrated 3D FRONT HUMAN dataset.

Table 4: Ablation study on different spatial collision guidance functions. The best scores are highlighted in bold, and the second best scores are underlined.

Motion Collision	Room Boundary	Object Collision	$\mathbf{Col_{mot}}$ ( $\downarrow$ )	$\mathbf{R_{out}}$ ( $\downarrow$ )	$\mathbf{Col_{obj}}$ ( $\downarrow$ )
✗	✗	✗	0.082	0.003	0.020
✓	✗	✗	0.034	0.008	0.023
✗	✓	✗	0.100	0.002	0.021
✗	✗	✓	0.084	0.004	0.008
✓	✓	✓	0.051	0.006	0.016

Human-aware Scene Synthesis

Figure 4 exhibits the generation ability of our method and baselines for different room types. Since DiffuScene only considers floor plan and free-space information, it fails to generate appropriate objects to interact with contact humans. Additionally, DiffuScene also generates objects colliding with free-space humans. Both MIME and our method can generate reasonable objects to support various contact humans, such as placing a bed under a sitting human or a sofa under a lying human. However, MIME still generates objects in free-space humans or outside the floor plan (see the second column of Fig. 4). In contrast, our method generates more plausible scenes that can avoid colliding with free-space humans and room boundaries, as well as prevent object overlap.

These observations are validated by the quantitative comparisons of various evaluation metrics presented in Table 2. Our method outperforms MIME in 3D IoU scores for both the bedroom and living room and performs comparably in the dining room. Furthermore, our method significantly reduces motion collisions ( $\mathbf{Col_{mot}}$ ) and object collisions ( $\mathbf{Col_{obj}}$ ) across all room types when compared to the baselines. Although MIME achieves better FID scores, our method consistently shows better CKL scores, indicating better alignment with the categorical distribution of real data. Overall, these results demonstrate the effectiveness of our framework.

Following MIME (Yi et al. 2023), we test SHADE on a real dataset of human motion to evaluate the generalization of our method. We use 5 input motions from the PROXD (Hassan et al. 2019) dataset and generate 10 scenes for each motion sequence. The quantitative results of our method and the baselines are reported in Table 3. Note that all methods are not fine-tuned on the PROXD dataset. For human-object interaction plausibility, our method achieves a 3D IoU score of 0.881, outperforming DiffuScene and being comparable to MIME. In terms of scene plausibility, our method surpasses the other methods on the motion collision metric $\mathbf{Col_{mot}}$ and the object collision metric $\mathbf{Col_{obj}}$ , indicating that our method can generate more realistic scenes with fewer human-object and object-object collisions. Figure 1 presents the qualitative visualization of all models’ generations.

Ablation Studies

Inference Guidance. We investigate the impact of each spatial collision guidance function on the bedroom and present the results in Tab. 4. Compared to scene generation without any guidance function, incorporating the motion collision avoidance function reduces the $\mathbf{Col_{mot}}$ metric to 0.034, demonstrating its effectiveness. Similar conclusions can be drawn from the results of the other two spatial collision guidance functions. It is noteworthy that these collision-based guidance functions can negatively affect each other. For example, while the room boundary constraint function improves the $\mathbf{R_{out}}$ metric, it can degrade the other metrics. This is reasonable because room boundary guidance pushes objects within the floor plan, potentially leading lead to increased collisions with free-space humans ( $\mathbf{Col_{mot}}$ ) and other objects ( $\mathbf{Col_{obj}}$ ). To balance these effects, we integrate all spatial collision functions, thereby achieving better overall performance in scene plausibility. Fig. 5 provides a qualitative visualization of the effect of each spatial collision guidance function. The improvements shown in the second row of Fig. 5 confirm that our collision-based guidance functions can significantly enhance the plausibility of 3D scenes.

Number of Input Humans. In Fig. 6, we further investigate the impact of input humans on scene generation by varying the number of free-space humans and contact humans provided as input. Our qualitative results indicate that as the density of free-space humans increases, our method generates fewer objects in the scenes. Additionally, when given more contact humans, we produce more occupied objects to interact with them. These findings demonstrate the flexible scene generation capabilities of our method, influenced by the varying numbers of input humans.

Conclusions

We introduced SHADE, a spatially-constrained diffusion model for human-aware 3D scene synthesis. SHADE learns a scene diffusion model that simultaneously considers all input humans and floor maps to generate plausible furniture layouts. During scene generation, SHADE integrates the motion collision avoidance, boundary constraint and object collision avoidance as guidance to further enhance the scene plausibility. Additionally, we devise an automated calibration pipeline to improve the spatial accuracy and diversity of human-object interactions in existing human-aware 3D scene dataset, thereby enhancing the generation ability of SHADE. The quantitative and qualitative results showcase promising improvements in synthetic and real-world HSI datasets, demonstrating the effectiveness of our framework and calibration pipeline.

References

Fu et al. (2021) Fu, H.; Jia, R.; Gao, L.; Gong, M.; Zhao, B.; Maybank, S.; and Tao, D. 2021. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 1–25.
Gao et al. (2023) Gao, L.; Sun, J.-M.; Mo, K.; Lai, Y.-K.; Guibas, L. J.; and Yang, J. 2023. SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation With Fine-Grained Geometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7): 8902–8919.
Hassan et al. (2019) Hassan, M.; Choutas, V.; Tzionas, D.; and Black, M. J. 2019. Resolving 3D human pose ambiguities with 3D scene constraints. In Proceedings of the IEEE/CVF international conference on computer vision, 2282–2292.
Ho, Jain, and Abbeel (2020a) Ho, J.; Jain, A.; and Abbeel, P. 2020a. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
Ho, Jain, and Abbeel (2020b) Ho, J.; Jain, A.; and Abbeel, P. 2020b. Denoising Diffusion Probabilistic Models. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 6840–6851. Curran Associates, Inc.
Ho et al. (2022) Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; and Fleet, D. J. 2022. Video diffusion models. Advances in Neural Information Processing Systems, 35: 8633–8646.
Huang et al. (2023) Huang, S.; Wang, Z.; Li, P.; Jia, B.; Liu, T.; Zhu, Y.; Liang, W.; and Zhu, S.-C. 2023. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16750–16761.
Karunratanakul et al. (2023) Karunratanakul, K.; Preechakul, K.; Suwajanakorn, S.; and Tang, S. 2023. Guided motion diffusion for controllable human motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2151–2162.
Kawar et al. (2023) Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; and Irani, M. 2023. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6007–6017.
Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Li et al. (2019) Li, M.; Patil, A. G.; Xu, K.; Chaudhuri, S.; Khan, O.; Shamir, A.; Tu, C.; Chen, B.; Cohen-Or, D.; and Zhang, H. 2019. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics (TOG), 38(2): 1–16.
Liu et al. (2023) Liu, J.; Xiong, W.; Jones, I.; Nie, Y.; Gupta, A.; and Ouguz, B. 2023. CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding. ArXiv, abs/2303.03565.
Luo et al. (2020) Luo, A.; Zhang, Z.; Wu, J.; and Tenenbaum, J. B. 2020. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3754–3763.
Müller et al. (2006) Müller, P.; Wonka, P.; Haegler, S.; Ulmer, A.; and Van Gool, L. 2006. Procedural modeling of buildings. In ACM SIGGRAPH 2006 Papers, 614–623.
Nie et al. (2022) Nie, Y.; Dai, A.; Han, X.; and Nießner, M. 2022. Pose2room: understanding 3d scenes from human activities. In European Conference on Computer Vision, 425–443. Springer.
Paschalidou et al. (2021) Paschalidou, D.; Kar, A.; Shugrina, M.; Kreis, K.; Geiger, A.; and Fidler, S. 2021. ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
Patil et al. (2023) Patil, A. G.; Patil, S. G.; Li, M.; Fisher, M.; Savva, M.; and Zhang, H. 2023. Advances in Data-Driven Analysis and Synthesis of 3D Indoor Scenes. ArXiv, abs/2304.03188.
Peebles and Xie (2022) Peebles, W.; and Xie, S. 2022. Scalable Diffusion Models with Transformers. CoRR, abs/2212.09748.
Qi et al. (2017) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
Qi et al. (2018) Qi, S.; Zhu, Y.; Huang, S.; Jiang, C.; and Zhu, S.-C. 2018. Human-centric indoor scene synthesis using stochastic grammar. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5899–5908.
Rempe et al. (2023) Rempe, D.; Luo, Z.; Bin Peng, X.; Yuan, Y.; Kitani, K.; Kreis, K.; Fidler, S.; and Litany, O. 2023. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13756–13766.
Ritchie, Wang, and Lin (2019) Ritchie, D.; Wang, K.; and Lin, Y.-A. 2019. Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Saharia et al. (2022) Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; and Norouzi, M. 2022. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, 1–10.
Shen et al. (2023) Shen, Z.; Cen, Z.; Peng, S.; Shuai, Q.; Bao, H.; and Zhou, X. 2023. Learning human mesh recovery in 3D scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17038–17047.
Tancik et al. (2020) Tancik, M.; Srinivasan, P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J.; and Ng, R. 2020. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33: 7537–7547.
Tang et al. (2023) Tang, J.; Yinyu, N.; Lev, M.; Angela, D.; Justus, T.; and Nießner, M. 2023. DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis. In arxiv.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2018) Wang, K.; Savva, M.; Chang, A. X.; and Ritchie, D. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics (TOG), 37(4): 1–14.
Wang, Yeshwanth, and Nießner (2020) Wang, X.; Yeshwanth, C.; and Nießner, M. 2020. SceneFormer: Indoor Scene Generation with Transformers. arXiv preprint arXiv:2012.09793.
Yang et al. (2024) Yang, Y.; Jia, B.; Zhi, P.; and Huang, S. 2024. PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI. arXiv preprint arXiv:2404.09465.
Ye et al. (2022) Ye, S.; Wang, Y.; Li, J.; Park, D.; Liu, C. K.; Xu, H.; and Wu, J. 2022. Scene Synthesis from Human Motion. In SIGGRAPH Asia 2022 Conference Papers, SA ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394703.
Yi et al. (2023) Yi, H.; Huang, C.-H. P.; Tripathi, S.; Hering, L.; Thies, J.; and Black, M. J. 2023. MIME: Human-Aware 3D Scene Generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 12965–12976.
Zhang, Rao, and Agrawala (2023) Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847.
Zhang et al. (2019) Zhang, S.-H.; Zhang, S.-K.; Liang, Y.; and Hall, P. 2019. A survey of 3d indoor scene synthesis. Journal of Computer Science and Technology, 34: 594–608.
Zhang et al. (2020a) Zhang, S.-H.; Zhang, S.-K.; Xie, W.-Y.; Luo, C.-Y.; and Fu, H.-B. 2020a. Fast 3d indoor scene synthesis with discrete and exact layout pattern extraction. arXiv preprint arXiv:2002.00328.
Zhang et al. (2020b) Zhang, Z.; Yang, Z.; Ma, C.; Luo, L.; Huth, A.; Vouga, E.; and Huang, Q. 2020b. Deep generative modeling for scene synthesis via hybrid representations. ACM Transactions on Graphics (TOG), 39(2): 1–21.
Zhou et al. (2019) Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; and Yang, R. 2019. Iou loss for 2d/3d object detection. In 2019 international conference on 3D vision (3DV), 85–94. IEEE.
Zhou, While, and Kalogerakis (2019) Zhou, Y.; While, Z.; and Kalogerakis, E. 2019. SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).