HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2311.16918v2 [cs.CV] 24 Dec 2023

RichDreamer: A Generalizable Normal-Depth Diffusion Model for
Detail Richness in Text-to-3D

Lingteng Qiu1,313{}^{1,3}~{}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT111Equal contribution.222Work done during internship at Alibaba.  Guanying Chen2,121{}^{2,1}start_FLOATSUPERSCRIPT 2 , 1 end_FLOATSUPERSCRIPT111Equal contribution.  Xiaodong Gu33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT111Equal contribution.
Qi Zuo33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT  Mutian Xu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Yushuang Wu2,1,3213{}^{2,1,3}start_FLOATSUPERSCRIPT 2 , 1 , 3 end_FLOATSUPERSCRIPT  Weihao Yuan33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT  Zilong Dong33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT
Liefeng Bo33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT,  Xiaoguang Han1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT333Corresponding author: [email protected].
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTSSE, CUHKSZ  22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTFNii, CUHKSZ  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTAlibaba Group
Abstract

Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://aigc3d.github.io/richdreamer/.

[Uncaptioned image]

Figure 1: 3D Generation Results and Applications of RichDreamer. RichDreamer can generate highly-detailed and diverse 3D content from free-form user prompts. Our method achieves this by first generating the object geometry based on a generalizable Normal-Depth diffusion model, followed by modeling the physically-based rendering (PBR) materials. Notably, the diverse crocodile-theme objects at the bottom highlights the generalization ability of our method. The abbreviation of text prompts are shown beside the corresponding objects (full prompts can be found in the supplementary materials).

1 Introduction

Image generation models have witnessed notable advancements in controllable image synthesis [57, 62]. This remarkable progress can be attributed to the scalability of generative models [21, 69] and utilization of the large-scale training datasets consisting of billions of image-caption pairs scrapped from the internet [64]. Conversely, due to the limited scale of the publicly available 3D datasets, existing 3D generative models are primarily evaluated for category-specific generation and face challenges when attempting to generate novel, unseen categories [15, 18, 94]. It remains an open problem to create a comprehensive 3D dataset to facilitate generalizable 3D generation.

Recently developments in the field of text-to-3D, such as DreamFusion [53], have demonstrated impressive capabilities in zero-shot generation. This is achieved by optimizing a neural radiance field [46] through score distillation sampling (SDS) [53, 77] using a 2D diffusion model [63]. Subsequent to this, several methods have been proposed to improve the quality of the generated objects [35, 79, 7, 45]. However, the approach of lifting from 2D to 3D has presented two primary challenges. Firstly, 2D diffusion models tend to lack multi-view constraints, often leading to the emergence of multi-face objects, a phenomenon referred to as the “Janus problem” [53, 77]. To address this issue, recent advancements in multi-view-based diffusion models have shown success in mitigating these multi-face artifacts [68, 96].

Secondly, given the inherent coupling of surface geometry, texture, and lighting in natural images, the direct use of 2D diffusion models for the simultaneous inference of geometry and texture is considered suboptimal [7]. This suggests a two-stage decoupled approach: first, the generation of geometry, followed by the generation of texture. The recent Fantasia3D [7] method has shown promise in this decoupling strategy, yielding notably improved geometric reconstructions. However, Fantasia3D relies on 2D RGB diffusion models to optimize normal maps, leading to data distribution discrepancies that compromise the quality of geometric generation and introduce instability in optimization. This limitation underscores the pressing need for a robust prior model to provide an effective geometric foundation.

In this work, we aim to develop a robust 3D prior model to push forward the decoupled text-to-3D generation approach. Creating a model that offers 3D geometric priors typically requires access to 3D data for training supervision. However, amassing a large-scale dataset containing high-quality 3D models across diverse scenes is a challenging endeavor due to the time-consuming and costly process of 3D object scanning and model design [60, 84, 92, 11]. The limited scale of the publicly available 3D datasets presents a critical challenge: how to learn a generalizable 3D prior model with limited 3D data?

Recognizing that the normal and depth information can effectively describe scene geometry and can be automatically estimated from images [58], we propose to learn a generalizable Normal-Depth diffusion model for 3D generation (see Fig. 1). This is achieved by training on the large-scale LAION dataset [64] to learn diverse distributions of normal and depth of real-world scenes with the help of the generalizable image-to-depth and normal prior models [59, 3]. To improve the capability in modeling a more accurate and sharp distribution of normal and depth, we fine-tune the proposed diffusion model on the synthetic Objaverse dataset [11]. Our results demonstrate that by pre-training on a large-scale real world dataset, the proposed Normal-Depth diffusion model can retain its generalization ability after fine-tuning on the synthetic dataset, indicating that our model learns a good distribution of diverse normal and depth in real-world scenes.

Given the inherent ambiguity in the decomposition of surface reflectance and illumination effects, textures generated by existing methods often retain shadows and specular highlights [61]. In an attempt to address this problem, we introduce an albedo diffusion model to provide data-driven constraints on the albedo component, enhancing the disentanglement of reflectance and illumination effects.

In summary, the key contributions of this paper are as follows:

  • We propose a novel Normal-Depth diffusion model to provide strong geometric prior for high-fidelity text-to-3D geometry generation. By training on the extensive LAION dataset, our method exhibits remarkable generalization abilities.

  • We introduce an albedo diffusion model that acts as a data-driven regularization for albedo, resulting in a more accurate separation of reflectance and illumination effects.

  • Experiments demonstrate that integrating our models into existing text-to-3D pipelines yields state-of-the-art results in both geometry and appearance modeling.

Refer to caption
Figure 2: Overview of the proposed RichDreamer. We introduce a generalizable Normal-Depth diffusion model that is trained on the LAION-2B dataset with normal and depth predicted by Midas [59], followed by fine-tuning on the synthetic dataset. Our model can be incorporated with the DMTet and NeRF representations to enhance the geometry generation. To alleviate the ambiguity in appearance modeling, we propose an albedo diffusion model to impose data-drive prior on the albedo component.

2 Related Work

3D Generative Model

The creation of high-quality 3D content is gaining increasing importance in various applications. Generative models directly learn the data distributions of the 3D data, enabling data sampling. Existing methods have yielded promising results by representing scenes using various 3D representations, including voxels [83, 20], point clouds [51, 93, 42], meshes [39, 15], and implicit fields [78, 10, 29, 1, 48, 28, 94, 18, 13, 6, 91]. However, these methods primarily demonstrate their generative capabilities within limited categories of objects due to the restricted scale of 3D training datasets. In contrast, our approach addresses the text-to-3D problem by extending 2D diffusion to the 3D domain, allowing for better generalization across diverse scenes specified by user prompts.

2D Diffusion for 3D Generation

Recent research has demonstrated the generation of 3D objects from user prompts, leveraging pre-trained models like CLIP model [55, 27, 86, 47] or 2D diffusion models [63, 62]. Notably, DreamFusion [53] achieves zero-shot text-to-3D generation by optimizing a neural radiance field (NeRF) [46] through score distillation sampling (SDS) with a 2D diffusion model. Concurrently, SJC [77] employs score Jacobian chaining for the 3D generation.

Encouraged by these promising results, numerous works have focused on improving the quality of generated objects through approaches such as coarse-to-fine optimization [35, 9], decoupled generation [7], new score distillation [79], improved optimization strategies [82, 8, 45, 74, 25, 2, 22, 2, 87, 65], and incorporating parametric shape model [26, 19, 34, 95] etc. As generating a 3D model typically involves hours of optimization, some methods explore efficient 3D representations (e.g., hashgrid [49] and 3D Gaussian splatting [30]) or improved training strategy to accelerate the optimization [89, 72, 41, 17]. Alongside the rapid development of the text-to-3D field, several methods have also adopted 2D diffusion models for the problem of 3D generation from image condition [98, 16, 12, 44, 85, 73, 54, 36, 56, 1, 97]. However, as 2D diffusion models tend to lack multi-view constraints, these methods often suffer from the multi-view inconsistency issue, resulting in less desirable 3D generation outcomes.

Geometry Prior for Diffusion Models

To enhance diffusion models for generative novel-view synthesis [37, 80, 5], Zero-1-to-3 [37] fine-tunes the Stable Diffusion model [62] to synthesize novel views conditioned on relative poses. Recent approaches have significantly improved the consistency of the generated multi-view images by performing multi-view diffusion [68, 38, 88, 81, 67, 75, 96, 71]. For example, MVDream [68] fine-tunes a pre-trained diffusion model on the synthetic Objaverse dataset [11] to simultaneously generate a set of 4-view images of the same object, conditioned on the camera poses. While these methods effectively address the multi-view inconsistency problem, they perform diffusion in RGB image space, making them less suitable for the decoupled generation approach where the geometry is generated before appearance.

There are methods incorporating more explicit geometric constraints into the diffusion models. LDM3D [70] introduces an RGB-D diffusion model on the LAION-400M dataset. However, this model is not validated for text-to-3D generation. Concurrent to our work, SweetDreamer [33] proposes to align the geometric prior in 2D diffusion using a canonical coordinate map (CCM) representation. However, CCM implicitly requires the training objects to be aligned and can only be obtained from synthetic 3D datasets, potentially limiting its generalization and scalability. Wonder3D [40] introduces an RGB-Normal diffusion model on the Objaverse dataset. HumanNorm [24] proposes two disjoint diffusion models, one for normal and the other for depth, on a 3D Human dataset of 2952 body models. In contrast, our model jointly learns the distribution of normal and depth, and was pre-trained on the extensive LAION-2B dataset to improve the generalization ability. In addition, we introduce an albedo diffusion model to better model appearance.

3 Method

In this section, we introduce a Normal-Depth diffusion model and an albedo diffusion model to push forward the decoupled 3D generation pipeline, where the geometry is first generated followed by the appearance modeling (see Fig. 2).

Overview

The existing approach for decoupled generation [7] adopts a text-to-image diffusion model (i.e., Stable-Diffusion [62]) to optimize the rendered surface normals of the object for geometry generation. However, this direct application is suboptimal due to the discrepancy in data distribution between natural images and normal maps, often leading to unstable optimization and compromised geometric fidelity [7]. As a result, appropriate geometry initialization (e.g., 3D ellipsoids with different shapes and orientations, or coarse 3D models) is often needed to achieve good results for different prompts. In response, we propose to learn a 3D-aware diffusion model tailored for 3D geometry generation. Specifically, we introduce a Normal-Depth diffusion model pre-trained on an extensive real-world dataset and further fine-tuned on a synthetic dataset, to offer consistent guidance for geometry generation.

Another critical issue in text-to-3D generation is the inaccurate appearance modeling, where materials intermingle with the lighting effects like shadows and specular highlights, often resulting in imprecise relighting. In an attempt to address this, our method integrates an albedo diffusion model to regularize the albedo component of the materials, effectively separating the albedo from the influence of lighting artifacts.

3.1 Normal-Depth Diffusion Model

To endow the diffusion model with 3D geometric priors for 3D generation tasks, we introduce a novel Normal-Depth diffusion model. Different from existing methods that either learn a normal or a depth diffusion model [24, 70], our model captures the joint distribution of normal and depth, leveraging their intrinsic complementary nature-depths describe the macrostructure of the scene while normals provide local surface details.

Model Architecture

We adapted the architecture of the publicly available text-to-image diffusion model, Stable Diffusion (SD) [62], with minor modifications. SD incorporates a variational auto-encoder (VAE) with KL-regularization [31], and a latent diffusion model (LDM). The VAE maps an image of size 512×512512512512\times 512512 × 512 to and from a latent space of size 64×64646464\times 6464 × 64, and the LDM is a UNet denoiser that learns to denoise the latent feature guided by the text prompt.

For our purpose, we extended the input and output channel number of SD’s VAE from three to four channels to encompass three for normals and one for depth, kee** other components unchanged.

Pre-training on Real-world Data

The LAION-2B dataset, comprising billions of correlated image and text pairs, served as our foundational training resource [64]. We prepared our training set with text prompts paired with corresponding normal and depth maps, utilizing NormalBae [3] and Midas-3.1 [59], which are leading methods for monocular normal and depth estimation.

We first trained the Normal-Depth VAE to learn the joint distribution of normal and depth with the MSE reconstruction loss, adversarial loss, and the KL-regularization loss [62, 14]. We then trained the LDM to enable text to Normal-Depth generation. Denoting x𝑥xitalic_x as the normal and depth data, \mathcal{E}caligraphic_E the encoder, z𝑧zitalic_z the latent feature, and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT the UNet denoiser, the objective for training LDM can be written as:

LDM=𝔼z(x),y,t,ϵ𝒩(0,1)[ϵθ(zt,y,t)ϵ22],subscriptLDMsubscript𝔼formulae-sequencesimilar-to𝑧𝑥𝑦𝑡similar-toitalic-ϵ𝒩01delimited-[]subscriptsuperscriptnormsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑦𝑡italic-ϵ22\displaystyle\mathcal{L}_{\text{LDM}}=\mathbb{E}_{z\sim\mathcal{E}(x),y,t,% \epsilon\sim\mathcal{N}(0,1)}\left[\|\epsilon_{\theta}(z_{t},y,t)-\epsilon\|^{% 2}_{2}\right],caligraphic_L start_POSTSUBSCRIPT LDM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_E ( italic_x ) , italic_y , italic_t , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) - italic_ϵ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , (1)

where ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the noised latent variable at a specific denoising timestep t𝑡titalic_t, and y𝑦yitalic_y the text embedding obtained from the CLIP model [55]. Our results show that the pre-training on the real-world dataset is crucial to maintain the generalization ability on diverse prompts.

Fine-tuning on Synthetic 3D Data

To enhance object-level 3D generation, we fine-tuned our Normal-Depth LDM on the Objaverse dataset [11], which features ground-truth 3D models. We render the ground-truth normal and depth maps with the provided object. In the fine-tuning stage, we employed a four-view diffusion technique proposed by MVDream [68]. The camera poses are mapped by a simple Multilayer Perceptron (MLP) to be the camera embeddings, which will be added to the time embedding to be accessed by the diffusion model. The training objective in the fine-tuning stage is:

LDM=𝔼z,y,c,t,ϵ𝒩(0,1)[ϵθ(zt,y,c,t)ϵ22],superscriptsubscriptLDMsubscript𝔼similar-to𝑧𝑦𝑐𝑡italic-ϵ𝒩01delimited-[]subscriptsuperscriptnormsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑦𝑐𝑡italic-ϵ22\displaystyle\mathcal{L}_{\text{LDM}}^{{}^{\prime}}=\mathbb{E}_{z,y,c,t,% \epsilon\sim\mathcal{N}(0,1)}\left[\|\epsilon_{\theta}(z_{t},y,c,t)-\epsilon\|% ^{2}_{2}\right],caligraphic_L start_POSTSUBSCRIPT LDM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z , italic_y , italic_c , italic_t , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_c , italic_t ) - italic_ϵ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , (2)

where c𝑐citalic_c is the camera condition.

Implementation Details

For training on the LAION dataset, we follow the data filtering strategy used in the training of SD v2.1 to ensure high-quality data selection. The image is resized to 384×384384384384\times 384384 × 384 as input to Midas for estimating normal and depth. The computational expense for training the Normal-Depth VAE and LDM amounted to 1,34413441,3441 , 344 and 11,5201152011,52011 , 520 GPU hours respectively on A100-80G GPUs. More details can be found in the supplementary materials

For fine-tuning on the Objaverse dataset, for each 3D object, we established camera positions within a radial distance of 1.4 to 2.0 units and an elevation angle spanning 5 to 30 degrees. We rendered 24 views per object, distributed uniformly across azimuth angles. To enhance the dataset quality, we discarded objects whose rendered images scored low in relevance to the object names as determined by CLIP scores, resulting in a pool of 270,000 training objects. The text prompts used in training were obtained by a hybrid approach: 30% stemmed from object tags and names, and the remaining 70% were from Cap3D [43]. The fine-tuning utilized a batch size of 512512512512 with gradient accumulation performed every 8888 batch. This was conducted on 8 GPUs for one week, reaching a total of 20,0002000020,00020 , 000 iterations.

3.2 Geometry Generation

Score Distillation Sampling (SDS)

Existing 2D lifting approaches for text-to-3D typically employ either a NeRF representation [46, 53] or the hybrid DMTet representation [66, 35] to represent the 3D content. Denoting ϕitalic-ϕ\phiitalic_ϕ as the parameters of a 3D representation and g𝑔gitalic_g as the differentiable rendering function, the rendered image can be expressed as x=g(ϕ)𝑥𝑔italic-ϕx=g(\phi)italic_x = italic_g ( italic_ϕ ). DreamFusion [53] introduces a Score Distillation Sampling (SDS) process that leverages gradient-based score functions to guide the optimization of parameters in 3D representation for object generation:

ϕSDS(ϕ,x=g(ϕ))=𝔼t,ϵ[w(t)(ϵθ(zt;y,t)ϵ)xϕ],subscriptitalic-ϕsubscriptSDSitalic-ϕ𝑥𝑔italic-ϕsubscript𝔼𝑡italic-ϵdelimited-[]𝑤𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑦𝑡italic-ϵ𝑥italic-ϕ\nabla_{\phi}\mathcal{L}_{\text{SDS}}(\phi,x=g(\phi))=\mathbb{E}_{t,\epsilon}% \left[w(t)(\epsilon_{\theta}(z_{t};y,t)-\epsilon)\frac{\partial x}{\partial% \phi}\right],∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT ( italic_ϕ , italic_x = italic_g ( italic_ϕ ) ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_y , italic_t ) - italic_ϵ ) divide start_ARG ∂ italic_x end_ARG start_ARG ∂ italic_ϕ end_ARG ] , (3)

where the expression (ϵθ(zt;y,t)ϵ)subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑦𝑡italic-ϵ(\epsilon_{\theta}(z_{t};y,t)-\epsilon)( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_y , italic_t ) - italic_ϵ ) represents the difference between the actual noise ϵitalic-ϵ\epsilonitalic_ϵ and the noise estimated by the UNet ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The term w(t)𝑤𝑡w(t)italic_w ( italic_t ) is a weighting term that depends on the timestep t𝑡titalic_t, and y𝑦yitalic_y indicates the text embedding.

Fantasia3D [7] shows that the SDS loss derived from the image diffusion model (e.g., Stable Diffusion [62]) can be applied to the rendered normal maps from a DMTet for geometry generation.

Normal-Depth Diffusion for 3D Generation

Compared to the image diffusion model, our Normal-Depth diffusion model is specifically designed for modeling the joint distribution of normal and depth maps. It provides effective supervision for geometry optimization.

We utilize a DMTet representation and integrate our Normal-Depth diffusion model into the coarse-to-fine geometry generation pipeline of Fantasia3D [7]. The normal and depth of the object can be efficiently rendered using differentiable rasterization [32]. The geometry generation loss function is defined as:

Geo=λSDSDSNormalSD+λNDSDSNDND,subscriptGeosubscript𝜆SDsuperscriptsubscriptSDSNormalSDsubscript𝜆NDsuperscriptsubscriptSDSNDND\displaystyle\mathcal{L}_{\text{Geo}}=\lambda_{\text{SD}}\mathcal{L}_{\text{% SDS}-\text{Normal}}^{\text{SD}}+\lambda_{\text{ND}}\mathcal{L}_{\text{SDS}-% \text{ND}}^{\text{ND}},caligraphic_L start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS - Normal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SD end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS - ND end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ND end_POSTSUPERSCRIPT , (4)

where the first loss term is employed in Fantasia3D [7] to enforce SDS on the rendered normal maps using Stable Diffusion. The second loss term is enabled by our Normal-Depth diffusion model to impose SDS on the composite of the rendered normal and depth maps. By default, we initialize the DMTet as a Sphere.

Integration with NeRF

Since normal and depth maps can be derived from the NeRF representation using volume rendering, our Normal-Depth diffusion model can also be utilized to optimize NeRF using the loss defined in Eq. (10). Given that the normal and depth maps derived from NeRF can be noisy at the start of the optimization, we impose SDS loss on the rendered RGB images with SD during the first 1,000 iterations to warm up the optimization.

Recognizing that the NeRF representation is more flexible in modeling complex structures during optimization, we also investigate the idea of converting the optimized NeRF to DMTet as the initialization for geometry refinement.

Optimization

For geometry generation, we accelerate the SDF function in DMTet with an efficient hash encoding [49]. The loss weights λSDsubscript𝜆SD\lambda_{\text{SD}}italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT and λNDsubscript𝜆ND\lambda_{\text{ND}}italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT are both set to 1111. The optimization process takes approximately 1.51.51.51.5 hours on a single GPU with 30303030 GB of memory.

3.3 Appearance Modeling

Physically-based Rendering

For DMTet representation, in line with prior studies [7, 50], we employ the Physically Based Rendering (PBR) Disney material model [4], which integrates a diffuse term with a specular GGX lobe [76]. The material property of a surface point is determined by the diffuse color kd3subscript𝑘𝑑superscript3k_{d}\in\mathbb{R}^{3}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, roughness krsubscript𝑘𝑟k_{r}\in\mathbb{R}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R, metallic term kmsubscript𝑘𝑚k_{m}\in\mathbb{R}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R, and normal variation in tangent space kn3subscript𝑘𝑛superscript3k_{n}\in\mathbb{R}^{3}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. We parameterize the spatially-varying materials of the surface by a learnable MLP fψsubscript𝑓𝜓f_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT with parameters ψ𝜓\psiitalic_ψ to predict material parameters for input 3D point p𝑝pitalic_p as:

(kd,kr,km,kn)=fψ(p).subscript𝑘𝑑subscript𝑘𝑟subscript𝑘𝑚subscript𝑘𝑛subscript𝑓𝜓𝑝\displaystyle(k_{d},k_{r},k_{m},k_{n})=f_{\psi}(p).( italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_p ) . (5)

By specifying the environment lighting and the camera viewpoint, the image color can be computed using a differentiable renderer based on the surface geometry and materials [50].

The existing method optimizes materials by imposing SDS loss on the final rendered RGB images [7]. However, this approach may lead to inaccuracies in material decomposition due to the inherent challenges in disentangling material components based solely on color.

To regularize the material generation, an ideal prior model should effectively regularize both the diffuse and specular components. However, due to the varied creation methods of existing 3D models and the lack of standardization [11], it is challenging to acquire a comprehensive dataset with consistent and accurate ground truth for the specular component. In light of this difficulty, we introduce an albedo diffusion model to decouple the albedo from complex lighting effects, serving as a preliminary approach to mitigate the challenge of mixed illumination.

Depth-Conditioned Albedo Generation

A direct solution for the albedo diffusion model involves fine-tuning a text-to-image diffusion model using paired data of text prompts and albedo maps. While this method is effective for sampling, it falls short for 3D generation due to potential misalignments between the generated albedo and the geometry. To ensure the alignment of generated albedo maps with geometry, we employ the depth map from the corresponding viewpoint as a condition within the albedo Latent Diffusion Model (LDM). Specifically, we concatenate the depth map with the latent features to serve as input for the UNet denoiser [57]. We also employed the four-view diffusion strategy proposed by MV-Dream for the albedo diffusion model. We fine-tune the SD 2.1 on the Objaverse dataset [11] to capture the albedo distribution with the following training objective:

Albedo=𝔼za,y,c,t,ϵ𝒩(0,1)[ϵθa(zta,y,c,0pt,t)ϵ22],subscriptAlbedosubscript𝔼similar-tosuperscript𝑧𝑎𝑦𝑐𝑡italic-ϵ𝒩01delimited-[]subscriptsuperscriptnormsubscriptitalic-ϵsubscript𝜃𝑎subscriptsuperscript𝑧𝑎𝑡𝑦𝑐0𝑝𝑡𝑡italic-ϵ22\displaystyle\mathcal{L}_{\text{Albedo}}=\mathbb{E}_{z^{a},y,c,t,\epsilon\sim% \mathcal{N}(0,1)}\left[\|\epsilon_{\theta_{a}}(z^{a}_{t},y,c,0pt,t)-\epsilon\|% ^{2}_{2}\right],caligraphic_L start_POSTSUBSCRIPT Albedo end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_y , italic_c , italic_t , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_c , 0 italic_p italic_t , italic_t ) - italic_ϵ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , (6)

where zasuperscript𝑧𝑎z^{a}italic_z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT represents the latent feature of the albedo map, and 0pt0𝑝𝑡0pt0 italic_p italic_t is the depth condition.


Refer to caption

Figure 3: Visual comparison between our method and existing methods.

Loss Function

The loss function for appearance modeling is expressed as:

App=λSDSDSRGBSD+λAlbedoSDSAlbedoAlbedo,subscriptAppsubscript𝜆SDsuperscriptsubscriptSDSRGBSDsubscript𝜆AlbedosuperscriptsubscriptSDSAlbedoAlbedo\displaystyle\mathcal{L}_{\text{App}}=\lambda_{\text{SD}}\mathcal{L}_{\text{% SDS}-\text{RGB}}^{\text{SD}}+\lambda_{\text{Albedo}}\mathcal{L}_{\text{SDS}-% \text{Albedo}}^{\text{Albedo}},caligraphic_L start_POSTSUBSCRIPT App end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS - RGB end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SD end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT Albedo end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS - Albedo end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Albedo end_POSTSUPERSCRIPT , (7)

where the first term reflects the SDS on the rendered RGB images using SD, and the latter term is the SDS imposed on the albedo component by our Albedo diffusion model.

Optimization

For appearance modeling, the loss weights λSDsuperscriptsubscript𝜆SD\lambda_{\text{SD}}^{\prime}italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and λNDsuperscriptsubscript𝜆ND\lambda_{\text{ND}}^{\prime}italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are also set to 1111. The optimization takes around 20202020 minutes on a GPU. After optimization, the material properties can be sampled at surface points and compiled into a 2D texture map [7, 50, 90], which can be directly imported into existing graphics engines for applications.

Table 1: Quantitative comparison with existing methods. The geometry CLIP score is measured on the shading images rendered with uniform albedo, and the appearance CLIP score is measured on the images rendered with textured models (values the higher the better).
DreamFusion-IF Magic3D-IF TextMesh-IF ProlificDreamer Fantasia3D (Sphere) MVDream Ours (NeRF) Ours (Shpere)
Geometry CLIP Score 17.4548 20.1157 18.2222 23.3818 17.5398 24.8003 26.0570 25.8820
Appearance CLIP Score 24.1091 27.8231 25.1218 31.8022 26.4055 28.7331 31.3551 31.7099

4 Experiments

In this section, we thoroughly evaluate the effectiveness of our proposed text-to-3D method by conducting a comprehensive comparison with state-of-the-art approaches.

Model Variants

As discussed in Section 3.2, our Normal-Depth diffusion model can be applied to optimize DMTet and NeRF. To better verify the effectiveness of our Normal-Depth Diffusion model, we have designed two model variants for text-to-3D. The first model, denoted as Ours (Sphere), initializes the DMTet with a Sphere for geometry generation. The second model, denoted as Ours (NeRF), first optimizes a NeRF with our Normal-Depth diffusion model and then converts the NeRF to DMTet as an initialization for geometry generation.

Baselines

We conducted extensive comparisons with a variety of baseline methods, including both DMTet-based and NeRF-based methods. For the DMTet-based methods, we compared our approach against the state-of-the-art Fantasia3D method [7], utilizing its publicly available official code with DMTet initialized as a Sphere. In the case of NeRF-based methods, we evaluated our approach against multiple competitors, including DreamFusion-IF [53], Magic3D-IF [35], TextMesh-IF [74], and ProlificDreamer [79]. As there is no publicly available code for these four methods, we used the implementation from threestudio [17], and IF indicates the DeepFloyd IF***https://github.com/deep-floyd/IF diffusion model. We also compared our method with the state-of-the-art NeRF-based method, MVDream [68], using its publicly available official code.

SweetDreamer [33] is a contemporaneous work that is compatible with DMTet and NeRF. Given the absence of a public implementation, we conducted a fair comparison by evaluating our results against those presented on its website using identical prompts in the supplementary materials.

4.1 Evaluation on Text-to-3D

We conducted evaluations in two key aspects: geometry generation and textured model generation.

Evaluation on Geometry Generation

Evaluating the quality of generated geometry is a complex problem due to the lack of standard metrics. To objectively evaluate the geometry’s quality, we employed a rendering-based approach. Specifically, to isolate the geometric attributes from the influence of texture, we rendered the generated geometry with a uniform albedo and then calculated the CLIP score [55] using the provided text prompts and CLIP model (vit-g-14). This process involved generating 16 different views for each object (a total of 113113113113 objects) and computing the average score after removing the highest and lowest scores.

Table 1 (the first row) shows that two variants of our method achieve the top two values in the average CLIP scores in uniform rendering, demonstrating that our method outperforms existing methods in geometry generation. Visual results in Fig. 3 clearly show that our method can produce 3D content with exceptionally detailed geometry aligned with the text prompts, indicating the effectiveness of the proposed Normal-Depth diffusion model.

Evaluation on Textured Model Generation

In parallel with the geometry evaluation, we assessed the quality of the generated textured models. To accomplish this, we computed CLIP scores for the rendered images of the textured models. As in the geometry evaluation, we rendered 16 distinct views and computed the average scores. Table 1 (the second row) shows that the two variants of our method achieve the second and third highest scores, outperforming most of the existing methods. Our result is slightly lower than that of the ProlificDreamer with a comparison of 31.709931.709931.709931.7099 vs. 31.802231.802231.802231.8022. The reason might be that the ProlificDreamer additionally fine-tunes the diffusion model with LoRA [23] during optimization, which might lead to a rendered image that better fits the text prompts. However, the visual comparison of textured model generation in Fig. 3 shows that our method generates much more accurate and detailed models. These results verify the design of our decoupled text-to-3D generation approach.

Refer to caption
Figure 4: User study for text-to-3D.

User Study

To further assess the visual quality of the generated 3D models, we conducted a comprehensive user study. We separately compare the two variants of our method with existing methods. We collected a set of 87878787 prompts from DreamFusions, Sweetdreamer, and MVDream. 119 and 192 participants were involved for the comparison of “Ours (NeRF) vs. existing methods” and “Ours (Sphere) vs. existing methods” respectively, with each participant undertaking 40404040 and 47474747 testing.

In each test case, the text prompt, textured models, and normal maps generated by various methods were simultaneously displayed. Participants were then tasked to give two votes, one for the best textured model and the other for the best geometry model. Figure 4 presents the results of our user study. Our method with NeRF representations received 75757575% and 70707070 of votes for “the best textured model” and “the best geometry”. Our method with Sphere initialization received more than 59595959% and 58585858 of votes for the two comparisons. These results show that our method clearly outperforms existing methods on geometry generation and textured model generation, demonstrating the effectiveness of our method.

Table 2: Ablation study for geometry generation.
ND Only ND (w/o LAION) + SD ND + SD
Geometry CLIP Score 24.1070 24.2601 25.8820
Appearance CLIP Score 29.5379 29.7522 31.7099
Refer to caption

ND only ND (w/o LAION) + SD ND + SD

Figure 5: Ablation for the Normal-Depth (ND) diffusion model for geometry generation. Prompt: “A fox plays a cello”.

4.2 Method Analysis

Effects of the Normal-Depth Diffusion Model

To explore the impact of the proposed Normal-Depth diffusion model in the context of text-to-3D, we show results of geometry generation without using the SDS loss from the SD model. Figure 5 and Table 4.1 show that only using our Normal-Depth model can robustly generate geometry with a coherent structure. When the SD model is incorporated alongside our Normal-Depth model, the resulting geometry exhibits finer details and an improved shape. These findings suggest that our Normal-Depth model serves as a valuable 3D geometric prior for the overall structure, while the SD model excels in generating surface details.

Effects of Pre-training on LAION dataset

To evaluate the impact of pre-training on the LAION-2B dataset using normal and depth generated by existing methods [3, 59], we conducted a comparison with a baseline model that directly fine-tunes on the synthetic Objaverse dataset [11]. The resulted baseline text-to-3D model is denoted as ND (w/o LAION)+ SD. Figure 5 illustrates that when the Normal-Depth model is fine-tuned solely on the synthetic dataset, its generalization ability significantly deteriorates. It struggles to generate content that aligns with the user prompts, and the quality of the generated geometry is notably inferior. In contrast, our method, which involves pre-training on the expansive LAION dataset, successfully preserves its generalization ability and produces superior results, which is also evidenced in Table 4.1.

Effects of Albedo Diffusion Model

Figure 6 shows that the albedo diffusion model can effectively improve the generated texture and lead to a more accurate appearance. Without the depth condition, the generated texture fails to align with the underlying geometry, highlighting the importance of depth condition in albedo diffusion model. With the inclusion of the albedo diffusion model, the generated albedo exhibits reduced shadows and specular highlights, leading to a more realistic relighting results (see Fig. 7).

Refer to caption
Figure 6: Ablation results for the albedo diffusion model.
Refer to caption
Refer to caption
Figure 7: Relighting results. From left to right: results of model w/o albedo diffusion, shading, and model w/ albedo diffusion.

5 Conclusion

In this work, we presented a generalizable approach to 3D generation through a Normal-Depth diffusion model, trained extensively on real-world data before undergoing fine-tuning with synthetic datasets. We also introduced a depth-conditioned albedo diffusion model that facilitates the separation of material attributes and lighting effects. Our models seamlessly integrate into current text-to-3D pipelines and demonstrate compatibility with the NeRF and DMTet representations. Extensive experiments show that our method achieves state-of-the-art text-to-3D results in both geometry and appearance modeling.

Future Work

Our current method predominantly focuses on object-level 3D generation. Moving forward, we aim to extend our Normal-Depth diffusion model to address text-to-scene generation challenges. Additionally, an interesting direction for future research is the development of an appearance prior model that regularizes the specular component of 3D content.

Acknowledgement

We would like to express our special gratitude to Rui Chen for the valuable discussion in training Fantasia3D and PBR modeling. Additionally, we extend our heartfelt thanks to Chao Xu for his assistance in conducting relighting experiments.

References

  • [1] Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In CVPR, 2023.
  • [2] Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  • [3] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • [4] Brent Burley and Walt Disney Animation Studios. Physically-based shading at disney. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). vol. 2012, 2012.
  • [5] Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. In ICCV, 2023.
  • [6] Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023.
  • [7] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  • [8] Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, and Guosheng Lin. It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint arXiv:2308.11473, 2023.
  • [9] Xinhua Cheng, Tianyu Yang, Jianan Wang, Yu Li, Lei Zhang, Jian Zhang, and Li Yuan. Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts. arXiv preprint arXiv:2310.11784, 2023.
  • [10] Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In CVPR, 2023.
  • [11] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  • [12] Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [13] Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015, 2023.
  • [14] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [15] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022.
  • [16] Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In Proceedings of the ACM International Conference on Machine Learning (ICML), 2023.
  • [17] Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian Laforte, Vikram Voleti, Guan Luo, Chia-Hao Chen, Zi-Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  • [18] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  • [19] Xiao Han, Yukang Cao, Kai Han, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang, and Kwan-Yee K Wong. Headsculpt: Crafting 3d head avatars with text. arXiv preprint arXiv:2306.03038, 2023.
  • [20] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Esca** plato’s cave: 3d shape from adversarial rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020.
  • [22] Susung Hong, Donghoon Ahn, and Seungryong Kim. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023.
  • [23] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • [24] Xin Huang, Ruizhi Shao, Qi Zhang, Hongwen Zhang, Ying Feng, Yebin Liu, and Qing Wang. Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv preprint arXiv:2310.01406, 2023.
  • [25] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  • [26] Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, and Lei Zhang. Dreamwaltz: Make a scene with complex 3d animatable avatars. arXiv preprint arXiv:2305.12529, 2023.
  • [27] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [28] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  • [29] Animesh Karnewar, Niloy J Mitra, Andrea Vedaldi, and David Novotny. Holofusion: Towards photo-realistic 3d generative modeling. In ICCV, 2023.
  • [30] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023.
  • [31] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [32] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 2020.
  • [33] Weiyu Li, Rui Chen, Xuelin Chen, and ** Tan. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023.
  • [34] Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, and Michael J Black. Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
  • [35] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [36] Minghua Liu, Chao Xu, Haian **, Linghao Chen, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023.
  • [37] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  • [38] Yuan Liu, Cheng Lin, Zijiao Zeng, ** Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
  • [39] Zhen Liu, Yao Feng, Michael J Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdiffusion: Score-based generative 3d mesh modeling. In ICLR, 2023.
  • [40] ** Wang. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  • [41] Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, and James Lucas. Att3d: Amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349, 2023.
  • [42] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [43] Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  • [44] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  • [45] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [46] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • [47] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia Conference Papers, 2022.
  • [48] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In CVPR, 2023.
  • [49] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022.
  • [50] Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas Müller, and Sanja Fidler. Extracting Triangular 3D Models, Materials, and Lighting From Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [51] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  • [52] Songyou Peng, Björn Häfner, Yvain Quéau, and Daniel Cremers. Depth super-resolution meets uncalibrated photometric stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  • [53] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In Proceedings of the The International Conference on Learning Representations (ICLR), 2023.
  • [54] Guocheng Qian, **jie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  • [55] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of the ACM International Conference on Machine Learning (ICML), 2021.
  • [56] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  • [57] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 2021.
  • [58] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [59] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2020.
  • [60] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In CVPR, 2021.
  • [61] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  • [62] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [63] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  • [64] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  • [65] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, **-Hwa Kim, Jiyoung Lee, and Seungryong Kim. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  • [66] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS, 2021.
  • [67] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
  • [68] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  • [69] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [70] Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, et al. Ldm3d: Latent diffusion model for 3d. arXiv preprint arXiv:2305.10853, 2023.
  • [71] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data. arXiv preprint arXiv:2306.07881, 2023.
  • [72] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  • [73] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  • [74] Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, and Federico Tombari. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  • [75] Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
  • [76] Bruce Walter, Stephen R Marschner, Hongsong Li, and Kenneth E Torrance. Microfacet models for refraction through rough surfaces. In Proceedings of the 18th Eurographics conference on Rendering Techniques, 2007.
  • [77] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [78] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, **g**g Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023.
  • [79] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  • [80] Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  • [81] Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, CL Chen, and Lei Zhang. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  • [82] **bo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, **gtuo Liu, and Errui Ding. Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. arXiv preprint arXiv:2307.16183, 2023.
  • [83] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. NeurIPS, 2016.
  • [84] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [85] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 views. arXiv e-prints, 2022.
  • [86] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR, 2023.
  • [87] Xudong Xu, Zhaoyang Lyu, Xingang Pan, and Bo Dai. Matlaber: Material-aware text-to-3d via latent brdf auto-encoder. arXiv preprint arXiv:2308.09278, 2023.
  • [88] Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, and Heng Wang. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. arXiv preprint arXiv:2310.03020, 2023.
  • [89] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  • [90] Jonathan Young. xatlas. https://github.com/jpcy/xatlas, 2021.
  • [91] Wang Yu, Xuelin Qian, **gyang Huo, Tiejun Huang, Bo Zhao, and Yanwei Fu. Pushing the limits of 3d shape generation at scale. arXiv preprint arXiv:2306.11510, 2023.
  • [92] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, Guanying Chen, Shuguang Cui, and Xiaoguang Han. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [93] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
  • [94] Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. In SIGGRAPH, 2023.
  • [95] Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, and **gyi Yu. Dreamface: Progressive generation of animatable 3d faces under text guidance. arXiv preprint arXiv:2304.03117, 2023.
  • [96] Minda Zhao, Chaoyi Zhao, Xinyue Liang, Lincheng Li, Zeng Zhao, Zhipeng Hu, Changjie Fan, and Xin Yu. Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223, 2023.
  • [97] Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. arXiv preprint arXiv:2306.17115, 2023.
  • [98] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Appendix A More Details for Normal-Depth Diffusion

A.1 Pre-training on the LAION Dataset

Refer to caption
Figure 8: Text to Normal-Depth sampling results.

Training of VAE

We initialize the parameters of our model with the pre-trained weights of SD 2.1. Specifically, we modify the input channels of the VAE from 3 to 4, and the weights of the newly added channel are initialized as the average of the original weights. We do not modify the number of channels in the latent space of our VAE, which remains at 4 channels.

For VAE fine-tuning, we randomly sample the training data from LAION-Aesthetics V1 [64], selecting data with aesthetics scores larger than 8.0, considering the need for high-quality training data. The training dataset consists of 8 million training samples.

We apply center crop** on the training images and resize them to 384×384384384384\times 384384 × 384. We employ random rotation and flip** of the images to augment our training dataset. Subsequently, we use augmented images as inputs to the monocular prior models (NormalBae [3] and Midas-3.1 [59]) to obtain corresponding estimates of the normal and depth. Notably, we perform depth normalization on the predicted depth values, scaling them from -1 to 1. This normalization step is necessary since the original depth predictions are in the form of real values. These estimated images are then resized to a resolution of 256×256256256256\times 256256 × 256 and serve as the input for the VAE.

During the training phase, following the latent diffusion model [62], we employ the Adam optimizer with a learning rate of 5e-5 to optimize our VAE model. To ensure a well-behaved latent space, we incorporate KL regularization loss. Moreover, to further improve the quality of the generated images, we train an auxiliary discriminator on the output of the VAE. The weights of the KL regularization and discriminator are set to 1e-6 and 0.5, respectively. For each iteration, the batch size is set to 1024. The training process is conducted on 8 A100-80G GPUs for two weeks, reaching a total of 100K iterations.

Training of UNet Diffuser

Notably, we maintain the original structure of the UNet model as the channels of the latent remain unchanged. During the training phase, we randomly sample data from the Laion-2B-en dataset. We perform a center crop on the sampled images and resize them to a resolution of 512 pixels. Subsequently, these images are passed through the monocular prior models and the encoder of the trained VAE. The output of this process serves as the input of the UNet model.

We follow the strategies utilized in the latent-diffusion model [62] to train our Normal-Depth diffusion model. Specifically, we first train our Normal-Depth diffusion model using the entire Laion-2B-en dataset. After 121,000 iterations of training, we proceed to sample our data from the “Laion-Aesthetics v2 5+” subset. This subset contains images with aesthetics scores greater than 5, and we only select images with an unsafety probability higher than 0.1. The fine-tuning process continues for about 167,000 iterations at a resolution of 512×512512512512\times 512512 × 512. To enable classifier-free guidance sampling, we incorporate a 10% drop** of the text-conditioning. We utilize the Adam optimizer with a learning rate of 1e-4 to optimize our Normal-Depth diffusion model. For each iteration, the batch size is set to 1024. This process costs 11,520 GPU hours.

Text to Normal-Depth Sampling

Figure 8 shows the sampling results of our Normal-Depth diffusion model trained on the Laion-2B datasets. As shown in the figure, the sampled normal and depth are not only highly consistent with the textual description but also with high quality. We set the classifier-free guidance scale to 7.5, with 50 DDIM [69] steps.

A.2 Fine-tuning on Synthetic Dataset

Multi-view Normal-Depth Diffusion Fine-tune

Thanks to the open-source large-scale Objaverse dataset [11], we fine-tune our Normal-Depth diffusion model on Objaverse to improve its performance for 3D generation tasks. To avoid the Janus problem, following MVDream [68], we finetune the Normal-Depth Diffusion model pretrained using multi-view diffusion. Particularly, we use 4 images of orthogonal camera views as the input for the diffusion model and apply a two-layer MLP to embed the extrinsic camera matrix, which is added as a residual to the time embedding.

Figure 9 illustrates the sampling results of our multi-view Normal-Depth Diffusion model, where the classifier-free guidance scale is set to 10 and the negative prompt is set to null.

Refer to caption
Figure 9: Sampling results of our Normal-Depth diffusion model fine-tuned on the Objaverse dataset.

Depth Normalization

For a synthetic dataset, we can directly obtain its depth values. Since our Normal-Depth Diffusion process involves normalized depth during training, it is necessary to normalize the range of synthetic depth. To better normalize the depth values, we can introduce a near plane and a far plane to restrict the depth range to [-1, 1]. By defining these planes, we can map the actual depth values to the normalized range. Considering our synthetic data is confined within a 0.5 uint cubic volume, we define the distance from the object’s center point to the near plane and far plane as 0.530.530.5\sqrt{3}0.5 square-root start_ARG 3 end_ARG. After defining the near and far planes, we can normalize the depth values of our synthetic data.

However, the process of normalizing depth is not trivial. Midas [52] estimates depth in the form of disparity, which is a relative measure of the difference in pixel coordinates and is inversely related to depth.

One straightforward approach is to normalize the disparity of synthetic data directly. For simplicity, we assume that the synthetic object is located at the coordinate origin. The equation for normalizing the disparity is as follows:

Disp(z)Disp𝑧\displaystyle\textrm{Disp}(z)Disp ( italic_z ) =1dcamz1dcam+3l1dcam3l1dcam+3labsent1subscript𝑑cam𝑧1subscript𝑑cam3𝑙1subscript𝑑cam3𝑙1subscript𝑑cam3𝑙\displaystyle=\frac{\frac{1}{d_{\text{cam}}-z}-\frac{1}{d_{\text{cam}}+\sqrt{3% }l}}{\frac{1}{d_{\text{cam}}-\sqrt{3}l}-\frac{1}{d_{\text{cam}}+\sqrt{3}l}}= divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT - italic_z end_ARG - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT + square-root start_ARG 3 end_ARG italic_l end_ARG end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT - square-root start_ARG 3 end_ARG italic_l end_ARG - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT + square-root start_ARG 3 end_ARG italic_l end_ARG end_ARG (8)
=(3l+z)(dcam3l)23l(dcamz).absent3𝑙𝑧subscript𝑑cam3𝑙23𝑙subscript𝑑cam𝑧\displaystyle=\frac{(\sqrt{3}l+z)\cdot(d_{\text{cam}}-\sqrt{3}l)}{2\sqrt{3}l% \cdot(d_{\text{cam}}-z)}.= divide start_ARG ( square-root start_ARG 3 end_ARG italic_l + italic_z ) ⋅ ( italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT - square-root start_ARG 3 end_ARG italic_l ) end_ARG start_ARG 2 square-root start_ARG 3 end_ARG italic_l ⋅ ( italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT - italic_z ) end_ARG .

where z𝑧zitalic_z represents the depth value between the given point and the original plane, dcamsubscript𝑑cam{d_{\text{cam}}}italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT denotes the depth value from the origin plane to the camera, and l𝑙litalic_l represents the size of the cube that confines the object, as illustrated in Fig. 10.

Refer to caption
Figure 10: Depth normalization figure.

From the equation, it is evident that the normalized disparity lacks the desirable property of scale invariance. Consider an example: for a fixed camera distance (dcamsubscript𝑑camd_{\text{cam}}italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT), if we double the values of l𝑙litalic_l and z𝑧zitalic_z, the resulting values will differ from the original l𝑙litalic_l and z𝑧zitalic_z values. Similarly, for the same values of l𝑙litalic_l and z𝑧zitalic_z, different camera distances will yield different normalized disparities.

During the optimization process of text to 3D, we randomly sample different camera distances. This randomness further exacerbates the lack of scale invariance, introducing significant noise into the optimization process. As a consequence, achieving accurate results becomes more challenging due to the varying scales, which can adversely affect the convergence and stability of the optimization algorithm.

To address the aforementioned issue, we propose the use of reverse depth as an alternative to disparity normalization. The reverse depth is defined as follows:

RevDepth(z)RevDepth𝑧\displaystyle\textrm{RevDepth}(z)RevDepth ( italic_z ) =(dcam+3l)(dcamz)(dcam+3l)(dcam3l)absentsubscript𝑑cam3𝑙subscript𝑑cam𝑧subscript𝑑cam3𝑙subscript𝑑cam3𝑙\displaystyle=\frac{(d_{\text{cam}}+\sqrt{3}l)-(d_{\text{cam}}-z)}{(d_{\text{% cam}}+\sqrt{3}l)-(d_{\text{cam}}-\sqrt{3}l)}= divide start_ARG ( italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT + square-root start_ARG 3 end_ARG italic_l ) - ( italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT - italic_z ) end_ARG start_ARG ( italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT + square-root start_ARG 3 end_ARG italic_l ) - ( italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT - square-root start_ARG 3 end_ARG italic_l ) end_ARG (9)
=3l+z23l.absent3𝑙𝑧23𝑙\displaystyle=\frac{\sqrt{3}l+z}{2\sqrt{3}l}.= divide start_ARG square-root start_ARG 3 end_ARG italic_l + italic_z end_ARG start_ARG 2 square-root start_ARG 3 end_ARG italic_l end_ARG .

From the equation, it is obvious that the normalization value is independent of dcamsubscript𝑑camd_{\text{cam}}italic_d start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT and remains unchanged when the variables l𝑙litalic_l and z𝑧zitalic_z scale proportionally. For the sake of simplicity, in the following content, the depth refers to normalized reverse depth.

Refer to caption
Figure 11: Depth-Condition Albedo diffusion Model.

Appendix B More Details for Albedo Diffusion Model

Depth-Conditioned Albedo Diffusion Fine-tuning

Regarding the training of the albedo model under the depth condition, we resize the normalized depth image and concatenate it with the latent space of the VAE. It means the number of channels in the UNet’s input expands from 4 to 5. We initialize the parameters of the Albedo-Diffusion Model with the pre-trained weights of SD 2.1. For the additional dimension in the input channel of the UNet, we set its weights to zero values. To alleviate the multi-face problem in the generated texture, we also employ the same multi-view strategy used in the multi-view Normal-Depth Diffusion model to train our albedo diffusion model (see Fig. 11 for sampling results).

Appendix C More Details for Geometry Generation

Our Normal-Depth diffusion model can be applied to optimize DMTet and NeRF. To better verify the effectiveness of our Normal-Depth Diffusion model, we design two model variants for text-to-3D. The first model, denoted as Ours (Sphere), initializes the DMTet with a Sphere for geometry generation. The second model, denoted as Ours (NeRF), first optimizes a NeRF with our Normal-Depth diffusion model and then converts the NeRF to DMTet as an initialization for geometry generation.

In the paper, the geometry generation loss function is defined as:

Geo=λSDSDSNormalSD+λNDSDSNDND,subscriptGeosubscript𝜆SDsuperscriptsubscriptSDSNormalSDsubscript𝜆NDsuperscriptsubscriptSDSNDND\displaystyle\mathcal{L}_{\text{Geo}}=\lambda_{\text{SD}}\mathcal{L}_{\text{% SDS}-\text{Normal}}^{\text{SD}}+\lambda_{\text{ND}}\mathcal{L}_{\text{SDS}-% \text{ND}}^{\text{ND}},caligraphic_L start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS - Normal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SD end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS - ND end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ND end_POSTSUPERSCRIPT , (10)

More Details for Ours (Sphere)

The geometry optimization in Ours (Sphere) consists of two stages: a coarse shape optimization stage and a refinement stage. For the coarse shape optimization stage, we follow the approach of Fantasia3D, where we directly resize the rendered normal and depth maps as the latent space features to quickly optimize to obtain a coarse shape. In the refinement stage, we utilize the latent features obtained from the VAE to enhance the details of geometry.

Specifically, our DMTet is initialized from a sphere. In terms of coarse shape optimization, we set λSDsubscript𝜆SD\lambda_{\text{SD}}italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT and λNDsubscript𝜆ND\lambda_{\text{ND}}italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT to 0.5 and 1.0, respectively. When it comes to refinement optimization, both λNDsubscript𝜆ND\lambda_{\text{ND}}italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT and λSDsubscript𝜆SD\lambda_{\text{SD}}italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT are set to 1.0. The negative prompt “low quality” is used in both diffusion models. The classifier-guided scale is set to 100 and 50 in SD 2.1 and the Normal-Depth diffusion model, respectively. For the time sampling schedule, we adopt a uniform sampling strategy of annealing from [0.5, 0.98] to [0.05, 0.5] when we switch from the coarse to refinement stage.

Ours (Sphere) is optimized on a single Nvidia A100-80G GPU for 3000 iterations (1500 iterations for the coarse stage and 1500 iterations for the fine stage), where we use the AdamW optimizer with learning rates of 1e-3. The batch size is set to 8, and the entire geometry optimization process takes about 40 minutes.

More Details for Ours (NeRF)

The geometry optimization in Ours (NeRF) also consists of two stages. In the coarse shape optimization stage, we adopt the rendered normal and depth maps as latent space input of SD 2.1 and the Normal-Depth diffusion model without VAE encoding to quickly optimize to obtain the coarse shape. In the fine detail refinement stage, we adopt encoding features obtained from VAE as latent space input of SD 2.1 to get a detailed shape.

Refer to caption
Figure 12: The visual geometric comparison results about with and without DMTet refinement stage in Ours (NeRF).

Ours (NeRF) is optimized on a single Nvidia A100-80G GPU for 5000 iterations (1500 iterations for the coarse stage and 3500 iterations for the fine stage), where we use the AdamW optimizer with learning rates of 1e-3 except for the hash encoding module using 1e-2. For the time sampling schedule, we adopt a uniform sampling strategy of annealing from [0.5, 0.98] to [0.05, 0.5] at the 3000th iteration. We set λSDsubscript𝜆SD\lambda_{\text{SD}}italic_λ start_POSTSUBSCRIPT SD end_POSTSUBSCRIPT to 1.0 and λNDsubscript𝜆ND\lambda_{\text{ND}}italic_λ start_POSTSUBSCRIPT ND end_POSTSUBSCRIPT annealing 10 to 2 at the 3500th iteration. The classifier-guided scale is set to 50 in both SD 2.1 and the Normal-Depth diffusion model. We utilize a multi-resolution strategy to train NeRF efficiently, the rendering resolution increases from 64 ×\times× 64 to 256 ×\times× 256 at the 3000th iteration, and the batch size decreases from 8 to 4. The entire geometry optimization takes about 40 minutes.

To reduce geometric artifacts when converting NeRF to DMTet representations, we optimize the DMTet with an additional 3000 iterations to refine the geometry. This is done using the same strategy adopted in the fine detail stage, except the rendering resolution is increased to 512. The refinement optimization takes about 20 minutes.

As illustrated in Figure 12, the optimized geometry of NeRF using our Normal-Depth diffusion model is already of very high quality, clearly demonstrating the effectiveness of our Normal-Depth diffusion model in optimizing both the DMTet and NeRF representations. After undergoing DMTet conversion and subsequent refinement, the surface details are further enhanced. However, for geometry types that are more suitably represented by a density volume (e.g., hair and smoke), the geometry shape tends to be better without the DMTet conversion.

Camera Sampling

During the training process, we randomly sample the elevation angle between 5 degrees and 30 degrees and uniformly sample the camera distance between 1.5 and 1.9. In addition, for the sampling of azimuth angles, we follow the sampling approach from MVDream [68], where we consecutively sample four orthogonal viewpoints.

Appendix D More Details for Appearance Modeling

The classifier guidance scale is set to 10 for the depth condition Albedo-Diffusion model, while its value is 100 for the original SD 2.1. We harness the same camera sampling strategy in geometry appearance modeling. In terms of the time sampling schedule, we adopt a uniform sampling strategy from [0.02, 0.98].

Our appearance model is optimized on a single Nvidia A100-80G GPU for 3000 iterations. The batch size is set to 8, and the AdamW optimizer with learning rates of 1e-2 is utilized to update the model parameters.

Appendix E More Details for the User Study

In the user study compared with the baseline, our main focus is to evaluate the quality of the generated geometry and the textured models. Regarding geometry, we primarily compare whether the geometry is complete, if the fine details appear natural, and whether there are significant artifacts. For textured models, our comparison is based on the naturalness of the generated textures and the alignment between the textured model and the textual descriptions. At the beginning of each questionnaire, we will provide an illustrative example to explain what is meant by “visual-textual matching” and how to evaluate the quality of the generated geometry.

The interface of our user study is shown in Fig. 13, where we display the results of different methods in each row and randomly shuffle the order of the methods to avoid introducing bias. For each row, we display four images, which consist of color and normal maps captured from two camera views that are 180 degrees apart.

Refer to caption
Figure 13: Example of the interactive interface for the user study.

We distribute our questionnaire to graduate students and professionals working in the field of 3D, such as model designers and engineers from technology companies. All the prompts used in our user study are presented in the attached txt file.

Refer to caption

Ours (NeRF) vs. NeRF-Based Ours (Sphere) vs. DMTet-Based

Figure 14: The user study for the comparison with SweetDreamer.
Refer to caption
Figure 15: Comparison between SweetDreamer (NeRF-based) and Ours (NeRF). In each row, from left to right show the rendered image and normal map of the SweetDreamer, followed by the rendered image and normal map of our method.
Refer to caption
Figure 16: Comparison between SweetDreamer (DMTet-based) and Ours (Sphere). In each row, from left to right show the rendered image and normal map of the SweetDreamer, followed by the rendered image and normal map of our method.

Appendix F More Results

F.1 Comparison with SweetDreamer

In comparison to SweetDreamer [33], a concurrent work that does not have publicly available code, we selected 40 text prompts from their project page or paper (20 prompts from their NeRF-based approach and 20 from their DMTet-based approach). Since the DMTet-based method from SweetDreamer visualizes the shape with the shading normal instead of the geometry normal, it is not feasible to directly compare the geometry quality. Therefore, we focus solely on evaluating the overall quality of the textured models.

Figure 14 illustrates the results of the user study comparing our method with SweetDreamer. In our user study, a significant majority of participants expressed a preference for our method. Specifically, when compared with SweetDreamer’s NeRF-based approach, 68% of users selected Ours (NeRF) as their preferred choice. When compared with SweetDreamer’s DMTet-based approach, 64% of users chose Ours (Sphere) as their preferred option. This outcome demonstrates that our method outperforms SweetDreamder in 3D generation.

Figure 15 and Figure 16 present the comparisons with SweetDreamer (NeRF-based) and SweetDreamer (DMTet-based), respectively. We can observe that our method generates better geometry compared to SweetDreamer.

Refer to caption

Ours (NeRF) Ours (Sphere)

Figure 17: Comparison between Our (NeRF) and Our (Sphere).

F.2 Discussion for Ours (Sphere) and Ours (NeRF)

To better understand the behavior of our method, we discuss differences between the Ours (Sphere) and Ours (NeRF). In our experiments, we found that DMTet initialized from NeRF and from a sphere has advantages for different cases.

In terms of NeRF initialization, it is easier to generate scenes with multiple objects, e.g., “a group of dogs playing poker”. Figure 17 (a) demonstrate the comparison results between Ours (NeRF) and Ours (Sphere) in this case. However, compared to optimization from a sphere, the geometry from NeRF initialization tends to be smoother, making it difficult to generate surfaces with highly detailed structures, as shown in Figure 17 (b).

In the future, we aim to devise a novel hybrid representation that combines both initialization methods. This approach will allow us to leverage the strengths of each initialization and potentially yield improved results.

F.3 More Visual Results

We present more visual results for Ours (Sphere) in Figures 18- 21 and Ours (NeRF) in Figures 22-25.

Refer to caption
Figure 18: Visual results of Ours (Sphere) (Part I).
Refer to caption
Figure 19: Visual results of Ours (Sphere) (Part II).
Refer to caption
Figure 20: Visual results of Ours (Sphere) (Part III).
Refer to caption
Figure 21: Visual results of Ours (Sphere) (Part IV).
Refer to caption
Figure 22: Visual results of Ours (NeRF) (Part I).
Refer to caption
Figure 23: Visual results of Ours (NeRF) (Part II).
Refer to caption
Figure 24: Visual results of Ours (NeRF) (Part III).
Refer to caption
Figure 25: Visual results of Ours (NeRF) (Part IV).