DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance

Linxuan Xin Peking UniversityShenzhenChina [email protected] Zheng Zhang Huawei Cloud Computing Technologies Co., Ltd.HangzhouChina [email protected] **fu Wei Tsinghua UniversityShenzhenChina [email protected] Wei Gao School of Electronic and Computer Engineering, Shenzhen Graduate Schoool, Peking UniversityShenzhenChina [email protected]  and  Duan Gao Huawei Cloud Computing Technologies Co., Ltd.ShenzhenChina [email protected]
Abstract.

Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets. To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by text and multi-modal controls, providing high controllability and diversity in material generation. The key to achieving diverse and high-quality PBR material generation lies in integrating the capabilities of recent large-scale vision-language models trained on billions of text-image pairs, along with material priors derived from hundreds of PBR material samples. We utilize a novel material Latent Diffusion Model (LDM) to establish the map** between albedo maps and the corresponding latent space. The latent representation is then decoded into full SVBRDF parameter maps using a rendering-aware PBR decoder. Our method supports tileable generation through convolution with circular padding. Furthermore, we introduce a multi-modal guidance module, which includes pixel-aligned guidance, style image guidance, and 3D shape guidance, to enhance the control capabilities of the material LDM. We demonstrate the effectiveness of DreamPBR in material creation, showcasing its versatility and user-friendliness on a wide range of controllable generation and editing applications.

Physically-based Rendering, Spatially Varying Bidirectional Reflectance Distribution Function, Multimodal Deep Generative Model, Deep Learning
copyright: noneccs: Computing methodologies Renderingccs: Computing methodologies Artificial intelligence
Refer to caption
Figure 1. DreamPBR, an innovative material generation framework, enables personalized creation with multi-modal controls. We present various controls such as text descriptions (4, 5, 6, 7), binary images (2, 9, 10), RGB images (3, 11, 12, 13), segmented geometry (8), and their combination (1) in this figure. The high-quality and tileable textures from DreamPBR show high applicability in different objects.
\Description

fig:teaser

1. Introduction

High-quality materials are crucial for achieving photorealistic rendering. Despite advancements in appearance modeling over the past few decades, material creation remains a challenging research area. The material generation approaches can be categorized into reconstruction-based methods and generation-based methods. Reconstruction- based methods use one or many input photographs to estimate surface reflectance properties either through optimization-based inverse rendering (Gao et al., 2019; Guo et al., 2020; Hu et al., 2019) or deep neural network inference (Deschaintre et al., 2018a; Guo et al., 2023). However, the scope of these methods is constrained to real-world photographs, limiting their ability to create imaginative and creative materials.

Recent approaches have explored material generation (Guo et al., 2020; Zhou et al., 2022) using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). However, these methods are typically trained on hundreds to thousands of materials, which pales in comparison to the billions of images used in large-scale Language-Image generative models. The dataset capacity restricted their generating diversity. Furthermore, GAN-based methods also had training challenges including unstable training, mode collapse, and scalability issues with large datasets. On the other hand, diffusion models (Ho et al., 2020; Rombach et al., 2022) have shown significant advancements, exhibiting advantages in scalability and diversity. Recent advances (Poole et al., 2022; Wang et al., 2023) leverage 2D diffusion models before generating 3D content. However, these methods mainly focus on implicit representation or textured mesh, lacking the capability to disentangle physically based material and illumination.

To address these challenges, we introduce DreamPBR, a novel generative framework for creating high-resolution spatially-varying bidirectional reflectance distribution functions (SVBRDFs) conditioned with text inputs and a variety of multi-modal guidance. The main advantages of our method lie in generating diversity and controllability. Our method can generate semantically correct and detailed materials based on various textual prompts, ranging from highly structured materials with stationary patterns to imaginative materials with flexible content, such as a Hello Kitty carpet (as shown in Figure 1).

The key idea of our method is to integrate pretrained 2D text-to-image diffusion models (Rombach et al., 2022) with material priors to generate high-fidelity and diverse materials. While 2D text-to-image Latent Diffusion Models (LDM) excel in generating natural images, they had challenges in producing spatially-varying physically-based material maps due to the large domain gaps between natural images and materials. Consequently, adapting pretrained 2D diffusion models into the material domain, while preserving both quality and diversity, is a non-trivial research task. We introduce a novel material LDM which is learned by a two-stage strategy to address this challenge. In the initial stage, we observed albedo map is a specialized RGB image and stores spatially-varying surface reflectance by RGB pixel values. We transfer the pretrained LDM from the text-to-image domain to the text-to-albedo domain using fine-tuning, which can be regarded as the distillation from a large source domain (natural images) to a relatively small target domain (albedo texture maps) by leveraging the target domain priors. In the subsequent stage, we leverage a PBR decoder to reconstruct SVBRDFs from the latent space of albedo maps learned in the former stage. The reasons that we employ a decoder-only architecture for SVBRDFs generation are: 1. The generated SVBRDF parameter maps exhibit strong correlations since they share a common latent representation as the starting point for decoding. 2. The decoder module does not compromise generating diversity, as we keep the denoising UNet frozen during the training of the PBR decoder. Additionally, we introduced a highlight-aware decoder for the albedo map to further enhance regularization.

We introduce a multi-modal guidance module designed to serve as the conditioning mechanism for our material LDM, enabling a wide variety of controls for user-friendly material creation. Specifically, this guidance module includes three key components: Pixel Control allows pixel-aligned guidance from inputs like sketches or inpainting masks. Style Control extracts style features from reference images and employs them to guide the generation process. Shape Control enables automatic material generation for a given 3D object with segmentations with an optional 2D exemplar image for reference. Importantly, our framework supports the concurrent use of multiple guidances seamlessly.

We have trained our DreamPBR method on a publicly available SVBRDF dataset, comprising over 700 high-resolution (2k𝑘kitalic_k) SVBRDFs. Thanks to the convolutional backbone of LDM, seamless tileable material generation can be supported by utilizing circular padding in all convolutional operators.

To summarize, our main contributions are as follows:

  • We introduce a novel generative framework for high-quality material generation under text and multi-modal guidances that combine pretrained 2D diffusion model and material domain priors efficiently;

  • We present a rendering-aware decoder module that learns the map** from a shared latent space to SVBRDFs;

  • Our multi-model guidance module offers rich user-friendly controllability, enabling users to manipulate the generation process effectively;

  • We propose an image-to-image editing scheme that facilitates material editing tasks such as stylization, inpainting, and seamless texture synthesis.

2. Related Work

2.1. Material estimation

Material estimation approaches aim to acquire material data from real-world measurements under varying viewpoints and lighting conditions. We specifically focus on recent material estimation methods that utilize lightweight capture setups using consumer cameras. For a more comprehensive overview of general appearance modeling, please refer to surveys (Dong, 2019; Weinmann and Klein, 2015; Guarnera et al., 2016).

Methods have been developed to leverage multiple images or video sequences captured by a handheld camera to estimate appearance properties. Due to the limitations of lightweight setups, most approaches still rely on regularization such as handcrafted heuristics for diffuse/specular separation (Riviere et al., 2016; Palma et al., 2012), linear combinations of basis BRDFs (Hui et al., 2017), and sparsity assumption for incident lighting (Dong et al., 2014). Another class of methods focuses on reducing the number of input images by leveraging material priors such as stationary materials  (Aittala et al., 2015, 2016), homogeneous or piece-wise materials  (Xu et al., 2016), and spatially sparse materials  (Zhou et al., 2016).

In recent years, deep learning-based methods have shown significant progress in recovering SVBRDFs from single image (Li et al., 2017; Deschaintre et al., 2018a; Li et al., 2018; Guo et al., 2021, 2023; Henzler et al., 2021). These methods employ deep convolutional neural network to predict plausible SVBRDFs from in-the-wild input images in a feed-forward manner.  Deschaintre et al. (2019) extended a single-image-based solution to multiple images by latent space max-pooling. More recent work by Gao et al. (2019) introduced a deep inverse rendering pipeline that enables appearance estimation from an arbitrary number of input images. In procedural material modeling,  Hu et al. (2019); Shi et al. (2020); Hu et al. (2022a) proposed to optimize material parameters with fixed node graphs to match input images.  Hu et al. (2022b) introduced a new pipeline that eliminates the need for predefined node graphs. Most recently,  Sartor and Peers (2023) proposed a diffusion-based model to estimate the material properties from a single photograph.

The methods mentioned above rely on captured photographs to reconstruct material and cannot produce non-real-world materials. In contrast, our approach can generate diverse and creative SVBRDFs using natural language inputs.

2.2. Generative models

Image generation

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have demonstrated remarkable capabilities in producing high-fidelity images. Subsequent research has focused on GAN improvements such as training stability (Kodali et al., 2017; Karras et al., 2018), attribute disentanglement (Karras et al., 2019), conditional controllability (Li et al., 2021; Park et al., 2019), and generation quality (Karras et al., 2020, 2021). GAN can be used in various applications including text-to-image synthesis (Reed et al., 2016b, a; Zhu et al., 2019), image-to-image translation (Isola et al., 2018; Zhu et al., 2020), video generation (Tulyakov et al., 2017), and even 3D shape generation (Li et al., 2019).

Recent advancements in text-to-image generation have been mainly driven by diffusion models (DMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Ramesh et al., 2022). Later advancements (Song et al., 2020, 2021; Liu et al., 2022) have explored efficient sampling strategies to significantly reduce the number of required sampling steps, thereby improving image generation performance.  Rombach et al. (2022) proposed Latent Diffusion Model (LDM) to perform the denoising process in learned compact latent space, enabling high-resolution image synthesis and efficient image manipulation.

Controllable generation

Integrating multi-modal controllability into a text-to-image diffusion model is crucial for creation applications. Recent research (Denis Zavadski and Rother, 2023; Zhang et al., 2023; Ye et al., 2023; Hu et al., 2022c; Mou et al., 2023) has focused on lightweight multi-modal controllability without the requirements of extensive data and high computational power. Hu et al. (2022c) introduces a fine-tuning strategy using low-rank matrices, enabling domain-specific adaptation. Zhang et al. (2023) proposed ControlNet, adding spatial conditioning to diffusion models for precise generation control. Ye et al. (2023) presented a lightweight framework enhancing diffusion models with image prompts using a decoupled cross-attention mechanism.

Material generation

Guo et al. (2020) proposed an unconditional MaterialGAN for synthesizing SVBRDFs from random noise. The learned latent space facilitates efficient material estimation in inverse renderings. Zhou et al. (2022) developed a StyleGAN2-based model, conditioned by spatial structure and material category, for tileable material synthesis. These GAN-based methods show advantages in generating high-resolution and visually compelling materials. However, their diversity is constrained by the training instability of GANs and the limited range of training datasets. In procedural material generation, Guerrero et al. (2022) first introduced a transformer-based autoregressive model. Later work by Hu et al. (2023) proposed a multi-model node graph generation architecture for creating high-quality procedural materials, guided by both text and image inputs. While procedural representations are compact and resolution-independent, they are limited to stationary patterns and cannot create arbitrary styles.

In concurrent work, Vecchio et al. (2023) introduced ControlMat, a diffusion-based material generative model, capable of generating tileable materials using text and a single photograph as input. This model was trained on a synthetic material dataset comprising 126,000126000126,000126 , 000 samples, derived from 8,61586158,6158 , 615 raw material graphs. While quite large in the material domain, this dataset is relatively small compared to the billions of text-image pairs used in text-to-image diffusion model training. This scale discrepancy leads to constrained diversity. Furthermore, this work only supports guidance of text and single photograph, limiting the scenarios range.

In contrast, our method significantly enhances material generation diversity through the efficient integration of pretrained diffusion models with material priors. We also provide a variety of user-friendly controls for guiding the generation process, expanding the scope and flexibility of applications.

2.3. Text-to-3D Generation

Transitioning 2D text-to-image approaches to 3D generation presents significant challenges, mainly due to lacking large-scale labeled 3D datasets. Recent approaches (Poole et al., 2022; Wang et al., 2023; Lin et al., 2023; Tang et al., 2023) have explored text-to-3D generation without the dependency of 3D data. (Poole et al., 2022) integrates Score Distillation Sampling (SDS) with text-to-image diffusion models. Wang et al. (2023) further improved the quality and diversity by introducing Variational Score Distillation (VSD). The development of large-scale 3D datasets (Deitke et al., 2023) enabled direct learning from 3D data (Liu et al., 2023; Shi et al., 2023). However, current 3D generation methods mainly focus on geometry modeling and fail to produce high-quality, disentangled materials.

Park et al. (2018) introduced a neural method to assign materials from a predefined set to different parts of a 3D shape. Extending this, Hu et al. (2022d) employs a translation network to establish the correspondence between 2D exemplar image and 3D shape. This allows for extracting material cues from 2D images and selecting optimal materials from a candidate pool using a perceptual metric. However, these methods are constrained by the variety of their predefined material assets and lack the ability to transfer complex spatial patterns from 2D exemplars to 3D shapes. In contrast, our generative model can produce diverse materials and effectively transfer spatial structures from 2D exemplar images to 3D models, showcasing a significant advancement in material assignment.

3. Method

3.1. Overview

Refer to caption
Figure 2. Overview of DreamPBR: The denoising UNet in our Material LDM is trained with only albedo textures (upper left) and a PBR Decoder with Highlight Aware Decoder is used to transform albedo textures to other physically-based textures (middle right). In the blue box on the left, we present three individual control modules: Pixel Control, Style Control, and Shape Control, whose results under controls are shown on the lower right. Besides, an additional Rendering-aware-super-resolution module is given for higher-quality textures (upper right).
\Description
Preliminaries

The goal of our method is to generate spatially-varying materials which are represented by the Cook-Torrance microfacet BRDF model with GGX normal distribution function  (Walter et al., 2007). Specifically, we use metallic-based PBR workflow and represent surface reflectance properties as albedo map 𝒫𝒫\mathcal{P}caligraphic_P, normal map 𝒩𝒩\mathcal{N}caligraphic_N, roughness map \mathcal{R}caligraphic_R, and metallic map \mathcal{M}caligraphic_M.

DreamPBR is a Latent Diffusion Model (LDM)-based generative framework capable of producing diverse, high-quality SVBRDF maps under text and multi-modal guidance, as illustrated in Figure 2.

The core generative module of our framework is the material Latent Diffusion Model (material LDM), which takes textual description T𝑇Titalic_T as inputs to encode high-dimensional surface reflectance properties into a compact latent representation z𝑧zitalic_z. This representation effectively compresses complex material data and guides the SVBRDF decoder in reconstructing detailed SVBRDF maps (i.e. albedo, normal, roughness, and metallic) S={𝒫,𝒩,,}𝑆𝒫𝒩S=\{\mathcal{P},\mathcal{N},\mathcal{R},\mathcal{M}\}italic_S = { caligraphic_P , caligraphic_N , caligraphic_R , caligraphic_M }. Our critical observation is that while pre-trained text-to-image diffusion models can capture a wide range of natural images that fulfill the diversity needs of material generation, their flexibility often leads to less plausible materials due to the absence of material priors. Instead of training material LDM from scratch with limited material data, we opted to fine-tune a pre-trained text-to-image diffusion model with target material data. This strategy effectively tailors the model from a broad image domain to a specific material domain, ensuring both diversity and authenticity of output.

Our text-to-material framework seamlessly integrates three types of control modules to enhance material generation capabilities. First, we introduce the Pixel Control module GPsubscript𝐺𝑃G_{P}italic_G start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT that takes pixel-aligned inputs (e.g. sketches, masks), utilizing the ControlNet architecture (Zhang et al., 2023). It adds conditional controls into diffusion models, providing spatial guidance for material generation. Second, we use Style Control module GIsubscript𝐺𝐼G_{I}italic_G start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to extract image features from the input image prompt, which are then utilized to adapt material LDM via cross attention. Third, we propose a Shape control module GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to generate SVBRDF maps automatically for a given segmented 3D shape. This module can leverage large language models to generate text prompts corresponding to different parts of the input shape. It also supports taking a 2D photo exemplar as additional input, enabling the generation of material maps for each segmented part, guided by the segmented 2D image. In the rest of the current section, we will dive into the key components of our framework. Section 3.2 introduce our core text-to-material module that enables tileable, diverse material generation. Next, Section 3.3 describes the SVBRDF decoder, responsible for reconstructing high-resolution SVBRDF maps from a unified latent space. Finally, Section 3.4 discusses the Multi-modal control module, providing image and 3D control capabilities to the diffusion model.

3.2. Physically-based material diffusion

Our material LDM transforms text features τ(T)𝜏𝑇\tau(T)italic_τ ( italic_T ), extracted by CLIP’s text encoder τ()𝜏\tau(\cdot)italic_τ ( ⋅ ) (Radford et al., 2021) from user prompts T𝑇Titalic_T, into a latent representation z𝑧zitalic_z of SVBRDF maps S𝑆Sitalic_S. The latent space is characterized by a Variational Autoencoder (VAE) architecture \mathcal{E}caligraphic_E  (Kingma and Welling, 2014; Rezende et al., 2014). Specifically, for an albedo map 𝒫H×W×C,C=3formulae-sequence𝒫superscript𝐻𝑊𝐶𝐶3\mathcal{P}\in\mathbb{R}^{H\times W\times C},C=3caligraphic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT , italic_C = 3, the map is compressed into latent space z=(𝒫)h×w×c𝑧𝒫superscript𝑤𝑐z=\mathcal{E}(\mathcal{P})\in\mathbb{R}^{h\times w\times c}italic_z = caligraphic_E ( caligraphic_P ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT. Consistent with  Rombach et al. (2022), we adopt the parameters c=4,h=H/8,w=W/8formulae-sequence𝑐4formulae-sequence𝐻8𝑤𝑊8c=4,h=H/8,w=W/8italic_c = 4 , italic_h = italic_H / 8 , italic_w = italic_W / 8.

The core component of diffusion model is the denoising U-Net module (Ronneberger et al., 2015) which is conditioned on timestep t𝑡titalic_t. Following Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020), our model employs a deterministic forward diffusion process q(zt|zt1)𝑞conditionalsubscript𝑧𝑡subscript𝑧𝑡1q(z_{t}|z_{t-1})italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) to transform latent vectors z𝑧zitalic_z towards an isotropic Gaussian distribution. The U-Net network is specifically trained to reverse the diffusion process q(zt1|zt)𝑞conditionalsubscript𝑧𝑡1subscript𝑧𝑡q(z_{t-1}|z_{t})italic_q ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), iteratively denoising the Gaussian noise back into latent vectors. Adopting the strategy proposed by Rombach et al. (2022), we incorporates the text feature τ(T)M×dτ𝜏𝑇superscript𝑀subscript𝑑𝜏\tau(T)\in\mathbb{R}^{M\times d_{\tau}}italic_τ ( italic_T ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT into the intermediate layer of UNet through a cross-attention mechanism Attention(Q,K,V)=softmax(QKTd)VAttention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝑑𝑉\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^{T}}{% \sqrt{d}}\right)\cdot Vroman_Attention ( italic_Q , italic_K , italic_V ) = roman_softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ⋅ italic_V, where Q=WQiφi(zt),K=WKiτ(T),V=WViτ(T)formulae-sequence𝑄superscriptsubscript𝑊𝑄𝑖subscript𝜑𝑖subscript𝑧𝑡formulae-sequence𝐾superscriptsubscript𝑊𝐾𝑖𝜏𝑇𝑉superscriptsubscript𝑊𝑉𝑖𝜏𝑇Q=W_{Q}^{i}\cdot\varphi_{i}\left(z_{t}\right),K=W_{K}^{i}\cdot\tau(T),V=W_{V}^% {i}\tau(T)italic_Q = italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_K = italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_τ ( italic_T ) , italic_V = italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_τ ( italic_T ), φi(zt)N×dϵisubscript𝜑𝑖subscript𝑧𝑡superscript𝑁superscriptsubscript𝑑italic-ϵ𝑖\varphi_{i}\left(z_{t}\right)\in\mathbb{R}^{N\times d_{\epsilon}^{i}}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT represents an intermediate representation of the UNet ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and WVid×dϵisuperscriptsubscript𝑊𝑉𝑖superscript𝑑superscriptsubscript𝑑italic-ϵ𝑖W_{V}^{i}\in\mathbb{R}^{d\times d_{\epsilon}^{i}}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, WQid×dτi,WKid×dτiformulae-sequencesuperscriptsubscript𝑊𝑄𝑖superscript𝑑superscriptsubscript𝑑𝜏𝑖superscriptsubscript𝑊𝐾𝑖superscript𝑑superscriptsubscript𝑑𝜏𝑖W_{Q}^{i}\in\mathbb{R}^{d\times d_{\tau}^{i}},W_{K}^{i}\in\mathbb{R}^{d\times d% _{\tau}^{i}}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are learnable projection matrices.

Our material LDM is fine-tuned on text-material pairs via:

(1) ldm:=𝔼(𝒫),T,ϵN(0,1),t[ϵϵθ(zt,t,τθ(T))22]assignsubscript𝑙𝑑𝑚subscript𝔼formulae-sequencesimilar-to𝒫𝑇italic-ϵ𝑁01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝜏𝜃𝑇22\mathcal{L}_{ldm}:=\mathbb{E}_{\mathcal{E}(\mathcal{P}),T,\epsilon\sim N(0,1),% t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t},t,\tau_{\theta}(T)\right)% \right\|_{2}^{2}\right]\text{. }caligraphic_L start_POSTSUBSCRIPT italic_l italic_d italic_m end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT caligraphic_E ( caligraphic_P ) , italic_T , italic_ϵ ∼ italic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Seamless tileable texture synthesis

Creating tileable texture maps is critical in material generation, involving meeting two requirements: a) maintenance of consistent spatial patterns and visual appearance, and b) the ability to tile textures without visible artifacts like seams and blocks.

While zero padding is the standard practice in CNNs, we found that circular padding is particularly effective for seamless content generation. We employ circular padding in all convolutional layers of our generative model for two main reasons:

  1. (1)

    Continuity across boundaries. Unlike classic padding methods such as zero padding, which may introduce artificial edges, circular padding ensures boundary continuity. It wraps image content around both horizontal and vertical boundaries, providing a seamless transition when tiling.

  2. (2)

    Pattern preservation. Circular padding mainly affects the boundary area of the image, leaving the central area and overall texture patterns unchanged.

Our tileable generation algorithm can serve two purposes: firstly, it can inherently produce tileable material maps without additional post-processing. Secondly, it can transform a non-tileable texture into a tileable version through an image-to-image generation pipeline, maintaining visual similarity with the original.

3.3. Render-aware SVBRDF decoder

The SVBRDF decoder, denoted as 𝒟={𝒟P,𝒟S}𝒟subscript𝒟𝑃subscript𝒟𝑆\mathcal{D}=\{{\mathcal{D}_{P},\mathcal{D}_{S}}\}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }, decodes the unified latent representation z𝑧zitalic_z into SVBRDFs S{𝒫,𝒩,,}=𝒟(z)𝑆𝒫𝒩𝒟𝑧S\coloneqq\{\mathcal{P},\mathcal{N},\mathcal{R},\mathcal{M}\}=\mathcal{D}(z)italic_S ≔ { caligraphic_P , caligraphic_N , caligraphic_R , caligraphic_M } = caligraphic_D ( italic_z ). Here, 𝒫,𝒩H×W×3𝒫𝒩superscript𝐻𝑊3\mathcal{P},\mathcal{N}\in\mathbb{R}^{H\times W\times 3}caligraphic_P , caligraphic_N ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, ,H×W×1superscript𝐻𝑊1\mathcal{M},\mathcal{R}\in\mathbb{R}^{H\times W\times 1}caligraphic_M , caligraphic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT. In our implementation, we set H=W=512𝐻𝑊512H=W=512italic_H = italic_W = 512. Specifically, we utilize separate decoder networks: 𝒟P(z)subscript𝒟𝑃𝑧\mathcal{D}_{P}(z)caligraphic_D start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_z ) for the albedo map 𝒫𝒫\mathcal{P}caligraphic_P, and 𝒟S(z)subscript𝒟𝑆𝑧\mathcal{D}_{S}(z)caligraphic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_z ) for other property maps {𝒩,,}𝒩\{\mathcal{N},\mathcal{R},\mathcal{M}\}{ caligraphic_N , caligraphic_R , caligraphic_M }. These decoder networks follow the decoder architecture in VAE proposed by (Kingma and Welling, 2014; Rezende et al., 2014), and are initialized with the weights from a pre-trained VAE decoder.

Training of PBR decoder

The training loss function for our PBR decoder 𝒟Ssubscript𝒟𝑆\mathcal{D}_{S}caligraphic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT comprises the following terms:

(2) PBR=map+perp+gan+reg+ render,subscriptPBRsubscript𝑚𝑎𝑝subscript𝑝𝑒𝑟𝑝subscript𝑔𝑎𝑛subscript𝑟𝑒𝑔 subscript𝑟𝑒𝑛𝑑𝑒𝑟\displaystyle\mathcal{L}_{\text{PBR}}=\mathcal{L}_{map}+\mathcal{L}_{perp}+% \mathcal{L}_{gan}+\mathcal{L}_{reg}+ \mathcal{L}_{render},caligraphic_L start_POSTSUBSCRIPT PBR end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_p italic_e italic_r italic_p end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_g italic_a italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT ,
(3) render(x,y)=log(x+0.01),log(y+0.01)1,subscriptrender𝑥𝑦subscript𝑙𝑜𝑔𝑥0.01𝑙𝑜𝑔𝑦0.011\displaystyle\mathcal{L}_{\text{render}}(x,y)=\lVert log(x+0.01),log(y+0.01)% \rVert_{1},caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∥ italic_l italic_o italic_g ( italic_x + 0.01 ) , italic_l italic_o italic_g ( italic_y + 0.01 ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where mapsubscript𝑚𝑎𝑝\mathcal{L}_{map}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT is L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss on the material property maps, perpsubscript𝑝𝑒𝑟𝑝\mathcal{L}_{perp}caligraphic_L start_POSTSUBSCRIPT italic_p italic_e italic_r italic_p end_POSTSUBSCRIPT is perceptual loss based on LPIPS (Zhang et al., 2018), gansubscript𝑔𝑎𝑛\mathcal{L}_{gan}caligraphic_L start_POSTSUBSCRIPT italic_g italic_a italic_n end_POSTSUBSCRIPT is the generative adversarial loss, regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is the Kullback-Leibler divergence penalty, and rendersubscript𝑟𝑒𝑛𝑑𝑒𝑟\mathcal{L}_{render}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT is L1𝐿1L1italic_L 1 log rendering loss applied to the rendered images.

For the rendering loss, we adopt the sampling scheme proposed by Deschaintre et al. (2018b) to render nine images per material map. The images include three images rendered with independently distant light and view directions, and six images using near-field mirrored view and lighting directions. The rendering loss yields desirable SVBRDF reconstructions, achieved by encouraging the training process to focus on minimizing errors in crucial material parameters rather than treating them with equal importance.

Highlight-aware albedo decoder

As previously mentioned in Section 3.2, our material LDM training utilizes the standard VAE decoder to map the latent space to the albedo map. While effective in generating plausible RGB images, this decoder tends to produce images with strong highlights, especially for shiny materials such as leather and metal.

To address this, we introduce a highlight-aware albedo decoder 𝒟Psubscript𝒟𝑃\mathcal{D}_{P}caligraphic_D start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, which is finetuned on a synthetic shaded-to-albedo dataset, ensuring robust regularization to effectively minimize highlight artifacts in albedo maps. For each material sample in our SVBRDF dataset, we simulate various lighting conditions and viewpoints by randomly positioning point lights and cameras parallel to the material plane and then rendering SVBRDFs to reference shaded images by a physically-based renderer.

During training, the default VAE image encoder maps the shaded images into latent space, which are then decoded back to image space by our specialized albedo decoder. The training process for this decoder follows the original VAE loss function (Kingma and Welling, 2014).

Material super-resolution

High-resolution material maps are essential for achieving photorealistic renderings. However, due to the memory and performance constraints, current diffusion models typically generate images at a resolution of 512×512512512512\times 512512 × 512, which falls short of high-quality production rendering.

We introduce a material super-resolution module comprising four super-resolution networks SR𝑆𝑅SRitalic_S italic_R, each following the Real-ESRGAN architecture(Wang et al., 2021). These super-resolution networks, denoted as SRP,SRN,SRR,SRM𝑆subscript𝑅𝑃𝑆subscript𝑅𝑁𝑆subscript𝑅𝑅𝑆subscript𝑅𝑀SR_{P},SR_{N},SR_{R},SR_{M}italic_S italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_S italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_S italic_R start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_S italic_R start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, are designed to augment the resolution of different SVBRDF property maps to 2,048×2,048204820482,048\times 2,0482 , 048 × 2 , 048.

We fine-tune the Real-ESRGAN with material data, which is trained on purely synthetic data, to more effectively capture the high-frequency details of materials. We incorporate a rendering loss (similar to Equation 3) into the training of the super-resolution module to ensure that the generated details contribute to high-frequency shading effects rather than visual artifacts. We should note that special care must be taken for normal maps during augmentation involving flip** and rotation. The directions stored in a normal map must be adjusted consistently with the map orientation to ensure consistent knowledge about surface normals.

3.4. Multi-model control

We propose three control modules for DreamPBR: Pixel Control, Style Control, and Shape Control. These modules are designed to be decoupled, allowing for flexible combinations of multiple controls.

3.4.1. Pixel Control

Spatial property guidance is widely used in material creation by artists. Our Pixel Control module GPsubscript𝐺𝑃G_{P}italic_G start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT takes spatial control maps IPsubscript𝐼𝑃I_{P}italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT as input, utilizing the ControlNet architecture (Zhang et al., 2023), to guide the generation of spatially-consistent SVBRDFs. It supports controlled generation under sketch guidance and allows for image-to-image material inpainting with a binary mask.

Our material LDM, as described in Section 3.2, is adapted in the material domain, enabling plausible material generation controlled by pre-trained ControlNet checkpoints, which are trained with 2D supervision. However, we found that fine-tuning pre-trained ControlNet with material data significantly improves both the controllability and the quality of generated materials. Specifically, we initialize our ControlNet using the ControlNet 1.1 Scribble checkpoint and fine-tune it on our SVBRDFs dataset. To generate the sketch guidance, we employ Pidinet (Su et al., 2021) for extracting sketches from albedo maps.

3.4.2. Style Control

The Style Control module GIsubscript𝐺𝐼G_{I}italic_G start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT takes image prompt ISsubscript𝐼𝑆I_{S}italic_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT as input and extracts the style characteristics to guide material generation. Inspired by Ye et al. (2023), image prompts Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are first encoded into image features by CLIP’s image encoder, and then embedded into material LDM using a decoupled cross-attention adaptation module. Multimodal material generation can be achieved by accompanying the image prompt with a text prompt.

Style Control module can effectively capture the appearance properties and structural information from the input images, to generate realistic and coherent material maps. This functionality is particularly useful in scenarios where materials need to be created based on specific exemplar images, which is a frequent requirement in the material design industry. The interaction of the Style Control module with the Shape Control module will be detailed in Section 3.4.3.

3.4.3. Shape Control

The Shape Control module GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT takes a segmented 3D model Os={O,s}subscript𝑂𝑠𝑂𝑠O_{s}=\{O,s\}italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_O , italic_s } (s𝑠sitalic_s denotes the geometry segmentation) and an optional photo exemplar Iosubscript𝐼𝑜I_{o}italic_I start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT as input and automatically generates high-quality material maps for each segmentation. When provided with only a segmented 3D model and a basic text prompt, we leverage large language models(LLMs) such as ChatGPT (Achiam et al., 2023), to enrich the text descriptions for each segmentation. For instance, given a 3D chair model, the language model can generate diverse text descriptions for each part like seat, leg, and armrest, each featuring varied design styles. Furthermore, integration with existing Pixel Control and Style Control modules supports enhanced SVBRDF generation, ensuring superior quality and detailed material characteristics.

Our model integrates the material transfer pipeline TMT (Hu et al., 2022d) to automatically assign diverse generated materials to 3D shapes based on an image exemplar. The TMT pipeline involves two stages: firstly, translating color from exemplar image to the projection of 3D shape and vice versa for segmentation results; secondly, assigning materials to projected parts using a material classifier network, based on the translated image. Unlike Hu et al. (2022d), we do not rely on predefined material collections in material assignment. Instead, we use predicted material labels of TMT directly as text prompts and translated images as image prompts in the Style Control module, allowing high-quality SVBRDF generation for each part. The proposed algorithm offers two significant advantages over traditional material transfer models: it expands material diversity beyond limited predefined material collections and transfers not only color and category information but also comprehensive material attributes including styles and spatial structures from 2D exemplar to 3D shapes, leveraging the capabilities of our Style Control module.

4. Results

4.1. Implementation Details

Type Render SVBRDF Render SVBRDF Render SVBRDF Render SVBRDF Render SVBRDF
Brick Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
snow-covered bricks, winter, outdoor, house coastal barrier bricks, sea-salt resistant, outdoor, barrier stenciled brick floor, paving, terracotta, scratched narrow bricks, walls blackened fireplace bricks, charred
Fabric Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
tablecloth, delicate denim jacket texture, clothing hand woven carpet, artisan, carpet floral cotton dress, clothing backpack fabric, sturdy
Ground Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ice glazed slippery, outdoor, winter aerial mud, road, tracks dry rocky ground marble floor, polished, indoor stone ground
Leather Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
perforated leather, breathable black leather decoration, indoor leather white, smooth reptile skin leather, textured
Metal Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
space cruiser panels, scifi wrought iron gate, ornate, outdoor golden metal wall, old anodized metal surface, industrial nickel plated hardware, smooth
Organic Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
alien slime forest leaves, natural, autumn, dirt dragon scales stylized animal fur honeycomb structure, geometric, natural, beehive
Plastic Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
plastic pattern, synthetic yoga mat synthetic plastic, rough reflective safety vest, clothing childrens playground slide, colorful
Tile Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
elegant, interior decoration art deco style tiles, vintage, indoor, decorative vintage ceiling tiles, indoor patterned bw vinyl, floors encaustic cement tiles, colorful, indoor, floor
Wall Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
dry stone wall, natural, outdoor street art graffiti, colorful, urban victorian wallpaper, patterned, indoor, historic stucco finish, mediterranean cliff, outdoor
Wood Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
blue, worn painted wood siding, walls parquet wood flooring, geometric charcoal varnished walnut, glossy, indoor bamboo wall covering, eco-friendly
Figure 3. The generation results of DreamPBR under text-only conditions: We randomly sampled numerous materials with various types and wide tags, by the prompts, “a PBR material of [type], [tags]”. Not only can DreamPBR generate materials that match the descriptions, but also some out-of-domain materials are created as well such as brick of snow-covered bricks, plastic of a children’s playground slide, and  wall of street art graffiti.
\Description

fig:PBR

Pattern Prompt Render SVBRDF Prompt Render SVBRDF Prompt Render SVBRDF
Refer to caption a PBR material of brick, narrow bricks, walls Refer to caption Refer to caption a PBR material of leather, smooth, white, clean Refer to caption Refer to caption a PBR material of metal, ornate celtic gold Refer to caption Refer to caption
a PBR material of fabric, plush toy fur, soft, indoor Refer to caption Refer to caption a PBR material of plastic, synthetic turf blades, green, sports Refer to caption Refer to caption a PBR material of tile, glass mosaic art, translucent, decorative Refer to caption Refer to caption
Refer to caption a PBR material of ground, marble floor tiles, polished, indoor, luxury Refer to caption Refer to caption a PBR material of fabric, dirty carpet, carpet, textile, faded, floor Refer to caption Refer to caption a PBR material of wood, oak flooring, classic, indoor Refer to caption Refer to caption
a PBR material of leather, fabric leather, clean, seat, chair, couch Refer to caption Refer to caption a PBR material of plastic, yoga mat Refer to caption Refer to caption a PBR material of fabric, hand woven carpet, artisan, indoor Refer to caption Refer to caption
Refer to caption a PBR material of tile, bathroom floor tiles, non-slip, indoor Refer to caption Refer to caption a PBR material of tile, slate walkway tiles, rugged, outdoor Refer to caption Refer to caption a PBR material of tile, art deco style tiles, vintage, indoor, decorative Refer to caption Refer to caption
a PBR material of wall, tiled bathroom wall, moisture-resistant, indoor Refer to caption Refer to caption a PBR material of metal, scratched scuffed metal Refer to caption Refer to caption a PBR material of brick, sewer brick, walls Refer to caption Refer to caption
Refer to caption a PBR material of brick, brick floor, outdoor, clean, man made Refer to caption Refer to caption a PBR material of metal, chrome car detailing, reflective, car trim Refer to caption Refer to caption a PBR material of wood, burnt wood finish, charred, artistic, decor Refer to caption Refer to caption
a PBR material of tile, patterned bw vinyl, floors Refer to caption Refer to caption a PBR material of wall, victorian wallpaper, patterned, indoor, historic Refer to caption Refer to caption a PBR material of fabric, hand woven carpet, artisan, indoor, carpet Refer to caption Refer to caption
Refer to caption a PBR material of wall, Hello Kitty sticker wallpaper, colorful, indoor, nursery, easy-apply Refer to caption Refer to caption a PBR material of wall, street art graffiti, colorful, outdoor, urban Refer to caption Refer to caption a PBR material of brick, multi-colored street bricks, vibrant, outdoor Refer to caption Refer to caption
a PBR material of fabric, embroidered linen, delicate, indoor, tablecloth Refer to caption Refer to caption a PBR material of metal, metal plate, scifi Refer to caption Refer to caption a PBR material of fabric, carpet, Hello Kitty outdoor picnic mat, durable, foldable Refer to caption Refer to caption
Refer to caption a PBR material of ground, marble floor tiles, polished, indoor, luxury Refer to caption Refer to caption a PBR material of leather, black motorcycle jacket, tough, clothing, jacket Refer to caption Refer to caption a PBR material of ground, forest leaves, natural, leaves, autumn Refer to caption Refer to caption
a PBR material of metal, colored metal plate, scifi Refer to caption Refer to caption a PBR material of brick, stenciled brick floor, man made, worn, paving, dry, terracotta Refer to caption Refer to caption a PBR material of fabric, loose tablecloth Refer to caption Refer to caption
Figure 4. Pixel Control’s results with the same pattern but different materials. The binary images in the first column are control conditions of different sketches and the generated materials are on their right with certain patterns same as our given images, following their material properties such as the edge of bricks and the growth rings of wood.
\Description

fig:pixel_control_1

Prompt Render SVBRDF Render SVBRDF Render SVBRDF Render SVBRDF
a PBR material of metal, space cruiser panels Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of wall, street art graffiti, colorful, outdoor, urban Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of tiles, encaustic cement tiles, colorful, indoor, floor Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5. Pixel Control’s results with the same material but different patterns. The first column shows the descriptions of materials and the control patterns are in the lower right corner of the rendering images. Like Figure 4, the results have also shown great consistency of materials and patterns.
\Description

fig:pixel_control_2

Style Render SVBRDF Render SVBRDF Render SVBRDF Render SVBRDF
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of fabric, carpet a PBR material of ground, stone, outdoor a PBR material of wood a PBR material of tile, marble
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of tile, encaustic cement a PBR material of wall, concrete wall, outdoor, cracked, man made, rough, painted a PBR material of brick, street brick, outdoor a PBR material of wood, varnished walnut, painted, artistic
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of brick, street brick, outdoor, art a PBR material of leather a PBR material of ground, sidewalk a PBR material of tile, encaustic cement
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of fabric, hand woven carpet a PBR material of tile, marble a PBR material of wall, wallpaper a PBR material of wood, synthetic wood, painted
Figure 6. Style Control’s results with the same style but different materials. The styled images are given in the first column, and each description of the material is below the image, which provides users with more artistic ways to design textures.
\Description

fig:style_control

a PBR material of wood
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
a PBR material of tile, encaustic cement tiles, indoor, floor
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 7. Diverse sampling results under the same prompts. We evaluate the diversity with the same basic description (top) and the same detailed description (bottom) but different random seeds. Both of them show quite different patterns and textures although the same prompt is used.
\Description

fig:seed

Output Expansion Output Expansion
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8. Splicing results of our tileable textures. The first 3 rows show the tileable results generated under only text descriptions and the last row shows the results with Pixel Control and Style Control, whose control conditions are attached to the edge of images. All the textures we generated in this figure show high tile abilities without artifacts.
\Description

fig:seamless

Input Yellow flower Red flower
Refer to caption Refer to caption Refer to caption
Blue flower Cyan flower Purple flower
Refer to caption Refer to caption Refer to caption
Pink flower Leaf Grass
Refer to caption Refer to caption Refer to caption
Figure 9. Inpainting results. The original texture is shown in the upper left corner, which is generated by the prompt “a PBR material of the fabric, floral cotton fabric”. By different tags (above each image) and regions to be inpainted (bright areas in each image), we can manipulate user-specified areas in the textures according to preferences such as changing a leaf to colorful flowers.
\Description

fig:inpainting

Prompt Style Pixel Render SVBRDF
a PBR material of tiles, marble Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of wood, indoor Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of tiles, art deco style tiles, vintage, indoor, decorative Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of fabric, patchwork quilt, colorful, indoor, bedding Refer to caption Refer to caption Refer to caption Refer to caption
a PBR material of fabric, hand woven carpet, cute bunny, artisan, indoor Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10. Results of combined control. We combine descriptions of materials (the first column), styled images (the second column), and binary images (the third column) to control the generated textures. Under the descriptions of materials, the generated results have both the given pattern and style incorporated into them.
\Description

fig:multiModal

Refer to caption
Figure 11. Results of ShapeControl. we leverage a large language model(LLM) to describe the segmented chair legs and chair back, which is used to generate their textures with the help of DreamPBR. Besides text-only descriptions(left two figures), additional user-specific controls such as RGB images (upper right) and binary images (lower right) are allowed for personalized design as well.
\Description

fig:shape

MaterialGAN TileGen Ours
Stone Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Metal Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 12. Qualitative comparison on texture generation. We randomly sample several materials of stone and metal by Guo et al. (2020) and Zhou et al. (2022), which are used to be compared with ours. Importantly, we can generate out-of-domain textures as shown in the fourth row and the last row of ours, which is beyond the capabilities of GAN-based methods.
\Description

fig:materialgan

Ours Refer to caption Refer to caption Refer to caption
TileGen Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption
TileGen Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption
TileGen Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption
TileGen Refer to caption Refer to caption Refer to caption
Figure 13. Qualitative comparison on pixel guidance. We compare our PixelControl against Zhou et al. (2022) in leather (first 4 rows) and tile (last 4 rows). In some cases, TileGen fails to generate textures that match given patterns such as (column 1, row 4), (column 2, row 4), and (column 1, row 8). There are some artifacts in results to fit given patterns from Zhou et al. (2022) such as (column 1, row 2), (column 2, row 6), (column 3, row 6), and (column 3, row 8). In contrast, ours show better consistency to pattern based on their natural materials.
\Description

fig:tilegen_con

Render SVBRDF Render SVBRDF
Reference Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
LPIPS RMSE
Render Albedo Metallic Normal Roughness
w/o rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT 0.107 0.0361 0.0126 0.0542 0.0406
Ours(w/ rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT) 0.101 0.0357 0.0086 0.0531 0.0365
Figure 14. Qualitative and quantitative comparison on PBR decoder with rendering loss. We trained two PBR decoders with and without rendering loss. With rendering loss, the decoded textures show better consistency in both rendering images and SVBRDF (especially for normal maps) visually and achieve the lowest LPIPS of rendering images and RMSE of SVBRDF maps as illustrated in the table.
\Description

fig:decoder

Reference
Refer to caption
Reference Low Res. Pretrained w/o rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT Ours(w/ rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT)
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
LPIPS RMSE
Render Albedo Metal. Normal Rough.
Pretrained 0.450 0.0272 0.0816 0.0598 0.0588
w/o rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT 0.342 0.0248 0.0652 0.0474 0.0451
Ours(w/ rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT) 0.321 0.0211 0.0643 0.0398 0.0445
Figure 15. Qualitative and quantitative comparison on Super-Resolution module. We show local zoom rendering results with high-resolution textures(2048×2048204820482048\times 20482048 × 2048), low-resolution textures(512×512512512512\times 512512 × 512), and textures from three super-resolution modules(Wang et al., 2021) with different training strategies: pre-trained one, fine-tuned one without rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT and fine-tuned one with rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT(ours). Our final methods produce details and textures more consistent with ground truth compared with incomplete methods as shown in five images at the bottom of the figure. With fine-tuning our datasets and rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT, the table below presents our final models giving the lowest LPIPS between rendering images with super-resolution textures and rendering images with real high-resolution textures, and lowest RMSE between super-resolution SVBRDF maps and ground truth.
\Description

fig:SR

w/o HA w/ HA w/o HA w/ HA w/o HA w/ HA
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(a) Generated cases
Input w/o HA w/ HA Input w/o HA w/ HA
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Input Refer. w/o HA w/ HA Input Refer. w/o HA w/ HA
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(b) Test cases in dataset
highlight inputs non-highlight inputs
L1 PSNR LPIPS L1 PSNR LPIPS
w/o HA 0.0409 25.7460 0.1928 0.0201 33.2621 0.1220
w/ HA 0.0211 32.6578 0.1452 0.0202 33.2904 0.1241
Figure 16. Qualitative and quantitative comparison on highlight-aware module (HA). As shown in 16(a), the original albedo decoder from VAE may generate data with highlight and the highlight-aware module can de-highlight them to make better generation results. As shown in 16(b), For those inputs without highlight, there is a tiny difference regardless of whether the highlight module is used or not. When inputting images with the highlight, we can de-highlight them with our highlight-aware module. For images with and without highlights, the table below numerically reaches the same conclusion by comparing the L1, PSNR, and LPIPS between real albedo maps and de-highlighted albedo maps.
\Description

fig:highlight_decoder

w/o ft Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 17. Qualitative comparison on fine-tuning ControlNet. We evaluate the pre-trained ControlNet and our fine-tuned version based on the pre-trained one to employ pixel-guidance generation. The textures from pre-trained ControlNet (w/o ft) are more like natural images rather than textures.
\Description

fig:ablation_pixel

4.1.1. Dataset Generation

Our dataset comprises a total of 711 PBR materials, each including four 2k𝑘kitalic_k texture maps: albedo, normal, metallic, and roughness, along with corresponding textual labels. The data are sourced from PolyHaven 111https://polyhaven.com/ and freePBR 222https://freepbr.com/. We categorized the data into ten types manually: Brick (58), Fabric (60), Ground (99), Leather (45), Metal (130), Organic (45), Plastic (40), Tile (75), Wall (69), and Wood (90).

The input text prompt 𝒯𝒯\mathcal{T}caligraphic_T is in the format of “a PBR material of [type], [name], [tags]” during the finetuning of material-LDM, where ‘type’ refers to the type of material, ‘name’ (title name) and ‘tags’ for each material are given by the website. These tags are randomly retained at a ratio of 30%100%percent30percent10030\%-100\%30 % - 100 % during training. To address the issue of uneven distribution in the original data, we selected high-quality and representative data within categories of large volumes and randomly duplicated existing data for categories with smaller data volumes, which helps to balance the sample sizes across all categories, ensuring more uniform training data distribution.

For the 2k𝑘kitalic_k textures we obtained, we perform horizontal flip**, vertical flip**, random rotation, and multi-scale crop** and adjust the direction of the normal maps accordingly, eventually resizing them to 512×512512512512\times 512512 × 512 pixels as our training data. After augmenting textures, we render each of them with randomly sampled viewpoints and lightings by Laine et al. (2020). The rendering images are also used to train our highlight-aware albedo decoder.

Concerning the paired data for training ControlNet, we utilized Pidinet  (Su et al., 2021) to extract sketches from the albedo maps as mentioned in Section 3.4.1.

4.1.2. Other Details

DreamPBR was trained on quadruple Nvidia RTX 3090 GPUs. During the training of Material LDM, we employ Adam as our optimizer with a base learning rate of 1.6×1031.6superscript1031.6\times 10^{-3}1.6 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and closed learning rate scaling. Starting with the stable-diffusion-v1-5 checkpoint for 9000 epochs, we finetuned it for approximately 10 days. For the training of the PBR decoder, we set the base learning rate to 4.5×1064.5superscript1064.5\times 10^{-6}4.5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and enabled scale_lr, taking 4 days total in which the output channels of the decoder were set to 8, with albedo and normal having three channels each, and metallic and roughness being single-channel. For the highlight-aware albedo decoder, we set the base learning rate to 4.5×1064.5superscript1064.5\times 10^{-6}4.5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and enabled scale_lr, taking 2 days total in which the output channels of the decoder were set to 3. We incorporate rendering loss as detailed in Section Section 3.3 during the training process above.

During the training of the Rendering-aware super-resolution module, we initially utilized the preset weights from Real-ESRGAN  (Wang et al., 2021) and finetuned four super-resolution modules specifically for albedo, normal, metallic, and roughness textures. These modules were finetuned using the learning rates of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 10000 total iter. Furthermore, we combined the training of all four modules in a model to render the result of each module during training and incorporated rendering loss.

To enhance image control performance, we set the learning rate to 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for training ControlNet, which requires about 2 days to complete. For Style Control, we directly utilize the ip-adapter_sd15 checkpoint along with our finetuned checkpoint, as we have observed satisfactory results.

4.2. Generation Results

DreamPBR is capable of generating realistic or magic materials with only descriptions. To demonstrate the ability to synthesize wide materials, we obtain a mount of descriptions of materials in by LLM for each type, which is used to sample materials with DreamPBR. The generated textures are enhanced by the super-resolution module and are then rendered as shown in Figure 3. In our sampled 400 textures, they show high consistency with text and the mean of CLIP Score between rendering images and given prompts is 30.198. Besides the consistency of text and images, the diversity of results is quite important for text-driven generative models as well. As demonstrated in Figure 7, we further sample several textures with the same prompt but different random seeds, DreamPBR succeeds in producing diverse textures that follow the descriptions we specify.

4.2.1. Tileable texture generation

Although the users would introduce various controls, we can generate seamless tileable textures all the time, which allows users to apply the generated textures in different scales and different scenes. In Figure 8, we present several tileable textures from direct and guided generation with their splicing results, showing the effectiveness of circular padding in our method as mentioned in Section 3.2.

4.2.2. Results of Pixel Control

By finetuning an additional ControlNet, DreamPBR is able to generate textures according to given patterns. In practice, a designer could decide on a pattern in advance, and then try different materials. It may also be the other way around. For those two situations, DreamPBR ensures reasonable textures for certain patterns or materials as demonstrated in Figure 4 and Figure 5.

With additional control of binary images, inpainting is also a usual method for users to obtain specified results so we present several inpainting results in Figure 9 to replace a region in texture with another region users describe.

4.2.3. Results of Style Control

A styled image expresses more easily for a person than only text like Su et al. (2021) does. To do so, we evaluate the adaptation of Su et al. (2021) for our Style Control. Specifically, we obtain several styled images online and present the generation results under different styles from images as shown in Figure 6. Figure 10 illustrates the situation in that users would like to combine Style Control with Pixel Control, which enables users to generate the results they want more freely.

4.2.4. Results of Shape Control

With the ability to generate various textures, DreamPBR can be extended to non-planar objects such as chairs. Specifically by giving a segmented object, we can utilize dialogue with a large language model to get different descriptions of each region. For more specified objects, a more direct way is to be in conjunction with cropped areas from exemplar images used with pixel control and style control. Thanks to the tileable features, the results from our pipeline of Shape Control are shown in Figure 11.

4.3. Comparative Experiments

Leveraging the state-of-the-art generative model, StableDiffusion, DreamPBR is very competitive with previous methods for materials generation. We compare the results generated from DreamPBR of different materials against MaterialGAN  (Guo et al., 2020) and TileGen  (Zhou et al., 2022) in Figure 12.  Notably, there are only two categories provided in the competing methods so our results are generated by giving prompts, “a PBR material of ground, stone” and “a PBR material of metal”. The comparison shows that DreamPBR can generate textures following the distribution of realistic data from datasets like GAN-based methods as well as magic textures from prior information for 2D images.

Moreover, we compare our Pixel Control with those of TileGen in generation with sketches guidance. The comparison results are shown in Figure 13, in which we demonstrate different generation results of TileGen and ours with the same binary masks. DreamPBR surpasses TileGen in sketches-driven generation and shows fewer artifacts and more precise controls than previous research on material generation like TileGen.

4.4. Ablation Study

The training of DreamPBR consists of some alternative modules and additional loss functions. In this section, we focus on evaluating the effect of each of the designs. To evaluate them, we randomly selected 100 textures from our obtained data that were not used in the whole training stage.

4.4.1. PBR Decoder

When the PBR Decoder is trained, we introduce rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT to solve the regression problem from images rendered with random lights and viewpoints, which enforces that the decoded textures are realistic after being rendered. It reduces the search space of output values compared to the one that rendering images is not used. We trained two PBR decoders with and without rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT, and evaluated their effectiveness of them by comparing the outputs with reference textures. Figure 14 presents the comparison results, in which our rendering-aware decoder is capable of achieving more realistic results in rendered results and more consistent results in generated textures.

4.4.2. Super-Resolution Module

Although the super-resolution models originally show great results in natural images, we finetune it again with our material data and employ a novel rendering loss rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT from the level of perception. In practice, we finetune super-resolution modules for each component of textures based on the pre-trained Real-ESRGAN as our baseline. With four single modules(albedo, metallic, normal, and roughness), we jointly finetune them and introduce the rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT by rendering four textures after super-resolution to image space. The comparison results are shown in Figure 15. Similar to the training of PBR Decoder, the finetuning super-resolution modules with rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT contributes to better results.

4.4.3. Highlight-aware decoder

As mentioned in Section 3.3, we introduce a highlight-aware albedo decoder to remove the potential highlights in generated RGB images. For a good de-highlight module, there are two key points to be taken into account: 1) effectively removing the highlights in images, and 2) leaving them unchanged for those without highlights. In practice, only training on rendered images potentially affects the decoded albedo(without highlights), so we finetune the highlight-aware decoder by randomly choosing rendered images from different lights or pure albedo maps. Furthermore, we compare the outputs of the highlight-aware decoder with the ones of the initially pre-trained decoder in Figure 16, suggesting that our decoder addresses the issues of two key points above.

4.4.4. Pixel Control

To realize the sketch-guidance control, we embed a pre-trained ControlNet in DreamPBR. However, different from the IP-Adapter for Style Control focuses on incorporating semantics of images in clip space independent of training data, the initial ControlNet leads to domain shift, from the albedo domain back to the image domain, in our experiments. To address this problem, we finetuned the ControlNet with our sketch-albedo pairs as mentioned above. The comparison of ControlNet before and after being finetuned is shown in Figure 17.

4.5. Limitations

Despite the promising capabilities of DreamPBR in generating high-quality and diverse material textures, our method encounters certain limitations that merit further exploration and improvement. We employ normal maps to reveal surface details in textures. However, using normal maps without displacement maps leaves self-occlusion ignored when rendering them with those textures, which makes the rendering results unrealistic. In addition, although a more lengthy description contributes to a more detailed texture that the user wants, it is also complex work for users to produce such a detailed description like “a PBR material of the wall, concrete wall, outdoor, cracked, man-made, rough, painted…”.

5. Conclusions and Future work

In this paper, we propose DreamPBR, a novel diffusion-based generative framework for creating physically-based material textures. Our methods do not rely on large data sets as image generation does but transfer their original prior information to desired textures. Given text descriptions and other optional multi-modal conditions, we can generate textures that are highly consistent with text descriptions and the other conditions such as styles of RGB images and patterns of binary images. By using DreamPBR, one can create planar textures freely according to their imagination. Specifically, we start with finetuning diffusion models for albedo generation and then decompose albedo to other SVBRDFs(normal, metallic, and roughness) by our highlight-aware decoder and PBR Decoder. For higher-resolution textures, we easily introduce an additional loss function in rendering images to our super-resolution module and bring significant improvement visually. With the properties above, DreamPBR can also produce some textures for simple geometries by dialogue with LLM.

For future work, although DreamPBR currently targets planar textures, it could be extended to complex geometries with further development of retopology. Additionally, because of our effective PBR Decoder and highlight-aware decoder, DreamPBR has the potential to be used in SVBRDF estimation. Lastly, there are inevitably problems such as limited resolution and time-consuming inference when utilizing diffusion models, which is also a challenging problem in the future.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Aittala et al. (2016) Miika Aittala, Timo Aila, and Jaakko Lehtinen. 2016. Reflectance Modeling by Neural Texture Synthesis. ACM Trans. Graph. 35, 4, Article 65 (jul 2016), 13 pages. https://doi.org/10.1145/2897824.2925917
  • Aittala et al. (2015) Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. 2015. Two-Shot SVBRDF Capture for Stationary Materials. ACM Trans. Graph. 34, 4, Article 110 (jul 2015), 13 pages. https://doi.org/10.1145/2766967
  • Deitke et al. (2023) Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. 2023. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv:2307.05663 [cs.CV]
  • Denis Zavadski and Rother (2023) Johann-Friedrich Feiden Denis Zavadski and Carsten Rother. 2023. ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models. (2023).
  • Deschaintre et al. (2018a) Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018a. Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
  • Deschaintre et al. (2018b) Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018b. Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
  • Deschaintre et al. (2019) Valentin Deschaintre, Miika Aittala, Fr’edo Durand, George Drettakis, and Adrien Bousseau. 2019. Flexible SVBRDF Capture with a Multi-Image Deep Network. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering) 38, 4 (July 2019). http://www-sop.inria.fr/reves/Basilic/2019/DADDB19
  • Dong (2019) Yue Dong. 2019. Deep appearance modeling: A survey. Visual Informatics 3, 2 (2019), 59–68. https://doi.org/10.1016/j.visinf.2019.07.003
  • Dong et al. (2014) Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. 2014. Appearance-from-Motion: Recovering Spatially Varying Surface Reflectance under Unknown Lighting. ACM Trans. Graph. 33, 6, Article 193 (nov 2014), 12 pages. https://doi.org/10.1145/2661229.2661283
  • Gao et al. (2019) Duan Gao, Xiao Li, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2019. Deep inverse rendering for high-resolution svbrdf estimation from an arbitrary number of images. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–15.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. (2014). arXiv:1406.2661 [stat.ML]
  • Guarnera et al. (2016) D. Guarnera, G.C. Guarnera, A. Ghosh, C. Denk, and M. Glencross. 2016. BRDF Representation and Acquisition. Computer Graphics Forum 35, 2 (2016), 625–650. https://doi.org/10.1111/cgf.12867 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12867
  • Guerrero et al. (2022) Paul Guerrero, Milos Hasan, Kalyan Sunkavalli, Radomir Mech, Tamy Boubekeur, and Niloy Mitra. 2022. MatFormer: A Generative Model for Procedural Materials. ACM Trans. Graph. 41, 4, Article 46 (2022). https://doi.org/10.1145/3528223.3530173
  • Guo et al. (2021) Jie Guo, Shuichang Lai, Chengzhi Tao, Yuelong Cai, Lei Wang, Yanwen Guo, and Ling-Qi Yan. 2021. Highlight-Aware Two-Stream Network for Single-Image SVBRDF Acquisition. ACM Trans. Graph. 40, 4, Article 123 (jul 2021), 14 pages. https://doi.org/10.1145/3450626.3459854
  • Guo et al. (2023) Jie Guo, Shuichang Lai, Qinghao Tu, Chengzhi Tao, Changqing Zou, and Yanwen Guo. 2023. Ultra-High Resolution SVBRDF Recovery from a Single Image. ACM Trans. Graph. 42, 3, Article 33 (jun 2023), 14 pages. https://doi.org/10.1145/3593798
  • Guo et al. (2020) Yu Guo, Cameron Smith, Miloš Hašan, Kalyan Sunkavalli, and Shuang Zhao. 2020. MaterialGAN: Reflectance Capture Using a Generative SVBRDF Model. ACM Trans. Graph. 39, 6, Article 254 (nov 2020), 13 pages. https://doi.org/10.1145/3414685.3417779
  • Henzler et al. (2021) Philipp Henzler, Valentin Deschaintre, Niloy J. Mitra, and Tobias Ritschel. 2021. Generative Modelling of BRDF Textures from Flash Images. ACM Trans. Graph. 40, 6, Article 284 (dec 2021), 13 pages. https://doi.org/10.1145/3478513.3480507
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
  • Hu et al. (2022c) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022c. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  • Hu et al. (2022d) Ruizhen Hu, Xiangyu Su, Xiangkai Chen, Oliver van Kaick, and Hui Huang. 2022d. Photo-to-Shape Material Transfer for Diverse Structures. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 39, 6 (2022), 113:1–113:14.
  • Hu et al. (2019) Yiwei Hu, Julie Dorsey, and Holly Rushmeier. 2019. A Novel Framework for Inverse Procedural Texture Modeling. ACM Trans. Graph. 38, 6, Article 186 (nov 2019), 14 pages. https://doi.org/10.1145/3355089.3356516
  • Hu et al. (2022a) Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2022a. Node Graph Optimization Using Differentiable Proxies. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 5, 9 pages. https://doi.org/10.1145/3528233.3530733
  • Hu et al. (2023) Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2023. Generating Procedural Materials from Text or Image Prompts. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings (SIGGRAPH ’23). ACM. https://doi.org/10.1145/3588432.3591520
  • Hu et al. (2022b) Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. 2022b. An Inverse Procedural Modeling Pipeline for SVBRDF Maps. ACM Trans. Graph. 41, 2, Article 18 (jan 2022), 17 pages. https://doi.org/10.1145/3502431
  • Hui et al. (2017) Zhuo Hui, Kalyan Sunkavalli, Joon-Young Lee, Sunil Hadap, Jian Wang, and Aswin C. Sankaranarayanan. 2017. Reflectance Capture Using Univariate Sampling of BRDFs. In 2017 IEEE International Conference on Computer Vision (ICCV). 5372–5380. https://doi.org/10.1109/ICCV.2017.573
  • Isola et al. (2018) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2018. Image-to-Image Translation with Conditional Adversarial Networks. arXiv:1611.07004 [cs.CV]
  • Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv:1710.10196 [cs.NE]
  • Karras et al. (2021) Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. In Proc. NeurIPS.
  • Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. (2019). arXiv:1812.04948 [cs.NE]
  • Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
  • Kodali et al. (2017) Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017. On Convergence and Stability of GANs. arXiv:1705.07215 [cs.AI]
  • Laine et al. (2020) Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. 2020. Modular Primitives for High-Performance Differentiable Rendering. ACM Transactions on Graphics 39, 6 (2020).
  • Li et al. (2017) Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017. Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–11.
  • Li et al. (2019) Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2019. Synthesizing 3D Shapes from Silhouette Image Collections using Multi-projection Generative Adversarial Networks. arXiv:1906.03841 [cs.CV]
  • Li et al. (2021) Yuheng Li, Yijun Li, **gwan Lu, Eli Shechtman, Yong Jae Lee, and Krishna Kumar Singh. 2021. Collaging class-specific gans for semantic image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14418–14427.
  • Li et al. (2018) Zhengqin Li, Kalyan Sunkavalli, and Manmohan Chandraker. 2018. Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 74–90. https://doi.org/10.1007/978-3-030-01219-9_5
  • Lin et al. (2023) Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2023. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
  • Liu et al. (2022) Lu** Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022. Pseudo Numerical Methods for Diffusion Models on Manifolds. arXiv:2202.09778 [cs.CV]
  • Liu et al. (2023) Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. 2023. Zero-1-to-3: Zero-shot One Image to 3D Object. arXiv:2303.11328 [cs.CV]
  • Mou et al. (2023) Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. 2023. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
  • Palma et al. (2012) Gianpaolo Palma, Marco Callieri, Matteo Dellepiane, and Roberto Scopigno. 2012. A Statistical Method for SVBRDF Approximation from Video Sequences in General Lighting Conditions. Computer Graphics Forum (2012). https://doi.org/10.1111/j.1467-8659.2012.03145.x
  • Park et al. (2018) Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M. Seitz. 2018. PhotoShape: Photorealistic Materials for Large-Scale Shape Collections. ACM Trans. Graph. 37, 6, Article 192 (Nov. 2018).
  • Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic Image Synthesis with Spatially-Adaptive Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Poole et al. (2022) Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv (2022).
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
  • Reed et al. (2016a) Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016a. Learning What and Where to Draw. arXiv:1610.02454 [cs.CV]
  • Reed et al. (2016b) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016b. Generative Adversarial Text to Image Synthesis. arXiv:1605.05396 [cs.NE]
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286.
  • Riviere et al. (2016) J. Riviere, P. Peers, and A. Ghosh. 2016. Mobile Surface Reflectometry. Computer Graphics Forum 35, 1 (2016), 191–202. https://doi.org/10.1111/cgf.12719 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12719
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [cs.CV]
  • Sartor and Peers (2023) Sam Sartor and Pieter Peers. 2023. Matfusion: a generative diffusion model for svbrdf capture. In SIGGRAPH Asia 2023 Conference Papers. 1–10.
  • Shi et al. (2020) Liang Shi, Beichen Li, Miloš Hašan, Kalyan Sunkavalli, Tamy Boubekeur, Radomir Mech, and Wojciech Matusik. 2020. MATch: Differentiable Material Graphs for Procedural Material Capture. ACM Trans. Graph. 39, 6 (Dec. 2020), 1–15.
  • Shi et al. (2023) Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. 2023. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model. arXiv:2310.15110 [cs.CV]
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  • Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
  • Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS
  • Su et al. (2021) Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. 2021. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision. 5117–5127.
  • Tang et al. (2023) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).
  • Tulyakov et al. (2017) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2017. MoCoGAN: Decomposing Motion and Content for Video Generation. arXiv:1707.04993 [cs.CV]
  • Vecchio et al. (2023) Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, and Tamy Boubekeur. 2023. ControlMat: Controlled Generative Approach to Material Capture. arXiv preprint arXiv:2309.01700 (2023).
  • Walter et al. (2007) Bruce Walter, Stephen R. Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. Microfacet Models for Refraction through Rough Surfaces. In Proceedings of the 18th Eurographics Conference on Rendering Techniques (Grenoble, France) (EGSR’07). Eurographics Association, Goslar, DEU, 195–206.
  • Wang et al. (2021) Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision. 1905–1914.
  • Wang et al. (2023) Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. arXiv preprint arXiv:2305.16213 (2023).
  • Weinmann and Klein (2015) Michael Weinmann and Reinhard Klein. 2015. Advances in Geometry and Reflectance Acquisition (Course Notes). In SIGGRAPH Asia 2015 Courses (Kobe, Japan) (SA ’15). Association for Computing Machinery, New York, NY, USA, Article 1, 71 pages. https://doi.org/10.1145/2818143.2818165
  • Xu et al. (2016) Zexiang Xu, Jannik Boll Nielsen, Jiyang Yu, Henrik Wann Jensen, and Ravi Ramamoorthi. 2016. Minimal BRDF Sampling for Two-Shot near-Field Reflectance Acquisition. ACM Trans. Graph. 35, 6, Article 188 (dec 2016), 12 pages. https://doi.org/10.1145/2980179.2982396
  • Ye et al. (2023) Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. (2023).
  • Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
  • Zhou et al. (2022) Xilong Zhou, Miloš Hašan, Valentin Deschaintre, Paul Guerrero, Kalyan Sunkavalli, and Nima Kalantari. 2022. TileGen: Tileable, Controllable Material Generation and Capture. (2022). arXiv:2206.05649 [cs.GR]
  • Zhou et al. (2016) Zhiming Zhou, Guojun Chen, Yue Dong, David Wipf, Yong Yu, John Snyder, and Xin Tong. 2016. Sparse-as-Possible SVBRDF Acquisition. ACM Trans. Graph. 35, 6, Article 189 (dec 2016), 12 pages. https://doi.org/10.1145/2980179.2980247
  • Zhu et al. (2020) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2020. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv:1703.10593 [cs.CV]
  • Zhu et al. (2019) Minfeng Zhu, **bo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310 [cs.CV]