DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance

1. Introduction

2. Related Work

3. Method

4. Results

5. Conclusions and Future work

References

2.1. Material estimation

2.2. Generative models

2.3. Text-to-3D Generation

3.1. Overview

3.2. Physically-based material diffusion

3.3. Render-aware SVBRDF decoder

3.4. Multi-model control

4.1. Implementation Details

4.2. Generation Results

4.3. Comparative Experiments

4.4. Ablation Study

4.5. Limitations

Image generation

Controllable generation

Material generation

Seamless tileable texture synthesis

Training of PBR decoder

Highlight-aware albedo decoder

Material super-resolution

3.4.1. Pixel Control

3.4.2. Style Control

3.4.3. Shape Control

4.1.1. Dataset Generation

4.1.2. Other Details

4.2.1. Tileable texture generation

4.2.2. Results of Pixel Control

4.2.3. Results of Style Control

4.2.4. Results of Shape Control

4.4.1. PBR Decoder

4.4.2. Super-Resolution Module

4.4.3. Highlight-aware decoder

4.4.4. Pixel Control

Preliminaries

Abstract.

Linxuan Xin Peking UniversityShenzhenChina [email protected] , Zheng Zhang Huawei Cloud Computing Technologies Co., Ltd.HangzhouChina [email protected] , **fu Wei Tsinghua UniversityShenzhenChina [email protected] , Wei Gao School of Electronic and Computer Engineering, Shenzhen Graduate Schoool, Peking UniversityShenzhenChina [email protected] and Duan Gao Huawei Cloud Computing Technologies Co., Ltd.ShenzhenChina [email protected]

Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets. To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by text and multi-modal controls, providing high controllability and diversity in material generation. The key to achieving diverse and high-quality PBR material generation lies in integrating the capabilities of recent large-scale vision-language models trained on billions of text-image pairs, along with material priors derived from hundreds of PBR material samples. We utilize a novel material Latent Diffusion Model (LDM) to establish the map** between albedo maps and the corresponding latent space. The latent representation is then decoded into full SVBRDF parameter maps using a rendering-aware PBR decoder. Our method supports tileable generation through convolution with circular padding. Furthermore, we introduce a multi-modal guidance module, which includes pixel-aligned guidance, style image guidance, and 3D shape guidance, to enhance the control capabilities of the material LDM. We demonstrate the effectiveness of DreamPBR in material creation, showcasing its versatility and user-friendliness on a wide range of controllable generation and editing applications.

Physically-based Rendering, Spatially Varying Bidirectional Reflectance Distribution Function, Multimodal Deep Generative Model, Deep Learning

^†^†copyright: none^†^†ccs: Computing methodologies Rendering^†^†ccs: Computing methodologies Artificial intelligence

Refer to caption — Figure 1. DreamPBR, an innovative material generation framework, enables personalized creation with multi-modal controls. We present various controls such as text descriptions (4, 5, 6, 7), binary images (2, 9, 10), RGB images (3, 11, 12, 13), segmented geometry (8), and their combination (1) in this figure. The high-quality and tileable textures from DreamPBR show high applicability in different objects.

High-quality materials are crucial for achieving photorealistic rendering. Despite advancements in appearance modeling over the past few decades, material creation remains a challenging research area. The material generation approaches can be categorized into reconstruction-based methods and generation-based methods. Reconstruction- based methods use one or many input photographs to estimate surface reflectance properties either through optimization-based inverse rendering (Gao et al., 2019; Guo et al., 2020; Hu et al., 2019) or deep neural network inference (Deschaintre et al., 2018a; Guo et al., 2023). However, the scope of these methods is constrained to real-world photographs, limiting their ability to create imaginative and creative materials.

Recent approaches have explored material generation (Guo et al., 2020; Zhou et al., 2022) using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). However, these methods are typically trained on hundreds to thousands of materials, which pales in comparison to the billions of images used in large-scale Language-Image generative models. The dataset capacity restricted their generating diversity. Furthermore, GAN-based methods also had training challenges including unstable training, mode collapse, and scalability issues with large datasets. On the other hand, diffusion models (Ho et al., 2020; Rombach et al., 2022) have shown significant advancements, exhibiting advantages in scalability and diversity. Recent advances (Poole et al., 2022; Wang et al., 2023) leverage 2D diffusion models before generating 3D content. However, these methods mainly focus on implicit representation or textured mesh, lacking the capability to disentangle physically based material and illumination.

To address these challenges, we introduce DreamPBR, a novel generative framework for creating high-resolution spatially-varying bidirectional reflectance distribution functions (SVBRDFs) conditioned with text inputs and a variety of multi-modal guidance. The main advantages of our method lie in generating diversity and controllability. Our method can generate semantically correct and detailed materials based on various textual prompts, ranging from highly structured materials with stationary patterns to imaginative materials with flexible content, such as a Hello Kitty carpet (as shown in Figure 1).

The key idea of our method is to integrate pretrained 2D text-to-image diffusion models (Rombach et al., 2022) with material priors to generate high-fidelity and diverse materials. While 2D text-to-image Latent Diffusion Models (LDM) excel in generating natural images, they had challenges in producing spatially-varying physically-based material maps due to the large domain gaps between natural images and materials. Consequently, adapting pretrained 2D diffusion models into the material domain, while preserving both quality and diversity, is a non-trivial research task. We introduce a novel material LDM which is learned by a two-stage strategy to address this challenge. In the initial stage, we observed albedo map is a specialized RGB image and stores spatially-varying surface reflectance by RGB pixel values. We transfer the pretrained LDM from the text-to-image domain to the text-to-albedo domain using fine-tuning, which can be regarded as the distillation from a large source domain (natural images) to a relatively small target domain (albedo texture maps) by leveraging the target domain priors. In the subsequent stage, we leverage a PBR decoder to reconstruct SVBRDFs from the latent space of albedo maps learned in the former stage. The reasons that we employ a decoder-only architecture for SVBRDFs generation are: 1. The generated SVBRDF parameter maps exhibit strong correlations since they share a common latent representation as the starting point for decoding. 2. The decoder module does not compromise generating diversity, as we keep the denoising UNet frozen during the training of the PBR decoder. Additionally, we introduced a highlight-aware decoder for the albedo map to further enhance regularization.

We introduce a multi-modal guidance module designed to serve as the conditioning mechanism for our material LDM, enabling a wide variety of controls for user-friendly material creation. Specifically, this guidance module includes three key components: Pixel Control allows pixel-aligned guidance from inputs like sketches or inpainting masks. Style Control extracts style features from reference images and employs them to guide the generation process. Shape Control enables automatic material generation for a given 3D object with segmentations with an optional 2D exemplar image for reference. Importantly, our framework supports the concurrent use of multiple guidances seamlessly.

We have trained our DreamPBR method on a publicly available SVBRDF dataset, comprising over 700 high-resolution (2 $k$ ) SVBRDFs. Thanks to the convolutional backbone of LDM, seamless tileable material generation can be supported by utilizing circular padding in all convolutional operators.

To summarize, our main contributions are as follows:

•

We introduce a novel generative framework for high-quality material generation under text and multi-modal guidances that combine pretrained 2D diffusion model and material domain priors efficiently;
•

We present a rendering-aware decoder module that learns the map** from a shared latent space to SVBRDFs;
•

Our multi-model guidance module offers rich user-friendly controllability, enabling users to manipulate the generation process effectively;
•

We propose an image-to-image editing scheme that facilitates material editing tasks such as stylization, inpainting, and seamless texture synthesis.

Material estimation approaches aim to acquire material data from real-world measurements under varying viewpoints and lighting conditions. We specifically focus on recent material estimation methods that utilize lightweight capture setups using consumer cameras. For a more comprehensive overview of general appearance modeling, please refer to surveys (Dong, 2019; Weinmann and Klein, 2015; Guarnera et al., 2016).

Methods have been developed to leverage multiple images or video sequences captured by a handheld camera to estimate appearance properties. Due to the limitations of lightweight setups, most approaches still rely on regularization such as handcrafted heuristics for diffuse/specular separation (Riviere et al., 2016; Palma et al., 2012), linear combinations of basis BRDFs (Hui et al., 2017), and sparsity assumption for incident lighting (Dong et al., 2014). Another class of methods focuses on reducing the number of input images by leveraging material priors such as stationary materials (Aittala et al., 2015, 2016), homogeneous or piece-wise materials (Xu et al., 2016), and spatially sparse materials (Zhou et al., 2016).

In recent years, deep learning-based methods have shown significant progress in recovering SVBRDFs from single image (Li et al., 2017; Deschaintre et al., 2018a; Li et al., 2018; Guo et al., 2021, 2023; Henzler et al., 2021). These methods employ deep convolutional neural network to predict plausible SVBRDFs from in-the-wild input images in a feed-forward manner. Deschaintre et al. (2019) extended a single-image-based solution to multiple images by latent space max-pooling. More recent work by Gao et al. (2019) introduced a deep inverse rendering pipeline that enables appearance estimation from an arbitrary number of input images. In procedural material modeling, Hu et al. (2019); Shi et al. (2020); Hu et al. (2022a) proposed to optimize material parameters with fixed node graphs to match input images. Hu et al. (2022b) introduced a new pipeline that eliminates the need for predefined node graphs. Most recently, Sartor and Peers (2023) proposed a diffusion-based model to estimate the material properties from a single photograph.

The methods mentioned above rely on captured photographs to reconstruct material and cannot produce non-real-world materials. In contrast, our approach can generate diverse and creative SVBRDFs using natural language inputs.

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have demonstrated remarkable capabilities in producing high-fidelity images. Subsequent research has focused on GAN improvements such as training stability (Kodali et al., 2017; Karras et al., 2018), attribute disentanglement (Karras et al., 2019), conditional controllability (Li et al., 2021; Park et al., 2019), and generation quality (Karras et al., 2020, 2021). GAN can be used in various applications including text-to-image synthesis (Reed et al., 2016b, a; Zhu et al., 2019), image-to-image translation (Isola et al., 2018; Zhu et al., 2020), video generation (Tulyakov et al., 2017), and even 3D shape generation (Li et al., 2019).

Recent advancements in text-to-image generation have been mainly driven by diffusion models (DMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Ramesh et al., 2022). Later advancements (Song et al., 2020, 2021; Liu et al., 2022) have explored efficient sampling strategies to significantly reduce the number of required sampling steps, thereby improving image generation performance. Rombach et al. (2022) proposed Latent Diffusion Model (LDM) to perform the denoising process in learned compact latent space, enabling high-resolution image synthesis and efficient image manipulation.

Integrating multi-modal controllability into a text-to-image diffusion model is crucial for creation applications. Recent research (Denis Zavadski and Rother, 2023; Zhang et al., 2023; Ye et al., 2023; Hu et al., 2022c; Mou et al., 2023) has focused on lightweight multi-modal controllability without the requirements of extensive data and high computational power. Hu et al. (2022c) introduces a fine-tuning strategy using low-rank matrices, enabling domain-specific adaptation. Zhang et al. (2023) proposed ControlNet, adding spatial conditioning to diffusion models for precise generation control. Ye et al. (2023) presented a lightweight framework enhancing diffusion models with image prompts using a decoupled cross-attention mechanism.

Guo et al. (2020) proposed an unconditional MaterialGAN for synthesizing SVBRDFs from random noise. The learned latent space facilitates efficient material estimation in inverse renderings. Zhou et al. (2022) developed a StyleGAN2-based model, conditioned by spatial structure and material category, for tileable material synthesis. These GAN-based methods show advantages in generating high-resolution and visually compelling materials. However, their diversity is constrained by the training instability of GANs and the limited range of training datasets. In procedural material generation, Guerrero et al. (2022) first introduced a transformer-based autoregressive model. Later work by Hu et al. (2023) proposed a multi-model node graph generation architecture for creating high-quality procedural materials, guided by both text and image inputs. While procedural representations are compact and resolution-independent, they are limited to stationary patterns and cannot create arbitrary styles.

In concurrent work, Vecchio et al. (2023) introduced ControlMat, a diffusion-based material generative model, capable of generating tileable materials using text and a single photograph as input. This model was trained on a synthetic material dataset comprising $126,000$ samples, derived from $8,615$ raw material graphs. While quite large in the material domain, this dataset is relatively small compared to the billions of text-image pairs used in text-to-image diffusion model training. This scale discrepancy leads to constrained diversity. Furthermore, this work only supports guidance of text and single photograph, limiting the scenarios range.

In contrast, our method significantly enhances material generation diversity through the efficient integration of pretrained diffusion models with material priors. We also provide a variety of user-friendly controls for guiding the generation process, expanding the scope and flexibility of applications.

Transitioning 2D text-to-image approaches to 3D generation presents significant challenges, mainly due to lacking large-scale labeled 3D datasets. Recent approaches (Poole et al., 2022; Wang et al., 2023; Lin et al., 2023; Tang et al., 2023) have explored text-to-3D generation without the dependency of 3D data. (Poole et al., 2022) integrates Score Distillation Sampling (SDS) with text-to-image diffusion models. Wang et al. (2023) further improved the quality and diversity by introducing Variational Score Distillation (VSD). The development of large-scale 3D datasets (Deitke et al., 2023) enabled direct learning from 3D data (Liu et al., 2023; Shi et al., 2023). However, current 3D generation methods mainly focus on geometry modeling and fail to produce high-quality, disentangled materials.

Park et al. (2018) introduced a neural method to assign materials from a predefined set to different parts of a 3D shape. Extending this, Hu et al. (2022d) employs a translation network to establish the correspondence between 2D exemplar image and 3D shape. This allows for extracting material cues from 2D images and selecting optimal materials from a candidate pool using a perceptual metric. However, these methods are constrained by the variety of their predefined material assets and lack the ability to transfer complex spatial patterns from 2D exemplars to 3D shapes. In contrast, our generative model can produce diverse materials and effectively transfer spatial structures from 2D exemplar images to 3D models, showcasing a significant advancement in material assignment.

The goal of our method is to generate spatially-varying materials which are represented by the Cook-Torrance microfacet BRDF model with GGX normal distribution function (Walter et al., 2007). Specifically, we use metallic-based PBR workflow and represent surface reflectance properties as albedo map $\mathcal{P}$ , normal map $\mathcal{N}$ , roughness map $\mathcal{R}$ , and metallic map $\mathcal{M}$ .

DreamPBR is a Latent Diffusion Model (LDM)-based generative framework capable of producing diverse, high-quality SVBRDF maps under text and multi-modal guidance, as illustrated in Figure 2.

The core generative module of our framework is the material Latent Diffusion Model (material LDM), which takes textual description $T$ as inputs to encode high-dimensional surface reflectance properties into a compact latent representation $z$ . This representation effectively compresses complex material data and guides the SVBRDF decoder in reconstructing detailed SVBRDF maps (i.e. albedo, normal, roughness, and metallic) $S=\{\mathcal{P},\mathcal{N},\mathcal{R},\mathcal{M}\}$ . Our critical observation is that while pre-trained text-to-image diffusion models can capture a wide range of natural images that fulfill the diversity needs of material generation, their flexibility often leads to less plausible materials due to the absence of material priors. Instead of training material LDM from scratch with limited material data, we opted to fine-tune a pre-trained text-to-image diffusion model with target material data. This strategy effectively tailors the model from a broad image domain to a specific material domain, ensuring both diversity and authenticity of output.

Our text-to-material framework seamlessly integrates three types of control modules to enhance material generation capabilities. First, we introduce the Pixel Control module $G_{P}$ that takes pixel-aligned inputs (e.g. sketches, masks), utilizing the ControlNet architecture (Zhang et al., 2023). It adds conditional controls into diffusion models, providing spatial guidance for material generation. Second, we use Style Control module $G_{I}$ to extract image features from the input image prompt, which are then utilized to adapt material LDM via cross attention. Third, we propose a Shape control module $G_{S}$ to generate SVBRDF maps automatically for a given segmented 3D shape. This module can leverage large language models to generate text prompts corresponding to different parts of the input shape. It also supports taking a 2D photo exemplar as additional input, enabling the generation of material maps for each segmented part, guided by the segmented 2D image. In the rest of the current section, we will dive into the key components of our framework. Section 3.2 introduce our core text-to-material module that enables tileable, diverse material generation. Next, Section 3.3 describes the SVBRDF decoder, responsible for reconstructing high-resolution SVBRDF maps from a unified latent space. Finally, Section 3.4 discusses the Multi-modal control module, providing image and 3D control capabilities to the diffusion model.

Our material LDM transforms text features $\tau(T)$ , extracted by CLIP’s text encoder $\tau(\cdot)$ (Radford et al., 2021) from user prompts $T$ , into a latent representation $z$ of SVBRDF maps $S$ . The latent space is characterized by a Variational Autoencoder (VAE) architecture $\mathcal{E}$ (Kingma and Welling, 2014; Rezende et al., 2014). Specifically, for an albedo map $\mathcal{P}\in\mathbb{R}^{H\times W\times C},C=3$ , the map is compressed into latent space $z=\mathcal{E}(\mathcal{P})\in\mathbb{R}^{h\times w\times c}$ . Consistent with Rombach et al. (2022), we adopt the parameters $c=4,h=H/8,w=W/8$ .

The core component of diffusion model is the denoising U-Net module (Ronneberger et al., 2015) which is conditioned on timestep $t$ . Following Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020), our model employs a deterministic forward diffusion process $q(z_{t}|z_{t-1})$ to transform latent vectors $z$ towards an isotropic Gaussian distribution. The U-Net network is specifically trained to reverse the diffusion process $q(z_{t-1}|z_{t})$ , iteratively denoising the Gaussian noise back into latent vectors. Adopting the strategy proposed by Rombach et al. (2022), we incorporates the text feature $\tau(T)\in\mathbb{R}^{M\times d_{\tau}}$ into the intermediate layer of UNet through a cross-attention mechanism $\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^{T}}{% \sqrt{d}}\right)\cdot V$ , where $Q=W_{Q}^{i}\cdot\varphi_{i}\left(z_{t}\right),K=W_{K}^{i}\cdot\tau(T),V=W_{V}^% {i}\tau(T)$ , $\varphi_{i}\left(z_{t}\right)\in\mathbb{R}^{N\times d_{\epsilon}^{i}}$ represents an intermediate representation of the UNet $\epsilon_{\theta}$ , and $W_{V}^{i}\in\mathbb{R}^{d\times d_{\epsilon}^{i}}$ , $W_{Q}^{i}\in\mathbb{R}^{d\times d_{\tau}^{i}},W_{K}^{i}\in\mathbb{R}^{d\times d% _{\tau}^{i}}$ are learnable projection matrices.

Our material LDM is fine-tuned on text-material pairs via:

(1)

\mathcal{L}_{ldm}:=\mathbb{E}_{\mathcal{E}(\mathcal{P}),T,\epsilon\sim N(0,1),% t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t},t,\tau_{\theta}(T)\right)% \right\|_{2}^{2}\right]\text{. }

Creating tileable texture maps is critical in material generation, involving meeting two requirements: a) maintenance of consistent spatial patterns and visual appearance, and b) the ability to tile textures without visible artifacts like seams and blocks.

While zero padding is the standard practice in CNNs, we found that circular padding is particularly effective for seamless content generation. We employ circular padding in all convolutional layers of our generative model for two main reasons:

(1)

Continuity across boundaries. Unlike classic padding methods such as zero padding, which may introduce artificial edges, circular padding ensures boundary continuity. It wraps image content around both horizontal and vertical boundaries, providing a seamless transition when tiling.
(2)

Pattern preservation. Circular padding mainly affects the boundary area of the image, leaving the central area and overall texture patterns unchanged.

Our tileable generation algorithm can serve two purposes: firstly, it can inherently produce tileable material maps without additional post-processing. Secondly, it can transform a non-tileable texture into a tileable version through an image-to-image generation pipeline, maintaining visual similarity with the original.

The SVBRDF decoder, denoted as $\mathcal{D}=\{{\mathcal{D}_{P},\mathcal{D}_{S}}\}$ , decodes the unified latent representation $z$ into SVBRDFs $S\coloneqq\{\mathcal{P},\mathcal{N},\mathcal{R},\mathcal{M}\}=\mathcal{D}(z)$ . Here, $\mathcal{P},\mathcal{N}\in\mathbb{R}^{H\times W\times 3}$ , $\mathcal{M},\mathcal{R}\in\mathbb{R}^{H\times W\times 1}$ . In our implementation, we set $H=W=512$ . Specifically, we utilize separate decoder networks: $\mathcal{D}_{P}(z)$ for the albedo map $\mathcal{P}$ , and $\mathcal{D}_{S}(z)$ for other property maps $\{\mathcal{N},\mathcal{R},\mathcal{M}\}$ . These decoder networks follow the decoder architecture in VAE proposed by (Kingma and Welling, 2014; Rezende et al., 2014), and are initialized with the weights from a pre-trained VAE decoder.

The training loss function for our PBR decoder $\mathcal{D}_{S}$ comprises the following terms:

(2)

\displaystyle\mathcal{L}_{\text{PBR}}=\mathcal{L}_{map}+\mathcal{L}_{perp}+% \mathcal{L}_{gan}+\mathcal{L}_{reg}+ \mathcal{L}_{render},

(3)

\displaystyle\mathcal{L}_{\text{render}}(x,y)=\lVert log(x+0.01),log(y+0.01)% \rVert_{1},

where $\mathcal{L}_{map}$ is $L_{1}$ loss on the material property maps, $\mathcal{L}_{perp}$ is perceptual loss based on LPIPS (Zhang et al., 2018), $\mathcal{L}_{gan}$ is the generative adversarial loss, $\mathcal{L}_{reg}$ is the Kullback-Leibler divergence penalty, and $\mathcal{L}_{render}$ is $L1$ log rendering loss applied to the rendered images.

For the rendering loss, we adopt the sampling scheme proposed by Deschaintre et al. (2018b) to render nine images per material map. The images include three images rendered with independently distant light and view directions, and six images using near-field mirrored view and lighting directions. The rendering loss yields desirable SVBRDF reconstructions, achieved by encouraging the training process to focus on minimizing errors in crucial material parameters rather than treating them with equal importance.

As previously mentioned in Section 3.2, our material LDM training utilizes the standard VAE decoder to map the latent space to the albedo map. While effective in generating plausible RGB images, this decoder tends to produce images with strong highlights, especially for shiny materials such as leather and metal.

To address this, we introduce a highlight-aware albedo decoder $\mathcal{D}_{P}$ , which is finetuned on a synthetic shaded-to-albedo dataset, ensuring robust regularization to effectively minimize highlight artifacts in albedo maps. For each material sample in our SVBRDF dataset, we simulate various lighting conditions and viewpoints by randomly positioning point lights and cameras parallel to the material plane and then rendering SVBRDFs to reference shaded images by a physically-based renderer.

During training, the default VAE image encoder maps the shaded images into latent space, which are then decoded back to image space by our specialized albedo decoder. The training process for this decoder follows the original VAE loss function (Kingma and Welling, 2014).

High-resolution material maps are essential for achieving photorealistic renderings. However, due to the memory and performance constraints, current diffusion models typically generate images at a resolution of $512\times 512$ , which falls short of high-quality production rendering.

We introduce a material super-resolution module comprising four super-resolution networks $SR$ , each following the Real-ESRGAN architecture(Wang et al., 2021). These super-resolution networks, denoted as $SR_{P},SR_{N},SR_{R},SR_{M}$ , are designed to augment the resolution of different SVBRDF property maps to $2,048\times 2,048$ .

We fine-tune the Real-ESRGAN with material data, which is trained on purely synthetic data, to more effectively capture the high-frequency details of materials. We incorporate a rendering loss (similar to Equation 3) into the training of the super-resolution module to ensure that the generated details contribute to high-frequency shading effects rather than visual artifacts. We should note that special care must be taken for normal maps during augmentation involving flip** and rotation. The directions stored in a normal map must be adjusted consistently with the map orientation to ensure consistent knowledge about surface normals.

We propose three control modules for DreamPBR: Pixel Control, Style Control, and Shape Control. These modules are designed to be decoupled, allowing for flexible combinations of multiple controls.

Spatial property guidance is widely used in material creation by artists. Our Pixel Control module $G_{P}$ takes spatial control maps $I_{P}$ as input, utilizing the ControlNet architecture (Zhang et al., 2023), to guide the generation of spatially-consistent SVBRDFs. It supports controlled generation under sketch guidance and allows for image-to-image material inpainting with a binary mask.

Our material LDM, as described in Section 3.2, is adapted in the material domain, enabling plausible material generation controlled by pre-trained ControlNet checkpoints, which are trained with 2D supervision. However, we found that fine-tuning pre-trained ControlNet with material data significantly improves both the controllability and the quality of generated materials. Specifically, we initialize our ControlNet using the ControlNet 1.1 Scribble checkpoint and fine-tune it on our SVBRDFs dataset. To generate the sketch guidance, we employ Pidinet (Su et al., 2021) for extracting sketches from albedo maps.

The Style Control module $G_{I}$ takes image prompt $I_{S}$ as input and extracts the style characteristics to guide material generation. Inspired by Ye et al. (2023), image prompts $I_{s}$ are first encoded into image features by CLIP’s image encoder, and then embedded into material LDM using a decoupled cross-attention adaptation module. Multimodal material generation can be achieved by accompanying the image prompt with a text prompt.

Style Control module can effectively capture the appearance properties and structural information from the input images, to generate realistic and coherent material maps. This functionality is particularly useful in scenarios where materials need to be created based on specific exemplar images, which is a frequent requirement in the material design industry. The interaction of the Style Control module with the Shape Control module will be detailed in Section 3.4.3.

The Shape Control module $G_{S}$ takes a segmented 3D model $O_{s}=\{O,s\}$ ( $s$ denotes the geometry segmentation) and an optional photo exemplar $I_{o}$ as input and automatically generates high-quality material maps for each segmentation. When provided with only a segmented 3D model and a basic text prompt, we leverage large language models(LLMs) such as ChatGPT (Achiam et al., 2023), to enrich the text descriptions for each segmentation. For instance, given a 3D chair model, the language model can generate diverse text descriptions for each part like seat, leg, and armrest, each featuring varied design styles. Furthermore, integration with existing Pixel Control and Style Control modules supports enhanced SVBRDF generation, ensuring superior quality and detailed material characteristics.

Our model integrates the material transfer pipeline TMT (Hu et al., 2022d) to automatically assign diverse generated materials to 3D shapes based on an image exemplar. The TMT pipeline involves two stages: firstly, translating color from exemplar image to the projection of 3D shape and vice versa for segmentation results; secondly, assigning materials to projected parts using a material classifier network, based on the translated image. Unlike Hu et al. (2022d), we do not rely on predefined material collections in material assignment. Instead, we use predicted material labels of TMT directly as text prompts and translated images as image prompts in the Style Control module, allowing high-quality SVBRDF generation for each part. The proposed algorithm offers two significant advantages over traditional material transfer models: it expands material diversity beyond limited predefined material collections and transfers not only color and category information but also comprehensive material attributes including styles and spatial structures from 2D exemplar to 3D shapes, leveraging the capabilities of our Style Control module.

Type	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF
Brick
	snow-covered bricks, winter, outdoor, house		coastal barrier bricks, sea-salt resistant, outdoor, barrier		stenciled brick floor, paving, terracotta, scratched		narrow bricks, walls		blackened fireplace bricks, charred
Fabric
	tablecloth, delicate		denim jacket texture, clothing		hand woven carpet, artisan, carpet		floral cotton dress, clothing		backpack fabric, sturdy
Ground
	ice glazed slippery, outdoor, winter		aerial mud, road, tracks		dry rocky ground		marble floor, polished, indoor		stone ground
Leather
	perforated leather, breathable		black leather		decoration, indoor		leather white, smooth		reptile skin leather, textured
Metal
	space cruiser panels, scifi		wrought iron gate, ornate, outdoor		golden metal wall, old		anodized metal surface, industrial		nickel plated hardware, smooth
Organic
	alien slime		forest leaves, natural, autumn, dirt		dragon scales		stylized animal fur		honeycomb structure, geometric, natural, beehive
Plastic
	plastic pattern, synthetic		yoga mat		synthetic plastic, rough		reflective safety vest, clothing		childrens playground slide, colorful
Tile
	elegant, interior decoration		art deco style tiles, vintage, indoor, decorative		vintage ceiling tiles, indoor		patterned bw vinyl, floors		encaustic cement tiles, colorful, indoor, floor
Wall
	dry stone wall, natural, outdoor		street art graffiti, colorful, urban		victorian wallpaper, patterned, indoor, historic		stucco finish, mediterranean		cliff, outdoor
Wood
	blue, worn painted wood siding, walls		parquet wood flooring, geometric		charcoal		varnished walnut, glossy, indoor		bamboo wall covering, eco-friendly

Figure 3. The generation results of DreamPBR under text-only conditions: We randomly sampled numerous materials with various types and wide tags, by the prompts, “a PBR material of [type], [tags]”. Not only can DreamPBR generate materials that match the descriptions, but also some out-of-domain materials are created as well such as brick of snow-covered bricks, plastic of a children’s playground slide, and wall of street art graffiti.

\Description

fig:PBR

Prompt	Prompt	Prompt
a PBR material of brick, narrow bricks, walls	a PBR material of leather, smooth, white, clean	a PBR material of metal, ornate celtic gold
a PBR material of fabric, plush toy fur, soft, indoor	a PBR material of plastic, synthetic turf blades, green, sports	a PBR material of tile, glass mosaic art, translucent, decorative
a PBR material of ground, marble floor tiles, polished, indoor, luxury	a PBR material of fabric, dirty carpet, carpet, textile, faded, floor	a PBR material of wood, oak flooring, classic, indoor
a PBR material of leather, fabric leather, clean, seat, chair, couch	a PBR material of plastic, yoga mat	a PBR material of fabric, hand woven carpet, artisan, indoor
a PBR material of tile, bathroom floor tiles, non-slip, indoor	a PBR material of tile, slate walkway tiles, rugged, outdoor	a PBR material of tile, art deco style tiles, vintage, indoor, decorative
a PBR material of wall, tiled bathroom wall, moisture-resistant, indoor	a PBR material of metal, scratched scuffed metal	a PBR material of brick, sewer brick, walls
a PBR material of brick, brick floor, outdoor, clean, man made	a PBR material of metal, chrome car detailing, reflective, car trim	a PBR material of wood, burnt wood finish, charred, artistic, decor
a PBR material of tile, patterned bw vinyl, floors	a PBR material of wall, victorian wallpaper, patterned, indoor, historic	a PBR material of fabric, hand woven carpet, artisan, indoor, carpet
a PBR material of wall, Hello Kitty sticker wallpaper, colorful, indoor, nursery, easy-apply	a PBR material of wall, street art graffiti, colorful, outdoor, urban	a PBR material of brick, multi-colored street bricks, vibrant, outdoor
a PBR material of fabric, embroidered linen, delicate, indoor, tablecloth	a PBR material of metal, metal plate, scifi	a PBR material of fabric, carpet, Hello Kitty outdoor picnic mat, durable, foldable
a PBR material of ground, marble floor tiles, polished, indoor, luxury	a PBR material of leather, black motorcycle jacket, tough, clothing, jacket	a PBR material of ground, forest leaves, natural, leaves, autumn
a PBR material of metal, colored metal plate, scifi	a PBR material of brick, stenciled brick floor, man made, worn, paving, dry, terracotta	a PBR material of fabric, loose tablecloth

Figure 4. Pixel Control’s results with the same pattern but different materials. The binary images in the first column are control conditions of different sketches and the generated materials are on their right with certain patterns same as our given images, following their material properties such as the edge of bricks and the growth rings of wood.

\Description

fig:pixel_control_1

	LPIPS	RMSE
	Render	Albedo	Metallic	Normal	Roughness
w/o $\mathcal{L}_{\text{render}}$	0.107	0.0361	0.0126	0.0542	0.0406
Ours(w/ $\mathcal{L}_{\text{render}}$ )	0.101	0.0357	0.0086	0.0531	0.0365

Figure 14. Qualitative and quantitative comparison on PBR decoder with rendering loss. We trained two PBR decoders with and without rendering loss. With rendering loss, the decoded textures show better consistency in both rendering images and SVBRDF (especially for normal maps) visually and achieve the lowest LPIPS of rendering images and RMSE of SVBRDF maps as illustrated in the table.

\Description

fig:decoder

	LPIPS	RMSE
Pretrained	0.450	0.0272	0.0816	0.0598	0.0588
w/o $\mathcal{L}_{\text{render}}$	0.342	0.0248	0.0652	0.0474	0.0451
Ours(w/ $\mathcal{L}_{\text{render}}$ )	0.321	0.0211	0.0643	0.0398	0.0445

w/o HA	w/ HA	w/o HA	w/ HA	w/o HA	w/ HA

(a) Generated cases

Input	w/o HA	w/ HA	Input	w/o HA	w/ HA

Input	Refer.	w/o HA	w/ HA	Input	Refer.	w/o HA	w/ HA

(b) Test cases in dataset

	highlight inputs			non-highlight inputs
	L1	PSNR	LPIPS	L1	PSNR	LPIPS
w/o HA	0.0409	25.7460	0.1928	0.0201	33.2621	0.1220
w/ HA	0.0211	32.6578	0.1452	0.0202	33.2904	0.1241

Figure 16. Qualitative and quantitative comparison on highlight-aware module (HA). As shown in 16(a), the original albedo decoder from VAE may generate data with highlight and the highlight-aware module can de-highlight them to make better generation results. As shown in 16(b), For those inputs without highlight, there is a tiny difference regardless of whether the highlight module is used or not. When inputting images with the highlight, we can de-highlight them with our highlight-aware module. For images with and without highlights, the table below numerically reaches the same conclusion by comparing the L1, PSNR, and LPIPS between real albedo maps and de-highlighted albedo maps.

\Description

fig:highlight_decoder

w/o ft
Ours

Figure 17. Qualitative comparison on fine-tuning ControlNet. We evaluate the pre-trained ControlNet and our fine-tuned version based on the pre-trained one to employ pixel-guidance generation. The textures from pre-trained ControlNet (w/o ft) are more like natural images rather than textures.

\Description

fig:ablation_pixel

Our dataset comprises a total of 711 PBR materials, each including four 2 $k$ texture maps: albedo, normal, metallic, and roughness, along with corresponding textual labels. The data are sourced from PolyHaven ¹¹1https://polyhaven.com/ and freePBR ²²2https://freepbr.com/. We categorized the data into ten types manually: Brick (58), Fabric (60), Ground (99), Leather (45), Metal (130), Organic (45), Plastic (40), Tile (75), Wall (69), and Wood (90).

The input text prompt $\mathcal{T}$ is in the format of “a PBR material of [type], [name], [tags]” during the finetuning of material-LDM, where ‘type’ refers to the type of material, ‘name’ (title name) and ‘tags’ for each material are given by the website. These tags are randomly retained at a ratio of $30\%-100\%$ during training. To address the issue of uneven distribution in the original data, we selected high-quality and representative data within categories of large volumes and randomly duplicated existing data for categories with smaller data volumes, which helps to balance the sample sizes across all categories, ensuring more uniform training data distribution.

For the 2 $k$ textures we obtained, we perform horizontal flip**, vertical flip**, random rotation, and multi-scale crop** and adjust the direction of the normal maps accordingly, eventually resizing them to $512\times 512$ pixels as our training data. After augmenting textures, we render each of them with randomly sampled viewpoints and lightings by Laine et al. (2020). The rendering images are also used to train our highlight-aware albedo decoder.

Concerning the paired data for training ControlNet, we utilized Pidinet (Su et al., 2021) to extract sketches from the albedo maps as mentioned in Section 3.4.1.

DreamPBR was trained on quadruple Nvidia RTX 3090 GPUs. During the training of Material LDM, we employ Adam as our optimizer with a base learning rate of $1.6\times 10^{-3}$ and closed learning rate scaling. Starting with the stable-diffusion-v1-5 checkpoint for 9000 epochs, we finetuned it for approximately 10 days. For the training of the PBR decoder, we set the base learning rate to $4.5\times 10^{-6}$ and enabled scale_lr, taking 4 days total in which the output channels of the decoder were set to 8, with albedo and normal having three channels each, and metallic and roughness being single-channel. For the highlight-aware albedo decoder, we set the base learning rate to $4.5\times 10^{-6}$ and enabled scale_lr, taking 2 days total in which the output channels of the decoder were set to 3. We incorporate rendering loss as detailed in Section Section 3.3 during the training process above.

During the training of the Rendering-aware super-resolution module, we initially utilized the preset weights from Real-ESRGAN (Wang et al., 2021) and finetuned four super-resolution modules specifically for albedo, normal, metallic, and roughness textures. These modules were finetuned using the learning rates of $1\times 10^{-4}$ and 10000 total iter. Furthermore, we combined the training of all four modules in a model to render the result of each module during training and incorporated rendering loss.

To enhance image control performance, we set the learning rate to $1\times 10^{-5}$ for training ControlNet, which requires about 2 days to complete. For Style Control, we directly utilize the ip-adapter_sd15 checkpoint along with our finetuned checkpoint, as we have observed satisfactory results.

DreamPBR is capable of generating realistic or magic materials with only descriptions. To demonstrate the ability to synthesize wide materials, we obtain a mount of descriptions of materials in by LLM for each type, which is used to sample materials with DreamPBR. The generated textures are enhanced by the super-resolution module and are then rendered as shown in Figure 3. In our sampled 400 textures, they show high consistency with text and the mean of CLIP Score between rendering images and given prompts is 30.198. Besides the consistency of text and images, the diversity of results is quite important for text-driven generative models as well. As demonstrated in Figure 7, we further sample several textures with the same prompt but different random seeds, DreamPBR succeeds in producing diverse textures that follow the descriptions we specify.

Although the users would introduce various controls, we can generate seamless tileable textures all the time, which allows users to apply the generated textures in different scales and different scenes. In Figure 8, we present several tileable textures from direct and guided generation with their splicing results, showing the effectiveness of circular padding in our method as mentioned in Section 3.2.

By finetuning an additional ControlNet, DreamPBR is able to generate textures according to given patterns. In practice, a designer could decide on a pattern in advance, and then try different materials. It may also be the other way around. For those two situations, DreamPBR ensures reasonable textures for certain patterns or materials as demonstrated in Figure 4 and Figure 5.

With additional control of binary images, inpainting is also a usual method for users to obtain specified results so we present several inpainting results in Figure 9 to replace a region in texture with another region users describe.

A styled image expresses more easily for a person than only text like Su et al. (2021) does. To do so, we evaluate the adaptation of Su et al. (2021) for our Style Control. Specifically, we obtain several styled images online and present the generation results under different styles from images as shown in Figure 6. Figure 10 illustrates the situation in that users would like to combine Style Control with Pixel Control, which enables users to generate the results they want more freely.

With the ability to generate various textures, DreamPBR can be extended to non-planar objects such as chairs. Specifically by giving a segmented object, we can utilize dialogue with a large language model to get different descriptions of each region. For more specified objects, a more direct way is to be in conjunction with cropped areas from exemplar images used with pixel control and style control. Thanks to the tileable features, the results from our pipeline of Shape Control are shown in Figure 11.

Leveraging the state-of-the-art generative model, StableDiffusion, DreamPBR is very competitive with previous methods for materials generation. We compare the results generated from DreamPBR of different materials against MaterialGAN (Guo et al., 2020) and TileGen (Zhou et al., 2022) in Figure 12. Notably, there are only two categories provided in the competing methods so our results are generated by giving prompts, “a PBR material of ground, stone” and “a PBR material of metal”. The comparison shows that DreamPBR can generate textures following the distribution of realistic data from datasets like GAN-based methods as well as magic textures from prior information for 2D images.

Moreover, we compare our Pixel Control with those of TileGen in generation with sketches guidance. The comparison results are shown in Figure 13, in which we demonstrate different generation results of TileGen and ours with the same binary masks. DreamPBR surpasses TileGen in sketches-driven generation and shows fewer artifacts and more precise controls than previous research on material generation like TileGen.

The training of DreamPBR consists of some alternative modules and additional loss functions. In this section, we focus on evaluating the effect of each of the designs. To evaluate them, we randomly selected 100 textures from our obtained data that were not used in the whole training stage.

When the PBR Decoder is trained, we introduce $\mathcal{L}_{\text{render}}$ to solve the regression problem from images rendered with random lights and viewpoints, which enforces that the decoded textures are realistic after being rendered. It reduces the search space of output values compared to the one that rendering images is not used. We trained two PBR decoders with and without $\mathcal{L}_{\text{render}}$ , and evaluated their effectiveness of them by comparing the outputs with reference textures. Figure 14 presents the comparison results, in which our rendering-aware decoder is capable of achieving more realistic results in rendered results and more consistent results in generated textures.

Although the super-resolution models originally show great results in natural images, we finetune it again with our material data and employ a novel rendering loss $\mathcal{L}_{\text{render}}$ from the level of perception. In practice, we finetune super-resolution modules for each component of textures based on the pre-trained Real-ESRGAN as our baseline. With four single modules(albedo, metallic, normal, and roughness), we jointly finetune them and introduce the $\mathcal{L}_{\text{render}}$ by rendering four textures after super-resolution to image space. The comparison results are shown in Figure 15. Similar to the training of PBR Decoder, the finetuning super-resolution modules with $\mathcal{L}_{\text{render}}$ contributes to better results.

As mentioned in Section 3.3, we introduce a highlight-aware albedo decoder to remove the potential highlights in generated RGB images. For a good de-highlight module, there are two key points to be taken into account: 1) effectively removing the highlights in images, and 2) leaving them unchanged for those without highlights. In practice, only training on rendered images potentially affects the decoded albedo(without highlights), so we finetune the highlight-aware decoder by randomly choosing rendered images from different lights or pure albedo maps. Furthermore, we compare the outputs of the highlight-aware decoder with the ones of the initially pre-trained decoder in Figure 16, suggesting that our decoder addresses the issues of two key points above.

To realize the sketch-guidance control, we embed a pre-trained ControlNet in DreamPBR. However, different from the IP-Adapter for Style Control focuses on incorporating semantics of images in clip space independent of training data, the initial ControlNet leads to domain shift, from the albedo domain back to the image domain, in our experiments. To address this problem, we finetuned the ControlNet with our sketch-albedo pairs as mentioned above. The comparison of ControlNet before and after being finetuned is shown in Figure 17.

Despite the promising capabilities of DreamPBR in generating high-quality and diverse material textures, our method encounters certain limitations that merit further exploration and improvement. We employ normal maps to reveal surface details in textures. However, using normal maps without displacement maps leaves self-occlusion ignored when rendering them with those textures, which makes the rendering results unrealistic. In addition, although a more lengthy description contributes to a more detailed texture that the user wants, it is also complex work for users to produce such a detailed description like “a PBR material of the wall, concrete wall, outdoor, cracked, man-made, rough, painted…”.

In this paper, we propose DreamPBR, a novel diffusion-based generative framework for creating physically-based material textures. Our methods do not rely on large data sets as image generation does but transfer their original prior information to desired textures. Given text descriptions and other optional multi-modal conditions, we can generate textures that are highly consistent with text descriptions and the other conditions such as styles of RGB images and patterns of binary images. By using DreamPBR, one can create planar textures freely according to their imagination. Specifically, we start with finetuning diffusion models for albedo generation and then decompose albedo to other SVBRDFs(normal, metallic, and roughness) by our highlight-aware decoder and PBR Decoder. For higher-resolution textures, we easily introduce an additional loss function in rendering images to our super-resolution module and bring significant improvement visually. With the properties above, DreamPBR can also produce some textures for simple geometries by dialogue with LLM.

For future work, although DreamPBR currently targets planar textures, it could be extended to complex geometries with further development of retopology. Additionally, because of our effective PBR Decoder and highlight-aware decoder, DreamPBR has the potential to be used in SVBRDF estimation. Lastly, there are inevitably problems such as limited resolution and time-consuming inference when utilizing diffusion models, which is also a challenging problem in the future.

w/o $\mathcal{L}_{\text{render}}$

(1)
Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Aittala et al. (2016) Miika Aittala, Timo Aila, and Jaakko Lehtinen. 2016. Reflectance Modeling by Neural Texture Synthesis. ACM Trans. Graph. 35, 4, Article 65 (jul 2016), 13 pages. https://doi.org/10.1145/2897824.2925917
Aittala et al. (2015) Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. 2015. Two-Shot SVBRDF Capture for Stationary Materials. ACM Trans. Graph. 34, 4, Article 110 (jul 2015), 13 pages. https://doi.org/10.1145/2766967
Deitke et al. (2023) Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. 2023. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv:2307.05663 [cs.CV]
Denis Zavadski and Rother (2023) Johann-Friedrich Feiden Denis Zavadski and Carsten Rother. 2023. ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models. (2023).
Deschaintre et al. (2018a) Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018a. Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
Deschaintre et al. (2018b) Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018b. Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
Deschaintre et al. (2019) Valentin Deschaintre, Miika Aittala, Fr’edo Durand, George Drettakis, and Adrien Bousseau. 2019. Flexible SVBRDF Capture with a Multi-Image Deep Network. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering) 38, 4 (July 2019). http://www-sop.inria.fr/reves/Basilic/2019/DADDB19
Dong (2019) Yue Dong. 2019. Deep appearance modeling: A survey. Visual Informatics 3, 2 (2019), 59–68. https://doi.org/10.1016/j.visinf.2019.07.003
Dong et al. (2014) Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. 2014. Appearance-from-Motion: Recovering Spatially Varying Surface Reflectance under Unknown Lighting. ACM Trans. Graph. 33, 6, Article 193 (nov 2014), 12 pages. https://doi.org/10.1145/2661229.2661283
Gao et al. (2019) Duan Gao, Xiao Li, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2019. Deep inverse rendering for high-resolution svbrdf estimation from an arbitrary number of images. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–15.
Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. (2014). arXiv:1406.2661 [stat.ML]
Guarnera et al. (2016) D. Guarnera, G.C. Guarnera, A. Ghosh, C. Denk, and M. Glencross. 2016. BRDF Representation and Acquisition. Computer Graphics Forum 35, 2 (2016), 625–650. https://doi.org/10.1111/cgf.12867 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12867
Guerrero et al. (2022) Paul Guerrero, Milos Hasan, Kalyan Sunkavalli, Radomir Mech, Tamy Boubekeur, and Niloy Mitra. 2022. MatFormer: A Generative Model for Procedural Materials. ACM Trans. Graph. 41, 4, Article 46 (2022). https://doi.org/10.1145/3528223.3530173
Guo et al. (2021) Jie Guo, Shuichang Lai, Chengzhi Tao, Yuelong Cai, Lei Wang, Yanwen Guo, and Ling-Qi Yan. 2021. Highlight-Aware Two-Stream Network for Single-Image SVBRDF Acquisition. ACM Trans. Graph. 40, 4, Article 123 (jul 2021), 14 pages. https://doi.org/10.1145/3450626.3459854
Guo et al. (2023) Jie Guo, Shuichang Lai, Qinghao Tu, Chengzhi Tao, Changqing Zou, and Yanwen Guo. 2023. Ultra-High Resolution SVBRDF Recovery from a Single Image. ACM Trans. Graph. 42, 3, Article 33 (jun 2023), 14 pages. https://doi.org/10.1145/3593798
Guo et al. (2020) Yu Guo, Cameron Smith, Miloš Hašan, Kalyan Sunkavalli, and Shuang Zhao. 2020. MaterialGAN: Reflectance Capture Using a Generative SVBRDF Model. ACM Trans. Graph. 39, 6, Article 254 (nov 2020), 13 pages. https://doi.org/10.1145/3414685.3417779
Henzler et al. (2021) Philipp Henzler, Valentin Deschaintre, Niloy J. Mitra, and Tobias Ritschel. 2021. Generative Modelling of BRDF Textures from Flash Images. ACM Trans. Graph. 40, 6, Article 284 (dec 2021), 13 pages. https://doi.org/10.1145/3478513.3480507
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
Hu et al. (2022c) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022c. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
Hu et al. (2022d) Ruizhen Hu, Xiangyu Su, Xiangkai Chen, Oliver van Kaick, and Hui Huang. 2022d. Photo-to-Shape Material Transfer for Diverse Structures. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 39, 6 (2022), 113:1–113:14.
Hu et al. (2019) Yiwei Hu, Julie Dorsey, and Holly Rushmeier. 2019. A Novel Framework for Inverse Procedural Texture Modeling. ACM Trans. Graph. 38, 6, Article 186 (nov 2019), 14 pages. https://doi.org/10.1145/3355089.3356516
Hu et al. (2022a) Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2022a. Node Graph Optimization Using Differentiable Proxies. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 5, 9 pages. https://doi.org/10.1145/3528233.3530733
Hu et al. (2023) Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2023. Generating Procedural Materials from Text or Image Prompts. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings (SIGGRAPH ’23). ACM. https://doi.org/10.1145/3588432.3591520
Hu et al. (2022b) Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. 2022b. An Inverse Procedural Modeling Pipeline for SVBRDF Maps. ACM Trans. Graph. 41, 2, Article 18 (jan 2022), 17 pages. https://doi.org/10.1145/3502431
Hui et al. (2017) Zhuo Hui, Kalyan Sunkavalli, Joon-Young Lee, Sunil Hadap, Jian Wang, and Aswin C. Sankaranarayanan. 2017. Reflectance Capture Using Univariate Sampling of BRDFs. In 2017 IEEE International Conference on Computer Vision (ICCV). 5372–5380. https://doi.org/10.1109/ICCV.2017.573
Isola et al. (2018) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2018. Image-to-Image Translation with Conditional Adversarial Networks. arXiv:1611.07004 [cs.CV]
Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv:1710.10196 [cs.NE]
Karras et al. (2021) Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. In Proc. NeurIPS.
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. (2019). arXiv:1812.04948 [cs.NE]
Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.
Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
Kodali et al. (2017) Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017. On Convergence and Stability of GANs. arXiv:1705.07215 [cs.AI]
Laine et al. (2020) Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. 2020. Modular Primitives for High-Performance Differentiable Rendering. ACM Transactions on Graphics 39, 6 (2020).
Li et al. (2017) Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017. Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–11.
Li et al. (2019) Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2019. Synthesizing 3D Shapes from Silhouette Image Collections using Multi-projection Generative Adversarial Networks. arXiv:1906.03841 [cs.CV]
Li et al. (2021) Yuheng Li, Yijun Li, **gwan Lu, Eli Shechtman, Yong Jae Lee, and Krishna Kumar Singh. 2021. Collaging class-specific gans for semantic image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14418–14427.
Li et al. (2018) Zhengqin Li, Kalyan Sunkavalli, and Manmohan Chandraker. 2018. Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 74–90. https://doi.org/10.1007/978-3-030-01219-9_5
Lin et al. (2023) Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2023. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
Liu et al. (2022) Lu** Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022. Pseudo Numerical Methods for Diffusion Models on Manifolds. arXiv:2202.09778 [cs.CV]
Liu et al. (2023) Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. 2023. Zero-1-to-3: Zero-shot One Image to 3D Object. arXiv:2303.11328 [cs.CV]
Mou et al. (2023) Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. 2023. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
Palma et al. (2012) Gianpaolo Palma, Marco Callieri, Matteo Dellepiane, and Roberto Scopigno. 2012. A Statistical Method for SVBRDF Approximation from Video Sequences in General Lighting Conditions. Computer Graphics Forum (2012). https://doi.org/10.1111/j.1467-8659.2012.03145.x
Park et al. (2018) Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M. Seitz. 2018. PhotoShape: Photorealistic Materials for Large-Scale Shape Collections. ACM Trans. Graph. 37, 6, Article 192 (Nov. 2018).
Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic Image Synthesis with Spatially-Adaptive Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Poole et al. (2022) Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv (2022).
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
Reed et al. (2016a) Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016a. Learning What and Where to Draw. arXiv:1610.02454 [cs.CV]
Reed et al. (2016b) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016b. Generative Adversarial Text to Image Synthesis. arXiv:1605.05396 [cs.NE]
Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286.
Riviere et al. (2016) J. Riviere, P. Peers, and A. Ghosh. 2016. Mobile Surface Reflectometry. Computer Graphics Forum 35, 1 (2016), 191–202. https://doi.org/10.1111/cgf.12719 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12719
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [cs.CV]
Sartor and Peers (2023) Sam Sartor and Pieter Peers. 2023. Matfusion: a generative diffusion model for svbrdf capture. In SIGGRAPH Asia 2023 Conference Papers. 1–10.
Shi et al. (2020) Liang Shi, Beichen Li, Miloš Hašan, Kalyan Sunkavalli, Tamy Boubekeur, Radomir Mech, and Wojciech Matusik. 2020. MATch: Differentiable Material Graphs for Procedural Material Capture. ACM Trans. Graph. 39, 6 (Dec. 2020), 1–15.
Shi et al. (2023) Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. 2023. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model. arXiv:2310.15110 [cs.CV]
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS
Su et al. (2021) Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. 2021. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision. 5117–5127.
Tang et al. (2023) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).
Tulyakov et al. (2017) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2017. MoCoGAN: Decomposing Motion and Content for Video Generation. arXiv:1707.04993 [cs.CV]
Vecchio et al. (2023) Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, and Tamy Boubekeur. 2023. ControlMat: Controlled Generative Approach to Material Capture. arXiv preprint arXiv:2309.01700 (2023).
Walter et al. (2007) Bruce Walter, Stephen R. Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. Microfacet Models for Refraction through Rough Surfaces. In Proceedings of the 18th Eurographics Conference on Rendering Techniques (Grenoble, France) (EGSR’07). Eurographics Association, Goslar, DEU, 195–206.
Wang et al. (2021) Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision. 1905–1914.
Wang et al. (2023) Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. arXiv preprint arXiv:2305.16213 (2023).
Weinmann and Klein (2015) Michael Weinmann and Reinhard Klein. 2015. Advances in Geometry and Reflectance Acquisition (Course Notes). In SIGGRAPH Asia 2015 Courses (Kobe, Japan) (SA ’15). Association for Computing Machinery, New York, NY, USA, Article 1, 71 pages. https://doi.org/10.1145/2818143.2818165
Xu et al. (2016) Zexiang Xu, Jannik Boll Nielsen, Jiyang Yu, Henrik Wann Jensen, and Ravi Ramamoorthi. 2016. Minimal BRDF Sampling for Two-Shot near-Field Reflectance Acquisition. ACM Trans. Graph. 35, 6, Article 188 (dec 2016), 12 pages. https://doi.org/10.1145/2980179.2982396
Ye et al. (2023) Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. (2023).
Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models.
Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Zhou et al. (2022) Xilong Zhou, Miloš Hašan, Valentin Deschaintre, Paul Guerrero, Kalyan Sunkavalli, and Nima Kalantari. 2022. TileGen: Tileable, Controllable Material Generation and Capture. (2022). arXiv:2206.05649 [cs.GR]
Zhou et al. (2016) Zhiming Zhou, Guojun Chen, Yue Dong, David Wipf, Yong Yu, John Snyder, and Xin Tong. 2016. Sparse-as-Possible SVBRDF Acquisition. ACM Trans. Graph. 35, 6, Article 189 (dec 2016), 12 pages. https://doi.org/10.1145/2980179.2980247
Zhu et al. (2020) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2020. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv:1703.10593 [cs.CV]
Zhu et al. (2019) Minfeng Zhu, **bo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310 [cs.CV]

Output	Expansion	Output	Expansion

Input	Yellow flower	Red flower

Blue flower	Cyan flower	Purple flower

Pink flower	Leaf	Grass

Prompt	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF
a PBR material of metal, space cruiser panels
a PBR material of wall, street art graffiti, colorful, outdoor, urban
a PBR material of tiles, encaustic cement tiles, colorful, indoor, floor

Style	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF

	a PBR material of fabric, carpet		a PBR material of ground, stone, outdoor		a PBR material of wood		a PBR material of tile, marble

	a PBR material of tile, encaustic cement		a PBR material of wall, concrete wall, outdoor, cracked, man made, rough, painted		a PBR material of brick, street brick, outdoor		a PBR material of wood, varnished walnut, painted, artistic

	a PBR material of brick, street brick, outdoor, art		a PBR material of leather		a PBR material of ground, sidewalk		a PBR material of tile, encaustic cement

	a PBR material of fabric, hand woven carpet		a PBR material of tile, marble		a PBR material of wall, wallpaper		a PBR material of wood, synthetic wood, painted

Prompt	Style	Pixel	Render	SVBRDF
a PBR material of tiles, marble
a PBR material of wood, indoor
a PBR material of tiles, art deco style tiles, vintage, indoor, decorative
a PBR material of fabric, patchwork quilt, colorful, indoor, bedding
a PBR material of fabric, hand woven carpet, cute bunny, artisan, indoor

Reference

Reference	Low Res.	Pretrained	w/o $\mathcal{L}_{\text{render}}$	Ours(w/ $\mathcal{L}_{\text{render}}$ )

	LPIPS	RMSE
	Render	Albedo	Metal.	Normal	Rough.
Pretrained	0.450	0.0272	0.0816	0.0598	0.0588
w/o $\mathcal{L}_{\text{render}}$	0.342	0.0248	0.0652	0.0474	0.0451
Ours(w/ $\mathcal{L}_{\text{render}}$ )	0.321	0.0211	0.0643	0.0398	0.0445