¹¹institutetext: Department of Artificial Intelligence, Sungkyunkwan University ²²institutetext: Department of Electrical and Computer Engineering, Sungkyunkwan University
https://yhyun225.github.io

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Younghyun Kim¹ Equal Contribution. Geunmin Hwang¹⁰⁰footnotemark: 0 Eunbyung Park^1,2 Corresponding author.

Abstract

Recent surge in large-scale generative models has spurred the development of vast fields in computer vision. In particular, text-to-image diffusion models have garnered widespread adoption across diverse domain due to their potential for high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generate images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher resolution datasets. However, this undertaking poses a formidable challenge due to the difficulty in collecting large-scale high-resolution contents and substantial computational resources. While several preceding works have proposed alternatives, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond its original capability and propose a novel progressive approach that fully utilizes generated low-resolution image to guide the generation of higher resolution image. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

Keywords:

Diffusion High-resolution Training-free

1 Introduction

Following the establishment of diffusion models as a cornerstone in the realm of generative modeling, there has been a rapid advancements across various domains within machine learning modalities. These advancements span areas such as audio synthesis [32, 13, 33, 26, 37], image synthesis [23, 62, 15, 50, 54, 48, 16, 4, 45], video generation [20, 22, 8, 61, 71, 7, 12], and 3D generation [46, 72, 36, 14, 59, 68, 75]. Notably, text-to-image diffusion models [4, 45, 50, 54, 48] have attracted considerable attention due to their ability to generate visually captivating images using intuitive, human-friendly natural language descriptions. Stable Diffusion (SD), an open-source text-to-image diffusion model trained on a large-scale online dataset [57], has emerged as a prominent choice for a diverse range of generative tasks and inverse problems. These tasks include but are not limited to image editing [1, 2, 21, 69, 30], inpainting [50, 53, 40], super-resolution [50, 55, 17], and image-to-image translation [10, 42, 76, 77].

Refer to caption — Figure 1: Various baselines. Each images has $2048^{2}$ size, generated from SDXL 1.0. We used ‘A group of playful monkeys swinging through the branches of a dense jungle.’ and ‘A line of taxis queued up outside a busy train station.’ as a textual prompt for each rows, respectively.

Despite the promising performance exhibited by SD, it encounters limitations when generating images at higher resolutions beyond its training resolution. The direct inference of unseen high-resolution samples often reveals repetitive patterns and irregular structures, particularly noticeable in object-centric samples, as discussed in prior works [19, 79] (see Fig. 1). While a straightforward approach might involve training or fine-tuning diffusion models on higher-resolution images, several challenges impede this approach. First, collecting text-image pairs of higher resolution is not readily feasible. Second, training on large-resolution images demands substantial computational resources due to the increased size of the intermediate features. Furthermore, capturing and learning the features from high-dimensional data often requires a greater model capacity (more model parameters), leading to further computational strain on the training process.

Several tuning-free [6, 34, 19] methods proposed various approaches to adapt pre-trained SD on higher-resolutions beyond its original settings. MultiDiffusion [6] and SyncDiffusion [34] employs multiple diffusion process with overlap** windows, each corresponding to different regions within the generating image. These joint diffusion models can produce images of arbitrary shape, but the resulting image involves object repetition problem since the same textual prompt is fed into each windows. Attn-SF [27] associates inference resolution with attention entropy and introduces scaling factor to alleviate entropy fluctuations during sampling of variable-sized images. However, their work does not consider adapting SD on much higher-resolutions, e.g., 2K and 4K. ScaleCrafter [19], on the other hand, extends the receptive field of the diffusion model by dilating the pre-trained convolution weights of the denoising UNet [51]. While it effectively addresses repetition issues in certain instances, its success heavily depends on the extensive search of the hyperparameters.

In this work, we investigate the SD’s capability of generating previously unseen high-resolution images and introduce a novel approach that does not involve any training (or fine-tuning) and additional modules. We posit that SD innately possesses the potential to generate images at resolutions higher than its training resolution thanks to its convolutional architecture [50] and broad data distribution coverage. To substantiate our claim, we generate 2K images using SD from noisy images at different intermediate diffusion timesteps. Note that Gaussian noise is added to the latent space. Fig. 2 demonstrates that from noisy images but whose global structures are preserved, SD seamlessly restores clean, highly detailed images.

Building upon this observation, we introduce a novel progressive high-resolution image generation pipeline, dubbed DiffuseHigh, where a relatively low-resolution image (sampled from SD) serves as a guide for generating higher-resolution images. Inspired by the recent literature [45, 41], we suggest the noising-denoising technique to synthesize higher-resolution images. First, we generate the low-resolution image using SD and upsample it by bilinear interpolation. Then, we add sufficient noise to obfuscate the fine details of the interpolated images. Finally, we perform the reverse diffusion process to denoise those images to infuse the high-frequency details to synthesize higher-resolution images, and we can repeat this process until we obtain the desired resolution images. This approach leverages the overall structure from the low-resolution image, effectively addressing repetition issues observed in the prior methods.

However, the ‘adding noise to damage the images’ approach poses several challenges. If we add too much noise, then we lose most of the structure in the low-resolution images, resulting in repetitive outcomes similar to those we generate from scratch. On the other hand, if we introduce a minimal amount of noise, the generated higher-resolution images do not show notable differences from the interpolated images, losing the opportunity to synthesize high-frequency details. In addition, finding adequate noise relies on both the content of the image and the pre-trained models, which makes it challenging to offer precise suggestions to users.

To resolve the issues above, we propose a principled way of preserving the overall structure from the low-resolution image for the suggested progressive pipeline. We employ a frequency-domain representation to extract the global structure as well as detailed contents from the low-resolution images. More specifically, we adopt the Discrete Wavelet Transform (DWT) to obtain essential contents, e.g., the $LL$ component, which we then incorporate into the denoising procedure to ensure that the resulting image remains consistent and does not deviate excessively.

Fig. 3 provides an overview of the overall pipeline of our method. We validate the proposed pipeline on the LAION-400M dataset [58] and demonstrate the superior performance of DiffuseHigh compared to other baseline methods. Additionally, we extend our method to diffusion-based video generation [71] to showcase the versatility of DiffuseHigh. The contributions of our work are summarized as follows:

•

Our observation indicates that SD has the innate ability to synthesize images with higher resolution than those it was trained on.
•

We suggest a novel training-free progressive high-resolution image synthesis pipeline called DiffuseHigh, in which a lower-resolution image acts as a guide for generating higher-resolution images.
•

We further propose Discrete Wavelet Transform (DWT)-based structure guidance during the denoising process, which enhances the structural properties and fine details of the generated samples.
•

We conduct comprehensive experiments both on image and video synthesis, demonstrating the superiority and versatility of our method.

2 Related Work

2.0.1 Diffusion Models

Diffusion models (DMs) [23, 63] represent a novel paradigm within the generative modeling framework, employing numerical methods [39, 78, 5] to solve reverse-time stochastic differential equations (SDEs) for simulating the generative trajectories [64]. Under this rigorous theoretical framework, DMs enable to achieve state-of-the-art (SoTA) [28, 44] image quality and comparable model likelihood [31, 38].

2.0.2 Text-to-Image Generation

Text-driven image generation can be traced back to the use of GANs [9, 29, 56], often combined with image-text representations such as CLIP [47], achieving significant performance. However, generating semantically consistent images with text guidance remains challenging for GANs [52]. Recently, DMs have gained popularity for their ability to produce high-quality images [44], showcasing great potential in text-to-image generation [15, 25]. Especially the pioneering work, Stable Diffusion [50], which introduces text representations in latent space iteratively, with further advancements occurring rapidly. Moreover, thanks to the large-scale training of Stable Diffusion, it is applied to various text-to-image tasks [35, 43, 11] by fine-tuning [52] or using training-free [49] method. While significant progress has been made in the field of text-to-image generation, one limitation of DMs is their capacity to generate images only at fixed resolutions [52, 35, 54], attributed to their training on specific image sizes. To remedy this, in this paper, we employ text prompts to generate much higher-resolution images than those present in their training datasets in a training-free manner.

2.0.3 Noising-Denoising

Based on the stochastic differential equation (SDE) reflecting the generative diffusion process, SDEdit [41] proposed a unified framework for image editing and image synthesis. Given images with low-level details, e.g., stroke painting, they add an adequate amount of noise to the image. Subsequently, they restore a clean, natural image from the noisy image through an iterative reverse SDE.

This ‘noising-denoising’ strategy, which performs a reverse diffusion process from the intermediate noised image, has been widely adopted in various domains. AnoDDPM [73] proposed reconstruction-based anomaly detection with partial Markov chain, where the data sample is slightly noised with small timesteps and reconstructed. SDXL [45] employs an optional refinement network, in which the network refines the low-quality part of the image through a noising-denoising process. Similarily, adopts this algorithm in a post-processing stage in order to rectify the imperfect video frames.

2.0.4 High-resolution Image Synthesis

Despite the progress made by current diffusion model-based synthesis methods, achieving high-resolution image generation remains elusive. Previous studies have tackled these challenges through methods such as training from scratch and fine-tuning [74, 80]. However, training from scratch and fine-tuning often require significant computational resources and a substantial amount of high-resolution training data. Consequently, there has been a recent trend towards training-free methods [79, 27] for generating arbitrary-size or high-resolution images. ScaleCrafter [19] utilized dilated convolution to adjust the convolutional receptive field, enabling adaptation to high-resolution image generation without any training.

Recently, Make-a-Cheap-Scaling [18] has also adopted the noising-denoising technique to synthesize higher-resolution images. To further boost the image quality, they propose to tune a lightweight upsampler module, which can provide proper semantic guidance during the generation process. Different from theirs, we propose to obtain explicit structural guidance from the low resolution image, which can effectively address object repetition problem and irregular structure issues. Our proposed DiffuseHigh can be directly applied to any prevalent pretrained diffusion models in a completely training-free manner, providing both efficacy and efficiency.

3 Preliminary

In this section, we briefly present preliminaries relevant to our method, including Stable Diffusion (SD) [50] and Discrete Wavelet Transform (DWT).

3.1 Stable Diffusion

Stable Diffusion is a text-to-image latent diffusion model where the diffusion process is performed on a low-dimensional latent space. Given a data sample $x$ from the unknown data distribution $p_{\text{data}}(x)$ , stable diffusion encodes $x$ into a latent representation $z_{0}=\mathcal{E}(x)$ , where $\mathcal{E}(\cdot)$ is an autoencoder that compresses the high-dimensional data into a compact latent space. Then, the model gradually adds isotropic gaussian noise $\epsilon\sim\mathcal{N}(0,I)$ to a clean sample $z_{0}$ with pre-defined noise schedule $\alpha_{t}\in(0,1)$ :

z_{t}=\sqrt{\bar{\alpha_{t}}}z_{0}+\sqrt{1-\bar{\alpha_{t}}}\epsilon,

(1)

where $t\in[1,...,T]$ denotes the timesteps of the diffusion process and $\bar{\alpha}=\Pi_{s=1}^{t}\alpha_{s}$ . The denoising network $\epsilon_{\phi}(z_{t};t,y)$ parametrized by $\phi$ learns to predict the amount of noise added, given noisy latent $z_{t}$ and text prompt $y$ , with the following denoising score matching objective:

\mathcal{L}:=\mathbb{E}_{t,\epsilon}\left[||w(t)(\epsilon_{\phi}(z_{t};t,y)-% \epsilon)||_{2}^{2}\right].

(2)

$w(t)$ is a weighting function applied to each loss term at timestep $t$ .

Initiating from $z_{T}\sim\mathcal{N}(0,I)$ , the reverse procces is formulated as $q_{\phi}(z_{t-1}|z_{t},z_{0})$ with $q_{\phi}(\cdot|\cdot)$ parametrized as a Gaussian distribution. For efficiency, DDIM [62] sampling strategy is generally adopted, where unknown $z_{0}$ is replaced with predicted clean latent $\hat{z_{0}}$ at timestep ‘ $t$ ’:

\hat{z}_{0,t}=\frac{z_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\phi}(z_{t};t,y)}% {\sqrt{\bar{\alpha}_{t}}}

(3)

Finally the clean image $\hat{x}$ is reconstructed from a decoder $\mathcal{D}(\cdot)$ of the stable diffusion, i.e., $\hat{x}=\mathcal{D}(z_{0})$ .

3.2 Discrete Wavelet Transform

Frequency-based methods, including the Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), and Discrete Wavelet Transform (DWT) play a pivotal role in discrete signal processing. Such frequency-based approaches transform the given signal into the frequency domain, enabling the analysis and manipulation of the individual frequency bands.

Among them, utilizing wavelets, DWT decomposes images into different components that are localized both in time and frequency. Specifically, at each DWT level, the decomposed components consist of an approximation coefficient denoted as $LL_{l}$ and a detail coefficient denoted as $LH_{l},HL_{l},HH_{l}$ , where $l$ represents the level of the DWT. Leveraging the low-pass filter and high-pass filter in both vertical and horizontal directions, $LL_{l}$ represents the low-frequency details of the image, encompassing global structures, uniformly-colored regions, and smooth textures. On the other hand, $LH_{l},HL_{l},HH_{l}$ encapsulates the high-frequency details, such as edges, boundaries, and rough textures.

We adopt DWT as a tool for the guidance of overall structures and contents of the low-resolution image for generating a higher-resolution image. The details of applying DWT-based guidance on our pipeline are described in Sec. 4.3.

4 Method

4.1 Problem Formulation

Our work aims to generate higher-resolution images over training size given textual prompts with a text-to-image diffusion model (stable diffusion) in a training-free manner. More formally, given text description $y$ and stable diffusion $\epsilon_{\phi}(\cdot)$ pretrained on fixed-size images $(h,w,3)$ , our objective is to generate higher resolution image $(H,W,3)$ without training $\phi$ , where $h\ll H,w\ll W$ .

4.2 Progressive High-Resolution Diffusion Pipeline

We present progressive approach for generating high-resolution images using a pretrained stable diffusion model. Initially, our method generates a clean sample based on a given text description through stable diffusion. Assuming alignment between the generated image and the provided text, we then employ bilinear interpolation to upscale the image, thereby guiding the high-resolution image generation. Our method incorporates a noising-denoising technique [41], which gradually projects the sample onto the manifold of natural, highly detailed images that the diffusion model has learned. This iterative procedure can also be interpreted as a refinement stage [45], wherein the denoising process restores the missing high details on the low-resolution sample.

Let $x_{0}\in\mathbb{R}^{h\times w\times 3}$ be the generated sample of stable diffusion aligned with user-provided textual prompt $y$ and $p_{\phi}(x|y)$ be the text conditional data distribution the model has learned. In other words, $x_{0}\sim p_{\phi}(x|y)$ and $p_{\phi}(x|y)\approx p_{data}(x|y)$ . We serve relatively low-resolution generated image $x_{0}$ as a guide for the higher resolution image generation and apply bilinear interpolation on the image $x_{0}$ to the desired size image $\tilde{x}_{0}\in\mathbb{R}^{H\times W\times 3}$ . Note that the details of the resulting image $\tilde{x}_{0}$ lack clarity due to the nature of the interpolation, which entails averaging neighboring pixel values to compose newly introduced pixels.

In order to infuse the appropriate details to the current high-resolution, we first add noise corresponding to the diffusion timestep $N<T$ to its latent code $\tilde{z}_{0}=\mathcal{E}(\tilde{x}_{0})$ according to Eq. 1:

\hat{z}_{N}=\sqrt{\bar{\alpha}_{N}}\tilde{z}_{0}+\sqrt{1-\bar{\alpha}_{N}}% \epsilon,\quad\epsilon\sim\mathcal{N}(0,I).

(4)

Then the denoising network $\epsilon_{\phi}(\cdot)$ performs the reverse process on the noisy latent representation $\hat{z}_{N}$ to recover the clean latent $\hat{z}_{0}$ . By employing the latent decoder $\mathcal{D}(\cdot)$ , we finally obtain the desired high-resolution image.

This progressive pipeline bears a resemblance to cascade diffusion models [24, 54], albeit with inherent differences. Cascade diffusion models employ multiple networks for each participating low-resolution image generation and super-resolution. In contrast, the described method relies solely on a single pretrained diffusion model, obviating the need for training the separate models.

Noising timestep $N$ is an important factor for the overall quality of the resulting image (see Fig. 4). Large $N$ significantly destroys the critical structural properties in the image, leading to object repetition problems and undesirable object shapes. However, small $N$ does not provide the diffusion model enough timesteps to perform the denoising process to restore the fine, high-frequency details of the image. Determining appropriate noise levels is contingent upon both the content of the image and the characteristics of pretrained diffusion models, thereby presenting a challenge in practical usage. Furthermore, we observed numerous instances where this approach degraded image quality across all noise levels. This leads us to develop a more principled way to uphold the overall structure and maintain the quality of the generated higher-resolution images.

4.3 Structure Guidance through Discrete Wavelet Transform

The progressive approach demonstrates proficiency in generating high-fidelity images; however, it frequently encounters challenges in effectively capturing certain structural properties and nuanced details from low-resolution inputs. Consequently, this can lead to discrepancies between the generated image and the actual data distribution. (See Fig. 5). This phenomenon is apparent, as structures and intricate details are susceptible to damage and distortion by a certain amount of noise.

We hereby introduce the method DiffuseHigh (Fig. 3), in which we incorporate a Discrete Wavelet Transform (DWT)-based structure guide into the proposed progressive pipeline. This method aims to enhance the fidelity of generated images by encouraging the preservation of crucial features from the low-resolution input. Given an interpolated image $\tilde{x}_{0}\in\mathbb{R}^{H\times W\times 3}$ , we extract its low-frequency component utilizing the DWT, which encapsulates the overall structure and coarse details of the image. More formally, let us define $\texttt{DWT}(\cdot):\mathbb{R}^{H\times W\times 3}\rightarrow\mathbb{R}^{4% \times\frac{H}{2}\times\frac{W}{2}\times 3}$ and $\texttt{DWT}(\tilde{x}_{0})_{LL},\texttt{DWT}(\tilde{x}_{0})_{LH},\texttt{DWT}% (\tilde{x}_{0})_{HL},\texttt{DWT}(\tilde{x}_{0})_{HH}\in\mathbb{R}^{\frac{H}{2% }\times\frac{W}{2}\times 3}$ are four decomposed components of the interpolated high-resolution image. Then, we define a DWT-guided denoising step at ‘ $t$ ’ as follows.

\displaystyle\begin{split}\hat{z}_{t-1}=&\sqrt{\bar{\alpha}_{t-1}}\mathcal{E}(% \mathtt{iDWT}(\mathtt{DWT}(\tilde{x}_{0})_{LL},\mathtt{DWT}(\hat{x}_{0,t})_{LH% },\mathtt{DWT}(\hat{x}_{0,t})_{HL},\mathtt{DWT}(\hat{x}_{0,t})_{HH}))\\ &+\sqrt{1-\bar{\alpha}_{t-1}}\epsilon,\end{split}

(5)

where $\hat{x}_{0,t}=\mathcal{D}(\hat{z}_{0,t})$ is a predicted clean image at timestep ‘ $t$ ’, using Eq. 3 and the decoder $\mathcal{D}(\cdot)$ , $\texttt{iDWT}(\cdot)$ is the inverse DWT, and $\mathcal{E}(\cdot)$ is the encoder. Finally, we recover the denoised image $\hat{x}_{t-1}=\mathcal{D}(\hat{z}_{t-1})$ at ‘ $t-1$ ’.

Since SD performs the diffusion process in the latent space, frequent transitions between the latent and pixel spaces pose a considerable computation burden. This is particularly pronounced when the image’s resolution. In our empirical observations, we found that restricting the denoising procedure to the initial 5 steps out of a total of 15 achieves a favorable balance between image fidelity and computational efficiency.

5 Experiments

5.1 Implementation Details

For high-resolution image generation, we conducted extensive experiments on two text-to-image diffusion models, Stable Diffusion 2.1 [65] and Stable Diffusion XL [45]. To ensure a fair comparison with baseline methods, we validate our method with inference resolutions of 4 $\times$ and 16 $\times$ of the model’s original training resolution. In detail, we generate the resolutions of $1024^{2}$ , $2048^{2}$ for Stable Diffusion 2.1, and $2048^{2}$ , $4096^{2}$ for Stable Diffusion XL. In the case of video generation, we conducted experiments on ModelScope [71]. We used 50 DDIM steps to generate both images and videos. As mentioned in Sec. 4.2 and Sec. 4.3, we fixed our hyperparameters to $N=15$ and DWT-guidance step as 5 steps.

5.2 Evaluation

We utilized the LAION-400M [57] dataset as a benchmark for image generation experiments, which comprises 400 million image-text pairs¹¹1The access to the LAION-5B dataset was revoked due to concerns regarding potentially illegal content, specifically Child Sexual Abuse Material (CSAM). Alternatively, we evaluate our methods on the LAION-400M dataset.. We randomly sample captions from the benchmark dataset and generated images corresponding to the sampled captions. Due to substantial computational cost, we generated 20K images for $1024^{2}$ , 5K images for $2048^{2}$ , and 1K images for $4096^{2}$ , and compared the performance of our method against baselines. We selected Frechet Inception Distance (FID) and Kernel Inception Distance (KID), denoted as $FID_{r}$ and $KID_{r}$ , as our evaluation metrics. Following previous work [19], we additionally report the metrics between the generated samples of the increased resolutions and the base resolutions in order to estimate the degree of each method preserving the original model generation capability, denoted as $FID_{b}$ and $KID_{b}$ .

Similarly to image experiment settings, for video generation, we sampled 2048 random captions from the WebVid-10M dataset [3] and measured the Frechet Video Distance (FVD) [70] with 16 frame videos. The resolution of the video generated with our method is $1024^{2}$ since the model is capable of generating $512^{2}$ resolution videos.

5.3 Image generation

We compare both of our methods, the progressive-only approach, denoted as ‘DiffuseHigh (w/o DWT)’ (Sec. 4.2) and ‘DiffuseHigh’ (Sec. 4.3) using DWT-based guidance against the existing training-free methods, e.g., direct inference of the stable diffusion models (D.I) and ScaleCrafter [19]. We excluded methods [6, 34] that are able to generate higher resolution image but poses object repetition problem from baselines. Moreover, we empirically found that utilizing multiple resolutions during the generation yields better results. We, therefore, added one more intermediate resolution in DiffuseHigh, i.e., [ $512^{2}$ , $768^{2}$ , $1024^{2}$ ] for Stable Diffusion and [ $1024^{2}$ , $1536^{2}$ , $2048^{2}$ ] for Stable Diffusion XL, on $4\times$ experiments.

	SD 2.1 (1K)				SDXL 1.0 (2K)
Method	$FID_{r}$	$KID_{r}$	$FID_{b}$	$KID_{b}$	$FID_{r}$	$KID_{r}$	$FID_{b}$	$KID_{b}$
D.I	58.77	0.0176	43.21	0.0094	69.37	0.025	47.01	0.128
ScaleCrafter	32.60	0.0117	18.43	0.0047	62.84	0.020	44.84	0.0104
DiffuseHigh (w/o DWT)	18.11	0.0066	4.09	0.0006	30.74	0.0081	17.21	0.0015
DiffuseHigh	18.72	0.0069	3.56	0.0003	26.08	0.0077	12.46	0.0001

Table 1: Evaluation of

4\times

experiments on LAION-400M. We compare our proposed method with training-free baseline methods. The table shows the metric scores of each methods.

We report our evaluation results of $4\times$ resolution inference experiment on Tab. 1. As observed, both of our approaches surpassed the given baselines by a large margin in terms of both FID and KID. In settings of SD 2.1, DiffuseHigh (w/o DWT) slightly preceded DiffuseHigh, while DiffuseHigh showed superior results compared to DiffuseHigh (w/o DWT) in SDXL settings. Qualitative samples are shown in Fig. 6.

We also report the qualitative evaluation metrics on $16\times$ settings on Tab. 2, mainly comparing DiffuseHigh (w/o DWT) with DiffuseHigh. Similar to the $4\times$ experiment, DiffuseHigh (w/o DWT) approach showed better results with SD 2.1, while DiffuseHigh achieved a better score with SDXL.

We conjecture that this consistent observation stems from the capacity of the diffusion model to generate correct shapes, structures, and details of the object. We observed that SD 2.1 is more likely to generate objects with undesirable appearance compared to SDXL. Since DiffuseHigh generates a high-resolution image guided by the low-resolution generated image, the object in the image is highly likely to preserve eccentric structures. Nonetheless, the progressive-only approach has the opportunity to amend the flawed shapes since it has a more flexible denoising process. In contrast, the outcomes diverge significantly with SDXL. Due to its proficiency in generating natural-looking images, leveraging guidance from the low-resolution image proves advantageous. This process ensures the accurate incorporation of structural attributes, leading to highly convincing and satisfactory results with DiffuseHigh. The quantitative images produced by DiffuseHigh are illustrated in Fig. 8.

	SD 2.1 (2K)				SDXL 1.0 (4K)
Method	$FID_{r}$	$KID_{r}$	$FID_{b}$	$KID_{b}$	$FID_{r}$	$KID_{r}$	$FID_{b}$	$KID_{b}$
DiffuseHigh (w/o DWT)	27.42	0.0065	14.05	0.0011	60.27	0.0081	44.06	0.0003
DiffuseHigh	30.96	0.0083	12.95	0.0005	59.62	0.0077	43.35	0.0002

Table 2: Evaluation of

16\times

experiments on LAION-400M.

5.4 Video generation

To further validate the versatility of DiffuseHigh, we adapt our proposed method on ModelScope [71] to generate higher-resolution video over its original resolution. As observed in images, directly inferencing on pretrained video model also resulted in severe object repetition problem (See Fig. 9). We report the quantitative results of video experiments on Tab. 3. As observed, our proposed DiffuseHigh achieved a lower FVD [70] score compared to the direct inference with a large margin. It demonstrates the versatility of our method, which shows superior performance in adapting the video diffusion model (ModelScope [71]) on higher-resolution settings. Additional qualitative examples are provided in supplemantary matarials.

	ModelScope (1K)
Method	FVD
D.I	785.16
DiffuseHigh	607.99

Table 3: Video experiments results. We used Frechet Video Distance(FVD) with 16 frames as an evaluation metric on 2048 generated videos. Captions are randomly sampled from the WebVid-10M dataset [3].

6 Limitation and Discussion

Since DiffuseHigh leverages generated low-resolution images as structural guidance, the generation ability of the diffusion model at its original resolution heavily affects the overall performance of our method. That is, several structural defects or flaws in low-resolution images are also likely to be guided to the resulting higher-resolution image. However, we believe that leveraging tuning-free enhancement methods such as FreeU [60], which improve the quality and fidelity of the sampled low-resolution image, would further improve the quality and fidelity of the resulting high-resolution image and leave it as a future work.

7 Conclusion

We present a training-free progressive high-resolution image synthesis pipeline using a pretrained diffusion model on low-resolution images. Inspired by the recent noising-denoising technique, our proposal involves leveraging generated low-resolution images as a guiding mechanism to effectively preserve the overall structure and intricate details of the contents. We also propose a principled way of incorporating structure information into the denoising process through frequency domain representation. This allows us to retain the essential information presented in the low-resolution images. The extensive experiments with the pretrained SD models have shown that the proposed DiffuseHigh generates higher-resolution images without commonly known issues in the existing approaches, such as repetitive patterns and irregular structures.

References

[1] Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Transactions on Graphics (TOG) 42(4), 1–11 (2023)
[2] Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18208–18218 (2022)
[3] Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1728–1738 (2021)
[4] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
[5] Bao, F., Li, C., Zhu, J., Zhang, B.: Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503 (2022)
[6] Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning. pp. 1737–1752. PMLR (2023)
[7] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
[8] Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22563–22575 (2023)
[9] Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
[10] Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)
[11] Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.H., Murphy, K., Freeman, W.T., Rubinstein, M., et al.: Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)
[12] Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)
[13] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020)
[14] Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)
[15] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS 34, 8780–8794 (2021)
[16] Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389 (2023)
[17] Gao, S., Liu, X., Zeng, B., Xu, S., Li, Y., Luo, X., Liu, J., Zhen, X., Zhang, B.: Implicit diffusion models for continuous super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10021–10030 (2023)
[18] Guo, L., He, Y., Chen, H., Xia, M., Cun, X., Wang, Y., Huang, S., Zhang, Y., Wang, X., Chen, Q., et al.: Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. arXiv preprint arXiv:2402.10491 (2024)
[19] He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen, Q., Shan, Y.: Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In: The Twelfth International Conference on Learning Representations (2023)
[20] He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 (2022)
[21] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
[22] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
[23] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
[24] Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research 23(1), 2249–2281 (2022)
[25] Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
[26] Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y.: Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 2595–2605 (2022)
[27] **, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. Advances in Neural Information Processing Systems 36 (2024)
[28] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. NeurIPS 35, 26565–26577 (2022)
[29] Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila, T.: Alias-free generative adversarial networks. NeurIPS 34, 852–863 (2021)
[30] Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6007–6017 (2023)
[31] Kim, D., Shin, S., Song, K., Kang, W., Moon, I.C.: Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. arXiv preprint arXiv:2106.05527 (2021)
[32] Kong, Z., **, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
[33] Lam, M.W., Wang, J., Huang, R., Su, D., Yu, D.: Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514 (2021)
[34] Lee, Y., Kim, K., Kim, H., Sung, M.: Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems 36 (2024)
[35] Li, Y., Wang, H., **, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., Ren, J.: Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. NeurIPS 36 (2024)
[36] Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 300–309 (2023)
[37] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)
[38] Lu, C., Zheng, K., Bao, F., Chen, J., Li, C., Zhu, J.: Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In: ICML. pp. 14429–14460. PMLR (2022)
[39] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS 35, 5775–5787 (2022)
[40] Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11461–11471 (2022)
[41] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
[42] Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
[43] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
[44] Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV. pp. 4195–4205 (2023)
[45] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
[46] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
[47] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)
[48] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
[49] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: ICML. pp. 8821–8831. PMLR (2021)
[50] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)
[51] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
[52] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR. pp. 22500–22510 (2023)
[53] Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings. pp. 1–10 (2022)
[54] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS 35, 36479–36494 (2022)
[55] Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(4), 4713–4726 (2022)
[56] Sauer, A., Schwarz, K., Geiger, A.: StyleGAN-XL: Scaling StyleGAN to large diverse datasets. In: ACM SIGGRAPH 2022 conference proceedings. pp. 1–10 (2022)
[57] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)
[58] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
[59] Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
[60] Si, C., Huang, Z., Jiang, Y., Liu, Z.: Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497 (2023)
[61] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
[62] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
[63] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. NeurIPS 32 (2019)
[64] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
[65] stabilityai: Stable Diffusion 2-1 base (2022), https://huggingface.co/stabilityai/stable-diffusion-2-1
[66] stabilityai: Stable Diffusion Latent Upscaler (2023), https://huggingface.co/stabilityai/sd-x2-latent-upscaler
[67] stabilityai: Stable Diffusion Upscaler (2023), https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler
[68] Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
[69] Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1921–1930 (2023)
[70] Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
[71] Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023)
[72] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems 36 (2024)
[73] Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G.: Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 650–656 (2022)
[74] Xie, E., Yao, L., Shi, H., Liu, Z., Zhou, D., Liu, Z., Li, J., Li, Z.: Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648 (2023)
[75] Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)
[76] Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: Freedom: Training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833 (2023)
[77] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
[78] Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902 (2022)
[79] Zhang, S., Chen, Z., Zhao, Z., Chen, Z., Tang, Y., Chen, Y., Cao, W., Liang, J.: Hidiffusion: Unlocking high-resolution creativity and efficiency in low-resolution trained diffusion models. arXiv preprint arXiv:2311.17528 (2023)
[80] Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., Xu, H.: Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. arXiv preprint arXiv:2308.16582 (2023)

Supplementary material for DiffuseHigh

A Comparison with Stable Diffusion Upscaler

Leveraging a Super-Resolution (SR) model is also an appealing approach for generating images with large resolution. However, utilizing SR models is often challenging due to the difficulty of collecting higher-resolution image datasets and the computational cost of training the separate super-resolution model on higher-resolution images.

In order to comprehensively evaluate our method, we compare DiffuseHigh against pretrained SR models, namely Stable Diffusion Latent Upscaler [66] and Stable Diffusion Upscaler [67], which has the capacity to increase the resolution of the given image up to $4\times$ and $16\times$ , respectively. We conducted experiments on Stable Diffusion 2.1 [65] and generated $1024^{2}$ and $2048^{2}$ resolution images for each method.

	SD 2.1 (1K)		SD 2.1 (2K)
Method	FID	KID	FID	KID
SD + SR	18.66	0.0070	30.05	0.0087
DiffuseHigh	18.72	0.0069	27.80	0.0069

Table 4: SDSR experiments results.

As observed in Tab. 4, for $4\times$ experiment, the SD + SR method showed a slightly lower FID score and higher KID score compared to DiffuseHigh, but the difference is negligible. In the case of the $16\times$ setting, DiffuseHigh surpassed SD + SR both on FID and KID scores. Also, as mentioned in [19], SD + SR often fails to compose reliable texture and details of the image, as shown in Fig. 10. These quantitative and qualitative results highlight the efficiency and efficacy of DiffuseHigh, where ours employ only the pretrained Stable Diffusion.