11institutetext: Department of Artificial Intelligence, Sungkyunkwan University 22institutetext: Department of Electrical and Computer Engineering, Sungkyunkwan University
https://yhyun225.github.io

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Younghyun Kim1 Equal Contribution. Geunmin Hwang100footnotemark: 0 Eunbyung Park1,2 Corresponding author.
Abstract

Recent surge in large-scale generative models has spurred the development of vast fields in computer vision. In particular, text-to-image diffusion models have garnered widespread adoption across diverse domain due to their potential for high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generate images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher resolution datasets. However, this undertaking poses a formidable challenge due to the difficulty in collecting large-scale high-resolution contents and substantial computational resources. While several preceding works have proposed alternatives, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond its original capability and propose a novel progressive approach that fully utilizes generated low-resolution image to guide the generation of higher resolution image. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

Keywords:
Diffusion High-resolution Training-free

1 Introduction

Following the establishment of diffusion models as a cornerstone in the realm of generative modeling, there has been a rapid advancements across various domains within machine learning modalities. These advancements span areas such as audio synthesis [32, 13, 33, 26, 37], image synthesis [23, 62, 15, 50, 54, 48, 16, 4, 45], video generation [20, 22, 8, 61, 71, 7, 12], and 3D generation [46, 72, 36, 14, 59, 68, 75]. Notably, text-to-image diffusion models [4, 45, 50, 54, 48] have attracted considerable attention due to their ability to generate visually captivating images using intuitive, human-friendly natural language descriptions. Stable Diffusion (SD), an open-source text-to-image diffusion model trained on a large-scale online dataset [57], has emerged as a prominent choice for a diverse range of generative tasks and inverse problems. These tasks include but are not limited to image editing [1, 2, 21, 69, 30], inpainting [50, 53, 40], super-resolution [50, 55, 17], and image-to-image translation [10, 42, 76, 77].

Refer to caption
Figure 1: Various baselines. Each images has 20482superscript204822048^{2}2048 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT size, generated from SDXL 1.0. We used ‘A group of playful monkeys swinging through the branches of a dense jungle.’ and ‘A line of taxis queued up outside a busy train station.’ as a textual prompt for each rows, respectively.

Despite the promising performance exhibited by SD, it encounters limitations when generating images at higher resolutions beyond its training resolution. The direct inference of unseen high-resolution samples often reveals repetitive patterns and irregular structures, particularly noticeable in object-centric samples, as discussed in prior works [19, 79] (see Fig. 1). While a straightforward approach might involve training or fine-tuning diffusion models on higher-resolution images, several challenges impede this approach. First, collecting text-image pairs of higher resolution is not readily feasible. Second, training on large-resolution images demands substantial computational resources due to the increased size of the intermediate features. Furthermore, capturing and learning the features from high-dimensional data often requires a greater model capacity (more model parameters), leading to further computational strain on the training process.

Several tuning-free [6, 34, 19] methods proposed various approaches to adapt pre-trained SD on higher-resolutions beyond its original settings. MultiDiffusion [6] and SyncDiffusion [34] employs multiple diffusion process with overlap** windows, each corresponding to different regions within the generating image. These joint diffusion models can produce images of arbitrary shape, but the resulting image involves object repetition problem since the same textual prompt is fed into each windows. Attn-SF [27] associates inference resolution with attention entropy and introduces scaling factor to alleviate entropy fluctuations during sampling of variable-sized images. However, their work does not consider adapting SD on much higher-resolutions, e.g., 2K and 4K. ScaleCrafter [19], on the other hand, extends the receptive field of the diffusion model by dilating the pre-trained convolution weights of the denoising UNet [51]. While it effectively addresses repetition issues in certain instances, its success heavily depends on the extensive search of the hyperparameters.

Refer to caption
Figure 2: Toy experiment on 2K image. (a) Original 2K image. (b) We slightly add the noise that does not perturb the global sturcture severely. (c) We denoise the image with SD 2.1 to restore the original contents.

In this work, we investigate the SD’s capability of generating previously unseen high-resolution images and introduce a novel approach that does not involve any training (or fine-tuning) and additional modules. We posit that SD innately possesses the potential to generate images at resolutions higher than its training resolution thanks to its convolutional architecture [50] and broad data distribution coverage. To substantiate our claim, we generate 2K images using SD from noisy images at different intermediate diffusion timesteps. Note that Gaussian noise is added to the latent space. Fig. 2 demonstrates that from noisy images but whose global structures are preserved, SD seamlessly restores clean, highly detailed images.

Building upon this observation, we introduce a novel progressive high-resolution image generation pipeline, dubbed DiffuseHigh, where a relatively low-resolution image (sampled from SD) serves as a guide for generating higher-resolution images. Inspired by the recent literature [45, 41], we suggest the noising-denoising technique to synthesize higher-resolution images. First, we generate the low-resolution image using SD and upsample it by bilinear interpolation. Then, we add sufficient noise to obfuscate the fine details of the interpolated images. Finally, we perform the reverse diffusion process to denoise those images to infuse the high-frequency details to synthesize higher-resolution images, and we can repeat this process until we obtain the desired resolution images. This approach leverages the overall structure from the low-resolution image, effectively addressing repetition issues observed in the prior methods.

However, the ‘adding noise to damage the images’ approach poses several challenges. If we add too much noise, then we lose most of the structure in the low-resolution images, resulting in repetitive outcomes similar to those we generate from scratch. On the other hand, if we introduce a minimal amount of noise, the generated higher-resolution images do not show notable differences from the interpolated images, losing the opportunity to synthesize high-frequency details. In addition, finding adequate noise relies on both the content of the image and the pre-trained models, which makes it challenging to offer precise suggestions to users.

To resolve the issues above, we propose a principled way of preserving the overall structure from the low-resolution image for the suggested progressive pipeline. We employ a frequency-domain representation to extract the global structure as well as detailed contents from the low-resolution images. More specifically, we adopt the Discrete Wavelet Transform (DWT) to obtain essential contents, e.g., the LL𝐿𝐿LLitalic_L italic_L component, which we then incorporate into the denoising procedure to ensure that the resulting image remains consistent and does not deviate excessively.

Fig. 3 provides an overview of the overall pipeline of our method. We validate the proposed pipeline on the LAION-400M dataset [58] and demonstrate the superior performance of DiffuseHigh compared to other baseline methods. Additionally, we extend our method to diffusion-based video generation [71] to showcase the versatility of DiffuseHigh. The contributions of our work are summarized as follows:

  • Our observation indicates that SD has the innate ability to synthesize images with higher resolution than those it was trained on.

  • We suggest a novel training-free progressive high-resolution image synthesis pipeline called DiffuseHigh, in which a lower-resolution image acts as a guide for generating higher-resolution images.

  • We further propose Discrete Wavelet Transform (DWT)-based structure guidance during the denoising process, which enhances the structural properties and fine details of the generated samples.

  • We conduct comprehensive experiments both on image and video synthesis, demonstrating the superiority and versatility of our method.

2 Related Work

2.0.1 Diffusion Models

Diffusion models (DMs) [23, 63] represent a novel paradigm within the generative modeling framework, employing numerical methods [39, 78, 5] to solve reverse-time stochastic differential equations (SDEs) for simulating the generative trajectories [64]. Under this rigorous theoretical framework, DMs enable to achieve state-of-the-art (SoTA) [28, 44] image quality and comparable model likelihood [31, 38].

2.0.2 Text-to-Image Generation

Text-driven image generation can be traced back to the use of GANs [9, 29, 56], often combined with image-text representations such as CLIP [47], achieving significant performance. However, generating semantically consistent images with text guidance remains challenging for GANs [52]. Recently, DMs have gained popularity for their ability to produce high-quality images [44], showcasing great potential in text-to-image generation [15, 25]. Especially the pioneering work, Stable Diffusion [50], which introduces text representations in latent space iteratively, with further advancements occurring rapidly. Moreover, thanks to the large-scale training of Stable Diffusion, it is applied to various text-to-image tasks [35, 43, 11] by fine-tuning [52] or using training-free [49] method. While significant progress has been made in the field of text-to-image generation, one limitation of DMs is their capacity to generate images only at fixed resolutions [52, 35, 54], attributed to their training on specific image sizes. To remedy this, in this paper, we employ text prompts to generate much higher-resolution images than those present in their training datasets in a training-free manner.

2.0.3 Noising-Denoising

Based on the stochastic differential equation (SDE) reflecting the generative diffusion process, SDEdit [41] proposed a unified framework for image editing and image synthesis. Given images with low-level details, e.g., stroke painting, they add an adequate amount of noise to the image. Subsequently, they restore a clean, natural image from the noisy image through an iterative reverse SDE.

This ‘noising-denoising’ strategy, which performs a reverse diffusion process from the intermediate noised image, has been widely adopted in various domains. AnoDDPM [73] proposed reconstruction-based anomaly detection with partial Markov chain, where the data sample is slightly noised with small timesteps and reconstructed. SDXL [45] employs an optional refinement network, in which the network refines the low-quality part of the image through a noising-denoising process. Similarily,   adopts this algorithm in a post-processing stage in order to rectify the imperfect video frames.

2.0.4 High-resolution Image Synthesis

Despite the progress made by current diffusion model-based synthesis methods, achieving high-resolution image generation remains elusive. Previous studies have tackled these challenges through methods such as training from scratch and fine-tuning [74, 80]. However, training from scratch and fine-tuning often require significant computational resources and a substantial amount of high-resolution training data. Consequently, there has been a recent trend towards training-free methods [79, 27] for generating arbitrary-size or high-resolution images. ScaleCrafter [19] utilized dilated convolution to adjust the convolutional receptive field, enabling adaptation to high-resolution image generation without any training.

Recently, Make-a-Cheap-Scaling [18] has also adopted the noising-denoising technique to synthesize higher-resolution images. To further boost the image quality, they propose to tune a lightweight upsampler module, which can provide proper semantic guidance during the generation process. Different from theirs, we propose to obtain explicit structural guidance from the low resolution image, which can effectively address object repetition problem and irregular structure issues. Our proposed DiffuseHigh can be directly applied to any prevalent pretrained diffusion models in a completely training-free manner, providing both efficacy and efficiency.

3 Preliminary

In this section, we briefly present preliminaries relevant to our method, including Stable Diffusion (SD) [50] and Discrete Wavelet Transform (DWT).

3.1 Stable Diffusion

Stable Diffusion is a text-to-image latent diffusion model where the diffusion process is performed on a low-dimensional latent space. Given a data sample x𝑥xitalic_x from the unknown data distribution pdata(x)subscript𝑝data𝑥p_{\text{data}}(x)italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ), stable diffusion encodes x𝑥xitalic_x into a latent representation z0=(x)subscript𝑧0𝑥z_{0}=\mathcal{E}(x)italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x ), where ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ) is an autoencoder that compresses the high-dimensional data into a compact latent space. Then, the model gradually adds isotropic gaussian noise ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) to a clean sample z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with pre-defined noise schedule αt(0,1)subscript𝛼𝑡01\alpha_{t}\in(0,1)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ):

zt=αt¯z0+1αt¯ϵ,subscript𝑧𝑡¯subscript𝛼𝑡subscript𝑧01¯subscript𝛼𝑡italic-ϵz_{t}=\sqrt{\bar{\alpha_{t}}}z_{0}+\sqrt{1-\bar{\alpha_{t}}}\epsilon,italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ , (1)

where t[1,,T]𝑡1𝑇t\in[1,...,T]italic_t ∈ [ 1 , … , italic_T ] denotes the timesteps of the diffusion process and α¯=Πs=1tαs¯𝛼superscriptsubscriptΠ𝑠1𝑡subscript𝛼𝑠\bar{\alpha}=\Pi_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG = roman_Π start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The denoising network ϵϕ(zt;t,y)subscriptitalic-ϵitalic-ϕsubscript𝑧𝑡𝑡𝑦\epsilon_{\phi}(z_{t};t,y)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t , italic_y ) parametrized by ϕitalic-ϕ\phiitalic_ϕ learns to predict the amount of noise added, given noisy latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and text prompt y𝑦yitalic_y, with the following denoising score matching objective:

:=𝔼t,ϵ[w(t)(ϵϕ(zt;t,y)ϵ)22].assignsubscript𝔼𝑡italic-ϵdelimited-[]superscriptsubscriptnorm𝑤𝑡subscriptitalic-ϵitalic-ϕsubscript𝑧𝑡𝑡𝑦italic-ϵ22\mathcal{L}:=\mathbb{E}_{t,\epsilon}\left[||w(t)(\epsilon_{\phi}(z_{t};t,y)-% \epsilon)||_{2}^{2}\right].caligraphic_L := blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ end_POSTSUBSCRIPT [ | | italic_w ( italic_t ) ( italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t , italic_y ) - italic_ϵ ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (2)

w(t)𝑤𝑡w(t)italic_w ( italic_t ) is a weighting function applied to each loss term at timestep t𝑡titalic_t.

Initiating from zT𝒩(0,I)similar-tosubscript𝑧𝑇𝒩0𝐼z_{T}\sim\mathcal{N}(0,I)italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ), the reverse procces is formulated as qϕ(zt1|zt,z0)subscript𝑞italic-ϕconditionalsubscript𝑧𝑡1subscript𝑧𝑡subscript𝑧0q_{\phi}(z_{t-1}|z_{t},z_{0})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with qϕ(|)q_{\phi}(\cdot|\cdot)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | ⋅ ) parametrized as a Gaussian distribution. For efficiency, DDIM [62] sampling strategy is generally adopted, where unknown z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is replaced with predicted clean latent z0^^subscript𝑧0\hat{z_{0}}over^ start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG at timestep ‘t𝑡titalic_t’:

z^0,t=zt1α¯tϵϕ(zt;t,y)α¯tsubscript^𝑧0𝑡subscript𝑧𝑡1subscript¯𝛼𝑡subscriptitalic-ϵitalic-ϕsubscript𝑧𝑡𝑡𝑦subscript¯𝛼𝑡\hat{z}_{0,t}=\frac{z_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\phi}(z_{t};t,y)}% {\sqrt{\bar{\alpha}_{t}}}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t , italic_y ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG (3)

Finally the clean image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG is reconstructed from a decoder 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ) of the stable diffusion, i.e., x^=𝒟(z0)^𝑥𝒟subscript𝑧0\hat{x}=\mathcal{D}(z_{0})over^ start_ARG italic_x end_ARG = caligraphic_D ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

3.2 Discrete Wavelet Transform

Frequency-based methods, including the Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), and Discrete Wavelet Transform (DWT) play a pivotal role in discrete signal processing. Such frequency-based approaches transform the given signal into the frequency domain, enabling the analysis and manipulation of the individual frequency bands.

Among them, utilizing wavelets, DWT decomposes images into different components that are localized both in time and frequency. Specifically, at each DWT level, the decomposed components consist of an approximation coefficient denoted as LLl𝐿subscript𝐿𝑙LL_{l}italic_L italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and a detail coefficient denoted as LHl,HLl,HHl𝐿subscript𝐻𝑙𝐻subscript𝐿𝑙𝐻subscript𝐻𝑙LH_{l},HL_{l},HH_{l}italic_L italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_H italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_H italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, where l𝑙litalic_l represents the level of the DWT. Leveraging the low-pass filter and high-pass filter in both vertical and horizontal directions, LLl𝐿subscript𝐿𝑙LL_{l}italic_L italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents the low-frequency details of the image, encompassing global structures, uniformly-colored regions, and smooth textures. On the other hand, LHl,HLl,HHl𝐿subscript𝐻𝑙𝐻subscript𝐿𝑙𝐻subscript𝐻𝑙LH_{l},HL_{l},HH_{l}italic_L italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_H italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_H italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT encapsulates the high-frequency details, such as edges, boundaries, and rough textures.

We adopt DWT as a tool for the guidance of overall structures and contents of the low-resolution image for generating a higher-resolution image. The details of applying DWT-based guidance on our pipeline are described in Sec. 4.3.

4 Method

Refer to caption
Figure 3: Progressive High-Resolution Diffusion Pipeline. For simplicity, we did not depict the transformation between latent space and pixel space. The text prompt: ‘A group of playful dolphins lea** gracefully out of the sparkling ocean waves.’.

4.1 Problem Formulation

Our work aims to generate higher-resolution images over training size given textual prompts with a text-to-image diffusion model (stable diffusion) in a training-free manner. More formally, given text description y𝑦yitalic_y and stable diffusion ϵϕ()subscriptitalic-ϵitalic-ϕ\epsilon_{\phi}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) pretrained on fixed-size images (h,w,3)𝑤3(h,w,3)( italic_h , italic_w , 3 ), our objective is to generate higher resolution image (H,W,3)𝐻𝑊3(H,W,3)( italic_H , italic_W , 3 ) without training ϕitalic-ϕ\phiitalic_ϕ, where hH,wWformulae-sequencemuch-less-than𝐻much-less-than𝑤𝑊h\ll H,w\ll Witalic_h ≪ italic_H , italic_w ≪ italic_W.

4.2 Progressive High-Resolution Diffusion Pipeline

We present progressive approach for generating high-resolution images using a pretrained stable diffusion model. Initially, our method generates a clean sample based on a given text description through stable diffusion. Assuming alignment between the generated image and the provided text, we then employ bilinear interpolation to upscale the image, thereby guiding the high-resolution image generation. Our method incorporates a noising-denoising technique [41], which gradually projects the sample onto the manifold of natural, highly detailed images that the diffusion model has learned. This iterative procedure can also be interpreted as a refinement stage [45], wherein the denoising process restores the missing high details on the low-resolution sample.

Let x0h×w×3subscript𝑥0superscript𝑤3x_{0}\in\mathbb{R}^{h\times w\times 3}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT be the generated sample of stable diffusion aligned with user-provided textual prompt y𝑦yitalic_y and pϕ(x|y)subscript𝑝italic-ϕconditional𝑥𝑦p_{\phi}(x|y)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x | italic_y ) be the text conditional data distribution the model has learned. In other words, x0pϕ(x|y)similar-tosubscript𝑥0subscript𝑝italic-ϕconditional𝑥𝑦x_{0}\sim p_{\phi}(x|y)italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x | italic_y ) and pϕ(x|y)pdata(x|y)subscript𝑝italic-ϕconditional𝑥𝑦subscript𝑝𝑑𝑎𝑡𝑎conditional𝑥𝑦p_{\phi}(x|y)\approx p_{data}(x|y)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x | italic_y ) ≈ italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ( italic_x | italic_y ). We serve relatively low-resolution generated image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a guide for the higher resolution image generation and apply bilinear interpolation on the image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the desired size image x~0H×W×3subscript~𝑥0superscript𝐻𝑊3\tilde{x}_{0}\in\mathbb{R}^{H\times W\times 3}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT. Note that the details of the resulting image x~0subscript~𝑥0\tilde{x}_{0}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT lack clarity due to the nature of the interpolation, which entails averaging neighboring pixel values to compose newly introduced pixels.

In order to infuse the appropriate details to the current high-resolution, we first add noise corresponding to the diffusion timestep N<T𝑁𝑇N<Titalic_N < italic_T to its latent code z~0=(x~0)subscript~𝑧0subscript~𝑥0\tilde{z}_{0}=\mathcal{E}(\tilde{x}_{0})over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) according to Eq. 1:

z^N=α¯Nz~0+1α¯Nϵ,ϵ𝒩(0,I).formulae-sequencesubscript^𝑧𝑁subscript¯𝛼𝑁subscript~𝑧01subscript¯𝛼𝑁italic-ϵsimilar-toitalic-ϵ𝒩0𝐼\hat{z}_{N}=\sqrt{\bar{\alpha}_{N}}\tilde{z}_{0}+\sqrt{1-\bar{\alpha}_{N}}% \epsilon,\quad\epsilon\sim\mathcal{N}(0,I).over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) . (4)

Then the denoising network ϵϕ()subscriptitalic-ϵitalic-ϕ\epsilon_{\phi}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) performs the reverse process on the noisy latent representation z^Nsubscript^𝑧𝑁\hat{z}_{N}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to recover the clean latent z^0subscript^𝑧0\hat{z}_{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. By employing the latent decoder 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ), we finally obtain the desired high-resolution image.

This progressive pipeline bears a resemblance to cascade diffusion models [24, 54], albeit with inherent differences. Cascade diffusion models employ multiple networks for each participating low-resolution image generation and super-resolution. In contrast, the described method relies solely on a single pretrained diffusion model, obviating the need for training the separate models.

Refer to caption
Figure 4: Experiment with varying noising step 𝐍𝐍\mathbf{N}bold_N. The first row shows the noisy images where noise corresponding to different timesteps is added. The second row is the denoised clean images. Small N𝑁Nitalic_N fails to generate detailed features, resulting in a blurry image. On the other hand, large N𝑁Nitalic_N introduces the object repetition problem: repeated objects on the road for N=25𝑁25N=25italic_N = 25, bikes for N=45𝑁45N=45italic_N = 45. Textual prompt: ‘a bear riding a bike in New York City’ is used in this experiment.

Noising timestep N𝑁Nitalic_N is an important factor for the overall quality of the resulting image (see Fig. 4). Large N𝑁Nitalic_N significantly destroys the critical structural properties in the image, leading to object repetition problems and undesirable object shapes. However, small N𝑁Nitalic_N does not provide the diffusion model enough timesteps to perform the denoising process to restore the fine, high-frequency details of the image. Determining appropriate noise levels is contingent upon both the content of the image and the characteristics of pretrained diffusion models, thereby presenting a challenge in practical usage. Furthermore, we observed numerous instances where this approach degraded image quality across all noise levels. This leads us to develop a more principled way to uphold the overall structure and maintain the quality of the generated higher-resolution images.

4.3 Structure Guidance through Discrete Wavelet Transform

The progressive approach demonstrates proficiency in generating high-fidelity images; however, it frequently encounters challenges in effectively capturing certain structural properties and nuanced details from low-resolution inputs. Consequently, this can lead to discrepancies between the generated image and the actual data distribution. (See Fig. 5). This phenomenon is apparent, as structures and intricate details are susceptible to damage and distortion by a certain amount of noise.

Refer to caption
Figure 5: Comparison between DiffuseHigh without DWT and DiffuseHigh. As shown in the figure, (a) DiffuseHigh without DWT fails to generate realistic image, and shows artifacts as observed in the orange box. (b) DiffuseHigh without DWT generates black dots on the body of a teddy, and also fails to generate mouth. In contrast, in case of DiffuseHigh, structural property of the low resolution image successfully guide the object to have clean shape textures.

We hereby introduce the method DiffuseHigh (Fig. 3), in which we incorporate a Discrete Wavelet Transform (DWT)-based structure guide into the proposed progressive pipeline. This method aims to enhance the fidelity of generated images by encouraging the preservation of crucial features from the low-resolution input. Given an interpolated image x~0H×W×3subscript~𝑥0superscript𝐻𝑊3\tilde{x}_{0}\in\mathbb{R}^{H\times W\times 3}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, we extract its low-frequency component utilizing the DWT, which encapsulates the overall structure and coarse details of the image. More formally, let us define DWT():H×W×34×H2×W2×3:DWTsuperscript𝐻𝑊3superscript4𝐻2𝑊23\texttt{DWT}(\cdot):\mathbb{R}^{H\times W\times 3}\rightarrow\mathbb{R}^{4% \times\frac{H}{2}\times\frac{W}{2}\times 3}DWT ( ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 4 × divide start_ARG italic_H end_ARG start_ARG 2 end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 end_ARG × 3 end_POSTSUPERSCRIPT and DWT(x~0)LL,DWT(x~0)LH,DWT(x~0)HL,DWT(x~0)HHH2×W2×3DWTsubscriptsubscript~𝑥0𝐿𝐿DWTsubscriptsubscript~𝑥0𝐿𝐻DWTsubscriptsubscript~𝑥0𝐻𝐿DWTsubscriptsubscript~𝑥0𝐻𝐻superscript𝐻2𝑊23\texttt{DWT}(\tilde{x}_{0})_{LL},\texttt{DWT}(\tilde{x}_{0})_{LH},\texttt{DWT}% (\tilde{x}_{0})_{HL},\texttt{DWT}(\tilde{x}_{0})_{HH}\in\mathbb{R}^{\frac{H}{2% }\times\frac{W}{2}\times 3}DWT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_L italic_L end_POSTSUBSCRIPT , DWT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_L italic_H end_POSTSUBSCRIPT , DWT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_H italic_L end_POSTSUBSCRIPT , DWT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_H italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 2 end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 end_ARG × 3 end_POSTSUPERSCRIPT are four decomposed components of the interpolated high-resolution image. Then, we define a DWT-guided denoising step at ‘t𝑡titalic_t’ as follows.

z^t1=α¯t1(𝚒𝙳𝚆𝚃(𝙳𝚆𝚃(x~0)LL,𝙳𝚆𝚃(x^0,t)LH,𝙳𝚆𝚃(x^0,t)HL,𝙳𝚆𝚃(x^0,t)HH))+1α¯t1ϵ,subscript^𝑧𝑡1subscript¯𝛼𝑡1𝚒𝙳𝚆𝚃𝙳𝚆𝚃subscriptsubscript~𝑥0𝐿𝐿𝙳𝚆𝚃subscriptsubscript^𝑥0𝑡𝐿𝐻𝙳𝚆𝚃subscriptsubscript^𝑥0𝑡𝐻𝐿𝙳𝚆𝚃subscriptsubscript^𝑥0𝑡𝐻𝐻1subscript¯𝛼𝑡1italic-ϵ\displaystyle\begin{split}\hat{z}_{t-1}=&\sqrt{\bar{\alpha}_{t-1}}\mathcal{E}(% \mathtt{iDWT}(\mathtt{DWT}(\tilde{x}_{0})_{LL},\mathtt{DWT}(\hat{x}_{0,t})_{LH% },\mathtt{DWT}(\hat{x}_{0,t})_{HL},\mathtt{DWT}(\hat{x}_{0,t})_{HH}))\\ &+\sqrt{1-\bar{\alpha}_{t-1}}\epsilon,\end{split}start_ROW start_CELL over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = end_CELL start_CELL square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG caligraphic_E ( typewriter_iDWT ( typewriter_DWT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_L italic_L end_POSTSUBSCRIPT , typewriter_DWT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_L italic_H end_POSTSUBSCRIPT , typewriter_DWT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_H italic_L end_POSTSUBSCRIPT , typewriter_DWT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_H italic_H end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ , end_CELL end_ROW (5)

where x^0,t=𝒟(z^0,t)subscript^𝑥0𝑡𝒟subscript^𝑧0𝑡\hat{x}_{0,t}=\mathcal{D}(\hat{z}_{0,t})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT = caligraphic_D ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ) is a predicted clean image at timestep ‘t𝑡titalic_t’, using Eq. 3 and the decoder 𝒟()𝒟\mathcal{D}(\cdot)caligraphic_D ( ⋅ ), iDWT()iDWT\texttt{iDWT}(\cdot)iDWT ( ⋅ ) is the inverse DWT, and ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ) is the encoder. Finally, we recover the denoised image x^t1=𝒟(z^t1)subscript^𝑥𝑡1𝒟subscript^𝑧𝑡1\hat{x}_{t-1}=\mathcal{D}(\hat{z}_{t-1})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = caligraphic_D ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) at ‘t1𝑡1t-1italic_t - 1’.

Since SD performs the diffusion process in the latent space, frequent transitions between the latent and pixel spaces pose a considerable computation burden. This is particularly pronounced when the image’s resolution. In our empirical observations, we found that restricting the denoising procedure to the initial 5 steps out of a total of 15 achieves a favorable balance between image fidelity and computational efficiency.

5 Experiments

5.1 Implementation Details

For high-resolution image generation, we conducted extensive experiments on two text-to-image diffusion models, Stable Diffusion 2.1 [65] and Stable Diffusion XL [45]. To ensure a fair comparison with baseline methods, we validate our method with inference resolutions of 4×\times× and 16×\times× of the model’s original training resolution. In detail, we generate the resolutions of 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 20482superscript204822048^{2}2048 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for Stable Diffusion 2.1, and 20482superscript204822048^{2}2048 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 40962superscript409624096^{2}4096 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for Stable Diffusion XL. In the case of video generation, we conducted experiments on ModelScope [71]. We used 50 DDIM steps to generate both images and videos. As mentioned in Sec. 4.2 and Sec. 4.3, we fixed our hyperparameters to N=15𝑁15N=15italic_N = 15 and DWT-guidance step as 5 steps.

5.2 Evaluation

We utilized the LAION-400M [57] dataset as a benchmark for image generation experiments, which comprises 400 million image-text pairs111The access to the LAION-5B dataset was revoked due to concerns regarding potentially illegal content, specifically Child Sexual Abuse Material (CSAM). Alternatively, we evaluate our methods on the LAION-400M dataset.. We randomly sample captions from the benchmark dataset and generated images corresponding to the sampled captions. Due to substantial computational cost, we generated 20K images for 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 5K images for 20482superscript204822048^{2}2048 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and 1K images for 40962superscript409624096^{2}4096 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and compared the performance of our method against baselines. We selected Frechet Inception Distance (FID) and Kernel Inception Distance (KID), denoted as FIDr𝐹𝐼subscript𝐷𝑟FID_{r}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and KIDr𝐾𝐼subscript𝐷𝑟KID_{r}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, as our evaluation metrics. Following previous work [19], we additionally report the metrics between the generated samples of the increased resolutions and the base resolutions in order to estimate the degree of each method preserving the original model generation capability, denoted as FIDb𝐹𝐼subscript𝐷𝑏FID_{b}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and KIDb𝐾𝐼subscript𝐷𝑏KID_{b}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.

Similarly to image experiment settings, for video generation, we sampled 2048 random captions from the WebVid-10M dataset [3] and measured the Frechet Video Distance (FVD) [70] with 16 frame videos. The resolution of the video generated with our method is 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT since the model is capable of generating 5122superscript5122512^{2}512 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT resolution videos.

Refer to caption
Figure 6: 4×4\times4 × larger resolution results obtained with DiffuseHigh.

5.3 Image generation

We compare both of our methods, the progressive-only approach, denoted as ‘DiffuseHigh (w/o DWT)’ (Sec. 4.2) and ‘DiffuseHigh’ (Sec. 4.3) using DWT-based guidance against the existing training-free methods, e.g., direct inference of the stable diffusion models (D.I) and ScaleCrafter [19]. We excluded methods [6, 34] that are able to generate higher resolution image but poses object repetition problem from baselines. Moreover, we empirically found that utilizing multiple resolutions during the generation yields better results. We, therefore, added one more intermediate resolution in DiffuseHigh, i.e., [5122superscript5122512^{2}512 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 7682superscript7682768^{2}768 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT] for Stable Diffusion and [10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 15362superscript153621536^{2}1536 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 20482superscript204822048^{2}2048 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT] for Stable Diffusion XL, on 4×4\times4 × experiments.

SD 2.1 (1K) SDXL 1.0 (2K)
Method FIDr𝐹𝐼subscript𝐷𝑟FID_{r}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT KIDr𝐾𝐼subscript𝐷𝑟KID_{r}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT FIDb𝐹𝐼subscript𝐷𝑏FID_{b}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT KIDb𝐾𝐼subscript𝐷𝑏KID_{b}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT FIDr𝐹𝐼subscript𝐷𝑟FID_{r}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT KIDr𝐾𝐼subscript𝐷𝑟KID_{r}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT FIDb𝐹𝐼subscript𝐷𝑏FID_{b}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT KIDb𝐾𝐼subscript𝐷𝑏KID_{b}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
D.I 58.77 0.0176 43.21 0.0094 69.37 0.025 47.01 0.128
ScaleCrafter 32.60 0.0117 18.43 0.0047 62.84 0.020 44.84 0.0104
DiffuseHigh (w/o DWT) 18.11 0.0066 4.09 0.0006 30.74 0.0081 17.21 0.0015
DiffuseHigh 18.72 0.0069 3.56 0.0003 26.08 0.0077 12.46 0.0001
Table 1: Evaluation of 4×4\times4 × experiments on LAION-400M. We compare our proposed method with training-free baseline methods. The table shows the metric scores of each methods.
Refer to caption
Figure 7: Visualization of DWT components corresponding to S𝑆Sitalic_S-th DDIM timestep within DiffuseHigh. Since we replace the LL component of the estimated clean image during several DDIM reverse steps, a large amount of LL components are preserved through the denoising process.

We report our evaluation results of 4×4\times4 × resolution inference experiment on Tab. 1. As observed, both of our approaches surpassed the given baselines by a large margin in terms of both FID and KID. In settings of SD 2.1, DiffuseHigh (w/o DWT) slightly preceded DiffuseHigh, while DiffuseHigh showed superior results compared to DiffuseHigh (w/o DWT) in SDXL settings. Qualitative samples are shown in Fig. 6.

We also report the qualitative evaluation metrics on 16×16\times16 × settings on Tab. 2, mainly comparing DiffuseHigh (w/o DWT) with DiffuseHigh. Similar to the 4×4\times4 × experiment, DiffuseHigh (w/o DWT) approach showed better results with SD 2.1, while DiffuseHigh achieved a better score with SDXL.

Refer to caption
Figure 8: 16×16\times16 × larger resolution results obtained with DiffuseHigh.

We conjecture that this consistent observation stems from the capacity of the diffusion model to generate correct shapes, structures, and details of the object. We observed that SD 2.1 is more likely to generate objects with undesirable appearance compared to SDXL. Since DiffuseHigh generates a high-resolution image guided by the low-resolution generated image, the object in the image is highly likely to preserve eccentric structures. Nonetheless, the progressive-only approach has the opportunity to amend the flawed shapes since it has a more flexible denoising process. In contrast, the outcomes diverge significantly with SDXL. Due to its proficiency in generating natural-looking images, leveraging guidance from the low-resolution image proves advantageous. This process ensures the accurate incorporation of structural attributes, leading to highly convincing and satisfactory results with DiffuseHigh. The quantitative images produced by DiffuseHigh are illustrated in Fig. 8.

SD 2.1 (2K) SDXL 1.0 (4K)
Method FIDr𝐹𝐼subscript𝐷𝑟FID_{r}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT KIDr𝐾𝐼subscript𝐷𝑟KID_{r}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT FIDb𝐹𝐼subscript𝐷𝑏FID_{b}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT KIDb𝐾𝐼subscript𝐷𝑏KID_{b}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT FIDr𝐹𝐼subscript𝐷𝑟FID_{r}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT KIDr𝐾𝐼subscript𝐷𝑟KID_{r}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT FIDb𝐹𝐼subscript𝐷𝑏FID_{b}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT KIDb𝐾𝐼subscript𝐷𝑏KID_{b}italic_K italic_I italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
DiffuseHigh (w/o DWT) 27.42 0.0065 14.05 0.0011 60.27 0.0081 44.06 0.0003
DiffuseHigh 30.96 0.0083 12.95 0.0005 59.62 0.0077 43.35 0.0002
Table 2: Evaluation of 16×16\times16 × experiments on LAION-400M.

5.4 Video generation

To further validate the versatility of DiffuseHigh, we adapt our proposed method on ModelScope [71] to generate higher-resolution video over its original resolution. As observed in images, directly inferencing on pretrained video model also resulted in severe object repetition problem (See Fig. 9). We report the quantitative results of video experiments on Tab. 3. As observed, our proposed DiffuseHigh achieved a lower FVD [70] score compared to the direct inference with a large margin. It demonstrates the versatility of our method, which shows superior performance in adapting the video diffusion model (ModelScope [71]) on higher-resolution settings. Additional qualitative examples are provided in supplemantary matarials.

ModelScope (1K)
Method FVD
D.I 785.16
DiffuseHigh 607.99
Table 3: Video experiments results. We used Frechet Video Distance(FVD) with 16 frames as an evaluation metric on 2048 generated videos. Captions are randomly sampled from the WebVid-10M dataset [3].
Refer to caption
Figure 9: Video adaptation experiment. We used ModelScope [71] model as a baseline. (a) Text prompt: ‘Darth Vader is surfing on waves.’ (b) Text prompt: ’A teddy bear is dancing in front of the building’

6 Limitation and Discussion

Since DiffuseHigh leverages generated low-resolution images as structural guidance, the generation ability of the diffusion model at its original resolution heavily affects the overall performance of our method. That is, several structural defects or flaws in low-resolution images are also likely to be guided to the resulting higher-resolution image. However, we believe that leveraging tuning-free enhancement methods such as FreeU [60], which improve the quality and fidelity of the sampled low-resolution image, would further improve the quality and fidelity of the resulting high-resolution image and leave it as a future work.

7 Conclusion

We present a training-free progressive high-resolution image synthesis pipeline using a pretrained diffusion model on low-resolution images. Inspired by the recent noising-denoising technique, our proposal involves leveraging generated low-resolution images as a guiding mechanism to effectively preserve the overall structure and intricate details of the contents. We also propose a principled way of incorporating structure information into the denoising process through frequency domain representation. This allows us to retain the essential information presented in the low-resolution images. The extensive experiments with the pretrained SD models have shown that the proposed DiffuseHigh generates higher-resolution images without commonly known issues in the existing approaches, such as repetitive patterns and irregular structures.

References

  • [1] Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Transactions on Graphics (TOG) 42(4), 1–11 (2023)
  • [2] Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18208–18218 (2022)
  • [3] Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1728–1738 (2021)
  • [4] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
  • [5] Bao, F., Li, C., Zhu, J., Zhang, B.: Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503 (2022)
  • [6] Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning. pp. 1737–1752. PMLR (2023)
  • [7] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
  • [8] Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22563–22575 (2023)
  • [9] Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
  • [10] Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)
  • [11] Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.H., Murphy, K., Freeman, W.T., Rubinstein, M., et al.: Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)
  • [12] Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)
  • [13] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020)
  • [14] Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)
  • [15] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS 34, 8780–8794 (2021)
  • [16] Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389 (2023)
  • [17] Gao, S., Liu, X., Zeng, B., Xu, S., Li, Y., Luo, X., Liu, J., Zhen, X., Zhang, B.: Implicit diffusion models for continuous super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10021–10030 (2023)
  • [18] Guo, L., He, Y., Chen, H., Xia, M., Cun, X., Wang, Y., Huang, S., Zhang, Y., Wang, X., Chen, Q., et al.: Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. arXiv preprint arXiv:2402.10491 (2024)
  • [19] He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen, Q., Shan, Y.: Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In: The Twelfth International Conference on Learning Representations (2023)
  • [20] He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 (2022)
  • [21] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
  • [22] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
  • [23] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
  • [24] Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research 23(1), 2249–2281 (2022)
  • [25] Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
  • [26] Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y.: Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 2595–2605 (2022)
  • [27] **, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. Advances in Neural Information Processing Systems 36 (2024)
  • [28] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. NeurIPS 35, 26565–26577 (2022)
  • [29] Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila, T.: Alias-free generative adversarial networks. NeurIPS 34, 852–863 (2021)
  • [30] Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6007–6017 (2023)
  • [31] Kim, D., Shin, S., Song, K., Kang, W., Moon, I.C.: Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. arXiv preprint arXiv:2106.05527 (2021)
  • [32] Kong, Z., **, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
  • [33] Lam, M.W., Wang, J., Huang, R., Su, D., Yu, D.: Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514 (2021)
  • [34] Lee, Y., Kim, K., Kim, H., Sung, M.: Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems 36 (2024)
  • [35] Li, Y., Wang, H., **, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., Ren, J.: Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. NeurIPS 36 (2024)
  • [36] Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 300–309 (2023)
  • [37] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)
  • [38] Lu, C., Zheng, K., Bao, F., Chen, J., Li, C., Zhu, J.: Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In: ICML. pp. 14429–14460. PMLR (2022)
  • [39] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS 35, 5775–5787 (2022)
  • [40] Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11461–11471 (2022)
  • [41] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
  • [42] Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
  • [43] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
  • [44] Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV. pp. 4195–4205 (2023)
  • [45] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
  • [46] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
  • [47] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)
  • [48] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2),  3 (2022)
  • [49] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: ICML. pp. 8821–8831. PMLR (2021)
  • [50] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)
  • [51] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
  • [52] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR. pp. 22500–22510 (2023)
  • [53] Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings. pp. 1–10 (2022)
  • [54] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS 35, 36479–36494 (2022)
  • [55] Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(4), 4713–4726 (2022)
  • [56] Sauer, A., Schwarz, K., Geiger, A.: StyleGAN-XL: Scaling StyleGAN to large diverse datasets. In: ACM SIGGRAPH 2022 conference proceedings. pp. 1–10 (2022)
  • [57] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)
  • [58] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
  • [59] Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
  • [60] Si, C., Huang, Z., Jiang, Y., Liu, Z.: Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497 (2023)
  • [61] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
  • [62] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
  • [63] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. NeurIPS 32 (2019)
  • [64] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
  • [65] stabilityai: Stable Diffusion 2-1 base (2022), https://huggingface.co/stabilityai/stable-diffusion-2-1
  • [66] stabilityai: Stable Diffusion Latent Upscaler (2023), https://huggingface.co/stabilityai/sd-x2-latent-upscaler
  • [67] stabilityai: Stable Diffusion Upscaler (2023), https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler
  • [68] Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
  • [69] Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1921–1930 (2023)
  • [70] Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
  • [71] Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023)
  • [72] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems 36 (2024)
  • [73] Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G.: Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 650–656 (2022)
  • [74] Xie, E., Yao, L., Shi, H., Liu, Z., Zhou, D., Liu, Z., Li, J., Li, Z.: Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648 (2023)
  • [75] Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)
  • [76] Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: Freedom: Training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833 (2023)
  • [77] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
  • [78] Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902 (2022)
  • [79] Zhang, S., Chen, Z., Zhao, Z., Chen, Z., Tang, Y., Chen, Y., Cao, W., Liang, J.: Hidiffusion: Unlocking high-resolution creativity and efficiency in low-resolution trained diffusion models. arXiv preprint arXiv:2311.17528 (2023)
  • [80] Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., Xu, H.: Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. arXiv preprint arXiv:2308.16582 (2023)

Supplementary material for DiffuseHigh

A Comparison with Stable Diffusion Upscaler

Refer to caption
Figure 10: Comparison to SD + SR on 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT image generation setting. (a) results of SD + SR, (b) results of DiffuseHigh. We used SD2.1 for both methods. Ours compose more reliable textures and high-frequency details.

Leveraging a Super-Resolution (SR) model is also an appealing approach for generating images with large resolution. However, utilizing SR models is often challenging due to the difficulty of collecting higher-resolution image datasets and the computational cost of training the separate super-resolution model on higher-resolution images.

In order to comprehensively evaluate our method, we compare DiffuseHigh against pretrained SR models, namely Stable Diffusion Latent Upscaler [66] and Stable Diffusion Upscaler [67], which has the capacity to increase the resolution of the given image up to 4×4\times4 × and 16×16\times16 ×, respectively. We conducted experiments on Stable Diffusion 2.1 [65] and generated 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 20482superscript204822048^{2}2048 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT resolution images for each method.

SD 2.1 (1K) SD 2.1 (2K)
Method FID KID FID KID
SD + SR 18.66 0.0070 30.05 0.0087
DiffuseHigh 18.72 0.0069 27.80 0.0069
Table 4: SDSR experiments results.

As observed in Tab. 4, for 4×4\times4 × experiment, the SD + SR method showed a slightly lower FID score and higher KID score compared to DiffuseHigh, but the difference is negligible. In the case of the 16×16\times16 × setting, DiffuseHigh surpassed SD + SR both on FID and KID scores. Also, as mentioned in  [19], SD + SR often fails to compose reliable texture and details of the image, as shown in  Fig. 10. These quantitative and qualitative results highlight the efficiency and efficacy of DiffuseHigh, where ours employ only the pretrained Stable Diffusion.

B Additional qualitative image examples

Refer to caption
Figure 11: Additional generated samples with SD 2.1.
Refer to caption
Figure 12: Additional generated samples with SD 2.1.
Refer to caption
Figure 13: Additional generated samples with SD 2.1.
Refer to caption
Figure 14: Additional generated samples with SDXL.
Refer to caption
Figure 15: Additional generated samples with SDXL.
Refer to caption
Figure 16: Additional generated samples with SDXL.
Refer to caption
Figure 17: Additional generated samples with SDXL.

C Additional qualitative video examples

Refer to caption
Figure 18: Text prompt: ‘A dolphin jum** out of the water’
Refer to caption
Figure 19: Text prompt: ‘A deer is walking in the forest’
Refer to caption
Figure 20: Text prompt: ‘A flickering big campfire in the woods’