11institutetext: Jianyi Wang 22institutetext: S-Lab, Nanyang Technological University, Singapore
22email: [email protected]
33institutetext: Zongsheng Yue 44institutetext: S-Lab, Nanyang Technological University, Singapore
44email: [email protected]
55institutetext: Shangchen Zhou 66institutetext: S-Lab, Nanyang Technological University, Singapore
66email: [email protected]
77institutetext: Kelvin C.K. Chan 88institutetext: S-Lab, Nanyang Technological University, Singapore
88email: [email protected]
99institutetext: Chen Change Loy (Corresponding author) 1010institutetext: S-Lab, Nanyang Technological University, Singapore
1010email: [email protected]

Exploiting Diffusion Prior for Real-World Image Super-Resolution

Jianyi Wang    Zongsheng Yue    Shangchen Zhou    Kelvin C.K. Chan    Chen Change Loy
(Received: date / Accepted: date)
Abstract

We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrap** module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.

Keywords:
Super-resolution image restoration diffusion models generative prior
Refer to caption
Figure 1: Qualitative comparisons of BSRGAN (Zhang et al., 2021b), Real-ESRGAN+ (Wang et al., 2021c), FeMaSR (Chen et al., 2022), LDM (Rombach et al., 2022), and our StableSR on real-world examples. (Zoom in for details)

1 Introduction

We have seen significant advancements in diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Yang et al., 2021a; Nichol et al., 2022) for the task of image synthesis. Existing studies demonstrate that the diffusion prior, embedded in synthesis models like Stable Diffusion (Rombach et al., 2022), can be applied to various downstream content creation tasks, including image (Choi et al., 2021; Avrahami et al., 2022; Hertz et al., 2022; Gu et al., 2022; Mou et al., 2024; Zhang et al., 2023; Gal et al., 2023) and video (Wu et al., 2022; Molad et al., 2023; Qi et al., 2023) editing. In this study, we extend the exploration beyond the realm of content creation and examine the potential benefits of using diffusion prior for super-resolution (SR). This low-level vision task presents an additional non-trivial challenge, as it requires high image fidelity in its generated content, which stands in contrast to the stochastic nature of diffusion models.

A common solution to the challenge above involves training a SR model from scratch (Saharia et al., 2022b; Rombach et al., 2022; Sahak et al., 2023; Li et al., 2022). To preserve fidelity, these methods use the low-resolution (LR) image as an additional input to constrain the output space. While these methods have achieved notable success, they often demand significant computational resources to train the diffusion model. Moreover, training a network from scratch can potentially jeopardize the generative priors captured in synthesis models, leading to suboptimal performance in the final network. These limitations have inspired an alternative approach (Choi et al., 2021; Wang et al., 2022; Chung et al., 2022; Song et al., 2023a; Meng and Kabashima, 2022), which involves incorporating constraints into the reverse diffusion process of a pre-trained synthesis model. This paradigm avoids the need for model training while leveraging the diffusion prior. However, designing these constraints assumes knowing the image degradations as a priori, which are typically unknown and complex. Consequently, such methods exhibit limited generalizability.

In this study, we present StableSR, an approach that preserves pre-trained diffusion priors without making explicit assumptions about the degradations. Specifically, unlike previous works (Saharia et al., 2022b; Rombach et al., 2022; Sahak et al., 2023; Li et al., 2022) that concatenate the LR image to intermediate outputs, which requires one to train a diffusion model from scratch, our method only needs to fine-tune a lightweight time-aware encoder and a few feature modulation layers for the SR task. When applying diffusion models for SR, the LR condition should provide adaptive guidance for each diffusion step during the restoration process, i.e., stronger guidance at earlier iterations to maintain fidelity and weaker guidance later to avoid introducing degradations. To this end, our encoder incorporates a time embedding layer to generate time-aware features, allowing the features in the diffusion model to be adaptively modulated at different iterations. Besides gaining improved training efficiency, kee** the original diffusion model frozen helps preserve the generative prior, which grants StableSR the capability of generating visually pleasant SR details and avoids overfitting to high-frequency degradations. Our experiments show that both the time-aware property of our encoder and the diffusion prior are crucial for achieving SR performance improvements.

To suppress randomness inherited from the diffusion model as well as the information loss due to the encoding process of the autoencoder (Rombach et al., 2022), inspired by Codeformer (Zhou et al., 2022), we apply a controllable feature wrap** module (CFW) with an adjustable coefficient to refine the outputs of the diffusion model during the decoding process of the autoencoder. Unlike CodeFormer, the multiple-step sampling nature of diffusion models makes it hard to finetune the CFW module directly. We overcome this issue by first generating synthetic LR-HR pairs with the diffusion training stage. Then, we obtain the corresponding latent codes using our finetuned diffusion model given the LR images as conditions. In this way, CFW can be trained using the generated data.

Applying diffusion models to arbitrary resolutions has remained a persistent challenge, especially for the SR task. A simple solution would be to split the image into patches and process each independently. However, this method often leads to boundary discontinuity in the output. To address this issue, we introduce a progressive aggregation sampling strategy. Inspired by Jiménez (Jiménez, 2023), our approach involves dividing the image into overlap** patches and fusing these patches using a Gaussian kernel at each diffusion iteration. This process smooths out the boundaries, resulting in a more coherent output. To avoid altering the output resolution of SR images, the overlap** sizes at the right and bottom boundaries are dynamically adjusted to fit the target resolution.

Adapting generative priors for real-world image super-resolution presents an intriguing yet challenging problem, and in this work, we offer a novel approach as a solution. We introduce a fine-tuning method that leverages pre-trained diffusion models without making explicit assumptions about degradations. We address key challenges, such as fidelity and arbitrary resolution, by proposing simple yet effective modules. With our time-aware encoder, controllable feature wrap** module, and progressive aggregation sampling strategy, our StableSR serves as a strong baseline that inspires future research in adopting diffusion priors for restoration tasks.

2 Related Work

Image Super-Resolution. Image Super-Resolution (SR) aims to restore an HR image from its degraded LR observation. Early SR approaches (Dai et al., 2019; Dong et al., 2014, 2015, 2016; He et al., 2019; Xu et al., 2019; Zhang et al., 2018b; Chen et al., 2021; Liang et al., 2021; Wang et al., 2018b; Ledig et al., 2017; Sajjadi et al., 2017; Xu et al., 2017; Zhou et al., 2020) assume a pre-defined degradation process, e.g., bicubic downsampling and blurring with known parameters. While these methods can achieve appealing performance on the synthetic data with the same degradation, their performance deteriorates significantly in real-world scenarios due to the limited generalizability.

Recent works have moved their focus from synthetic settings to blind SR, where the degradation is unknown and similar to real-world scenarios. Due to the lack of real-world paired data for training, some methods (Fritsche et al., 2019; Maeda, 2020; Wan et al., 2020; Wang et al., 2021a; Wei et al., 2021; Zhang et al., 2021a) propose to implicitly learn a degradation model from LR images in an unsupervised manner such as Cycle-GAN (Zhu et al., 2017) and contrastive learning (Oord et al., 2018). In addition to unsupervised learning, other approaches aim to explicitly synthesize LR-HR image pairs that resemble real-world data. Specifically, BSRGAN (Zhang et al., 2021b) and Real-ESRGAN (Wang et al., 2021c) present effective degradation pipelines for blind SR in real world. Building upon such degradation pipelines, recent works based on diffusion models (Saharia et al., 2022b; Sahak et al., 2023) further show competitive performance on real-world image SR. In this work, we consider an orthogonal direction of fine-tuning a diffusion model for SR. In this way, the computational cost of network training could be reduced. Moreover, our method allows the exploitation of generative prior encapsulated in the synthesis model, leading to better performance.

Prior for Image Super-Resolution. To further enhance performance in complex real-world SR scenarios, numerous prior-based approaches have been proposed. These techniques deploy additional image priors to bolster the generation of faithful textures. A straightforward method is reference-based SR (Zheng et al., 2018; Zhang et al., 2019; Yang et al., 2020; Jiang et al., 2021; Zhou et al., 2020). This involves using one or several reference high-resolution (HR) images, which share similar textures with the input low-resolution (LR) image, as an explicit prior to aid in generating the corresponding HR output. However, aligning features of the reference with the LR input can be challenging in real-world cases, and such explicit priors are not always readily available. Recent works have moved away from relying on explicit priors, finding more promising performance with implicit priors instead. Wang et al. (Wang et al., 2018a) were the first to propose the use of semantic segmentation probability maps for guiding SR in the feature space. Subsequent works (Menon et al., 2020; Gu et al., 2020; Wang et al., 2021b; Pan et al., 2021; Chan et al., 2021, 2022a; Yang et al., 2021b) employed pre-trained GANs by exploring the corresponding high-resolution latent space of the low-resolution input. While effective, the implicit priors used in these approaches are often tailored for specific scenarios, such as limited categories (Wang et al., 2018a; Gu et al., 2020; Pan et al., 2021; Chan et al., 2021) and faces (Menon et al., 2020; Wang et al., 2021b; Yang et al., 2021b), and therefore lack generalizability for complex real-world SR tasks. Other implicit priors for image SR include mixtures of degradation experts (Yu et al., 2018; Liang et al., 2022) and VQGAN (Zhao et al., 2022; Chen et al., 2022; Zhou et al., 2022). However, these methods fall short, either due to insufficient prior expressiveness (Yu et al., 2018; Zhao et al., 2022; Liang et al., 2022) or inaccurate feature matching (Chen et al., 2022), resulting in output quality that remains less than satisfactory.

In contrast to existing strategies, we set our sights on exploring the robust and extensive generative prior found in pre-trained diffusion models (Nichol et al., 2022; Rombach et al., 2022; Ramesh et al., 2021; Saharia et al., 2022a; Ramesh et al., 2022). While recent studies (Choi et al., 2021; Avrahami et al., 2022; Hu et al., 2022; Zhang et al., 2023; Mou et al., 2024) have highlighted the remarkable generative abilities of pre-trained diffusion models, the high-fidelity requirement inherent in super-resolution (SR) makes it unfeasible to directly adopt these methods for this task. Our proposed StableSR, unlike LDM (Rombach et al., 2022), does not necessitate training from scratch. Instead, it shares a similar idea to concurrent works (Zhang et al., 2023; Mou et al., 2024) by fine-tuning directly on a frozen pre-trained diffusion model with only a small number of trainable parameters, leading to superior performance in a more efficient way. In practice, our approach further shows comparable performance with follow-up works (Lin et al., 2023; Yu et al., 2024), which also exploit the diffusion prior but follow the ControlNet-like (Zhang et al., 2023) framework. We provide a detailed comparison with these works in a following section.

Refer to caption

Figure 2: Framework of StableSR. We first finetune the time-aware encoder that is attached to a fixed pre-trained Stable Diffusion model. Features are combined with trainable spatial feature transform (SFT) layers. Such a simple yet effective design is capable of leveraging rich diffusion prior for image SR. Then, the diffusion model is fixed. Inspired by CodeFormer (Zhou et al., 2022), we introduce a controllable feature wrap** (CFW) module to obtain a tuned feature 𝑭msubscript𝑭𝑚\bm{F}_{m}bold_italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in a residual manner, given the additional information 𝑭esubscript𝑭𝑒\bm{F}_{e}bold_italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT from LR features and features 𝑭dsubscript𝑭𝑑\bm{F}_{d}bold_italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT from the fixed decoder. With an adjustable coefficient w𝑤witalic_w, CFW can trade between quality and fidelity.

3 Methodology

Our method employs diffusion prior for SR. Inspired by the generative capabilities of Stable Diffusion (Rombach et al., 2022), we use it as the diffusion prior in our work, hence the name StableSR for our method. The main component of StableSR is a time-aware encoder, which is trained along with a frozen Stable Diffusion model to allow for conditioning based on the input image. To further facilitate a trade-off between realism and fidelity, depending on user preference, we follow CodeFormer (Zhou et al., 2022) to introduce an optional controllable feature wrap** module. The overall framework of StableSR is depicted in Fig. 2.

3.1 Guided Finetuning with Time Awareness

To exploit the prior knowledge of Stable Diffusion for SR, we establish the following constraints when designing our model: 1) The resulting model must have the ability to generate a plausible HR image, conditioned on the observed LR input. This is vital because the LR image is the only source of structural information, which is crucial for maintaining high fidelity. 2) The model should introduce only minimal alterations to the original Stable Diffusion model to prevent disrupting the prior encapsulated within it.

Feature Modulation. While several existing approaches (Nichol et al., 2022; Rombach et al., 2022; Hertz et al., 2022; Feng et al., 2023; Balaji et al., 2022) have successfully controlled the generated semantic structure of a diffusion model via cross-attention, such a strategy can hardly provide detailed and high-frequency guidance due to insufficient inductive bias (Liu et al., 2021). To more accurately guide the generation process, we adopt an additional encoder to extract multi-scale features {𝑭n}n=1Nsubscriptsuperscriptsuperscript𝑭𝑛𝑁𝑛1\{\bm{F}^{n}\}^{N}_{n=1}{ bold_italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT from the degraded LR image features, and use them to modulate the intermediate feature maps {𝑭difn}n=1Nsubscriptsuperscriptsubscriptsuperscript𝑭𝑛dif𝑁𝑛1\{\bm{F}^{n}_{\rm dif}\}^{N}_{n=1}{ bold_italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_dif end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT of the residual blocks in Stable Diffusion via spatial feature transformations (SFT) (Wang et al., 2018a):

𝑭^difn=(1+𝜶n)𝑭difn+𝜷n;𝜶n,𝜷n=θn(𝑭n),formulae-sequencesubscriptsuperscript^𝑭𝑛difdirect-product1superscript𝜶𝑛subscriptsuperscript𝑭𝑛difsuperscript𝜷𝑛superscript𝜶𝑛superscript𝜷𝑛subscriptsuperscript𝑛𝜃superscript𝑭𝑛\hat{\bm{F}}^{n}_{\rm dif}=(1+\bm{\alpha}^{n})\odot\bm{F}^{n}_{\rm dif}+\bm{% \beta}^{n};\leavevmode\nobreak\ \bm{\alpha}^{n},\bm{\beta}^{n}=\mathcal{M}^{n}% _{\theta}(\bm{F}^{n}),over^ start_ARG bold_italic_F end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_dif end_POSTSUBSCRIPT = ( 1 + bold_italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⊙ bold_italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_dif end_POSTSUBSCRIPT + bold_italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) , (1)

where 𝜶nsuperscript𝜶𝑛\bm{\alpha}^{n}bold_italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝜷nsuperscript𝜷𝑛\bm{\beta}^{n}bold_italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denote the affine parameters in SFT and θnsubscriptsuperscript𝑛𝜃\mathcal{M}^{n}_{\theta}caligraphic_M start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes a small network consisting of several convolutional layers. Here n𝑛nitalic_n indices the spatial scale of the UNet (Ronneberger et al., 2015) architecture in Stable Diffusion.

During finetuning, we freeze the weights of Stable Diffusion and train only the encoder and SFT layers. This strategy allows us to insert structural information extracted from the LR image without destroying the generative prior captured by Stable Diffusion.

Refer to caption

Figure 3: In contrast to a conditional encoder without time embedding, the one equipped with time embedding can adaptively supply guidance to the pre-trained diffusion models. (a), we gauge the cosine similarity between the diffusion model’s features pre- and post-SFT at various timesteps, which echoes the strength of the condition originating from the encoder. (b), we further visualize the features of the conditional encoder extracted from the LR image. As shown, the encoder is inclined to provide sharp features when the SNR hovers around 5e25superscripte25\text{e}^{-2}5 e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. This is precisely when the diffusion model requires substantial guidance to generate the desired high-resolution image content. Interestingly, this observation aligns with the findings in (Choi et al., 2022).

Time-aware Guidance. We find that incorporating temporal information through a time-embedding layer in our encoder considerably enhances both the quality of generation and the fidelity to the ground truth, since it can adaptively adjust the condition strength derived from the LR features. Here, we analyze this phenomenon from a signal-to-noise ratio (SNR) standpoint and later quantitatively and qualitatively validate it in the ablation study.

During the generation process, the SNR of the produced image progressively increases as noise is incrementally removed. A recent study (Choi et al., 2022) indicates that image content is rapidly populated when the SNR approaches 5e25superscripte25\text{e}^{-2}5 e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. In line with this observation, we notice that the time embedding enables the conditional encoder to provide stronger guidance within the range where the signal-to-noise ratio (SNR) hits 5e25superscripte25\text{e}^{-2}5 e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. This is essential because the content generated at this stage significantly influences the super-resolution performance of our method. To further substantiate this, since the conditional features are inserted into the diffusion prior via SFT layers, we employ the cosine similarity between the features of Stable Diffusion before and after the SFT to measure the condition strength provided by the encoder. The cosine similarity values at different timesteps are plotted in Fig. 3-(a). As can be observed, the cosine similarity reaches its minimum value around an SNR of 5e25superscripte25\text{e}^{-2}5 e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, indicative of the strongest conditions imposed by the encoder. In addition, we also depict the feature maps extracted from our specially designed encoder in Fig. 3-(b). It is noticeable that the features around the SNR point of 5e25superscripte25\text{e}^{-2}5 e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT are sharper and contain more detailed image structures. We hypothesize that these adaptive feature conditions can furnish more comprehensive guidance for SR.

Color Correction. Diffusion models can occasionally exhibit color shifts, as noted in (Choi et al., 2022). To address this issue, we perform color normalization on the generated image to align its mean and variance with those of the LR input. In particular, if we let 𝒙𝒙\bm{x}bold_italic_x denote the LR input and 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG represent the generated HR image, the color-corrected output, 𝒚𝒚\bm{y}bold_italic_y, is calculated as follows:

𝒚c=𝒚^c𝝁𝒚^c𝝈𝒚^c𝝈xc+𝝁xc,superscript𝒚𝑐superscript^𝒚𝑐superscriptsubscript𝝁^𝒚𝑐superscriptsubscript𝝈^𝒚𝑐superscriptsubscript𝝈𝑥𝑐superscriptsubscript𝝁𝑥𝑐\bm{y}^{c}=\frac{\hat{\bm{y}}^{c}-\bm{\mu}_{\hat{\bm{y}}}^{c}}{\bm{\sigma}_{% \hat{\bm{y}}}^{c}}\cdot\bm{\sigma}_{x}^{c}+\bm{\mu}_{x}^{c},bold_italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = divide start_ARG over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG start_ARG bold_italic_σ start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG ⋅ bold_italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + bold_italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , (2)

where c{r,g,b}𝑐𝑟𝑔𝑏c\in\{r,g,b\}italic_c ∈ { italic_r , italic_g , italic_b } denotes the color channel, 𝝁𝒚^csubscriptsuperscript𝝁𝑐^𝒚\bm{\mu}^{c}_{\hat{\bm{y}}}bold_italic_μ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG end_POSTSUBSCRIPT and 𝝈𝒚^csubscriptsuperscript𝝈𝑐^𝒚\bm{\sigma}^{c}_{\hat{\bm{y}}}bold_italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG end_POSTSUBSCRIPT (or 𝝁xcsubscriptsuperscript𝝁𝑐𝑥\bm{\mu}^{c}_{x}bold_italic_μ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝝈xcsubscriptsuperscript𝝈𝑐𝑥\bm{\sigma}^{c}_{x}bold_italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT) are the mean and standard variance estimated from the c𝑐citalic_c-th channel of 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG (or 𝒙𝒙\bm{x}bold_italic_x), respectively. We find that this simple correction suffices to remedy the color difference.

Though pixel color correction via channel matching can improve color fidelity, we notice that it may suffer from limited color correction ability due to the lack of pixel-wise controllability. The main reason is that it only introduces global statistics, i.e., channel-wise mean and variance of the input for color correction, ignoring pixel-wise semantics. Besides adopting color correction in the pixel domain, we further propose wavelet-based color correction for better visual performance in some cases. Wavelet color correction directly introduces the low-frequency part from the input since the color information belongs to the low-frequency components, while the degradations are mostly high-frequency components. In this way, we can improve the color fidelity of the results without perceptibly affecting the generated quality. Given any image 𝑰𝑰\bm{I}bold_italic_I, we extract its high-frequency component 𝑯isuperscript𝑯𝑖\bm{H}^{i}bold_italic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and low-frequency component 𝑳isuperscript𝑳𝑖\bm{L}^{i}bold_italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT at the i𝑖iitalic_i-th (1il1𝑖𝑙1\leq i\leq l1 ≤ italic_i ≤ italic_l) scale via the wavelet decomposition, i.e.,

𝑳i=𝒞i(𝑳i1,𝒌),𝑯i=𝑳i1𝑳i,formulae-sequencesuperscript𝑳𝑖subscript𝒞𝑖superscript𝑳𝑖1𝒌superscript𝑯𝑖superscript𝑳𝑖1superscript𝑳𝑖\bm{L}^{i}=\mathcal{C}_{i}(\bm{L}^{i-1},\bm{k}),\leavevmode\nobreak\ \bm{H}^{i% }=\bm{L}^{i-1}-\bm{L}^{i},bold_italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_L start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , bold_italic_k ) , bold_italic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_italic_L start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT - bold_italic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (3)

where 𝑳0=𝑰superscript𝑳0𝑰\bm{L}^{0}=\bm{I}bold_italic_L start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_I, 𝒞isubscript𝒞𝑖\mathcal{C}_{i}caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the convolutional operator with a dilation of 2isuperscript2𝑖2^{i}2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and k𝑘kitalic_k is the convolutional kernel defined as:

𝒌=[1/161/81/161/81/41/81/161/81/16].𝒌matrix1161811618141811618116\bm{k}=\begin{bmatrix}\nicefrac{{1}}{{16}}&\nicefrac{{1}}{{8}}&\nicefrac{{1}}{% {16}}\\ \nicefrac{{1}}{{8}}&\nicefrac{{1}}{{4}}&\nicefrac{{1}}{{8}}\\ \nicefrac{{1}}{{16}}&\nicefrac{{1}}{{8}}&\nicefrac{{1}}{{16}}\end{bmatrix}.bold_italic_k = [ start_ARG start_ROW start_CELL / start_ARG 1 end_ARG start_ARG 16 end_ARG end_CELL start_CELL / start_ARG 1 end_ARG start_ARG 8 end_ARG end_CELL start_CELL / start_ARG 1 end_ARG start_ARG 16 end_ARG end_CELL end_ROW start_ROW start_CELL / start_ARG 1 end_ARG start_ARG 8 end_ARG end_CELL start_CELL / start_ARG 1 end_ARG start_ARG 4 end_ARG end_CELL start_CELL / start_ARG 1 end_ARG start_ARG 8 end_ARG end_CELL end_ROW start_ROW start_CELL / start_ARG 1 end_ARG start_ARG 16 end_ARG end_CELL start_CELL / start_ARG 1 end_ARG start_ARG 8 end_ARG end_CELL start_CELL / start_ARG 1 end_ARG start_ARG 16 end_ARG end_CELL end_ROW end_ARG ] . (4)

By denoting the l𝑙litalic_l-th low-frequency and high-frequency components of 𝒙𝒙\bm{x}bold_italic_x (or 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG) as 𝑳xlsubscriptsuperscript𝑳𝑙𝑥\bm{L}^{l}_{x}bold_italic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝑯xlsubscriptsuperscript𝑯𝑙𝑥\bm{H}^{l}_{x}bold_italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (or 𝑳ylsubscriptsuperscript𝑳𝑙𝑦\bm{L}^{l}_{y}bold_italic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and 𝑯ylsubscriptsuperscript𝑯𝑙𝑦\bm{H}^{l}_{y}bold_italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT), the desired HR output 𝒚𝒚\bm{y}bold_italic_y is formulated as follows:

𝒚=𝑯yl+𝑳xl.𝒚subscriptsuperscript𝑯𝑙𝑦subscriptsuperscript𝑳𝑙𝑥\bm{y}=\bm{H}^{l}_{y}+\bm{L}^{l}_{x}.bold_italic_y = bold_italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + bold_italic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT . (5)

Intuitively, we replace the low-frequency component 𝑳ylsubscriptsuperscript𝑳𝑙𝑦\bm{L}^{l}_{y}bold_italic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT of 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG with 𝑳xlsubscriptsuperscript𝑳𝑙𝑥\bm{L}^{l}_{x}bold_italic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to correct the color bias. By default, we adopt color correction in the pixel domain for simplicity.

3.2 Fidelity-Realism Trade-off

Although the output of the proposed approach is visually compelling, it often deviates from the ground truth due to the inherent stochasticity of the diffusion model. Drawing inspiration from CodeFormer (Zhou et al., 2022), we introduce a Controllable Feature Wrap** (CFW) module to flexibly manage the balance between realism and fidelity. Unlike CodeFormer, there are multiple sampling steps for generating a sample during inference and we cannot finetune the CFW module directly. To overcome this problem, we first generate synthetic LR-HR pairs following the same degradation pipeline with the diffusion training stage. Then, the latent codes 𝒁0subscript𝒁0\bm{Z}_{0}bold_italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be obtained using our finetuned diffusion model given the LR images as conditions. Finally, CFW can be trained using the generated data.

Since Stable Diffusion is implemented in the latent space of an autoencoder, it is natural to leverage the encoder features of the autoencoder to modulate the corresponding decoder features for further fidelity improvement. Let 𝑭esubscript𝑭𝑒\bm{F}_{e}bold_italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝑭dsubscript𝑭𝑑\bm{F}_{d}bold_italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT be the encoder and decoder features, respectively. We introduce an adjustable coefficient w[0,1]𝑤01w\in[0,1]italic_w ∈ [ 0 , 1 ] to control the extent of modulation:

𝑭m=𝑭d+𝒞(𝑭e,𝑭d;𝜽)×w,subscript𝑭𝑚subscript𝑭𝑑𝒞subscript𝑭𝑒subscript𝑭𝑑𝜽𝑤\bm{F}_{m}=\bm{F}_{d}+\mathcal{C}(\bm{F}_{e},\bm{F}_{d};\bm{\theta})\times w,bold_italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + caligraphic_C ( bold_italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ; bold_italic_θ ) × italic_w , (6)

where 𝒞(;𝜽)𝒞𝜽\mathcal{C}(\cdot;\bm{\theta})caligraphic_C ( ⋅ ; bold_italic_θ ) represents convolution layers with trainable parameter 𝜽𝜽\bm{\theta}bold_italic_θ. The overall framework is shown in Fig. 2.

In this design, a small w𝑤witalic_w exploits the generation capability of Stable Diffusion, leading to outputs with high realism under severe degradations. In contrast, a large w𝑤witalic_w allows stronger structural guidance from the LR image, enhancing fidelity. We observe that w= 0.5𝑤0.5w\,{=}\,0.5italic_w = 0.5 achieves a good balance between quality and fidelity. Note that we only train CFW in this particular stage. In practice, we notice that CFW involves additional GPU memory and the improvement can be subtle in some cases. Thus, we make it optional for different real-world applications.

Refer to caption

Figure 4: When dealing with images beyond 512×512512512512\times 512512 × 512, StableSR (w/o aggregation sampling) suffers from obvious block inconsistency by chop** the image into several tiles, processing them separately, and stitching them together. With our proposed aggregation sampling, StableSR can achieve consistent results on large images. The resolution of the shown figure is 1024×1024102410241024\times 10241024 × 1024.

3.3 Aggregation Sampling

Due to the heightened sensitivity of the attention layers in Stable Diffusion with respect to the image resolution, it tends to produce inferior outputs for resolutions differing from its training settings, specifically 512×512512512512{\times}512512 × 512. This, in effect, constrains the practicality of StableSR.

A common workaround involves splitting the larger image into several overlap** smaller patches and processing each individually. While this strategy often yields good results for conventional CNN-based SR methods, it is not directly applicable to the diffusion paradigm. This is because discrepancies between patches are compounded and magnified over the course of diffusion iterations. A typical failure case is illustrated in Fig. 4.

Inspired by Jiménez (Jiménez, 2023), we apply a progressive patch aggregation sampling algorithm to handle images of arbitrary resolutions. Specifically, we begin by encoding the LR image into a latent feature map 𝑭h×w𝑭superscript𝑤\bm{F}\in\mathcal{R}^{h\times w}bold_italic_F ∈ caligraphic_R start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT, which is then subdivided into M𝑀Mitalic_M overlap** small patches {𝑭Ωn}n=1Msuperscriptsubscriptsubscript𝑭subscriptΩ𝑛𝑛1𝑀\{\bm{F}_{\Omega_{n}}\}_{n=1}^{M}{ bold_italic_F start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, each with a resolution of 64×64646464\times 6464 × 64 - matching the training resolution111The downsampling scale factor of the autoencoder in Stable Diffusion is 8×8\times8 ×.. Here, ΩnsubscriptΩ𝑛\Omega_{n}roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the coordinate set of the n𝑛nitalic_nth patch in 𝑭𝑭\bm{F}bold_italic_F. During each timestep in the reverse sampling, each patch is individually processed through StableSR, with the processed patches subsequently aggregated. To integrate overlap** patches, a weight map 𝒘Ωnh×wsubscript𝒘subscriptΩ𝑛superscript𝑤\bm{w}_{\Omega_{n}}\in\mathcal{R}^{h\times w}bold_italic_w start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT whose entries follow up a Gaussian filter in ΩnsubscriptΩ𝑛\Omega_{n}roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 0 elsewhere is generated for each patch 𝑭Ωnsubscript𝑭subscriptΩ𝑛\bm{F}_{\Omega_{n}}bold_italic_F start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Overlap** pixels are then weighted in accordance with their respective Gaussian weight maps. In particular, we follow Jiménez (Jiménez, 2023) to define a padding function f()𝑓f(\cdot)italic_f ( ⋅ ) that expands any patch of size 64×64646464\times 6464 × 64 to the resolution of h×w𝑤h\times witalic_h × italic_w by filling zeros outside the region ΩnsubscriptΩ𝑛\Omega_{n}roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This procedure is reiterated until reaching the final iteration.

Given the output of each patch as ϵ𝜽(𝒁Ωn(t),𝑭Ωn,t)subscriptitalic-ϵ𝜽subscriptsuperscript𝒁𝑡subscriptΩ𝑛subscript𝑭subscriptΩ𝑛𝑡\epsilon_{\bm{\theta}}(\bm{Z}^{(t)}_{\Omega_{n}},\bm{F}_{\Omega_{n}},t)italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t ), where 𝒁Ωn(t)subscriptsuperscript𝒁𝑡subscriptΩ𝑛\bm{Z}^{(t)}_{\Omega_{n}}bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the n𝑛nitalic_nth patch of the noisy input 𝒁(t)superscript𝒁𝑡\bm{Z}^{(t)}bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝜽𝜽\bm{\theta}bold_italic_θ is the parameters of the diffusion model, the results of all the patches aggregated together can be formulated as follows:

ϵ𝜽(𝒁(t),𝑭,t)=n=1M𝒘Ωn𝒘^f(ϵ𝜽(𝒁Ωn(t),𝑭Ωn,t)),subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡superscriptsubscript𝑛1𝑀direct-productsubscript𝒘subscriptΩ𝑛^𝒘𝑓subscriptitalic-ϵ𝜽subscriptsuperscript𝒁𝑡subscriptΩ𝑛subscript𝑭subscriptΩ𝑛𝑡\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t)=\sum_{n=1}^{M}\frac{\bm{w}_{% \Omega_{n}}}{\hat{\bm{w}}}\odot f\left(\epsilon_{\bm{\theta}}\left(\bm{Z}^{(t)% }_{\Omega_{n}},\bm{F}_{\Omega_{n}},t\right)\right),italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , italic_t ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG bold_italic_w start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG bold_italic_w end_ARG end_ARG ⊙ italic_f ( italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t ) ) , (7)

where 𝒘^=n𝒘Ωn^𝒘subscript𝑛subscript𝒘subscriptΩ𝑛\hat{\bm{w}}=\sum_{n}\bm{w}_{\Omega_{n}}over^ start_ARG bold_italic_w end_ARG = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Based on ϵ𝜽(𝒁(t),𝑭,t)subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t)italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , italic_t ), we can obtain 𝒁(t1)superscript𝒁𝑡1\bm{Z}^{(t-1)}bold_italic_Z start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT according to the sampling procedure, denoted as Sampler(𝒁(t),ϵ𝜽(𝒁(t),𝑭,t))Samplersuperscript𝒁𝑡subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡{\rm Sampler}(\bm{Z}^{(t)},\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t))roman_Sampler ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , italic_t ) ), in the diffusion model. Subsequently, we re-split 𝒁(t1)superscript𝒁𝑡1\bm{Z}^{(t-1)}bold_italic_Z start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT into over-lapped patches and repeat the above steps until t=1𝑡1t=1italic_t = 1. The whole process is summed up in Algorithm 1. Our experiments suggest that this progressive aggregation method substantially mitigates discrepancies in the overlapped regions, as depicted in Fig. 4. More details can be found in the supplementary material.

Algorithm 1 Progressive Patch Aggregation
1:Cropped Regions {Ωn}n=1MsuperscriptsubscriptsubscriptΩ𝑛𝑛1𝑀\{\Omega_{n}\}_{n=1}^{M}{ roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, diffusion steps T𝑇Titalic_T, LR latent features 𝑭𝑭\bm{F}bold_italic_F.
2:Initialize 𝒘Ωnsubscript𝒘subscriptΩ𝑛\bm{w}_{\Omega_{n}}bold_italic_w start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒘^^𝒘\hat{\bm{w}}over^ start_ARG bold_italic_w end_ARG
3:𝒁(T)𝒩(0,𝕀)similar-tosuperscript𝒁𝑇𝒩0𝕀\bm{Z}^{(T)}\sim{\cal N}(0,{\mathbb{I}})bold_italic_Z start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , blackboard_I )
4:for t[T,,0]𝑡𝑇0t\in[T,\ldots,0]italic_t ∈ [ italic_T , … , 0 ] do
5:     for n[1,,M]𝑛1𝑀n\in[1,\ldots,M]italic_n ∈ [ 1 , … , italic_M ] do
6:         Compute ϵ𝜽(𝒁Ωn(t),𝑭Ωn,t)subscriptitalic-ϵ𝜽subscriptsuperscript𝒁𝑡subscriptΩ𝑛subscript𝑭subscriptΩ𝑛𝑡\epsilon_{\bm{\theta}}\left(\bm{Z}^{(t)}_{\Omega_{n}},\bm{F}_{\Omega_{n}},t\right)italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t )
7:     end for
8:     Compute ϵ𝜽(𝒁(t),𝑭,t)subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t)italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , italic_t ) following Eq. (7)
9:     𝒁(t1)=Sampler(𝒁(t),ϵ𝜽(𝒁(t),𝑭,t))superscript𝒁𝑡1Samplersuperscript𝒁𝑡subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡\bm{Z}^{(t-1)}={\rm Sampler}(\bm{Z}^{(t)},\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},% \bm{F},t))bold_italic_Z start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT = roman_Sampler ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , italic_t ) )
10:end for
11:return 𝒁0subscript𝒁0\bm{Z}_{0}bold_italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

4 Experiments

4.1 Implementation Details

StableSR is built based on Stable Diffusion 2.1-base222https://huggingface.co/stabilityai/stable-diffusion-2-1-base. Our time-aware encoder is similar to the contracting path of the denoising U-Net in Stable Diffusion but is much more lightweight (similar-to{\sim}105M, including SFT layers). SFT layers are inserted in each residual block of Stable Diffusion for effective control. We finetune the diffusion model of StableSR for 117117117117 epochs with a batch size of 192192192192, and the prompt is fixed as null. We follow Stable Diffusion to use Adam (Kingma and Ba, 2014) optimizer and the learning rate is set to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The training process is conducted on 512×512512512512\times 512512 × 512 resolution with 8 NVIDIA Tesla 32G-V100 GPUs. For inference, we adopt DDPM sampling (Ho et al., 2020) with 200 timesteps. To handle images with arbitrary sizes, we adopt the proposed aggregation sampling strategy for images beyond 512×512512512512\times 512512 × 512. As for images under 512×512512512512\times 512512 × 512, we first enlarge the LR images such that the shorter side has a length of 512512512512 and rescale the results back to target resolutions after generation.

To train CFW, we first generate 100k synthetic LR-HR pairs with 512×512512512512\times 512512 × 512 resolution following the degradation pipeline in Real-ESRGAN (Wang et al., 2021c). Then, we adopt the finetuned diffusion model to generate the corresponding latent codes 𝒁0subscript𝒁0\bm{Z}_{0}bold_italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT given the above LR images as conditions. The training losses are almost the same as the autoencoder used in LDM (Rombach et al., 2022), except that we use a fixed adversarial loss weight of 0.0250.0250.0250.025 rather than a self-adjustable one.

Table 1: Quantitative comparison with state-of-the-art methods on both synthetic and real-world benchmarks. Red and blue colors represent the best and second best performance, respectively.
Datasets Metrics RealSR BSRGAN DASR Real-ESRGAN+ FeMaSR LDM SwinIR-GAN IF_III StableSR
DIV2K Valid PSNR \uparrow 24.62 24.58 24.47 24.29 23.06 23.32 23.93 23.36 23.26
SSIM \uparrow 0.5970 0.6269 0.6304 0.6372 0.5887 0.5762 0.6285 0.5636 0.5726
LPIPS \downarrow 0.5276 0.3351 0.3543 0.3112 0.3126 0.3199 0.3160 0.4641 0.3114
FID \downarrow 49.49 44.22 49.16 37.64 35.87 26.47 36.34 37.54 24.44
CLIP-IQA \uparrow 0.3534 0.5246 0.5036 0.5276 0.5998 0.6245 0.5338 0.3980 0.6771
MUSIQ \uparrow 28.57 61.19 55.19 61.05 60.83 62.27 60.22 43.71 65.92
RealSR PSNR \uparrow 27.30 26.38 27.02 25.69 25.06 25.46 26.31 25.47 24.65
SSIM \uparrow 0.7579 0.7651 0.7707 0.7614 0.7356 0.7145 0.7729 0.7067 0.7080
LPIPS \downarrow 0.3570 0.2656 0.3134 0.2709 0.2937 0.3159 0.2539 0.3462 0.3002
CLIP-IQA \uparrow 0.3687 0.5114 0.3198 0.4495 0.5406 0.5688 0.4360 0.3482 0.6234
MUSIQ \uparrow 38.26 63.28 41.21 60.36 59.06 58.90 58.70 41.71 65.88
DRealSR PSNR \uparrow 30.19 28.70 29.75 28.62 26.87 27.88 28.50 28.66 28.03
SSIM \uparrow 0.8148 0.8028 0.8262 0.8052 0.7569 0.7448 0.8043 0.7860 0.7536
LPIPS \downarrow 0.3938 0.2858 0.3099 0.2818 0.3157 0.3379 0.2743 0.3853 0.3284
CLIP-IQA \uparrow 0.3744 0.5091 0.3813 0.4515 0.5634 0.5756 0.4447 0.2925 0.6357
MUSIQ \uparrow 26.93 57.16 42.41 54.26 53.71 53.72 52.74 30.71 58.51
DPED-iphone CLIP-IQA \uparrow 0.4496 0.4021 0.2826 0.3389 0.5306 0.4482 0.3373 0.2962 0.4799
MUSIQ \uparrow 45.60 45.89 32.68 42.42 49.95 44.23 43.30 37.49 50.48

4.2 Experimental Settings

Training Datasets. We adopt the degradation pipeline of Real-ESRGAN (Wang et al., 2021c) to synthesize LR/HR pairs on DIV2K (Agustsson and Timofte, 2017), DIV8K (Gu et al., 2019), Flickr2K (Timofte et al., 2017) and OutdoorSceneTraining (Wang et al., 2018a) datasets. We additionally add 5000 face images from the FFHQ dataset (Karras et al., 2019) for general cases.

Testing Datasets. We evaluate our approach on both synthetic and real-world datasets. For synthetic data, we follow the degradation pipeline of Real-ESRGAN (Wang et al., 2021c) and generate 3k LR-HR pairs from DIV2K validation set (Agustsson and Timofte, 2017). The resolution of LR is 128×128128128128\times 128128 × 128 and that of the corresponding HR is 512×512512512512\times 512512 × 512. Note that for StableSR, the inputs are first upsampled to the same size as the outputs before inference. For real-world datasets, we follow common settings to conduct comparisons on RealSR (Cai et al., 2019), DRealSR (Wei et al., 2020) and DPED-iPhone (Ignatov et al., 2017). We further collect 40 images from the Internet for comparison.

Refer to caption

Figure 5: Qualitative comparisons on several representative real-world samples (128512128512128\rightarrow 512128 → 512). Our StableSR is capable of removing artifacts and generating realistic details. (Zoom in for details)

Compared Methods. To verify the effectiveness of our approach, we compare our StableSR with several state-of-the-art methods333SR3 (Saharia et al., 2022b) is not included since its official code is unavailable., i.e., RealSR444We use the latest official model DF2K-JPEG. (Ji et al., 2020), BSRGAN (Zhang et al., 2021b), Real-ESRGAN+ (Wang et al., 2021c), DASR (Liang et al., 2022), FeMaSR (Chen et al., 2022), latent diffusion model (LDM) (Rombach et al., 2022), SwinIR-GAN555We use the latest official SwinIR-GAN model, i.e., 003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth. (Liang et al., 2021), and DeepFloyd IF_III (Deep-floyd, 2023). Since LDM is officially trained on images with 256×256256256256\times 256256 × 256 resolution, we finetune it following the same training settings of StableSR for a fair comparison. For other methods, we directly use the official code and models for testing. Note that the results in this section are obtained on the same resolution with training, i.e., 128×128128128128\times 128128 × 128. Specifically, for images from (Cai et al., 2019; Wei et al., 2020; Ignatov et al., 2017), we crop them at the center to obtain patches with 128×128128128128\times 128128 × 128 resolution. For other real-world images, we first resize them such that the shorter sides are 128128128128 and then apply center crop**. As for other resolutions, one example of StableSR on real-world images under 1024×1024102410241024\times 10241024 × 1024 resolution is shown in Fig. 4. More results are provided in the supplementary material.

Evaluation Metrics. For benchmarks with paired data, i.e., DIV2K Valid, RealSR and DRealSR, we employ various perceptual metrics including LPIPS666We use LPIPS-ALEX by default.(Zhang et al., 2018a), FID (Heusel et al., 2017), CLIP-IQA (Wang et al., 2023) and MUSIQ (Ke et al., 2021) to evaluate the perceptual quality of generated images. PSNR and SSIM scores (evaluated on the luminance channel in YCbCr color space) are also reported for reference. Since ground-truth images are unavailable in DPED-iPhone (Ignatov et al., 2017), we follow existing methods (Wang et al., 2021c; Chen et al., 2022) to report results on no-reference metrics i.e., CLIP-IQA and MUSIQ for perceptual quality evaluation. Besides, we further conduct a user study on 16161616 real-world images to verify the effectiveness of our approach against existing methods.

4.3 Comparison with Existing Methods

Quantitative Comparisons. We first show the quantitative comparison on the synthetic DIV2K validation set and three real-world benchmarks. As shown in Table 1, our approach outperforms state-of-the-art SR methods in terms of multiple perceptual metrics, including FID, CLIP-IQA and MUSIQ. Specifically, on synthetic benchmark DIV2K Valid, our StableSR (w=0.5𝑤0.5w=0.5italic_w = 0.5) achieves a 24.4424.4424.4424.44 FID score, which is 7.7%percent7.77.7\%7.7 % lower than LDM and at least 32.9%percent32.932.9\%32.9 % lower than other GAN-based methods. Besides, our StableSR (w=0.5𝑤0.5w=0.5italic_w = 0.5) achieves the highest CLIP-IQA scores on the two commonly used real-world benchmarks (Cai et al., 2019; Wei et al., 2020), suggesting the superiority of StableSR. While we notice that StableSR achieves inferior performance on metrics including PSNR, SSIM and LPIPS compared with non-diffusion methods, these metrics only reflect certain aspects of performance (Ledig et al., 2017; Wang et al., 2018b; Blau and Michaeli, 2018). Besides, the previous non-diffusion methods tend to directly use 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT losses and perceptual loss between the predictions and the corresponding ground truths for training, which are closely related to the calculation of PSNR and LPIPS, respectively. Different from previous methods, diffusion models (Ho et al., 2020; Rombach et al., 2022) only apply 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss between the predicted and the ground-truth noise. We conjecture this is an important factor that makes diffusion models less competitive on these metrics, as observed by the recent work (Yue and Loy, 2022). Moreover, previous methods usually fail to restore faithful textures and generate blurry results, as shown in Fig. 5. In contrast, our StableSR is capable of generating sharp images with realistic details.

Qualitative Comparisons. To demonstrate the effectiveness of our method, we present visual results on real-world images from both real-world benchmarks (Cai et al., 2019; Wei et al., 2020) and the internet in Fig. 5 and Fig. 6. It is observed that StableSR outperforms previous methods in both artifact removal and detail generation. Specifically, StableSR is able to generate faithful details, as shown in the first row of Fig. 5, while other methods either show blurry results (DASR, BSRGAN, Real-ESRGAN+, LDM) or unnatural details (RealSR, FeMaSR). Moreover, as shown in the fourth row of Fig. 5, StableSR generates sharp edges without obvious degradations, whereas other state-of-the-art methods generate blurry results. Figure 6 further demonstrates the superiority of StableSR on images beyond 512×512512512512\times 512512 × 512.

Refer to caption

Figure 6: Qualitative comparisons on real-world images with diverse resolutions beyond 512×512512512512\times 512512 × 512. Our StableSR still outperforms other methods with more vivid details and less annoying artifacts. (Zoom in for details)

User Study. To further examine the effectiveness of StableSR, we conduct a user study on 40 real-world LR images collected from the Internet. To alleviate potential bias, the collected real-world images contain diverse content, e.g., natural images with and without objects, and photos with texts and faces. The order of the images as well as the options are also randomly shuffled. We further provide the link777https://forms.gle/gsLyVr6pSkAEbW8J9 of our user study for reference. We compare our approach with three commonly used SR methods with competitive performance, i.e., Real-ESRGAN+, SwinIR-GAN and LDM. Given a LR image as reference, the subject is asked to choose the best HR image generated from the four methods, i.e., StableSR, Real-ESRGAN+, SwinIR-GAN and LDM. Given the 40 LR images with the three compared methods, there are 35 subjects for evaluation, resulting in 40×35=14004035140040\times 35=140040 × 35 = 1400 votes in total. As depicted in Fig. 7, by gaining over 80% of the votes, StableSR shows its potential capability for real-world SR applications. However, we also notice that StableSR may struggle in dealing with small texts, faces and patterns, indicating there is still room for improvement.

Refer to caption

Figure 7: User study on 40 real-world images evaluated by 35 subjects. Given one LR image, the subjects are asked to choose the best HR image generated from the methods including StableSR, LDM, Real-ESRGAN+ and SwinIR w/ GAN. The large number of votes gained by StableSR indicates its potential capability for real-world SR applications.

Comparison with Concurrent Diffusion Applications. We notice that recent concurrent works (Zhang et al., 2023; Deep-floyd, 2023) can also be adopted for image SR. While IF_III upscaler (Deep-floyd, 2023) is a super-resolution model training from scratch, ControlNet-tile (Zhang et al., 2023) also adopts a diffusion prior. The key technical differences regarding to the use of diffusion prior between our StableSR and ControlNet-tile lie in the different adaptor designs, i.e., ControlNet-tile adopts a trainable copy of the encoding layers in Stable Diffusion (Rombach et al., 2022), whilst StableSR does not rely on any layer copies of the fixed diffusion prior, thus can be more flexible. Specifically, we introduce a time-aware encoder to modulate the feature maps of the fixed diffusion prior. This time-aware encoder is more lightweight than the copied layers in ControlNet-tile, i.e., 105M vs. 364M. As a result, StableSR is also faster than ControlNet-tile in terms of inference speed, i.e., 10.37s vs. 14.47s for 50 sampling steps. Here, we further conduct comparisons with these methods on real-world images. For fair comparisons, we use DDIM sampling with η=1.0𝜂1.0\eta=1.0italic_η = 1.0 and timestep 200200200200 for all the methods, and the seed is fixed to 42424242. We further set w=0.0𝑤0.0w=0.0italic_w = 0.0 in StableSR to avoid additional improvement due to CFW. For ControlNet-tile (Zhang et al., 2023), we generate additional prompts using stable-diffusion-webui888https://github.com/AUTOMATIC1111/stable-diffusion-webui for better performance. For IF_III upscaler (Deep-floyd, 2023), we follow official examples to set noise level to 100100100100 w/o prompts. As shown in Fig. 8, ControlNet-tile shows poor fidelity due to the lack of specific designs for SR. Compared with IF_III upscaler, the proposed StableSR is capable of generating more faithful details with sharper edges, e.g., the text in the first row, the tiger’s nose in the third row and the wing of the butterfly in the last row of Fig. 9. Note that IF_III upscaler is trained from scratch, which requires significant computational resources. The visual comparisons suggest the superiority of StableSR.

Refer to caption

Figure 8: Qualitative comparisons on real-world images (128512128512128\rightarrow 512128 → 512). Our StableSR outperforms ControlNet-tile (Zhang et al., 2023) with higher fidelity and has more realistic and sharper details compared with IF_III upscaler (Deep-floyd, 2023). (Zoom in for details)

Comparison with Follow-up Approaches. During the submission of our work, we notice that several follow-up methods (Lin et al., 2023; Yu et al., 2024) are further proposed for image super-resolution by exploiting the diffusion prior with a ControlNet-like (Zhang et al., 2023) framework. We therefore conduct a further comparison with these works here. The key technical differences regarding the use of diffusion prior between our StableSR and DiffBIR lie in the different adaptor designs, i.e., DiffBIR follows ControlNet (Zhang et al., 2023) to adopt a trainable copy of the encoding layers in Stable Diffusion (Rombach et al., 2022), while StableSR does not rely on any layer copies of the fixed diffusion prior, thus can be more flexible. Specifically, the generation module part of DiffBIR is the same as ControlNet, leading to more trainable parameters (364M vs. 105M) and longer inference time (14.47s vs. 10.37s). Besides, DiffBIR requires an additional pre-clean model during both training and inference, as inspired by our earlier work DifFace (Yue and Loy, 2022), whilst our StableSR does not require such a pre-clean model during training. In the testing phase, this pre-clean model is also optional and can be removed999We do not use it by default, unless clarified.. Details of the pre-clean model for StableSR can be found in the supplementary material. Similar to DiffBIR, another recent work SUPIR (Yu et al., 2024) proposes to adopt SDXL (Podell et al., 2023), a much larger diffusion model (2.6B vs. 865M) as diffusion prior and develops a trimmed ControlNet to reduce the model size. While both following ControlNet (Zhang et al., 2023), SUPIR has much more trainable parameters, i.e., 1.3B than DiffBIR, leading to almost 2x inference time than StableSR. We further conduct comparisons on real-world test data. As shown in Table 2 and Fig. 9, StableSR is comparable with DiffBIR. We further notice that DiffBIR tends to generate patterns overly as shown in the last row of Fig. 9 while StableSR does not suffer from such a problem. As for SUPIR, we observe that it does not perform well on images with small resolutions, i.e., lower than 512 after upsampling. We conjecture this is because small cropped images lack semantic content and the prior adopted by SUPIR is trained on a 1024×1024102410241024\times 10241024 × 1024 resolution. However, we do observe that SUPIR outperforms our method on large resolutions beyond 1024102410241024, which should be mostly due to the huge model size and the large training set with detailed prompts. Improving StableSR with larger diffusion prior and training datasets with prompts can be regarded as a future direction.

Refer to caption

Figure 9: Qualitative comparisons on real-world images (128512128512128\rightarrow 512128 → 512) with DiffBIR (Lin et al., 2023) and SUPIR (Yu et al., 2024). (Zoom in for details)
Table 2: Quantitative comparison with follow-up works, i.e., DiffBIR (Lin et al., 2023) and SUPIR (Yu et al., 2024) on RealSR (Cai et al., 2019) and DRealSR (Wei et al., 2020) benchmarks. SUPIR does not perform well due to the resolution gap between test data (512×512512512512\times 512512 × 512) and SDXL prior (1024×1024102410241024\times 10241024 × 1024).
Datasets Metrics DiffBIR SUPIR StableSR
RealSR PSNR \uparrow 25.02 23.70 24.65
SSIM \uparrow 0.6711 0.6647 0.7080
LPIPS \downarrow 0.3568 0.3559 0.3002
CLIP-IQA \uparrow 0.6568 0.6619 0.6234
MUSIQ \uparrow 64.07 61.97 65.88
DRealSR PSNR \uparrow 27.20 24.86 28.03
SSIM \uparrow 0.6721 0.6441 0.7536
LPIPS \downarrow 0.4274 0.4229 0.3284
CLIP-IQA \uparrow 0.6293 0.6891 0.6357
MUSIQ \uparrow 59.87 59.70 58.51

4.4 Ablation Study

Refer to caption

Figure 10: Training comparisons between w/ and w/o diffusion prior (DP). Adopting DP significantly speeds up the training process with better LPIPS scores at early epochs. The visualization results on validation sets at different epochs also indicate the superiority of using DP. (Zoom in for details)

Effectiveness of Diffusion Prior. We first verify the effectiveness of adopting diffusion prior for super-resolution. We train a baseline from scratch without loading a pretrained diffusion model as diffusion prior. The architecture is kept the same as our StableSR for fair comparison. As shown in Fig. 10, benefiting from the diffusion prior, StableSR achieves better LPIPS scores on both of the validation datasets during training. The visual comparisons at different epochs also indicate the significance of adopting diffusion prior. Moreover, we observe that training from scratch requires 2.06 times more GPU memory in average compared to StableSR on NVIDIA Tesla 32G-V100 GPUs.

Refer to caption

Figure 11: Training process and qualitative comparisons with ControlNet (CNet) baseline (Zhang et al., 2023). As shown in the validation curve during training, StableSR converges faster with better LPIPS scores on both validation sets. The visual comparisons after training 117 epochs also indicate the effectiveness of our StableSR. (Zoom in for details)

Refer to caption

Figure 12: Training comparisons between our StableSR and the baseline w/o SFT layers. SFT layers slightly improve the training performance in terms of lower LPIPS scores of validation sets.

Effectiveness of Network Design. In StableSR, a time-aware encoder and SFT layers are adopted to harness the diffusion prior. While concurrent works ControlNet (Zhang et al., 2023) and T2I-Adaptor (Mou et al., 2024) propose to exploit diffusion prior to image generation, their effectiveness for image super-resolution is underexplored. Here, we further compare our design with theirs. Specifically, we first retrain a ControlNet for image super-resolution using the same diffusion prior and training pipelines as ours. Recall that we have shown the superiority of StableSR compared with ControlNet-tile in Fig. 8. With retraining, the performance of ControlNet for super-resolution can be improved, but still inferior to ours as shown in Fig. 11. To compare with T2I-Adapter, while we have already verified the effectiveness of time-aware guidance, we further add a baseline w/o SFT layers by first map** the features to the same shape as the prior features and then adding them together. Note that such strategy can be regarded as a special case of SFT layers with 𝜶n=0,𝜷n=0formulae-sequencesuperscript𝜶𝑛0superscript𝜷𝑛0\bm{\alpha}^{n}=0,\bm{\beta}^{n}=0bold_italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 0 , bold_italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 0 in Eq.(1). As shown in Fig. 12, SFT layers slightly improve the training performance on the validation sets in terms of LPIPS scores during training.

Refer to caption

Figure 13: Visual comparisons of time-aware guidance and color correction. Exp. (a) does not apply time-aware guidance, leading to blurry textures. Exp. (b) applies time-aware guidance and can generate sharper details, but obvious color shifts can be observed. With both strategies, StableSR generates sharp textures and avoids color shifts.

Importance of Time-aware Guidance and Color Correction. We then investigate the significance of time-aware guidance and color correction. Recall that in Fig. 3, we already show that the time-aware guidance allows the encoder to adaptively adjust the condition strength. Here, we further verify its effectiveness on real-world benchmarks (Cai et al., 2019; Wei et al., 2020). As shown in Table 3, removing time-aware guidance (i.e., removing the time-embedding layer) or color correction both lead to worse SSIM and LPIPS. Moreover, the comparisons in Fig. 13 also indicate inferior performance without the above two components, suggesting the effectiveness of time-aware guidance and color correction. In addition to directly adopting color correction in the pixel domain, our proposed wavelet color correction can further boost the visual quality, as shown in Fig. 14, which may further facilitate the practical use. Note that technically, the wavelet transform may introduce halo effects (Thorndike et al., 1920), though we do not observe this phenomenon during our experiments.

Table 3: Ablation studies of time-aware guidance and color correction on RealSR (Cai et al., 2019) and DRealSR (Wei et al., 2020) benchmarks.
Exp. Strategies RealSR / DRealSR
Time aware Pixel Color cor. Wavelet Color cor. PSNR \uparrow SSIM \uparrow LPIPS \downarrow
(a) 24.65 / 27.68 0.7040 / 0.7280 0.3157 / 0.3456
(b) 22.24 / 23.86 0.6840 / 0.7179 0.3180 / 0.3544
(c) 23.38 / 26.80 0.6870 / 0.7235 0.3157 / 0.3475
Default 24.65 / 28.03 0.7080 / 0.7536 0.3002 / 0.3284

Refer to caption

Figure 14: Visual comparisons of different color correction strategies. With no color correction, obvious color shifts can be observed in Exp. (b). Our color correction via channel matching in Eq. (2) can alleviate the color shift problem, while the wavelet color correction of Eq. (5) can further improve the visual quality in these cases.

Refer to caption

Figure 15: Visual comparisons with different coefficients w𝑤witalic_w for CFW module. A small w𝑤witalic_w tends to generate a realistic result while a larger w𝑤witalic_w improves the fidelity.

Flexibility of Fidelity-realism Trade-off. Our CFW module inspired by CodeFormer (Zhou et al., 2022) allows a flexible realism-fidelity trade-off. In particular, given a controllable coefficient w𝑤witalic_w with a range of [0,1]01[0,1][ 0 , 1 ], CFW with a small w𝑤witalic_w tends to generate a realistic result, especially for large degradations, while CFW with a larger w𝑤witalic_w improves the fidelity. As shown in Table 4, compared with StableSR (w=0.0𝑤0.0w=0.0italic_w = 0.0), StableSR with larger values of w𝑤witalic_w (e.g., 0.75) achieves higher PSNR and SSIM on all three paired benchmarks, indicating better fidelity. In contrast, StableSR (w=0.0𝑤0.0w=0.0italic_w = 0.0) achieves better perceptual quality with higher CLIP-IQA scores and MUSIQ scores. Similar phenomena can also be observed in Fig. 15. We further observe that a proper w𝑤witalic_w can lead to improvement in both fidelity and perceptual quality. Specifically, StableSR (w=0.5𝑤0.5w=0.5italic_w = 0.5) shows comparable PSNR and SSIM with StableSR (w=1.0𝑤1.0w=1.0italic_w = 1.0) but achieves better perceptual metric scores in Table 4. Hence, we set the coefficient w𝑤witalic_w to 0.5 by default for trading between quality and fidelity. We observe that CFW necessitates extra GPU memory. Consequently, we designate it as an optional feature for varying applications.

Table 4: Ablation studies of the controllable coefficient w𝑤witalic_w on both synthetic (DIV2K Valid (Agustsson and Timofte, 2017)) and real-world (RealSR (Cai et al., 2019), DRealSR (Wei et al., 2020), and DPED-iPhone (Ignatov et al., 2017)) benchmarks.
Datasets Metrics StableSR (w=0.0𝑤0.0w=0.0italic_w = 0.0) StableSR (w=0.5𝑤0.5w=0.5italic_w = 0.5) StableSR (w=0.75𝑤0.75w=0.75italic_w = 0.75) StableSR (w=1.0𝑤1.0w=1.0italic_w = 1.0)
DIV2K Valid PSNR \uparrow 22.68 23.26 24.17 23.14
SSIM \uparrow 0.5546 0.5726 0.6209 0.5681
LPIPS \downarrow 0.3393 0.3114 0.3003 0.3077
FID \downarrow 25.83 24.44 24.05 26.14
CLIP-IQA \uparrow 0.6529 0.6771 0.5519 0.6197
MUSIQ \uparrow 65.72 65.92 59.46 64.31
RealSR PSNR \uparrow 24.07 24.65 25.37 24.70
SSIM \uparrow 0.6829 0.7080 0.7435 0.7157
LPIPS \downarrow 0.3190 0.3002 0.2672 0.2892
CLIP-IQA \uparrow 0.6127 0.6234 0.5341 0.5847
MUSIQ \uparrow 65.81 65.88 62.36 64.05
DRealSR PSNR \uparrow 27.43 28.03 29.00 27.97
SSIM \uparrow 0.7341 0.7536 0.7985 0.7540
LPIPS \downarrow 0.3595 0.3284 0.2721 0.3080
CLIP-IQA \uparrow 0.6340 0.6357 0.5070 0.5893
MUSIQ \uparrow 58.98 58.51 53.12 56.77
DPED-iPhone CLIP-IQA \uparrow 0.5015 0.4799 0.3405 0.4250
MUSIQ \uparrow 51.90 50.48 41.81 47.96
Table 5: Complexity comparison of model complexity. All methods are evaluated on 128×128128128128\times 128128 × 128 input images for 4x SR using an NVIDIA Tesla 32G-V100 GPU. The runtime is averaged by ten runs with a batch size of 1.
Real-ESRGAN+ FeMaSR SwinIR-GAN LDM IF_III StableSR StableSR-Turbo
Model type GAN GAN GAN Diffusion Diffusion Diffusion Diffusion
Number of Inference step 1 1 1 200 200 200 4
Runtime 0.08s 0.12s 0.31s 5.25s 17.78s 15.16s 0.83s
Trainable Params 16.70M 28.29M 28.01M 113.62M 473.40M 149.91M 149.91M

4.5 Complexity Comparison

StableSR is a diffusion-based approach and requires multi-step sampling for image generation. As shown in Table 5, when the number of sampling steps is set to 200, StableSR needs 15.16 seconds to generate a 512×512512512512\times 512512 × 512 image on one NVIDIA Tesla 32G-V100 GPU. This is comparable to IF_III upscaler but slower than GAN-based SR methods such as Real-ESRGAN+ and SwinIR-GAN, which require only a single forward pass. Fast sampling strategy (Song et al., 2020; Lu et al., 2022; Karras et al., 2022) and model distillation (Salimans and Ho, 2021; Song et al., 2023b; Luo et al., 2023) are two promising solutions to improve efficiency. Another viable remedy is to shorten the chain of diffusion process (Yue et al., 2023). As for trainable parameters, StableSR has 149.91149.91149.91149.91M trainable parameters, which is only 11.50% of the full model and less than IF_III, i.e., 473.40M. The trainable parameters can be further decreased with more careful design, e.g., adopting lightweight architectures (Chollet, 2017; Howard et al., 2019) or network pruning (Fang et al., 2023). Such exploration is beyond the scope of this paper.

5 Inference Strategies

The proposed StableSR already demonstrates superior performance quantitatively and qualitatively on both synthetic and real-world benchmarks, as shown in Sec. 4. Here, we discuss several effective strategies during the sampling process that can further boost the inference performance without additional finetuning.

Refer to caption

Figure 16: Qualitative comparisons on classifier-free guidance with negative prompts. Higher guidance scale s𝑠sitalic_s leads to sharper edges. (Zoom in for details)

5.1 Classifier-free Guidance with Negative Prompts

The default StableSR is trained with null prompts. Interestingly, we observe that StableSR can react to prompts, especially negative prompts. We examine the use of classifier-free guidance (Ho and Salimans, 2021) with negative prompts to further improve the visual quality during sampling. Given two StableSR models conditioned on null prompts ϵ𝜽(𝒁(t),𝑭,[],t)subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},[],t)italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , [ ] , italic_t ) and negative prompts ϵ𝜽(𝒁(t),𝑭,𝒄,t)subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝒄𝑡\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},\bm{c},t)italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , bold_italic_c , italic_t ), respectively, the new sampling process can be performed using a linear combination of the estimations with a guidance scale s𝑠sitalic_s:

ϵ~𝜽=ϵ𝜽(𝒁(t),𝑭,𝒄,t)+s(ϵ𝜽(𝒁(t),𝑭,[],t)ϵ𝜽(𝒁(t),𝑭,𝒄,t)),subscript~italic-ϵ𝜽subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝒄𝑡𝑠subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝑡subscriptitalic-ϵ𝜽superscript𝒁𝑡𝑭𝒄𝑡\leavevmode\resizebox{186.45341pt}{}{ $\tilde{\epsilon}_{\bm{\theta}}=\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},\bm% {c},t)+s\left(\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},[],t)-\epsilon_{\bm{% \theta}}(\bm{Z}^{(t)},\bm{F},\bm{c},t)\right)$},over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , bold_italic_c , italic_t ) + italic_s ( italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , [ ] , italic_t ) - italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_F , bold_italic_c , italic_t ) ) , (8)

where 𝒄𝒄\bm{c}bold_italic_c is the negative prompt for guidance. According to Eq. (8), it is worth noting that s=0𝑠0s=0italic_s = 0 is equivalent to directly using negative prompts without guidance, and s=1𝑠1s=1italic_s = 1 is equivalent to our default settings with the null prompt.

We compare the performance of StableSR with various positive prompts, i.e., (1) “(masterpiece:2), (best quality:2), (realistic:2), (very clear:2)”, and (2) “Good photo.”, and negative prompts, i.e., (a) “3d, cartoon, anime, sketches, (worst quality:2), (low quality:2)”, and (b) “Bad photo.”. As shown in Table 6, different prompts lead to diverse metric scores. Specifically, the classifier-free guidance with negative prompts shows a significant influence on the metrics, i.e., higher guidance scales lead to higher CLIP-IQA and MUSIQ scores, indicating sharper results. Similar phenomena can also be observed in Fig. 16. However, an overly strong guidance, e.g., s=7.5𝑠7.5s=7.5italic_s = 7.5 can result in oversharpening.

Refer to caption

Figure 17: Qualitative comparisons on real-world images (128512128512128\rightarrow 512128 → 512). Our StableSR-Turbo w/o further finetuning is capable of generating high-quality images in only 4 steps, while still significantly outperforming existing approaches.
Table 6: Comparison of different prompts and guidance strengths. Note that s=0𝑠0s=0italic_s = 0 is equivalent to using negative prompts w/o guidance. Positive prompts are (1) “(masterpiece:2), (best quality:2), (realistic:2), (very clear:2)”, and (2) “Good photo.”. Negative prompts are (a) “3d, cartoon, anime, sketches, (worst quality:2), (low quality:2)”, and (b) “Bad photo.”. The first row is the default settings for StableSR.
Strategies RealSR / DRealSR
Pos. Prompts Neg. Prompts Guidance Scale PSNR \uparrow SSIM \uparrow LPIPS \downarrow CLIP-IQA \uparrow MUSIQ \uparrow
[] - - 24.65 / 28.03 0.7080 / 0.7536 0.3002 / 0.3284 0.6234 / 0.6357 65.88 / 58.51
(1) - - 24.68 / 28.03 0.7025 / 0.7461 0.3151 / 0.3378 0.6251 / 0.6370 65.34 / 58.07
(2) - - 24.71 / 28.07 0.7049 / 0.7500 0.3118 / 0.3333 0.6219 / 0.6291 65.22 / 57.75
[] (a) s=0.0𝑠0.0s=0.0italic_s = 0.0 24.80 / 28.18 0.7097 / 0.7562 0.3105 / 0.3316 0.6176 / 0.6224 64.86 / 57.31
s=2.5𝑠2.5s=2.5italic_s = 2.5 24.41 / 27.76 0.6972 / 0.7383 0.3168 / 0.3417 0.6306 / 0.6422 66.02 / 59.21
s=5.0𝑠5.0s=5.0italic_s = 5.0 23.96 / 27.21 0.6829 / 0.7188 0.3267 / 0.3583 0.6356 / 0.6558 66.84 / 61.07
s=7.5𝑠7.5s=7.5italic_s = 7.5 23.53 / 26.68 0.6673 / 0.7003 0.3399 / 0.3774 0.6323 / 0.6621 67.26 / 62.41
[] (b) s=0.0𝑠0.0s=0.0italic_s = 0.0 24.77 / 28.13 0.7067 / 0.7520 0.3100 / 0.3317 0.6184 / 0.6239 64.81 / 57.27
s=2.5𝑠2.5s=2.5italic_s = 2.5 24.46 / 27.90 0.7017 / 0.7467 0.3170 / 0.3371 0.6303 / 0.6409 66.29 / 58.97
s=5.0𝑠5.0s=5.0italic_s = 5.0 24.13 / 27.61 0.6958 / 0.7391 0.3240 / 0.3467 0.6377 / 0.6490 67.43 / 60.69
s=7.5𝑠7.5s=7.5italic_s = 7.5 23.78 / 27.30 0.6894 / 0.7310 0.3320 / 0.3578 0.6421 / 0.6583 68.13 / 62.12

Refer to caption

Figure 18: StableSR shares the same limitations as the diffusion prior, i.e., Stable Diffusion  (Rombach et al., 2022), thus may fail to handle texts, very small patterns and small faces. While these cases are very challenging for existing generic SR methods, we believe a more powerful diffusion prior and larger data training could help.

5.2 StableSR with SD-Turbo

The default sampler of StableSR is DDPM (Ho et al., 2020) with 200 sampling steps. Though effective, the sampling process can be time-consuming compared with non-diffusion approaches as shown in Table 5. In practice, we observe that StableSR is capable of generating high-quality results much faster using advanced samplers in fewer sampling steps. Specifically, DDIM (Song et al., 2020) enables StableSR to generate results with faithful details in 20 steps. Moreover, StableSR can be further applied to SD-turbo (Sauer et al., 2023) w/o further finetuning. As shown in Fig. 17, StableSR equipped with SD-turbo can generate high-quality results with only 4 steps, significantly reducing the inference time, i.e., 0.83s as shown in Table 5, which is 6.3 times faster than LDM with 200 sampling steps, while still remarkably outperforming popular GAN-based methods (Wang et al., 2021c; Liang et al., 2021) and LDM (Rombach et al., 2022). Notably, directly speeding up LDM using existing fast sampling approaches, i.e., DDIM will lead to a severe performance drop as shown in Fig. 17.

6 Limitations

Though benefiting from the diffusion prior, StableSR also shares similar limitations with it. Specifically, StableSR may struggle in handling small texts, faces and patterns as shown in Fig. 18. While these cases are challenging for existing generic super-resolution approaches including StableSR, we believe adopting a more powerful diffusion prior and training on more high-quality data can help. We leave these as future work.

7 Conclusion

Motivated by the rapid development of diffusion models and their wide applications to downstream tasks, this work discusses an important yet underexplored problem of how diffusion prior can be adopted for super-resolution. In this paper, we present StableSR, a new way to exploit diffusion prior for real-world SR while avoiding source-intensive training from scratch. We devote our efforts to tackling the well-known problems, such as high computational cost and fixed resolution, and propose respective solutions, including the time-aware encoder, controllable feature wrap** module, and progressive aggregation sampling scheme. Extensive experiments are conducted for evaluation and effective inference strategies are further provided to facilitate practical applications. We believe that our exploration would lay a good foundation in this direction, and our proposed StableSR could provide useful insights for future works.

Acknowledgement: This study is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2022-01-033[T]), RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). We sincerely thank Yi Li for providing valuable advice and building the WebUI implementation101010https://github.com/pkuliyi2015/sd-webui-stablesr of our work. We also thank the continuous interest and contributions from the community.

Appendix

Appendix A Details of Time-aware Encoder

As mentioned in the main paper, the architecture of the time-aware encoder is similar to the contracting path of the denoising U-Net in Stable Diffusion (Rombach et al., 2022) with much fewer parameters (similar-to{\sim}105M, including SFT layers) by reducing the number of channels. The detailed settings are listed in Table 7.

Table 7: Settings of the time-aware encoder in StableSR.
Settings Value
in_channels 4
model_channels 256
out_channels 256
num_res_blocks 2
dropout 0
channel_mult [1, 1, 2, 2]
attention_resolutions [4, 2, 1]
conv_resample True
dims 2
use_fp16 False
num_heads 4

Refer to caption

Figure 19: Illustration of our aggregation sampling algorithm. We divide the noisy latent codes into overlap** patches and fuse these patches using a Gaussian kernel at each diffusion iteration. To avoid altering the output resolution, the overlap** size (region C) at the right and bottom boundaries is dynamically adjusted to fit the target resolution.

Appendix B Aggregation Sampling

Here, we provide more details about our aggregation sampling strategy, which is an effective and practical solution that enables arbitrary-size image generation without a perceptible performance drop for diffusion-based restoration. Our aggregation sampling strategy is mainly inspired by Jiménez (Jiménez, 2023) and we further enable more flexible resolution by dynamically adjusting the overlap** size at the right and bottom boundaries as shown in Fig. 19.

Refer to caption

Figure 20: StableSR may generate suboptimal results when the inputs have severe degradations. Adopting a simple pre-cleaning with a pre-trained SR model during sampling can effectively improve the performance of StableSR under such circumstances.

Appendix C Pre-cleaning for Severe Degradations

It is observed that StableSR may yield suboptimal results when LR images are severely degraded with pronounced levels of blur or noise, as shown in the first column of Fig. 20. Drawing inspiration from RealBasicVSR (Chan et al., 2022b), we incorporate an auxiliary pre-cleaning phase preceding StableSR to address scenarios under severe degradations. Specifically, we first adopt an existing SR approach e.g., Real-ESRGAN+ (Wang et al., 2021c) for general SR and CodeFormer (Zhou et al., 2022) for face SR111111For face SR, we further finetune our StableSR model for 50 epochs on FFHQ (Karras et al., 2019) using the same degradations as CodeFormer (Zhou et al., 2022). to mitigate the aforementioned severe degradations. To suppress the amplification of artifacts originating from the pre-cleaning phase, a subsequent 2×2\times2 × bicubic downsampling operation is further adopted after pre-cleaning. Subsequently, StableSR is used to generate the final outputs. As shown in Fig. 20, such a pre-cleaning stage substantially improves the robustness of StableSR.

Appendix D Additional Visual Results

Refer to caption

Figure 21: More qualitative comparisons on real-world images (128512128512128\rightarrow 512128 → 512). While existing methods typically fail to restore realistic textures under complicated degradations, our StableSR outperforms these methods by a large margin. (Zoom in for details)

D.1 Visual Results on Fixed Resolution

In this section, we provide additional qualitative comparisons on real-world images w/o ground truths under the resolution of 512×512512512512\times 512512 × 512. We obtain LR images with 128×128128128128\times 128128 × 128 resolution. As shown in Fig. 21, StableSR successfully produces outputs with finer details and sharper edges, significantly outperforming state-of-the-art methods.

D.2 Visual Results on Arbitrary Resolution

In this section, we provide additional qualitative comparisons on the original resolution of real-world images w/o ground truths. As shown in Fig. 22, StableSR is capable of generating high-quality SR images beyond 4x resolution, indicating its practical use in real-world applications. Moreover, the results in Fig. 23 indicate that StableSR can generate realistic textures under diverse and complicated real-world scenarios such as buildings and texts, while existing methods either lead to blurry results or introduce unpleasant artifacts.

Refer to caption

Figure 22: A 4x StableSR result on AIGC content beyond 4K resolution. (Zoom in for details)

Refer to caption

Figure 23: More qualitative comparisons on original real-world images with diverse resolutions. Our StableSR is capable of generating vivid details without annoying artifacts. (Zoom in for details)

References

  • Agustsson and Timofte (2017) Agustsson E, Timofte R (2017) Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (CVPR-W)
  • Avrahami et al. (2022) Avrahami O, Lischinski D, Fried O (2022) Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Balaji et al. (2022) Balaji Y, Nah S, Huang X, Vahdat A, Song J, Kreis K, Aittala M, Aila T, Laine S, Catanzaro B, Karras T, Liu MY (2022) ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:221101324
  • Blau and Michaeli (2018) Blau Y, Michaeli T (2018) The perception-distortion tradeoff. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Cai et al. (2019) Cai J, Zeng H, Yong H, Cao Z, Zhang L (2019) Toward real-world single image super-resolution: A new benchmark and a new model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Chan et al. (2021) Chan KC, Wang X, Xu X, Gu J, Loy CC (2021) GLEAN: Generative latent bank for large-factor image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Chan et al. (2022a) Chan KC, Wang X, Xu X, Gu J, Loy CC (2022a) GLEAN: Generative latent bank for large-factor image super-resolution and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • Chan et al. (2022b) Chan KC, Zhou S, Xu X, Loy CC (2022b) Investigating tradeoffs in real-world video super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Chen et al. (2022) Chen C, Shi X, Qin Y, Li X, Han X, Yang T, Guo S (2022) Real-world blind super-resolution via feature matching with implicit high-resolution priors. In: Proceedings of the ACM International Conference on Multimedia (ACM MM)
  • Chen et al. (2021) Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Choi et al. (2021) Choi J, Kim S, Jeong Y, Gwon Y, Yoon S (2021) Ilvr: Conditioning method for denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Choi et al. (2022) Choi J, Lee J, Shin C, Kim S, Kim H, Yoon S (2022) Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Chollet (2017) Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Chung et al. (2022) Chung H, Sim B, Ryu D, Ye JC (2022) Improving diffusion models for inverse problems using manifold constraints. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Dai et al. (2019) Dai T, Cai J, Zhang Y, Xia ST, Zhang L (2019) Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Deep-floyd (2023) Deep-floyd (2023) If. https://github.com/deep-floyd/IF
  • Dong et al. (2014) Dong C, Loy CC, He K, Tang X (2014) Learning a deep convolutional network for image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
  • Dong et al. (2015) Dong C, Loy CC, He K, Tang X (2015) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • Dong et al. (2016) Dong C, Loy CC, Tang X (2016) Accelerating the super-resolution convolutional neural network. In: Proceedings of the European Conference on Computer Vision (ECCV)
  • Fang et al. (2023) Fang G, Ma X, Wang X (2023) Structural pruning for diffusion models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Feng et al. (2023) Feng W, He X, Fu TJ, Jampani V, Akula A, Narayana P, Basu S, Wang XE, Wang WY (2023) Training-free structured diffusion guidance for compositional text-to-image synthesis. Proceedings of International Conference on Learning Representations (ICLR)
  • Fritsche et al. (2019) Fritsche M, Gu S, Timofte R (2019) Frequency separation for real-world super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
  • Gal et al. (2023) Gal R, Arar M, Atzmon Y, Bermano AH, Chechik G, Cohen-Or D (2023) Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:230212228
  • Gu et al. (2020) Gu J, Shen Y, Zhou B (2020) Image processing using multi-code gan prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Gu et al. (2019) Gu S, Lugmayr A, Danelljan M, Fritsche M, Lamour J, Timofte R (2019) Div8k: Diverse 8k resolution image dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
  • Gu et al. (2022) Gu S, Chen D, Bao J, Wen F, Zhang B, Chen D, Yuan L, Guo B (2022) Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • He et al. (2019) He X, Mo Z, Wang P, Liu Y, Yang M, Cheng J (2019) Ode-inspired network design for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Hertz et al. (2022) Hertz A, Mokady R, Tenenbaum J, Aberman K, Pritch Y, Cohen-Or D (2022) Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:220801626
  • Heusel et al. (2017) Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Ho and Salimans (2021) Ho J, Salimans T (2021) Classifier-free diffusion guidance. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Ho et al. (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol 33
  • Howard et al. (2019) Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, et al. (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Hu et al. (2022) Hu EJ, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W, et al. (2022) Lora: Low-rank adaptation of large language models. In: Proceedings of International Conference on Learning Representations (ICLR)
  • Ignatov et al. (2017) Ignatov A, Kobyshev N, Timofte R, Vanhoey K, Van Gool L (2017) Dslr-quality photos on mobile devices with deep convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Ji et al. (2020) Ji X, Cao Y, Tai Y, Wang C, Li J, Huang F (2020) Real-world super-resolution via kernel estimation and noise injection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (CVPR-W)
  • Jiang et al. (2021) Jiang Y, Chan KC, Wang X, Loy CC, Liu Z (2021) Robust reference-based super-resolution via c2-matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Jiménez (2023) Jiménez ÁB (2023) Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:230202412
  • Karras et al. (2019) Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Karras et al. (2022) Karras T, Aittala M, Aila T, Laine S (2022) Elucidating the design space of diffusion-based generative models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Ke et al. (2021) Ke J, Wang Q, Wang Y, Milanfar P, Yang F (2021) Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
  • Ledig et al. (2017) Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Li et al. (2022) Li H, Yang Y, Chang M, Chen S, Feng H, Xu Z, Li Q, Chen Y (2022) SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing
  • Liang et al. (2021) Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R (2021) SwinIR: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
  • Liang et al. (2022) Liang J, Zeng H, Zhang L (2022) Efficient and degradation-adaptive network for real-world image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
  • Lin et al. (2023) Lin X, He J, Chen Z, Lyu Z, Fei B, Dai B, Ouyang W, Qiao Y, Dong C (2023) Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:230815070
  • Liu et al. (2021) Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Lu et al. (2022) Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J (2022) Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Luo et al. (2023) Luo S, Tan Y, Huang L, Li J, Zhao H (2023) Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:231004378
  • Maeda (2020) Maeda S (2020) Unpaired image super-resolution using pseudo-supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Meng and Kabashima (2022) Meng X, Kabashima Y (2022) Diffusion model based posterior sampling for noisy linear inverse problems. arXiv preprint arXiv:221112343
  • Menon et al. (2020) Menon S, Damian A, Hu S, Ravi N, Rudin C (2020) Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Molad et al. (2023) Molad E, Horwitz E, Valevski D, Acha AR, Matias Y, Pritch Y, Leviathan Y, Hoshen Y (2023) Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:230201329
  • Mou et al. (2024) Mou C, Wang X, Xie L, Wu Y, Zhang J, Qi Z, Shan Y (2024) T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence
  • Nichol et al. (2022) Nichol AQ, Dhariwal P, Ramesh A, Shyam P, Mishkin P, Mcgrew B, Sutskever I, Chen M (2022) Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of International Conference on Machine Learning (ICML)
  • Oord et al. (2018) Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:180703748
  • Pan et al. (2021) Pan X, Zhan X, Dai B, Lin D, Loy CC, Luo P (2021) Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • Podell et al. (2023) Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, Müller J, Penna J, Rombach R (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: Proceedings of International Conference on Learning Representations (ICLR)
  • Qi et al. (2023) Qi C, Cun X, Zhang Y, Lei C, Wang X, Shan Y, Chen Q (2023) Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:230309535
  • Ramesh et al. (2021) Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Proceedings of International Conference on Machine Learning (ICML)
  • Ramesh et al. (2022) Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125
  • Rombach et al. (2022) Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Ronneberger et al. (2015) Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, pp 234–241
  • Sahak et al. (2023) Sahak H, Watson D, Saharia C, Fleet D (2023) Denoising diffusion probabilistic models for robust image super-resolution in the wild. arXiv preprint arXiv:230207864
  • Saharia et al. (2022a) Saharia C, Chan W, Saxena S, Li L, Whang J, Denton E, Ghasemipour SKS, Gontijo-Lopes R, Ayan BK, Salimans T, et al. (2022a) Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Saharia et al. (2022b) Saharia C, Ho J, Chan W, Salimans T, Fleet DJ, Norouzi M (2022b) Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • Sajjadi et al. (2017) Sajjadi MS, Scholkopf B, Hirsch M (2017) Enhancenet: Single image super-resolution through automated texture synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Salimans and Ho (2021) Salimans T, Ho J (2021) Progressive distillation for fast sampling of diffusion models. In: Proceedings of International Conference on Learning Representations (ICLR)
  • Sauer et al. (2023) Sauer A, Lorenz D, Blattmann A, Rombach R (2023) Adversarial diffusion distillation. arXiv preprint arXiv:231117042
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: Proceedings of International Conference on Machine Learning (ICML)
  • Song et al. (2020) Song J, Meng C, Ermon S (2020) Denoising diffusion implicit models. In: Proceedings of International Conference on Learning Representations (ICLR)
  • Song et al. (2023a) Song J, Vahdat A, Mardani M, Kautz J (2023a) Pseudoinverse-guided diffusion models for inverse problems. In: Proceedings of International Conference on Learning Representations (ICLR)
  • Song et al. (2023b) Song Y, Dhariwal P, Chen M, Sutskever I (2023b) Consistency models. arXiv preprint arXiv:230301469
  • Thorndike et al. (1920) Thorndike EL, et al. (1920) A constant error in psychological ratings. Journal of applied psychology
  • Timofte et al. (2017) Timofte R, Agustsson E, Van Gool L, Yang MH, Zhang L (2017) Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (CVPR-W)
  • Wan et al. (2020) Wan Z, Zhang B, Chen D, Zhang P, Chen D, Liao J, Wen F (2020) Bringing old photos back to life. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Wang et al. (2023) Wang J, Chan KC, Loy CC (2023) Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence
  • Wang et al. (2021a) Wang L, Wang Y, Dong X, Xu Q, Yang J, An W, Guo Y (2021a) Unsupervised degradation representation learning for blind super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Wang et al. (2018a) Wang X, Yu K, Dong C, Loy CC (2018a) Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Wang et al. (2018b) Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy C (2018b) Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision Workshops (ECCV-W)
  • Wang et al. (2021b) Wang X, Li Y, Zhang H, Shan Y (2021b) Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Wang et al. (2021c) Wang X, Xie L, Dong C, Shan Y (2021c) Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
  • Wang et al. (2022) Wang Y, Yu J, Zhang J (2022) Zero-shot image restoration using denoising diffusion null-space model. Proceedings of International Conference on Learning Representations (ICLR)
  • Wei et al. (2020) Wei P, Xie Z, Lu H, Zhan Z, Ye Q, Zuo W, Lin L (2020) Component divide-and-conquer for real-world image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
  • Wei et al. (2021) Wei Y, Gu S, Li Y, Timofte R, ** L, Song H (2021) Unsupervised real-world image super resolution via domain-distance aware training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Wu et al. (2022) Wu JZ, Ge Y, Wang X, Lei SW, Gu Y, Hsu W, Shan Y, Qie X, Shou MZ (2022) Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:221211565
  • Xu et al. (2017) Xu X, Sun D, Pan J, Zhang Y, Pfister H, Yang MH (2017) Learning to super-resolve blurry face and text images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Xu et al. (2019) Xu X, Ma Y, Sun W (2019) Towards real scene super-resolution with raw images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Yang et al. (2020) Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Yang et al. (2021a) Yang S, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2021a) Score-based generative modeling through stochastic differential equations. In: Proceedings of International Conference on Learning Representations (ICLR)
  • Yang et al. (2021b) Yang T, Ren P, Xie X, Zhang L (2021b) Gan prior embedded network for blind face restoration in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Yu et al. (2024) Yu F, Gu J, Li Z, Hu J, Kong X, Wang X, He J, Qiao Y, Dong C (2024) Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Yu et al. (2018) Yu K, Dong C, Lin L, Loy CC (2018) Crafting a toolchain for image restoration by deep reinforcement learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Yue and Loy (2022) Yue Z, Loy CC (2022) Difface: Blind face restoration with diffused error contraction. arXiv preprint arXiv:221206512
  • Yue et al. (2023) Yue Z, Wang J, Loy CC (2023) Resshift: Efficient diffusion model for image super-resolution by residual shifting. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Zhang et al. (2021a) Zhang J, Lu S, Zhan F, Yu Y (2021a) Blind image super-resolution via contrastive representation learning. arXiv preprint arXiv:210700708
  • Zhang et al. (2021b) Zhang K, Liang J, Van Gool L, Timofte R (2021b) Designing a practical degradation model for deep blind image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Zhang et al. (2023) Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  • Zhang et al. (2018a) Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018a) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Zhang et al. (2018b) Zhang Y, Li K, Li K, Wang L, Zhong B, Fu Y (2018b) Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European Conference on Computer Vision (ECCV)
  • Zhang et al. (2019) Zhang Z, Wang Z, Lin Z, Qi H (2019) Image super-resolution by neural texture transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Zhao et al. (2022) Zhao Y, Su YC, Chu CT, Li Y, Renn M, Zhu Y, Chen C, Jia X (2022) Rethinking deep face restoration. In: cvpr
  • Zheng et al. (2018) Zheng H, Ji M, Wang H, Liu Y, Fang L (2018) Crossnet: An end-to-end reference-based super resolution network using cross-scale war**. In: Proceedings of the European Conference on Computer Vision (ECCV)
  • Zhou et al. (2020) Zhou S, Zhang J, Zuo W, Loy CC (2020) Cross-scale internal graph neural network for image super-resolution. Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Zhou et al. (2022) Zhou S, Chan KC, Li C, Loy CC (2022) Towards robust blind face restoration with codebook lookup transformer. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
  • Zhu et al. (2017) Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)