∎

¹¹institutetext: Jianyi Wang ²²institutetext: S-Lab, Nanyang Technological University, Singapore
²²email: [email protected] ³³institutetext: Zongsheng Yue ⁴⁴institutetext: S-Lab, Nanyang Technological University, Singapore
⁴⁴email: [email protected] ⁵⁵institutetext: Shangchen Zhou ⁶⁶institutetext: S-Lab, Nanyang Technological University, Singapore
⁶⁶email: [email protected] ⁷⁷institutetext: Kelvin C.K. Chan ⁸⁸institutetext: S-Lab, Nanyang Technological University, Singapore
⁸⁸email: [email protected] ⁹⁹institutetext: Chen Change Loy (Corresponding author) ¹⁰¹⁰institutetext: S-Lab, Nanyang Technological University, Singapore
¹⁰¹⁰email: [email protected]

Exploiting Diffusion Prior for Real-World Image Super-Resolution

Jianyi Wang Zongsheng Yue Shangchen Zhou Kelvin C.K. Chan Chen Change Loy

(Received: date / Accepted: date)

Abstract

We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrap** module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.

Keywords:

Super-resolution image restoration diffusion models generative prior

Refer to caption — Figure 1: Qualitative comparisons of BSRGAN (Zhang et al., 2021b), Real-ESRGAN+ (Wang et al., 2021c), FeMaSR (Chen et al., 2022), LDM (Rombach et al., 2022), and our StableSR on real-world examples. (Zoom in for details)

1 Introduction

We have seen significant advancements in diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Yang et al., 2021a; Nichol et al., 2022) for the task of image synthesis. Existing studies demonstrate that the diffusion prior, embedded in synthesis models like Stable Diffusion (Rombach et al., 2022), can be applied to various downstream content creation tasks, including image (Choi et al., 2021; Avrahami et al., 2022; Hertz et al., 2022; Gu et al., 2022; Mou et al., 2024; Zhang et al., 2023; Gal et al., 2023) and video (Wu et al., 2022; Molad et al., 2023; Qi et al., 2023) editing. In this study, we extend the exploration beyond the realm of content creation and examine the potential benefits of using diffusion prior for super-resolution (SR). This low-level vision task presents an additional non-trivial challenge, as it requires high image fidelity in its generated content, which stands in contrast to the stochastic nature of diffusion models.

A common solution to the challenge above involves training a SR model from scratch (Saharia et al., 2022b; Rombach et al., 2022; Sahak et al., 2023; Li et al., 2022). To preserve fidelity, these methods use the low-resolution (LR) image as an additional input to constrain the output space. While these methods have achieved notable success, they often demand significant computational resources to train the diffusion model. Moreover, training a network from scratch can potentially jeopardize the generative priors captured in synthesis models, leading to suboptimal performance in the final network. These limitations have inspired an alternative approach (Choi et al., 2021; Wang et al., 2022; Chung et al., 2022; Song et al., 2023a; Meng and Kabashima, 2022), which involves incorporating constraints into the reverse diffusion process of a pre-trained synthesis model. This paradigm avoids the need for model training while leveraging the diffusion prior. However, designing these constraints assumes knowing the image degradations as a priori, which are typically unknown and complex. Consequently, such methods exhibit limited generalizability.

In this study, we present StableSR, an approach that preserves pre-trained diffusion priors without making explicit assumptions about the degradations. Specifically, unlike previous works (Saharia et al., 2022b; Rombach et al., 2022; Sahak et al., 2023; Li et al., 2022) that concatenate the LR image to intermediate outputs, which requires one to train a diffusion model from scratch, our method only needs to fine-tune a lightweight time-aware encoder and a few feature modulation layers for the SR task. When applying diffusion models for SR, the LR condition should provide adaptive guidance for each diffusion step during the restoration process, i.e., stronger guidance at earlier iterations to maintain fidelity and weaker guidance later to avoid introducing degradations. To this end, our encoder incorporates a time embedding layer to generate time-aware features, allowing the features in the diffusion model to be adaptively modulated at different iterations. Besides gaining improved training efficiency, kee** the original diffusion model frozen helps preserve the generative prior, which grants StableSR the capability of generating visually pleasant SR details and avoids overfitting to high-frequency degradations. Our experiments show that both the time-aware property of our encoder and the diffusion prior are crucial for achieving SR performance improvements.

To suppress randomness inherited from the diffusion model as well as the information loss due to the encoding process of the autoencoder (Rombach et al., 2022), inspired by Codeformer (Zhou et al., 2022), we apply a controllable feature wrap** module (CFW) with an adjustable coefficient to refine the outputs of the diffusion model during the decoding process of the autoencoder. Unlike CodeFormer, the multiple-step sampling nature of diffusion models makes it hard to finetune the CFW module directly. We overcome this issue by first generating synthetic LR-HR pairs with the diffusion training stage. Then, we obtain the corresponding latent codes using our finetuned diffusion model given the LR images as conditions. In this way, CFW can be trained using the generated data.

Applying diffusion models to arbitrary resolutions has remained a persistent challenge, especially for the SR task. A simple solution would be to split the image into patches and process each independently. However, this method often leads to boundary discontinuity in the output. To address this issue, we introduce a progressive aggregation sampling strategy. Inspired by Jiménez (Jiménez, 2023), our approach involves dividing the image into overlap** patches and fusing these patches using a Gaussian kernel at each diffusion iteration. This process smooths out the boundaries, resulting in a more coherent output. To avoid altering the output resolution of SR images, the overlap** sizes at the right and bottom boundaries are dynamically adjusted to fit the target resolution.

Adapting generative priors for real-world image super-resolution presents an intriguing yet challenging problem, and in this work, we offer a novel approach as a solution. We introduce a fine-tuning method that leverages pre-trained diffusion models without making explicit assumptions about degradations. We address key challenges, such as fidelity and arbitrary resolution, by proposing simple yet effective modules. With our time-aware encoder, controllable feature wrap** module, and progressive aggregation sampling strategy, our StableSR serves as a strong baseline that inspires future research in adopting diffusion priors for restoration tasks.

2 Related Work

Image Super-Resolution. Image Super-Resolution (SR) aims to restore an HR image from its degraded LR observation. Early SR approaches (Dai et al., 2019; Dong et al., 2014, 2015, 2016; He et al., 2019; Xu et al., 2019; Zhang et al., 2018b; Chen et al., 2021; Liang et al., 2021; Wang et al., 2018b; Ledig et al., 2017; Sajjadi et al., 2017; Xu et al., 2017; Zhou et al., 2020) assume a pre-defined degradation process, e.g., bicubic downsampling and blurring with known parameters. While these methods can achieve appealing performance on the synthetic data with the same degradation, their performance deteriorates significantly in real-world scenarios due to the limited generalizability.

Recent works have moved their focus from synthetic settings to blind SR, where the degradation is unknown and similar to real-world scenarios. Due to the lack of real-world paired data for training, some methods (Fritsche et al., 2019; Maeda, 2020; Wan et al., 2020; Wang et al., 2021a; Wei et al., 2021; Zhang et al., 2021a) propose to implicitly learn a degradation model from LR images in an unsupervised manner such as Cycle-GAN (Zhu et al., 2017) and contrastive learning (Oord et al., 2018). In addition to unsupervised learning, other approaches aim to explicitly synthesize LR-HR image pairs that resemble real-world data. Specifically, BSRGAN (Zhang et al., 2021b) and Real-ESRGAN (Wang et al., 2021c) present effective degradation pipelines for blind SR in real world. Building upon such degradation pipelines, recent works based on diffusion models (Saharia et al., 2022b; Sahak et al., 2023) further show competitive performance on real-world image SR. In this work, we consider an orthogonal direction of fine-tuning a diffusion model for SR. In this way, the computational cost of network training could be reduced. Moreover, our method allows the exploitation of generative prior encapsulated in the synthesis model, leading to better performance.

Prior for Image Super-Resolution. To further enhance performance in complex real-world SR scenarios, numerous prior-based approaches have been proposed. These techniques deploy additional image priors to bolster the generation of faithful textures. A straightforward method is reference-based SR (Zheng et al., 2018; Zhang et al., 2019; Yang et al., 2020; Jiang et al., 2021; Zhou et al., 2020). This involves using one or several reference high-resolution (HR) images, which share similar textures with the input low-resolution (LR) image, as an explicit prior to aid in generating the corresponding HR output. However, aligning features of the reference with the LR input can be challenging in real-world cases, and such explicit priors are not always readily available. Recent works have moved away from relying on explicit priors, finding more promising performance with implicit priors instead. Wang et al. (Wang et al., 2018a) were the first to propose the use of semantic segmentation probability maps for guiding SR in the feature space. Subsequent works (Menon et al., 2020; Gu et al., 2020; Wang et al., 2021b; Pan et al., 2021; Chan et al., 2021, 2022a; Yang et al., 2021b) employed pre-trained GANs by exploring the corresponding high-resolution latent space of the low-resolution input. While effective, the implicit priors used in these approaches are often tailored for specific scenarios, such as limited categories (Wang et al., 2018a; Gu et al., 2020; Pan et al., 2021; Chan et al., 2021) and faces (Menon et al., 2020; Wang et al., 2021b; Yang et al., 2021b), and therefore lack generalizability for complex real-world SR tasks. Other implicit priors for image SR include mixtures of degradation experts (Yu et al., 2018; Liang et al., 2022) and VQGAN (Zhao et al., 2022; Chen et al., 2022; Zhou et al., 2022). However, these methods fall short, either due to insufficient prior expressiveness (Yu et al., 2018; Zhao et al., 2022; Liang et al., 2022) or inaccurate feature matching (Chen et al., 2022), resulting in output quality that remains less than satisfactory.

In contrast to existing strategies, we set our sights on exploring the robust and extensive generative prior found in pre-trained diffusion models (Nichol et al., 2022; Rombach et al., 2022; Ramesh et al., 2021; Saharia et al., 2022a; Ramesh et al., 2022). While recent studies (Choi et al., 2021; Avrahami et al., 2022; Hu et al., 2022; Zhang et al., 2023; Mou et al., 2024) have highlighted the remarkable generative abilities of pre-trained diffusion models, the high-fidelity requirement inherent in super-resolution (SR) makes it unfeasible to directly adopt these methods for this task. Our proposed StableSR, unlike LDM (Rombach et al., 2022), does not necessitate training from scratch. Instead, it shares a similar idea to concurrent works (Zhang et al., 2023; Mou et al., 2024) by fine-tuning directly on a frozen pre-trained diffusion model with only a small number of trainable parameters, leading to superior performance in a more efficient way. In practice, our approach further shows comparable performance with follow-up works (Lin et al., 2023; Yu et al., 2024), which also exploit the diffusion prior but follow the ControlNet-like (Zhang et al., 2023) framework. We provide a detailed comparison with these works in a following section.

3 Methodology

Our method employs diffusion prior for SR. Inspired by the generative capabilities of Stable Diffusion (Rombach et al., 2022), we use it as the diffusion prior in our work, hence the name StableSR for our method. The main component of StableSR is a time-aware encoder, which is trained along with a frozen Stable Diffusion model to allow for conditioning based on the input image. To further facilitate a trade-off between realism and fidelity, depending on user preference, we follow CodeFormer (Zhou et al., 2022) to introduce an optional controllable feature wrap** module. The overall framework of StableSR is depicted in Fig. 2.

3.1 Guided Finetuning with Time Awareness

To exploit the prior knowledge of Stable Diffusion for SR, we establish the following constraints when designing our model: 1) The resulting model must have the ability to generate a plausible HR image, conditioned on the observed LR input. This is vital because the LR image is the only source of structural information, which is crucial for maintaining high fidelity. 2) The model should introduce only minimal alterations to the original Stable Diffusion model to prevent disrupting the prior encapsulated within it.

Feature Modulation. While several existing approaches (Nichol et al., 2022; Rombach et al., 2022; Hertz et al., 2022; Feng et al., 2023; Balaji et al., 2022) have successfully controlled the generated semantic structure of a diffusion model via cross-attention, such a strategy can hardly provide detailed and high-frequency guidance due to insufficient inductive bias (Liu et al., 2021). To more accurately guide the generation process, we adopt an additional encoder to extract multi-scale features $\{\bm{F}^{n}\}^{N}_{n=1}$ from the degraded LR image features, and use them to modulate the intermediate feature maps $\{\bm{F}^{n}_{\rm dif}\}^{N}_{n=1}$ of the residual blocks in Stable Diffusion via spatial feature transformations (SFT) (Wang et al., 2018a):

\hat{\bm{F}}^{n}_{\rm dif}=(1+\bm{\alpha}^{n})\odot\bm{F}^{n}_{\rm dif}+\bm{% \beta}^{n};\leavevmode\nobreak\ \bm{\alpha}^{n},\bm{\beta}^{n}=\mathcal{M}^{n}% _{\theta}(\bm{F}^{n}),

(1)

where $\bm{\alpha}^{n}$ and $\bm{\beta}^{n}$ denote the affine parameters in SFT and $\mathcal{M}^{n}_{\theta}$ denotes a small network consisting of several convolutional layers. Here $n$ indices the spatial scale of the UNet (Ronneberger et al., 2015) architecture in Stable Diffusion.

During finetuning, we freeze the weights of Stable Diffusion and train only the encoder and SFT layers. This strategy allows us to insert structural information extracted from the LR image without destroying the generative prior captured by Stable Diffusion.

Time-aware Guidance. We find that incorporating temporal information through a time-embedding layer in our encoder considerably enhances both the quality of generation and the fidelity to the ground truth, since it can adaptively adjust the condition strength derived from the LR features. Here, we analyze this phenomenon from a signal-to-noise ratio (SNR) standpoint and later quantitatively and qualitatively validate it in the ablation study.

During the generation process, the SNR of the produced image progressively increases as noise is incrementally removed. A recent study (Choi et al., 2022) indicates that image content is rapidly populated when the SNR approaches $5\text{e}^{-2}$ . In line with this observation, we notice that the time embedding enables the conditional encoder to provide stronger guidance within the range where the signal-to-noise ratio (SNR) hits $5\text{e}^{-2}$ . This is essential because the content generated at this stage significantly influences the super-resolution performance of our method. To further substantiate this, since the conditional features are inserted into the diffusion prior via SFT layers, we employ the cosine similarity between the features of Stable Diffusion before and after the SFT to measure the condition strength provided by the encoder. The cosine similarity values at different timesteps are plotted in Fig. 3-(a). As can be observed, the cosine similarity reaches its minimum value around an SNR of $5\text{e}^{-2}$ , indicative of the strongest conditions imposed by the encoder. In addition, we also depict the feature maps extracted from our specially designed encoder in Fig. 3-(b). It is noticeable that the features around the SNR point of $5\text{e}^{-2}$ are sharper and contain more detailed image structures. We hypothesize that these adaptive feature conditions can furnish more comprehensive guidance for SR.

Color Correction. Diffusion models can occasionally exhibit color shifts, as noted in (Choi et al., 2022). To address this issue, we perform color normalization on the generated image to align its mean and variance with those of the LR input. In particular, if we let $\bm{x}$ denote the LR input and $\hat{\bm{y}}$ represent the generated HR image, the color-corrected output, $\bm{y}$ , is calculated as follows:

\bm{y}^{c}=\frac{\hat{\bm{y}}^{c}-\bm{\mu}_{\hat{\bm{y}}}^{c}}{\bm{\sigma}_{% \hat{\bm{y}}}^{c}}\cdot\bm{\sigma}_{x}^{c}+\bm{\mu}_{x}^{c},

(2)

where $c\in\{r,g,b\}$ denotes the color channel, $\bm{\mu}^{c}_{\hat{\bm{y}}}$ and $\bm{\sigma}^{c}_{\hat{\bm{y}}}$ (or $\bm{\mu}^{c}_{x}$ and $\bm{\sigma}^{c}_{x}$ ) are the mean and standard variance estimated from the $c$ -th channel of $\hat{\bm{y}}$ (or $\bm{x}$ ), respectively. We find that this simple correction suffices to remedy the color difference.

Though pixel color correction via channel matching can improve color fidelity, we notice that it may suffer from limited color correction ability due to the lack of pixel-wise controllability. The main reason is that it only introduces global statistics, i.e., channel-wise mean and variance of the input for color correction, ignoring pixel-wise semantics. Besides adopting color correction in the pixel domain, we further propose wavelet-based color correction for better visual performance in some cases. Wavelet color correction directly introduces the low-frequency part from the input since the color information belongs to the low-frequency components, while the degradations are mostly high-frequency components. In this way, we can improve the color fidelity of the results without perceptibly affecting the generated quality. Given any image $\bm{I}$ , we extract its high-frequency component $\bm{H}^{i}$ and low-frequency component $\bm{L}^{i}$ at the $i$ -th ( $1\leq i\leq l$ ) scale via the wavelet decomposition, i.e.,

\bm{L}^{i}=\mathcal{C}_{i}(\bm{L}^{i-1},\bm{k}),\leavevmode\nobreak\ \bm{H}^{i% }=\bm{L}^{i-1}-\bm{L}^{i},

(3)

where $\bm{L}^{0}=\bm{I}$ , $\mathcal{C}_{i}$ denotes the convolutional operator with a dilation of $2^{i}$ , and $k$ is the convolutional kernel defined as:

\bm{k}=\begin{bmatrix}\nicefrac{{1}}{{16}}&\nicefrac{{1}}{{8}}&\nicefrac{{1}}{% {16}}\\ \nicefrac{{1}}{{8}}&\nicefrac{{1}}{{4}}&\nicefrac{{1}}{{8}}\\ \nicefrac{{1}}{{16}}&\nicefrac{{1}}{{8}}&\nicefrac{{1}}{{16}}\end{bmatrix}.

(4)

By denoting the $l$ -th low-frequency and high-frequency components of $\bm{x}$ (or $\hat{\bm{y}}$ ) as $\bm{L}^{l}_{x}$ and $\bm{H}^{l}_{x}$ (or $\bm{L}^{l}_{y}$ and $\bm{H}^{l}_{y}$ ), the desired HR output $\bm{y}$ is formulated as follows:

\bm{y}=\bm{H}^{l}_{y}+\bm{L}^{l}_{x}.

(5)

Intuitively, we replace the low-frequency component $\bm{L}^{l}_{y}$ of $\hat{\bm{y}}$ with $\bm{L}^{l}_{x}$ to correct the color bias. By default, we adopt color correction in the pixel domain for simplicity.

3.2 Fidelity-Realism Trade-off

Although the output of the proposed approach is visually compelling, it often deviates from the ground truth due to the inherent stochasticity of the diffusion model. Drawing inspiration from CodeFormer (Zhou et al., 2022), we introduce a Controllable Feature Wrap** (CFW) module to flexibly manage the balance between realism and fidelity. Unlike CodeFormer, there are multiple sampling steps for generating a sample during inference and we cannot finetune the CFW module directly. To overcome this problem, we first generate synthetic LR-HR pairs following the same degradation pipeline with the diffusion training stage. Then, the latent codes $\bm{Z}_{0}$ can be obtained using our finetuned diffusion model given the LR images as conditions. Finally, CFW can be trained using the generated data.

Since Stable Diffusion is implemented in the latent space of an autoencoder, it is natural to leverage the encoder features of the autoencoder to modulate the corresponding decoder features for further fidelity improvement. Let $\bm{F}_{e}$ and $\bm{F}_{d}$ be the encoder and decoder features, respectively. We introduce an adjustable coefficient $w\in[0,1]$ to control the extent of modulation:

\bm{F}_{m}=\bm{F}_{d}+\mathcal{C}(\bm{F}_{e},\bm{F}_{d};\bm{\theta})\times w,

(6)

where $\mathcal{C}(\cdot;\bm{\theta})$ represents convolution layers with trainable parameter $\bm{\theta}$ . The overall framework is shown in Fig. 2.

In this design, a small $w$ exploits the generation capability of Stable Diffusion, leading to outputs with high realism under severe degradations. In contrast, a large $w$ allows stronger structural guidance from the LR image, enhancing fidelity. We observe that $w\,{=}\,0.5$ achieves a good balance between quality and fidelity. Note that we only train CFW in this particular stage. In practice, we notice that CFW involves additional GPU memory and the improvement can be subtle in some cases. Thus, we make it optional for different real-world applications.

3.3 Aggregation Sampling

Due to the heightened sensitivity of the attention layers in Stable Diffusion with respect to the image resolution, it tends to produce inferior outputs for resolutions differing from its training settings, specifically $512{\times}512$ . This, in effect, constrains the practicality of StableSR.

A common workaround involves splitting the larger image into several overlap** smaller patches and processing each individually. While this strategy often yields good results for conventional CNN-based SR methods, it is not directly applicable to the diffusion paradigm. This is because discrepancies between patches are compounded and magnified over the course of diffusion iterations. A typical failure case is illustrated in Fig. 4.

Inspired by Jiménez (Jiménez, 2023), we apply a progressive patch aggregation sampling algorithm to handle images of arbitrary resolutions. Specifically, we begin by encoding the LR image into a latent feature map $\bm{F}\in\mathcal{R}^{h\times w}$ , which is then subdivided into $M$ overlap** small patches $\{\bm{F}_{\Omega_{n}}\}_{n=1}^{M}$ , each with a resolution of $64\times 64$ - matching the training resolution¹¹1The downsampling scale factor of the autoencoder in Stable Diffusion is $8\times$ .. Here, $\Omega_{n}$ is the coordinate set of the $n$ th patch in $\bm{F}$ . During each timestep in the reverse sampling, each patch is individually processed through StableSR, with the processed patches subsequently aggregated. To integrate overlap** patches, a weight map $\bm{w}_{\Omega_{n}}\in\mathcal{R}^{h\times w}$ whose entries follow up a Gaussian filter in $\Omega_{n}$ and 0 elsewhere is generated for each patch $\bm{F}_{\Omega_{n}}$ . Overlap** pixels are then weighted in accordance with their respective Gaussian weight maps. In particular, we follow Jiménez (Jiménez, 2023) to define a padding function $f(\cdot)$ that expands any patch of size $64\times 64$ to the resolution of $h\times w$ by filling zeros outside the region $\Omega_{n}$ . This procedure is reiterated until reaching the final iteration.

Given the output of each patch as $\epsilon_{\bm{\theta}}(\bm{Z}^{(t)}_{\Omega_{n}},\bm{F}_{\Omega_{n}},t)$ , where $\bm{Z}^{(t)}_{\Omega_{n}}$ is the $n$ th patch of the noisy input $\bm{Z}^{(t)}$ and $\bm{\theta}$ is the parameters of the diffusion model, the results of all the patches aggregated together can be formulated as follows:

\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t)=\sum_{n=1}^{M}\frac{\bm{w}_{% \Omega_{n}}}{\hat{\bm{w}}}\odot f\left(\epsilon_{\bm{\theta}}\left(\bm{Z}^{(t)% }_{\Omega_{n}},\bm{F}_{\Omega_{n}},t\right)\right),

(7)

where $\hat{\bm{w}}=\sum_{n}\bm{w}_{\Omega_{n}}$ . Based on $\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t)$ , we can obtain $\bm{Z}^{(t-1)}$ according to the sampling procedure, denoted as ${\rm Sampler}(\bm{Z}^{(t)},\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t))$ , in the diffusion model. Subsequently, we re-split $\bm{Z}^{(t-1)}$ into over-lapped patches and repeat the above steps until $t=1$ . The whole process is summed up in Algorithm 1. Our experiments suggest that this progressive aggregation method substantially mitigates discrepancies in the overlapped regions, as depicted in Fig. 4. More details can be found in the supplementary material.

Algorithm 1 Progressive Patch Aggregation

1:Cropped Regions

\{\Omega_{n}\}_{n=1}^{M}

, diffusion steps

T

, LR latent features

\bm{F}

2:Initialize

\bm{w}_{\Omega_{n}}

and

\hat{\bm{w}}

\bm{Z}^{(T)}\sim{\cal N}(0,{\mathbb{I}})

4:for

t\in[T,\ldots,0]

5: for

n\in[1,\ldots,M]

6: Compute

\epsilon_{\bm{\theta}}\left(\bm{Z}^{(t)}_{\Omega_{n}},\bm{F}_{\Omega_{n}},t\right)

7: end for

8: Compute

\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},t)

following Eq. (7)

\bm{Z}^{(t-1)}={\rm Sampler}(\bm{Z}^{(t)},\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},% \bm{F},t))

10:end for

11:return

\bm{Z}_{0}

4 Experiments

4.1 Implementation Details

StableSR is built based on Stable Diffusion 2.1-base²²2https://huggingface.co/stabilityai/stable-diffusion-2-1-base. Our time-aware encoder is similar to the contracting path of the denoising U-Net in Stable Diffusion but is much more lightweight ( ${\sim}$ 105M, including SFT layers). SFT layers are inserted in each residual block of Stable Diffusion for effective control. We finetune the diffusion model of StableSR for $117$ epochs with a batch size of $192$ , and the prompt is fixed as null. We follow Stable Diffusion to use Adam (Kingma and Ba, 2014) optimizer and the learning rate is set to $5\times 10^{-5}$ . The training process is conducted on $512\times 512$ resolution with 8 NVIDIA Tesla 32G-V100 GPUs. For inference, we adopt DDPM sampling (Ho et al., 2020) with 200 timesteps. To handle images with arbitrary sizes, we adopt the proposed aggregation sampling strategy for images beyond $512\times 512$ . As for images under $512\times 512$ , we first enlarge the LR images such that the shorter side has a length of $512$ and rescale the results back to target resolutions after generation.

To train CFW, we first generate 100k synthetic LR-HR pairs with $512\times 512$ resolution following the degradation pipeline in Real-ESRGAN (Wang et al., 2021c). Then, we adopt the finetuned diffusion model to generate the corresponding latent codes $\bm{Z}_{0}$ given the above LR images as conditions. The training losses are almost the same as the autoencoder used in LDM (Rombach et al., 2022), except that we use a fixed adversarial loss weight of $0.025$ rather than a self-adjustable one.

Table 1: Quantitative comparison with state-of-the-art methods on both synthetic and real-world benchmarks. Red and blue colors represent the best and second best performance, respectively.

Datasets	Metrics	RealSR	BSRGAN	DASR	Real-ESRGAN+	FeMaSR	LDM	SwinIR-GAN	IF_III	StableSR
DIV2K Valid	PSNR $\uparrow$	24.62	24.58	24.47	24.29	23.06	23.32	23.93	23.36	23.26
	SSIM $\uparrow$	0.5970	0.6269	0.6304	0.6372	0.5887	0.5762	0.6285	0.5636	0.5726
	LPIPS $\downarrow$	0.5276	0.3351	0.3543	0.3112	0.3126	0.3199	0.3160	0.4641	0.3114
	FID $\downarrow$	49.49	44.22	49.16	37.64	35.87	26.47	36.34	37.54	24.44
	CLIP-IQA $\uparrow$	0.3534	0.5246	0.5036	0.5276	0.5998	0.6245	0.5338	0.3980	0.6771
	MUSIQ $\uparrow$	28.57	61.19	55.19	61.05	60.83	62.27	60.22	43.71	65.92
RealSR	PSNR $\uparrow$	27.30	26.38	27.02	25.69	25.06	25.46	26.31	25.47	24.65
	SSIM $\uparrow$	0.7579	0.7651	0.7707	0.7614	0.7356	0.7145	0.7729	0.7067	0.7080
	LPIPS $\downarrow$	0.3570	0.2656	0.3134	0.2709	0.2937	0.3159	0.2539	0.3462	0.3002
	CLIP-IQA $\uparrow$	0.3687	0.5114	0.3198	0.4495	0.5406	0.5688	0.4360	0.3482	0.6234
	MUSIQ $\uparrow$	38.26	63.28	41.21	60.36	59.06	58.90	58.70	41.71	65.88
DRealSR	PSNR $\uparrow$	30.19	28.70	29.75	28.62	26.87	27.88	28.50	28.66	28.03
	SSIM $\uparrow$	0.8148	0.8028	0.8262	0.8052	0.7569	0.7448	0.8043	0.7860	0.7536
	LPIPS $\downarrow$	0.3938	0.2858	0.3099	0.2818	0.3157	0.3379	0.2743	0.3853	0.3284
	CLIP-IQA $\uparrow$	0.3744	0.5091	0.3813	0.4515	0.5634	0.5756	0.4447	0.2925	0.6357
	MUSIQ $\uparrow$	26.93	57.16	42.41	54.26	53.71	53.72	52.74	30.71	58.51
DPED-iphone	CLIP-IQA $\uparrow$	0.4496	0.4021	0.2826	0.3389	0.5306	0.4482	0.3373	0.2962	0.4799
DPED-iphone	MUSIQ $\uparrow$	45.60	45.89	32.68	42.42	49.95	44.23	43.30	37.49	50.48

4.2 Experimental Settings

Training Datasets. We adopt the degradation pipeline of Real-ESRGAN (Wang et al., 2021c) to synthesize LR/HR pairs on DIV2K (Agustsson and Timofte, 2017), DIV8K (Gu et al., 2019), Flickr2K (Timofte et al., 2017) and OutdoorSceneTraining (Wang et al., 2018a) datasets. We additionally add 5000 face images from the FFHQ dataset (Karras et al., 2019) for general cases.

Testing Datasets. We evaluate our approach on both synthetic and real-world datasets. For synthetic data, we follow the degradation pipeline of Real-ESRGAN (Wang et al., 2021c) and generate 3k LR-HR pairs from DIV2K validation set (Agustsson and Timofte, 2017). The resolution of LR is $128\times 128$ and that of the corresponding HR is $512\times 512$ . Note that for StableSR, the inputs are first upsampled to the same size as the outputs before inference. For real-world datasets, we follow common settings to conduct comparisons on RealSR (Cai et al., 2019), DRealSR (Wei et al., 2020) and DPED-iPhone (Ignatov et al., 2017). We further collect 40 images from the Internet for comparison.

Compared Methods. To verify the effectiveness of our approach, we compare our StableSR with several state-of-the-art methods³³3SR3 (Saharia et al., 2022b) is not included since its official code is unavailable., i.e., RealSR⁴⁴4We use the latest official model DF2K-JPEG. (Ji et al., 2020), BSRGAN (Zhang et al., 2021b), Real-ESRGAN+ (Wang et al., 2021c), DASR (Liang et al., 2022), FeMaSR (Chen et al., 2022), latent diffusion model (LDM) (Rombach et al., 2022), SwinIR-GAN⁵⁵5We use the latest official SwinIR-GAN model, i.e., 003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth. (Liang et al., 2021), and DeepFloyd IF_III (Deep-floyd, 2023). Since LDM is officially trained on images with $256\times 256$ resolution, we finetune it following the same training settings of StableSR for a fair comparison. For other methods, we directly use the official code and models for testing. Note that the results in this section are obtained on the same resolution with training, i.e., $128\times 128$ . Specifically, for images from (Cai et al., 2019; Wei et al., 2020; Ignatov et al., 2017), we crop them at the center to obtain patches with $128\times 128$ resolution. For other real-world images, we first resize them such that the shorter sides are $128$ and then apply center crop**. As for other resolutions, one example of StableSR on real-world images under $1024\times 1024$ resolution is shown in Fig. 4. More results are provided in the supplementary material.

Evaluation Metrics. For benchmarks with paired data, i.e., DIV2K Valid, RealSR and DRealSR, we employ various perceptual metrics including LPIPS⁶⁶6We use LPIPS-ALEX by default.(Zhang et al., 2018a), FID (Heusel et al., 2017), CLIP-IQA (Wang et al., 2023) and MUSIQ (Ke et al., 2021) to evaluate the perceptual quality of generated images. PSNR and SSIM scores (evaluated on the luminance channel in YCbCr color space) are also reported for reference. Since ground-truth images are unavailable in DPED-iPhone (Ignatov et al., 2017), we follow existing methods (Wang et al., 2021c; Chen et al., 2022) to report results on no-reference metrics i.e., CLIP-IQA and MUSIQ for perceptual quality evaluation. Besides, we further conduct a user study on $16$ real-world images to verify the effectiveness of our approach against existing methods.

4.3 Comparison with Existing Methods

Quantitative Comparisons. We first show the quantitative comparison on the synthetic DIV2K validation set and three real-world benchmarks. As shown in Table 1, our approach outperforms state-of-the-art SR methods in terms of multiple perceptual metrics, including FID, CLIP-IQA and MUSIQ. Specifically, on synthetic benchmark DIV2K Valid, our StableSR ( $w=0.5$ ) achieves a $24.44$ FID score, which is $7.7\%$ lower than LDM and at least $32.9\%$ lower than other GAN-based methods. Besides, our StableSR ( $w=0.5$ ) achieves the highest CLIP-IQA scores on the two commonly used real-world benchmarks (Cai et al., 2019; Wei et al., 2020), suggesting the superiority of StableSR. While we notice that StableSR achieves inferior performance on metrics including PSNR, SSIM and LPIPS compared with non-diffusion methods, these metrics only reflect certain aspects of performance (Ledig et al., 2017; Wang et al., 2018b; Blau and Michaeli, 2018). Besides, the previous non-diffusion methods tend to directly use $\ell_{2}$ losses and perceptual loss between the predictions and the corresponding ground truths for training, which are closely related to the calculation of PSNR and LPIPS, respectively. Different from previous methods, diffusion models (Ho et al., 2020; Rombach et al., 2022) only apply $\ell_{2}$ loss between the predicted and the ground-truth noise. We conjecture this is an important factor that makes diffusion models less competitive on these metrics, as observed by the recent work (Yue and Loy, 2022). Moreover, previous methods usually fail to restore faithful textures and generate blurry results, as shown in Fig. 5. In contrast, our StableSR is capable of generating sharp images with realistic details.

Qualitative Comparisons. To demonstrate the effectiveness of our method, we present visual results on real-world images from both real-world benchmarks (Cai et al., 2019; Wei et al., 2020) and the internet in Fig. 5 and Fig. 6. It is observed that StableSR outperforms previous methods in both artifact removal and detail generation. Specifically, StableSR is able to generate faithful details, as shown in the first row of Fig. 5, while other methods either show blurry results (DASR, BSRGAN, Real-ESRGAN+, LDM) or unnatural details (RealSR, FeMaSR). Moreover, as shown in the fourth row of Fig. 5, StableSR generates sharp edges without obvious degradations, whereas other state-of-the-art methods generate blurry results. Figure 6 further demonstrates the superiority of StableSR on images beyond $512\times 512$ .

User Study. To further examine the effectiveness of StableSR, we conduct a user study on 40 real-world LR images collected from the Internet. To alleviate potential bias, the collected real-world images contain diverse content, e.g., natural images with and without objects, and photos with texts and faces. The order of the images as well as the options are also randomly shuffled. We further provide the link⁷⁷7https://forms.gle/gsLyVr6pSkAEbW8J9 of our user study for reference. We compare our approach with three commonly used SR methods with competitive performance, i.e., Real-ESRGAN+, SwinIR-GAN and LDM. Given a LR image as reference, the subject is asked to choose the best HR image generated from the four methods, i.e., StableSR, Real-ESRGAN+, SwinIR-GAN and LDM. Given the 40 LR images with the three compared methods, there are 35 subjects for evaluation, resulting in $40\times 35=1400$ votes in total. As depicted in Fig. 7, by gaining over 80% of the votes, StableSR shows its potential capability for real-world SR applications. However, we also notice that StableSR may struggle in dealing with small texts, faces and patterns, indicating there is still room for improvement.

Comparison with Concurrent Diffusion Applications. We notice that recent concurrent works (Zhang et al., 2023; Deep-floyd, 2023) can also be adopted for image SR. While IF_III upscaler (Deep-floyd, 2023) is a super-resolution model training from scratch, ControlNet-tile (Zhang et al., 2023) also adopts a diffusion prior. The key technical differences regarding to the use of diffusion prior between our StableSR and ControlNet-tile lie in the different adaptor designs, i.e., ControlNet-tile adopts a trainable copy of the encoding layers in Stable Diffusion (Rombach et al., 2022), whilst StableSR does not rely on any layer copies of the fixed diffusion prior, thus can be more flexible. Specifically, we introduce a time-aware encoder to modulate the feature maps of the fixed diffusion prior. This time-aware encoder is more lightweight than the copied layers in ControlNet-tile, i.e., 105M vs. 364M. As a result, StableSR is also faster than ControlNet-tile in terms of inference speed, i.e., 10.37s vs. 14.47s for 50 sampling steps. Here, we further conduct comparisons with these methods on real-world images. For fair comparisons, we use DDIM sampling with $\eta=1.0$ and timestep $200$ for all the methods, and the seed is fixed to $42$ . We further set $w=0.0$ in StableSR to avoid additional improvement due to CFW. For ControlNet-tile (Zhang et al., 2023), we generate additional prompts using stable-diffusion-webui⁸⁸8https://github.com/AUTOMATIC1111/stable-diffusion-webui for better performance. For IF_III upscaler (Deep-floyd, 2023), we follow official examples to set noise level to $100$ w/o prompts. As shown in Fig. 8, ControlNet-tile shows poor fidelity due to the lack of specific designs for SR. Compared with IF_III upscaler, the proposed StableSR is capable of generating more faithful details with sharper edges, e.g., the text in the first row, the tiger’s nose in the third row and the wing of the butterfly in the last row of Fig. 9. Note that IF_III upscaler is trained from scratch, which requires significant computational resources. The visual comparisons suggest the superiority of StableSR.

Comparison with Follow-up Approaches. During the submission of our work, we notice that several follow-up methods (Lin et al., 2023; Yu et al., 2024) are further proposed for image super-resolution by exploiting the diffusion prior with a ControlNet-like (Zhang et al., 2023) framework. We therefore conduct a further comparison with these works here. The key technical differences regarding the use of diffusion prior between our StableSR and DiffBIR lie in the different adaptor designs, i.e., DiffBIR follows ControlNet (Zhang et al., 2023) to adopt a trainable copy of the encoding layers in Stable Diffusion (Rombach et al., 2022), while StableSR does not rely on any layer copies of the fixed diffusion prior, thus can be more flexible. Specifically, the generation module part of DiffBIR is the same as ControlNet, leading to more trainable parameters (364M vs. 105M) and longer inference time (14.47s vs. 10.37s). Besides, DiffBIR requires an additional pre-clean model during both training and inference, as inspired by our earlier work DifFace (Yue and Loy, 2022), whilst our StableSR does not require such a pre-clean model during training. In the testing phase, this pre-clean model is also optional and can be removed⁹⁹9We do not use it by default, unless clarified.. Details of the pre-clean model for StableSR can be found in the supplementary material. Similar to DiffBIR, another recent work SUPIR (Yu et al., 2024) proposes to adopt SDXL (Podell et al., 2023), a much larger diffusion model (2.6B vs. 865M) as diffusion prior and develops a trimmed ControlNet to reduce the model size. While both following ControlNet (Zhang et al., 2023), SUPIR has much more trainable parameters, i.e., 1.3B than DiffBIR, leading to almost 2x inference time than StableSR. We further conduct comparisons on real-world test data. As shown in Table 2 and Fig. 9, StableSR is comparable with DiffBIR. We further notice that DiffBIR tends to generate patterns overly as shown in the last row of Fig. 9 while StableSR does not suffer from such a problem. As for SUPIR, we observe that it does not perform well on images with small resolutions, i.e., lower than 512 after upsampling. We conjecture this is because small cropped images lack semantic content and the prior adopted by SUPIR is trained on a $1024\times 1024$ resolution. However, we do observe that SUPIR outperforms our method on large resolutions beyond $1024$ , which should be mostly due to the huge model size and the large training set with detailed prompts. Improving StableSR with larger diffusion prior and training datasets with prompts can be regarded as a future direction.

Table 2: Quantitative comparison with follow-up works, i.e., DiffBIR (Lin et al., 2023) and SUPIR (Yu et al., 2024) on RealSR (Cai et al., 2019) and DRealSR (Wei et al., 2020) benchmarks. SUPIR does not perform well due to the resolution gap between test data (

512\times 512

) and SDXL prior (

1024\times 1024

Datasets	Metrics	DiffBIR	SUPIR	StableSR
RealSR	PSNR $\uparrow$	25.02	23.70	24.65
	SSIM $\uparrow$	0.6711	0.6647	0.7080
	LPIPS $\downarrow$	0.3568	0.3559	0.3002
	CLIP-IQA $\uparrow$	0.6568	0.6619	0.6234
	MUSIQ $\uparrow$	64.07	61.97	65.88
DRealSR	PSNR $\uparrow$	27.20	24.86	28.03
	SSIM $\uparrow$	0.6721	0.6441	0.7536
	LPIPS $\downarrow$	0.4274	0.4229	0.3284
	CLIP-IQA $\uparrow$	0.6293	0.6891	0.6357
	MUSIQ $\uparrow$	59.87	59.70	58.51

4.4 Ablation Study

Effectiveness of Diffusion Prior. We first verify the effectiveness of adopting diffusion prior for super-resolution. We train a baseline from scratch without loading a pretrained diffusion model as diffusion prior. The architecture is kept the same as our StableSR for fair comparison. As shown in Fig. 10, benefiting from the diffusion prior, StableSR achieves better LPIPS scores on both of the validation datasets during training. The visual comparisons at different epochs also indicate the significance of adopting diffusion prior. Moreover, we observe that training from scratch requires 2.06 times more GPU memory in average compared to StableSR on NVIDIA Tesla 32G-V100 GPUs.

Effectiveness of Network Design. In StableSR, a time-aware encoder and SFT layers are adopted to harness the diffusion prior. While concurrent works ControlNet (Zhang et al., 2023) and T2I-Adaptor (Mou et al., 2024) propose to exploit diffusion prior to image generation, their effectiveness for image super-resolution is underexplored. Here, we further compare our design with theirs. Specifically, we first retrain a ControlNet for image super-resolution using the same diffusion prior and training pipelines as ours. Recall that we have shown the superiority of StableSR compared with ControlNet-tile in Fig. 8. With retraining, the performance of ControlNet for super-resolution can be improved, but still inferior to ours as shown in Fig. 11. To compare with T2I-Adapter, while we have already verified the effectiveness of time-aware guidance, we further add a baseline w/o SFT layers by first map** the features to the same shape as the prior features and then adding them together. Note that such strategy can be regarded as a special case of SFT layers with $\bm{\alpha}^{n}=0,\bm{\beta}^{n}=0$ in Eq.(1). As shown in Fig. 12, SFT layers slightly improve the training performance on the validation sets in terms of LPIPS scores during training.

Importance of Time-aware Guidance and Color Correction. We then investigate the significance of time-aware guidance and color correction. Recall that in Fig. 3, we already show that the time-aware guidance allows the encoder to adaptively adjust the condition strength. Here, we further verify its effectiveness on real-world benchmarks (Cai et al., 2019; Wei et al., 2020). As shown in Table 3, removing time-aware guidance (i.e., removing the time-embedding layer) or color correction both lead to worse SSIM and LPIPS. Moreover, the comparisons in Fig. 13 also indicate inferior performance without the above two components, suggesting the effectiveness of time-aware guidance and color correction. In addition to directly adopting color correction in the pixel domain, our proposed wavelet color correction can further boost the visual quality, as shown in Fig. 14, which may further facilitate the practical use. Note that technically, the wavelet transform may introduce halo effects (Thorndike et al., 1920), though we do not observe this phenomenon during our experiments.

Table 3: Ablation studies of time-aware guidance and color correction on RealSR (Cai et al., 2019) and DRealSR (Wei et al., 2020) benchmarks.

Exp.	Strategies			RealSR / DRealSR
Exp.	Time aware	Pixel Color cor.	Wavelet Color cor.	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
(a)		✓		24.65 / 27.68	0.7040 / 0.7280	0.3157 / 0.3456
(b)	✓			22.24 / 23.86	0.6840 / 0.7179	0.3180 / 0.3544
(c)	✓		✓	23.38 / 26.80	0.6870 / 0.7235	0.3157 / 0.3475
Default	✓	✓		24.65 / 28.03	0.7080 / 0.7536	0.3002 / 0.3284

Flexibility of Fidelity-realism Trade-off. Our CFW module inspired by CodeFormer (Zhou et al., 2022) allows a flexible realism-fidelity trade-off. In particular, given a controllable coefficient $w$ with a range of $[0,1]$ , CFW with a small $w$ tends to generate a realistic result, especially for large degradations, while CFW with a larger $w$ improves the fidelity. As shown in Table 4, compared with StableSR ( $w=0.0$ ), StableSR with larger values of $w$ (e.g., 0.75) achieves higher PSNR and SSIM on all three paired benchmarks, indicating better fidelity. In contrast, StableSR ( $w=0.0$ ) achieves better perceptual quality with higher CLIP-IQA scores and MUSIQ scores. Similar phenomena can also be observed in Fig. 15. We further observe that a proper $w$ can lead to improvement in both fidelity and perceptual quality. Specifically, StableSR ( $w=0.5$ ) shows comparable PSNR and SSIM with StableSR ( $w=1.0$ ) but achieves better perceptual metric scores in Table 4. Hence, we set the coefficient $w$ to 0.5 by default for trading between quality and fidelity. We observe that CFW necessitates extra GPU memory. Consequently, we designate it as an optional feature for varying applications.

Table 4: Ablation studies of the controllable coefficient

w

on both synthetic (DIV2K Valid (Agustsson and Timofte, 2017)) and real-world (RealSR (Cai et al., 2019), DRealSR (Wei et al., 2020), and DPED-iPhone (Ignatov et al., 2017)) benchmarks.

Datasets	Metrics	StableSR ( $w=0.0$ )	StableSR ( $w=0.5$ )	StableSR ( $w=0.75$ )	StableSR ( $w=1.0$ )
DIV2K Valid	PSNR $\uparrow$	22.68	23.26	24.17	23.14
	SSIM $\uparrow$	0.5546	0.5726	0.6209	0.5681
	LPIPS $\downarrow$	0.3393	0.3114	0.3003	0.3077
	FID $\downarrow$	25.83	24.44	24.05	26.14
	CLIP-IQA $\uparrow$	0.6529	0.6771	0.5519	0.6197
	MUSIQ $\uparrow$	65.72	65.92	59.46	64.31
RealSR	PSNR $\uparrow$	24.07	24.65	25.37	24.70
	SSIM $\uparrow$	0.6829	0.7080	0.7435	0.7157
	LPIPS $\downarrow$	0.3190	0.3002	0.2672	0.2892
	CLIP-IQA $\uparrow$	0.6127	0.6234	0.5341	0.5847
	MUSIQ $\uparrow$	65.81	65.88	62.36	64.05
DRealSR	PSNR $\uparrow$	27.43	28.03	29.00	27.97
	SSIM $\uparrow$	0.7341	0.7536	0.7985	0.7540
	LPIPS $\downarrow$	0.3595	0.3284	0.2721	0.3080
	CLIP-IQA $\uparrow$	0.6340	0.6357	0.5070	0.5893
	MUSIQ $\uparrow$	58.98	58.51	53.12	56.77
DPED-iPhone	CLIP-IQA $\uparrow$	0.5015	0.4799	0.3405	0.4250
DPED-iPhone	MUSIQ $\uparrow$	51.90	50.48	41.81	47.96

Table 5: Complexity comparison of model complexity. All methods are evaluated on

128\times 128

input images for 4x SR using an NVIDIA Tesla 32G-V100 GPU. The runtime is averaged by ten runs with a batch size of 1.

	Real-ESRGAN+	FeMaSR	SwinIR-GAN	LDM	IF_III	StableSR	StableSR-Turbo
Model type	GAN	GAN	GAN	Diffusion	Diffusion	Diffusion	Diffusion
Number of Inference step	1	1	1	200	200	200	4
Runtime	0.08s	0.12s	0.31s	5.25s	17.78s	15.16s	0.83s
Trainable Params	16.70M	28.29M	28.01M	113.62M	473.40M	149.91M	149.91M

4.5 Complexity Comparison

StableSR is a diffusion-based approach and requires multi-step sampling for image generation. As shown in Table 5, when the number of sampling steps is set to 200, StableSR needs 15.16 seconds to generate a $512\times 512$ image on one NVIDIA Tesla 32G-V100 GPU. This is comparable to IF_III upscaler but slower than GAN-based SR methods such as Real-ESRGAN+ and SwinIR-GAN, which require only a single forward pass. Fast sampling strategy (Song et al., 2020; Lu et al., 2022; Karras et al., 2022) and model distillation (Salimans and Ho, 2021; Song et al., 2023b; Luo et al., 2023) are two promising solutions to improve efficiency. Another viable remedy is to shorten the chain of diffusion process (Yue et al., 2023). As for trainable parameters, StableSR has $149.91$ M trainable parameters, which is only 11.50% of the full model and less than IF_III, i.e., 473.40M. The trainable parameters can be further decreased with more careful design, e.g., adopting lightweight architectures (Chollet, 2017; Howard et al., 2019) or network pruning (Fang et al., 2023). Such exploration is beyond the scope of this paper.

5 Inference Strategies

The proposed StableSR already demonstrates superior performance quantitatively and qualitatively on both synthetic and real-world benchmarks, as shown in Sec. 4. Here, we discuss several effective strategies during the sampling process that can further boost the inference performance without additional finetuning.

5.1 Classifier-free Guidance with Negative Prompts

The default StableSR is trained with null prompts. Interestingly, we observe that StableSR can react to prompts, especially negative prompts. We examine the use of classifier-free guidance (Ho and Salimans, 2021) with negative prompts to further improve the visual quality during sampling. Given two StableSR models conditioned on null prompts $\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},[],t)$ and negative prompts $\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},\bm{c},t)$ , respectively, the new sampling process can be performed using a linear combination of the estimations with a guidance scale $s$ :

\leavevmode\resizebox{186.45341pt}{}{ $\tilde{\epsilon}_{\bm{\theta}}=\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},\bm% {c},t)+s\left(\epsilon_{\bm{\theta}}(\bm{Z}^{(t)},\bm{F},[],t)-\epsilon_{\bm{% \theta}}(\bm{Z}^{(t)},\bm{F},\bm{c},t)\right)$},

(8)

where $\bm{c}$ is the negative prompt for guidance. According to Eq. (8), it is worth noting that $s=0$ is equivalent to directly using negative prompts without guidance, and $s=1$ is equivalent to our default settings with the null prompt.

We compare the performance of StableSR with various positive prompts, i.e., (1) “(masterpiece:2), (best quality:2), (realistic:2), (very clear:2)”, and (2) “Good photo.”, and negative prompts, i.e., (a) “3d, cartoon, anime, sketches, (worst quality:2), (low quality:2)”, and (b) “Bad photo.”. As shown in Table 6, different prompts lead to diverse metric scores. Specifically, the classifier-free guidance with negative prompts shows a significant influence on the metrics, i.e., higher guidance scales lead to higher CLIP-IQA and MUSIQ scores, indicating sharper results. Similar phenomena can also be observed in Fig. 16. However, an overly strong guidance, e.g., $s=7.5$ can result in oversharpening.

Table 6: Comparison of different prompts and guidance strengths. Note that

s=0

is equivalent to using negative prompts w/o guidance. Positive prompts are (1) “(masterpiece:2), (best quality:2), (realistic:2), (very clear:2)”, and (2) “Good photo.”. Negative prompts are (a) “3d, cartoon, anime, sketches, (worst quality:2), (low quality:2)”, and (b) “Bad photo.”. The first row is the default settings for StableSR.

Strategies			RealSR / DRealSR
Pos. Prompts	Neg. Prompts	Guidance Scale	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	CLIP-IQA $\uparrow$	MUSIQ $\uparrow$
[]	-	-	24.65 / 28.03	0.7080 / 0.7536	0.3002 / 0.3284	0.6234 / 0.6357	65.88 / 58.51
(1)	-	-	24.68 / 28.03	0.7025 / 0.7461	0.3151 / 0.3378	0.6251 / 0.6370	65.34 / 58.07
(2)	-	-	24.71 / 28.07	0.7049 / 0.7500	0.3118 / 0.3333	0.6219 / 0.6291	65.22 / 57.75
[]	(a)	$s=0.0$	24.80 / 28.18	0.7097 / 0.7562	0.3105 / 0.3316	0.6176 / 0.6224	64.86 / 57.31
		$s=2.5$	24.41 / 27.76	0.6972 / 0.7383	0.3168 / 0.3417	0.6306 / 0.6422	66.02 / 59.21
		$s=5.0$	23.96 / 27.21	0.6829 / 0.7188	0.3267 / 0.3583	0.6356 / 0.6558	66.84 / 61.07
		$s=7.5$	23.53 / 26.68	0.6673 / 0.7003	0.3399 / 0.3774	0.6323 / 0.6621	67.26 / 62.41
[]	(b)	$s=0.0$	24.77 / 28.13	0.7067 / 0.7520	0.3100 / 0.3317	0.6184 / 0.6239	64.81 / 57.27
		$s=2.5$	24.46 / 27.90	0.7017 / 0.7467	0.3170 / 0.3371	0.6303 / 0.6409	66.29 / 58.97
		$s=5.0$	24.13 / 27.61	0.6958 / 0.7391	0.3240 / 0.3467	0.6377 / 0.6490	67.43 / 60.69
		$s=7.5$	23.78 / 27.30	0.6894 / 0.7310	0.3320 / 0.3578	0.6421 / 0.6583	68.13 / 62.12

5.2 StableSR with SD-Turbo

The default sampler of StableSR is DDPM (Ho et al., 2020) with 200 sampling steps. Though effective, the sampling process can be time-consuming compared with non-diffusion approaches as shown in Table 5. In practice, we observe that StableSR is capable of generating high-quality results much faster using advanced samplers in fewer sampling steps. Specifically, DDIM (Song et al., 2020) enables StableSR to generate results with faithful details in 20 steps. Moreover, StableSR can be further applied to SD-turbo (Sauer et al., 2023) w/o further finetuning. As shown in Fig. 17, StableSR equipped with SD-turbo can generate high-quality results with only 4 steps, significantly reducing the inference time, i.e., 0.83s as shown in Table 5, which is 6.3 times faster than LDM with 200 sampling steps, while still remarkably outperforming popular GAN-based methods (Wang et al., 2021c; Liang et al., 2021) and LDM (Rombach et al., 2022). Notably, directly speeding up LDM using existing fast sampling approaches, i.e., DDIM will lead to a severe performance drop as shown in Fig. 17.

6 Limitations

Though benefiting from the diffusion prior, StableSR also shares similar limitations with it. Specifically, StableSR may struggle in handling small texts, faces and patterns as shown in Fig. 18. While these cases are challenging for existing generic super-resolution approaches including StableSR, we believe adopting a more powerful diffusion prior and training on more high-quality data can help. We leave these as future work.

7 Conclusion

Motivated by the rapid development of diffusion models and their wide applications to downstream tasks, this work discusses an important yet underexplored problem of how diffusion prior can be adopted for super-resolution. In this paper, we present StableSR, a new way to exploit diffusion prior for real-world SR while avoiding source-intensive training from scratch. We devote our efforts to tackling the well-known problems, such as high computational cost and fixed resolution, and propose respective solutions, including the time-aware encoder, controllable feature wrap** module, and progressive aggregation sampling scheme. Extensive experiments are conducted for evaluation and effective inference strategies are further provided to facilitate practical applications. We believe that our exploration would lay a good foundation in this direction, and our proposed StableSR could provide useful insights for future works.

Acknowledgement: This study is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2022-01-033[T]), RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). We sincerely thank Yi Li for providing valuable advice and building the WebUI implementation¹⁰¹⁰10https://github.com/pkuliyi2015/sd-webui-stablesr of our work. We also thank the continuous interest and contributions from the community.

Appendix

Appendix A Details of Time-aware Encoder

As mentioned in the main paper, the architecture of the time-aware encoder is similar to the contracting path of the denoising U-Net in Stable Diffusion (Rombach et al., 2022) with much fewer parameters ( ${\sim}$ 105M, including SFT layers) by reducing the number of channels. The detailed settings are listed in Table 7.

Table 7: Settings of the time-aware encoder in StableSR.

Settings	Value
in_channels	4
model_channels	256
out_channels	256
num_res_blocks	2
dropout	0
channel_mult	[1, 1, 2, 2]
attention_resolutions	[4, 2, 1]
conv_resample	True
dims	2
use_fp16	False
num_heads	4

Appendix B Aggregation Sampling

Here, we provide more details about our aggregation sampling strategy, which is an effective and practical solution that enables arbitrary-size image generation without a perceptible performance drop for diffusion-based restoration. Our aggregation sampling strategy is mainly inspired by Jiménez (Jiménez, 2023) and we further enable more flexible resolution by dynamically adjusting the overlap** size at the right and bottom boundaries as shown in Fig. 19.

Appendix C Pre-cleaning for Severe Degradations

It is observed that StableSR may yield suboptimal results when LR images are severely degraded with pronounced levels of blur or noise, as shown in the first column of Fig. 20. Drawing inspiration from RealBasicVSR (Chan et al., 2022b), we incorporate an auxiliary pre-cleaning phase preceding StableSR to address scenarios under severe degradations. Specifically, we first adopt an existing SR approach e.g., Real-ESRGAN+ (Wang et al., 2021c) for general SR and CodeFormer (Zhou et al., 2022) for face SR¹¹¹¹11For face SR, we further finetune our StableSR model for 50 epochs on FFHQ (Karras et al., 2019) using the same degradations as CodeFormer (Zhou et al., 2022). to mitigate the aforementioned severe degradations. To suppress the amplification of artifacts originating from the pre-cleaning phase, a subsequent $2\times$ bicubic downsampling operation is further adopted after pre-cleaning. Subsequently, StableSR is used to generate the final outputs. As shown in Fig. 20, such a pre-cleaning stage substantially improves the robustness of StableSR.

Appendix D Additional Visual Results

D.1 Visual Results on Fixed Resolution

In this section, we provide additional qualitative comparisons on real-world images w/o ground truths under the resolution of $512\times 512$ . We obtain LR images with $128\times 128$ resolution. As shown in Fig. 21, StableSR successfully produces outputs with finer details and sharper edges, significantly outperforming state-of-the-art methods.

D.2 Visual Results on Arbitrary Resolution

In this section, we provide additional qualitative comparisons on the original resolution of real-world images w/o ground truths. As shown in Fig. 22, StableSR is capable of generating high-quality SR images beyond 4x resolution, indicating its practical use in real-world applications. Moreover, the results in Fig. 23 indicate that StableSR can generate realistic textures under diverse and complicated real-world scenarios such as buildings and texts, while existing methods either lead to blurry results or introduce unpleasant artifacts.

References

Agustsson and Timofte (2017) Agustsson E, Timofte R (2017) Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (CVPR-W)
Avrahami et al. (2022) Avrahami O, Lischinski D, Fried O (2022) Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Balaji et al. (2022) Balaji Y, Nah S, Huang X, Vahdat A, Song J, Kreis K, Aittala M, Aila T, Laine S, Catanzaro B, Karras T, Liu MY (2022) ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:221101324
Blau and Michaeli (2018) Blau Y, Michaeli T (2018) The perception-distortion tradeoff. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cai et al. (2019) Cai J, Zeng H, Yong H, Cao Z, Zhang L (2019) Toward real-world single image super-resolution: A new benchmark and a new model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Chan et al. (2021) Chan KC, Wang X, Xu X, Gu J, Loy CC (2021) GLEAN: Generative latent bank for large-factor image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chan et al. (2022a) Chan KC, Wang X, Xu X, Gu J, Loy CC (2022a) GLEAN: Generative latent bank for large-factor image super-resolution and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Chan et al. (2022b) Chan KC, Zhou S, Xu X, Loy CC (2022b) Investigating tradeoffs in real-world video super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chen et al. (2022) Chen C, Shi X, Qin Y, Li X, Han X, Yang T, Guo S (2022) Real-world blind super-resolution via feature matching with implicit high-resolution priors. In: Proceedings of the ACM International Conference on Multimedia (ACM MM)
Chen et al. (2021) Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Choi et al. (2021) Choi J, Kim S, Jeong Y, Gwon Y, Yoon S (2021) Ilvr: Conditioning method for denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Choi et al. (2022) Choi J, Lee J, Shin C, Kim S, Kim H, Yoon S (2022) Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chollet (2017) Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chung et al. (2022) Chung H, Sim B, Ryu D, Ye JC (2022) Improving diffusion models for inverse problems using manifold constraints. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Dai et al. (2019) Dai T, Cai J, Zhang Y, Xia ST, Zhang L (2019) Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Deep-floyd (2023) Deep-floyd (2023) If. https://github.com/deep-floyd/IF
Dong et al. (2014) Dong C, Loy CC, He K, Tang X (2014) Learning a deep convolutional network for image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
Dong et al. (2015) Dong C, Loy CC, He K, Tang X (2015) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Dong et al. (2016) Dong C, Loy CC, Tang X (2016) Accelerating the super-resolution convolutional neural network. In: Proceedings of the European Conference on Computer Vision (ECCV)
Fang et al. (2023) Fang G, Ma X, Wang X (2023) Structural pruning for diffusion models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Feng et al. (2023) Feng W, He X, Fu TJ, Jampani V, Akula A, Narayana P, Basu S, Wang XE, Wang WY (2023) Training-free structured diffusion guidance for compositional text-to-image synthesis. Proceedings of International Conference on Learning Representations (ICLR)
Fritsche et al. (2019) Fritsche M, Gu S, Timofte R (2019) Frequency separation for real-world super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
Gal et al. (2023) Gal R, Arar M, Atzmon Y, Bermano AH, Chechik G, Cohen-Or D (2023) Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:230212228
Gu et al. (2020) Gu J, Shen Y, Zhou B (2020) Image processing using multi-code gan prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Gu et al. (2019) Gu S, Lugmayr A, Danelljan M, Fritsche M, Lamour J, Timofte R (2019) Div8k: Diverse 8k resolution image dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
Gu et al. (2022) Gu S, Chen D, Bao J, Wen F, Zhang B, Chen D, Yuan L, Guo B (2022) Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
He et al. (2019) He X, Mo Z, Wang P, Liu Y, Yang M, Cheng J (2019) Ode-inspired network design for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Hertz et al. (2022) Hertz A, Mokady R, Tenenbaum J, Aberman K, Pritch Y, Cohen-Or D (2022) Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:220801626
Heusel et al. (2017) Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Ho and Salimans (2021) Ho J, Salimans T (2021) Classifier-free diffusion guidance. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Ho et al. (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol 33
Howard et al. (2019) Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, et al. (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Hu et al. (2022) Hu EJ, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W, et al. (2022) Lora: Low-rank adaptation of large language models. In: Proceedings of International Conference on Learning Representations (ICLR)
Ignatov et al. (2017) Ignatov A, Kobyshev N, Timofte R, Vanhoey K, Van Gool L (2017) Dslr-quality photos on mobile devices with deep convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Ji et al. (2020) Ji X, Cao Y, Tai Y, Wang C, Li J, Huang F (2020) Real-world super-resolution via kernel estimation and noise injection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (CVPR-W)
Jiang et al. (2021) Jiang Y, Chan KC, Wang X, Loy CC, Liu Z (2021) Robust reference-based super-resolution via c2-matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Jiménez (2023) Jiménez ÁB (2023) Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:230202412
Karras et al. (2019) Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Karras et al. (2022) Karras T, Aittala M, Aila T, Laine S (2022) Elucidating the design space of diffusion-based generative models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Ke et al. (2021) Ke J, Wang Q, Wang Y, Milanfar P, Yang F (2021) Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
Ledig et al. (2017) Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li et al. (2022) Li H, Yang Y, Chang M, Chen S, Feng H, Xu Z, Li Q, Chen Y (2022) SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing
Liang et al. (2021) Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R (2021) SwinIR: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
Liang et al. (2022) Liang J, Zeng H, Zhang L (2022) Efficient and degradation-adaptive network for real-world image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
Lin et al. (2023) Lin X, He J, Chen Z, Lyu Z, Fei B, Dai B, Ouyang W, Qiao Y, Dong C (2023) Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:230815070
Liu et al. (2021) Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Lu et al. (2022) Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J (2022) Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Luo et al. (2023) Luo S, Tan Y, Huang L, Li J, Zhao H (2023) Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:231004378
Maeda (2020) Maeda S (2020) Unpaired image super-resolution using pseudo-supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Meng and Kabashima (2022) Meng X, Kabashima Y (2022) Diffusion model based posterior sampling for noisy linear inverse problems. arXiv preprint arXiv:221112343
Menon et al. (2020) Menon S, Damian A, Hu S, Ravi N, Rudin C (2020) Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Molad et al. (2023) Molad E, Horwitz E, Valevski D, Acha AR, Matias Y, Pritch Y, Leviathan Y, Hoshen Y (2023) Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:230201329
Mou et al. (2024) Mou C, Wang X, Xie L, Wu Y, Zhang J, Qi Z, Shan Y (2024) T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence
Nichol et al. (2022) Nichol AQ, Dhariwal P, Ramesh A, Shyam P, Mishkin P, Mcgrew B, Sutskever I, Chen M (2022) Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of International Conference on Machine Learning (ICML)
Oord et al. (2018) Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:180703748
Pan et al. (2021) Pan X, Zhan X, Dai B, Lin D, Loy CC, Luo P (2021) Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Podell et al. (2023) Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, Müller J, Penna J, Rombach R (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: Proceedings of International Conference on Learning Representations (ICLR)
Qi et al. (2023) Qi C, Cun X, Zhang Y, Lei C, Wang X, Shan Y, Chen Q (2023) Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:230309535
Ramesh et al. (2021) Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Proceedings of International Conference on Machine Learning (ICML)
Ramesh et al. (2022) Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125
Rombach et al. (2022) Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ronneberger et al. (2015) Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, pp 234–241
Sahak et al. (2023) Sahak H, Watson D, Saharia C, Fleet D (2023) Denoising diffusion probabilistic models for robust image super-resolution in the wild. arXiv preprint arXiv:230207864
Saharia et al. (2022a) Saharia C, Chan W, Saxena S, Li L, Whang J, Denton E, Ghasemipour SKS, Gontijo-Lopes R, Ayan BK, Salimans T, et al. (2022a) Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Saharia et al. (2022b) Saharia C, Ho J, Chan W, Salimans T, Fleet DJ, Norouzi M (2022b) Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Sajjadi et al. (2017) Sajjadi MS, Scholkopf B, Hirsch M (2017) Enhancenet: Single image super-resolution through automated texture synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Salimans and Ho (2021) Salimans T, Ho J (2021) Progressive distillation for fast sampling of diffusion models. In: Proceedings of International Conference on Learning Representations (ICLR)
Sauer et al. (2023) Sauer A, Lorenz D, Blattmann A, Rombach R (2023) Adversarial diffusion distillation. arXiv preprint arXiv:231117042
Sohl-Dickstein et al. (2015) Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: Proceedings of International Conference on Machine Learning (ICML)
Song et al. (2020) Song J, Meng C, Ermon S (2020) Denoising diffusion implicit models. In: Proceedings of International Conference on Learning Representations (ICLR)
Song et al. (2023a) Song J, Vahdat A, Mardani M, Kautz J (2023a) Pseudoinverse-guided diffusion models for inverse problems. In: Proceedings of International Conference on Learning Representations (ICLR)
Song et al. (2023b) Song Y, Dhariwal P, Chen M, Sutskever I (2023b) Consistency models. arXiv preprint arXiv:230301469
Thorndike et al. (1920) Thorndike EL, et al. (1920) A constant error in psychological ratings. Journal of applied psychology
Timofte et al. (2017) Timofte R, Agustsson E, Van Gool L, Yang MH, Zhang L (2017) Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (CVPR-W)
Wan et al. (2020) Wan Z, Zhang B, Chen D, Zhang P, Chen D, Liao J, Wen F (2020) Bringing old photos back to life. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wang et al. (2023) Wang J, Chan KC, Loy CC (2023) Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence
Wang et al. (2021a) Wang L, Wang Y, Dong X, Xu Q, Yang J, An W, Guo Y (2021a) Unsupervised degradation representation learning for blind super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wang et al. (2018a) Wang X, Yu K, Dong C, Loy CC (2018a) Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wang et al. (2018b) Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy C (2018b) Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision Workshops (ECCV-W)
Wang et al. (2021b) Wang X, Li Y, Zhang H, Shan Y (2021b) Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wang et al. (2021c) Wang X, Xie L, Dong C, Shan Y (2021c) Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV-W)
Wang et al. (2022) Wang Y, Yu J, Zhang J (2022) Zero-shot image restoration using denoising diffusion null-space model. Proceedings of International Conference on Learning Representations (ICLR)
Wei et al. (2020) Wei P, Xie Z, Lu H, Zhan Z, Ye Q, Zuo W, Lin L (2020) Component divide-and-conquer for real-world image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
Wei et al. (2021) Wei Y, Gu S, Li Y, Timofte R, ** L, Song H (2021) Unsupervised real-world image super resolution via domain-distance aware training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wu et al. (2022) Wu JZ, Ge Y, Wang X, Lei SW, Gu Y, Hsu W, Shan Y, Qie X, Shou MZ (2022) Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:221211565
Xu et al. (2017) Xu X, Sun D, Pan J, Zhang Y, Pfister H, Yang MH (2017) Learning to super-resolve blurry face and text images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Xu et al. (2019) Xu X, Ma Y, Sun W (2019) Towards real scene super-resolution with raw images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang et al. (2020) Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang et al. (2021a) Yang S, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2021a) Score-based generative modeling through stochastic differential equations. In: Proceedings of International Conference on Learning Representations (ICLR)
Yang et al. (2021b) Yang T, Ren P, Xie X, Zhang L (2021b) Gan prior embedded network for blind face restoration in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yu et al. (2024) Yu F, Gu J, Li Z, Hu J, Kong X, Wang X, He J, Qiao Y, Dong C (2024) Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yu et al. (2018) Yu K, Dong C, Lin L, Loy CC (2018) Crafting a toolchain for image restoration by deep reinforcement learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yue and Loy (2022) Yue Z, Loy CC (2022) Difface: Blind face restoration with diffused error contraction. arXiv preprint arXiv:221206512
Yue et al. (2023) Yue Z, Wang J, Loy CC (2023) Resshift: Efficient diffusion model for image super-resolution by residual shifting. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Zhang et al. (2021a) Zhang J, Lu S, Zhan F, Yu Y (2021a) Blind image super-resolution via contrastive representation learning. arXiv preprint arXiv:210700708
Zhang et al. (2021b) Zhang K, Liang J, Van Gool L, Timofte R (2021b) Designing a practical degradation model for deep blind image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zhang et al. (2023) Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zhang et al. (2018a) Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018a) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zhang et al. (2018b) Zhang Y, Li K, Li K, Wang L, Zhong B, Fu Y (2018b) Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European Conference on Computer Vision (ECCV)
Zhang et al. (2019) Zhang Z, Wang Z, Lin Z, Qi H (2019) Image super-resolution by neural texture transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zhao et al. (2022) Zhao Y, Su YC, Chu CT, Li Y, Renn M, Zhu Y, Chen C, Jia X (2022) Rethinking deep face restoration. In: cvpr
Zheng et al. (2018) Zheng H, Ji M, Wang H, Liu Y, Fang L (2018) Crossnet: An end-to-end reference-based super resolution network using cross-scale war**. In: Proceedings of the European Conference on Computer Vision (ECCV)
Zhou et al. (2020) Zhou S, Zhang J, Zuo W, Loy CC (2020) Cross-scale internal graph neural network for image super-resolution. Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Zhou et al. (2022) Zhou S, Chan KC, Li C, Loy CC (2022) Towards robust blind face restoration with codebook lookup transformer. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Zhu et al. (2017) Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)