Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration

Kang Liao  Zongsheng Yue  Zhouxia Wang  Chen Change Loy
S-Lab, Nanyang Technological University
{kang.liao, zongsheng.yue, zhouxia.wang, ccloy}@ntu.edu.sg
Abstract

Although deep learning-based image restoration methods have made significant progress, they still struggle with limited generalization to real-world scenarios due to the substantial domain gap caused by training on synthetic data. Existing methods address this issue by improving data synthesis pipelines, estimating degradation kernels, employing deep internal learning, and performing domain adaptation and regularization. Previous domain adaptation methods have sought to bridge the domain gap by learning domain-invariant knowledge in either feature or pixel space. However, these techniques often struggle to extend to low-level vision tasks within a stable and compact framework. In this paper, we show that it is possible to perform domain adaptation via the noise-space using diffusion models. In particular, by leveraging the unique property of how the multi-step denoising process is influenced by auxiliary conditional inputs, we obtain meaningful gradients from noise prediction to gradually align the restored results of both synthetic and real-world data to a common clean distribution. We refer to this method as denoising as adaptation. To prevent shortcuts during training, we present useful techniques such as channel shuffling and residual-swap** contrastive learning. Experimental results on three classical image restoration tasks, namely denoising, deblurring, and deraining, demonstrate the effectiveness of the proposed method. Code will be released at: https://github.com/KangLiao929/Noise-DA/.

1 Introduction

Image restoration is a long-standing yet challenging problem in computer vision. It includes a variety of sub-tasks, e.g., denoising [1, 2, 3], deblurring [4, 5], and deraining [6, 7], each of which has received considerable research attention. Many existing methods are based on deep learning, typically following a supervised learning pipeline. Since annotated samples are not available in real-world contexts, i.e., degradation is unknown, a common technique is to generate synthetic low-quality data from high-quality images based on some assumptions on the degradation process to obtain training pairs. This technique has achieved considerable success but is not perfect, as synthetic data cannot cover all unknown or unpredictable degradation factors, which can vary wildly due to uncontrollable environmental conditions. Consequently, existing restoration methods often struggle to generalize well to real-world scenarios.

Refer to caption
Figure 1: (a) The prediction error of a diffusion model is highly dependent on the quality of the conditional inputs. In this experiment, we introduce an additional condition alongside the original noisy input. This condition is the same target image but corrupted with additive white Gaussian noise at a noise level σ[0,80]𝜎080\sigma\in[0,80]italic_σ ∈ [ 0 , 80 ]. More details can be found in the Appendix. (b) The restoration network is optimized to provide “good” conditions to minimize the diffusion model’s noise prediction error, aiming for a clean target distribution.

Extensive studies have been conducted to address the lack of real-world training data. Some methods improve the data synthesis pipeline to generate more realistic degraded inputs for training [8, 9]. Other blind restoration approaches estimate the degradation kernel from the real degraded input during inference and use it as a conditional input to guide the restoration [10, 11]. Unsupervised methods [12, 13, 14, 15, 16, 5, 17] enhance input quality without relying on predefined pairs of clean and degraded images. These methods often use deep internal learning or self-supervised learning, where the model learns to predict clean images directly from the noisy or distorted data itself. In this paper, we investigate the problem assuming the existence of both synthetic data and real-world degraded images. This scenario fits a typical domain adaptation setting, where existing methods can be categorized into feature-space [18, 19, 20, 21, 22, 23] and pixel-space [24, 25, 26, 27] approaches. Both paradigms have their weaknesses: aligning high-level deep representations in feature space may overlook low-level variations essential for image restoration, while pixel-space approaches often involve computationally intensive adversarial paradigms that can lead to instability during training.

In this work, we present a novel adaptation method for image restoration, which allows for a meaningful diffusion loss to mitigate the domain gap between synthetic and real-world degraded images. Our main idea stems from the observation shown in Fig. 1(a). Here, we measure the noise prediction error of a diffusion model conditioned on a noisy version of the target image. The trend in Fig. 1(a) shows that conditions with fewer corruption levels facilitate lower prediction errors of the diffusion model. In other words, “good” conditions give low diffusion loss, and “bad” conditions lead to high diffusion loss. While such a behavior may be expected, it reveals an interesting property of how conditional inputs could influence the prediction error of a diffusion model. Our method leverages this phenomenon by conditioning both the restored synthetic image and real image from a restoration network onto the diffusion model, as shown in Fig. 1(b). Both networks are jointly trained, with the restoration network optimized to provide “good” conditions to minimize the diffusion model’s noise prediction error, aiming for a clean target distribution. The goal of providing good conditions drives the restoration network to learn to improve the quality of its outputs. After training, the diffusion model is discarded, leaving only the trained restoration network for inference.

To bridge the gap between the restored synthetic and real outputs, our method carefully conceals the identity of the conditions. This prevents the diffusion model from simply learning to differentiate between synthetic and real conditions based on their channel index, avoiding a trivial shortcut in training. In addition, the pixel similarity between the noisy synthetic label and synthetic output is also easy to distinguish when they share the same clean image. To avoid the above shortcut learning, we design a channel shuffling layer at the beginning of the diffusion model. It randomly shuffles the channel index of synthetic and real-world conditions at each training iteration before concatenating them. We further propose a residual-swap** contrastive learning strategy to ensure the model genuinely learns to restore images accurately, rather than relying on easily distinguishable features.

Our work represents the first attempt at addressing domain adaptation in the noise space for image restoration. We show the unique benefits offered by diffusion loss in eliminating the domain gap between the synthetic and real-world data, which cannot be achieved using existing losses. To verify the effectiveness of the proposed method, we conducted extensive experiments on three classical image restoration tasks, including denoising, deblurring, and deraining.

2 Related Work

Image Restoration. Image restoration aims to recover images degraded by factors like noise, blur, or data loss. Driven largely by the capabilities of various neural networks [28, 29, 30, 31, 32, 33], significant advancements have been made in sub-fields such as image denoising [34, 35, 36, 3, 15, 37], image deblurring [38, 39, 40, 41], and image deraining [42, 43, 44, 45]. In image restoration, loss functions are essential for training models. For example, the L1𝐿1L1italic_L 1 loss minimizes average absolute pixel differences, ensuring pixel-wise accuracy. Perceptual loss uses pre-trained neural networks to compare high-level features, ensuring perceptual similarity. Adversarial loss involves a discriminator distinguishing between real and restored images, pushing the generator to create more realistic outputs. However, restoration models trained on synthetic images with these conventional loss functions still cannot escape from a significant drop in performance when applied to real-world domains.

To address the mismatch between training and testing degradations, some supervised image restoration techniques [8, 9] improve the data synthesis pipeline, focusing on creating a training degradation distribution that balances accuracy and generalization in real-world scenarios. Some methods [10, 11] estimate and correct the degradation kernels to improve the restoration quality. Our work is orthogonal to these methods, aiming to bridge the gap between training and testing degradations.

Unsupervised learning methods for image restoration leverage models that do not rely on paired training samples [12, 14, 15, 16, 5, 17, 46]. Techniques like Noise2Noise [12], Noise2Void [47], and Deep Image Prior [48] exploit the intrinsic properties of images, where the network learns to restore images by understanding the natural image statistics or by self-supervision. These unsupervised approaches have proven effective in restoration tasks, achieving impressive results comparable to supervised learning methods. However, they often struggle with handling highly complex or corrupted images due to their reliance on learned distributions and intrinsic image properties, which may not fully capture intricate details and show limited generalization to other restoration tasks.

Domain Adaptation. The concept of domain adaptation is proposed to eliminate the discrepancy between the source domains and target domains [49, 50] to facilitate the generalization ability of learning models. Previous methods can be categorized into feature-space and pixel-space approaches. For example, feature-space adaptation methods [18, 19, 20, 21, 22, 23] adjust the extracted features from networks to align across different domains. Among these methods, some classical techniques are developed like minimizing the distance between feature spaces [18, 20] and introducing domain adversarial objectives [19, 21, 23]. Aligning high levels of deep representation may overlook crucial low-level variances that are essential for target tasks such as image restoration. In contrast, pixel-space domain adaptation methods [24, 25, 26, 27] achieve distribution alignment directly in the raw pixel level, by translating source data to match the “style" of a target domain. While they are easier to understand and verify for effectiveness from domain-shifted visualizations, pixel-space adaptation methods require careful tuning and can be unstable during training. Recent methods [51, 52, 53] compensate for the limitation of isolated domain adaptation by jointly aligning feature space and pixel space, shared the similar pipeline with CycleGAN [54]. However, they tend to be computationally demanding due to the need to train multiple networks (generally two generators and two discriminators) and the complexity of the cycle consistency loss. Different from the above feature-space and pixel-space methods, we propose a new noise-space solution that preserves low-level appearance across different domains with a compact and stable framework.

Diffusion Model. Diffusion models [55, 56, 57] have gained significant attention as a novel approach in generative modeling. They work by gradually transforming a simple distribution (usually Gaussian) into a complex distribution in a series of steps, reversing the diffusion process. This approach shows remarkable success in text-to-image generation [58, 59, 60] and image restoration [61, 62, 63]. Often, conditions are fed to the diffusion model for conditional generation, such as text [58], class label [64], visual prompt [65], and low-resolution image [66], to facilitate the approximation of the target distribution. In this work, we show that the diffusion’s forward denoising process has the potential to serve as a proxy task to improve the model’s generalization ability in image restoration tasks.

Refer to caption
Figure 2: Overview of the proposed framework. The restored synthetic and real-world images from the image restoration network are conditioned onto the diffusion model, adapting to the clean distribution in a multi-step denoising manner. Both networks are jointly trained, and the diffusion model is discarded after training. Gradients obtained from the diffusion model are used to drive the restoration network to produce better conditions.

3 Methodology

Problem Definition. We start by formulating the problem of noise-space domain adaptation in the context of image restoration. Given a labeled dataset111Following the notations in domain adaptation, we use “label” to represent the ground truth image in the task of image restoration. from a synthetic domain and an unlabeled dataset from a real-world domain, we aim to train a model on both the synthetic and real data that can generalize well to the real-world domain. Supposed that 𝒟s={(𝒙is,𝒚is)}i=1Nssuperscript𝒟𝑠superscriptsubscriptsuperscriptsubscript𝒙𝑖𝑠superscriptsubscript𝒚𝑖𝑠𝑖1superscript𝑁𝑠{\mathcal{D}}^{s}=\{({\bm{x}}_{i}^{s},{\bm{y}}_{i}^{s})\}_{i=1}^{N^{s}}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denotes the labeled dataset containing Nssuperscript𝑁𝑠N^{s}italic_N start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT samples from the source synthetic domain and 𝒟r={𝒙ir}i=1Nrsuperscript𝒟𝑟superscriptsubscriptsuperscriptsubscript𝒙𝑖𝑟𝑖1superscript𝑁𝑟{\mathcal{D}}^{r}=\{{\bm{x}}_{i}^{r}\}_{i=1}^{N^{r}}caligraphic_D start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denotes the unlabeled dataset with Nrsuperscript𝑁𝑟N^{r}italic_N start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT samples from the target real-world domain, where 𝒚ssuperscript𝒚𝑠{\bm{y}}^{s}bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the clean image, 𝒙ssuperscript𝒙𝑠{\bm{x}}^{s}bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the corresponding synthetic degraded image, and 𝒙rsuperscript𝒙𝑟{\bm{x}}^{r}bold_italic_x start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is the real-world degraded image.

Image Restoration Baseline. The image restoration network can be generally formulated as a deep neural network G(;𝜽G)𝐺subscript𝜽𝐺G(\cdot;{\bm{\theta}}_{G})italic_G ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) with learnable parameter 𝜽Gsubscript𝜽𝐺\bm{\theta}_{G}bold_italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. This network is trained to predict the ground truth image 𝒚ssuperscript𝒚𝑠{\bm{y}}^{s}bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT from its degraded observation 𝒙ssuperscript𝒙𝑠\bm{x}^{s}bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT on the synthetic domain. The proposed noise space domain adaptation is not limited to a specific type of network architecture. One can choose from existing networks such as DnCNN [1], U-Net [67], RCAN [68], and SwinIR [33]. The approach is also orthogonal to existing loss functions used in image restoration, e.g., L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss, Charbonnier loss [69], perceptual loss [70, 71], and adversarial loss [30, 39]. To better validate the generality of the proposed approach, we adopt the widely used U-Net architecture and the Charbonnier loss, denoted as Ressubscript𝑅𝑒𝑠\mathcal{L}_{Res}caligraphic_L start_POSTSUBSCRIPT italic_R italic_e italic_s end_POSTSUBSCRIPT, as our baseline. In the joint training, the diffusion model is trained using a diffusion objective, Difsubscript𝐷𝑖𝑓\mathcal{L}_{Dif}caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT, while the restoration network is updated using both the Ressubscript𝑅𝑒𝑠\mathcal{L}_{Res}caligraphic_L start_POSTSUBSCRIPT italic_R italic_e italic_s end_POSTSUBSCRIPT and Difsubscript𝐷𝑖𝑓\mathcal{L}_{Dif}caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT. The diffusion model is discarded after training.

3.1 Noise-Space Domain Adaptation

Ideally, the ground truth images and those restored images by an image restoration model from both synthetic and real-world data should lie in a shared distribution 𝒮𝒮\mathcal{S}caligraphic_S of high-quality clean images. However, attaining such an ideal model that can universally map any degraded images onto the distribution 𝒮𝒮\mathcal{S}caligraphic_S, is exceedingly challenging. By assuming the restored images from synthetic and real-world data obey distinct distributions 𝒮ssuperscript𝒮𝑠\mathcal{S}^{s}caligraphic_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒮rsuperscript𝒮𝑟\mathcal{S}^{r}caligraphic_S start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, our goal is to align 𝒮ssuperscript𝒮𝑠\mathcal{S}^{s}caligraphic_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒮rsuperscript𝒮𝑟\mathcal{S}^{r}caligraphic_S start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT with 𝒮𝒮\mathcal{S}caligraphic_S. To this end, we introduce a diffusion model that conditions on 𝒮ssuperscript𝒮𝑠\mathcal{S}^{s}caligraphic_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒮rsuperscript𝒮𝑟\mathcal{S}^{r}caligraphic_S start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT to approximate the target distribution 𝒮𝒮\mathcal{S}caligraphic_S. During training, this diffusion model is expected to guide the conditional distributions 𝒮ssuperscript𝒮𝑠\mathcal{S}^{s}caligraphic_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒮rsuperscript𝒮𝑟\mathcal{S}^{r}caligraphic_S start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT toward 𝒮𝒮\mathcal{S}caligraphic_S.

Given the commonly adopted case where the ground truth images from the synthetic dataset are available, we first explore adapting the target distribution 𝒮𝒮\mathcal{S}caligraphic_S with a perspective of paired data. Without loss of generality, let us consider a synthetic degraded image 𝒙ssuperscript𝒙𝑠\bm{x}^{s}bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT with its ground truth 𝒚ssuperscript𝒚𝑠{\bm{y}}^{s}bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT from the synthetic domain and a real degraded image 𝒙rsuperscript𝒙𝑟\bm{x}^{r}bold_italic_x start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT from the real-world domain. Using the restoration network G(;𝜽G)𝐺subscript𝜽𝐺G(\cdot;{\bm{\theta}}_{G})italic_G ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), we can obtain the restored images 𝒚^ssuperscript^𝒚𝑠{\hat{\bm{y}}}^{s}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒚^rsuperscript^𝒚𝑟{\hat{\bm{y}}}^{r}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, respectively. Then, we introduce a diffusion denoising process as a proxy task in the training process. It employs the predicted images 𝒚^ssuperscript^𝒚𝑠{\hat{\bm{y}}}^{s}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒚^rsuperscript^𝒚𝑟{\hat{\bm{y}}}^{r}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT as conditions to help the diffusion model fit the distribution of 𝒮𝒮\mathcal{S}caligraphic_S. Following the notations in DDPM [56], we denote the diffusion model as ϵθsubscriptbold-italic-ϵ𝜃\bm{\epsilon}_{\theta}bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and formulate its optimization to the following objective:

Dif=𝔼ϵϵθ(𝒚~s,𝐂(𝒚^s,𝒚^r),t)2,subscript𝐷𝑖𝑓𝔼subscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃superscript~𝒚𝑠𝐂superscript^𝒚𝑠superscript^𝒚𝑟𝑡2\mathcal{L}_{Dif}=\mathbb{E}\left\|\bm{\epsilon}-\bm{\epsilon}_{\theta}\left(% \tilde{\bm{y}}^{s},\mathbf{C}(\hat{\bm{y}}^{s},\hat{\bm{y}}^{r}),t\right)% \right\|_{2},caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT = blackboard_E ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_C ( over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

where 𝒚~s=α¯t𝒚s+1α¯tϵsuperscript~𝒚𝑠subscript¯𝛼𝑡superscript𝒚𝑠1subscript¯𝛼𝑡bold-italic-ϵ\tilde{\bm{y}}^{s}=\sqrt{\bar{\alpha}_{t}}{\bm{y}}^{s}+\sqrt{1-\bar{\alpha}_{t% }}\bm{\epsilon}over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, ϵN(0,𝑰)similar-tobold-italic-ϵ𝑁0𝑰\bm{\epsilon}\sim N(0,\bm{I})bold_italic_ϵ ∼ italic_N ( 0 , bold_italic_I ), α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the hyper-parameter of the noise schedule, and 𝐂(,)𝐂\mathbf{C}(\cdot,\cdot)bold_C ( ⋅ , ⋅ ) denotes the concatenation operation along the channel dimension. During the joint training shown in Fig. 2, supervision from the diffusion loss in Eq. (1) will back-propagate to the conditions 𝒚^ssuperscript^𝒚𝑠\hat{\bm{y}}^{s}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒚^rsuperscript^𝒚𝑟\hat{\bm{y}}^{r}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT if they are under-restored, i.e., far away from the expected distribution 𝒮𝒮\mathcal{S}caligraphic_S. This encourages the preceding restoration network to align 𝒚^ssuperscript^𝒚𝑠\hat{\bm{y}}^{s}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒚^rsuperscript^𝒚𝑟\hat{\bm{y}}^{r}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT as closely as possible to 𝒮𝒮\mathcal{S}caligraphic_S.

Refer to caption
Figure 3: During joint training of image restoration network and diffusion network, the restored result of synthetic degraded images smoothly converges to the expected distribution over the epochs. However, the model tends to find shortcut learning on real-world images by matching the similarity between the conditions and the paired clean image or remembering the channel index. Consequently, the image restoration network learns to corrupt the high-frequency details in real-world images and the diffusion model learns to ignore them.

The joint training, however, could lead to trivial solutions or shortcuts, as shown in Fig. 3. For example, it is easy to distinguish the “synthetic” and “real” conditions by the pixel similarity between 𝒚^ssuperscript^𝒚𝑠\hat{\bm{y}}^{s}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒚~ssuperscript~𝒚𝑠\tilde{\bm{y}}^{s}over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT or the channel index. Consequently, the restoration network will cheat the diffusion network by roughly degrading the high-frequency information in real-world images. As illustrated in Fig. 3(bottom), we identify three stages in this training process: (I) Diffusion network struggles to recognize which conditions aid denoising as both are heavily degraded, promoting the restoration network to enhance both; (II) Synthetic image is clearly restored and is easy to discriminate from its appearance; (III) The diffusion model distinguish between the conditions, leading the restoration network to focus on the synthetic data while ignoring the real-world data.

Refer to caption
Figure 4: The proposed solution to eliminate the shortcut learning in diffusion.

3.2 Eliminating Shortcut Learning in Diffusion

To avoid the above shortcut in diffusion, as shown in Fig. 4, we first propose a channel shuffling layer fcssubscript𝑓𝑐𝑠f_{cs}italic_f start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT to randomly shuffle the channel index of synthetic and real-world conditions at each iteration before concatenating them, i.e., 𝐂(fcs(𝒚^s,𝒚^r))𝐂subscript𝑓𝑐𝑠superscript^𝒚𝑠superscript^𝒚𝑟\mathbf{C}(f_{cs}(\hat{\bm{y}}^{s},\hat{\bm{y}}^{r}))bold_C ( italic_f start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) )222We omit the shuffling operator fcssubscript𝑓𝑐𝑠f_{cs}italic_f start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT for notation clarity in the following presentation.. We show in the experiments that this strategy is crucial to bridge the gap of synthetic and real data.

In addition to channel shuffling, we also devise residual-swap** contrastive learning to ensure the network learns to restore genuinely instead of overfitting the paired synthetic appearance. Using the ground truth noise ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ as the anchor, we construct a positive example ϵpossuperscriptbold-italic-ϵ𝑝𝑜𝑠\bm{\epsilon}^{pos}bold_italic_ϵ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT derived from Eq. (1): ϵpos=ϵθ(𝒚~s,𝐂(𝒚^s,𝒚^r),t)superscriptbold-italic-ϵ𝑝𝑜𝑠subscriptbold-italic-ϵ𝜃superscript~𝒚𝑠𝐂superscript^𝒚𝑠superscript^𝒚𝑟𝑡\bm{\epsilon}^{pos}=\bm{\epsilon}_{\theta}\left(\tilde{\bm{y}}^{s},\mathbf{C}(% \hat{\bm{y}}^{s},\hat{\bm{y}}^{r}),t\right)bold_italic_ϵ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT = bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_C ( over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , italic_t ), i.e., the expected noise from the diffusion model conditioning on restored synthetic and real-world images. We then swap the residual maps of these two conditions and formulate a negative example ϵnegsuperscriptbold-italic-ϵ𝑛𝑒𝑔\bm{\epsilon}^{neg}bold_italic_ϵ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT as follows:

ϵnegsuperscriptbold-italic-ϵ𝑛𝑒𝑔\displaystyle\bm{\epsilon}^{neg}bold_italic_ϵ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT =ϵθ(𝒚~s,𝐂(𝒚^sr,𝒚^rs),t),𝒚^sr=𝒙sr,𝒚^rs=𝒙rs,formulae-sequenceabsentsubscriptbold-italic-ϵ𝜃superscript~𝒚𝑠𝐂superscript^𝒚𝑠𝑟superscript^𝒚𝑟𝑠𝑡formulae-sequencesuperscript^𝒚𝑠𝑟direct-sumsuperscript𝒙𝑠superscript𝑟superscript^𝒚𝑟𝑠direct-sumsuperscript𝒙𝑟superscript𝑠\displaystyle=\bm{\epsilon}_{\theta}\left(\tilde{\bm{y}}^{s},\mathbf{C}(\hat{% \bm{y}}^{s\leftarrow r},\hat{\bm{y}}^{r\leftarrow s}),t\right),~{}\hat{\bm{y}}% ^{s\leftarrow r}=\bm{x}^{s}\oplus\mathcal{R}^{r},~{}\hat{\bm{y}}^{r\leftarrow s% }=\bm{x}^{r}\oplus\mathcal{R}^{s},= bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_C ( over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s ← italic_r end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r ← italic_s end_POSTSUPERSCRIPT ) , italic_t ) , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s ← italic_r end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⊕ caligraphic_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r ← italic_s end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ⊕ caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , (2)

where ssuperscript𝑠\mathcal{R}^{s}caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and rsuperscript𝑟\mathcal{R}^{r}caligraphic_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT are the estimated residual maps of the corresponding synthetic image and real-world image from the restoration network, and direct-sum\oplus is the pixel-wise addition operator. By swap** the residual maps of two conditions, we constrain the diffusion model to repel the distance between the wrong restored results and the expected clean distribution regardless of their context. Based on the positive, negative, and anchor examples, the residual-swap** contrastive learning can be formulated as:

Con=max(ϵϵpos2ϵϵneg2+δ,0),subscript𝐶𝑜𝑛subscriptnormbold-italic-ϵsuperscriptbold-italic-ϵ𝑝𝑜𝑠2subscriptnormbold-italic-ϵsuperscriptbold-italic-ϵ𝑛𝑒𝑔2𝛿0\mathcal{L}_{Con}=\max\left(\|\bm{\epsilon}-\bm{\epsilon}^{pos}\|_{2}-\|\bm{% \epsilon}-\bm{\epsilon}^{neg}\|_{2}+\delta,0\right),caligraphic_L start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT = roman_max ( ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ , 0 ) , (3)

where δ𝛿\deltaitalic_δ denotes a predefined margin to separate the positive and negative samples. In this way, the loss of diffusion model takes the mean of Eq. (1) and Eq. (3).

In the above formulation, the synthetic restored image of the condition, denoted as 𝒚^ssuperscript^𝒚𝑠\hat{\bm{y}}^{s}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and the input to the diffusion model, represented as 𝒚~ssuperscript~𝒚𝑠\tilde{\bm{y}}^{s}over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, form a pair of data with evident pixel-wise similarity. This similarity can potentially mislead the diffusion model to ignore the real restored image 𝒚^rsuperscript^𝒚𝑟\hat{\bm{y}}^{r}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT in condition as analyzed in Fig. 3. It is important to note that the distribution 𝒮𝒮\mathcal{S}caligraphic_S encapsulates the domain knowledge of high-quality clean images, including but not limited to the ground truth images in the synthetic dataset. Motivated by this observation, the proposed method can be further extended by replacing the noisy input 𝒚~ssuperscript~𝒚𝑠\tilde{\bm{y}}^{s}over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT with 𝒚~csuperscript~𝒚𝑐\tilde{\bm{y}}^{c}over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, defined as 𝒚~c=α¯t𝒚c+1α¯tϵsuperscript~𝒚𝑐subscript¯𝛼𝑡superscript𝒚𝑐1subscript¯𝛼𝑡bold-italic-ϵ\tilde{\bm{y}}^{c}=\sqrt{\bar{\alpha}_{t}}{\bm{y}}^{c}+\sqrt{1-\bar{\alpha}_{t% }}\bm{\epsilon}over~ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, where 𝒚csuperscript𝒚𝑐{\bm{y}}^{c}bold_italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is randomly sampled from an unpaired extensive high-quality image dataset. This strategy disrupts the pixel-wise similarity between the “synthetic” condition and the diffusion input, thus enforcing the diffusion model to guide both the “synthetic” and “real” conditions predicted by the restoration network at the domain level. We will provide an ablation on this setting in Sec. 4.1.

3.3 Training

The image restoration network and diffusion model are jointly optimized by:

=Res+λDif[Dif+Con2].subscript𝑅𝑒𝑠subscript𝜆𝐷𝑖𝑓delimited-[]subscript𝐷𝑖𝑓subscript𝐶𝑜𝑛2\mathcal{L}=\mathcal{L}_{Res}+\lambda_{Dif}\left[\frac{\mathcal{L}_{Dif}+% \mathcal{L}_{Con}}{2}\right].caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_R italic_e italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT [ divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ] . (4)

Following previous works [19], we gradually change λDifsubscript𝜆𝐷𝑖𝑓\lambda_{Dif}italic_λ start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT from 00 to β𝛽\betaitalic_β to avoid distractions for the main image restoration task during the early stages of the training process:

λDif=(21+exp(γp)1)β,subscript𝜆𝐷𝑖𝑓21𝛾𝑝1𝛽\lambda_{Dif}=\left(\frac{2}{1+\exp(-\gamma\cdot p)}-1\right)\cdot\beta,italic_λ start_POSTSUBSCRIPT italic_D italic_i italic_f end_POSTSUBSCRIPT = ( divide start_ARG 2 end_ARG start_ARG 1 + roman_exp ( - italic_γ ⋅ italic_p ) end_ARG - 1 ) ⋅ italic_β , (5)

where γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β are empirically set to 5555 and 0.20.20.20.2 in all experiments, respectively. And p=min(nN,1)𝑝𝑛𝑁1p=\min\left(\frac{n}{N},1\right)italic_p = roman_min ( divide start_ARG italic_n end_ARG start_ARG italic_N end_ARG , 1 ), where n𝑛nitalic_n denotes the current epoch index and N𝑁Nitalic_N represents the total number of training epochs.

3.4 Discussion

The proposed denoising as adaption is reminiscent of the domain adversarial objective proposed by Ganin and Lempitsky [19]. The main difference is that we do not use a domain classifier with a gradient reversal layer but a diffusion network for the loss. We categorize methods like [19] as feature-space domain adaptation approaches. Unlike these approaches, we show that denoising as adaptation is more well-suited for image restoration as it can better preserve low-level appearance in the pixel-wise noise space. Compared to pixel-space approaches that usually require multiple generator and discriminator networks, our method adopts a compact framework incorporating only a single additional denoising U-Net, ensuring stable adaptation training. After training, the diffusion network is discarded, requiring only the learned restoration network for testing purposes. The framework comparison of the above three types of methods is presented in Sec. B of the Appendix.

4 Experiments

Training Dataset. For image denoising, we follow previous works [34, 31] and construct the synthetic training dataset based on DIV2K [72], Flickr2K [73], WED [74], and BSD [75]. The noisy images are obtained by adding the additive white Gaussian noise (AWGN) of noise level σ[0,75]𝜎075\sigma\in[0,75]italic_σ ∈ [ 0 , 75 ] to the source clean images. We use the training dataset of SIDD [76] as the real-world data. For image deraining, the synthetic and real-world training datasets are respectively obtained from Rain13K [45] and SPA [77]. For image deblurring, GoPro [38] and RealBlur-J [78] are selected as the synthetic and real-world training datasets, respectively. Please note that we only use the degraded images from these real-world datasets (without the ground truth) for training purposes. For large-scale unpaired clean images, all images in the MS-COCO dataset [79] are used.

Testing Dataset. The testing images of the real-world datasets (SIDD [76], SPA [77], RealBlur-J [78]) are employed to evaluate the performance of the corresponding image restoration models.

Training Settings. To train the diffusion model, we adopt α𝛼\alphaitalic_α conditioning and the linear noise schedule ranging from 1e-61𝑒-61e\text{-}61 italic_e - 6 to 1e-21𝑒-21e\text{-}21 italic_e - 2 following previous works [61, 62, 80]. Moreover, the EMA strategy with a decaying factor of 0.99990.99990.99990.9999 is also used across our experiments. Both the restoration and diffusion networks are trained on 128×128128128128\times 128128 × 128 patches, which are processed with random crop** and rotation for data augmentation. Our model is trained with a fixed learning rate 5e-55𝑒-55e\text{-}55 italic_e - 5 using Adam [81] algorithm and the batch size is set to 40. All experiments are conducted on NVIDIA A100 GPUs.

Metrics. The performance of various methods is mainly evaluated using the classical metrics: PSNR, SSIM, and LPIPS. For the image deraining task, we calculate PSNR/SSIM scores using the Y channel in YCbCr color space following existing methods [42, 43, 31].

4.1 Comparisons with State-of-the-Art Methods

We implement the proposed noise-space domain adaptation method using a handy and classical U-Net architecture [82]. To validate its effectiveness, we compare the proposed method with previous domain adaptation approaches, including DANN [19], DSN [22], PixelDA [27], and CyCADA [51], covering the feature-space and pixel-space adaptation solutions. For the purpose of a fair comparison, we retrained these methods with the same standard settings and datasets. In addition, we also consider some unsupervised image restoration methods and representative supervised methods such as Ne2Ne [14], MaskedD [15], NLCL [16], SelfDeblur [5], VDIP [17], and Restormer [31].

Refer to caption
Figure 5: Visual comparison of the image denoising task on SIDD test dataset [76].

Comparison Results. The quantitative and qualitative comparison results are shown in Tab. 1-3 and Fig. 5-6. From the comparison results, the proposed method leads the comparison methods on three image restoration tasks. In particular, previous feature-space domain adaptation methods [19, 22, 51] fail to perceive the crucial low-level information and pixel-space domain adaptation methods [27, 51] yield inferior results since the precise style transfer between two domains is hard to control during the adversarial training. Moreover, the self-supervised and unsupervised restoration methods [14, 15, 5, 16, 17] show noticeable artifacts and limited generalization performance due to some inevitable information loss and hand-crafted designs on specific degradations. By contrast, our method ensures a fine domain adaptation in the pixel-wise noise space without introducing unstable training.

Table 1: Quantitative evaluation of the image denoising task on SIDD test dataset [76]. syn, real, both denote the model is trained on synthetic, real-world (w/o GT), and both synthetic and real-world (w/o GT) datasets, respectively. The best score is highlighted with color shading.
Metrics Vanilla DANN [19] DSN [22] PixelDA [27] CyCADA [51] Ne2Ne [14] MaskedD [15] Ours
Space - Feature Feature Pixel Feature&Pixel - - Noise
Train Data syn both both both both real real both
PSNR \uparrow 26.58 30.09 28.40 29.24 30.81 25.61 28.51 34.71
SSIM \uparrow 0.6132 0.7832 0.6984 0.7611 0.8067 0.5647 0.7196 0.9202
LPIPS \downarrow 0.3171 0.1348 0.2265 0.1403 0.1256 0.3039 0.2348 0.0903
Table 2: Quantitative evaluation of the image deraining task on SPA test dataset [77]. PSNR/SSIM scores are calculated using the Y channel in the YCbCr color space.
Metrics Vanilla DANN [19] DSN [22] PixelDA [27] CyCADA [51] NLCL [16] Restormer [31] Ours
Space - Feature Feature Pixel Feature&Pixel - - Noise
Train Data syn both both both both real syn both
PSNR \uparrow 33.04 32.21 33.56 30.20 32.21 20.68 34.17 34.39
SSIM \uparrow 0.9540 0.9443 0.9552 0.9288 0.9442 0.8412 0.9492 0.9571
LPIPS \downarrow 0.0477 0.0597 0.0512 0.0758 0.0597 0.0967 0.0488 0.0462
Table 3: Quantitative evaluation of the image deblurring task on RealBlur-J test dataset [78].
Metrics Vanilla DANN [19] DSN [22] PixelDA [27] CyCADA [51] SelfDeblur [5] VDIP [17] Ours
Space - Feature Feature Pixel Feature&Pixel - - Noise
Train Data syn both both both both real real both
PSNR \uparrow 26.27 26.11 26.28 24.71 26.36 23.23 24.89 26.46
SSIM \uparrow 0.8012 0.7945 0.8003 0.7646 0.7936 0.6699 0.7404 0.8048
LPIPS \downarrow 0.1389 0.1345 0.1380 0.1583 0.1340 0.1340 0.1589 0.1363
Refer to caption
Figure 6: Visual comparison of the image deraining and image deblurring tasks on SPA [77] and RealBlur-J [78] test datasets.

Analysis. From the above results, we can observe that the proposed method enables noticeable improvements beyond the Vanilla baseline (trained only with synthetic datasets) on the image restoration tasks involved with high-frequency noises, such as image denoising and image deraining. Especially for image denoising, +8.13/0.30708.130.3070+8.13/0.3070+ 8.13 / 0.3070 improvements on PSNR/SSIM metrics are achieved. We argue that the target of image denoising naturally fits that of the forward denoising process in the diffusion model. It is more sensitive to other Gaussian-like noises with respect to the pre-sampled noise space. Thus, an intense diffusion loss would be back-propagated if the conditioned images are under-restored, and the preceding restoration network tries to eliminate the noises on both the synthetic and real-world images as much as possible.

Table 4: Quantitative metrics of the proposed method (Ours) and its extension on unpaired condition case (Our-Ex). The results are formed with PSNR/SSIM/LPIPS. The best and second best scores are highlighted and underlined.
Task Ours Ours-Ex
Denoising 34.71/0.9202/0.0903 33.44/0.8938/0.1064
Deraining 34.39/0.9571/0.0462 34.20/0.9587/0.0444
Deblurring 26.46/0.8048/0.1363 26.44/0.8030/0.1313

Extension. As mentioned in Sec. 3.2, our method can extend to the unpaired condition case by relaxing the diffusion’s input with the image from other clean datasets. Thus, the shortcut issue can be directly eliminated since the trivial solutions such as matching the pixel’s similarity between input and condition do not exist. Such an extension keeps the channel shuffling layer but is free to the residual swap** contrastive learning. We show the quantitative evaluation in Tab. 4. The results demonstrate that although the condition and diffusion input are unpaired, our method can still learn to adapt the restored results from the synthetic and real-world domains to the clean image distribution, which also complements the restoration performance of the paired solution in some tasks. More qualitative results are presented in the Appendix.

Noise Sampling Range Strategy Metrics
Exp. [1, 100] [900, 1000] [1, 1000] CS RS PSNR\uparrow SSIM\uparrow
(a) 26.58 0.6132
(b) 16.77 0.6070
(c) 27.36 0.6590
(d) 32.07 0.8706
(e) 32.91 0.9082
(f) (Ours) 34.71 0.9202
Table 5: Ablation studies of variant networks on the SIDD test image denoising dataset. CS and RS represent the proposed channel shuffling layer and residual-swap** contrastive learning strategies, respectively.
[Uncaptioned image]
Figure 7: Visual comparison results of ablation studies.

4.2 Ablation Studies

To evaluate the effectiveness of different components in the proposed method, we conduct ablation studies regarding the sampled noise levels of the diffusion model, determined by the time-step t𝑡titalic_t, and the training strategies to avoid shortcut learning, as shown in Tab. 5 and Fig. 7. Concretely, with low noise intensity, e.g., t[1,100]𝑡1100t\in[1,100]italic_t ∈ [ 1 , 100 ], it is easy for the diffusion model to discriminate the similarity of paired synthetic data even when the restored conditions are under-restored. As a result, the shortcut learning comes earlier during the training process and the real-world degraded image is heavily corrupted by the restoration network, of which most all details are filtered. On the other hand, when the intensity of the sampled noise is high, e.g., t[900,1000]𝑡9001000t\in[900,1000]italic_t ∈ [ 900 , 1000 ], the diffusion model is hard to converge and the whole framework has fallen into a local optimum. By sampling the noise from a more diverse range with t[1,1000]𝑡11000t\in[1,1000]italic_t ∈ [ 1 , 1000 ], the restored results can be gradually adapted to the clean distribution. Moreover, the generalization ability of the restoration network gains further improvement using the designed channel shuffling layer (CS) and residual-swap** contrastive learning strategy (RS), which effectively eliminates the shortcut learning of the diffusion model. Therefore, higher restoration performance on real-world images and more realistic visual appearance can be observed from (d) to (e) and (f) in Tab. 5 and Fig. 7.

Refer to caption
Figure 8: Scalability of the proposed method on different network architectures.

4.3 Scalability

We further validate the scalability of the proposed method, using different variants of U-Net-based image restoration networks and other types of architectures such as the Transformer-based network [32]. In particular, we classify these networks based on their model sizes and obtain: Unet-T, Unet-S (the model investigated in the above experiments), Unet-B, Uformer-T, Uformer-S, and Uformer-B. More network details are listed in the Appendix. The quantitative results of PSNR vs. computational cost on SIDD test dataset [76] are shown in Fig. 8. As we can observe, as the complexity and parameter increase, the vanilla restoration network (orange elements) tends to overfit the training synthetic dataset and perform worse on the test real-world dataset. In contrast, the proposed domain adaptation approach method can improve the generalization ability of image restoration models with various sizes and architectures (blue elements). It is also interesting that for each type of architecture, our method can facilitate better adaptation performance as the complexity of the restoration network increases, demonstrating its effectiveness in addressing the overfitting problem of large models.

4.4 Limitation and Broader Impacts

In the proposed noise-space domain adaptation, we discard the diffusion model after its joint training with the image restoration network. Like previous domain adaptation methods, we need to retrain the proxy model once new datasets or tasks are involved. However, our learned diffusion model has a strong capacity to discriminate if the restored results are good or inferior. Thus, it could be leveraged to directly provide the prior knowledge to inspire new restoration-related tasks. Consequently, the adaptation from the synthetic domain to a new real-world domain would be achieved more efficiently. We leave it as one of our future work. For the broader impacts of this work, image restoration can improve the quality and accessibility of visual information in various fields, including medical imaging, satellite imagery, and historical document preservation. It can benefit better diagnostic accuracy in healthcare, more precise environmental monitoring, and the safeguarding of cultural heritage. While the image restoration network can vastly improve degraded images, it also carries the risk of inadvertently editing and altering the original information, which may compromise the authenticity of the restored images.

5 Conclusion

In this work, we have presented a novel approach that harnesses the diffusion model as a proxy network to address the domain adaptation issues in image restoration tasks. Different from previous feature-space and pixel-space domain adaptation approaches, the proposed method adapts the restored results to their shared clean distribution in the pixel-wise noise space, resulting in significant low-level appearance improvements within a compact and stable training framework. To mitigate the shortcut issue arising from the joint training of the restoration and diffusion models, we randomly shuffle the channel index of two conditions and propose a residual-swap** contrastive learning strategy to prevent the diffusion model from discriminating the conditions based on the paired similarity. Furthermore, the proposed method can be extended by relaxing the input constraint of the diffusion model, introducing diverse unpaired clean images as denoising input. Experimental results have demonstrated the effectiveness of the proposed noise-space approach beyond existing feature-space and pixel-space methods on image restoration tasks. In the future, we plan to further investigate adapting the synthetic source domain to the real-world target domain using diffusion models, particularly in other dense prediction vision tasks.

References

  • Zhang et al. [2017] Kai Zhang, Wangmeng Zuo, Yun** Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
  • Guo et al. [2019] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Yue et al. [2024] Zongsheng Yue, Hongwei Yong, Qian Zhao, Lei Zhang, Deyu Meng, and Kwan-Yee K Wong. Deep variational network toward blind image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Pan et al. [2016] **shan Pan, Deqing Sun, Hanspeter Pfister, and Ming-Hsuan Yang. Blind image deblurring using dark channel prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • Ren et al. [2020] Dongwei Ren, Kai Zhang, Qilong Wang, Qinghua Hu, and Wangmeng Zuo. Neural blind deconvolution using deep priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • Fu et al. [2017] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • Wang et al. [2021] Hong Wang, Zongsheng Yue, Qi Xie, Qian Zhao, Yefeng Zheng, and Deyu Meng. From rain generation to rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Zhang et al. [2023] Ruofan Zhang, **** Gu, Haoyu Chen, Chao Dong, Yulun Zhang, and Wenming Yang. Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution. In International Conference on Machine Learning, 2023.
  • Luo et al. [2022] Zhengxiong Luo, Yan Huang, Shang Li, Liang Wang, and Tieniu Tan. Learning the degradation distribution for blind image super-resolution. arXiv preprint arXiv:2203.04962, 2022.
  • Gu et al. [2019] **** Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong. Blind super-resolution with iterative kernel correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Bell-Kligler et al. [2019] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind super-resolution kernel estimation using an internal-gan. Advances in Neural Information Processing Systems, 2019.
  • Lehtinen et al. [2018] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.
  • Shocher et al. [2018] Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Huang et al. [2021] Tao Huang, Songjiang Li, Xu Jia, Huchuan Lu, and Jianzhuang Liu. Neighbor2neighbor: Self-supervised denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Chen et al. [2023] Haoyu Chen, **** Gu, Yihao Liu, Salma Abdel Magid, Chao Dong, Qiong Wang, Hanspeter Pfister, and Lei Zhu. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Ye et al. [2022] Yuntong Ye, Changfeng Yu, Yi Chang, Lin Zhu, Xi-Le Zhao, Luxin Yan, and Yonghong Tian. Unsupervised deraining: Where contrastive learning meets self-similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Huo et al. [2023] Dong Huo, Abbas Masoumzadeh, Rafsanjany Kushol, and Yee-Hong Yang. Blind image deconvolution using variational deep image prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Tzeng et al. [2014] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, 2015.
  • Long et al. [2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, 2015.
  • Tzeng et al. [2015] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
  • Bousmalis et al. [2016] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. Advances in Neural Information Processing Systems, 2016.
  • Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  • Liu and Tuzel [2016] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. Advances in Neural Information Processing Systems, 2016.
  • Taigman et al. [2016] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
  • Shrivastava et al. [2017] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  • Bousmalis et al. [2017] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  • Dong et al. [2014] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision, 2014.
  • Dong et al. [2015] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2015.
  • Wang et al. [2018] Xintao Wang, Ke Yu, Shixiang Wu, **** Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision Workshops, 2018.
  • Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Wang et al. [2022] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Liang et al. [2021] **gyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Zhang et al. [2018a] Kai Zhang, Wangmeng Zuo, and Lei Zhang. FFDNet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018a.
  • Zhang et al. [2021] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6360–6376, 2021.
  • Ren et al. [2021] Chao Ren, Xiaohai He, Chuncheng Wang, and Zhibo Zhao. Adaptive consistency prior based deep network for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Kim et al. [2020] Yoonsik Kim, Jae Woong Soh, Gu Yong Park, and Nam Ik Cho. Transfer learning from synthetic to real-noise denoising with adaptive instance normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • Nah et al. [2017] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  • Kupyn et al. [2018] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Suin et al. [2020] Maitreya Suin, Kuldeep Purohit, and AN Rajagopalan. Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • Zhang et al. [2019] Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep stacked hierarchical multi-patch network for image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Jiang et al. [2020] Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Bao** Huang, Yimin Luo, Jiayi Ma, and Junjun Jiang. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • Purohit et al. [2021] Kuldeep Purohit, Maitreya Suin, AN Rajagopalan, and Vishnu Naresh Boddeti. Spatially-adaptive image restoration using distortion-guided networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  • Ren et al. [2019] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, and Deyu Meng. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Yang et al. [2017] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  • Chen et al. [2024] Lufei Chen, Xiangpeng Tian, Shuhua Xiong, Yinjie Lei, and Chao Ren. Unsupervised blind image deblurring based on self-enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • Krull et al. [2019] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Ulyanov et al. [2018] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Saenko et al. [2010] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Proceedings of the European Conference on Computer Vision, 2010.
  • Torralba and Efros [2011] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2011.
  • Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, 2018.
  • Zheng et al. [2018] Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In Proceedings of the European Conference on Computer Vision, pages 767–783, 2018.
  • Chen et al. [2019] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.
  • Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022a.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Saharia et al. [2022b] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022b.
  • Saharia et al. [2022c] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022c.
  • Yue et al. [2023] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. In Advances in Neural Information Processing Systems, 2023.
  • Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In Advances in Neural Information Processing Systems Workshop, 2022.
  • Bar et al. [2022] Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 2022.
  • Wang et al. [2023] Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023.
  • Yue et al. [2019] Zongsheng Yue, Hongwei Yong, Qian Zhao, Deyu Meng, and Lei Zhang. Variational denoising network: Toward blind noise modeling and removal. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Zhang et al. [2018b] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, 2018b.
  • Zamir et al. [2021] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Johnson et al. [2016] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, 2016.
  • Zhang et al. [2018c] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018c.
  • Timofte et al. [2017] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2017.
  • Nah et al. [2019] Seungjun Nah, Radu Timofte, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  • Ma et al. [2016] Kede Ma, Zhengfang Duanmu, Qingbo Wu, Zhou Wang, Hongwei Yong, Hongliang Li, and Lei Zhang. Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing, 26(2):1004–1016, 2016.
  • Martin et al. [2001] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision, 2001.
  • Abdelhamed et al. [2018] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Wang et al. [2019] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, and Rynson WH Lau. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • Rim et al. [2020] Jaesung Rim, Haeyun Lee, Jucheol Won, and Sunghyun Cho. Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the European Conference on Computer Vision, 2020.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. 2014.
  • Chen et al. [2020] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, 2015.
  • Plotz and Roth [2017] Tobias Plotz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.

Appendix

Appendix A More Implementation Details

A.1 Condition Evaluation on Diffusion Model

This work is inspired by the beneficial effects that favorable conditions facilitate the denoising process of the diffusion model, as shown in Fig 1(a). In this preliminary experiment, we first condition and train the diffusion model with an additional input in addition to its conventional input. Then, we test the noise prediction performance of this model under different qualities of the condition. To be specific, we corrupt the condition by adding the additive white Gaussian noise (AWGN) of noise level σ[0,80]𝜎080\sigma\in[0,80]italic_σ ∈ [ 0 , 80 ] to its original clean images, which are performed on 1,000 images in the MS-COCO test dataset [79]. The noise prediction error of the diffusion model is evaluated using the mean square error (MSE) metric.

A.2 Comparison Settings

In comparison experiments, we mainly compare the proposed approach with three types of previous methods: domain adaptation methods, including DANN [19], DSN [22], PixelDA [27], and CyCADA [51]; unsupervised image restoration methods, including Ne2Ne [14], MaskedD [15], NLCL [16], SelfDeblur [5], and VDIP [17]; some representative supervised methods which serve as strong baselines in image restoration such as Restormer [31], to comprehensively evaluate generalization performance of different methods.

A.3 Scalability Evaluation

To provide a comprehensive evaluation of the proposed method, we apply six variants of the image restoration network in our experiments, including three variants of convolution-based network [82]: Unet-T (Tiny), Unet-S (Small), and Unet-B (Base); and three variants of Transformer-based network [32]: Uformer-T (Tiny), Uformer-S (Small), and Uformer-B (Base). These variants differ in the number of feature channels (C) and the count of layers at each encoder and decoder stage. The specific configurations, computational cost, and the parameter numbers are detailed below:

  • Unet-T: C=32, depths of Encoder = {2, 2, 2, 2}, GMACs: 3.14G, Parameter: 2.14M,

  • Unet-S: C=64, depths of Encoder = {2, 2, 2, 2}, GMACs: 12.48G, Parameter: 8.56M,

  • Unet-B: C=76, depths of Encoder = {2, 2, 2, 2}, GMACs: 17.58G, Parameter: 12.07M,

  • Uformer-T: C=16, depths of Encoder = {2, 2, 2, 2}, GMACs: 15.49G, Parameter: 9.50M,

  • Uformer-S: C=32, depths of Encoder = {2, 2, 2, 2}, GMACs: 34.76G, Parameter: 21.38M,

  • Uformer-B: C=32, depths of Encoder = {1, 2, 8, 8}, GMACs: 86.97G, Parameter: 53.58M,

and the depths of the Decoder match those of the Encoder.

Refer to caption
Figure 9: Overview of different domain adaptation (DA) approaches. (a) Feature-space DA aligns the intermediate features across source and target domains. (b) Pixel-space DA translates source data to the “style" of the target domain through adversarial learning. (c) The proposed noise-space DA is specifically designed for image restoration. It gradually adapts the results from both source and target domains to the target clean image distribution, via multi-step denoising. Particularly, the function network represents a restorer in the context of image restoration.

Appendix B Discussion on Different Domain Adaptation Methods

As discussed in Sec. 3.4, we described the effectiveness of the proposed method beyond the previous feature-space and pixel-space domain adaptation methods. We further show their specific framework in Fig. 9. In contrast to previous adaptation methods, our method is free to a domain classifier by introducing a meaningful diffusion loss function.

Appendix C More Visual Comparison Results

We visualize more comparison results on the image denoising task in Fig. 10, image deraining task in Fig. 11, and image deblurring task in Fig. 12. We name the proposed method and its extension as ‘Ours’ and ‘Ours-Ex’, respectively.

Appendix D More Visual Results on Other Real-World Datasets

To show the generalization ability of the proposed method, we also visualize the restored results of the proposed method on other real-world datasets [83, 45] in Fig. 13, Fig. 14, Fig. 15. These datasets were not encountered during the network’s training and fall outside the distribution of the trained datasets.

Refer to caption
Figure 10: Visual comparison of the image denoising task on SIDD test dataset [76].
Refer to caption
Figure 11: Visual comparison of the image deraining task on SPA test dataset [77].
Refer to caption
Figure 12: Visual comparison of the image deblurring task on RealBlur-J [78] test dataset.
Refer to caption
Figure 13: Visual results of the proposed method on DND real-world denoising test dataset [83].
Refer to caption
Figure 14: Visual results of the proposed method on DND real-world denoising test dataset [83].
Refer to caption
Figure 15: Visual results of the proposed method on ‘Real-Internet’ real-world deraining test dataset [45].