(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: University of Electronic Science and Technology of China, Chengdu, CN
11email: {dongqifan, junhaozhang}@std.uestc.edu.cn
11email: [email protected]

ConStyle v2: A Strong Prompter for All-in-One Image Restoration

Dongqi Fan University of Electronic Science and Technology of China, Chengdu, CN
11email: {dongqifan, junhaozhang}@std.uestc.edu.cn
11email: [email protected]
   Junhao Zhang University of Electronic Science and Technology of China, Chengdu, CN
11email: {dongqifan, junhaozhang}@std.uestc.edu.cn
11email: [email protected]
   Liang Chang University of Electronic Science and Technology of China, Chengdu, CN
11email: {dongqifan, junhaozhang}@std.uestc.edu.cn
11email: [email protected]
Abstract

This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its weights to guide the training of the general restoration network. Three improvements are proposed in the pre-training stage to train ConStyle: unsupervised pre-training, adding a pretext task (i.e. classification), and adopting knowledge distillation. Without bells and whistles, we can get ConStyle v2, a strong prompter for all-in-one Image Restoration, in less than two GPU days and doesn’t require any fine-tuning. Extensive experiments on Restormer (transformer-based), NAFNet (CNN-based), MAXIM-1S (MLP-based), and a vanilla CNN network demonstrate that ConStyle v2 can enhance any U-Net style Image Restoration models to all-in-one Image Restoration models. Furthermore, models guided by the well-trained ConStyle v2 exhibit superior performance in some specific degradation compared to ConStyle. The code is avaliable at: https://github.com/Dongqi-Fan/ConStyle_v2

Keywords:
Image Restoration Visual Prompting All-in-One Pre-training

1 Introduction

Image Restoration (IR) is a fundamental vision task in the computer vision community, which aims to reconstruct a high-quality image from a degraded one. Recent advancements in deep learning have shown promising results in specific IR tasks such as denoising [16, 8, 78], dehazing [46, 58, 22], deraining [29, 57, 34], desnowing [13, 12, 48], motion deblurring [35, 19, 74], defocus deblurring [61, 81, 36], low-light enhancement [69, 6, 25], and JPEG artifact removal/correction [33, 86]. However, these models are limited to addressing only one specific type of degradation. To tackle this issue, researchers have focused on develo** models capable of handling multiple degradations [80, 17, 70, 42]. Yet, these models require retraining for each different type of degradation, that is a set of weights is tailored for a single type of degradation. Obviously, these approaches are not practical as multiple degradations often coexist in real-world scenarios. For instance, rainy days are often associated with haze and reduced lighting.

Refer to caption
Figure 1: Different way to solve multiple degradations. (a) The priors are obtained by setting sub-networks as many as degradations. (b) The example pair of the degradation-clean must provide in training and inference stage. (c) ConStyle v2 adaptively outputs clean visual prompts according to different degradations to guide the training of the general restoration network.

The all-in-one Image Restoration [84, 53, 66, 14] is a kind of method that only uses a suit of weights to address multiple types of degradation. However, these all-in-one models often have a large number of parameters and require heavy computations, leading to time-consuming and inefficient training processes. For instance, Chen et al. [14] (Fig. 1 (a)) adjust the number of teacher networks based on the number of degradations. Thus, the more teacher networks there are, the more complex the training process becomes; Li et al. [41] (Fig. 1 (a)) also adopt the number of sub-networks the same as the amount of degradation, and its backbone is obtained through neural architecture search (NAS); PromptGIP [47] (Fig. 1 (b)) leverage the idea of visual prompting to help the model in handling multiple degradations, but a degradation-clean sample pair must be provided in the training and inference stage and requiring 8 V100 GPUs for training. These methods are inefficient due to the lack of prior knowledge about the degradations in the input image. In other words, the prior information about the degradation type needs to be first obtained and then passed to subsequent sub-networks. In contrast, thanks to ConStyle v2, general restoration network (Fig. 1 (c)) do not need specific prior knowledge about degradations, but rather require a clean visual prompt (clean prior).

In this paper, we introduce a strong prompter for all-in-one Image Restoration: ConStyle v2 (Fig. 2 and Fig. 3). The key of our work is to ensure that ConStyle v2 generates clean visual prompts, thus mitigating the issue of model collapse and guiding the training of general restoration networks. Model collapse typically arises when a model struggles to simultaneously handle multiple degradations. To address this challenge, we propose three simple yet effective improvements to ConStyle: unsupervised pre-training, leveraging a pretext task to enhance semantic information extraction capabilities, and employing knowledge distillation to further enhance this capacity. ConStyle v2 consists of convolution, linear, and BN layers and without any complex operators (Fig. 3 (c)), so the time of the training is less than two days on a V100 GPU and Intel Xeon Silver 4216 CPU. Once trained, ConStyle v2 can be seamlessly integrated into any U-Net model to facilitate their training without fine-tuning. Additionally, given the lack of datasets encompassing multiple degradations, we collect and produce a Mix Degradations dataset, which includes noise, motion blur, defocus blur, rain, snow, low-light, JPEG artifact, and haze, to cater to training requirements. To verify our method, we perform ConStyle v2 on three state-of-the-art IR models (Restormer[79], MAXIM-1S[65], NAFNet[10]) and a non-IR U-Net model consisting of vanilla convolutions. Fig. 4 shows details architecture of Original models and ConStyle/ConStyle v2 models. Experiment results on 27 benchmarks demonstrate that our ConStyle v2 is a powerful plug-and-play prompter for all-in-one image restoration and exhibits superior performance for specific degradation compared to ConStyle [24].

Our contributions can be summarized as follows:

  • Three simple yet effective enhancements are proposed to train ConStyle v2, and the time of the training is less than two GPU days.

  • We propose a Mix Degradations dataset, which includes noise, motion blur, defocus blur, rain, snow, low-light, JPEG artifact, and haze, to cater to training needs.

  • We propose a strong plug-and-play prompter for all-in-one and specific image restoration, in which the model collapse issue is avoided.

2 Related Work

2.1 All-in-One Image Restoration

While numerous works [80, 17, 70, 42, 79, 65, 10, 9, 18, 39, 23] excel in various Image Restoration tasks, they are typically limited to addressing a single type of degradation with a specific set of weights. To solve this problem, all-in-one Image Restoration (IR) methods [84, 53, 66, 14, 49, 47, 54, 85, 38, 50, 55] have been developed. These methods aim to enable models to effectively handle multiple degradations simultaneously. For example, AirNet [38] leverages MoCo [30] and Deformable Convolution [20] to transform degradation priors obtained from the former into convolution kernels in the latter, enabling dynamic degradation removal; DA-CLIP [49] builds upon the architecture of CLIP [56], in which BLIP [40] is used to generate synthetic captions for high-quality images. Then match low-quality images with captions and corresponding degradation types as image-text-degradation pairs; ADMS [54] introduces a Filter Attribution method based on FAIG [73] to identify the specific contributions of filters in removing specific degradations, while IDR [85] proposes a learnable Principal Component Analysis and treats various IR tasks as a form of multi-task learning to acquire priors. Different from the above methods, we aim to design a plug-and-play module that can transform a non-all-in-one model into an all-in-one model.

2.2 Visual Prompting

In the field of Natural Language Processing (NLP), Prompting Learning refers to providing task-specific instructions or in-context information to a model without the need for retraining. This approach has shown promising results in NLP, such as GPT-3 [4]. Drawing inspiration from Prompting Learning in NLP, recently, there have been many excellent visual prompting works in the IR [47, 50, 55, 11, 45, 3, 67, 63]. For example, ProRes [50] involves adding a target visual prompt to an input image to create a "prompted image". This prompted image is then flattened into patches, with the weights of ProRes frozen, and learnable prompts are randomly initialized for new tasks or datasets; PromptIR [55] introduces a Prompt Block in the decoder stage of the U-Net architecture. This block takes prompt components and the output of the previous transformer block as inputs, with its output being fed into the next transformer block; PromptGIP [47] proposes a training method akin to masked autoencoding, where certain portions of question images and answer images are randomly masked to prompt the model to reconstruct these patches from the unmasked areas. During inference, input-output pairs are assembled as task prompts to realize image restoration. Our approach also leverages visual prompting, but in a more efficient manner, eliminating the need to distinguish different degradations like the above methods. It will provide a clean visual prompt for other models.

Refer to caption
Figure 2: The training diagram of the ConStyle v2. Only the Encoder is retained once training is complete.
Refer to caption
Figure 3: The difference between ConStyle (a) and ConStyle v2 (b), and the detail structure of the Momentum Encoder and Encoder (c). Where Con. Loss and Cross. Loss are abbreviations of Content Loss and CrossEntropy Loss. In ConStyle, the Momentum Encoder and Queue are only removed in the inference stage, while, in ConStyle v2, they are removed when the pre-training is finished.

3 Method

The training diagram of ConStyle v2 is depicted in Fig. 2, while the distinctions between ConStyle and ConStyle v2 are illustrated in Fig. 3. The same as ConStyle, ConStyle v2 only retains the Encoder part when the pre-training is complete. In this section, we first provide a brief overview of ConStyle (Sec. 3.1), followed by showing problems encountered with ConStyle in multiple degradations (Sec. 3.3), and finally, we illustrate the improvements made from ConStyle to ConStyle v2 in three steps (Sec. 3.4). In addition, the Mix Degradations dataset is described in Sec. 3.2.

Refer to caption
Figure 4: The detailed structure of the original models (a)(b)(c)(d) and the ConStyle/ConStyle v2 models (e)(f)(g)(h). DC represents the downsample and concat operation, and UC represents upsample and concat operation

3.1 Review of ConStyle

IRConStyle [24] is a versatile and robust IR framework consisting of the ConStyle and a general restoration network. ConStyle includes several convolutional layers and one MLP layer, which is responsible for extracting latent features (the latent code and intermediate feature map) and then passing them to the general restoration network. The general restoration network follows an abstract U-Net style architecture, allowing for the instantiation of any IR U-Net model. The training stage (Eq. 1) and inference (Eq. 2) process of IRConStyle [24] can be described as follows:

Irestored=G(E(Idegraded,Iclean),Idegraded)subscript𝐼𝑟𝑒𝑠𝑡𝑜𝑟𝑒𝑑𝐺𝐸subscript𝐼𝑑𝑒𝑔𝑟𝑎𝑑𝑒𝑑subscript𝐼𝑐𝑙𝑒𝑎𝑛subscript𝐼𝑑𝑒𝑔𝑟𝑎𝑑𝑒𝑑I_{restored}=G(E(I_{degraded},I_{clean}),I_{degraded})italic_I start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t italic_o italic_r italic_e italic_d end_POSTSUBSCRIPT = italic_G ( italic_E ( italic_I start_POSTSUBSCRIPT italic_d italic_e italic_g italic_r italic_a italic_d italic_e italic_d end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_d italic_e italic_g italic_r italic_a italic_d italic_e italic_d end_POSTSUBSCRIPT ) (1)
Irestored=G(E(Idegraded),Idegraded)subscript𝐼𝑟𝑒𝑠𝑡𝑜𝑟𝑒𝑑𝐺𝐸subscript𝐼𝑑𝑒𝑔𝑟𝑎𝑑𝑒𝑑subscript𝐼𝑑𝑒𝑔𝑟𝑎𝑑𝑒𝑑I_{restored}=G(E(I_{degraded}),I_{degraded})italic_I start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t italic_o italic_r italic_e italic_d end_POSTSUBSCRIPT = italic_G ( italic_E ( italic_I start_POSTSUBSCRIPT italic_d italic_e italic_g italic_r italic_a italic_d italic_e italic_d end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_d italic_e italic_g italic_r italic_a italic_d italic_e italic_d end_POSTSUBSCRIPT ) (2)

Where G stands for general restoration network, E for ConStyle, Idegradedsubscript𝐼𝑑𝑒𝑔𝑟𝑎𝑑𝑒𝑑I_{degraded}italic_I start_POSTSUBSCRIPT italic_d italic_e italic_g italic_r italic_a italic_d italic_e italic_d end_POSTSUBSCRIPT for the input degraded image, and Irestoredsubscript𝐼𝑟𝑒𝑠𝑡𝑜𝑟𝑒𝑑I_{restored}italic_I start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t italic_o italic_r italic_e italic_d end_POSTSUBSCRIPT for the output restored clean image. Based on the contrast learning framework MoCo [30], ConStyle cleverly integrates the idea of style transfer and replaces the pretext task, Instance Discrimination [72], with one pretext task more suitable for IR. The total loss functions for IRConStyle are as follows:

Ltotal=Lstyle+Lcontent+LInfoNCE+L1subscript𝐿𝑡𝑜𝑡𝑎𝑙subscript𝐿𝑠𝑡𝑦𝑙𝑒subscript𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡subscript𝐿𝐼𝑛𝑓𝑜𝑁𝐶𝐸subscript𝐿1L_{total}=L_{style}+L_{content}+L_{InfoNCE}+L_{1}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_s italic_t italic_y italic_l italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_I italic_n italic_f italic_o italic_N italic_C italic_E end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (3)

The calculation of Lstylesubscript𝐿𝑠𝑡𝑦𝑙𝑒L_{style}italic_L start_POSTSUBSCRIPT italic_s italic_t italic_y italic_l italic_e end_POSTSUBSCRIPT, Lcontentsubscript𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡L_{content}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT, and LInfoNCEsubscript𝐿𝐼𝑛𝑓𝑜𝑁𝐶𝐸L_{InfoNCE}italic_L start_POSTSUBSCRIPT italic_I italic_n italic_f italic_o italic_N italic_C italic_E end_POSTSUBSCRIPT is performed in ConStyle, while L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is performed in general restoration network. Under the supervision of Lstylesubscript𝐿𝑠𝑡𝑦𝑙𝑒L_{style}italic_L start_POSTSUBSCRIPT italic_s italic_t italic_y italic_l italic_e end_POSTSUBSCRIPT, Lcontentsubscript𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡L_{content}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT, and LInfoNCEsubscript𝐿𝐼𝑛𝑓𝑜𝑁𝐶𝐸L_{InfoNCE}italic_L start_POSTSUBSCRIPT italic_I italic_n italic_f italic_o italic_N italic_C italic_E end_POSTSUBSCRIPT, latent features move closer to the clean space and further away from the degradation space. Since ConStyle can adaptively output clean latent features according to input degraded images, it is natural for us to believe that ConStyle should be able to turn the general restoration network into an all-in-one model. However, the experiment results on ConStyle models are not as expected.

3.2 Mix Degradations Datasets

We need a training dataset that includes noise, motion blur, defocus blur, rain, snow, low light, JPEG artifact, and haze, but the existing training dataset did not meet our needs. Therefore, we propose a dataset, namely Mix Degradations datasets, consisting of image pairs with all of the aforementioned degradations. Details on the Mix Degradations dataset can be found in Tab. 1. The images with noise and JPEG artifacts are respectively generated using established methods same as [79, 65, 10] and [44, 33, 86]. It is important to note that in the OTS dataset, haze images with intensities of 0.04 and 0.06 are manually removed due to being too clear to the human eye. In addition, the deraining dataset, which includes Rain14000 [26], Rain1800 [75], Rain800 [83], and Rain12 [43], initially contained 13,712 images, but two erroneous pictures are identified and removed. After a unified crop** process, the Mix Degradations dataset has 621,573 images, all of size 256 × 256. The Mix Degradations datasets, the uncropped joint datasets mentioned in Tab. 1 (totaling 46,301 images), and the data preparation file are all available on our GitHub link.

Table 1: Details about Mix Degradations datasets.
Task Motion Blurring Defocus Blurring Dehazing Low-light Enhacement
Where GoPro [52] LFDOF [60] OTS [37] LoL v1 [71], LoL v2 [77], and FiveK [5]
Used Number 2,103 5,606 12,500 5,885
Crop Size 256 256 256 256
Step Size 150 220 240 100
Final Number 84,120 84,090 82,425 51,891
Task JPEG Artifact Removal Denoising Desnowing Deraining
Where DIV2K [2] DIV2K [2] Snow100K [48] Rain14000, 1800, 800 and 12
Used Number 800 800 7000 13710
Crop Size 256 256 256 256
Step Size 165 165 160 240
Final Number 77,904 77,904 82,432 80,807

3.3 ConStyle on Mix Degradations Datasets

To evaluate whether ConStyle [24] can directly convert U-Net models to all-in-one models, we conduct experiments using Original models (Restormer [79], NAFNet [60], MAXIM-1S [65]) and ConStyle models (ConStyle Restormer, ConStyle NAFNet, ConStyle MAXIM-1S) on Mix Degradations datasets. These models will be tested on GoPro [52] (motion blurring), RealDOF [60] (defocus blurring), LoL v1 [71] (low-light enhancement), SOTS outdoors [37] (dehazing), LIVE1 [62] (JPEG artifact removal), CSD [13] (desnowing), CBSD68 [51] (denoising), and Rain100H [76] (deraining). In addition, we introduce two more models for comparison: Original Conv and ConStyle Conv. The Original Conv is a vanilla U-Net convolution model, while the ConStyle Conv incorporates ConStyle into the Original Conv. These two additional models are used to verify the generality of the ConStyle and ConStyle v2. It is important to note that to expedite evaluation during the training, only a subset of the test datasets is used. For example, only 24 images from GoPro’s test dataset of 1,111 images are selected. Using the full test datasets for all 8 tasks would significantly increase training time, as inference on Restormer alone with GoPro’s test dataset would take 40 minutes on a V100 GPU.

Refer to caption
Figure 5: Original Conv (a), ConStyle Conv (b), Restormer (c), and ConStyle Restormer (d) trained on Mix Degradations datasets.

The results in Fig. 5 demonstrate that under multiple degradation settings, the performance of ConStyle Conv not only fails to surpass that of the Original Conv but also remains consistently low performance after 80K iterations. While ConStyle Restormer shows better performance than Restormer, it also suffers from model collapse problems as Restormer. In the following section, we will illustrate the improving process of ConStyle to ConStyle v2 on Restormer and Original Conv step by step.

3.4 ConStyle v2

3.4.1 Unsupervised Pre-training

We find that even for specific IR tasks, ConStyle models exhibit varying degrees of model collapse issues. For example, in the dehazing task, the performance of ConStyle NAFNet, ConStyle MAXIM-1S, and ConStyle Restormer significantly declines after 250K iterations, 30K iterations, and 10K iterations respectively. Interestingly, even within the same model like ConStyle MAXIM-1S, the onset of model collapse differs across denoising, deraining, and deblurring tasks, occurring at 10K, 100K, and 200K iterations respectively. To address this challenge, IRConStyle [24] implements a strategy of early stop** ConStyle updates. Here, we intend to elegantly solve this problem. Since this problem happens in joint training, then it is natural to split joint training process into two stages. Specifically, ConStyle is pre-trained independently, followed by fixing its weights and integrating it with other IR models for guided training. For pre-training stage, we leverage the generation techniques of Real-ESRGAN [68] and ImageNet-C [31] in the Degradation Process (Fig. 2) for unsupervised training on ImageNet-1K [21].

Since our goal is to train ConStyle v2 to be a powerful prompter that can produce a clean visual prompt based on the different degradations, we use the method in ImageNet-C [31] to generate motion blur, snow, and low contrast and the two-stage degradation method in Real-ESRGAN [68] to generate Gaussian blur, noise, and JPEG artifacts. For each batch of images, 40% of the images are randomly selected to add motion blur and snow and change contrast, while 60% of the images are added Gaussian blur, noise, and JPEG artifact. For details training setting of the pre-training please see Sec. 4. The pre-trained ConStyle, with the weight fixed, is incorporated into the general restoration network for training on the Mix Degradations datasets. Here, we name the models of this stage as Pre-train models. The results of Pre-train Restormer and the Pre-train Conv can be seen in Fig. 6 (a) and (e).

Refer to caption
Figure 6: (a), (b), and (c) represent Pre-train ConStyle Conv, Class ConStyle Conv, and ConStyle v2 Conv. (e), (f), and (g) represent Pre-train ConStyle Restormer, Class ConStyle Restormer, and ConStyle v2 Restormer. (d) and (h) represent average performance on eight degradations.

3.4.2 Pretext Task

Refer to caption
Figure 7: The visual results of the proposed degradation process. Several degradations are added to a single image, and the intensity of the degradations is random.

Following pre-training step, ConStyle Restormer has demonstrated significant performance improvement, successfully resolving the issue of model collapse. Conversely, ConStyle Conv continues to face challenges with unstable training and limited enhancement in performance. We believe that this is attributed to heavy degradation, leading to a loss of semantic information (Fig. 7) in the original image. It makes the model of weak semantic extraction ability, such as ConStyle Conv, also have poor image restoration performance. Because the process of image restoration involves pixel-wise operations and necessitates a comprehensive understanding of the entire image. Thus we introduce a pretext task (classification) to enhance the semantic information extraction capabilities of ConStyle, so as to improve such capability of ConStyle models. Specifically, we add Classifier and Softmax layers at the back of the Encoder and leverage the labels of the ImageNet (Fig. 2). Here, we name the models of this stage as Class models. As shown in Fig. 6 (b) and (f), the addition of the pretext task has little influence on ConStyle Restormer, since the Transformer model already has strong semantic extraction abilities. In contrast, for ConStyle Conv, the inclusion of the pretext task makes the training stable, and the performance is significantly improved.

3.4.3 Knowledge Distillation

Although ConStyle has been significantly improved through pre-training and the addition of a pretext task, enabling it to generate clean visual prompts based on degraded image input, to further boost ConStyle’s ability to extract semantic information, we take the last step to improve ConStyle to ConStyle v2. Inspired by BYOL[28], SimSam[15], and DINO[7], we take advantage of knowledge distillation. Since the input of the Momentum Encoder is the clean image, and its output visual prompt is cleaner than the output of Encoder, by utilizing the Momentum Encoder as a teacher and the Encoder as a student, the teacher network is able to adaptively guide the student network during training. Specifically, a Classifier and Softmax layer are added at the back of the Momentum Encoder, with distance measured by the Kullback-Leibler (KL) function (Fig. 2). Now, we have the final ConStyle v2, and the performance of ConStyle v2 Restormer (Fig. 6 (g) and ConStyle v2 Conv (Fig. 6 (c)) is raised again compared with other models. In addition, each improvement in the average performance across eight degradations is depicted in Fig. 6 (d) and (h).

4 Experiments

4.1 Implement details

All experiments in this paper are performed on an NVIDIA Tesla V100 GPU. To be consistent with ConStyle [24], we use AdamW (β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=0.9, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.999, weight decay=1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT) optimizer with an initial learning rate of 3e43superscript𝑒43e^{-4}3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and Cosine annealing. Training Stage: The batch size, crop size, and total iterations are set as 16, 128, and 700K respectively. Pre-training Stage: The batch size, crop size, and total iterations are set as 32, 224, and 200K respectively. In the process of generating degraded images, we directly use all configurations in Real-ESRGAN [68] and change the intensity of degradation in ImageNet-C [31].

Table 2: Parameters, computations, and inference speed comparison. ()(*)( ∗ ) represents ConStyle, Pre-train, class, and ConStyle v2 models. For instance, Conv()(*)( ∗ ) stands for ConStyle Conv, Pre-train Conv, Class Conv, and ConStyle v2 Conv.
Method Restormer [79] Restormer()(*)( ∗ ) NAFNet [10] NAFNet()(*)( ∗ )
Params(M) 26.12 15.57 87.40 12.74
GFLOPs 70.49 74.92 49.13 46.97
Speed(us) 60 61 53 54
Method MAXIM-1S [65] MAXIM-1S()(*)( ∗ ) Conv Conv()(*)( ∗ )
Params(M) 8.17 8.10 5.03 6.78
GFLOPs 21.58 25.58 19.64 25.90
Speed(us) 53 52 3 5
Table 3: The average performance of the PSNR/SSIM on 27 benchmarks. Red means the best and blue means the second best.
Original ConStyle Pre-train Class ConStyle v2
Restormer [79] 6.40/0.0506 25.23/0.7844 27.24/0.8222 27.40/0.8305 27.60/0.8352
NAFNet [10] 27.45/0.8347 26.57/0.8054 27.43/0.8341 27.53/0.8379 27.55/0.8391
MAXIM-1S [65] 26.95/0.8319 26.63/0.8290 27.14/0.8319 27.41/0.8372 27.54/0.8400
Conv 21.44/0.6177 20.99/0.5986 20.14/0.6544 24.84/0.7889 26.06/0.7974

Mix Degradations datasets is used in the training stage and ImageNet-1K in the pre-training stage. For evaluation, GoPro [52], HIDE [64], RealBlur-J [59], and RealBlur-R [59] are used for motion deblurring, DPDD [1] is used for defocus deblurring, SOTS outdoors [37] is used for dehazing, Rain100H [76], Rain100L [76], Test1200 [82], and Test2800 [27] are used for deraining, FiveK [5], LoL v1 [71], and LoL v2 [77] are used for low-light enhancement, CSD [13], Snow100K (S, M, and L) [48] are used for desnowing, CBSD68 [51] and urban100[32] are used for denoising, and LIVE1 [62] is used for JPEG artifact removal.

Table 4: RGB image denoising tested by PSNR. Where (*) indicates models are trained with random sigma (0 to 50), while other indicates models are trained with fixed sigma. Blue means better and red means worse.
Method CBSD68(*) [51] CBSD68 [51] Urban100(*) [32] Urban 100 [32]
σ𝜎\sigmaitalic_σ=15 σ𝜎\sigmaitalic_σ=25 σ𝜎\sigmaitalic_σ=50 σ𝜎\sigmaitalic_σ=15 σ𝜎\sigmaitalic_σ=25 σ𝜎\sigmaitalic_σ=50 σ𝜎\sigmaitalic_σ=15 σ𝜎\sigmaitalic_σ=25 σ𝜎\sigmaitalic_σ=50 σ𝜎\sigmaitalic_σ=15 σ𝜎\sigmaitalic_σ=25 σ𝜎\sigmaitalic_σ=50
ConStyle Restormer 34.33 31.71 28.51 34.37 31.74 28.52 34.89 32.66 29.64 35.01 32.74 29.71
ConStyle v2 Restormer 34.33 31.71 28.51 34.37 31.74 28.52 34.89 32.67 29.66 35.01 32.77 29.71
Differ. 0 0 0 0 +0 0 0 +0.01 +0.02 0 +0.03 +0
ConStyle NAFNet 34.31 31.69 28.50 34.34 31.71 28.52 34.82 32.58 29.55 34.91 32.64 29.63
ConStyle v2 NAFNet 34.33 31.71 28.52 34.36 31.73 28.54 34.88 32.66 29.65 34.96 32.72 29.68
Differ. +0.02 +0.02 +0.02 +0.02 +0.02 +0.02 +0.06 +0.08 +0.10 +0.05 +0.08 +0.05
ConStyle MAXIM-1S 34.25 31.63 28.43 34.28 31.65 28.44 34.56 32.26 29.10 34.64 32.32 29.13
ConStyle v2 MAXIM-1S 34.19 31.55 28.34 34.22 31.58 28.36 34.60 32.32 29.19 34.70 32.40 29.26
Differ. -0.06 -0.08 -0.09 -0.06 -0.07 -0.08 +0.04 +0.06 +0.09 +0.06 +0.08 +0.13

4.2 Model Analyses

Tab. 2 presents a comparison of parameters, computations, and speed between all models. All the results are obtained using input data of size (2,3,128,128), and the speed is the average of 10,000 inference. The reason why the parameters of the ConStyle v2/ConStyle models are fewer than the original models (except for Original Conv and ConStyle v2 Conv) is that, to demonstrate that the improvement of the ConStyle models is not brought by simply expanding the scale of the network, ConStyle models are downscaled by reducing the width and depth [24]. Because of the introduction of the ConStyle part, the parameters of models will be increased by 1.19M.

Table 5: Image motion deblurring and dehazing tested by PSNR/SSIM. Blue means better and red means worse.
Method GoPro [52] RealBlur-R [59] RealBlur-J [59] HIDE [64] SOTS outdoor [37]
ConStyle Restormer 31.45/0.9208 33.94/0.9454 26.63/0.8288 30.20/0.9098 30.85/0.9760
ConStyle v2 Restormer 31.36/0.9200 33.95/0.9447 26.63/0.8298 30.10/0.9080 31.32/0.9789
Differ. -0.09/ -0.0008 +0.01/ -0.0007 0/ -0.0010 -0.10/ -0.0018 +0.47/ +0.0029
ConStyle NAFNet 31.56/0.9230 33.89/0.9441 26.61/0.8308 30.24/0.9106 30.73/0.9743
ConStyle v2 NAFNet 31.68/0.9254 33.90/0.9444 26.63/0.8311 30.32/0.9115 32.34/0.9812
Differ. +0.12/ +0.0024 +0.01/ +0.0003 +0.02/ +0.0003 +0.08/ +0.0009 +1.61/ +0.0069
ConStyle MAXIM-1S 30.77/0.9093 33.93/0.9435 26.54/0.8261 29.26/0.8951 30.59/0.9713
ConStyle v2 MAXIM-1S 31.01/0.9159 33.79/0.9407 26.47/0.8248 29.36/0.8952 30.68/0.9751
Differ. +0.24/ +0.0066 -0.14/ -0.0028 -0.07/ -0.0013 +0.10/ +0.0001 +0.09/ +0.0038

4.3 All-in-One Image Restoration Result

To verify the overall performance of our methods, we calculate the average PSNR/SSIM on 27 benchmarks. As shown in Tab. 3, except that the performance of Pre-train Conv is lower than that of Original Conv and ConStyle Conv on PSNR, the performance of other Pre-train, Class, and ConStyle v2 models are significantly higher than that of Original and ConStyle models. This highlights the effectiveness of three proposed methods: unsupervised pre-training, using pretext task, and using knowledge distillation. Due to space constraints, the detail results of Restormer, NAFNet, MAXIM-1S, and Conv models can be found in the supplementary material. While the ConStyle v2 models may not outperform ConStyle, Pre-train, and Class models in certain benchmarks, they still show significant improvement over the Original models. It is worth noting that scaling up the ConStyle v2 models to the size of the Original models could potentially yield even better results.

4.4 Single Image Restoration Result

Considering the significant improvement demonstrated by ConStyle v2 in handling multiple degradations, it is worth investigating whether this method can also enhance general restoration models to address specific degradations. We simply utilize the same training settings as IRConStyle [24] for specific degradation scenarios. A comparison between ConStyle v2 models and ConStyle models is conducted across motion deblurring, denoising, and dehazing tasks. The denoising results are presented in Tab. 4, while the results for motion deblurring and dehazing are shown Tab. 5.

For denoising, except for the performance of ConStyle v2 MAXIM-1S slightly worse than ConStyle MAXIM-1S on CBSD68, the overall performance of ConStyle v2 models is better than ConStyle models. For dehazing, ConStyle v2 models significantly outperform ConStyle models, even by 1.61 dB on NAFNet models. For motion deblurring, ConStyle v2 NAFNet is superior to ConStyle NAFNet but is indistinguishable from ConStyle on Restormer and MAXIM-1S. In general, for specific degradation, ConStyle v2 models do not require an accurate number of iterations to freeze the weight, while ConStyle models need to do so to avoid the problem of model collapse. Therefore, ConStyle v2 makes the entire IRConStyle framework more efficient.

4.5 All-in-One Image Restoration Visual Result

We present the visual results of the Original models and ConStyle v2 models on the GoPro, DPDD, LoL v2, CBSD68, Rain100L, SOTS outdoors, snow100K-M, and LIVE1. Due to the space constraint and fair comparasions, for all tasks, we only select one identical image for each model. The origin degrdaded and target images are shown in Fig. 8. The visual results of the Restormer and ConStyle v2 Restormer are shown in Fig. 9, the visual results of the NAFNet and ConStyle v2 NAFNet are shown in Fig. 10, the visual results of the MAXIM-1S and ConStyle v2 MAXIM-1S are shown in Fig. 11, and the visual results of the Original Conv and ConStyle v2 Conv are shown in Fig. 12.

4.6 Ablation Studies

In the process of improving ConStyle to ConStyle v2, the results of optimization are obtained step by step (Sec. 3.4): unsupervised pre-training, adding a pretext task, and adopting knowledge distillation. Therefore, our whole improvement process is also the process of ablation studies. In addition, for every step of improvement, we conduct all models on the Mix Degradations datasets for fair comparisons (see supplemental materials for details).

Refer to caption
Figure 8: The origin degraded images (top) and target images (bottom).
Refer to caption
Figure 9: The visual results of the Restormer (top) and ConStyle v2 Restormer (bottom).
Refer to caption
Figure 10: The visual results of the NAFNet (top) and ConStyle v2 NAFNet (bottom).

5 Conclusions and Limitations

Refer to caption
Figure 11: The visual results of the MAXIM-1S (top) and ConStyle v2 MAXIM-1S (bottom).
Refer to caption
Figure 12: The visual results of the Original Conv (top) and ConStyle v2 Conv (bottom).

5.1 Conclusions

This paper leverages the unsupervised pre-training, pretext task, and knowledge distillation to improve ConStyle into a strong prompter for all-in-one image restoration. ConStyle v2 not only significantly improves the performance of the Original models under multiple degradations settings, but also solves the issue of model collapse (observed in Restormer) and unstable training caused by model limitations (observed in Original Conv). Moreover, the redundant operations that manually select specific iterations to freeze weights across different models and tasks in ConStyle are avoided, and the performance of ConStyle v2 models under certain specific degradations is also improved. Finally, due to the lack of training datasets for multiple degradations, the Mix Degradations datasets is collected and introduced.

5.2 Limitations

Despite many advantages as described in the conclusion and experiments, there are inevitably two limitations. Firstly, ConStyle v2 exhibits limited improvements in low-light, deraining, and defocus deblurring tasks compared to other tasks. This is evident in the results of ConStyle v2 Conv on LoL v1 and ConStyle v2 MAXIM-1S on DPDD. The reason is that during the pre-training stage, the generation methods of rain, low light, and defocus blur are not included in the degradation generation (Real-ESRGAN and ImageNet-C) since the effective method of the synthetic defocus blur, rain, and low light is still a challenge in the IR community. Secondly, while ConStyle v2 models have shown promising results in quantifiable indicators such as PSNR/SSIM, the visual improvements in ConStyle v2 Conv seem to be less pronounced. This may be attributed to the inherent limitations of the Original Conv model, which is not specifically tailored for image restoration. However, ConStyle v2 demonstrates significant visual enhancements in most tasks on MAXIM-1S and NAFNet.

References

  • [1] Abuolaim, A., Brown, M.S.: Defocus deblurring using dual-pixel data. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. pp. 111–126. Springer (2020)
  • [2] Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 126–135 (2017)
  • [3] Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in Neural Information Processing Systems 35, 25005–25017 (2022)
  • [4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [5] Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011)
  • [6] Cai, Y., Bian, H., Lin, J., Wang, H., Timofte, R., Zhang, Y.: Retinexformer: One-stage retinex-based transformer for low-light image enhancement. arXiv preprint arXiv:2303.06705 (2023)
  • [7] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
  • [8] Chang, M., Li, Q., Feng, H., Xu, Z.: Spatial-adaptive network for single image denoising. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 171–187. Springer (2020)
  • [9] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12299–12310 (2021)
  • [10] Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. In: European Conference on Computer Vision. pp. 17–33. Springer (2022)
  • [11] Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems 35, 31333–31346 (2022)
  • [12] Chen, W.T., Fang, H.Y., Ding, J.J., Tsai, C.C., Kuo, S.Y.: Jstasr: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. pp. 754–770. Springer (2020)
  • [13] Chen, W.T., Fang, H.Y., Hsieh, C.L., Tsai, C.C., Chen, I., Ding, J.J., Kuo, S.Y., et al.: All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4196–4205 (2021)
  • [14] Chen, W.T., Huang, Z.K., Tsai, C.C., Yang, H.H., Ding, J.J., Kuo, S.Y.: Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17653–17662 (2022)
  • [15] Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)
  • [16] Cheng, S., Wang, Y., Huang, H., Liu, D., Fan, H., Liu, S.: Nbnet: Noise basis learning for image denoising with subspace projection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4896–4906 (2021)
  • [17] Cui, Y., Ren, W., Yang, S., Cao, X., Knoll, A.: Irnext: Rethinking convolutional network design for image restoration (2023)
  • [18] Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., Knoll, A.: Selective frequency network for image restoration. In: The Eleventh International Conference on Learning Representations (2022)
  • [19] Cui, Y., Tao, Y., Ren, W., Knoll, A.: Dual-domain attention for image deblurring. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 479–487 (2023)
  • [20] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764–773 (2017)
  • [21] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [22] Dong, J., Pan, J.: Physics-based feature dehazing networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 188–204. Springer (2020)
  • [23] Fan, D., Yue, T., Zhao, X., Chang, L.: Lir: Efficient degradation removal for lightweight image restoration. arXiv preprint arXiv:2402.01368 (2024)
  • [24] Fan, D., Zhao, X., Chang, L.: Irconstyle: Image restoration framework using contrastive learning and style transfer. arXiv preprint arXiv:2402.15784 (2024)
  • [25] Fu, H., Zheng, W., Meng, X., Wang, X., Wang, C., Ma, H.: You do not need additional priors or regularizers in retinex-based low-light image enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18125–18134 (2023)
  • [26] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3855–3863 (2017)
  • [27] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3855–3863 (2017)
  • [28] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271–21284 (2020)
  • [29] Gu, J., Ma, X., Kong, X., Qiao, Y., Dong, C.: Networks are slacking off: Understanding generalization problem in image deraining. Advances in Neural Information Processing Systems 36 (2024)
  • [30] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
  • [31] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
  • [32] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5197–5206 (2015)
  • [33] Jiang, J., Zhang, K., Timofte, R.: Towards flexible blind jpeg artifacts removal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4997–5006 (2021)
  • [34] Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi-scale progressive fusion network for single image deraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8346–8355 (2020)
  • [35] Kong, L., Dong, J., Ge, J., Li, M., Pan, J.: Efficient frequency domain-based transformers for high-quality image deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5886–5895 (2023)
  • [36] Lee, J., Son, H., Rim, J., Cho, S., Lee, S.: Iterative filter adaptive network for single image defocus deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2034–2042 (2021)
  • [37] Li, B., Ren, W., Fu, D., Tao, D., Feng, D., Zeng, W., Wang, Z.: Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing 28(1), 492–505 (2018)
  • [38] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17452–17462 (2022)
  • [39] Li, F., Shen, L., Mi, Y., Li, Z.: Drcnet: Dynamic image restoration contrastive network. In: European Conference on Computer Vision. pp. 514–532. Springer (2022)
  • [40] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
  • [41] Li, R., Tan, R.T., Cheong, L.F.: All in one bad weather removal using architectural search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3175–3185 (2020)
  • [42] Li, Y., Fan, Y., Xiang, X., Demandolx, D., Ranjan, R., Timofte, R., Van Gool, L.: Efficient and explicit modelling of image hierarchies for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18278–18289 (2023)
  • [43] Li, Y., Tan, R.T., Guo, X., Lu, J., Brown, M.S.: Rain streak removal using layer priors. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2736–2744 (2016)
  • [44] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)
  • [45] Liu, W., Shen, X., Pun, C.M., Cun, X.: Explicit visual prompting for low-level structure segmentations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19434–19445 (2023)
  • [46] Liu, X., Ma, Y., Shi, Z., Chen, J.: Griddehazenet: Attention-based multi-scale network for image dehazing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7314–7323 (2019)
  • [47] Liu, Y., Chen, X., Ma, X., Wang, X., Zhou, J., Qiao, Y., Dong, C.: Unifying image processing as visual prompting question answering. arXiv preprint arXiv:2310.10513 (2023)
  • [48] Liu, Y.F., Jaw, D.W., Huang, S.C., Hwang, J.N.: Desnownet: Context-aware deep network for snow removal. IEEE Transactions on Image Processing 27(6), 3064–3073 (2018)
  • [49] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018 (2023)
  • [50] Ma, J., Cheng, T., Wang, G., Zhang, Q., Wang, X., Zhang, L.: Prores: Exploring degradation-aware visual prompt for universal image restoration. arXiv preprint arXiv:2306.13653 (2023)
  • [51] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 2, pp. 416–423. IEEE (2001)
  • [52] Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3883–3891 (2017)
  • [53] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
  • [54] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
  • [55] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
  • [56] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [57] Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D.: Progressive image deraining networks: A better and simpler baseline. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3937–3946 (2019)
  • [58] Ren, W., Ma, L., Zhang, J., Pan, J., Cao, X., Liu, W., Yang, M.H.: Gated fusion network for single image dehazing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3253–3261 (2018)
  • [59] Rim, J., Lee, H., Won, J., Cho, S.: Real-world blur dataset for learning and benchmarking deblurring algorithms. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16. pp. 184–201. Springer (2020)
  • [60] Ruan, L., Chen, B., Li, J., Lam, M.L.: Aifnet: All-in-focus image restoration network using a light field-based dataset. IEEE Transactions on Computational Imaging 7, 675–688 (2021)
  • [61] Ruan, L., Chen, B., Li, J., Lam, M.: Learning to deblur using light field generated and real defocus images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16304–16313 (2022)
  • [62] Sheikh, H.: Live image quality assessment database release 2. http://live. ece. utexas. edu/research/quality (2005)
  • [63] Shen, Y., Fu, C., Chen, P., Zhang, M., Li, K., Sun, X., Wu, Y., Lin, S., Ji, R.: Aligning and prompting everything all at once for universal visual perception. arXiv preprint arXiv:2312.02153 (2023)
  • [64] Shen, Z., Wang, W., Lu, X., Shen, J., Ling, H., Xu, T., Shao, L.: Human-aware motion deblurring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5572–5581 (2019)
  • [65] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxim: Multi-axis mlp for image processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5769–5780 (2022)
  • [66] Valanarasu, J.M.J., Yasarla, R., Patel, V.M.: Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2353–2363 (2022)
  • [67] Wang, X., Wang, W., Cao, Y., Shen, C., Huang, T.: Images speak in images: A generalist painter for in-context visual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6830–6839 (2023)
  • [68] Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
  • [69] Wang, Y., Liu, Z., Liu, J., Xu, S., Liu, S.: Low-light image enhancement with illumination-aware gamma correction and complete image modelling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13128–13137 (2023)
  • [70] Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17683–17693 (2022)
  • [71] Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arxiv 2018. arXiv preprint arXiv:1808.04560
  • [72] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3733–3742 (2018)
  • [73] Xie, L., Wang, X., Dong, C., Qi, Z., Shan, Y.: Finding discriminative filters for specific degradations in blind super-resolution. Advances in Neural Information Processing Systems 34, 51–61 (2021)
  • [74] Yang, D., Yamac, M.: Motion aware double attention network for dynamic scene deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1113–1123 (2022)
  • [75] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1357–1366 (2017)
  • [76] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1357–1366 (2017)
  • [77] Yang, W., Wang, W., Huang, H., Wang, S., Liu, J.: Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing 30, 2072–2086 (2021)
  • [78] Yue, Z., Yong, H., Zhao, Q., Meng, D., Zhang, L.: Variational denoising network: Toward blind noise modeling and removal. Advances in neural information processing systems 32 (2019)
  • [79] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)
  • [80] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14821–14831 (2021)
  • [81] Zhang, D., Wang, X.: Dynamic multi-scale network for dual-pixel images defocus deblurring with transformer. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2022)
  • [82] Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 695–704 (2018)
  • [83] Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. IEEE transactions on circuits and systems for video technology 30(11), 3943–3956 (2019)
  • [84] Zhang, J., Huang, J., Yao, M., Yang, Z., Yu, H., Zhou, M., Zhao, F.: Ingredient-oriented multi-degradation learning for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5825–5835 (2023)
  • [85] Zhang, J., Huang, J., Yao, M., Yang, Z., Yu, H., Zhou, M., Zhao, F.: Ingredient-oriented multi-degradation learning for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5825–5835 (2023)
  • [86] Zheng, B., Chen, Y., Tian, X., Zhou, F., Liu, X.: Implicit dual-domain convolutional network for robust color image compression artifact reduction. IEEE Transactions on Circuits and Systems for Video Technology 30(11), 3982–3994 (2019)

Supplementary Material: A Strong Prompter for All-in-One Image Restoration

Dongqi Fan Junhao Zhang Liang Chang

Appendix 0.A All-in-One Image Restoration Result

Table 6: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of Restormer models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
GoPro [52] 5.91/0.0062 25.64/0.7912 27.94/0.8558 27.93/0.8554 28.03/ 0.8585
HIDE [64] 5.99/0.0157 23.87/0.7560 26.08/0.8338 26.51/0.8372 26.53/ 0.8393
RealBlur-J [59] 10.97/0.0591 18.49/0.5820 18.50/0.4209 19.79/0.6579 19.39/0.4664
RealBlur-R [59] 8.56/0.3099 17.23/0.6388 19.88/0.4953 17.69/0.4213 23.23/ 0.7408
Rain100H [76] 6.25/0.0032 18.41/0.5615 24.95/0.7818 25.23/0.7852 25.54/ 0.7949
Rain100L [76] 6.25/0.0032 24.33/0.8136 22.04/0.8340 23.73/0.8503 24.49/ 0.8358
Test1200 [82] 6.33/0.0169 27.72/0.8151 31.00/0.8847 31.30/0.8888 31.41/ 0.8926
Test2800 [27] 6.29/0.0058 27.40/0.8353 30.93/0.9111 30.99/0.9119 31.06/ 0.9130
FiveK [5] 6.64/0.0307 18.39/0.8026 24.50/0.9062 24.59/0.9058 24.46/ 0.9058
LoL v1 [71] 6.38/0.0177 18.54/0.7231 21.35/0.8003 21.87/0.8015 21.75/ 0.8189
LoL v2 [77] 5.57/0.0145 15.56/0.7042 24.06/0.9145 24.46/0.9165 24.10/0.9136
DPDD [1] 7.73/0.0143 20.22/0.7037 15.00/0.6134 15.71/0.6104 14.38/0.5988
SOTS [37] 5.23/0.0065 24.80/0.9326 28.64/0.9694 28.86/0.9700 29.86/ 0.9704
Table 7: Image denoising, JPEG artifact removal, and desnowing of Restormer models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
15 6.62/0.0044 31.99/0.8844 33.33/0.9100 33.35/0.9117 33.40/ 0.9130
CBSD68 [51] 25 6.62/0.0044 29.33/0.8146 30.40/0.8517 30.39/0.8515 30.45/ 0.8519
50 6.62/0.0044 25.31/0.6445 25.63/0.7035 25.77/0.7081 25.79/ 0.7098
Urban100 [32] 15 5.72/0.218 31.15/0.8923 32.43/0.9107 32.77/0.9100 32.45/ 0.9106
25 5.72/0.218 28.49/0.8351 29.60/0.8602 29.86/0.8661 29.65/ 0.8619
50 5.72/0.218 24.44/0.6716 24.69/0.7206 24.97/0.7366 25.07/ 0.7416
LIVE1 [62] 10 6.20/0.0020 25.72/0.7512 27.00/0.7839 27.05/0.7852 27.12/ 0.7887
20 6.20/0.0020 28.09/0.8329 29.37/0.8558 29.42/0.8570 29.47/ 0.8582
30 6.20/0.0020 29.29/0.8674 30.67/0.8863 30.73/0.8874 30.77/ 0.8882
40 6.20/0.0020 30.04/0.8866 31.58/0.9034 31.63/0.9044 31.67/ 0.9050
Snow100K [48] S 6.43/0.0203 29.12/0.8778 33.90/0.9418 34.04/0.9430 34.23/ 0.9443
M 6.44/0.0203 26.17/0.8549 32.17/0.9300 32.36/0.9316 32.49/ 0.9330
L 6.45/0.0214 23.57/0.7808 28.28/0.8761 28.46/0.8793 28.50/ 0.8809
CSD [48] 5.43/0.0078 16.03/0.7596 16.00/0.7404 15.35/0.7595 16.13/ 0.7625
Table 8: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of NAFNet models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
GoPro [52] 27.52/0.8461 27.81/0.5839 27.81/0.8536 27.80/0.8530 27.82/ 0.8540
HIDE [64] 25.01/0.8173 25.55/0.8293 25.27/0.8228 25.30/0.8125 25.38/ 0.8243
RealBlur-J [59] 23.11/0.7418 24.32/0.7688 23.67/0.7435 25.17/0.7918 23.64/0.7419
RealBlur-R [59] 29.18/0.7915 27.72/0.7397 28.63/0.7596 27.49/0.7281 27.34/0.7272
Rain100H [76] 24.47/0.7666 21.93/0.7943 24.53/0.7437 24.72/0.7748 25.00/ 0.7777
Rain100L [76] 21.93/0.7943 22.03/0.8003 21.85/0.8110 21.09/0.7958 22.12/ 0.8111
Test1200 [82] 30.68/0.8820 29.38/0.8631 29.40/0.8793 29.41/0.8608 29.62/ 0.8669
Test2800 [27] 30.70/0.9020 30.74/0.9087 30.97/0.9017 30.73/0.9087 30.75/ 0.9089
FiveK [5] 24.43/0.9022 24.41/0.9028 24.42/0.9030 24.45/0.9018 24.47/ 0.9038
LoL v1 [71] 21.73/0.7878 21.89/0.7804 21.74/0.8022 21.96/0.7993 22.05/ 0.7995
LoL v2 [77] 24.08/0.9087 23.77/0.9085 24.08/0.9094 22.05/0.7995 24.47/ 0.9128
DPDD [1] 13.34/0.5684 14.09/0.5860 14.11/0.5689 14.22/0.5817 14.25/ 0.5961
SOTS [37] 29.01/0.9452 29.18/0.9650 28.51/0.9213 29.13/0.9646 29.19/ 0.9651
Table 9: Image denoising, JPEG artifact removal, and desnowing of NAFNet models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
15 33.00/0.9004 33.20/0.9098 33.19/0.9005 33.20/0.9011 33.22/ 0.9099
CBSD68 [51] 25 30.19/0.8417 29.94/0.8416 30.00/0.8357 30.22/0.8427 30.27/ 0.8440
50 25.13/0.6917 22.64/0.6433 25.34/0.6963 25.24/0.6994 25.41/ 0.7058
Urban100 [32] 15 32.81/0.9226 29.77/0.8670 32.90/0.9242 32.83/0.9241 32.82/0.9237
25 29.83/0.8718 25.20/0.7623 28.90/0.8655 29.91/0.8762 29.90/ 0.8761
50 24.28/0.7247 16.17/0.4767 24.38/0.7425 24.36/0.7459 24.49/ 0.7496
LIVE1 [62] 10 26.94/0.7828 26.96/0.7824 26.85/0.7802 26.96/0.7829 26.99/ 0.7842
20 29.28/0.8544 29.29/0.8539 29.28/0.8526 29.30/0.8545 29.31/ 0.8547
30 30.58/0.8840 30.59/0.8850 30.49/0.8833 30.60/0.8851 30.60/ 0.8850
40 31.48/0.9023 31.50/0.9027 31.54/0.9028 31.49/0.9025 31.50/0.9026
Snow100K [48] S 33.41/0.9399 33.45/0.9399 33.40/0.9323 33.38/0.9397 33.51/ 0.9404
M 31.77/0.9276 31.77/0.9275 31.65/0.9206 31.72/0.9272 31.79/0.9280
L 27.60/0.8710 27.64/0.8703 28.30/0.8771 27.59/0.8697 27.68/0.8711
CSD [13] 16.49/0.7493 16.83/0.7545 16.69/0.7606 17.69/0.7677 17.29/0.7621
Table 10: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of MAXIM-1S models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
GoPro [52] 26.99/0.8294 27.35/0.8319 27.28/0.8379 27.35/0.8391 27.37/0.8401
HIDE [64] 23.92/0.7764 25.13/0.8079 24.28/0.7924 25.54/0.8107 25.35/0.8096
RealBlur-J [59] 24.05/0.7673 24.37/0.7574 23.33/0.7278 24.74/0.7669 25.05/0.7751
RealBlur-R [59] 23.11/0.7305 24.94/0.6566 22.22/0.5779 25.83/0.6676 27.65/0.7341
Rain100H [76] 21.58/0.7020 22.16/0.7777 22.84/0.7544 22.82/0.7503 23.75/0.7603
Rain100L [76] 21.58/0.8020 21.81/0.7961 20.74/0.7860 19.90/0.7693 22.64/0.8076
Test1200 [82] 30.52/0.8826 29.04/0.8783 30.81/0.8839 30.48/0.8815 30.76/0.8829
Test2800 [27] 30.50/0.9060 28.34/0.8675 30.70/0.8999 30.83/0.9109 30.55/0.9075
FiveK [5] 23.94/0.8953 24.10/0.8907 24.16/0.8990 24.15/0.9008 24.17/0.8993
LoL v1 [71] 20.99/0.8000 20.97/0.8119 20.88/0.8027 21.03/0.8078 21.59/0.8089
LoL v2 [77] 23.26/0.9036 24.09/0.9111 24.13/0.9055 24.44/0.9122 24.40/0.9112
DPDD [1] 14.71/0.5969 15.09/0.6082 14.84/0.5982 15.47/0.6079 14.81/0.5989
SOTS [37] 28.55/0.9586 29.36/0.9651 29.34/0.9660 29.24/0.9600 29.47/0.9679
Table 11: Image denoising, JPEG artifact removal, and desnowing of MAXIM-1S models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
15 33.11/0.9094 33.20/0.9111 33.24/0.9113 33.29/0.9128 33.22/0.9113
CBSD68 [51] 25 30.13/0.8462 30.27/0.8469 30.25/0.8470 30.33/0.8486 30.27/0.8477
50 25.43/0.7102 25.50/0.7172 25.53/0.7195 25.67/0.7228 25.63/0.7132
Urban100 [32] 15 32.47/0.9190 32.11/0.9186 32.66/0.9202 32.71/0.9187 32.77/0.9238
25 29.50/0.8693 29.04/0.8672 29.83/0.8689 29.86/0.8770 29.86/0.8777
50 24.54/0.7499 24.07/0.7456 25.01/0.7689 25.07/0.7718 24.94/0.7629
LIVE1 [62] 10 26.82/0.7824 24.93/0.7519 26.82/0.7841 26.89/0.7830 26.90/0.7854
20 29.06/0.8504 27.09/0.8345 29.24/0.8462 29.18/0.8506 29.27/0.8564
30 30.19/0.8778 28.40/0.8683 30.45/0.8866 30.68/0.8876 30.59/0.8868
40 30.98/0.8941 28.73/0.8847 31.56/0.9038 31.58/0.9004 31.37/0.9012
Snow100K [48] S 33.05/0.9380 33.48/0.9324 33.38/0.9408 33.53/0.9406 33.54/0.9420
M 31.46/0.9261 31.86/0.9209 31.78/0.9292 31.81/0.9211 31.87/0.9303
L 27.78/0.8726 28.01/0.8760 27.95/0.8755 28.16/0.8791 28.03/0.8773
CSD [13] 13.60/0.7048 13.93/0.7087 13.48/0.7335 14.70/0.7344 14.27/0.7147
Table 12: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of Conv models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
GoPro [52] 21.56/0.6900 24.61/0.7275 25.67/0.7922 26.35/0.8108 26.54/0.8121
HIDE [64] 20.04/0.6523 23.18/0.7001 21.84/0.7303 23.17/0.7019 23.27/0.7671
RealBlur-J [59] 23.55/0.6025 25.22/0.6936 25.30/0.7888 23.66/0.7621 26.10/0.7979
RealBlur-R [59] 30.01/0.8375 32.02/0.8901 30.32/0.8874 30.80/0.8889 32.60/0.9130
Rain100H [76] 10.11/0.3799 12.32/0.3374 15.30/0.5284 11.96/0.4334 12.40/0.4236
Rain100L [76] 17.64/0.6086 23.77/0.7262 18.89/0.7502 18.11/0.7265 18.95/0.7312
Test1200 [82] 25.07/0.7105 21.27/0.6324 23.52/0.7902 28.41/0.8499 27.98/0.8479
Test2800 [27] 25.40/0.7675 21.88/0.9619 23.44/0.8105 29.65/0.8964 29.27/0.8897
FiveK [5] 17.19/0.7864 17.33/0.7005 17.72/0.7805 18.94/0.8032 19.50/0.8337
LoL v1 [71] 7.77/0.1935 7.76/0.1887 8.32/0.2434 9.43/0.3634 7.96/0.2134
LoL v2 [77] 11.05/0.4467 10.93/0.4086 13.02/0.5814 13.03/0.6542 13.21/0.6206
DPDD [1] 20.32/0.5338 21.29/0.6129 19.88/0.6912 19.82/0.6825 21.92/0.7143
SOTS [37] 22.92/0.9063 16.25/0.7639 22.22/0.9065 21.49/0.9160 24.21/0.9406
Table 13: Image denoising, JPEG artifact removal, and desnowing of Conv models tested by PSNR/SSIM. Red means the best and blue means the second best.
Datasets Original ConStyle Pre-train Class ConStyle v2
15 24.82/0.5730 24.56/0.5976 19.22/0.6107 32.92/0.9059 33.09/0.9047
CBSD68 [51] 25 20.54/0.3950 21.27/0.4416 17.22/0.4439 28.33/0.8345 30.12/0.8354
50 15.01/0.1999 16.16/0.2436 15.77/0.2747 21.74/0.6674 25.16/0.6830
Urban100 [32] 15 24.85/0.6123 23.25/0.6137 11.72/0.4582 29.34/0.8898 32.38/0.9150
25 20.69/0.4569 20.48/0.4843 12.96/0.3982 25.31/0.8214 29.48/0.8608
50 15.18/0.2676 15.84/0.3055 13.97/0.2979 18.43/0.6606 24.40/0.7230
LIVE1 [62] 10 24.95/0.7396 23.59/0.6604 25.06/0.7449 26.93/0.7836 26.80/0.7787
20 26.53/0.8191 25.03/0.7324 26.40/0.8170 29.30/0.8555 29.22/0.8526
30 27.32/0.8542 25.69/0.7619 26.40/0.8408 30.61/0.8864 30.54/0.8843
40 27.83/0.8745 26.12/0.7793 26.40/0.8485 31.50/0.9035 31.40/0.9017
Snow100K [48] S 26.70/0.8268 23.46/0.7341 24.72/0.8582 31.55/0.9129 31.85/0.9243
M 24.79/0.7978 21.97/0.7103 24.34/0.8427 31.14/0.9200 30.43/0.9106
L 22.22/0.6906 18.56/0.6144 23.09/0.7833 27.15/0.8589 26.12/0.8456
CSD [13] 15.78/0.7259 14.56/0.6666 15.66/0.7367 13.66/0.6878 16.63/0.7515