(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
11email: {dongqifan, junhaozhang}@std.uestc.edu.cn
11email: [email protected]
ConStyle v2: A Strong Prompter for All-in-One Image Restoration
Abstract
This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its weights to guide the training of the general restoration network. Three improvements are proposed in the pre-training stage to train ConStyle: unsupervised pre-training, adding a pretext task (i.e. classification), and adopting knowledge distillation. Without bells and whistles, we can get ConStyle v2, a strong prompter for all-in-one Image Restoration, in less than two GPU days and doesn’t require any fine-tuning. Extensive experiments on Restormer (transformer-based), NAFNet (CNN-based), MAXIM-1S (MLP-based), and a vanilla CNN network demonstrate that ConStyle v2 can enhance any U-Net style Image Restoration models to all-in-one Image Restoration models. Furthermore, models guided by the well-trained ConStyle v2 exhibit superior performance in some specific degradation compared to ConStyle. The code is avaliable at: https://github.com/Dongqi-Fan/ConStyle_v2
Keywords:
Image Restoration Visual Prompting All-in-One Pre-training1 Introduction
Image Restoration (IR) is a fundamental vision task in the computer vision community, which aims to reconstruct a high-quality image from a degraded one. Recent advancements in deep learning have shown promising results in specific IR tasks such as denoising [16, 8, 78], dehazing [46, 58, 22], deraining [29, 57, 34], desnowing [13, 12, 48], motion deblurring [35, 19, 74], defocus deblurring [61, 81, 36], low-light enhancement [69, 6, 25], and JPEG artifact removal/correction [33, 86]. However, these models are limited to addressing only one specific type of degradation. To tackle this issue, researchers have focused on develo** models capable of handling multiple degradations [80, 17, 70, 42]. Yet, these models require retraining for each different type of degradation, that is a set of weights is tailored for a single type of degradation. Obviously, these approaches are not practical as multiple degradations often coexist in real-world scenarios. For instance, rainy days are often associated with haze and reduced lighting.
![Refer to caption](x1.png)
The all-in-one Image Restoration [84, 53, 66, 14] is a kind of method that only uses a suit of weights to address multiple types of degradation. However, these all-in-one models often have a large number of parameters and require heavy computations, leading to time-consuming and inefficient training processes. For instance, Chen et al. [14] (Fig. 1 (a)) adjust the number of teacher networks based on the number of degradations. Thus, the more teacher networks there are, the more complex the training process becomes; Li et al. [41] (Fig. 1 (a)) also adopt the number of sub-networks the same as the amount of degradation, and its backbone is obtained through neural architecture search (NAS); PromptGIP [47] (Fig. 1 (b)) leverage the idea of visual prompting to help the model in handling multiple degradations, but a degradation-clean sample pair must be provided in the training and inference stage and requiring 8 V100 GPUs for training. These methods are inefficient due to the lack of prior knowledge about the degradations in the input image. In other words, the prior information about the degradation type needs to be first obtained and then passed to subsequent sub-networks. In contrast, thanks to ConStyle v2, general restoration network (Fig. 1 (c)) do not need specific prior knowledge about degradations, but rather require a clean visual prompt (clean prior).
In this paper, we introduce a strong prompter for all-in-one Image Restoration: ConStyle v2 (Fig. 2 and Fig. 3). The key of our work is to ensure that ConStyle v2 generates clean visual prompts, thus mitigating the issue of model collapse and guiding the training of general restoration networks. Model collapse typically arises when a model struggles to simultaneously handle multiple degradations. To address this challenge, we propose three simple yet effective improvements to ConStyle: unsupervised pre-training, leveraging a pretext task to enhance semantic information extraction capabilities, and employing knowledge distillation to further enhance this capacity. ConStyle v2 consists of convolution, linear, and BN layers and without any complex operators (Fig. 3 (c)), so the time of the training is less than two days on a V100 GPU and Intel Xeon Silver 4216 CPU. Once trained, ConStyle v2 can be seamlessly integrated into any U-Net model to facilitate their training without fine-tuning. Additionally, given the lack of datasets encompassing multiple degradations, we collect and produce a Mix Degradations dataset, which includes noise, motion blur, defocus blur, rain, snow, low-light, JPEG artifact, and haze, to cater to training requirements. To verify our method, we perform ConStyle v2 on three state-of-the-art IR models (Restormer[79], MAXIM-1S[65], NAFNet[10]) and a non-IR U-Net model consisting of vanilla convolutions. Fig. 4 shows details architecture of Original models and ConStyle/ConStyle v2 models. Experiment results on 27 benchmarks demonstrate that our ConStyle v2 is a powerful plug-and-play prompter for all-in-one image restoration and exhibits superior performance for specific degradation compared to ConStyle [24].
Our contributions can be summarized as follows:
-
•
Three simple yet effective enhancements are proposed to train ConStyle v2, and the time of the training is less than two GPU days.
-
•
We propose a Mix Degradations dataset, which includes noise, motion blur, defocus blur, rain, snow, low-light, JPEG artifact, and haze, to cater to training needs.
-
•
We propose a strong plug-and-play prompter for all-in-one and specific image restoration, in which the model collapse issue is avoided.
2 Related Work
2.1 All-in-One Image Restoration
While numerous works [80, 17, 70, 42, 79, 65, 10, 9, 18, 39, 23] excel in various Image Restoration tasks, they are typically limited to addressing a single type of degradation with a specific set of weights. To solve this problem, all-in-one Image Restoration (IR) methods [84, 53, 66, 14, 49, 47, 54, 85, 38, 50, 55] have been developed. These methods aim to enable models to effectively handle multiple degradations simultaneously. For example, AirNet [38] leverages MoCo [30] and Deformable Convolution [20] to transform degradation priors obtained from the former into convolution kernels in the latter, enabling dynamic degradation removal; DA-CLIP [49] builds upon the architecture of CLIP [56], in which BLIP [40] is used to generate synthetic captions for high-quality images. Then match low-quality images with captions and corresponding degradation types as image-text-degradation pairs; ADMS [54] introduces a Filter Attribution method based on FAIG [73] to identify the specific contributions of filters in removing specific degradations, while IDR [85] proposes a learnable Principal Component Analysis and treats various IR tasks as a form of multi-task learning to acquire priors. Different from the above methods, we aim to design a plug-and-play module that can transform a non-all-in-one model into an all-in-one model.
2.2 Visual Prompting
In the field of Natural Language Processing (NLP), Prompting Learning refers to providing task-specific instructions or in-context information to a model without the need for retraining. This approach has shown promising results in NLP, such as GPT-3 [4]. Drawing inspiration from Prompting Learning in NLP, recently, there have been many excellent visual prompting works in the IR [47, 50, 55, 11, 45, 3, 67, 63]. For example, ProRes [50] involves adding a target visual prompt to an input image to create a "prompted image". This prompted image is then flattened into patches, with the weights of ProRes frozen, and learnable prompts are randomly initialized for new tasks or datasets; PromptIR [55] introduces a Prompt Block in the decoder stage of the U-Net architecture. This block takes prompt components and the output of the previous transformer block as inputs, with its output being fed into the next transformer block; PromptGIP [47] proposes a training method akin to masked autoencoding, where certain portions of question images and answer images are randomly masked to prompt the model to reconstruct these patches from the unmasked areas. During inference, input-output pairs are assembled as task prompts to realize image restoration. Our approach also leverages visual prompting, but in a more efficient manner, eliminating the need to distinguish different degradations like the above methods. It will provide a clean visual prompt for other models.
![Refer to caption](x2.png)
![Refer to caption](x3.png)
3 Method
The training diagram of ConStyle v2 is depicted in Fig. 2, while the distinctions between ConStyle and ConStyle v2 are illustrated in Fig. 3. The same as ConStyle, ConStyle v2 only retains the Encoder part when the pre-training is complete. In this section, we first provide a brief overview of ConStyle (Sec. 3.1), followed by showing problems encountered with ConStyle in multiple degradations (Sec. 3.3), and finally, we illustrate the improvements made from ConStyle to ConStyle v2 in three steps (Sec. 3.4). In addition, the Mix Degradations dataset is described in Sec. 3.2.
![Refer to caption](x4.png)
3.1 Review of ConStyle
IRConStyle [24] is a versatile and robust IR framework consisting of the ConStyle and a general restoration network. ConStyle includes several convolutional layers and one MLP layer, which is responsible for extracting latent features (the latent code and intermediate feature map) and then passing them to the general restoration network. The general restoration network follows an abstract U-Net style architecture, allowing for the instantiation of any IR U-Net model. The training stage (Eq. 1) and inference (Eq. 2) process of IRConStyle [24] can be described as follows:
(1) |
(2) |
Where G stands for general restoration network, E for ConStyle, for the input degraded image, and for the output restored clean image. Based on the contrast learning framework MoCo [30], ConStyle cleverly integrates the idea of style transfer and replaces the pretext task, Instance Discrimination [72], with one pretext task more suitable for IR. The total loss functions for IRConStyle are as follows:
(3) |
The calculation of , , and is performed in ConStyle, while is performed in general restoration network. Under the supervision of , , and , latent features move closer to the clean space and further away from the degradation space. Since ConStyle can adaptively output clean latent features according to input degraded images, it is natural for us to believe that ConStyle should be able to turn the general restoration network into an all-in-one model. However, the experiment results on ConStyle models are not as expected.
3.2 Mix Degradations Datasets
We need a training dataset that includes noise, motion blur, defocus blur, rain, snow, low light, JPEG artifact, and haze, but the existing training dataset did not meet our needs. Therefore, we propose a dataset, namely Mix Degradations datasets, consisting of image pairs with all of the aforementioned degradations. Details on the Mix Degradations dataset can be found in Tab. 1. The images with noise and JPEG artifacts are respectively generated using established methods same as [79, 65, 10] and [44, 33, 86]. It is important to note that in the OTS dataset, haze images with intensities of 0.04 and 0.06 are manually removed due to being too clear to the human eye. In addition, the deraining dataset, which includes Rain14000 [26], Rain1800 [75], Rain800 [83], and Rain12 [43], initially contained 13,712 images, but two erroneous pictures are identified and removed. After a unified crop** process, the Mix Degradations dataset has 621,573 images, all of size 256 × 256. The Mix Degradations datasets, the uncropped joint datasets mentioned in Tab. 1 (totaling 46,301 images), and the data preparation file are all available on our GitHub link.
Task | Motion Blurring | Defocus Blurring | Dehazing | Low-light Enhacement |
Where | GoPro [52] | LFDOF [60] | OTS [37] | LoL v1 [71], LoL v2 [77], and FiveK [5] |
Used Number | 2,103 | 5,606 | 12,500 | 5,885 |
Crop Size | 256 | 256 | 256 | 256 |
Step Size | 150 | 220 | 240 | 100 |
Final Number | 84,120 | 84,090 | 82,425 | 51,891 |
Task | JPEG Artifact Removal | Denoising | Desnowing | Deraining |
Where | DIV2K [2] | DIV2K [2] | Snow100K [48] | Rain14000, 1800, 800 and 12 |
Used Number | 800 | 800 | 7000 | 13710 |
Crop Size | 256 | 256 | 256 | 256 |
Step Size | 165 | 165 | 160 | 240 |
Final Number | 77,904 | 77,904 | 82,432 | 80,807 |
3.3 ConStyle on Mix Degradations Datasets
To evaluate whether ConStyle [24] can directly convert U-Net models to all-in-one models, we conduct experiments using Original models (Restormer [79], NAFNet [60], MAXIM-1S [65]) and ConStyle models (ConStyle Restormer, ConStyle NAFNet, ConStyle MAXIM-1S) on Mix Degradations datasets. These models will be tested on GoPro [52] (motion blurring), RealDOF [60] (defocus blurring), LoL v1 [71] (low-light enhancement), SOTS outdoors [37] (dehazing), LIVE1 [62] (JPEG artifact removal), CSD [13] (desnowing), CBSD68 [51] (denoising), and Rain100H [76] (deraining). In addition, we introduce two more models for comparison: Original Conv and ConStyle Conv. The Original Conv is a vanilla U-Net convolution model, while the ConStyle Conv incorporates ConStyle into the Original Conv. These two additional models are used to verify the generality of the ConStyle and ConStyle v2. It is important to note that to expedite evaluation during the training, only a subset of the test datasets is used. For example, only 24 images from GoPro’s test dataset of 1,111 images are selected. Using the full test datasets for all 8 tasks would significantly increase training time, as inference on Restormer alone with GoPro’s test dataset would take 40 minutes on a V100 GPU.
![Refer to caption](x5.png)
The results in Fig. 5 demonstrate that under multiple degradation settings, the performance of ConStyle Conv not only fails to surpass that of the Original Conv but also remains consistently low performance after 80K iterations. While ConStyle Restormer shows better performance than Restormer, it also suffers from model collapse problems as Restormer. In the following section, we will illustrate the improving process of ConStyle to ConStyle v2 on Restormer and Original Conv step by step.
3.4 ConStyle v2
3.4.1 Unsupervised Pre-training
We find that even for specific IR tasks, ConStyle models exhibit varying degrees of model collapse issues. For example, in the dehazing task, the performance of ConStyle NAFNet, ConStyle MAXIM-1S, and ConStyle Restormer significantly declines after 250K iterations, 30K iterations, and 10K iterations respectively. Interestingly, even within the same model like ConStyle MAXIM-1S, the onset of model collapse differs across denoising, deraining, and deblurring tasks, occurring at 10K, 100K, and 200K iterations respectively. To address this challenge, IRConStyle [24] implements a strategy of early stop** ConStyle updates. Here, we intend to elegantly solve this problem. Since this problem happens in joint training, then it is natural to split joint training process into two stages. Specifically, ConStyle is pre-trained independently, followed by fixing its weights and integrating it with other IR models for guided training. For pre-training stage, we leverage the generation techniques of Real-ESRGAN [68] and ImageNet-C [31] in the Degradation Process (Fig. 2) for unsupervised training on ImageNet-1K [21].
Since our goal is to train ConStyle v2 to be a powerful prompter that can produce a clean visual prompt based on the different degradations, we use the method in ImageNet-C [31] to generate motion blur, snow, and low contrast and the two-stage degradation method in Real-ESRGAN [68] to generate Gaussian blur, noise, and JPEG artifacts. For each batch of images, 40% of the images are randomly selected to add motion blur and snow and change contrast, while 60% of the images are added Gaussian blur, noise, and JPEG artifact. For details training setting of the pre-training please see Sec. 4. The pre-trained ConStyle, with the weight fixed, is incorporated into the general restoration network for training on the Mix Degradations datasets. Here, we name the models of this stage as Pre-train models. The results of Pre-train Restormer and the Pre-train Conv can be seen in Fig. 6 (a) and (e).
![Refer to caption](x6.png)
3.4.2 Pretext Task
![Refer to caption](x7.png)
Following pre-training step, ConStyle Restormer has demonstrated significant performance improvement, successfully resolving the issue of model collapse. Conversely, ConStyle Conv continues to face challenges with unstable training and limited enhancement in performance. We believe that this is attributed to heavy degradation, leading to a loss of semantic information (Fig. 7) in the original image. It makes the model of weak semantic extraction ability, such as ConStyle Conv, also have poor image restoration performance. Because the process of image restoration involves pixel-wise operations and necessitates a comprehensive understanding of the entire image. Thus we introduce a pretext task (classification) to enhance the semantic information extraction capabilities of ConStyle, so as to improve such capability of ConStyle models. Specifically, we add Classifier and Softmax layers at the back of the Encoder and leverage the labels of the ImageNet (Fig. 2). Here, we name the models of this stage as Class models. As shown in Fig. 6 (b) and (f), the addition of the pretext task has little influence on ConStyle Restormer, since the Transformer model already has strong semantic extraction abilities. In contrast, for ConStyle Conv, the inclusion of the pretext task makes the training stable, and the performance is significantly improved.
3.4.3 Knowledge Distillation
Although ConStyle has been significantly improved through pre-training and the addition of a pretext task, enabling it to generate clean visual prompts based on degraded image input, to further boost ConStyle’s ability to extract semantic information, we take the last step to improve ConStyle to ConStyle v2. Inspired by BYOL[28], SimSam[15], and DINO[7], we take advantage of knowledge distillation. Since the input of the Momentum Encoder is the clean image, and its output visual prompt is cleaner than the output of Encoder, by utilizing the Momentum Encoder as a teacher and the Encoder as a student, the teacher network is able to adaptively guide the student network during training. Specifically, a Classifier and Softmax layer are added at the back of the Momentum Encoder, with distance measured by the Kullback-Leibler (KL) function (Fig. 2). Now, we have the final ConStyle v2, and the performance of ConStyle v2 Restormer (Fig. 6 (g) and ConStyle v2 Conv (Fig. 6 (c)) is raised again compared with other models. In addition, each improvement in the average performance across eight degradations is depicted in Fig. 6 (d) and (h).
4 Experiments
4.1 Implement details
All experiments in this paper are performed on an NVIDIA Tesla V100 GPU. To be consistent with ConStyle [24], we use AdamW (=0.9, =0.999, weight decay=) optimizer with an initial learning rate of and Cosine annealing. Training Stage: The batch size, crop size, and total iterations are set as 16, 128, and 700K respectively. Pre-training Stage: The batch size, crop size, and total iterations are set as 32, 224, and 200K respectively. In the process of generating degraded images, we directly use all configurations in Real-ESRGAN [68] and change the intensity of degradation in ImageNet-C [31].
Original | ConStyle | Pre-train | Class | ConStyle v2 | |
---|---|---|---|---|---|
Restormer [79] | 6.40/0.0506 | 25.23/0.7844 | 27.24/0.8222 | 27.40/0.8305 | 27.60/0.8352 |
NAFNet [10] | 27.45/0.8347 | 26.57/0.8054 | 27.43/0.8341 | 27.53/0.8379 | 27.55/0.8391 |
MAXIM-1S [65] | 26.95/0.8319 | 26.63/0.8290 | 27.14/0.8319 | 27.41/0.8372 | 27.54/0.8400 |
Conv | 21.44/0.6177 | 20.99/0.5986 | 20.14/0.6544 | 24.84/0.7889 | 26.06/0.7974 |
Mix Degradations datasets is used in the training stage and ImageNet-1K in the pre-training stage. For evaluation, GoPro [52], HIDE [64], RealBlur-J [59], and RealBlur-R [59] are used for motion deblurring, DPDD [1] is used for defocus deblurring, SOTS outdoors [37] is used for dehazing, Rain100H [76], Rain100L [76], Test1200 [82], and Test2800 [27] are used for deraining, FiveK [5], LoL v1 [71], and LoL v2 [77] are used for low-light enhancement, CSD [13], Snow100K (S, M, and L) [48] are used for desnowing, CBSD68 [51] and urban100[32] are used for denoising, and LIVE1 [62] is used for JPEG artifact removal.
Method | CBSD68(*) [51] | CBSD68 [51] | Urban100(*) [32] | Urban 100 [32] | ||||||||
=15 | =25 | =50 | =15 | =25 | =50 | =15 | =25 | =50 | =15 | =25 | =50 | |
ConStyle Restormer | 34.33 | 31.71 | 28.51 | 34.37 | 31.74 | 28.52 | 34.89 | 32.66 | 29.64 | 35.01 | 32.74 | 29.71 |
ConStyle v2 Restormer | 34.33 | 31.71 | 28.51 | 34.37 | 31.74 | 28.52 | 34.89 | 32.67 | 29.66 | 35.01 | 32.77 | 29.71 |
Differ. | 0 | 0 | 0 | 0 | +0 | 0 | 0 | +0.01 | +0.02 | 0 | +0.03 | +0 |
ConStyle NAFNet | 34.31 | 31.69 | 28.50 | 34.34 | 31.71 | 28.52 | 34.82 | 32.58 | 29.55 | 34.91 | 32.64 | 29.63 |
ConStyle v2 NAFNet | 34.33 | 31.71 | 28.52 | 34.36 | 31.73 | 28.54 | 34.88 | 32.66 | 29.65 | 34.96 | 32.72 | 29.68 |
Differ. | +0.02 | +0.02 | +0.02 | +0.02 | +0.02 | +0.02 | +0.06 | +0.08 | +0.10 | +0.05 | +0.08 | +0.05 |
ConStyle MAXIM-1S | 34.25 | 31.63 | 28.43 | 34.28 | 31.65 | 28.44 | 34.56 | 32.26 | 29.10 | 34.64 | 32.32 | 29.13 |
ConStyle v2 MAXIM-1S | 34.19 | 31.55 | 28.34 | 34.22 | 31.58 | 28.36 | 34.60 | 32.32 | 29.19 | 34.70 | 32.40 | 29.26 |
Differ. | -0.06 | -0.08 | -0.09 | -0.06 | -0.07 | -0.08 | +0.04 | +0.06 | +0.09 | +0.06 | +0.08 | +0.13 |
4.2 Model Analyses
Tab. 2 presents a comparison of parameters, computations, and speed between all models. All the results are obtained using input data of size (2,3,128,128), and the speed is the average of 10,000 inference. The reason why the parameters of the ConStyle v2/ConStyle models are fewer than the original models (except for Original Conv and ConStyle v2 Conv) is that, to demonstrate that the improvement of the ConStyle models is not brought by simply expanding the scale of the network, ConStyle models are downscaled by reducing the width and depth [24]. Because of the introduction of the ConStyle part, the parameters of models will be increased by 1.19M.
Method | GoPro [52] | RealBlur-R [59] | RealBlur-J [59] | HIDE [64] | SOTS outdoor [37] |
---|---|---|---|---|---|
ConStyle Restormer | 31.45/0.9208 | 33.94/0.9454 | 26.63/0.8288 | 30.20/0.9098 | 30.85/0.9760 |
ConStyle v2 Restormer | 31.36/0.9200 | 33.95/0.9447 | 26.63/0.8298 | 30.10/0.9080 | 31.32/0.9789 |
Differ. | -0.09/ -0.0008 | +0.01/ -0.0007 | 0/ -0.0010 | -0.10/ -0.0018 | +0.47/ +0.0029 |
ConStyle NAFNet | 31.56/0.9230 | 33.89/0.9441 | 26.61/0.8308 | 30.24/0.9106 | 30.73/0.9743 |
ConStyle v2 NAFNet | 31.68/0.9254 | 33.90/0.9444 | 26.63/0.8311 | 30.32/0.9115 | 32.34/0.9812 |
Differ. | +0.12/ +0.0024 | +0.01/ +0.0003 | +0.02/ +0.0003 | +0.08/ +0.0009 | +1.61/ +0.0069 |
ConStyle MAXIM-1S | 30.77/0.9093 | 33.93/0.9435 | 26.54/0.8261 | 29.26/0.8951 | 30.59/0.9713 |
ConStyle v2 MAXIM-1S | 31.01/0.9159 | 33.79/0.9407 | 26.47/0.8248 | 29.36/0.8952 | 30.68/0.9751 |
Differ. | +0.24/ +0.0066 | -0.14/ -0.0028 | -0.07/ -0.0013 | +0.10/ +0.0001 | +0.09/ +0.0038 |
4.3 All-in-One Image Restoration Result
To verify the overall performance of our methods, we calculate the average PSNR/SSIM on 27 benchmarks. As shown in Tab. 3, except that the performance of Pre-train Conv is lower than that of Original Conv and ConStyle Conv on PSNR, the performance of other Pre-train, Class, and ConStyle v2 models are significantly higher than that of Original and ConStyle models. This highlights the effectiveness of three proposed methods: unsupervised pre-training, using pretext task, and using knowledge distillation. Due to space constraints, the detail results of Restormer, NAFNet, MAXIM-1S, and Conv models can be found in the supplementary material. While the ConStyle v2 models may not outperform ConStyle, Pre-train, and Class models in certain benchmarks, they still show significant improvement over the Original models. It is worth noting that scaling up the ConStyle v2 models to the size of the Original models could potentially yield even better results.
4.4 Single Image Restoration Result
Considering the significant improvement demonstrated by ConStyle v2 in handling multiple degradations, it is worth investigating whether this method can also enhance general restoration models to address specific degradations. We simply utilize the same training settings as IRConStyle [24] for specific degradation scenarios. A comparison between ConStyle v2 models and ConStyle models is conducted across motion deblurring, denoising, and dehazing tasks. The denoising results are presented in Tab. 4, while the results for motion deblurring and dehazing are shown Tab. 5.
For denoising, except for the performance of ConStyle v2 MAXIM-1S slightly worse than ConStyle MAXIM-1S on CBSD68, the overall performance of ConStyle v2 models is better than ConStyle models. For dehazing, ConStyle v2 models significantly outperform ConStyle models, even by 1.61 dB on NAFNet models. For motion deblurring, ConStyle v2 NAFNet is superior to ConStyle NAFNet but is indistinguishable from ConStyle on Restormer and MAXIM-1S. In general, for specific degradation, ConStyle v2 models do not require an accurate number of iterations to freeze the weight, while ConStyle models need to do so to avoid the problem of model collapse. Therefore, ConStyle v2 makes the entire IRConStyle framework more efficient.
4.5 All-in-One Image Restoration Visual Result
We present the visual results of the Original models and ConStyle v2 models on the GoPro, DPDD, LoL v2, CBSD68, Rain100L, SOTS outdoors, snow100K-M, and LIVE1. Due to the space constraint and fair comparasions, for all tasks, we only select one identical image for each model. The origin degrdaded and target images are shown in Fig. 8. The visual results of the Restormer and ConStyle v2 Restormer are shown in Fig. 9, the visual results of the NAFNet and ConStyle v2 NAFNet are shown in Fig. 10, the visual results of the MAXIM-1S and ConStyle v2 MAXIM-1S are shown in Fig. 11, and the visual results of the Original Conv and ConStyle v2 Conv are shown in Fig. 12.
4.6 Ablation Studies
In the process of improving ConStyle to ConStyle v2, the results of optimization are obtained step by step (Sec. 3.4): unsupervised pre-training, adding a pretext task, and adopting knowledge distillation. Therefore, our whole improvement process is also the process of ablation studies. In addition, for every step of improvement, we conduct all models on the Mix Degradations datasets for fair comparisons (see supplemental materials for details).
![Refer to caption](extracted/5692928/figs/target.png)
![Refer to caption](extracted/5692928/figs/Restormer.png)
![Refer to caption](extracted/5692928/figs/NAFNet.png)
5 Conclusions and Limitations
![Refer to caption](extracted/5692928/figs/MAXIM.png)
![Refer to caption](extracted/5692928/figs/Conv.png)
5.1 Conclusions
This paper leverages the unsupervised pre-training, pretext task, and knowledge distillation to improve ConStyle into a strong prompter for all-in-one image restoration. ConStyle v2 not only significantly improves the performance of the Original models under multiple degradations settings, but also solves the issue of model collapse (observed in Restormer) and unstable training caused by model limitations (observed in Original Conv). Moreover, the redundant operations that manually select specific iterations to freeze weights across different models and tasks in ConStyle are avoided, and the performance of ConStyle v2 models under certain specific degradations is also improved. Finally, due to the lack of training datasets for multiple degradations, the Mix Degradations datasets is collected and introduced.
5.2 Limitations
Despite many advantages as described in the conclusion and experiments, there are inevitably two limitations. Firstly, ConStyle v2 exhibits limited improvements in low-light, deraining, and defocus deblurring tasks compared to other tasks. This is evident in the results of ConStyle v2 Conv on LoL v1 and ConStyle v2 MAXIM-1S on DPDD. The reason is that during the pre-training stage, the generation methods of rain, low light, and defocus blur are not included in the degradation generation (Real-ESRGAN and ImageNet-C) since the effective method of the synthetic defocus blur, rain, and low light is still a challenge in the IR community. Secondly, while ConStyle v2 models have shown promising results in quantifiable indicators such as PSNR/SSIM, the visual improvements in ConStyle v2 Conv seem to be less pronounced. This may be attributed to the inherent limitations of the Original Conv model, which is not specifically tailored for image restoration. However, ConStyle v2 demonstrates significant visual enhancements in most tasks on MAXIM-1S and NAFNet.
References
- [1] Abuolaim, A., Brown, M.S.: Defocus deblurring using dual-pixel data. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. pp. 111–126. Springer (2020)
- [2] Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 126–135 (2017)
- [3] Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in Neural Information Processing Systems 35, 25005–25017 (2022)
- [4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
- [5] Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011)
- [6] Cai, Y., Bian, H., Lin, J., Wang, H., Timofte, R., Zhang, Y.: Retinexformer: One-stage retinex-based transformer for low-light image enhancement. arXiv preprint arXiv:2303.06705 (2023)
- [7] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
- [8] Chang, M., Li, Q., Feng, H., Xu, Z.: Spatial-adaptive network for single image denoising. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 171–187. Springer (2020)
- [9] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12299–12310 (2021)
- [10] Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. In: European Conference on Computer Vision. pp. 17–33. Springer (2022)
- [11] Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems 35, 31333–31346 (2022)
- [12] Chen, W.T., Fang, H.Y., Ding, J.J., Tsai, C.C., Kuo, S.Y.: Jstasr: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. pp. 754–770. Springer (2020)
- [13] Chen, W.T., Fang, H.Y., Hsieh, C.L., Tsai, C.C., Chen, I., Ding, J.J., Kuo, S.Y., et al.: All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4196–4205 (2021)
- [14] Chen, W.T., Huang, Z.K., Tsai, C.C., Yang, H.H., Ding, J.J., Kuo, S.Y.: Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17653–17662 (2022)
- [15] Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)
- [16] Cheng, S., Wang, Y., Huang, H., Liu, D., Fan, H., Liu, S.: Nbnet: Noise basis learning for image denoising with subspace projection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4896–4906 (2021)
- [17] Cui, Y., Ren, W., Yang, S., Cao, X., Knoll, A.: Irnext: Rethinking convolutional network design for image restoration (2023)
- [18] Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., Knoll, A.: Selective frequency network for image restoration. In: The Eleventh International Conference on Learning Representations (2022)
- [19] Cui, Y., Tao, Y., Ren, W., Knoll, A.: Dual-domain attention for image deblurring. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 479–487 (2023)
- [20] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764–773 (2017)
- [21] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
- [22] Dong, J., Pan, J.: Physics-based feature dehazing networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 188–204. Springer (2020)
- [23] Fan, D., Yue, T., Zhao, X., Chang, L.: Lir: Efficient degradation removal for lightweight image restoration. arXiv preprint arXiv:2402.01368 (2024)
- [24] Fan, D., Zhao, X., Chang, L.: Irconstyle: Image restoration framework using contrastive learning and style transfer. arXiv preprint arXiv:2402.15784 (2024)
- [25] Fu, H., Zheng, W., Meng, X., Wang, X., Wang, C., Ma, H.: You do not need additional priors or regularizers in retinex-based low-light image enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18125–18134 (2023)
- [26] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3855–3863 (2017)
- [27] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3855–3863 (2017)
- [28] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271–21284 (2020)
- [29] Gu, J., Ma, X., Kong, X., Qiao, Y., Dong, C.: Networks are slacking off: Understanding generalization problem in image deraining. Advances in Neural Information Processing Systems 36 (2024)
- [30] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
- [31] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
- [32] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5197–5206 (2015)
- [33] Jiang, J., Zhang, K., Timofte, R.: Towards flexible blind jpeg artifacts removal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4997–5006 (2021)
- [34] Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi-scale progressive fusion network for single image deraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8346–8355 (2020)
- [35] Kong, L., Dong, J., Ge, J., Li, M., Pan, J.: Efficient frequency domain-based transformers for high-quality image deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5886–5895 (2023)
- [36] Lee, J., Son, H., Rim, J., Cho, S., Lee, S.: Iterative filter adaptive network for single image defocus deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2034–2042 (2021)
- [37] Li, B., Ren, W., Fu, D., Tao, D., Feng, D., Zeng, W., Wang, Z.: Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing 28(1), 492–505 (2018)
- [38] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17452–17462 (2022)
- [39] Li, F., Shen, L., Mi, Y., Li, Z.: Drcnet: Dynamic image restoration contrastive network. In: European Conference on Computer Vision. pp. 514–532. Springer (2022)
- [40] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
- [41] Li, R., Tan, R.T., Cheong, L.F.: All in one bad weather removal using architectural search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3175–3185 (2020)
- [42] Li, Y., Fan, Y., Xiang, X., Demandolx, D., Ranjan, R., Timofte, R., Van Gool, L.: Efficient and explicit modelling of image hierarchies for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18278–18289 (2023)
- [43] Li, Y., Tan, R.T., Guo, X., Lu, J., Brown, M.S.: Rain streak removal using layer priors. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2736–2744 (2016)
- [44] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)
- [45] Liu, W., Shen, X., Pun, C.M., Cun, X.: Explicit visual prompting for low-level structure segmentations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19434–19445 (2023)
- [46] Liu, X., Ma, Y., Shi, Z., Chen, J.: Griddehazenet: Attention-based multi-scale network for image dehazing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7314–7323 (2019)
- [47] Liu, Y., Chen, X., Ma, X., Wang, X., Zhou, J., Qiao, Y., Dong, C.: Unifying image processing as visual prompting question answering. arXiv preprint arXiv:2310.10513 (2023)
- [48] Liu, Y.F., Jaw, D.W., Huang, S.C., Hwang, J.N.: Desnownet: Context-aware deep network for snow removal. IEEE Transactions on Image Processing 27(6), 3064–3073 (2018)
- [49] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018 (2023)
- [50] Ma, J., Cheng, T., Wang, G., Zhang, Q., Wang, X., Zhang, L.: Prores: Exploring degradation-aware visual prompt for universal image restoration. arXiv preprint arXiv:2306.13653 (2023)
- [51] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 2, pp. 416–423. IEEE (2001)
- [52] Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3883–3891 (2017)
- [53] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
- [54] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
- [55] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
- [56] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
- [57] Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D.: Progressive image deraining networks: A better and simpler baseline. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3937–3946 (2019)
- [58] Ren, W., Ma, L., Zhang, J., Pan, J., Cao, X., Liu, W., Yang, M.H.: Gated fusion network for single image dehazing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3253–3261 (2018)
- [59] Rim, J., Lee, H., Won, J., Cho, S.: Real-world blur dataset for learning and benchmarking deblurring algorithms. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16. pp. 184–201. Springer (2020)
- [60] Ruan, L., Chen, B., Li, J., Lam, M.L.: Aifnet: All-in-focus image restoration network using a light field-based dataset. IEEE Transactions on Computational Imaging 7, 675–688 (2021)
- [61] Ruan, L., Chen, B., Li, J., Lam, M.: Learning to deblur using light field generated and real defocus images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16304–16313 (2022)
- [62] Sheikh, H.: Live image quality assessment database release 2. http://live. ece. utexas. edu/research/quality (2005)
- [63] Shen, Y., Fu, C., Chen, P., Zhang, M., Li, K., Sun, X., Wu, Y., Lin, S., Ji, R.: Aligning and prompting everything all at once for universal visual perception. arXiv preprint arXiv:2312.02153 (2023)
- [64] Shen, Z., Wang, W., Lu, X., Shen, J., Ling, H., Xu, T., Shao, L.: Human-aware motion deblurring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5572–5581 (2019)
- [65] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxim: Multi-axis mlp for image processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5769–5780 (2022)
- [66] Valanarasu, J.M.J., Yasarla, R., Patel, V.M.: Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2353–2363 (2022)
- [67] Wang, X., Wang, W., Cao, Y., Shen, C., Huang, T.: Images speak in images: A generalist painter for in-context visual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6830–6839 (2023)
- [68] Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
- [69] Wang, Y., Liu, Z., Liu, J., Xu, S., Liu, S.: Low-light image enhancement with illumination-aware gamma correction and complete image modelling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13128–13137 (2023)
- [70] Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17683–17693 (2022)
- [71] Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arxiv 2018. arXiv preprint arXiv:1808.04560
- [72] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3733–3742 (2018)
- [73] Xie, L., Wang, X., Dong, C., Qi, Z., Shan, Y.: Finding discriminative filters for specific degradations in blind super-resolution. Advances in Neural Information Processing Systems 34, 51–61 (2021)
- [74] Yang, D., Yamac, M.: Motion aware double attention network for dynamic scene deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1113–1123 (2022)
- [75] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1357–1366 (2017)
- [76] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1357–1366 (2017)
- [77] Yang, W., Wang, W., Huang, H., Wang, S., Liu, J.: Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing 30, 2072–2086 (2021)
- [78] Yue, Z., Yong, H., Zhao, Q., Meng, D., Zhang, L.: Variational denoising network: Toward blind noise modeling and removal. Advances in neural information processing systems 32 (2019)
- [79] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)
- [80] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14821–14831 (2021)
- [81] Zhang, D., Wang, X.: Dynamic multi-scale network for dual-pixel images defocus deblurring with transformer. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2022)
- [82] Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 695–704 (2018)
- [83] Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. IEEE transactions on circuits and systems for video technology 30(11), 3943–3956 (2019)
- [84] Zhang, J., Huang, J., Yao, M., Yang, Z., Yu, H., Zhou, M., Zhao, F.: Ingredient-oriented multi-degradation learning for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5825–5835 (2023)
- [85] Zhang, J., Huang, J., Yao, M., Yang, Z., Yu, H., Zhou, M., Zhao, F.: Ingredient-oriented multi-degradation learning for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5825–5835 (2023)
- [86] Zheng, B., Chen, Y., Tian, X., Zhou, F., Liu, X.: Implicit dual-domain convolutional network for robust color image compression artifact reduction. IEEE Transactions on Circuits and Systems for Video Technology 30(11), 3982–3994 (2019)
Supplementary Material: A Strong Prompter for All-in-One Image Restoration
Dongqi Fan Junhao Zhang Liang Chang
Appendix 0.A All-in-One Image Restoration Result
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 |
GoPro [52] | 5.91/0.0062 | 25.64/0.7912 | 27.94/0.8558 | 27.93/0.8554 | 28.03/ 0.8585 |
HIDE [64] | 5.99/0.0157 | 23.87/0.7560 | 26.08/0.8338 | 26.51/0.8372 | 26.53/ 0.8393 |
RealBlur-J [59] | 10.97/0.0591 | 18.49/0.5820 | 18.50/0.4209 | 19.79/0.6579 | 19.39/0.4664 |
RealBlur-R [59] | 8.56/0.3099 | 17.23/0.6388 | 19.88/0.4953 | 17.69/0.4213 | 23.23/ 0.7408 |
Rain100H [76] | 6.25/0.0032 | 18.41/0.5615 | 24.95/0.7818 | 25.23/0.7852 | 25.54/ 0.7949 |
Rain100L [76] | 6.25/0.0032 | 24.33/0.8136 | 22.04/0.8340 | 23.73/0.8503 | 24.49/ 0.8358 |
Test1200 [82] | 6.33/0.0169 | 27.72/0.8151 | 31.00/0.8847 | 31.30/0.8888 | 31.41/ 0.8926 |
Test2800 [27] | 6.29/0.0058 | 27.40/0.8353 | 30.93/0.9111 | 30.99/0.9119 | 31.06/ 0.9130 |
FiveK [5] | 6.64/0.0307 | 18.39/0.8026 | 24.50/0.9062 | 24.59/0.9058 | 24.46/ 0.9058 |
LoL v1 [71] | 6.38/0.0177 | 18.54/0.7231 | 21.35/0.8003 | 21.87/0.8015 | 21.75/ 0.8189 |
LoL v2 [77] | 5.57/0.0145 | 15.56/0.7042 | 24.06/0.9145 | 24.46/0.9165 | 24.10/0.9136 |
DPDD [1] | 7.73/0.0143 | 20.22/0.7037 | 15.00/0.6134 | 15.71/0.6104 | 14.38/0.5988 |
SOTS [37] | 5.23/0.0065 | 24.80/0.9326 | 28.64/0.9694 | 28.86/0.9700 | 29.86/ 0.9704 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 | |
15 | 6.62/0.0044 | 31.99/0.8844 | 33.33/0.9100 | 33.35/0.9117 | 33.40/ 0.9130 | |
CBSD68 [51] | 25 | 6.62/0.0044 | 29.33/0.8146 | 30.40/0.8517 | 30.39/0.8515 | 30.45/ 0.8519 |
50 | 6.62/0.0044 | 25.31/0.6445 | 25.63/0.7035 | 25.77/0.7081 | 25.79/ 0.7098 | |
Urban100 [32] | 15 | 5.72/0.218 | 31.15/0.8923 | 32.43/0.9107 | 32.77/0.9100 | 32.45/ 0.9106 |
25 | 5.72/0.218 | 28.49/0.8351 | 29.60/0.8602 | 29.86/0.8661 | 29.65/ 0.8619 | |
50 | 5.72/0.218 | 24.44/0.6716 | 24.69/0.7206 | 24.97/0.7366 | 25.07/ 0.7416 | |
LIVE1 [62] | 10 | 6.20/0.0020 | 25.72/0.7512 | 27.00/0.7839 | 27.05/0.7852 | 27.12/ 0.7887 |
20 | 6.20/0.0020 | 28.09/0.8329 | 29.37/0.8558 | 29.42/0.8570 | 29.47/ 0.8582 | |
30 | 6.20/0.0020 | 29.29/0.8674 | 30.67/0.8863 | 30.73/0.8874 | 30.77/ 0.8882 | |
40 | 6.20/0.0020 | 30.04/0.8866 | 31.58/0.9034 | 31.63/0.9044 | 31.67/ 0.9050 | |
Snow100K [48] | S | 6.43/0.0203 | 29.12/0.8778 | 33.90/0.9418 | 34.04/0.9430 | 34.23/ 0.9443 |
M | 6.44/0.0203 | 26.17/0.8549 | 32.17/0.9300 | 32.36/0.9316 | 32.49/ 0.9330 | |
L | 6.45/0.0214 | 23.57/0.7808 | 28.28/0.8761 | 28.46/0.8793 | 28.50/ 0.8809 | |
CSD [48] | 5.43/0.0078 | 16.03/0.7596 | 16.00/0.7404 | 15.35/0.7595 | 16.13/ 0.7625 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 |
GoPro [52] | 27.52/0.8461 | 27.81/0.5839 | 27.81/0.8536 | 27.80/0.8530 | 27.82/ 0.8540 |
HIDE [64] | 25.01/0.8173 | 25.55/0.8293 | 25.27/0.8228 | 25.30/0.8125 | 25.38/ 0.8243 |
RealBlur-J [59] | 23.11/0.7418 | 24.32/0.7688 | 23.67/0.7435 | 25.17/0.7918 | 23.64/0.7419 |
RealBlur-R [59] | 29.18/0.7915 | 27.72/0.7397 | 28.63/0.7596 | 27.49/0.7281 | 27.34/0.7272 |
Rain100H [76] | 24.47/0.7666 | 21.93/0.7943 | 24.53/0.7437 | 24.72/0.7748 | 25.00/ 0.7777 |
Rain100L [76] | 21.93/0.7943 | 22.03/0.8003 | 21.85/0.8110 | 21.09/0.7958 | 22.12/ 0.8111 |
Test1200 [82] | 30.68/0.8820 | 29.38/0.8631 | 29.40/0.8793 | 29.41/0.8608 | 29.62/ 0.8669 |
Test2800 [27] | 30.70/0.9020 | 30.74/0.9087 | 30.97/0.9017 | 30.73/0.9087 | 30.75/ 0.9089 |
FiveK [5] | 24.43/0.9022 | 24.41/0.9028 | 24.42/0.9030 | 24.45/0.9018 | 24.47/ 0.9038 |
LoL v1 [71] | 21.73/0.7878 | 21.89/0.7804 | 21.74/0.8022 | 21.96/0.7993 | 22.05/ 0.7995 |
LoL v2 [77] | 24.08/0.9087 | 23.77/0.9085 | 24.08/0.9094 | 22.05/0.7995 | 24.47/ 0.9128 |
DPDD [1] | 13.34/0.5684 | 14.09/0.5860 | 14.11/0.5689 | 14.22/0.5817 | 14.25/ 0.5961 |
SOTS [37] | 29.01/0.9452 | 29.18/0.9650 | 28.51/0.9213 | 29.13/0.9646 | 29.19/ 0.9651 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 | |
15 | 33.00/0.9004 | 33.20/0.9098 | 33.19/0.9005 | 33.20/0.9011 | 33.22/ 0.9099 | |
CBSD68 [51] | 25 | 30.19/0.8417 | 29.94/0.8416 | 30.00/0.8357 | 30.22/0.8427 | 30.27/ 0.8440 |
50 | 25.13/0.6917 | 22.64/0.6433 | 25.34/0.6963 | 25.24/0.6994 | 25.41/ 0.7058 | |
Urban100 [32] | 15 | 32.81/0.9226 | 29.77/0.8670 | 32.90/0.9242 | 32.83/0.9241 | 32.82/0.9237 |
25 | 29.83/0.8718 | 25.20/0.7623 | 28.90/0.8655 | 29.91/0.8762 | 29.90/ 0.8761 | |
50 | 24.28/0.7247 | 16.17/0.4767 | 24.38/0.7425 | 24.36/0.7459 | 24.49/ 0.7496 | |
LIVE1 [62] | 10 | 26.94/0.7828 | 26.96/0.7824 | 26.85/0.7802 | 26.96/0.7829 | 26.99/ 0.7842 |
20 | 29.28/0.8544 | 29.29/0.8539 | 29.28/0.8526 | 29.30/0.8545 | 29.31/ 0.8547 | |
30 | 30.58/0.8840 | 30.59/0.8850 | 30.49/0.8833 | 30.60/0.8851 | 30.60/ 0.8850 | |
40 | 31.48/0.9023 | 31.50/0.9027 | 31.54/0.9028 | 31.49/0.9025 | 31.50/0.9026 | |
Snow100K [48] | S | 33.41/0.9399 | 33.45/0.9399 | 33.40/0.9323 | 33.38/0.9397 | 33.51/ 0.9404 |
M | 31.77/0.9276 | 31.77/0.9275 | 31.65/0.9206 | 31.72/0.9272 | 31.79/0.9280 | |
L | 27.60/0.8710 | 27.64/0.8703 | 28.30/0.8771 | 27.59/0.8697 | 27.68/0.8711 | |
CSD [13] | 16.49/0.7493 | 16.83/0.7545 | 16.69/0.7606 | 17.69/0.7677 | 17.29/0.7621 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 |
GoPro [52] | 26.99/0.8294 | 27.35/0.8319 | 27.28/0.8379 | 27.35/0.8391 | 27.37/0.8401 |
HIDE [64] | 23.92/0.7764 | 25.13/0.8079 | 24.28/0.7924 | 25.54/0.8107 | 25.35/0.8096 |
RealBlur-J [59] | 24.05/0.7673 | 24.37/0.7574 | 23.33/0.7278 | 24.74/0.7669 | 25.05/0.7751 |
RealBlur-R [59] | 23.11/0.7305 | 24.94/0.6566 | 22.22/0.5779 | 25.83/0.6676 | 27.65/0.7341 |
Rain100H [76] | 21.58/0.7020 | 22.16/0.7777 | 22.84/0.7544 | 22.82/0.7503 | 23.75/0.7603 |
Rain100L [76] | 21.58/0.8020 | 21.81/0.7961 | 20.74/0.7860 | 19.90/0.7693 | 22.64/0.8076 |
Test1200 [82] | 30.52/0.8826 | 29.04/0.8783 | 30.81/0.8839 | 30.48/0.8815 | 30.76/0.8829 |
Test2800 [27] | 30.50/0.9060 | 28.34/0.8675 | 30.70/0.8999 | 30.83/0.9109 | 30.55/0.9075 |
FiveK [5] | 23.94/0.8953 | 24.10/0.8907 | 24.16/0.8990 | 24.15/0.9008 | 24.17/0.8993 |
LoL v1 [71] | 20.99/0.8000 | 20.97/0.8119 | 20.88/0.8027 | 21.03/0.8078 | 21.59/0.8089 |
LoL v2 [77] | 23.26/0.9036 | 24.09/0.9111 | 24.13/0.9055 | 24.44/0.9122 | 24.40/0.9112 |
DPDD [1] | 14.71/0.5969 | 15.09/0.6082 | 14.84/0.5982 | 15.47/0.6079 | 14.81/0.5989 |
SOTS [37] | 28.55/0.9586 | 29.36/0.9651 | 29.34/0.9660 | 29.24/0.9600 | 29.47/0.9679 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 | |
15 | 33.11/0.9094 | 33.20/0.9111 | 33.24/0.9113 | 33.29/0.9128 | 33.22/0.9113 | |
CBSD68 [51] | 25 | 30.13/0.8462 | 30.27/0.8469 | 30.25/0.8470 | 30.33/0.8486 | 30.27/0.8477 |
50 | 25.43/0.7102 | 25.50/0.7172 | 25.53/0.7195 | 25.67/0.7228 | 25.63/0.7132 | |
Urban100 [32] | 15 | 32.47/0.9190 | 32.11/0.9186 | 32.66/0.9202 | 32.71/0.9187 | 32.77/0.9238 |
25 | 29.50/0.8693 | 29.04/0.8672 | 29.83/0.8689 | 29.86/0.8770 | 29.86/0.8777 | |
50 | 24.54/0.7499 | 24.07/0.7456 | 25.01/0.7689 | 25.07/0.7718 | 24.94/0.7629 | |
LIVE1 [62] | 10 | 26.82/0.7824 | 24.93/0.7519 | 26.82/0.7841 | 26.89/0.7830 | 26.90/0.7854 |
20 | 29.06/0.8504 | 27.09/0.8345 | 29.24/0.8462 | 29.18/0.8506 | 29.27/0.8564 | |
30 | 30.19/0.8778 | 28.40/0.8683 | 30.45/0.8866 | 30.68/0.8876 | 30.59/0.8868 | |
40 | 30.98/0.8941 | 28.73/0.8847 | 31.56/0.9038 | 31.58/0.9004 | 31.37/0.9012 | |
Snow100K [48] | S | 33.05/0.9380 | 33.48/0.9324 | 33.38/0.9408 | 33.53/0.9406 | 33.54/0.9420 |
M | 31.46/0.9261 | 31.86/0.9209 | 31.78/0.9292 | 31.81/0.9211 | 31.87/0.9303 | |
L | 27.78/0.8726 | 28.01/0.8760 | 27.95/0.8755 | 28.16/0.8791 | 28.03/0.8773 | |
CSD [13] | 13.60/0.7048 | 13.93/0.7087 | 13.48/0.7335 | 14.70/0.7344 | 14.27/0.7147 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 |
GoPro [52] | 21.56/0.6900 | 24.61/0.7275 | 25.67/0.7922 | 26.35/0.8108 | 26.54/0.8121 |
HIDE [64] | 20.04/0.6523 | 23.18/0.7001 | 21.84/0.7303 | 23.17/0.7019 | 23.27/0.7671 |
RealBlur-J [59] | 23.55/0.6025 | 25.22/0.6936 | 25.30/0.7888 | 23.66/0.7621 | 26.10/0.7979 |
RealBlur-R [59] | 30.01/0.8375 | 32.02/0.8901 | 30.32/0.8874 | 30.80/0.8889 | 32.60/0.9130 |
Rain100H [76] | 10.11/0.3799 | 12.32/0.3374 | 15.30/0.5284 | 11.96/0.4334 | 12.40/0.4236 |
Rain100L [76] | 17.64/0.6086 | 23.77/0.7262 | 18.89/0.7502 | 18.11/0.7265 | 18.95/0.7312 |
Test1200 [82] | 25.07/0.7105 | 21.27/0.6324 | 23.52/0.7902 | 28.41/0.8499 | 27.98/0.8479 |
Test2800 [27] | 25.40/0.7675 | 21.88/0.9619 | 23.44/0.8105 | 29.65/0.8964 | 29.27/0.8897 |
FiveK [5] | 17.19/0.7864 | 17.33/0.7005 | 17.72/0.7805 | 18.94/0.8032 | 19.50/0.8337 |
LoL v1 [71] | 7.77/0.1935 | 7.76/0.1887 | 8.32/0.2434 | 9.43/0.3634 | 7.96/0.2134 |
LoL v2 [77] | 11.05/0.4467 | 10.93/0.4086 | 13.02/0.5814 | 13.03/0.6542 | 13.21/0.6206 |
DPDD [1] | 20.32/0.5338 | 21.29/0.6129 | 19.88/0.6912 | 19.82/0.6825 | 21.92/0.7143 |
SOTS [37] | 22.92/0.9063 | 16.25/0.7639 | 22.22/0.9065 | 21.49/0.9160 | 24.21/0.9406 |
Datasets | Original | ConStyle | Pre-train | Class | ConStyle v2 | |
15 | 24.82/0.5730 | 24.56/0.5976 | 19.22/0.6107 | 32.92/0.9059 | 33.09/0.9047 | |
CBSD68 [51] | 25 | 20.54/0.3950 | 21.27/0.4416 | 17.22/0.4439 | 28.33/0.8345 | 30.12/0.8354 |
50 | 15.01/0.1999 | 16.16/0.2436 | 15.77/0.2747 | 21.74/0.6674 | 25.16/0.6830 | |
Urban100 [32] | 15 | 24.85/0.6123 | 23.25/0.6137 | 11.72/0.4582 | 29.34/0.8898 | 32.38/0.9150 |
25 | 20.69/0.4569 | 20.48/0.4843 | 12.96/0.3982 | 25.31/0.8214 | 29.48/0.8608 | |
50 | 15.18/0.2676 | 15.84/0.3055 | 13.97/0.2979 | 18.43/0.6606 | 24.40/0.7230 | |
LIVE1 [62] | 10 | 24.95/0.7396 | 23.59/0.6604 | 25.06/0.7449 | 26.93/0.7836 | 26.80/0.7787 |
20 | 26.53/0.8191 | 25.03/0.7324 | 26.40/0.8170 | 29.30/0.8555 | 29.22/0.8526 | |
30 | 27.32/0.8542 | 25.69/0.7619 | 26.40/0.8408 | 30.61/0.8864 | 30.54/0.8843 | |
40 | 27.83/0.8745 | 26.12/0.7793 | 26.40/0.8485 | 31.50/0.9035 | 31.40/0.9017 | |
Snow100K [48] | S | 26.70/0.8268 | 23.46/0.7341 | 24.72/0.8582 | 31.55/0.9129 | 31.85/0.9243 |
M | 24.79/0.7978 | 21.97/0.7103 | 24.34/0.8427 | 31.14/0.9200 | 30.43/0.9106 | |
L | 22.22/0.6906 | 18.56/0.6144 | 23.09/0.7833 | 27.15/0.8589 | 26.12/0.8456 | |
CSD [13] | 15.78/0.7259 | 14.56/0.6666 | 15.66/0.7367 | 13.66/0.6878 | 16.63/0.7515 |