(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: University of Electronic Science and Technology of China, Chengdu, CN
¹¹email: {dongqifan, junhaozhang}@std.uestc.edu.cn
¹¹email: [email protected]

ConStyle v2: A Strong Prompter for All-in-One Image Restoration

Dongqi Fan University of Electronic Science and Technology of China, Chengdu, CN
¹¹email: {dongqifan, junhaozhang}@std.uestc.edu.cn
¹¹email: [email protected] Junhao Zhang University of Electronic Science and Technology of China, Chengdu, CN
¹¹email: {dongqifan, junhaozhang}@std.uestc.edu.cn
¹¹email: [email protected] Liang Chang University of Electronic Science and Technology of China, Chengdu, CN
¹¹email: {dongqifan, junhaozhang}@std.uestc.edu.cn
¹¹email: [email protected]

Abstract

This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its weights to guide the training of the general restoration network. Three improvements are proposed in the pre-training stage to train ConStyle: unsupervised pre-training, adding a pretext task (i.e. classification), and adopting knowledge distillation. Without bells and whistles, we can get ConStyle v2, a strong prompter for all-in-one Image Restoration, in less than two GPU days and doesn’t require any fine-tuning. Extensive experiments on Restormer (transformer-based), NAFNet (CNN-based), MAXIM-1S (MLP-based), and a vanilla CNN network demonstrate that ConStyle v2 can enhance any U-Net style Image Restoration models to all-in-one Image Restoration models. Furthermore, models guided by the well-trained ConStyle v2 exhibit superior performance in some specific degradation compared to ConStyle. The code is avaliable at: https://github.com/Dongqi-Fan/ConStyle_v2

Keywords:

Image Restoration Visual Prompting All-in-One Pre-training

1 Introduction

Image Restoration (IR) is a fundamental vision task in the computer vision community, which aims to reconstruct a high-quality image from a degraded one. Recent advancements in deep learning have shown promising results in specific IR tasks such as denoising [16, 8, 78], dehazing [46, 58, 22], deraining [29, 57, 34], desnowing [13, 12, 48], motion deblurring [35, 19, 74], defocus deblurring [61, 81, 36], low-light enhancement [69, 6, 25], and JPEG artifact removal/correction [33, 86]. However, these models are limited to addressing only one specific type of degradation. To tackle this issue, researchers have focused on develo** models capable of handling multiple degradations [80, 17, 70, 42]. Yet, these models require retraining for each different type of degradation, that is a set of weights is tailored for a single type of degradation. Obviously, these approaches are not practical as multiple degradations often coexist in real-world scenarios. For instance, rainy days are often associated with haze and reduced lighting.

Refer to caption — Figure 1: Different way to solve multiple degradations. (a) The priors are obtained by setting sub-networks as many as degradations. (b) The example pair of the degradation-clean must provide in training and inference stage. (c) ConStyle v2 adaptively outputs clean visual prompts according to different degradations to guide the training of the general restoration network.

The all-in-one Image Restoration [84, 53, 66, 14] is a kind of method that only uses a suit of weights to address multiple types of degradation. However, these all-in-one models often have a large number of parameters and require heavy computations, leading to time-consuming and inefficient training processes. For instance, Chen et al. [14] (Fig. 1 (a)) adjust the number of teacher networks based on the number of degradations. Thus, the more teacher networks there are, the more complex the training process becomes; Li et al. [41] (Fig. 1 (a)) also adopt the number of sub-networks the same as the amount of degradation, and its backbone is obtained through neural architecture search (NAS); PromptGIP [47] (Fig. 1 (b)) leverage the idea of visual prompting to help the model in handling multiple degradations, but a degradation-clean sample pair must be provided in the training and inference stage and requiring 8 V100 GPUs for training. These methods are inefficient due to the lack of prior knowledge about the degradations in the input image. In other words, the prior information about the degradation type needs to be first obtained and then passed to subsequent sub-networks. In contrast, thanks to ConStyle v2, general restoration network (Fig. 1 (c)) do not need specific prior knowledge about degradations, but rather require a clean visual prompt (clean prior).

In this paper, we introduce a strong prompter for all-in-one Image Restoration: ConStyle v2 (Fig. 2 and Fig. 3). The key of our work is to ensure that ConStyle v2 generates clean visual prompts, thus mitigating the issue of model collapse and guiding the training of general restoration networks. Model collapse typically arises when a model struggles to simultaneously handle multiple degradations. To address this challenge, we propose three simple yet effective improvements to ConStyle: unsupervised pre-training, leveraging a pretext task to enhance semantic information extraction capabilities, and employing knowledge distillation to further enhance this capacity. ConStyle v2 consists of convolution, linear, and BN layers and without any complex operators (Fig. 3 (c)), so the time of the training is less than two days on a V100 GPU and Intel Xeon Silver 4216 CPU. Once trained, ConStyle v2 can be seamlessly integrated into any U-Net model to facilitate their training without fine-tuning. Additionally, given the lack of datasets encompassing multiple degradations, we collect and produce a Mix Degradations dataset, which includes noise, motion blur, defocus blur, rain, snow, low-light, JPEG artifact, and haze, to cater to training requirements. To verify our method, we perform ConStyle v2 on three state-of-the-art IR models (Restormer[79], MAXIM-1S[65], NAFNet[10]) and a non-IR U-Net model consisting of vanilla convolutions. Fig. 4 shows details architecture of Original models and ConStyle/ConStyle v2 models. Experiment results on 27 benchmarks demonstrate that our ConStyle v2 is a powerful plug-and-play prompter for all-in-one image restoration and exhibits superior performance for specific degradation compared to ConStyle [24].

Our contributions can be summarized as follows:

•

Three simple yet effective enhancements are proposed to train ConStyle v2, and the time of the training is less than two GPU days.
•

We propose a Mix Degradations dataset, which includes noise, motion blur, defocus blur, rain, snow, low-light, JPEG artifact, and haze, to cater to training needs.
•

We propose a strong plug-and-play prompter for all-in-one and specific image restoration, in which the model collapse issue is avoided.

2 Related Work

2.1 All-in-One Image Restoration

While numerous works [80, 17, 70, 42, 79, 65, 10, 9, 18, 39, 23] excel in various Image Restoration tasks, they are typically limited to addressing a single type of degradation with a specific set of weights. To solve this problem, all-in-one Image Restoration (IR) methods [84, 53, 66, 14, 49, 47, 54, 85, 38, 50, 55] have been developed. These methods aim to enable models to effectively handle multiple degradations simultaneously. For example, AirNet [38] leverages MoCo [30] and Deformable Convolution [20] to transform degradation priors obtained from the former into convolution kernels in the latter, enabling dynamic degradation removal; DA-CLIP [49] builds upon the architecture of CLIP [56], in which BLIP [40] is used to generate synthetic captions for high-quality images. Then match low-quality images with captions and corresponding degradation types as image-text-degradation pairs; ADMS [54] introduces a Filter Attribution method based on FAIG [73] to identify the specific contributions of filters in removing specific degradations, while IDR [85] proposes a learnable Principal Component Analysis and treats various IR tasks as a form of multi-task learning to acquire priors. Different from the above methods, we aim to design a plug-and-play module that can transform a non-all-in-one model into an all-in-one model.

2.2 Visual Prompting

In the field of Natural Language Processing (NLP), Prompting Learning refers to providing task-specific instructions or in-context information to a model without the need for retraining. This approach has shown promising results in NLP, such as GPT-3 [4]. Drawing inspiration from Prompting Learning in NLP, recently, there have been many excellent visual prompting works in the IR [47, 50, 55, 11, 45, 3, 67, 63]. For example, ProRes [50] involves adding a target visual prompt to an input image to create a "prompted image". This prompted image is then flattened into patches, with the weights of ProRes frozen, and learnable prompts are randomly initialized for new tasks or datasets; PromptIR [55] introduces a Prompt Block in the decoder stage of the U-Net architecture. This block takes prompt components and the output of the previous transformer block as inputs, with its output being fed into the next transformer block; PromptGIP [47] proposes a training method akin to masked autoencoding, where certain portions of question images and answer images are randomly masked to prompt the model to reconstruct these patches from the unmasked areas. During inference, input-output pairs are assembled as task prompts to realize image restoration. Our approach also leverages visual prompting, but in a more efficient manner, eliminating the need to distinguish different degradations like the above methods. It will provide a clean visual prompt for other models.

3 Method

The training diagram of ConStyle v2 is depicted in Fig. 2, while the distinctions between ConStyle and ConStyle v2 are illustrated in Fig. 3. The same as ConStyle, ConStyle v2 only retains the Encoder part when the pre-training is complete. In this section, we first provide a brief overview of ConStyle (Sec. 3.1), followed by showing problems encountered with ConStyle in multiple degradations (Sec. 3.3), and finally, we illustrate the improvements made from ConStyle to ConStyle v2 in three steps (Sec. 3.4). In addition, the Mix Degradations dataset is described in Sec. 3.2.

3.1 Review of ConStyle

IRConStyle [24] is a versatile and robust IR framework consisting of the ConStyle and a general restoration network. ConStyle includes several convolutional layers and one MLP layer, which is responsible for extracting latent features (the latent code and intermediate feature map) and then passing them to the general restoration network. The general restoration network follows an abstract U-Net style architecture, allowing for the instantiation of any IR U-Net model. The training stage (Eq. 1) and inference (Eq. 2) process of IRConStyle [24] can be described as follows:

I_{restored}=G(E(I_{degraded},I_{clean}),I_{degraded})

(1)

I_{restored}=G(E(I_{degraded}),I_{degraded})

(2)

Where G stands for general restoration network, E for ConStyle, $I_{degraded}$ for the input degraded image, and $I_{restored}$ for the output restored clean image. Based on the contrast learning framework MoCo [30], ConStyle cleverly integrates the idea of style transfer and replaces the pretext task, Instance Discrimination [72], with one pretext task more suitable for IR. The total loss functions for IRConStyle are as follows:

L_{total}=L_{style}+L_{content}+L_{InfoNCE}+L_{1}

(3)

The calculation of $L_{style}$ , $L_{content}$ , and $L_{InfoNCE}$ is performed in ConStyle, while $L_{1}$ is performed in general restoration network. Under the supervision of $L_{style}$ , $L_{content}$ , and $L_{InfoNCE}$ , latent features move closer to the clean space and further away from the degradation space. Since ConStyle can adaptively output clean latent features according to input degraded images, it is natural for us to believe that ConStyle should be able to turn the general restoration network into an all-in-one model. However, the experiment results on ConStyle models are not as expected.

3.2 Mix Degradations Datasets

We need a training dataset that includes noise, motion blur, defocus blur, rain, snow, low light, JPEG artifact, and haze, but the existing training dataset did not meet our needs. Therefore, we propose a dataset, namely Mix Degradations datasets, consisting of image pairs with all of the aforementioned degradations. Details on the Mix Degradations dataset can be found in Tab. 1. The images with noise and JPEG artifacts are respectively generated using established methods same as [79, 65, 10] and [44, 33, 86]. It is important to note that in the OTS dataset, haze images with intensities of 0.04 and 0.06 are manually removed due to being too clear to the human eye. In addition, the deraining dataset, which includes Rain14000 [26], Rain1800 [75], Rain800 [83], and Rain12 [43], initially contained 13,712 images, but two erroneous pictures are identified and removed. After a unified crop** process, the Mix Degradations dataset has 621,573 images, all of size 256 × 256. The Mix Degradations datasets, the uncropped joint datasets mentioned in Tab. 1 (totaling 46,301 images), and the data preparation file are all available on our GitHub link.

Table 1: Details about Mix Degradations datasets.

Task	Motion Blurring	Defocus Blurring	Dehazing	Low-light Enhacement
Where	GoPro [52]	LFDOF [60]	OTS [37]	LoL v1 [71], LoL v2 [77], and FiveK [5]
Used Number	2,103	5,606	12,500	5,885
Crop Size	256	256	256	256
Step Size	150	220	240	100
Final Number	84,120	84,090	82,425	51,891
Task	JPEG Artifact Removal	Denoising	Desnowing	Deraining
Where	DIV2K [2]	DIV2K [2]	Snow100K [48]	Rain14000, 1800, 800 and 12
Used Number	800	800	7000	13710
Crop Size	256	256	256	256
Step Size	165	165	160	240
Final Number	77,904	77,904	82,432	80,807

3.3 ConStyle on Mix Degradations Datasets

To evaluate whether ConStyle [24] can directly convert U-Net models to all-in-one models, we conduct experiments using Original models (Restormer [79], NAFNet [60], MAXIM-1S [65]) and ConStyle models (ConStyle Restormer, ConStyle NAFNet, ConStyle MAXIM-1S) on Mix Degradations datasets. These models will be tested on GoPro [52] (motion blurring), RealDOF [60] (defocus blurring), LoL v1 [71] (low-light enhancement), SOTS outdoors [37] (dehazing), LIVE1 [62] (JPEG artifact removal), CSD [13] (desnowing), CBSD68 [51] (denoising), and Rain100H [76] (deraining). In addition, we introduce two more models for comparison: Original Conv and ConStyle Conv. The Original Conv is a vanilla U-Net convolution model, while the ConStyle Conv incorporates ConStyle into the Original Conv. These two additional models are used to verify the generality of the ConStyle and ConStyle v2. It is important to note that to expedite evaluation during the training, only a subset of the test datasets is used. For example, only 24 images from GoPro’s test dataset of 1,111 images are selected. Using the full test datasets for all 8 tasks would significantly increase training time, as inference on Restormer alone with GoPro’s test dataset would take 40 minutes on a V100 GPU.

The results in Fig. 5 demonstrate that under multiple degradation settings, the performance of ConStyle Conv not only fails to surpass that of the Original Conv but also remains consistently low performance after 80K iterations. While ConStyle Restormer shows better performance than Restormer, it also suffers from model collapse problems as Restormer. In the following section, we will illustrate the improving process of ConStyle to ConStyle v2 on Restormer and Original Conv step by step.

3.4 ConStyle v2

3.4.1 Unsupervised Pre-training

We find that even for specific IR tasks, ConStyle models exhibit varying degrees of model collapse issues. For example, in the dehazing task, the performance of ConStyle NAFNet, ConStyle MAXIM-1S, and ConStyle Restormer significantly declines after 250K iterations, 30K iterations, and 10K iterations respectively. Interestingly, even within the same model like ConStyle MAXIM-1S, the onset of model collapse differs across denoising, deraining, and deblurring tasks, occurring at 10K, 100K, and 200K iterations respectively. To address this challenge, IRConStyle [24] implements a strategy of early stop** ConStyle updates. Here, we intend to elegantly solve this problem. Since this problem happens in joint training, then it is natural to split joint training process into two stages. Specifically, ConStyle is pre-trained independently, followed by fixing its weights and integrating it with other IR models for guided training. For pre-training stage, we leverage the generation techniques of Real-ESRGAN [68] and ImageNet-C [31] in the Degradation Process (Fig. 2) for unsupervised training on ImageNet-1K [21].

Since our goal is to train ConStyle v2 to be a powerful prompter that can produce a clean visual prompt based on the different degradations, we use the method in ImageNet-C [31] to generate motion blur, snow, and low contrast and the two-stage degradation method in Real-ESRGAN [68] to generate Gaussian blur, noise, and JPEG artifacts. For each batch of images, 40% of the images are randomly selected to add motion blur and snow and change contrast, while 60% of the images are added Gaussian blur, noise, and JPEG artifact. For details training setting of the pre-training please see Sec. 4. The pre-trained ConStyle, with the weight fixed, is incorporated into the general restoration network for training on the Mix Degradations datasets. Here, we name the models of this stage as Pre-train models. The results of Pre-train Restormer and the Pre-train Conv can be seen in Fig. 6 (a) and (e).

3.4.2 Pretext Task

Following pre-training step, ConStyle Restormer has demonstrated significant performance improvement, successfully resolving the issue of model collapse. Conversely, ConStyle Conv continues to face challenges with unstable training and limited enhancement in performance. We believe that this is attributed to heavy degradation, leading to a loss of semantic information (Fig. 7) in the original image. It makes the model of weak semantic extraction ability, such as ConStyle Conv, also have poor image restoration performance. Because the process of image restoration involves pixel-wise operations and necessitates a comprehensive understanding of the entire image. Thus we introduce a pretext task (classification) to enhance the semantic information extraction capabilities of ConStyle, so as to improve such capability of ConStyle models. Specifically, we add Classifier and Softmax layers at the back of the Encoder and leverage the labels of the ImageNet (Fig. 2). Here, we name the models of this stage as Class models. As shown in Fig. 6 (b) and (f), the addition of the pretext task has little influence on ConStyle Restormer, since the Transformer model already has strong semantic extraction abilities. In contrast, for ConStyle Conv, the inclusion of the pretext task makes the training stable, and the performance is significantly improved.

3.4.3 Knowledge Distillation

Although ConStyle has been significantly improved through pre-training and the addition of a pretext task, enabling it to generate clean visual prompts based on degraded image input, to further boost ConStyle’s ability to extract semantic information, we take the last step to improve ConStyle to ConStyle v2. Inspired by BYOL[28], SimSam[15], and DINO[7], we take advantage of knowledge distillation. Since the input of the Momentum Encoder is the clean image, and its output visual prompt is cleaner than the output of Encoder, by utilizing the Momentum Encoder as a teacher and the Encoder as a student, the teacher network is able to adaptively guide the student network during training. Specifically, a Classifier and Softmax layer are added at the back of the Momentum Encoder, with distance measured by the Kullback-Leibler (KL) function (Fig. 2). Now, we have the final ConStyle v2, and the performance of ConStyle v2 Restormer (Fig. 6 (g) and ConStyle v2 Conv (Fig. 6 (c)) is raised again compared with other models. In addition, each improvement in the average performance across eight degradations is depicted in Fig. 6 (d) and (h).

4 Experiments

4.1 Implement details

All experiments in this paper are performed on an NVIDIA Tesla V100 GPU. To be consistent with ConStyle [24], we use AdamW ( $\beta_{1}$ =0.9, $\beta_{2}$ =0.999, weight decay= $1e^{-4}$ ) optimizer with an initial learning rate of $3e^{-4}$ and Cosine annealing. Training Stage: The batch size, crop size, and total iterations are set as 16, 128, and 700K respectively. Pre-training Stage: The batch size, crop size, and total iterations are set as 32, 224, and 200K respectively. In the process of generating degraded images, we directly use all configurations in Real-ESRGAN [68] and change the intensity of degradation in ImageNet-C [31].

Table 2: Parameters, computations, and inference speed comparison.

(*)

represents ConStyle, Pre-train, class, and ConStyle v2 models. For instance, Conv

(*)

stands for ConStyle Conv, Pre-train Conv, Class Conv, and ConStyle v2 Conv.

Method	Restormer [79]	Restormer $(*)$	NAFNet [10]	NAFNet $(*)$
Params(M)	26.12	15.57	87.40	12.74
GFLOPs	70.49	74.92	49.13	46.97
Speed(us)	60	61	53	54
Method	MAXIM-1S [65]	MAXIM-1S $(*)$	Conv	Conv $(*)$
Params(M)	8.17	8.10	5.03	6.78
GFLOPs	21.58	25.58	19.64	25.90
Speed(us)	53	52	3	5

Table 3: The average performance of the PSNR/SSIM on 27 benchmarks. Red means the best and blue means the second best.

	Original	ConStyle	Pre-train	Class	ConStyle v2
Restormer [79]	6.40/0.0506	25.23/0.7844	27.24/0.8222	27.40/0.8305	27.60/0.8352
NAFNet [10]	27.45/0.8347	26.57/0.8054	27.43/0.8341	27.53/0.8379	27.55/0.8391
MAXIM-1S [65]	26.95/0.8319	26.63/0.8290	27.14/0.8319	27.41/0.8372	27.54/0.8400
Conv	21.44/0.6177	20.99/0.5986	20.14/0.6544	24.84/0.7889	26.06/0.7974

Mix Degradations datasets is used in the training stage and ImageNet-1K in the pre-training stage. For evaluation, GoPro [52], HIDE [64], RealBlur-J [59], and RealBlur-R [59] are used for motion deblurring, DPDD [1] is used for defocus deblurring, SOTS outdoors [37] is used for dehazing, Rain100H [76], Rain100L [76], Test1200 [82], and Test2800 [27] are used for deraining, FiveK [5], LoL v1 [71], and LoL v2 [77] are used for low-light enhancement, CSD [13], Snow100K (S, M, and L) [48] are used for desnowing, CBSD68 [51] and urban100[32] are used for denoising, and LIVE1 [62] is used for JPEG artifact removal.

Table 4: RGB image denoising tested by PSNR. Where (

*

) indicates models are trained with random sigma (0 to 50), while other indicates models are trained with fixed sigma. Blue means better and red means worse.

Method	CBSD68(*) [51]			CBSD68 [51]			Urban100(*) [32]			Urban 100 [32]
Method	$\sigma$ =15	$\sigma$ =25	$\sigma$ =50	$\sigma$ =15	$\sigma$ =25	$\sigma$ =50	$\sigma$ =15	$\sigma$ =25	$\sigma$ =50	$\sigma$ =15	$\sigma$ =25	$\sigma$ =50
ConStyle Restormer	34.33	31.71	28.51	34.37	31.74	28.52	34.89	32.66	29.64	35.01	32.74	29.71
ConStyle v2 Restormer	34.33	31.71	28.51	34.37	31.74	28.52	34.89	32.67	29.66	35.01	32.77	29.71
Differ.	0	0	0	0	+0	0	0	+0.01	+0.02	0	+0.03	+0
ConStyle NAFNet	34.31	31.69	28.50	34.34	31.71	28.52	34.82	32.58	29.55	34.91	32.64	29.63
ConStyle v2 NAFNet	34.33	31.71	28.52	34.36	31.73	28.54	34.88	32.66	29.65	34.96	32.72	29.68
Differ.	+0.02	+0.02	+0.02	+0.02	+0.02	+0.02	+0.06	+0.08	+0.10	+0.05	+0.08	+0.05
ConStyle MAXIM-1S	34.25	31.63	28.43	34.28	31.65	28.44	34.56	32.26	29.10	34.64	32.32	29.13
ConStyle v2 MAXIM-1S	34.19	31.55	28.34	34.22	31.58	28.36	34.60	32.32	29.19	34.70	32.40	29.26
Differ.	-0.06	-0.08	-0.09	-0.06	-0.07	-0.08	+0.04	+0.06	+0.09	+0.06	+0.08	+0.13

4.2 Model Analyses

Tab. 2 presents a comparison of parameters, computations, and speed between all models. All the results are obtained using input data of size (2,3,128,128), and the speed is the average of 10,000 inference. The reason why the parameters of the ConStyle v2/ConStyle models are fewer than the original models (except for Original Conv and ConStyle v2 Conv) is that, to demonstrate that the improvement of the ConStyle models is not brought by simply expanding the scale of the network, ConStyle models are downscaled by reducing the width and depth [24]. Because of the introduction of the ConStyle part, the parameters of models will be increased by 1.19M.

Table 5: Image motion deblurring and dehazing tested by PSNR/SSIM. Blue means better and red means worse.

Method	GoPro [52]	RealBlur-R [59]	RealBlur-J [59]	HIDE [64]	SOTS outdoor [37]
ConStyle Restormer	31.45/0.9208	33.94/0.9454	26.63/0.8288	30.20/0.9098	30.85/0.9760
ConStyle v2 Restormer	31.36/0.9200	33.95/0.9447	26.63/0.8298	30.10/0.9080	31.32/0.9789
Differ.	-0.09/ -0.0008	+0.01/ -0.0007	0/ -0.0010	-0.10/ -0.0018	+0.47/ +0.0029
ConStyle NAFNet	31.56/0.9230	33.89/0.9441	26.61/0.8308	30.24/0.9106	30.73/0.9743
ConStyle v2 NAFNet	31.68/0.9254	33.90/0.9444	26.63/0.8311	30.32/0.9115	32.34/0.9812
Differ.	+0.12/ +0.0024	+0.01/ +0.0003	+0.02/ +0.0003	+0.08/ +0.0009	+1.61/ +0.0069
ConStyle MAXIM-1S	30.77/0.9093	33.93/0.9435	26.54/0.8261	29.26/0.8951	30.59/0.9713
ConStyle v2 MAXIM-1S	31.01/0.9159	33.79/0.9407	26.47/0.8248	29.36/0.8952	30.68/0.9751
Differ.	+0.24/ +0.0066	-0.14/ -0.0028	-0.07/ -0.0013	+0.10/ +0.0001	+0.09/ +0.0038

4.3 All-in-One Image Restoration Result

To verify the overall performance of our methods, we calculate the average PSNR/SSIM on 27 benchmarks. As shown in Tab. 3, except that the performance of Pre-train Conv is lower than that of Original Conv and ConStyle Conv on PSNR, the performance of other Pre-train, Class, and ConStyle v2 models are significantly higher than that of Original and ConStyle models. This highlights the effectiveness of three proposed methods: unsupervised pre-training, using pretext task, and using knowledge distillation. Due to space constraints, the detail results of Restormer, NAFNet, MAXIM-1S, and Conv models can be found in the supplementary material. While the ConStyle v2 models may not outperform ConStyle, Pre-train, and Class models in certain benchmarks, they still show significant improvement over the Original models. It is worth noting that scaling up the ConStyle v2 models to the size of the Original models could potentially yield even better results.

4.4 Single Image Restoration Result

Considering the significant improvement demonstrated by ConStyle v2 in handling multiple degradations, it is worth investigating whether this method can also enhance general restoration models to address specific degradations. We simply utilize the same training settings as IRConStyle [24] for specific degradation scenarios. A comparison between ConStyle v2 models and ConStyle models is conducted across motion deblurring, denoising, and dehazing tasks. The denoising results are presented in Tab. 4, while the results for motion deblurring and dehazing are shown Tab. 5.

For denoising, except for the performance of ConStyle v2 MAXIM-1S slightly worse than ConStyle MAXIM-1S on CBSD68, the overall performance of ConStyle v2 models is better than ConStyle models. For dehazing, ConStyle v2 models significantly outperform ConStyle models, even by 1.61 dB on NAFNet models. For motion deblurring, ConStyle v2 NAFNet is superior to ConStyle NAFNet but is indistinguishable from ConStyle on Restormer and MAXIM-1S. In general, for specific degradation, ConStyle v2 models do not require an accurate number of iterations to freeze the weight, while ConStyle models need to do so to avoid the problem of model collapse. Therefore, ConStyle v2 makes the entire IRConStyle framework more efficient.

4.5 All-in-One Image Restoration Visual Result

We present the visual results of the Original models and ConStyle v2 models on the GoPro, DPDD, LoL v2, CBSD68, Rain100L, SOTS outdoors, snow100K-M, and LIVE1. Due to the space constraint and fair comparasions, for all tasks, we only select one identical image for each model. The origin degrdaded and target images are shown in Fig. 8. The visual results of the Restormer and ConStyle v2 Restormer are shown in Fig. 9, the visual results of the NAFNet and ConStyle v2 NAFNet are shown in Fig. 10, the visual results of the MAXIM-1S and ConStyle v2 MAXIM-1S are shown in Fig. 11, and the visual results of the Original Conv and ConStyle v2 Conv are shown in Fig. 12.

4.6 Ablation Studies

In the process of improving ConStyle to ConStyle v2, the results of optimization are obtained step by step (Sec. 3.4): unsupervised pre-training, adding a pretext task, and adopting knowledge distillation. Therefore, our whole improvement process is also the process of ablation studies. In addition, for every step of improvement, we conduct all models on the Mix Degradations datasets for fair comparisons (see supplemental materials for details).

5 Conclusions and Limitations

5.1 Conclusions

This paper leverages the unsupervised pre-training, pretext task, and knowledge distillation to improve ConStyle into a strong prompter for all-in-one image restoration. ConStyle v2 not only significantly improves the performance of the Original models under multiple degradations settings, but also solves the issue of model collapse (observed in Restormer) and unstable training caused by model limitations (observed in Original Conv). Moreover, the redundant operations that manually select specific iterations to freeze weights across different models and tasks in ConStyle are avoided, and the performance of ConStyle v2 models under certain specific degradations is also improved. Finally, due to the lack of training datasets for multiple degradations, the Mix Degradations datasets is collected and introduced.

5.2 Limitations

Despite many advantages as described in the conclusion and experiments, there are inevitably two limitations. Firstly, ConStyle v2 exhibits limited improvements in low-light, deraining, and defocus deblurring tasks compared to other tasks. This is evident in the results of ConStyle v2 Conv on LoL v1 and ConStyle v2 MAXIM-1S on DPDD. The reason is that during the pre-training stage, the generation methods of rain, low light, and defocus blur are not included in the degradation generation (Real-ESRGAN and ImageNet-C) since the effective method of the synthetic defocus blur, rain, and low light is still a challenge in the IR community. Secondly, while ConStyle v2 models have shown promising results in quantifiable indicators such as PSNR/SSIM, the visual improvements in ConStyle v2 Conv seem to be less pronounced. This may be attributed to the inherent limitations of the Original Conv model, which is not specifically tailored for image restoration. However, ConStyle v2 demonstrates significant visual enhancements in most tasks on MAXIM-1S and NAFNet.

References

[1] Abuolaim, A., Brown, M.S.: Defocus deblurring using dual-pixel data. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. pp. 111–126. Springer (2020)
[2] Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 126–135 (2017)
[3] Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in Neural Information Processing Systems 35, 25005–25017 (2022)
[4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[5] Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011)
[6] Cai, Y., Bian, H., Lin, J., Wang, H., Timofte, R., Zhang, Y.: Retinexformer: One-stage retinex-based transformer for low-light image enhancement. arXiv preprint arXiv:2303.06705 (2023)
[7] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
[8] Chang, M., Li, Q., Feng, H., Xu, Z.: Spatial-adaptive network for single image denoising. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 171–187. Springer (2020)
[9] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12299–12310 (2021)
[10] Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. In: European Conference on Computer Vision. pp. 17–33. Springer (2022)
[11] Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems 35, 31333–31346 (2022)
[12] Chen, W.T., Fang, H.Y., Ding, J.J., Tsai, C.C., Kuo, S.Y.: Jstasr: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. pp. 754–770. Springer (2020)
[13] Chen, W.T., Fang, H.Y., Hsieh, C.L., Tsai, C.C., Chen, I., Ding, J.J., Kuo, S.Y., et al.: All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4196–4205 (2021)
[14] Chen, W.T., Huang, Z.K., Tsai, C.C., Yang, H.H., Ding, J.J., Kuo, S.Y.: Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17653–17662 (2022)
[15] Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)
[16] Cheng, S., Wang, Y., Huang, H., Liu, D., Fan, H., Liu, S.: Nbnet: Noise basis learning for image denoising with subspace projection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4896–4906 (2021)
[17] Cui, Y., Ren, W., Yang, S., Cao, X., Knoll, A.: Irnext: Rethinking convolutional network design for image restoration (2023)
[18] Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., Knoll, A.: Selective frequency network for image restoration. In: The Eleventh International Conference on Learning Representations (2022)
[19] Cui, Y., Tao, Y., Ren, W., Knoll, A.: Dual-domain attention for image deblurring. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 479–487 (2023)
[20] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764–773 (2017)
[21] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
[22] Dong, J., Pan, J.: Physics-based feature dehazing networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 188–204. Springer (2020)
[23] Fan, D., Yue, T., Zhao, X., Chang, L.: Lir: Efficient degradation removal for lightweight image restoration. arXiv preprint arXiv:2402.01368 (2024)
[24] Fan, D., Zhao, X., Chang, L.: Irconstyle: Image restoration framework using contrastive learning and style transfer. arXiv preprint arXiv:2402.15784 (2024)
[25] Fu, H., Zheng, W., Meng, X., Wang, X., Wang, C., Ma, H.: You do not need additional priors or regularizers in retinex-based low-light image enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18125–18134 (2023)
[26] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3855–3863 (2017)
[27] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3855–3863 (2017)
[28] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271–21284 (2020)
[29] Gu, J., Ma, X., Kong, X., Qiao, Y., Dong, C.: Networks are slacking off: Understanding generalization problem in image deraining. Advances in Neural Information Processing Systems 36 (2024)
[30] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
[31] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
[32] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5197–5206 (2015)
[33] Jiang, J., Zhang, K., Timofte, R.: Towards flexible blind jpeg artifacts removal. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4997–5006 (2021)
[34] Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi-scale progressive fusion network for single image deraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8346–8355 (2020)
[35] Kong, L., Dong, J., Ge, J., Li, M., Pan, J.: Efficient frequency domain-based transformers for high-quality image deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5886–5895 (2023)
[36] Lee, J., Son, H., Rim, J., Cho, S., Lee, S.: Iterative filter adaptive network for single image defocus deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2034–2042 (2021)
[37] Li, B., Ren, W., Fu, D., Tao, D., Feng, D., Zeng, W., Wang, Z.: Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing 28(1), 492–505 (2018)
[38] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17452–17462 (2022)
[39] Li, F., Shen, L., Mi, Y., Li, Z.: Drcnet: Dynamic image restoration contrastive network. In: European Conference on Computer Vision. pp. 514–532. Springer (2022)
[40] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
[41] Li, R., Tan, R.T., Cheong, L.F.: All in one bad weather removal using architectural search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3175–3185 (2020)
[42] Li, Y., Fan, Y., Xiang, X., Demandolx, D., Ranjan, R., Timofte, R., Van Gool, L.: Efficient and explicit modelling of image hierarchies for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18278–18289 (2023)
[43] Li, Y., Tan, R.T., Guo, X., Lu, J., Brown, M.S.: Rain streak removal using layer priors. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2736–2744 (2016)
[44] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)
[45] Liu, W., Shen, X., Pun, C.M., Cun, X.: Explicit visual prompting for low-level structure segmentations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19434–19445 (2023)
[46] Liu, X., Ma, Y., Shi, Z., Chen, J.: Griddehazenet: Attention-based multi-scale network for image dehazing. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7314–7323 (2019)
[47] Liu, Y., Chen, X., Ma, X., Wang, X., Zhou, J., Qiao, Y., Dong, C.: Unifying image processing as visual prompting question answering. arXiv preprint arXiv:2310.10513 (2023)
[48] Liu, Y.F., Jaw, D.W., Huang, S.C., Hwang, J.N.: Desnownet: Context-aware deep network for snow removal. IEEE Transactions on Image Processing 27(6), 3064–3073 (2018)
[49] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018 (2023)
[50] Ma, J., Cheng, T., Wang, G., Zhang, Q., Wang, X., Zhang, L.: Prores: Exploring degradation-aware visual prompt for universal image restoration. arXiv preprint arXiv:2306.13653 (2023)
[51] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 2, pp. 416–423. IEEE (2001)
[52] Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3883–3891 (2017)
[53] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
[54] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
[55] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
[56] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[57] Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D.: Progressive image deraining networks: A better and simpler baseline. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3937–3946 (2019)
[58] Ren, W., Ma, L., Zhang, J., Pan, J., Cao, X., Liu, W., Yang, M.H.: Gated fusion network for single image dehazing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3253–3261 (2018)
[59] Rim, J., Lee, H., Won, J., Cho, S.: Real-world blur dataset for learning and benchmarking deblurring algorithms. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16. pp. 184–201. Springer (2020)
[60] Ruan, L., Chen, B., Li, J., Lam, M.L.: Aifnet: All-in-focus image restoration network using a light field-based dataset. IEEE Transactions on Computational Imaging 7, 675–688 (2021)
[61] Ruan, L., Chen, B., Li, J., Lam, M.: Learning to deblur using light field generated and real defocus images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16304–16313 (2022)
[62] Sheikh, H.: Live image quality assessment database release 2. http://live. ece. utexas. edu/research/quality (2005)
[63] Shen, Y., Fu, C., Chen, P., Zhang, M., Li, K., Sun, X., Wu, Y., Lin, S., Ji, R.: Aligning and prompting everything all at once for universal visual perception. arXiv preprint arXiv:2312.02153 (2023)
[64] Shen, Z., Wang, W., Lu, X., Shen, J., Ling, H., Xu, T., Shao, L.: Human-aware motion deblurring. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5572–5581 (2019)
[65] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxim: Multi-axis mlp for image processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5769–5780 (2022)
[66] Valanarasu, J.M.J., Yasarla, R., Patel, V.M.: Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2353–2363 (2022)
[67] Wang, X., Wang, W., Cao, Y., Shen, C., Huang, T.: Images speak in images: A generalist painter for in-context visual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6830–6839 (2023)
[68] Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1905–1914 (2021)
[69] Wang, Y., Liu, Z., Liu, J., Xu, S., Liu, S.: Low-light image enhancement with illumination-aware gamma correction and complete image modelling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13128–13137 (2023)
[70] Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17683–17693 (2022)
[71] Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arxiv 2018. arXiv preprint arXiv:1808.04560
[72] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3733–3742 (2018)
[73] Xie, L., Wang, X., Dong, C., Qi, Z., Shan, Y.: Finding discriminative filters for specific degradations in blind super-resolution. Advances in Neural Information Processing Systems 34, 51–61 (2021)
[74] Yang, D., Yamac, M.: Motion aware double attention network for dynamic scene deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1113–1123 (2022)
[75] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1357–1366 (2017)
[76] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1357–1366 (2017)
[77] Yang, W., Wang, W., Huang, H., Wang, S., Liu, J.: Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing 30, 2072–2086 (2021)
[78] Yue, Z., Yong, H., Zhao, Q., Meng, D., Zhang, L.: Variational denoising network: Toward blind noise modeling and removal. Advances in neural information processing systems 32 (2019)
[79] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)
[80] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14821–14831 (2021)
[81] Zhang, D., Wang, X.: Dynamic multi-scale network for dual-pixel images defocus deblurring with transformer. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2022)
[82] Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 695–704 (2018)
[83] Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. IEEE transactions on circuits and systems for video technology 30(11), 3943–3956 (2019)
[84] Zhang, J., Huang, J., Yao, M., Yang, Z., Yu, H., Zhou, M., Zhao, F.: Ingredient-oriented multi-degradation learning for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5825–5835 (2023)
[85] Zhang, J., Huang, J., Yao, M., Yang, Z., Yu, H., Zhou, M., Zhao, F.: Ingredient-oriented multi-degradation learning for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5825–5835 (2023)
[86] Zheng, B., Chen, Y., Tian, X., Zhou, F., Liu, X.: Implicit dual-domain convolutional network for robust color image compression artifact reduction. IEEE Transactions on Circuits and Systems for Video Technology 30(11), 3982–3994 (2019)

Supplementary Material: A Strong Prompter for All-in-One Image Restoration

Dongqi Fan Junhao Zhang Liang Chang

Appendix 0.A All-in-One Image Restoration Result

Table 6: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of Restormer models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets	Original	ConStyle	Pre-train	Class	ConStyle v2
GoPro [52]	5.91/0.0062	25.64/0.7912	27.94/0.8558	27.93/0.8554	28.03/ 0.8585
HIDE [64]	5.99/0.0157	23.87/0.7560	26.08/0.8338	26.51/0.8372	26.53/ 0.8393
RealBlur-J [59]	10.97/0.0591	18.49/0.5820	18.50/0.4209	19.79/0.6579	19.39/0.4664
RealBlur-R [59]	8.56/0.3099	17.23/0.6388	19.88/0.4953	17.69/0.4213	23.23/ 0.7408
Rain100H [76]	6.25/0.0032	18.41/0.5615	24.95/0.7818	25.23/0.7852	25.54/ 0.7949
Rain100L [76]	6.25/0.0032	24.33/0.8136	22.04/0.8340	23.73/0.8503	24.49/ 0.8358
Test1200 [82]	6.33/0.0169	27.72/0.8151	31.00/0.8847	31.30/0.8888	31.41/ 0.8926
Test2800 [27]	6.29/0.0058	27.40/0.8353	30.93/0.9111	30.99/0.9119	31.06/ 0.9130
FiveK [5]	6.64/0.0307	18.39/0.8026	24.50/0.9062	24.59/0.9058	24.46/ 0.9058
LoL v1 [71]	6.38/0.0177	18.54/0.7231	21.35/0.8003	21.87/0.8015	21.75/ 0.8189
LoL v2 [77]	5.57/0.0145	15.56/0.7042	24.06/0.9145	24.46/0.9165	24.10/0.9136
DPDD [1]	7.73/0.0143	20.22/0.7037	15.00/0.6134	15.71/0.6104	14.38/0.5988
SOTS [37]	5.23/0.0065	24.80/0.9326	28.64/0.9694	28.86/0.9700	29.86/ 0.9704

Table 7: Image denoising, JPEG artifact removal, and desnowing of Restormer models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets		Original	ConStyle	Pre-train	Class	ConStyle v2
	15	6.62/0.0044	31.99/0.8844	33.33/0.9100	33.35/0.9117	33.40/ 0.9130
CBSD68 [51]	25	6.62/0.0044	29.33/0.8146	30.40/0.8517	30.39/0.8515	30.45/ 0.8519
	50	6.62/0.0044	25.31/0.6445	25.63/0.7035	25.77/0.7081	25.79/ 0.7098
Urban100 [32]	15	5.72/0.218	31.15/0.8923	32.43/0.9107	32.77/0.9100	32.45/ 0.9106
	25	5.72/0.218	28.49/0.8351	29.60/0.8602	29.86/0.8661	29.65/ 0.8619
	50	5.72/0.218	24.44/0.6716	24.69/0.7206	24.97/0.7366	25.07/ 0.7416
LIVE1 [62]	10	6.20/0.0020	25.72/0.7512	27.00/0.7839	27.05/0.7852	27.12/ 0.7887
	20	6.20/0.0020	28.09/0.8329	29.37/0.8558	29.42/0.8570	29.47/ 0.8582
	30	6.20/0.0020	29.29/0.8674	30.67/0.8863	30.73/0.8874	30.77/ 0.8882
	40	6.20/0.0020	30.04/0.8866	31.58/0.9034	31.63/0.9044	31.67/ 0.9050
Snow100K [48]	S	6.43/0.0203	29.12/0.8778	33.90/0.9418	34.04/0.9430	34.23/ 0.9443
	M	6.44/0.0203	26.17/0.8549	32.17/0.9300	32.36/0.9316	32.49/ 0.9330
	L	6.45/0.0214	23.57/0.7808	28.28/0.8761	28.46/0.8793	28.50/ 0.8809
CSD [48]		5.43/0.0078	16.03/0.7596	16.00/0.7404	15.35/0.7595	16.13/ 0.7625

Table 8: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of NAFNet models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets	Original	ConStyle	Pre-train	Class	ConStyle v2
GoPro [52]	27.52/0.8461	27.81/0.5839	27.81/0.8536	27.80/0.8530	27.82/ 0.8540
HIDE [64]	25.01/0.8173	25.55/0.8293	25.27/0.8228	25.30/0.8125	25.38/ 0.8243
RealBlur-J [59]	23.11/0.7418	24.32/0.7688	23.67/0.7435	25.17/0.7918	23.64/0.7419
RealBlur-R [59]	29.18/0.7915	27.72/0.7397	28.63/0.7596	27.49/0.7281	27.34/0.7272
Rain100H [76]	24.47/0.7666	21.93/0.7943	24.53/0.7437	24.72/0.7748	25.00/ 0.7777
Rain100L [76]	21.93/0.7943	22.03/0.8003	21.85/0.8110	21.09/0.7958	22.12/ 0.8111
Test1200 [82]	30.68/0.8820	29.38/0.8631	29.40/0.8793	29.41/0.8608	29.62/ 0.8669
Test2800 [27]	30.70/0.9020	30.74/0.9087	30.97/0.9017	30.73/0.9087	30.75/ 0.9089
FiveK [5]	24.43/0.9022	24.41/0.9028	24.42/0.9030	24.45/0.9018	24.47/ 0.9038
LoL v1 [71]	21.73/0.7878	21.89/0.7804	21.74/0.8022	21.96/0.7993	22.05/ 0.7995
LoL v2 [77]	24.08/0.9087	23.77/0.9085	24.08/0.9094	22.05/0.7995	24.47/ 0.9128
DPDD [1]	13.34/0.5684	14.09/0.5860	14.11/0.5689	14.22/0.5817	14.25/ 0.5961
SOTS [37]	29.01/0.9452	29.18/0.9650	28.51/0.9213	29.13/0.9646	29.19/ 0.9651

Table 9: Image denoising, JPEG artifact removal, and desnowing of NAFNet models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets		Original	ConStyle	Pre-train	Class	ConStyle v2
	15	33.00/0.9004	33.20/0.9098	33.19/0.9005	33.20/0.9011	33.22/ 0.9099
CBSD68 [51]	25	30.19/0.8417	29.94/0.8416	30.00/0.8357	30.22/0.8427	30.27/ 0.8440
	50	25.13/0.6917	22.64/0.6433	25.34/0.6963	25.24/0.6994	25.41/ 0.7058
Urban100 [32]	15	32.81/0.9226	29.77/0.8670	32.90/0.9242	32.83/0.9241	32.82/0.9237
	25	29.83/0.8718	25.20/0.7623	28.90/0.8655	29.91/0.8762	29.90/ 0.8761
	50	24.28/0.7247	16.17/0.4767	24.38/0.7425	24.36/0.7459	24.49/ 0.7496
LIVE1 [62]	10	26.94/0.7828	26.96/0.7824	26.85/0.7802	26.96/0.7829	26.99/ 0.7842
	20	29.28/0.8544	29.29/0.8539	29.28/0.8526	29.30/0.8545	29.31/ 0.8547
	30	30.58/0.8840	30.59/0.8850	30.49/0.8833	30.60/0.8851	30.60/ 0.8850
	40	31.48/0.9023	31.50/0.9027	31.54/0.9028	31.49/0.9025	31.50/0.9026
Snow100K [48]	S	33.41/0.9399	33.45/0.9399	33.40/0.9323	33.38/0.9397	33.51/ 0.9404
	M	31.77/0.9276	31.77/0.9275	31.65/0.9206	31.72/0.9272	31.79/0.9280
	L	27.60/0.8710	27.64/0.8703	28.30/0.8771	27.59/0.8697	27.68/0.8711
CSD [13]		16.49/0.7493	16.83/0.7545	16.69/0.7606	17.69/0.7677	17.29/0.7621

Table 10: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of MAXIM-1S models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets	Original	ConStyle	Pre-train	Class	ConStyle v2
GoPro [52]	26.99/0.8294	27.35/0.8319	27.28/0.8379	27.35/0.8391	27.37/0.8401
HIDE [64]	23.92/0.7764	25.13/0.8079	24.28/0.7924	25.54/0.8107	25.35/0.8096
RealBlur-J [59]	24.05/0.7673	24.37/0.7574	23.33/0.7278	24.74/0.7669	25.05/0.7751
RealBlur-R [59]	23.11/0.7305	24.94/0.6566	22.22/0.5779	25.83/0.6676	27.65/0.7341
Rain100H [76]	21.58/0.7020	22.16/0.7777	22.84/0.7544	22.82/0.7503	23.75/0.7603
Rain100L [76]	21.58/0.8020	21.81/0.7961	20.74/0.7860	19.90/0.7693	22.64/0.8076
Test1200 [82]	30.52/0.8826	29.04/0.8783	30.81/0.8839	30.48/0.8815	30.76/0.8829
Test2800 [27]	30.50/0.9060	28.34/0.8675	30.70/0.8999	30.83/0.9109	30.55/0.9075
FiveK [5]	23.94/0.8953	24.10/0.8907	24.16/0.8990	24.15/0.9008	24.17/0.8993
LoL v1 [71]	20.99/0.8000	20.97/0.8119	20.88/0.8027	21.03/0.8078	21.59/0.8089
LoL v2 [77]	23.26/0.9036	24.09/0.9111	24.13/0.9055	24.44/0.9122	24.40/0.9112
DPDD [1]	14.71/0.5969	15.09/0.6082	14.84/0.5982	15.47/0.6079	14.81/0.5989
SOTS [37]	28.55/0.9586	29.36/0.9651	29.34/0.9660	29.24/0.9600	29.47/0.9679

Table 11: Image denoising, JPEG artifact removal, and desnowing of MAXIM-1S models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets		Original	ConStyle	Pre-train	Class	ConStyle v2
	15	33.11/0.9094	33.20/0.9111	33.24/0.9113	33.29/0.9128	33.22/0.9113
CBSD68 [51]	25	30.13/0.8462	30.27/0.8469	30.25/0.8470	30.33/0.8486	30.27/0.8477
	50	25.43/0.7102	25.50/0.7172	25.53/0.7195	25.67/0.7228	25.63/0.7132
Urban100 [32]	15	32.47/0.9190	32.11/0.9186	32.66/0.9202	32.71/0.9187	32.77/0.9238
	25	29.50/0.8693	29.04/0.8672	29.83/0.8689	29.86/0.8770	29.86/0.8777
	50	24.54/0.7499	24.07/0.7456	25.01/0.7689	25.07/0.7718	24.94/0.7629
LIVE1 [62]	10	26.82/0.7824	24.93/0.7519	26.82/0.7841	26.89/0.7830	26.90/0.7854
	20	29.06/0.8504	27.09/0.8345	29.24/0.8462	29.18/0.8506	29.27/0.8564
	30	30.19/0.8778	28.40/0.8683	30.45/0.8866	30.68/0.8876	30.59/0.8868
	40	30.98/0.8941	28.73/0.8847	31.56/0.9038	31.58/0.9004	31.37/0.9012
Snow100K [48]	S	33.05/0.9380	33.48/0.9324	33.38/0.9408	33.53/0.9406	33.54/0.9420
	M	31.46/0.9261	31.86/0.9209	31.78/0.9292	31.81/0.9211	31.87/0.9303
	L	27.78/0.8726	28.01/0.8760	27.95/0.8755	28.16/0.8791	28.03/0.8773
CSD [13]		13.60/0.7048	13.93/0.7087	13.48/0.7335	14.70/0.7344	14.27/0.7147

Table 12: Image motion deblurring, deraining, low-light enhancement, defocus deblurring, and dehazing of Conv models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets	Original	ConStyle	Pre-train	Class	ConStyle v2
GoPro [52]	21.56/0.6900	24.61/0.7275	25.67/0.7922	26.35/0.8108	26.54/0.8121
HIDE [64]	20.04/0.6523	23.18/0.7001	21.84/0.7303	23.17/0.7019	23.27/0.7671
RealBlur-J [59]	23.55/0.6025	25.22/0.6936	25.30/0.7888	23.66/0.7621	26.10/0.7979
RealBlur-R [59]	30.01/0.8375	32.02/0.8901	30.32/0.8874	30.80/0.8889	32.60/0.9130
Rain100H [76]	10.11/0.3799	12.32/0.3374	15.30/0.5284	11.96/0.4334	12.40/0.4236
Rain100L [76]	17.64/0.6086	23.77/0.7262	18.89/0.7502	18.11/0.7265	18.95/0.7312
Test1200 [82]	25.07/0.7105	21.27/0.6324	23.52/0.7902	28.41/0.8499	27.98/0.8479
Test2800 [27]	25.40/0.7675	21.88/0.9619	23.44/0.8105	29.65/0.8964	29.27/0.8897
FiveK [5]	17.19/0.7864	17.33/0.7005	17.72/0.7805	18.94/0.8032	19.50/0.8337
LoL v1 [71]	7.77/0.1935	7.76/0.1887	8.32/0.2434	9.43/0.3634	7.96/0.2134
LoL v2 [77]	11.05/0.4467	10.93/0.4086	13.02/0.5814	13.03/0.6542	13.21/0.6206
DPDD [1]	20.32/0.5338	21.29/0.6129	19.88/0.6912	19.82/0.6825	21.92/0.7143
SOTS [37]	22.92/0.9063	16.25/0.7639	22.22/0.9065	21.49/0.9160	24.21/0.9406

Table 13: Image denoising, JPEG artifact removal, and desnowing of Conv models tested by PSNR/SSIM. Red means the best and blue means the second best.

Datasets		Original	ConStyle	Pre-train	Class	ConStyle v2
	15	24.82/0.5730	24.56/0.5976	19.22/0.6107	32.92/0.9059	33.09/0.9047
CBSD68 [51]	25	20.54/0.3950	21.27/0.4416	17.22/0.4439	28.33/0.8345	30.12/0.8354
	50	15.01/0.1999	16.16/0.2436	15.77/0.2747	21.74/0.6674	25.16/0.6830
Urban100 [32]	15	24.85/0.6123	23.25/0.6137	11.72/0.4582	29.34/0.8898	32.38/0.9150
	25	20.69/0.4569	20.48/0.4843	12.96/0.3982	25.31/0.8214	29.48/0.8608
	50	15.18/0.2676	15.84/0.3055	13.97/0.2979	18.43/0.6606	24.40/0.7230
LIVE1 [62]	10	24.95/0.7396	23.59/0.6604	25.06/0.7449	26.93/0.7836	26.80/0.7787
	20	26.53/0.8191	25.03/0.7324	26.40/0.8170	29.30/0.8555	29.22/0.8526
	30	27.32/0.8542	25.69/0.7619	26.40/0.8408	30.61/0.8864	30.54/0.8843
	40	27.83/0.8745	26.12/0.7793	26.40/0.8485	31.50/0.9035	31.40/0.9017
Snow100K [48]	S	26.70/0.8268	23.46/0.7341	24.72/0.8582	31.55/0.9129	31.85/0.9243
	M	24.79/0.7978	21.97/0.7103	24.34/0.8427	31.14/0.9200	30.43/0.9106
	L	22.22/0.6906	18.56/0.6144	23.09/0.7833	27.15/0.8589	26.12/0.8456
CSD [13]		15.78/0.7259	14.56/0.6666	15.66/0.7367	13.66/0.6878	16.63/0.7515