latexFont shape
Make Explicit Calibration Implicit:
Calibrate Denoiser Instead of the Noise Model
Abstract
Explicit calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods are impeded by several critical limitations: a) the explicit calibration process is both labor- and time-intensive, b) challenge exists in transferring denoisers across different camera models, and c) the disparity between synthetic and real noise is exacerbated by digital gain. To address these issues, we introduce a groundbreaking pipeline named Lighting Every Darkness (LED), which is effective regardless of the digital gain or the camera sensor. LED eliminates the need for explicit noise model calibration, instead utilizing an implicit fine-tuning process that allows quick deployment and requires minimal data. Structural modifications are also included to reduce the discrepancy between synthetic and real noise without extra computational demands. Our method surpasses existing methods in various camera models, including new ones not in public datasets, with just a few pairs per digital gain and only 0.5 of the typical iterations. Furthermore, LED also allows researchers to focus more on deep learning advancements while still utilizing sensor engineering benefits. Code and related materials can be found in https://srameo.github.io/projects/led-iccv23/.
Index Terms:
Extreme low-light imaging, few-shot learning, deep low-light image denoising, low-light denoising dataset.1 Introduction
Noise, an inescapable topic for image capturing, has been systematically investigated in recent years [2, 3, 4, 5, 6, 7, 8]. Compared to standard RGB images, RAW images offer two substantial advantages for image denoising: tractable, primitive noise distribution [8] and higher bit depth for differentiating signal from noise. Learning-based methodologies have demonstrated remarkable advancements in RAW image denoising, particularly when utilizing paired real datasets [9, 10, 11, 12]. However, creating extensive real RAW image datasets tailored to each camera model is impractical. Consequently, there has been a growing focus on applying learning-based techniques to synthetic datasets, a trend reflected in various studies [13, 14, 15, 8, 16, 17, 18].
Calibration-based noise synthesis, particularly when employing physics-based models, has demonstrated its proficiency in accurately fitting real noise characteristics [19, 8, 16, 20, 21, 22]. These methods typically adhere to a systematic process. Initially, they construct a well-designed noise model that aligns with the electronic imaging pipeline. Subsequently, a specific target camera is chosen, and the parameters of the pre-defined noise model are meticulously calibrated. The final step involves generating synthetic paired data for training a denoising network. Moreover, some approaches have been exploring the use of Deep Neural Network (DNN)-based generative models to facilitate the calibration of noise parameters [20, 21].
Despite their notable achievements, current methods encounter three principal limitations, as depicted in Fig. 2 (b). 1) Explicit camera-specific noisy model calibration is time-consuming and labor-intensive, requiring specialized data collection with a consistent illumination environment and comprehensive post-processing. 2) Each denoising network (denoiser) is tailored for a specific camera model. Such coupling issues exhibit adaptability challenges to different cameras, requiring repeated calibration and training for distinct target cameras. 3) The noise model trained with synthetic-only data may not encompass certain noise distributions, leading to what is termed as out-of-model noise [8, 16, 22]. In other words, a domain gap persists between Synthetic Noise (SN) and Real Noise (RN). While recent advancements [21] have concentrated on reducing calibration costs through DNN-based methods, issues related to the coupling of between networks and cameras, and out-of-model noise continue to increase training expenses and constrain overall performance.
We introduce an innovative pipeline, LED, for lighting every darkness, addressing the identified shortcomings of calibration-based methods. As illustrated in Fig. 2 (c), our framework eliminates the necessity for calibration data and operations related to the noise model. To sever the strong dependency between the denoising network and a specific target camera, we propose a dual-stage approach: pre-training with a virtual camera set111“Virtual” cameras do not correspond to any real camera models but with reasonable noise parameters of the pre-defined noise model. It is sampled from a parameter space with our proposed sampling strategy. Details can be found in Sec. 3.2. followed by fine-tuning with few-shot pairs from a specific real camera. This strategy effectively decouples the network from being bound to a single camera model. Concerning the disparity between a virtual and a target camera and the challenges posed by out-of-model noise, we introduce the Re-parameterized Noise Removal (RepNR) block. During the pre-training stage, the RepNR block has several camera-specific alignments (CSA). Each CSA is responsible for learning the camera-specific information of a single virtual camera and aligning features to a shared space. Then, the common knowledge of in-model (components that have been assumed as part of the noise model) noise is learned by a shared denoising convolution. In the fine-tuning stage, we average all the CSAs of virtual cameras as initialization of the target camera. Additionally, we integrate a parallel convolution branch for Out-of-Model Noise Removal (OMNR). During the fine-tuning stage, LED implicitly “calibrates” the parameters of the denoiser, especially the CSAs, instead of explicitly calibrating the noise model. Only 2 pairs for each ratio (additional digital gain) captured by the target camera, in a total of 6 raw image pairs, are used for learning to remove real noise (discussion on why 2 pairs for each ratio can be found in Sec. 6). During deployment, all the RepNR blocks can be structurally parameterized [24, 25, 26] into a straightforward convolution without any extra computational cost, yielding a plain UNet [27].
To comprehensively evaluate the efficacy of LED across diverse camera models, we introduce a novel dataset specifically tailored for Multi-camera and dark scene RAW image denoising, referred to as MultiRAW. This dataset is distinct in that it includes five different camera models that have never appeared before. A notable feature of MultiRAW is its encompassment of various sensor sizes, ranging from full-frame cameras to APS-C format cameras, offering a more expansive and realistic testing ground. Furthermore, MultiRAW dataset will be used in the CVPR 2024 and subsequent MIPI (Mobile Intelligent Photography & Imaging) workshops. This utilization underscores its significance and potential impact in advancing the field of RAW image denoising, particularly in scenarios characterized by extremely low light conditions.
Compared to LED, previous methods primarily focused on constructing noise models and calibrating noise parameters, namely sensor-related engineering. However, LED has focused on deep learning techniques like few-shot and transfer learning. Additionally, our method does not deviate from traditional noise modeling methods, which can still empower the pre-training stage of LED.
Our principal contributions are concisely encapsulated as follows:
-
•
We introduce a novel, implicit “calibration” pipeline for lighting every darkness, eliminating the need for additional calibration-related expenses for noise parameter calculation.
-
•
The implementation of Camera-Specific Alignments (CSA) mitigates the dependence of the denoising network on specific camera models. At the same time, the Out-of-Model Noise Removal (OMNR) mechanism facilitates few-shot transfer by learning the out-of-model noise of different sensors.
-
•
We release a new dataset, MultiRAW, encompassing various camera models, assorted scenes, and varying brightness levels. This dataset substantially enriches the current landscape of open-source datasets and addresses the prevalent limitation of limited camera variety.
-
•
Remarkably, our method requires only 2 RAW image pairs for each ratio and a mere 0.5 of the iterations typically needed by state-of-the-art methods (Fig. 1).
Compared to the ICCV 2023 [1] version, this journal extension includes several notable expansions. 1) Experiments (Sec. 5.5) demonstrate that our method can be seamlessly integrated with various existing network architectures and explicit calibration methods, showcasing the broad applicability of our proposed pipeline. 2) Furthermore, a discussion is provided on whether the network employs noise prior or image prior during denoising (detailed in Sec. 6), serving as guidance for further research. 3) We provide a detailed process for few-shot dataset collection and considerations, laying the groundwork for widespread adoption of our implicit calibration pipeline, LED. 4) Based on the remainder in 3), we introduce a new dataset, MultiRAW, featuring various camera models (not included in prior public datasets), multiple additional digital gains, and each setting encompassing two different ISO configurations. 5) We plan to invigorate the RAW image denoising community by hosting a Few-shot RAW Image Denoising competition with the proposed MultiRAW dataset at the CVPR 2024 workshop: Mobile Intelligent Photography & Imaging.
2 Related Work
The issue of image capture in extremely dark scenes has received widespread attention from numerous camera/smartphone manufacturers. This section will revisit denoising techniques such as training with paired data and methods based on noise model calibration.
2.1 Training with Paired Real Data.
The field of RAW data exploitation for image denoising has its roots in the groundbreaking work of the SIDD project [6]. Progress in this area has recently broadened to encompass traditional light image denoising and the more complex challenges inherent in extremely low-light conditions. This expansion is illustrated by notable studies such as SID [7] and ELD [8]. While methodologies based on real noise have yielded encouraging results [28, 29, 30, 31, 32, 33], their widespread application is hampered by the considerable effort required to compile extensive datasets of paired low and high-quality images. To address this, employing training strategies that utilize paired low-quality raw images, exemplified by Noise2Noise [5] and Noise2NoiseFlow [17], offers an effective workaround to the tedious task of assembling noisy-clean image pairs. However, these techniques tend to under-perform in severe noise levels, especially in scenarios with extreme darkness [7, 8].
In this context, our LED aims to advance the understanding and effectiveness of real noise elimination. It incorporates insights from a limited number of paired images taken in extremely low-light conditions, thereby mitigating the data collection challenges associated with such environments.
2.2 Calibration-Based Denoising.
While alleviating the burden of compiling pairwise datasets, synthetic noise-based techniques encounter practical limitations. Common noise models like Poisson and Gaussian significantly diverge from actual noise distributions in extremely low-light conditions [7, 8] 222 Denoising under extremely low-light scenarios necessitates the application of additional digital gain (up to 300) to the input, thereby intensifying the domain gap between real and synthetic noise. . In response, explicit calibration-based methods, simulating each noise component in electronic imaging pipelines [34, 35, 36, 37, 38], have thrived due to their reliability.
ELD [8] proposed a noise model that closely aligns with real noise characteristics, achieving notable performance in dark scenarios. Zhang et al. [16] acknowledged the complexity of modeling signal-independent noise sources and proposed a method that randomly samples such noise from dark frames. However, it still necessitates calibration for signal-dependent noise parameters (overall system gain). Monakhova et al. [20] devised a noise generator combining physics-based noise models with a generative adversarial framework [39]. Zou et al. [21] pursued more accurate and concise calibration by employing contrastive learning [40, 41] for parameter estimation.
Despite the impressive performance achieved by calibration-based methods, certain challenges persist. Stable illumination environments (e.g., consistent brightness and temperature), calibration-specific data collection (e.g., multiple images for each camera setting), and intricate post-processing tasks (e.g., alignment, localization, and statistical analyses) are prerequisites for precisely estimating noise parameters. Furthermore, repeated calibration and training processes are essential for distinct cameras, owing to the diversity of parameters and the nonuniform pre-defined noise model [42, 36, 38, 43]. Additionally, the domain gap between synthetic and real noise is not adequately addressed.
Our LED overcomes these challenges by replacing the explicit calibration procedure with implicitly calibrating the denoiser: a pre-training and fine-tuning framework and a RepNR block designed for noise removal, respectively.
2.3 From Synthetic to Real Noise.
The domain gap between real and synthetic noise, a fundamental challenge, becomes particularly pronounced when models trained on synthetic data are tested on real-world data. To bridge this gap, recent research has increasingly focused on employing techniques like Adaptive Instance Normalization (AdaIN) [44, 45] and few-shot learning [46, 47, 48], along with transfer learning [23] and domain adaptation [49] strategies. However, these approaches often struggle in extremely dark environments where the numerical instability caused by intense noise and high digital gain can impair signal reconstruction.
To address this, our framework introduces a novel camera-specific alignment strategy. This method reduces numerical instability and effectively separates camera-specific characteristics from the general attributes of the noise model. Moreover, unlike instance or layer normalization [50, 51], our alignment operations can be reparameterized into a straightforward convolution, similar to custom batch normalization [52]. This reparameterization ensures that our approach does not incur any additional computational burden.
3 Method
This section commences with an overview of the complete pipeline for our proposed raw image denoising with implicit calibration. Subsequently, we introduce our Reparameterized Noise Removal (RepNR) block. The comprehensive denoising pipeline is illustrated in Fig. 3.
3.1 Preliminaries and Motivation
In raw image space, the captured signals are conventionally regarded as the sum of the clean image and various noise components , expressed as Eqn. (1).
(1) |
where is assumed to follow a noise model,
(2) |
with , , , , and representing shot noise, read noise, row noise, quantization noise, and out-of-model noise, respectively. Apart from the out-of-model noise, other noise components are sampled from specific distributions:
(3) |
where denotes the overall system gain. Here, , , and represent Poisson, Gaussian, and uniform distributions, respectively. stands for the Tukey-lambda distribution [53] with shape , mean , and standard deviation . Based on the assumption in ELD [8], a linear relationship governs the joint distribution of and , expressed as:
(4) |
where , denotes the range of the overall system gain, determined by the minimal and maximum ISO value. , , and indicate the line’s slope, bias, and an unbiased estimator of the standard deviation, respectively. In this context, a camera can be approximated as a ten-dimensional coordinate :
(5) |
Existing methods predominantly rely on explicit calibration to determine the coordinate , especially the linear relationship. It is a process characterized by intensive labor and a substantial domain gap (i.e., the gap between simulated noise and real noise). Moreover, the entanglement between neural networks and cameras requires repeated explicit calibration and training. In our implementation, these distributions and linear relationships are defined similarly to ELD [8]. However, we can also employ more advanced noise models as replacements to achieve theoretically superior performance.
We aim to streamline the complex calibration process and mitigate the strong coupling between networks and cameras. Additionally, we address the out-of-model noise comprehensively, a task facilitated by the structural modifications introduced in the RepNR block. Our motivation is to compel the network to function as a swift adapter [54, 55].
3.2 Pre-train with Camera-Specific Alignment
Preprocessing. We initiate the pre-training stage using virtual cameras to induce the network to function as a fast adapter. Given the number of virtual cameras and the parameter space (formulated as ), for the -th camera, we select the -th bisection points for each parameter range and combine them to construct a virtual camera. Augmenting the data with synthetic noise, we can pre-train our network based on multiple virtual cameras, compelling the network to acquire common knowledge.
Camera-Specific Alignment. As depicted in Fig. 3, within the pre-training process, we introduce our Camera-Specific Alignment (CSA) module, which focuses on adjusting the distribution of input features. In the baseline model, a convolution followed by leaky-ReLU [56] constitutes the primary component. A multi-path alignment layer is inserted before each convolution of the network to align features from different virtual cameras into a shared space. Each path represents the CSA corresponding to the -th camera, aligning the -th camera-specific feature distribution into a shared space. Let the feature of the -th virtual camera be . Formally, the -th branch contains a weight and a bias , performing channel-wise linear projection, denoted by . are initialized as , and are initialized as , with no effect on the convolution at the beginning.
During training, data augmented by the noise of the -th virtual camera is fed into the -th path for alignment and a shared convolution for further processing. The detailed pre-training pipeline is described in Algorithm 1.
3.3 Fine-tune with Few-shot RAW Image Pairs
Following the pre-training process, the model is intended for deployment in realistic denoising tasks. We advocate for a few-shot strategy, specifically employing only 6 pairs (2 pairs for each of the three ratios) of raw images to fine-tune the pre-trained model. We assume that convolutions have acquired sufficient capability to handle features aligned by CSAs. The convolutions remain frozen during subsequent fine-tuning to maximize the utilization of the model parameters obtained from pre-training. For addressing real noise, we substitute the multi-branch CSA with a new CSA layer, denoted as CSA (CSA for the target camera). Unlike the multi-branch CSA during pre-training, the CSA layer is initialized by averaging the pre-trained CSAs for improved generalization. The CSA followed by a convolution branch mentioned above is called the in-model noise removal branch (IMNR).
Nevertheless, real noise encompasses the modeled part and some out-of-model noise. Since our CSA layer is specifically designed for aligning features augmented by synthetic noise, a gap still exists between real noise and the one that IMNR can handle (i.e., in Eqn. (2)). Therefore, we propose introducing an out-of-model noise removal branch (OMNR), to learn the gap between real noise and the modeled components. We treat the OMNR component as a parallel branch alongside the IMNR branch, due to previous research that has demonstrated the efficacy of parallel convolution branches in transfer and continual learning [57]. OMNR comprises only a convolution, aiming to capture the structural characteristics of real noise from few-shot raw image pairs. Given the absence of prior information on the noise remainder , we initialize the weights and bias of OMNR as a tensor of . Combining IMNR with OMNR yields the proposed RepNR block. It is worth noting that it is more reasonable to first learn in-model noise and subsequently address out-of-model noise. Therefore, we divide the optimization process into two steps: initially training IMNR and subsequently training OMNR. Following this approach, iterations of two-step fine-tuning only account for 0.5 of the pre-training, rendering it highly feasible for practical implementation. The detailed fine-tuning pipeline is described in Algorithm 2.
Analysis on the Initialization of CSA. As mentioned in Sec. 3.3, we initialize CSA by averaging the pre-trained CSAs in the multi-branch CSA layer. Given that every path shares the convolution in the multi-branch CSA, this initialization can be conceptualized as the ensemble of models, where is the number of paths, like (a)-(c) in Fig. 4. According to studies [58, 59, 60], the weighted average of different models can significantly enhance the model’s generalization. This aligns with our objective of generalizing the model to the target noisy domain.
Another rationale for this approach is that CSAs are largely determined by the coordinates . From this perspective, the average of different CSAs can be considered the center of gravity of these coordinates. Moreover, the coordinates of test cameras, both in SID [7] and ELD [8], are encompassed in the parameter space . In such circumstances, averaging the pre-trained CSAs is a sound starting point. However, even if coordinates are not in the pre-defined parameter space (in our MultiRAW dataset), LED could also achieve SOAT performance with a few more iterations during fine-tuning.
3.4 Deploy
Upon completion of fine-tuning, the deployment of the model holds paramount importance for future applications. Directly substituting the convolution with our RepNR Block would inevitably increase the number of parameters and computational workload. However, it is noteworthy that our RepNR block solely comprises serial vs. parallel linear map**s. Additionally, the receptive field of each branch in the RepNR block is . Therefore, employing the structural reparameterization technique [61, 24, 25], our RepNR block can be transformed into a plain convolution during deployment, as illustrated in Fig. 4 (d). This implies that our model incurs no additional costs in the application process and facilitates a fair comparison with other methods. Regarding parallel reparameterization techniques, please refer to previous works [61, 24, 25, 62, 63]. Here, we primarily introduce the serial reparameterization techniques we employed.
Sequential Reparameterization. The reparameterization process can be denoted as the following equation:
(6) |
where , denotes transform a dimensional vector into a diagonal matrix and replicate-padding a dimensional vector into a matrix respectively. And , , and stand for the weight of the CSA, the convolution, and the reparameterized weight, respectively. And the are standing for the bias of the corresponding type.
Since our CSA operator solely comprises channel-wise operations, it is necessary to initially transform it into a regular convolution using the operator during reparameterization. It is worth noting that such reparameterization can only approximate the and . To ensure consistency during training and testing, we employed the online reparameterization technique [64]. It allows for reparameterization during training, which intends to save more GPU memories. However, our primary goal is to ensure consistency between training and testing utilizing the online reparameterization technique. More details can be found in our Github repo [65].
4 Dark RAW Images (MultiRAW) Dataset
In this section, we will introduce the MultiRAW dataset, details related to data collection (to guide the deployment of LED to any other cameras), and the availability and limitations of the data. Notice that, description in this section has been simplified as much as possible to facilitate a more comfortable and rapid deployment of LED on any other camera models.
4.1 Overview of the MultiRAW Dataset
To further validate the effectiveness of LED across different cameras, we introduce the MultiRAW dataset. Compared to existing datasets, our MultiRAW dataset has the following advantages:
-
•
Multi-Camera Data: To further demonstrate the effectiveness of LED across different cameras (corresponding to different noise parameters, coordinates ), our dataset includes five distinct models not covered in existing datasets. Additionally, MultiRAW includes full-frame and APS-C format cameras with smaller sensor areas, often exhibiting stronger noise characteristics.
-
•
Varied Illumination Settings: The dataset contains data under five different illumination ratios (, , , , and ), each representing varying levels of denoising difficulty.
-
•
Dual ISO Configurations: There are two different ISO settings for each scene and illumination setting. These can be used not only for the fine-tuning stage of the LED method but also for testing the algorithm’s robustness under different illumination settings.
In addition to the three highlighted points, the MultiRAW dataset spans 30 indoor scenes, featuring diverse backgrounds and varying types and quantities of objects being photographed. It includes seven different ISO settings ranging from 200 to 6400. The hardest example in our dataset resembles the image captured at a “pseudo” ISO up to 960,000 (). We captured a 5-image burst per setting to collect a broader range of noise samples for each ISO configuration under every illumination setting. This approach provides more test data pairs and lays the groundwork for burst raw image denoising in extremely dark environments. Also, we captured data for explicit calibration to reproduce existing calibration-based methods for fully evaluation.
Most existing datasets directly use low ISO and long exposure images as ground truth because the noise produced at low ISO settings is often negligible in full-frame cameras. However, since our shooting equipment includes APS-C format cameras with smaller sensor areas, we need to additionally perform multi-frame averaging denoising on low ISO and long exposure images (4 frames in our implementations). Therefore, we collected a total of noisy images and images for creating ground-truths, comprising pairs of data for both training and evaluation.
4.2 Instructions on Data Collection
To ensure the quality of the dataset, special attention must be paid to lighting, alignment, and environmental factors during the shooting process:
-
•
Lighting: To ensure consistent lighting conditions for the images, it is often necessary to supplement environmental lighting or adjust the aperture. This allows for correct exposure in low ISO and long exposure scenarios.
-
•
Alignment: Remote control is essential to prevent misalignment issues. Additionally, to avoid camera shake caused by the mechanical shutter during photography, the camera should be set to electronic shutter mode for shooting.
-
•
Temperature: To prevent the increase in camera temperature caused by continuous shooting (which typically leads to increased noise variance), it is necessary to set the interval between continuous shots to 5 seconds or more.
Moreover, to provide more information on signal-dependent noise (shot noise) for the fine-tuning of LED, the scenes photographed should have a wide variety of colors.
Categories | Methods | Extra Data Requirements | Iterations (K) | ||||||
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | ||||
DNN Model Based | Kristina et al. [20] | 1800 noisy-clean pairs | 327.6 | 38.7799 | 0.9120 | 34.4924 | 0.7900 | 31.2971 | 0.6990 |
NoiseFlow [13] | 1800 noisy-clean pairs | 777.6 | 37.0200 | 0.8820 | 32.9457 | 0.7699 | 29.8068 | 0.6700 | |
Calibration-Based | Calibrated P-G | 300 calibration data | 257.6 | 39.1576 | 0.8963 | 33.8929 | 0.7630 | 31.0035 | 0.6522 |
ELD [8] | 300 calibration data | 257.6 | 41.8271 | 0.9538 | 38.8492 | 0.9278 | 35.9402 | 0.8982 | |
Zhang et al. [16] | 150/150 for calib./database | 257.6 | 40.9232 | 0.9488 | 38.4397 | 0.9255 | 35.5439 | 0.8975 | |
Real Data Based | SID [7] | 1800 noisy-clean pairs | 257.6 | 41.7273 | 0.9531 | 39.1353 | 0.9304 | 37.3627 | 0.9341 |
Noise2Noise [5] | 12000 noisy pairs | 257.6 | 39.2769 | 0.8993 | 34.1660 | 0.7824 | 31.0991 | 0.7080 | |
AINDNet [23] | 300 noisy-clean pairs | 1.5 | 40.5636 | 0.9194 | 36.2538 | 0.8509 | 32.2291 | 0.7397 | |
AINDNet* | 300 noisy-clean pairs | 1.5 | 39.8052 | 0.9350 | 37.2210 | 0.9101 | 34.5615 | 0.8856 | |
LED (Ours) | 6 noisy-clean pairs | 1.5 | 41.9842 | 0.9539 | 39.3419 | 0.9317 | 36.6728 | 0.9147 |
Cam. | Ratio | Calibrated P-G | ELD [8] | LED (Ours) |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | ||
Sony A7S2 | 54.3710/0.9977 | 52.8120/0.9957 | 51.9547/0.9968 | |
49.9973/0.9891 | 50.0152/0.9913 | 50.1762/0.9945 | ||
41.5246/0.8668 | 44.9865/0.9707 | 45.3574/0.9779 | ||
37.6866/0.7818 | 42.5440/0.9430 | 42.9747/0.9577 |
Cam. | Ratio | Calibrated P-G | ELD [8] | LED (Ours) |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | ||
Nikon D850 | 50.6207/0.9949 | 50.5628/0.9925 | 50.6222/0.9939 | |
48.3461/0.9884 | 48.3667/0.9890 | 48.0684/0.9894 | ||
42.2231/0.9046 | 43.6907/0.9634 | 43.5620/0.9667 | ||
39.0084/0.8391 | 41.3311/0.9364 | 41.3984/0.9482 |
Cam. | Ratio | Calibrated P-G | ELD [8] | LED (Ours) |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | ||
Canon EOS70D | 42.7352/0.9915 | 42.4305/0.9900 | 48.5063/0.9924 | |
41.0061/0.9841 | 40.6364/0.9833 | 45.4415/0.9842 | ||
36.7007/0.8700 | 37.7944/0.9255 | 39.5491/0.9360 | ||
33.3459/0.7942 | 35.1554/0.8703 | 36.2362/0.8948 |
Cam. | Ratio | Calibrated P-G | ELD [8] | LED (Ours) |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | ||
Canon EOS700D | 42.0156/0.9900 | 41.9264/0.9881 | 47.7006/0.9910 | |
40.7658/0.9791 | 40.5297/0.9758 | 44.8541/0.9815 | ||
36.7589/0.8697 | 36.9642/0.8937 | 38.3147/0.9206 | ||
34.3376/0.8063 | 34.9231/0.8534 | 35.1962/0.8717 |
4.3 Dataset Application and Availability
Our dataset will be used in the Few-shot RAW Image Denoising track at the CVPR 2024 workshop: Mobile Intelligent Photography & Imaging. Following popular benchmarks, we fully release a subset of the data (about 20 scenes of the Canon EOSR10 and Sony A6400 camera models), along with a batch of test data. To prevent overfitting, we only make the images public, with the corresponding ground truths accessible via an online leaderboard on Google CodaLab [67]. A thumbnail of our MultiRAW dataset is illustrated in Fig. 5.
5 Experiments and Analysis
This section offers a comprehensive description of our implementation, details the evaluation metrics and datasets used, presents comparative experiments with other methods, and includes ablation studies to demonstrate the efficacy of our approach.
5.1 Implementation Details
Similar to most denoising methods [14, 68], we utilize the loss function as the training objective. We adopt the same UNet [27] architecture as previous methods for a fair comparison, with the distinction that we replace the convolution blocks inside the UNet with our proposed RepNR block. As mentioned in Sec. 3.4, the RepNR block can be structurally reparameterized into a simple convolution block without incurring additional computational costs. We employ the same data preprocessing and optimization strategy as ELD [8] during pre-training. The raw images with long exposure time in the SID [7] train subset are utilized for noise synthesis. Concerning data preprocessing, we pack the Bayer images into 4 channels, followed by crop** the long exposure data with a patch size of , non-overlap**, step , thereby increasing the iterations of one epoch from to . Our implementation is based on PyTorch [69] and MindSpore [70]. We train the models for 200 epochs (257.6K iterations) using the Adam optimizer [71] with and for optimization, without applying weight decay. The initial learning rate is set to and is halved at the 100th epoch (128.8K iterations) before being further reduced to at the 180th epoch (231.84K iterations).
During fine-tuning, we initially freeze the convolution and average the multi-branch CSA to initialize CSA. We first train CSA until convergence, which constitutes the implicit calibration process we propose. After CSA has converged, we introduce the out-of-model noise removal branch (a parallel convolution) and freeze all the remaining parameters in our network, as depicted in Fig. 3 ④. Subsequently, we train the OMNR until convergence. Different datasets require varying iterations and learning rates, the details of which will be described in Sec. II. After completing the training process, we deploy our model by reparameterizing the RepNR blocks into convolutions.
5.2 Evaluation Metrics and Datasets
PSNR and SSIM [72] are utilized as quantitative evaluation metrics for pixel-wise and structural assessment. It’s important to note that the pixel value of low-light raw images usually lies in a smaller range than sRGB images, typically after normalization. This can result in a lower mean square error and higher PSNR. We evaluated our proposed LED on 3 RAW-based denoising datasets, namely SID [7], ELD [8] and our proposed MultiRAW.
SID [7] dataset. The SID [7] dataset exclusively comprises the Sony A7S2 camera model, yet its test scenes are highly diverse, effectively demonstrating the algorithm’s efficacy to the greatest extent. Consequently, a substantial number of ablation experiments are based on this dataset. We randomly selected two pairs of data for each additional digital gain (, , and ), in a total of six pairs, as the few-shot training datasets. Since the coordinate (first mentioned in Eqn. (5)) of the Sony A7S2 is already included in our pre-defined parameter space , the required training strategy can be relatively mild. We initially fine-tuned CSA using a learning rate of for 1K iterations. Subsequently, we fine-tune the OMNR branch for 500 iterations using a learning rate of .
ELD [8] dataset. The ELD [8] dataset encompasses four camera models: Sony A7S2, Nikon D850, Canon EOS70D, and Canon EOS700D. We used the paired raw images of the first two scenarios for fine-tuning the pre-trained network, while the remaining eight scenarios were used for evaluation. All the metrics in Tab. II are calculated across the eight scenes for fair comparison. On the ELD [8] dataset, since the four cameras’ coordinate s are all included in our pre-defined parameter space , the training strategy is the same as for the SID [7] dataset.
MultiRAW dataset. The MultiRAW dataset includes five camera models not previously mentioned: Sony A6400, Canon EOSR10, and three other cameras. Given that this dataset is intended for few-shot raw image denoising, we directly use its training set for fine-tuning. The training strategy on the MultiRAW dataset may be somewhat aggressive because the coordinate s of the 5 camera models in MultiRAW dataset are not included in our pre-defined parameter space . However, This would fully verify the effectiveness of our proposed LED on unseen camera models. During the fine-tuning process, we adopted the SGDR [73] learning rate decay strategy. Initially, CSA is trained with a learning rate from to for 1K iterations for rapid convergence. Subsequently, the OMNR is trained for 2K iterations with a learning rate from to .
Camera | Ratio | P-G | AINDNet* [23] | ELD [8] | Zhang et al. [16] | LED (Ours) | ||||||||||
PSNR | SSIM | Time | PSNR | SSIM | Time | PSNR | SSIM | Time | PSNR | SSIM | Time | PSNR | SSIM | Time | ||
Canon EOSR10 | 45.5070 | 0.9895 | 4h 35m 27s | 42.8885 | 0.9749 | 15m 01s | 45.4837 | 0.9786 | 4h 37m 11s | 45.4036 | 0.9865 | 4h 29m 12s | 48.6290 | 0.9918 | 7m 17s | |
44.7179 | 0.9847 | 41.8977 | 0.9670 | 43.4092 | 0.9601 | 43.9946 | 0.9803 | 46.3750 | 0.9842 | |||||||
39.8212 | 0.9064 | 39.2519 | 0.9391 | 40.6755 | 0.9310 | 41.2814 | 0.9594 | 41.8574 | 0.9547 | |||||||
37.0122 | 0.8130 | 38.3639 | 0.9279 | 40.3582 | 0.9439 | 40.1521 | 0.9486 | 40.8654 | 0.9456 | |||||||
34.5953 | 0.7769 | 35.7965 | 0.8700 | 37.7036 | 0.8987 | 37.6117 | 0.8967 | 37.7800 | 0.8972 | |||||||
Sony A6400 | 49.3146 | 0.9934 | 4h 23m 15s | 43.5193 | 0.9750 | 15m 15s | 48.9889 | 0.9927 | 4h 39m 27s | 48.3114 | 0.9913 | 4h 29m 32s | 49.0211 | 0.9936 | 7m 19s | |
47.7593 | 0.9880 | 42.7484 | 0.9677 | 47.1114 | 0.9835 | 46.6079 | 0.9843 | 47.4265 | 0.9880 | |||||||
43.6363 | 0.9415 | 41.0480 | 0.9531 | 43.1836 | 0.9346 | 43.3121 | 0.9505 | 43.7688 | 0.9613 | |||||||
41.3958 | 0.9131 | 39.8725 | 0.9383 | 42.0199 | 0.9204 | 42.1055 | 0.9379 | 42.5766 | 0.9562 | |||||||
38.1028 | 0.8427 | 38.0563 | 0.9098 | 39.5744 | 0.8873 | 40.2146 | 0.9169 | 40.3370 | 0.9381 | |||||||
Camera3 | 41.1760 | 0.9798 | 4h 36m 23s | 40.7700 | 0.9594 | 15m 15s | 40.5599 | 0.9796 | 4h 38m 12s | 42.0061 | 0.9790 | 4h 30m 33s | 42.3091 | 0.9816 | 7m 13s | |
40.0307 | 0.9677 | 39.4657 | 0.9420 | 39.6185 | 0.9666 | 40.4674 | 0.9672 | 40.7769 | 0.9700 | |||||||
36.2148 | 0.8938 | 36.1391 | 0.8914 | 36.7027 | 0.9138 | 37.2370 | 0.9280 | 37.4741 | 0.9311 | |||||||
34.3638 | 0.8487 | 35.1045 | 0.8783 | 35.2796 | 0.8791 | 36.0706 | 0.9045 | 36.0443 | 0.9130 | |||||||
30.4170 | 0.7663 | 31.4775 | 0.7760 | 31.8913 | 0.8211 | 32.8985 | 0.8532 | 33.0504 | 0.8561 | |||||||
Camera4 | 49.2394 | 0.9942 | 4h 36m 20s | 43.7557 | 0.9705 | 15m 08s | 47.9876 | 0.9924 | 4h 38m 15s | 47.4546 | 0.9887 | 4h 30m 30s | 50.1183 | 0.9945 | 7m 19s | |
47.6744 | 0.9895 | 42.9754 | 0.9636 | 46.3897 | 0.9811 | 45.8446 | 0.9768 | 47.7583 | 0.9895 | |||||||
41.9510 | 0.9335 | 39.8534 | 0.9360 | 42.4956 | 0.9537 | 42.0030 | 0.9540 | 41.9648 | 0.9587 | |||||||
40.5930 | 0.9230 | 38.7384 | 0.9294 | 41.0072 | 0.9463 | 40.3252 | 0.9354 | 40.5241 | 0.9503 | |||||||
36.6494 | 0.8391 | 36.2330 | 0.8915 | 38.5018 | 0.9108 | 38.6361 | 0.9231 | 38.1756 | 0.9209 | |||||||
Camera5 | 48.6019 | 0.9928 | 4h 24m 03s | 42.8059 | 0.9713 | 14m 58s | 47.1503 | 0.9874 | 4h 18m 44s | 46.0550 | 0.9868 | 4h 29m 52s | 46.9796 | 0.9897 | 7m 16s | |
43.4577 | 0.9134 | 41.6037 | 0.9545 | 43.5000 | 0.9627 | 43.9310 | 0.9749 | 44.5822 | 0.9753 | |||||||
36.4346 | 0.7930 | 38.1994 | 0.9081 | 39.6707 | 0.9040 | 39.9786 | 0.9321 | 41.3606 | 0.9478 | |||||||
32.6378 | 0.7228 | 36.4481 | 0.8836 | 37.3455 | 0.8712 | 37.6322 | 0.9017 | 39.8046 | 0.9307 | |||||||
29.2045 | 0.6537 | 32.9607 | 0.8229 | 34.5113 | 0.8179 | 33.9278 | 0.8524 | 36.4322 | 0.8922 |
5.3 Comparison with State-of-the-art Methods
We assess the performance of our LED on three distinct datasets: the Sony subset of SID [7], the ELD dataset [8], and the 5 subsets in our MultiRAW dataset. This evaluation aims to gauge the generalization capabilities of LED across outdoor and indoor scenes and across more camera models, respectively. LED is benchmarked against state-of-the-art raw denoising methods designed for extremely low-light environments. These comparative analyses include:
-
•
DNN model-based methods: Exemplars in this category encompass the approaches presented by Kristina et al. [20] and NoiseFlow [13]. These methodologies initially undergo training on paired real raw images, enabling them to learn the intricacies of noise generation specific to a particular camera. However, they may necessitate additional iterations when applied to a novel camera model.
- •
- •
The denoising network for all methods above is trained under identical settings, following the parameters outlined in ELD [8]. This standardization ensures a fair and consistent basis for comparison, as elucidated in Sec. 5.1.
Quantitative Evaluation. As demonstrated in Tab. I, Tab. II and Tab. III, our approach surpasses previous calibration-based methods in denoising performance under extremely low-light conditions. The disparity between synthetic and real noise is exacerbated with a substantial ratio ( and ), resulting in diminished performance during training with synthetic noise. This is exemplified in comparing ELD [8] and SID [7]. Moreover, DNN model-based methods often exhibit more significant discrepancies than calibration-based methods, with Kristina et al. [20] failing to account for different system gains. Our method mitigates this discrepancy by fine-tuning with few-shot real data, achieving superior performance under and digital gain, as detailed in Tab. I. AINDNet [23] also demonstrates enhanced performance under extremely dark scenes, benefitting from a noise model with reduced deviation. Notably, the noise model deviation has minimal impact on denoising efficacy under small additional digital gain, even may enhance performance, as illustrated in Tab. II. Discussions related to this phenomenon can be found in Sec. 6. Significantly, our method exhibits superiority under extremely low-light scenes, even across different camera models. Additionally, when compared to alternative methods, LED introduces lower training costs in terms of data requirements, training iterations, and training time.
Qualitative Evaluation. The visual comparisons presented in Fig. 6, Fig. 7 and Fig. 8 illustrate the performance of our method against other state-of-the-art approaches on the SID [7], ELD [8] and MultiRAW datasets, respectively. Under extremely low-light conditions, LED recovers more high-frequency information. As shown in Camera3 in Fig. 8, LED is the only method to restore the strings of all three badminton rackets, especially the blue one. Also, the presence of intense noise significantly disrupts the color tone. In Fig. 6, input images exhibit noticeable green or purple color shifts, with many comparative methods struggling to restore the correct color tone. Leveraging implicit noise modeling and a diverse sampling space, LED efficiently reconstructs signals amidst severe noise interference, achieving accurate color rendering and preserving rich texture detail. Moreover, other methods often fail to discern and address enlarged out-of-model noises, resulting in the corruption of the final image with fixed patterns or specific positional artifacts. In contrast, during the fine-tuning, LED learns to effectively eliminate these camera-specific noises, enhancing visual quality and demonstrating robustness against such challenges.
Setting | 100 | 250 | 300 | ||
U-net | CSA | OMNR | PSNR/SSIM | PSNR/SSIM | PSNR/SSIM |
✓ | 41.518/0.951 | 39.140/0.923 | 36.273/0.898 | ||
✓ | ✓ | 41.866/0.954 | 39.201/0.931 | 36.499/0.912 | |
✓ | ✓ | ✓ | 41.984/0.954 | 39.342/0.932 | 36.673/0.915 |
5.4 Ablation Studies
Reparameterized Noise Removal Block. We conduct experiments to analyze the impact of different components in the Reparameterized Noise Removal Block (RepNR). As depicted in Tab. IV, our RepNR consistently demonstrates improved performance across three different ratios, with each component in the RepNR block contributing positively to the overall pipeline.
Pre-training with Advanced Strategy. As outlined in Tab. V, pre-training with the SGDR [73] optimizer and larger batch size (equivalent to the training strategy of PMN [22]) yields further performance improvements, all while maintaining the same fine-tuning (2 image pairs for each ratio and 1.5K iterations). This underscores the scalability of the proposed LED. Additionally, in comparison to LLD [74], LED demonstrates superior performance with minimal data and training costs.
Comparison between CSA and Other Normalization. A similar technique to our proposed one is to insert normalization layers in the network, which is relatively common in transfer learning scenarios. To show the superiority of CSA compared with the usual method, we directly replace CSAs with different kinds of normalization layers to observe the difference. As shown in Tab. VI, Alternatives are Instance-Normalization [50], Layer-Normalization [51], and Batch-Normalization [52] ( denotes BN without running-mean and running-variance). Any normalization cannot achieve comparable performance to CSA. One main reason is that the value range of features is crucial to the denoising task. Normalization seriously destroys the value range of the feature and breaks its stability. On the contrary, CSA roughly maintains the original value range, preventing model performance from collapsing.
Method | 100 | 250 | 300 |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | |
LED | 41.984/0.954 | 39.342/0.932 | 36.673/0.915 |
ELD [8] | 42.081/0.955 | 39.461/0.934 | 36.870/0.920 |
LLD [74] | 42.100/0.955 | 39.760/0.933 | 36.760/0.912 |
LED | 42.396/0.955 | 39.843/0.939 | 36.997/0.923 |
Metric | CSA | IN [50] | LN [51] | BN [52] | BN* |
PSNR | 39.161 | 26.596 | 26.605 | 26.412 | 23.995 |
SSIM | 0.9322 | 0.5883 | 0.5938 | 0.6066 | 0.4186 |
Virtual Camera Number. We have done ablation studies on the virtual camera numbers of our proposed LED. As shown in Fig. 9, LED achieves the best performance with five virtual cameras. Intuitive thought is that too few cameras will make it difficult for the model to learn common knowledge, while too many cameras significantly increase the difficulty of the model learning process. Since five virtual cameras show an impressive improvement over the whole process, we chose five as the number of virtual cameras for our pre-training process.
Sampling Strategy. Uniform sampling makes covering the whole parameter space hard. However, our sampling strategy could cover the whole parameter space , thus resulting in better performance, as shown in Tab. VII. Based on the observation, we use the equivalence point strategy to choose the parameters of the virtual camera. To reduce errors, we conducted experiments with uniform sampling three times and averaged the metrics.
Setting | 100 | 250 | 300 |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | |
Rand | 41.5253/0.9489 | 39.2755/0.9283 | 36.3940/0.9070 |
Ours | 41.9842/0.9539 | 39.3419/0.9317 | 36.6728/0.9147 |
Initialization of CSA for Target Camera. Given the initialization of CST as described in Sec. 3.3, we present the PSNR/SSIM difference between initialization and model averaging. The results indicate that, in most scenarios, model averaging yields superior performance. Furthermore, the performance on the Sony A7S2 of SID [7], as shown in Tab. X, is considered representative of the generalization ability, owing to the scale of the dataset.
Fine-tuning with More Images. Ablation studies are conducted to explore the impact of the number of fine-tuning, illustrating the potential of our proposed LED. As depicted in Fig. 10, an increase in the quantity of paired data correlates with a gradual performance improvement. Moreover, LED outperforms ELD [8] even when fine-tuning only two noise-clean pairs. Further discussions are provided in Sec. 6.
5.5 Further Application
Equip RepNR block on other network architecture. By simply replacing the convolutional operators of other structures with our proposed RepNR Block, LED can be easily migrated to architectures beyond UNet. In Tab. VIII, we experimented with Restormer [31] and NAFNet [32], transformer-based and convolution-based, respectively. Results demonstrate that LED still possesses performance comparable to calibration-based methods.
Architecture | Method | |||
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | ||
Restormer [31] | P-G | 39.457/0.8943 | 33.956/0.7525 | 30.964/0.6409 |
ELD [8] | 42.568/0.9536 | 38.699/0.9280 | 35.863/0.9059 | |
LED | 42.452/0.9492 | 39.376/0.9143 | 36.322/0.9143 | |
NAFNet [32] | P-G | 39.388/0.8945 | 33.892/0.7541 | 30.948/0.6445 |
ELD [8] | 42.351/0.9535 | 38.697/0.9300 | 35.931/0.9112 | |
LED | 42.368/0.9532 | 39.277/0.9351 | 36.292/0.9188 |
LED pre-training could boost the performance of other methods. By integrating LED pre-training into various existing calibration-based or paired data-based methods, as referenced in [8, 7], our approach facilitates notable enhancements in performance as shown in Tab. IX. These improvements are not uniform but rather depend on the difference in the pre-training strategies employed. This proves particularly effective in industrial applications, where the demands for efficiency are paramount. The strategic application of LED pre-training not only boosts the performance of the denoiser but also paves the way for more advanced, adaptable, and efficient denoising.
Method | ||||||
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
ELD [8] | 41.827 | 0.9538 | 38.849 | 0.9278 | 35.940 | 0.8982 |
ELD [8]LED | 42.170 | 0.9558 | 39.285 | 0.9302 | 36.384 | 0.9058 |
ELD [8]LED | 42.471 | 0.9567 | 39.454 | 0.9333 | 36.534 | 0.9138 |
SID [7] | 41.727 | 0.9531 | 39.135 | 0.9304 | 37.363 | 0.9341 |
SID [7]LED | 42.277 | 0.9580 | 39.576 | 0.9445 | 37.518 | 0.9369 |
SID [7]LED | 42.320 | 0.9585 | 39.613 | 0.9455 | 37.614 | 0.9369 |
Init | Metric | Sony | Nikon | Canon | ||
A7S2# | A7S2 | D850 | EOS700D | EOS70D | ||
PSNR | 39.015 | 47.310 | 45.790 | 41.409 | 42.344 | |
SSIM | 0.9307 | 0.9809 | 0.9737 | 0.9408 | 0.9520 | |
Avg. | PSNR | 39.161 | 47.616 | 45.903 | 41.516 | 42.495 |
SSIM | 0.9322 | 0.9817 | 0.9743 | 0.9412 | 0.9524 |
Ratio | 1 | 2 | 4 | 2* |
41.295/0.9480 | 41.704/0.9523 | 41.432/0.9466 | 43.795/0.9648 | |
39.239/0.9350 | 39.410/0.9351 | 39.327/0.9367 | 41.311/0.9457 | |
38.314/0.9229 | 38.486/0.9216 | 38.499/0.9240 | 39.190/0.9278 |
6 Discussions
Why pairs for each ratio? As indicated in Eqn. (4), the variance of noise exhibits a linear relationship with the overall system gain . With only one pair of data, establishing the correct linear relationship is unattainable, resulting in suboptimal performance, as demonstrated in Tab. XI. Furthermore, utilizing two or more pairs with similar system gains fails to precisely model the linear relationship due to a non-negligible error in the sampling scope ( in Eqn. (4)), as illustrated in Fig. 11. Following the principle of using two points to determine a straight line, adopting two pairs with marginally different system gains facilitates the accurate modeling of linearity, significantly enhancing denoising capabilities. Additionally, as shown in Fig. 10, an increase in the number of pairs enables a more accurate fitting of linearity, thereby reducing regression errors further.
For typical explicit calibration-based methods, the primary objective of the calibration process is to compute the linear relationships mentioned previously. Subsequently, the network is trained on synthetic data to learn this relationship. However, our implicit calibration adjusts the learned linear relationships of the network directly through “calibrating” network parameters. This approach makes the entire process more direct and enables the network to serve as a swift adapter.
Noise prior or image prior? Both! It is well known that existing calibration-based methods uniformly utilize noise prior techniques (explicit noise model calibration). However, these methods can exhibit sudden performance degradation on certain cameras, as shown in Canon EOS70D and Canon EOS700D of Tab. II, This is attributed to these methods having learned an excessive amount of image priors from other cameras during training. Sensors of various manufacturers would hold diverse response models, thus yielding different signal intensities to the same scenario. In most calibration-based methods [8, 16], the network’s denoising ability is restricted to a certain image distribution prior, i.e., Sony A7S2. As stated in [49] and shown in Fig. 12, the intensity distributions of Nikon D850 and Sony A7S2 show high similarity. Therefore, generated from the response intensity of Sony A7S2 and the noise model of Nikon D850, the synthetic image exhibits slight discrepancy from the real image prior, assisting network to achieve great performance, as shown in Nikon D850 of Tab. II of the main paper. On the contrary, the intensity distributions between Canon EOS700D and Sony A7S2 remain large discrepancy, leading to a performance drop.
However, it is important to note that as additional digital gain increases, the performance gap between LED and other methods is gradually narrowing. This is because higher digital gain leads to more pronounced noise, making the noise prior to learning by the network more effective. Conversely, under conditions of low digital gain, the image prior previously learned by the network becomes predominant.
Based on this observation, the balance between image prior and noise prior is the key to this problem. With the help of the proposed CSA, features are aligned to the shared space before denoising, decreasing the influence of the image prior to the network. As shown in Tab. XII, even pre-trained with the response model of Sony A7S2, LED can outperforms other calibration-based methods. Furthermore, fine-tuning a few pairs of images of the target camera complements the camera-specific information, supporting the network to step forward for learning both image prior and noise prior.
RAW Src. | 1 | 10 | 100 | 200 |
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | |
Sony | 44.27/0.992 | 42.15/0.982 | 37.43/0.917 | 34.74/0.867 |
Canon | 46.24/0.992 | 44.14/0.983 | 37.94/0.920 | 34.78/0.869 |
7 Conclusion and Future Work
To address the inherent shortcomings of calibration-based methods, we introduce a implicit calibration pipeline designed to lighting even the darkest scenes. Leveraging the camera-specific alignment (CSA), we substitute the explicit calibration procedure with an implicit learning process on the denoiser. The CSA facilitates rapid adaptation to the target camera by separating camera-specific information from the common knowledge of the noise model. Additionally, a parallel convolution mechanism is implemented to learn and eliminate out-of-model noise. With 2 pairs for each ratio (a total of 6 pairs) and 1.5K iterations, our approach attains superior performance compared to existing methods.
Up to this point, the final output quality of LED is still strongly correlated with the data quality used in the few-shot fine-tuning. However, this is not solely a limitation of our method but a common drawback of most few-shot methods. Future work could focus more on making few-shot learning more stable. This represents a key distinction between LED and previous methods: earlier approaches primarily concentrated on engineering for sensor noise modeling rather than focusing on deep learning techniques like few-shot, transfer, or continual learning. Consequently, LED allows researchers to shift their focus from sensor engineering to exploring few-shot learning.
Acknowledgement
This research was supported by the NSFC (NO. 62225604, 62306153) and the Fundamental Research Funds for the Central Universities (Nankai University, 070-63233089). The Supercomputing Center of Nankai University supports computation. Moreover, we would like to express our profound gratitude to Yixuan Huang, Yipeng Du, Bowen Yin, Yunheng Li, and Ruihong Cen (in no particular order) for their dedicated efforts in constructing our dataset.
References
- [1] X. **, J.-W. Xiao, L.-H. Han, C. Guo, R. Zhang, X. Liu, and C. Li, “Lighting every darkness in two pairs: A calibration-free pipeline for raw denoising,” in ICCV, 2023.
- [2] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in CVPR, 2005.
- [3] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE TIP, 2017.
- [4] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.
- [5] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in CVPR, 2018.
- [6] A. Abdelhamed, S. Lin, and M. S. Brown, “A high-quality denoising dataset for smartphone cameras,” in CVPR, 2018.
- [7] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in CVPR, 2018.
- [8] K. Wei, Y. Fu, Y. Zheng, and J. Yang, “Physics-based noise modeling for extreme low-light photography,” IEEE TPAMI, 2021.
- [9] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE TIP, 2018.
- [10] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” in CVPR, 2019.
- [11] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for fast image restoration and enhancement,” IEEE TPAMI, 2022.
- [12] X. **, L.-H. Han, Z. Li, C.-L. Guo, Z. Chai, and C. Li, “Dnf: Decouple and feedback network for seeing in the dark,” in CVPR, 2023.
- [13] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise modeling with conditional normalizing flows,” in ICCV, 2019.
- [14] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” in CVPR, 2020.
- [15] G. Jang, W. Lee, S. Son, and K. M. Lee, “C2n: Practical generative noise modeling for real-world denoising,” in CVPR, 2021.
- [16] Y. Zhang, H. Qin, X. Wang, and H. Li, “Rethinking noise synthesis and modeling in raw denoising,” in ICCV, 2021.
- [17] A. Maleky, S. Kousha, M. S. Brown, and M. A. Brubaker, “Noise2noiseflow: Realistic camera noise modeling without clean images,” in CVPR, 2022.
- [18] S. Kousha, A. Maleky, M. S. Brown, and M. A. Brubaker, “Modeling srgb camera noise with normalizing flows,” in CVPR, 2022.
- [19] Y. Wang, H. Huang, Q. Xu, J. Liu, Y. Liu, and J. Wang, “Practical deep raw image denoising on mobile devices,” in ECCV, 2020.
- [20] K. Monakhova, S. R. Richter, L. Waller, and V. Koltun, “Dancing under the stars: video denoising in starlight,” in CVPR, 2022.
- [21] Y. Zou and Y. Fu, “Estimating fine-grained noise model via contrastive learning,” in CVPR, 2022.
- [22] H. Feng, L. Wang, Y. Wang, and H. Huang, “Learnability enhancement for low-light raw denoising: Where paired real data meets noise modeling,” in ACM MM, 2022.
- [23] Y. Kim, J. W. Soh, G. Y. Park, and N. I. Cho, “Transfer learning from synthetic to real-noise denoising with adaptive instance normalization,” in CVPR, 2020.
- [24] X. Ding, X. Zhang, J. Han, and G. Ding, “Diverse branch block: Building a convolution as an inception-like unit,” in CVPR, 2021.
- [25] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in CVPR, 2021.
- [26] L. Chen, Y. Fu, K. Wei, D. Zheng, and F. Heide, “Instance segmentation in the dark,” IJCV, 2023.
- [27] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
- [28] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for real image restoration and enhancement,” in ECCV, 2020.
- [29] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen, “Hinet: Half instance normalization network for image restoration,” in CVPR, 2021.
- [30] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in CVPR, 2021.
- [31] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in CVPR, 2022.
- [32] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” in ECCV, 2022.
- [33] Z. Zhang, Y. Jiang, W. Shao, X. Wang, P. Luo, K. Lin, and J. Gu, “Real-time controllable denoising for image and video,” in CVPR, 2023.
- [34] R. A. Boie and I. J. Cox, “An analysis of camera noise,” IEEE TPAMI, 1992.
- [35] G. E. Healey and R. Kondepudy, “Radiometric ccd camera calibration and noise estimation,” IEEE TPAMI, 1994.
- [36] R. D. Gow, D. Renshaw, K. Findlater, L. Grant, S. J. McLeod, J. Hart, and R. L. Nicol, “A comprehensive tool for modeling cmos image-sensor-noise performance,” IEEE TED, 2007.
- [37] K. Irie, A. E. McKinnon, K. Unsworth, and I. M. Woodhead, “A technique for evaluation of ccd video-camera noise,” IEEE TCSVT, 2008.
- [38] M. Konnik and J. Welsh, “High-level numerical simulations of noise in ccd and cmos photosensors: review and tutorial,” arXiv:1412.4031, 2014.
- [39] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” in NeurIPS, 2014.
- [40] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
- [41] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020.
- [42] H. Wach and E. R. Dowski Jr, “Noise modeling for design and simulation of computational imaging systems,” in Visual Information Processing XIII, 2004.
- [43] M. Maggioni, E. Sánchez-Monge, and A. Foi, “Joint removal of random and fixed-pattern noise through spatiotemporal video filtering,” IEEE TIP, 2014.
- [44] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017.
- [45] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019.
- [46] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning in neural networks: A survey,” IEEE TPAMI, 2021.
- [47] H.-J. Ye, L. Ming, D.-C. Zhan, and W.-L. Chao, “Few-shot learning with a strong teacher,” IEEE TPAMI, 2022.
- [48] G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE TPAMI, 2022.
- [49] K. R. Prabhakar, V. Vinod, N. R. Sahoo, and R. V. Babu, “Few-shot domain adaptation for low light raw image enhancement,” in BMVC, 2021.
- [50] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv:1607.08022, 2016.
- [51] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016.
- [52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
- [53] B. L. Joiner and J. R. Rosenblatt, “Some properties of the range in samples from tukey’s symmetric lambda distributions,” Journal of the American Statistical Association, 1971.
- [54] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2016.
- [55] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017.
- [56] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv:1505.00853, 2015.
- [57] C.-B. Zhang, J.-W. Xiao, X. Liu, Y.-C. Chen, and M.-M. Cheng, “Representation compensation networks for continual semantic segmentation,” in CVPR, 2022.
- [58] J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, and S. Park, “Swad: Domain generalization by seeking flat minima,” in NeurIPS, 2021.
- [59] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv:1803.05407, 2018.
- [60] J.-W. Xiao, C.-B. Zhang, J. Feng, X. Liu, J. van de Weijer, and M.-M. Cheng, “Endpoints weight fusion for class incremental semantic segmentation,” in CVPR, 2023.
- [61] X. Ding, Y. Guo, G. Ding, and J. Han, “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks,” in ICCV, 2019.
- [62] X. Ding, H. Chen, X. Zhang, J. Han, and G. Ding, “Repmlpnet: Hierarchical vision mlp with re-parameterized locality,” in CVPR, 2022.
- [63] X. Ding, Y. Zhang, Y. Ge, S. Zhao, L. Song, X. Yue, and Y. Shan, “Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition,” arXiv:2311.15599, 2023.
- [64] M. Hu, J. Feng, J. Hua, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Online convolutional re-parameterization,” in CVPR, 2022.
- [65] X. **, J.-W. Xiao, and Y. Huang, “Led,” https://github.com/Srameo/LED, 2023.
- [66] M. Riechert, “Rawpy,” https://github.com/letmaik/rawpy, 2014.
- [67] A. Pavao, I. Guyon, A.-C. Letournel, D.-T. Tran, X. Baro, H. J. Escalante, S. Escalera, T. Thomas, and Z. Xu, “Codalab competitions: An open source platform to organize scientific challenges,” JMLR, 2023.
- [68] S. Cheng, Y. Wang, H. Huang, D. Liu, H. Fan, and S. Liu, “Nbnet: Noise basis learning for image denoising with subspace projection,” in CVPR, 2021.
- [69] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS Workshops, 2017.
- [70] Mindspore-AI, “Mindspore,” https://github.com/mindspore-ai/mindspore, 2019.
- [71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
- [72] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, 2004.
- [73] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in ICLR, 2017.
- [74] Y. Cao, M. Liu, S. Liu, X. Wang, L. Lei, and W. Zuo, “Physics-guided iso-dependent sensor noise modeling for extreme low-light photography,” in CVPR, 2023.
Xin ** received the BS degree from the College of Software, Nankai University, China, in 2022. He is currently a Ph.D. student at the College of Computer Science, Nankai University. His research interests include computational photography and video/image processing. |
Jia-Wen Xiao received his BS degree from the College of Computer Science, Nankai University, China, in 2022. He is currently a Ph.D. student at the College of Computer Science, Nankai University. His research interests include continual learning, self-supervised learning, few-shot learning, and computational photography. |
Ling-Hao Han is a Ph.D. student from the College of Computer Science at Nankai University, under Prof. Ming-Ming Cheng’s supervision. Before that, he received a Bachelor’s Degree from Nankai University in 2020. His research interests include image restoration, low-light image enhancement, and computational photography. |
Chunle Guo received his PhD from Tian** University in China. He continued his research as a Research Associate with the Department of Computer Science, City University of Hong Kong (CityU), from 2018 to 2019. Now, he is a postdoc research fellow working with Prof. Ming-Ming Cheng at Nankai University. His research interests lie in image processing, computer vision, and deep learning. |
Xialei Liu is currently an associate professor at Nankai University, Tian**, China. Before that, he was a postdoc research associate at the University of Edinburgh, Edinburgh, UK. He obtained his PhD at the Autonomous University of Barcelona, Barcelona, Spain. He received B.S. and M.S. degrees at Northwestern Polytechnical University in 2013 and 2016, respectively, in Xi’an, China. His research interests include continual learning, self-supervised learning, few-shot learning etc. |
Chongyi Li is a professor at the Nankai University, China. He was a Research Fellow and then a Research Assistant Professor with City University of Hong Kong and Nanyang Technological University from 2018 to 2023. His research interests include image enhancement and restoration, image generation and editing, and underwater imaging. He serves as an AE of the IEEE TCSVT, and a Lead Guest AE of IJCV. He is an IEEE Senior Member. |
Ming-Ming Cheng received his PhD degree from Tsinghua University in 2012, and then worked with Prof. Philip Torr in Oxford for 2 years. Since 2016, he is a full professor at Nankai University, leading the Media Computing Lab. His research interests include computer vision and computer graphics. He received awards, including the ACM China Rising Star Award, IBM Global SUR Award, etc.. He is a senior member of the IEEE and on the editorial boards of IEEE TPAMI and IEEE TIP. |