\WarningFilter

latexFont shape

Make Explicit Calibration Implicit:
Calibrate Denoiser Instead of the Noise Model

Xin **, Jia-Wen Xiao, Ling-Hao Han, Chunle Guo, Xialei Liu, Chongyi Li, and Ming-Ming Cheng All the authors are with VCIP, CS, Nankai University, Tian**, China. CL Guo and MM Cheng ({guochunle,cmm}@nankai.edu.cn) are corresponding authors. This paper is an extension of our ICCV 2023 conference version [1].

Abstract

Explicit calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods are impeded by several critical limitations: a) the explicit calibration process is both labor- and time-intensive, b) challenge exists in transferring denoisers across different camera models, and c) the disparity between synthetic and real noise is exacerbated by digital gain. To address these issues, we introduce a groundbreaking pipeline named Lighting Every Darkness (LED), which is effective regardless of the digital gain or the camera sensor. LED eliminates the need for explicit noise model calibration, instead utilizing an implicit fine-tuning process that allows quick deployment and requires minimal data. Structural modifications are also included to reduce the discrepancy between synthetic and real noise without extra computational demands. Our method surpasses existing methods in various camera models, including new ones not in public datasets, with just a few pairs per digital gain and only 0.5 $\%$ of the typical iterations. Furthermore, LED also allows researchers to focus more on deep learning advancements while still utilizing sensor engineering benefits. Code and related materials can be found in https://srameo.github.io/projects/led-iccv23/.

Index Terms:

Extreme low-light imaging, few-shot learning, deep low-light image denoising, low-light denoising dataset.

1 Introduction

Noise, an inescapable topic for image capturing, has been systematically investigated in recent years [2, 3, 4, 5, 6, 7, 8]. Compared to standard RGB images, RAW images offer two substantial advantages for image denoising: tractable, primitive noise distribution [8] and higher bit depth for differentiating signal from noise. Learning-based methodologies have demonstrated remarkable advancements in RAW image denoising, particularly when utilizing paired real datasets [9, 10, 11, 12]. However, creating extensive real RAW image datasets tailored to each camera model is impractical. Consequently, there has been a growing focus on applying learning-based techniques to synthetic datasets, a trend reflected in various studies [13, 14, 15, 8, 16, 17, 18].

Calibration-based noise synthesis, particularly when employing physics-based models, has demonstrated its proficiency in accurately fitting real noise characteristics [19, 8, 16, 20, 21, 22]. These methods typically adhere to a systematic process. Initially, they construct a well-designed noise model that aligns with the electronic imaging pipeline. Subsequently, a specific target camera is chosen, and the parameters of the pre-defined noise model are meticulously calibrated. The final step involves generating synthetic paired data for training a denoising network. Moreover, some approaches have been exploring the use of Deep Neural Network (DNN)-based generative models to facilitate the calibration of noise parameters [20, 21].

\begin{overpic}[width=433.62pt]{led_radar.pdf} \put(89.0,12.1){\scriptsize~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}} \put(94.5,4.6){\scriptsize~{}\cite[cite]{[\@@bibref{}{kim2020transfer}{}{}]}} \put(96.3,0.8){\scriptsize~{}\cite[cite]{[\@@bibref{}{zhang2021rethinking}{}{}% ]}} \end{overpic}

Figure 1: LED exhibits unparalleled state-of-the-art performance across a spectrum of darkness scenarios, encompassing various digital gain levels and camera sensors, outperforming calibration-based and transfer learning-based methodologies. Furthermore, adopting our proposed pipeline for new camera models requires minimal cost. Metrics are scaled into non-linear space for best understanding. Refer to Sec. 5 for a comprehensive explanation.

Despite their notable achievements, current methods encounter three principal limitations, as depicted in Fig. 2 (b). 1) Explicit camera-specific noisy model calibration is time-consuming and labor-intensive, requiring specialized data collection with a consistent illumination environment and comprehensive post-processing. 2) Each denoising network (denoiser) is tailored for a specific camera model. Such coupling issues exhibit adaptability challenges to different cameras, requiring repeated calibration and training for distinct target cameras. 3) The noise model trained with synthetic-only data may not encompass certain noise distributions, leading to what is termed as out-of-model noise [8, 16, 22]. In other words, a domain gap persists between Synthetic Noise (SN) and Real Noise (RN). While recent advancements [21] have concentrated on reducing calibration costs through DNN-based methods, issues related to the coupling of between networks and cameras, and out-of-model noise continue to increase training expenses and constrain overall performance.

\begin{overpic}[width=433.62pt]{teaser_framework.pdf} \put(0.0,48.0){\small{(a) Paired data-based methods}} \put(0.0,2.0){\small{(b) Calibration-based methods}} \put(54.9,22.5){\small{(c) Our proposed method}} \end{overpic}

Figure 2: The thumbnail of paired data-based methods, explicit calibration-based methods, and our proposed LED (Zoom-in for best view). The “

\rightarrow

” denotes the limitations of the paired data- and calibration-based methods, and the “

\rightarrow

” highlights our solutions for the above limitations. Calib. represents the calibration operations, including pre-defining a noise model, collecting calibration-specialized data, post-processing, and calculating the noise parameters. In LED, the collection procedure only captures few-shot paired data, alleviating the deployment cost.

We introduce an innovative pipeline, LED, for lighting every darkness, addressing the identified shortcomings of calibration-based methods. As illustrated in Fig. 2 (c), our framework eliminates the necessity for calibration data and operations related to the noise model. To sever the strong dependency between the denoising network and a specific target camera, we propose a dual-stage approach: pre-training with a virtual camera set¹¹1“Virtual” cameras do not correspond to any real camera models but with reasonable noise parameters of the pre-defined noise model. It is sampled from a parameter space $\mathcal{S}$ with our proposed sampling strategy. Details can be found in Sec. 3.2. followed by fine-tuning with few-shot pairs from a specific real camera. This strategy effectively decouples the network from being bound to a single camera model. Concerning the disparity between a virtual and a target camera and the challenges posed by out-of-model noise, we introduce the Re-parameterized Noise Removal (RepNR) block. During the pre-training stage, the RepNR block has several camera-specific alignments (CSA). Each CSA is responsible for learning the camera-specific information of a single virtual camera and aligning features to a shared space. Then, the common knowledge of in-model (components that have been assumed as part of the noise model) noise is learned by a shared denoising convolution. In the fine-tuning stage, we average all the CSAs of virtual cameras as initialization of the target camera. Additionally, we integrate a parallel convolution branch for Out-of-Model Noise Removal (OMNR). During the fine-tuning stage, LED implicitly “calibrates” the parameters of the denoiser, especially the CSAs, instead of explicitly calibrating the noise model. Only 2 pairs for each ratio (additional digital gain) captured by the target camera, in a total of 6 raw image pairs, are used for learning to remove real noise (discussion on why 2 pairs for each ratio can be found in Sec. 6). During deployment, all the RepNR blocks can be structurally parameterized [24, 25, 26] into a straightforward $3\times 3$ convolution without any extra computational cost, yielding a plain UNet [27].

To comprehensively evaluate the efficacy of LED across diverse camera models, we introduce a novel dataset specifically tailored for Multi-camera and dark scene RAW image denoising, referred to as MultiRAW. This dataset is distinct in that it includes five different camera models that have never appeared before. A notable feature of MultiRAW is its encompassment of various sensor sizes, ranging from full-frame cameras to APS-C format cameras, offering a more expansive and realistic testing ground. Furthermore, MultiRAW dataset will be used in the CVPR 2024 and subsequent MIPI (Mobile Intelligent Photography & Imaging) workshops. This utilization underscores its significance and potential impact in advancing the field of RAW image denoising, particularly in scenarios characterized by extremely low light conditions.

Compared to LED, previous methods primarily focused on constructing noise models and calibrating noise parameters, namely sensor-related engineering. However, LED has focused on deep learning techniques like few-shot and transfer learning. Additionally, our method does not deviate from traditional noise modeling methods, which can still empower the pre-training stage of LED.

Our principal contributions are concisely encapsulated as follows:

•

We introduce a novel, implicit “calibration” pipeline for lighting every darkness, eliminating the need for additional calibration-related expenses for noise parameter calculation.
•

The implementation of Camera-Specific Alignments (CSA) mitigates the dependence of the denoising network on specific camera models. At the same time, the Out-of-Model Noise Removal (OMNR) mechanism facilitates few-shot transfer by learning the out-of-model noise of different sensors.
•

We release a new dataset, MultiRAW, encompassing various camera models, assorted scenes, and varying brightness levels. This dataset substantially enriches the current landscape of open-source datasets and addresses the prevalent limitation of limited camera variety.
•

Remarkably, our method requires only 2 RAW image pairs for each ratio and a mere 0.5 $\%$ of the iterations typically needed by state-of-the-art methods (Fig. 1).

Compared to the ICCV 2023 [1] version, this journal extension includes several notable expansions. 1) Experiments (Sec. 5.5) demonstrate that our method can be seamlessly integrated with various existing network architectures and explicit calibration methods, showcasing the broad applicability of our proposed pipeline. 2) Furthermore, a discussion is provided on whether the network employs noise prior or image prior during denoising (detailed in Sec. 6), serving as guidance for further research. 3) We provide a detailed process for few-shot dataset collection and considerations, laying the groundwork for widespread adoption of our implicit calibration pipeline, LED. 4) Based on the remainder in 3), we introduce a new dataset, MultiRAW, featuring various camera models (not included in prior public datasets), multiple additional digital gains, and each setting encompassing two different ISO configurations. 5) We plan to invigorate the RAW image denoising community by hosting a Few-shot RAW Image Denoising competition with the proposed MultiRAW dataset at the CVPR 2024 workshop: Mobile Intelligent Photography & Imaging.

2 Related Work

The issue of image capture in extremely dark scenes has received widespread attention from numerous camera/smartphone manufacturers. This section will revisit denoising techniques such as training with paired data and methods based on noise model calibration.

2.1 Training with Paired Real Data.

The field of RAW data exploitation for image denoising has its roots in the groundbreaking work of the SIDD project [6]. Progress in this area has recently broadened to encompass traditional light image denoising and the more complex challenges inherent in extremely low-light conditions. This expansion is illustrated by notable studies such as SID [7] and ELD [8]. While methodologies based on real noise have yielded encouraging results [28, 29, 30, 31, 32, 33], their widespread application is hampered by the considerable effort required to compile extensive datasets of paired low and high-quality images. To address this, employing training strategies that utilize paired low-quality raw images, exemplified by Noise2Noise [5] and Noise2NoiseFlow [17], offers an effective workaround to the tedious task of assembling noisy-clean image pairs. However, these techniques tend to under-perform in severe noise levels, especially in scenarios with extreme darkness [7, 8].

In this context, our LED aims to advance the understanding and effectiveness of real noise elimination. It incorporates insights from a limited number of paired images taken in extremely low-light conditions, thereby mitigating the data collection challenges associated with such environments.

2.2 Calibration-Based Denoising.

While alleviating the burden of compiling pairwise datasets, synthetic noise-based techniques encounter practical limitations. Common noise models like Poisson and Gaussian significantly diverge from actual noise distributions in extremely low-light conditions [7, 8] ²²2 Denoising under extremely low-light scenarios necessitates the application of additional digital gain (up to 300 $\times$ ) to the input, thereby intensifying the domain gap between real and synthetic noise. . In response, explicit calibration-based methods, simulating each noise component in electronic imaging pipelines [34, 35, 36, 37, 38], have thrived due to their reliability.

ELD [8] proposed a noise model that closely aligns with real noise characteristics, achieving notable performance in dark scenarios. Zhang et al. [16] acknowledged the complexity of modeling signal-independent noise sources and proposed a method that randomly samples such noise from dark frames. However, it still necessitates calibration for signal-dependent noise parameters (overall system gain). Monakhova et al. [20] devised a noise generator combining physics-based noise models with a generative adversarial framework [39]. Zou et al. [21] pursued more accurate and concise calibration by employing contrastive learning [40, 41] for parameter estimation.

Despite the impressive performance achieved by calibration-based methods, certain challenges persist. Stable illumination environments (e.g., consistent brightness and temperature), calibration-specific data collection (e.g., multiple images for each camera setting), and intricate post-processing tasks (e.g., alignment, localization, and statistical analyses) are prerequisites for precisely estimating noise parameters. Furthermore, repeated calibration and training processes are essential for distinct cameras, owing to the diversity of parameters and the nonuniform pre-defined noise model [42, 36, 38, 43]. Additionally, the domain gap between synthetic and real noise is not adequately addressed.

Our LED overcomes these challenges by replacing the explicit calibration procedure with implicitly calibrating the denoiser: a pre-training and fine-tuning framework and a RepNR block designed for noise removal, respectively.

2.3 From Synthetic to Real Noise.

The domain gap between real and synthetic noise, a fundamental challenge, becomes particularly pronounced when models trained on synthetic data are tested on real-world data. To bridge this gap, recent research has increasingly focused on employing techniques like Adaptive Instance Normalization (AdaIN) [44, 45] and few-shot learning [46, 47, 48], along with transfer learning [23] and domain adaptation [49] strategies. However, these approaches often struggle in extremely dark environments where the numerical instability caused by intense noise and high digital gain can impair signal reconstruction.

To address this, our framework introduces a novel camera-specific alignment strategy. This method reduces numerical instability and effectively separates camera-specific characteristics from the general attributes of the noise model. Moreover, unlike instance or layer normalization [50, 51], our alignment operations can be reparameterized into a straightforward convolution, similar to custom batch normalization [52]. This reparameterization ensures that our approach does not incur any additional computational burden.

\begin{overpic}[width=433.62pt]{archs.pdf} \put(49.9,22.3){\LARGE{\color[rgb]{0.4375,0.6796875,0.27734375}$\Rightarrow$}} \end{overpic}

Figure 3: Illustration of our proposed LED and RepNR block. The overall pipeline is delineated into four key stages: 1) Sampling a set of

m

virtual cameras responsible for synthesizing noise at a later stage; 2) Pre-training the denoising network with

m

camera-specific alignments (CSAs) and synthetic paired images, with each CSA corresponding to a virtual camera; 3) Utilizing the target camera to acquire a limited number of real noisy image pairs; 4) Fine-tuning the pre-trained denoising network with real noisy data, tailoring the network to the characteristics of the target camera. In the intermediary phase, we introduce distinct optimization strategies tailored for the specific training stages of our RepNR block. During the stage transition, indicated by “

\Rightarrow

”, we average the CSAs to initialize the CSA

{}^{T}

. Subsequently, once CSA

{}^{T}

reaches convergence, we introduce the OMNR (

3\times 3

) branch alongside the existing IMNR (

3\times 3

+

CSA

{}^{T}

) branch, and proceed with the training process.

3 Method

This section commences with an overview of the complete pipeline for our proposed raw image denoising with implicit calibration. Subsequently, we introduce our Reparameterized Noise Removal (RepNR) block. The comprehensive denoising pipeline is illustrated in Fig. 3.

3.1 Preliminaries and Motivation

In raw image space, the captured signals $D$ are conventionally regarded as the sum of the clean image $I$ and various noise components $N$ , expressed as Eqn. (1).

D=I+N,

(1)

where $N$ is assumed to follow a noise model,

\displaystyle N=N_{shot}+N_{read}+N_{row}+N_{quant}+\epsilon,

(2)

with $N_{shot}$ , $N_{read}$ , $N_{row}$ , $N_{quant}$ , and $\epsilon$ representing shot noise, read noise, row noise, quantization noise, and out-of-model noise, respectively. Apart from the out-of-model noise, other noise components are sampled from specific distributions:

\begin{split}&N_{shot}+I\sim\mathcal{P}(\frac{I}{K})K,\\ &N_{read}\sim\mathcal{T}(\lambda;\mu_{c},\sigma_{\mathcal{T}}),\\ &N_{row}\sim\mathcal{N}(0,\sigma_{r}),\\ &N_{quant}\sim U(-\frac{1}{2},\frac{1}{2}),\end{split}

(3)

where $K$ denotes the overall system gain. Here, $\mathcal{P}$ , $\mathcal{N}$ , and $U$ represent Poisson, Gaussian, and uniform distributions, respectively. $\mathcal{T}(\lambda;\mu,\sigma)$ stands for the Tukey-lambda distribution [53] with shape $\lambda$ , mean $\mu$ , and standard deviation $\sigma$ . Based on the assumption in ELD [8], a linear relationship governs the joint distribution of $(K,\sigma_{\mathcal{T}})$ and $(K,\sigma_{r})$ , expressed as:

\displaystyle\begin{split}&\log(K)\sim U(\log(\hat{K}_{min}),\log(\hat{K}_{max% })),\\ &\log(\sigma_{\mathcal{T}})|\log(K)\sim\mathcal{N}(a_{\mathcal{T}}\log(K)+b_{% \mathcal{T}},\hat{\sigma}_{\mathcal{T}}),\\ &\log(\sigma_{r})|\log(K)\sim\mathcal{N}(a_{r}\log(K)+b_{r},\hat{\sigma}_{r}),% \end{split}

(4)

where $\hat{K}_{min}$ , $\hat{K}_{max}$ denotes the range of the overall system gain, determined by the minimal and maximum ISO value. $a$ , $b$ , and $\hat{\sigma}$ indicate the line’s slope, bias, and an unbiased estimator of the standard deviation, respectively. In this context, a camera can be approximated as a ten-dimensional coordinate $\mathcal{C}$ :

\displaystyle\mathcal{C}=(\hat{K}_{min},\hat{K}_{max},\lambda,\mu_{c},a_{% \mathcal{T}},b_{\mathcal{T}},\hat{\sigma}_{\mathcal{T}},a_{r},b_{r},\hat{% \sigma}_{r}).

(5)

Existing methods predominantly rely on explicit calibration to determine the coordinate $\mathcal{C}$ , especially the linear relationship. It is a process characterized by intensive labor and a substantial domain gap (i.e., the gap between simulated noise and real noise). Moreover, the entanglement between neural networks and cameras requires repeated explicit calibration and training. In our implementation, these distributions and linear relationships are defined similarly to ELD [8]. However, we can also employ more advanced noise models as replacements to achieve theoretically superior performance.

We aim to streamline the complex calibration process and mitigate the strong coupling between networks and cameras. Additionally, we address the out-of-model noise comprehensively, a task facilitated by the structural modifications introduced in the RepNR block. Our motivation is to compel the network to function as a swift adapter [54, 55].

Algorithm 1 Pre-training pipeline in LED

0: model

\Phi,m,\mathcal{S},

clean dataset

D

\Phi_{\text{pre}}\leftarrow

insert-multi-CSA(

\Phi

)

\{c_{k}\}_{k=1}^{m}\leftarrow

generate-virtual-camera(

\mathcal{S}

)

while not converged do

Sample mini-batch

x_{i}\sim D

k\leftarrow

random

(1,m)

\tilde{x_{i}}\leftarrow

augment

(c_{k},x_{i})

\Phi_{\text{pre},k}\leftarrow

select-CSA(

\Phi_{\text{pre}},k

)

train

(\Phi_{\text{pre},k},\{\tilde{x_{i}},x_{i}\})

end while

3.2 Pre-train with Camera-Specific Alignment

Preprocessing. We initiate the pre-training stage using virtual cameras to induce the network to function as a fast adapter. Given the number of virtual cameras $m$ and the parameter space (formulated as $\mathcal{S}$ ), for the $k$ -th camera, we select the $k$ -th $m$ bisection points for each parameter range and combine them to construct a virtual camera. Augmenting the data with synthetic noise, we can pre-train our network based on multiple virtual cameras, compelling the network to acquire common knowledge.

Camera-Specific Alignment. As depicted in Fig. 3, within the pre-training process, we introduce our Camera-Specific Alignment (CSA) module, which focuses on adjusting the distribution of input features. In the baseline model, a $3\times 3$ convolution followed by leaky-ReLU [56] constitutes the primary component. A multi-path alignment layer is inserted before each convolution of the network to align features from different virtual cameras into a shared space. Each path represents the CSA corresponding to the $k$ -th camera, aligning the $k$ -th camera-specific feature distribution into a shared space. Let the feature of the $k$ -th virtual camera be $F_{k}\in\mathcal{R}^{B\times C\times H\times W}$ . Formally, the $k$ -th branch contains a weight $W_{k}\in\mathcal{R}^{C}$ and a bias $b_{k}\in\mathcal{R}^{C}$ , performing channel-wise linear projection, denoted by $Y=W_{k}F+b_{k}$ . ${W_{k}}$ are initialized as $\mathbf{1}$ , and ${b_{k}}$ are initialized as $\mathbf{0}$ , with no effect on the $3\times 3$ convolution at the beginning.

During training, data augmented by the noise of the $k$ -th virtual camera is fed into the $k$ -th path for alignment and a shared $3\times 3$ convolution for further processing. The detailed pre-training pipeline is described in Algorithm 1.

3.3 Fine-tune with Few-shot RAW Image Pairs

Following the pre-training process, the model is intended for deployment in realistic denoising tasks. We advocate for a few-shot strategy, specifically employing only 6 pairs (2 pairs for each of the three ratios) of raw images to fine-tune the pre-trained model. We assume that $3\times 3$ convolutions have acquired sufficient capability to handle features aligned by CSAs. The convolutions remain frozen during subsequent fine-tuning to maximize the utilization of the model parameters obtained from pre-training. For addressing real noise, we substitute the multi-branch CSA with a new CSA layer, denoted as CSA ${}^{T}$ (CSA for the target camera). Unlike the multi-branch CSA during pre-training, the CSA ${}^{T}$ layer is initialized by averaging the pre-trained CSAs for improved generalization. The CSA ${}^{T}$ followed by a $3\times 3$ convolution branch mentioned above is called the in-model noise removal branch (IMNR).

\begin{overpic}[width=208.13574pt]{rep+ensemble.pdf} \put(15.0,32.0){(a)} \put(60.0,32.0){(b)} \put(92.0,32.0){(c)} \put(51.0,-5.0){(d)} \end{overpic}

Figure 4: Illustration for the initializing strategy of CSA

{}^{T}

and the reparameterization process. (a) RepNR block during pre-training. (b) Our RepNR block can be seen as

m

parameters sharing blocks, each for a specific virtual camera. (c) We initialize the CSA

{}^{T}

by averaging the pre-trained CSAs, which can be considered model ensembling. (d) The reparameterization process during deployment. Rep. denotes reparameterize. We detailed the sequential reparameterization process in Sec. 3.4.

Nevertheless, real noise encompasses the modeled part and some out-of-model noise. Since our CSA layer is specifically designed for aligning features augmented by synthetic noise, a gap still exists between real noise and the one that IMNR can handle (i.e., $\epsilon$ in Eqn. (2)). Therefore, we propose introducing an out-of-model noise removal branch (OMNR), to learn the gap between real noise and the modeled components. We treat the OMNR component as a parallel branch alongside the IMNR branch, due to previous research that has demonstrated the efficacy of parallel convolution branches in transfer and continual learning [57]. OMNR comprises only a $3\times 3$ convolution, aiming to capture the structural characteristics of real noise from few-shot raw image pairs. Given the absence of prior information on the noise remainder $\epsilon$ , we initialize the weights and bias of OMNR as a tensor of $\mathbf{0}$ . Combining IMNR with OMNR yields the proposed RepNR block. It is worth noting that it is more reasonable to first learn in-model noise and subsequently address out-of-model noise. Therefore, we divide the optimization process into two steps: initially training IMNR and subsequently training OMNR. Following this approach, iterations of two-step fine-tuning only account for 0.5 $\%$ of the pre-training, rendering it highly feasible for practical implementation. The detailed fine-tuning pipeline is described in Algorithm 2.

Analysis on the Initialization of CSA ${}^{T}$ . As mentioned in Sec. 3.3, we initialize CSA ${}^{T}$ by averaging the pre-trained CSAs in the multi-branch CSA layer. Given that every path shares the convolution in the multi-branch CSA, this initialization can be conceptualized as the ensemble of $m$ models, where $m$ is the number of paths, like (a)-(c) in Fig. 4. According to studies [58, 59, 60], the weighted average of different models can significantly enhance the model’s generalization. This aligns with our objective of generalizing the model to the target noisy domain.

Another rationale for this approach is that CSAs are largely determined by the coordinates $\mathcal{C}$ . From this perspective, the average of different CSAs can be considered the center of gravity of these coordinates. Moreover, the coordinates of test cameras, both in SID [7] and ELD [8], are encompassed in the parameter space $\mathcal{S}$ . In such circumstances, averaging the pre-trained CSAs is a sound starting point. However, even if coordinates $\mathcal{C}$ are not in the pre-defined parameter space $\mathcal{S}$ (in our MultiRAW dataset), LED could also achieve SOAT performance with a few more iterations during fine-tuning.

Algorithm 2 Fine-tuning and deploy pipeline in LED

0: pre-trained model

\Phi_{\text{pre}}

, real dataset

D_{\text{real}}

\Phi_{\text{ft}}\leftarrow

freeze-3

\times

\Phi_{\text{pre}}

)

\Phi_{\text{ft}}\leftarrow

average-CSA(

\Phi_{\text{ft}}

)

while not converged do

Sample mini-batch pairs

\{x_{i},y_{i}\}\sim D_{\text{real}}

train

(\Phi_{\text{ft}},\{x_{i},y_{i}\})

end while

\Phi_{\text{ft}}\leftarrow

freeze-IMNR(

\Phi_{\text{ft}}

)

\Phi_{\text{ft}}\leftarrow

add-OMNR(

\Phi_{\text{ft}}

)

while not converged do

Sample mini-batch pairs

\{x_{i},y_{i}\}\sim D_{\text{real}}

train

(\Phi_{\text{ft}},\{x_{i},y_{i}\})

end while

\Phi_{\text{final}}\leftarrow

deploy(

\Phi_{\text{ft}}

)

3.4 Deploy

Upon completion of fine-tuning, the deployment of the model holds paramount importance for future applications. Directly substituting the $3\times 3$ convolution with our RepNR Block would inevitably increase the number of parameters and computational workload. However, it is noteworthy that our RepNR block solely comprises serial vs. parallel linear map**s. Additionally, the receptive field of each branch in the RepNR block is $3$ . Therefore, employing the structural reparameterization technique [61, 24, 25], our RepNR block can be transformed into a plain $3\times 3$ convolution during deployment, as illustrated in Fig. 4 (d). This implies that our model incurs no additional costs in the application process and facilitates a fair comparison with other methods. Regarding parallel reparameterization techniques, please refer to previous works [61, 24, 25, 62, 63]. Here, we primarily introduce the serial reparameterization techniques we employed.

Sequential Reparameterization. The reparameterization process can be denoted as the following equation:

\displaystyle\begin{split}W_{\mathbf{rep}}&=\mathbf{diag}(W)\otimes W_{3\times 3% },\\ b_{\mathbf{rep}}&=W_{3\times 3}\otimes\mathbf{pad}(b)+b_{3\times 3},\end{split}

(6)

where $\mathbf{diag}$ , $\mathbf{pad}$ denotes transform a $C$ dimensional vector into a $C\times C$ diagonal matrix and replicate-padding a $1\times 1\times C$ dimensional vector into a $3\times 3\times C$ matrix respectively. And $W$ , $W_{3\times 3}$ , and $W_{\mathbf{rep}}$ stand for the weight of the CSA, the $3\times 3$ convolution, and the reparameterized weight, respectively. And the $b_{\ast}$ are standing for the bias of the corresponding type.

Since our CSA operator solely comprises $1\times 1$ channel-wise operations, it is necessary to initially transform it into a regular $1\times 1$ convolution using the $\mathbf{diag}$ operator during reparameterization. It is worth noting that such reparameterization can only approximate the $W_{\mathbf{rep}}$ and $b_{\mathbf{rep}}$ . To ensure consistency during training and testing, we employed the online reparameterization technique [64]. It allows for reparameterization during training, which intends to save more GPU memories. However, our primary goal is to ensure consistency between training and testing utilizing the online reparameterization technique. More details can be found in our Github repo [65].

4 Dark RAW Images (MultiRAW) Dataset

In this section, we will introduce the MultiRAW dataset, details related to data collection (to guide the deployment of LED to any other cameras), and the availability and limitations of the data. Notice that, description in this section has been simplified as much as possible to facilitate a more comfortable and rapid deployment of LED on any other camera models.

4.1 Overview of the MultiRAW Dataset

To further validate the effectiveness of LED across different cameras, we introduce the MultiRAW dataset. Compared to existing datasets, our MultiRAW dataset has the following advantages:

•

Multi-Camera Data: To further demonstrate the effectiveness of LED across different cameras (corresponding to different noise parameters, coordinates $\mathcal{C}$ ), our dataset includes five distinct models not covered in existing datasets. Additionally, MultiRAW includes full-frame and APS-C format cameras with smaller sensor areas, often exhibiting stronger noise characteristics.
•

Varied Illumination Settings: The dataset contains data under five different illumination ratios ( $\times 1$ , $\times 10$ , $\times 100$ , $\times 200$ , and $\times 300$ ), each representing varying levels of denoising difficulty.
•

Dual ISO Configurations: There are two different ISO settings for each scene and illumination setting. These can be used not only for the fine-tuning stage of the LED method but also for testing the algorithm’s robustness under different illumination settings.

In addition to the three highlighted points, the MultiRAW dataset spans 30 indoor scenes, featuring diverse backgrounds and varying types and quantities of objects being photographed. It includes seven different ISO settings ranging from 200 to 6400. The hardest example in our dataset resembles the image captured at a “pseudo” ISO up to 960,000 ( $3200\times 300$ ). We captured a 5-image burst per setting to collect a broader range of noise samples for each ISO configuration under every illumination setting. This approach provides more test data pairs and lays the groundwork for burst raw image denoising in extremely dark environments. Also, we captured data for explicit calibration to reproduce existing calibration-based methods for fully evaluation.

Most existing datasets directly use low ISO and long exposure images as ground truth because the noise produced at low ISO settings is often negligible in full-frame cameras. However, since our shooting equipment includes APS-C format cameras with smaller sensor areas, we need to additionally perform multi-frame averaging denoising on low ISO and long exposure images (4 frames in our implementations). Therefore, we collected a total of $(5*5*2)*5*30=7,500$ noisy images and $4*5*30=600$ images for creating $150$ ground-truths, comprising $(5*5*2)*5*30=7,500$ pairs of data for both training and evaluation.

Refer to caption — Figure 5: A thumbnail of our MultiRAW dataset (Zoom in for best view). It features 30 unique scenes, captured using 5 distinct camera models previously unrepresented in public datasets, under 5 varied lighting conditions (ranging from $\times 1$ to $\times 300$ ratios). For each camera, scene, and lighting combination, we recorded images in dual ISO configurations to enhance the tuning of our LED (detailed in Sec. 6), along with a burst of 5 images for expanded application. In total, MultiRAW provides 7,500 paired images for both training and evaluative purposes. The visual results are amplified and post-processed with the ISP provided by RawPy [66]. Then, downsampled 4 $\times$ to reduce file size.

4.2 Instructions on Data Collection

To ensure the quality of the dataset, special attention must be paid to lighting, alignment, and environmental factors during the shooting process:

•

Lighting: To ensure consistent lighting conditions for the images, it is often necessary to supplement environmental lighting or adjust the aperture. This allows for correct exposure in low ISO and long exposure scenarios.
•

Alignment: Remote control is essential to prevent misalignment issues. Additionally, to avoid camera shake caused by the mechanical shutter during photography, the camera should be set to electronic shutter mode for shooting.
•

Temperature: To prevent the increase in camera temperature caused by continuous shooting (which typically leads to increased noise variance), it is necessary to set the interval between continuous shots to 5 seconds or more.

Moreover, to provide more information on signal-dependent noise (shot noise) for the fine-tuning of LED, the scenes photographed should have a wide variety of colors.

TABLE I: Quantitative results on the SID [7] Sony subset. The best result is in bold, whereas the second best one is in underlined. The extra data requirements and iterations (K) are calculated when transferred to a new target camera. The DNN model-based methods require training noise generators for the target camera, resulting in larger iteration requirements. AINDNet* indicates that the AINDNet is pre-trained with our proposed noise model instead of AWGN. It is worth noting that all methods except AINDNet are trained with the same UNet architecture, while we keep the AINDNet the same as their paper with almost twice the number of parameters compared to the UNet.

Categories	Methods	Extra Data Requirements	Iterations (K)	$\times 100$		$\times 250$		$\times 300$
Categories	Methods	Extra Data Requirements	Iterations (K)	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DNN Model Based	Kristina et al. [20]	$\sim$ 1800 noisy-clean pairs	327.6	38.7799	0.9120	34.4924	0.7900	31.2971	0.6990
DNN Model Based	NoiseFlow [13]	$\sim$ 1800 noisy-clean pairs	777.6	37.0200	0.8820	32.9457	0.7699	29.8068	0.6700
Calibration-Based	Calibrated P-G	$\sim$ 300 calibration data	257.6	39.1576	0.8963	33.8929	0.7630	31.0035	0.6522
	ELD [8]	$\sim$ 300 calibration data	257.6	41.8271	0.9538	38.8492	0.9278	35.9402	0.8982
	Zhang et al. [16]	$\sim$ 150/ $\sim$ 150 for calib./database	257.6	40.9232	0.9488	38.4397	0.9255	35.5439	0.8975
Real Data Based	SID [7]	$\sim$ 1800 noisy-clean pairs	257.6	41.7273	0.9531	39.1353	0.9304	37.3627	0.9341
	Noise2Noise [5]	$\sim$ 12000 noisy pairs	257.6	39.2769	0.8993	34.1660	0.7824	31.0991	0.7080
	AINDNet [23]	$\sim$ 300 noisy-clean pairs	1.5	40.5636	0.9194	36.2538	0.8509	32.2291	0.7397
	AINDNet*	$\sim$ 300 noisy-clean pairs	1.5	39.8052	0.9350	37.2210	0.9101	34.5615	0.8856
	LED (Ours)	6 noisy-clean pairs	1.5	41.9842	0.9539	39.3419	0.9317	36.6728	0.9147

TABLE II: Quantitative results on four camera models, SonyA7S2, NikonD850, Canon EOS70D and Canon EOS700D, of the ELD [8] dataset. The best result is denoted as bold. The reasons for the significant performance improvement observed with Canon cameras are discussed in detail in Sec. 6. All the metrics in this table are calculated with the last eight scenes in the ELD [8] dataset, details in .

Cam.	Ratio	Calibrated P-G	ELD [8]	LED (Ours)
Cam.	Ratio	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Sony A7S2	$\times{1}$	54.3710/0.9977	52.8120/0.9957	51.9547/0.9968
	$\times{10}$	49.9973/0.9891	50.0152/0.9913	50.1762/0.9945
	$\times{100}$	41.5246/0.8668	44.9865/0.9707	45.3574/0.9779
	$\times{200}$	37.6866/0.7818	42.5440/0.9430	42.9747/0.9577

Cam.	Ratio	Calibrated P-G	ELD [8]	LED (Ours)
Cam.	Ratio	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Nikon D850	$\times{1}$	50.6207/0.9949	50.5628/0.9925	50.6222/0.9939
	$\times{10}$	48.3461/0.9884	48.3667/0.9890	48.0684/0.9894
	$\times{100}$	42.2231/0.9046	43.6907/0.9634	43.5620/0.9667
	$\times{200}$	39.0084/0.8391	41.3311/0.9364	41.3984/0.9482

Cam.	Ratio	Calibrated P-G	ELD [8]	LED (Ours)
Cam.	Ratio	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Canon EOS70D	$\times{1}$	42.7352/0.9915	42.4305/0.9900	48.5063/0.9924
	$\times{10}$	41.0061/0.9841	40.6364/0.9833	45.4415/0.9842
	$\times{100}$	36.7007/0.8700	37.7944/0.9255	39.5491/0.9360
	$\times{200}$	33.3459/0.7942	35.1554/0.8703	36.2362/0.8948

Cam.	Ratio	Calibrated P-G	ELD [8]	LED (Ours)
Cam.	Ratio	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Canon EOS700D	$\times{1}$	42.0156/0.9900	41.9264/0.9881	47.7006/0.9910
	$\times{10}$	40.7658/0.9791	40.5297/0.9758	44.8541/0.9815
	$\times{100}$	36.7589/0.8697	36.9642/0.8937	38.3147/0.9206
	$\times{200}$	34.3376/0.8063	34.9231/0.8534	35.1962/0.8717

4.3 Dataset Application and Availability

Our dataset will be used in the Few-shot RAW Image Denoising track at the CVPR 2024 workshop: Mobile Intelligent Photography & Imaging. Following popular benchmarks, we fully release a subset of the data (about 20 scenes of the Canon EOSR10 and Sony A6400 camera models), along with a batch of test data. To prevent overfitting, we only make the images public, with the corresponding ground truths accessible via an online leaderboard on Google CodaLab [67]. A thumbnail of our MultiRAW dataset is illustrated in Fig. 5.

5 Experiments and Analysis

This section offers a comprehensive description of our implementation, details the evaluation metrics and datasets used, presents comparative experiments with other methods, and includes ablation studies to demonstrate the efficacy of our approach.

5.1 Implementation Details

Similar to most denoising methods [14, 68], we utilize the $L1$ loss function as the training objective. We adopt the same UNet [27] architecture as previous methods for a fair comparison, with the distinction that we replace the convolution blocks inside the UNet with our proposed RepNR block. As mentioned in Sec. 3.4, the RepNR block can be structurally reparameterized into a simple convolution block without incurring additional computational costs. We employ the same data preprocessing and optimization strategy as ELD [8] during pre-training. The raw images with long exposure time in the SID [7] train subset are utilized for noise synthesis. Concerning data preprocessing, we pack the Bayer images into 4 channels, followed by crop** the long exposure data with a patch size of $512\times 512$ , non-overlap**, step $256$ , thereby increasing the iterations of one epoch from $161$ to $1288$ . Our implementation is based on PyTorch [69] and MindSpore [70]. We train the models for 200 epochs (257.6K iterations) using the Adam optimizer [71] with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ for optimization, without applying weight decay. The initial learning rate is set to $10^{-4}$ and is halved at the 100th epoch (128.8K iterations) before being further reduced to $10^{-5}$ at the 180th epoch (231.84K iterations).

During fine-tuning, we initially freeze the $3\times 3$ convolution and average the multi-branch CSA to initialize CSA ${}^{T}$ . We first train CSA ${}^{T}$ until convergence, which constitutes the implicit calibration process we propose. After CSA ${}^{T}$ has converged, we introduce the out-of-model noise removal branch (a parallel $3\times 3$ convolution) and freeze all the remaining parameters in our network, as depicted in Fig. 3 ④. Subsequently, we train the OMNR until convergence. Different datasets require varying iterations and learning rates, the details of which will be described in Sec. II. After completing the training process, we deploy our model by reparameterizing the RepNR blocks into convolutions.

5.2 Evaluation Metrics and Datasets

PSNR and SSIM [72] are utilized as quantitative evaluation metrics for pixel-wise and structural assessment. It’s important to note that the pixel value of low-light raw images usually lies in a smaller range than sRGB images, typically $[0,0.5]$ after normalization. This can result in a lower mean square error and higher PSNR. We evaluated our proposed LED on 3 RAW-based denoising datasets, namely SID [7], ELD [8] and our proposed MultiRAW.

SID [7] dataset. The SID [7] dataset exclusively comprises the Sony A7S2 camera model, yet its test scenes are highly diverse, effectively demonstrating the algorithm’s efficacy to the greatest extent. Consequently, a substantial number of ablation experiments are based on this dataset. We randomly selected two pairs of data for each additional digital gain ( $\times 100$ , $\times 250$ , and $\times 300$ ), in a total of six pairs, as the few-shot training datasets. Since the coordinate $\mathcal{C}$ (first mentioned in Eqn. (5)) of the Sony A7S2 is already included in our pre-defined parameter space $\mathcal{S}$ , the required training strategy can be relatively mild. We initially fine-tuned CSA ${}^{T}$ using a learning rate of $10^{-4}$ for 1K iterations. Subsequently, we fine-tune the OMNR branch for 500 iterations using a learning rate of $10^{-5}$ .

ELD [8] dataset. The ELD [8] dataset encompasses four camera models: Sony A7S2, Nikon D850, Canon EOS70D, and Canon EOS700D. We used the paired raw images of the first two scenarios for fine-tuning the pre-trained network, while the remaining eight scenarios were used for evaluation. All the metrics in Tab. II are calculated across the eight scenes for fair comparison. On the ELD [8] dataset, since the four cameras’ coordinate $\mathcal{C}$ s are all included in our pre-defined parameter space $\mathcal{S}$ , the training strategy is the same as for the SID [7] dataset.

MultiRAW dataset. The MultiRAW dataset includes five camera models not previously mentioned: Sony A6400, Canon EOSR10, and three other cameras. Given that this dataset is intended for few-shot raw image denoising, we directly use its training set for fine-tuning. The training strategy on the MultiRAW dataset may be somewhat aggressive because the coordinate $\mathcal{C}$ s of the 5 camera models in MultiRAW dataset are not included in our pre-defined parameter space $\mathcal{S}$ . However, This would fully verify the effectiveness of our proposed LED on unseen camera models. During the fine-tuning process, we adopted the SGDR [73] learning rate decay strategy. Initially, CSA ${}^{T}$ is trained with a learning rate from $2\times 10^{-4}$ to $5\times 10^{-5}$ for 1K iterations for rapid convergence. Subsequently, the OMNR is trained for 2K iterations with a learning rate from $10^{-4}$ to $10^{-5}$ .

TABLE III: Quantitative results on the five different camera models, Canon EOSR10, Sony A6400, and three other camera models, of the proposed MultiRAW dataset. The best result is in bold. Time denotes the training time on a single Nvidia Geforce 3090 GPU with training strategy declared in Sec. 5.1. For LED and AINDNet [23], Time denotes the training time of the fine-tuning stage (only when deploying to new camera models.). AINDNet* indicates that the AINDNet is pre-trained with our proposed noise model instead of AWGN. All methods except AINDNet are trained with the same UNet architecture, while we keep the AINDNet the same as their paper with almost twice the number of parameters compared to the UNet. Please note that these metrics were calculated across all scenarios of the proposed MultiRAW dataset.

Camera	Ratio	P-G			AINDNet* [23]			ELD [8]			Zhang et al. [16]			LED (Ours)
Camera	Ratio	PSNR	SSIM	Time	PSNR	SSIM	Time	PSNR	SSIM	Time	PSNR	SSIM	Time	PSNR	SSIM	Time
Canon EOSR10	$\times 1$	45.5070	0.9895	4h 35m 27s	42.8885	0.9749	15m 01s	45.4837	0.9786	4h 37m 11s	45.4036	0.9865	4h 29m 12s	48.6290	0.9918	7m 17s
	$\times 10$	44.7179	0.9847		41.8977	0.9670		43.4092	0.9601		43.9946	0.9803		46.3750	0.9842
	$\times 100$	39.8212	0.9064		39.2519	0.9391		40.6755	0.9310		41.2814	0.9594		41.8574	0.9547
	$\times 200$	37.0122	0.8130		38.3639	0.9279		40.3582	0.9439		40.1521	0.9486		40.8654	0.9456
	$\times 300$	34.5953	0.7769		35.7965	0.8700		37.7036	0.8987		37.6117	0.8967		37.7800	0.8972
Sony A6400	$\times 1$	49.3146	0.9934	4h 23m 15s	43.5193	0.9750	15m 15s	48.9889	0.9927	4h 39m 27s	48.3114	0.9913	4h 29m 32s	49.0211	0.9936	7m 19s
	$\times 10$	47.7593	0.9880		42.7484	0.9677		47.1114	0.9835		46.6079	0.9843		47.4265	0.9880
	$\times 100$	43.6363	0.9415		41.0480	0.9531		43.1836	0.9346		43.3121	0.9505		43.7688	0.9613
	$\times 200$	41.3958	0.9131		39.8725	0.9383		42.0199	0.9204		42.1055	0.9379		42.5766	0.9562
	$\times 300$	38.1028	0.8427		38.0563	0.9098		39.5744	0.8873		40.2146	0.9169		40.3370	0.9381
Camera3	$\times 1$	41.1760	0.9798	4h 36m 23s	40.7700	0.9594	15m 15s	40.5599	0.9796	4h 38m 12s	42.0061	0.9790	4h 30m 33s	42.3091	0.9816	7m 13s
	$\times 10$	40.0307	0.9677		39.4657	0.9420		39.6185	0.9666		40.4674	0.9672		40.7769	0.9700
	$\times 100$	36.2148	0.8938		36.1391	0.8914		36.7027	0.9138		37.2370	0.9280		37.4741	0.9311
	$\times 200$	34.3638	0.8487		35.1045	0.8783		35.2796	0.8791		36.0706	0.9045		36.0443	0.9130
	$\times 300$	30.4170	0.7663		31.4775	0.7760		31.8913	0.8211		32.8985	0.8532		33.0504	0.8561
Camera4	$\times 1$	49.2394	0.9942	4h 36m 20s	43.7557	0.9705	15m 08s	47.9876	0.9924	4h 38m 15s	47.4546	0.9887	4h 30m 30s	50.1183	0.9945	7m 19s
	$\times 10$	47.6744	0.9895		42.9754	0.9636		46.3897	0.9811		45.8446	0.9768		47.7583	0.9895
	$\times 100$	41.9510	0.9335		39.8534	0.9360		42.4956	0.9537		42.0030	0.9540		41.9648	0.9587
	$\times 200$	40.5930	0.9230		38.7384	0.9294		41.0072	0.9463		40.3252	0.9354		40.5241	0.9503
	$\times 300$	36.6494	0.8391		36.2330	0.8915		38.5018	0.9108		38.6361	0.9231		38.1756	0.9209
Camera5	$\times 1$	48.6019	0.9928	4h 24m 03s	42.8059	0.9713	14m 58s	47.1503	0.9874	4h 18m 44s	46.0550	0.9868	4h 29m 52s	46.9796	0.9897	7m 16s
	$\times 10$	43.4577	0.9134		41.6037	0.9545		43.5000	0.9627		43.9310	0.9749		44.5822	0.9753
	$\times 100$	36.4346	0.7930		38.1994	0.9081		39.6707	0.9040		39.9786	0.9321		41.3606	0.9478
	$\times 200$	32.6378	0.7228		36.4481	0.8836		37.3455	0.8712		37.6322	0.9017		39.8046	0.9307
	$\times 300$	29.2045	0.6537		32.9607	0.8229		34.5113	0.8179		33.9278	0.8524		36.4322	0.8922

\begin{overpic}[width=429.28616pt]{sid_comparison.pdf} \put(5.8,53.5){{Input}} \put(18.9,53.5){{AINDNet~{}\cite[cite]{[\@@bibref{}{kim2020transfer}{}{}]}}} \put(34.6,53.5){{Zhang~{}{et al. }~{}\cite[cite]{[\@@bibref{}{zhang2021% rethinking}{}{}]}}} \put(55.0,53.5){{ELD~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}}} \put(70.2,53.5){{{\bf{LED~{}(Ours)}}}} \put(90.5,53.5){GT} \end{overpic}

Figure 6: Visual comparison between our LED and other state-of-the-art methods on the SID [7] dataset (Zoom-in for best view). We amplified and post-processed the input images with the same ISP as ELD [8].

5.3 Comparison with State-of-the-art Methods

We assess the performance of our LED on three distinct datasets: the Sony subset of SID [7], the ELD dataset [8], and the 5 subsets in our MultiRAW dataset. This evaluation aims to gauge the generalization capabilities of LED across outdoor and indoor scenes and across more camera models, respectively. LED is benchmarked against state-of-the-art raw denoising methods designed for extremely low-light environments. These comparative analyses include:

•

DNN model-based methods: Exemplars in this category encompass the approaches presented by Kristina et al. [20] and NoiseFlow [13]. These methodologies initially undergo training on paired real raw images, enabling them to learn the intricacies of noise generation specific to a particular camera. However, they may necessitate additional iterations when applied to a novel camera model.
•

Calibration-based methods: This classification encompasses ELD [8], the approach proposed by Zhang et al. [16], and Calibrated P-G. Noteworthy is the requirement for a time-intensive and laborious calibration process intrinsic to these methods.
•

Real data-based methods: Techniques falling under this category involve training with various data pairings, such as noisy-clean pairs (SID [7]), noisy-noisy pairs (Noise2Noise [5]), and transfer learning as demonstrated by AINDNet [23].

The denoising network for all methods above is trained under identical settings, following the parameters outlined in ELD [8]. This standardization ensures a fair and consistent basis for comparison, as elucidated in Sec. 5.1.

Quantitative Evaluation. As demonstrated in Tab. I, Tab. II and Tab. III, our approach surpasses previous calibration-based methods in denoising performance under extremely low-light conditions. The disparity between synthetic and real noise is exacerbated with a substantial ratio ( $\times 250$ and $\times 300$ ), resulting in diminished performance during training with synthetic noise. This is exemplified in comparing ELD [8] and SID [7]. Moreover, DNN model-based methods often exhibit more significant discrepancies than calibration-based methods, with Kristina et al. [20] failing to account for different system gains. Our method mitigates this discrepancy by fine-tuning with few-shot real data, achieving superior performance under $\times 100$ and $\times 250$ digital gain, as detailed in Tab. I. AINDNet [23] also demonstrates enhanced performance under extremely dark scenes, benefitting from a noise model with reduced deviation. Notably, the noise model deviation has minimal impact on denoising efficacy under small additional digital gain, even may enhance performance, as illustrated in Tab. II. Discussions related to this phenomenon can be found in Sec. 6. Significantly, our method exhibits superiority under extremely low-light scenes, even across different camera models. Additionally, when compared to alternative methods, LED introduces lower training costs in terms of data requirements, training iterations, and training time.

Qualitative Evaluation. The visual comparisons presented in Fig. 6, Fig. 7 and Fig. 8 illustrate the performance of our method against other state-of-the-art approaches on the SID [7], ELD [8] and MultiRAW datasets, respectively. Under extremely low-light conditions, LED recovers more high-frequency information. As shown in Camera3 in Fig. 8, LED is the only method to restore the strings of all three badminton rackets, especially the blue one. Also, the presence of intense noise significantly disrupts the color tone. In Fig. 6, input images exhibit noticeable green or purple color shifts, with many comparative methods struggling to restore the correct color tone. Leveraging implicit noise modeling and a diverse sampling space, LED efficiently reconstructs signals amidst severe noise interference, achieving accurate color rendering and preserving rich texture detail. Moreover, other methods often fail to discern and address enlarged out-of-model noises, resulting in the corruption of the final image with fixed patterns or specific positional artifacts. In contrast, during the fine-tuning, LED learns to effectively eliminate these camera-specific noises, enhancing visual quality and demonstrating robustness against such challenges.

\begin{overpic}[width=208.13574pt]{eld_comparison.pdf} \put(8.0,41.5){Input} \put(30.8,41.5){ELD~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}} \put(52.4,41.5){{\bf{LED~{}(Ours)}}} \put(84.2,41.5){GT} \end{overpic}

Figure 7: Visual comparison on the ELD [8] dataset (Zoom-in for best view).

\begin{overpic}[width=424.94574pt]{multiraw_comparison.pdf} \put(7.5,65.0){{Input}} \put(24.5,65.0){{P-G}} \put(39.0,65.0){{ELD~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}}} \put(52.0,65.0){{Zhang~{}{et al. }~{}\cite[cite]{[\@@bibref{}{zhang2021% rethinking}{}{}]}}} \put(70.5,65.0){{{\bf{LED~{}(Ours)}}}} \put(90.5,65.0){GT} \end{overpic}

Figure 8: Visual comparison between our LED and other state-of-the-art calibration-based methods on our proposed MultiRAW dataset, along with 5 cameras (Zoom-in for best view). We amplified and post-processed the input images with the same ISP as ELD [8].

TABLE IV: Ablation studies on the RepNR block. The provided metrics are with the fine-tuning, as shown in ③ of Fig. 3.

Setting			$\times$ 100	$\times$ 250	$\times$ 300
U-net	CSA	OMNR	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
✓			41.518/0.951	39.140/0.923	36.273/0.898
✓	✓		41.866/0.954	39.201/0.931	36.499/0.912
✓	✓	✓	41.984/0.954	39.342/0.932	36.673/0.915

5.4 Ablation Studies

Reparameterized Noise Removal Block. We conduct experiments to analyze the impact of different components in the Reparameterized Noise Removal Block (RepNR). As depicted in Tab. IV, our RepNR consistently demonstrates improved performance across three different ratios, with each component in the RepNR block contributing positively to the overall pipeline.

Pre-training with Advanced Strategy. As outlined in Tab. V, pre-training with the SGDR [73] optimizer and larger batch size (equivalent to the training strategy of PMN [22]) yields further performance improvements, all while maintaining the same fine-tuning (2 image pairs for each ratio and 1.5K iterations). This underscores the scalability of the proposed LED. Additionally, in comparison to LLD [74], LED demonstrates superior performance with minimal data and training costs.

Comparison between CSA and Other Normalization. A similar technique to our proposed one is to insert normalization layers in the network, which is relatively common in transfer learning scenarios. To show the superiority of CSA compared with the usual method, we directly replace CSAs with different kinds of normalization layers to observe the difference. As shown in Tab. VI, Alternatives are Instance-Normalization [50], Layer-Normalization [51], and Batch-Normalization [52] ( $*$ denotes BN without running-mean and running-variance). Any normalization cannot achieve comparable performance to CSA. One main reason is that the value range of features is crucial to the denoising task. Normalization seriously destroys the value range of the feature and breaks its stability. On the contrary, CSA roughly maintains the original value range, preventing model performance from collapsing.

TABLE V: Ablation studies on the pre-training strategy. The notation with

\star

indicates utilizing the same training strategy as PMN [22] for the denoiser. At the same time, LED

\star

employs this strategy specifically for the pre-training stage and keeps the fine-tuning the same as before.

Method	$\times$ 100	$\times$ 250	$\times$ 300
Method	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
LED	41.984/0.954	39.342/0.932	36.673/0.915
ELD $\star$ [8]	42.081/0.955	39.461/0.934	36.870/0.920
LLD $\star$ [74]	42.100/0.955	39.760/0.933	36.760/0.912
LED $\star$	42.396/0.955	39.843/0.939	36.997/0.923

TABLE VI: Ablation studies on the CSA. BN* denotes batch normalization with running mean and running variance.

Metric	CSA	IN [50]	LN [51]	BN [52]	BN*
PSNR	39.161	26.596	26.605	26.412	23.995
SSIM	0.9322	0.5883	0.5938	0.6066	0.4186

Virtual Camera Number. We have done ablation studies on the virtual camera numbers of our proposed LED. As shown in Fig. 9, LED achieves the best performance with five virtual cameras. Intuitive thought is that too few cameras will make it difficult for the model to learn common knowledge, while too many cameras significantly increase the difficulty of the model learning process. Since five virtual cameras show an impressive improvement over the whole process, we chose five as the number of virtual cameras for our pre-training process.

Sampling Strategy. Uniform sampling makes covering the whole parameter space $\mathcal{S}$ hard. However, our sampling strategy could cover the whole parameter space $\mathcal{S}$ , thus resulting in better performance, as shown in Tab. VII. Based on the observation, we use the equivalence point strategy to choose the parameters of the virtual camera. To reduce errors, we conducted experiments with uniform sampling three times and averaged the metrics.

TABLE VII: Ablation studies on virtual camera sampling strategy. Rand represents leveraging uniform distribution as the strategy. The results of Rand are derived from the average of three trials to minimize errors.

Setting	$\times$ 100	$\times$ 250	$\times$ 300
Setting	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Rand	41.5253/0.9489	39.2755/0.9283	36.3940/0.9070
Ours	41.9842/0.9539	39.3419/0.9317	36.6728/0.9147

\begin{overpic}[width=208.13574pt]{camera_number.pdf} \end{overpic}

Figure 9: Ablation studies on virtual camera numbers. PSNR and SSIM reach the apex when the virtual camera number is 5.

Initialization of CSA for Target Camera. Given the initialization of CST ${}^{T}$ as described in Sec. 3.3, we present the PSNR/SSIM difference between $\mathbf{(1,0)}$ initialization and model averaging. The results indicate that, in most scenarios, model averaging yields superior performance. Furthermore, the performance on the Sony A7S2 of SID [7], as shown in Tab. X, is considered representative of the generalization ability, owing to the scale of the dataset.

Fine-tuning with More Images. Ablation studies are conducted to explore the impact of the number of fine-tuning, illustrating the potential of our proposed LED. As depicted in Fig. 10, an increase in the quantity of paired data correlates with a gradual performance improvement. Moreover, LED outperforms ELD [8] even when fine-tuning only two noise-clean pairs. Further discussions are provided in Sec. 6.

\begin{overpic}[width=208.13574pt]{fewshot_list.pdf} \end{overpic}

Figure 10: Ablation studies on the data amount for fine-tuning that LED achieves superior performance with just 2 pairs for each ratio.

5.5 Further Application

Equip RepNR block on other network architecture. By simply replacing the convolutional operators of other structures with our proposed RepNR Block, LED can be easily migrated to architectures beyond UNet. In Tab. VIII, we experimented with Restormer [31] and NAFNet [32], transformer-based and convolution-based, respectively. Results demonstrate that LED still possesses performance comparable to calibration-based methods.

TABLE VIII: Experiments on network architecture. For LED, we first replace most of the convolution block into our proposed RepNR block during pre-training and fine-tuning in deploying, LED outputs the same architecture as other methods without any additional computational burden, owing to the structural reparameterization procedure.

Architecture	Method	$\times 100$	$\times 250$	$\times 300$
Architecture	Method	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Restormer [31]	P-G	39.457/0.8943	33.956/0.7525	30.964/0.6409
	ELD [8]	42.568/0.9536	38.699/0.9280	35.863/0.9059
	LED	42.452/0.9492	39.376/0.9143	36.322/0.9143
NAFNet [32]	P-G	39.388/0.8945	33.892/0.7541	30.948/0.6445
	ELD [8]	42.351/0.9535	38.697/0.9300	35.931/0.9112
	LED	42.368/0.9532	39.277/0.9351	36.292/0.9188

\begin{overpic}[width=208.13574pt]{two_pairs.pdf} \end{overpic}

Figure 11: Illustration of the feasible solution space (blue area) depicting the linear relationship between the overall system gain

\log(K)

and noise variance

\log(\sigma)

under various sample strategies.

LED pre-training could boost the performance of other methods. By integrating LED pre-training into various existing calibration-based or paired data-based methods, as referenced in [8, 7], our approach facilitates notable enhancements in performance as shown in Tab. IX. These improvements are not uniform but rather depend on the difference in the pre-training strategies employed. This proves particularly effective in industrial applications, where the demands for efficiency are paramount. The strategic application of LED pre-training not only boosts the performance of the denoiser but also paves the way for more advanced, adaptable, and efficient denoising.

TABLE IX: Experiments on LED pre-training with other methods.

\mathbf{X}+\mathbf{Y}

denotes

\mathbf{X}

method is training on the pre-trained network of

\mathbf{Y}

\star

indicates the utilization of the advanced training strategy same as PMN [22] for the denoiser during pre-training.

Method	$\times 100$		$\times 250$		$\times 300$
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
ELD [8]	41.827	0.9538	38.849	0.9278	35.940	0.8982
ELD [8] $+$ LED	42.170	0.9558	39.285	0.9302	36.384	0.9058
ELD [8] $+$ LED $\star$	42.471	0.9567	39.454	0.9333	36.534	0.9138
SID [7]	41.727	0.9531	39.135	0.9304	37.363	0.9341
SID [7] $+$ LED	42.277	0.9580	39.576	0.9445	37.518	0.9369
SID [7] $+$ LED $\star$	42.320	0.9585	39.613	0.9455	37.614	0.9369

TABLE X: Ablation studies on the initialization strategy of CSA for the target camera. “Sony A7S2#” denotes that fine-tuning and testing is performed on the SID [7] dataset, while other evaluations are conducted based on the ELD [8] dataset.

Init	Metric	Sony		Nikon	Canon
Init	Metric	A7S2#	A7S2	D850	EOS700D	EOS70D
$\mathbf{(1,0)}$	PSNR	39.015	47.310	45.790	41.409	42.344
$\mathbf{(1,0)}$	SSIM	0.9307	0.9809	0.9737	0.9408	0.9520
Avg.	PSNR	39.161	47.616	45.903	41.516	42.495
Avg.	SSIM	0.9322	0.9817	0.9743	0.9412	0.9524

TABLE XI: Ablation studies on the pair count for fine-tuning testing on the synthetic dataset.

n

represents fine-tuning

n

data pairs with a similar overall system gain for each ratio.

n^{*}

denotes pairs of data with marginally different overall system gains.

Ratio	1	2	4	2*
$\times 100$	41.295/0.9480	41.704/0.9523	41.432/0.9466	43.795/0.9648
$\times 250$	39.239/0.9350	39.410/0.9351	39.327/0.9367	41.311/0.9457
$\times 300$	38.314/0.9229	38.486/0.9216	38.499/0.9240	39.190/0.9278

6 Discussions

Why $2$ pairs for each ratio? As indicated in Eqn. (4), the variance of noise $\log(\sigma)$ exhibits a linear relationship with the overall system gain $\log(K)$ . With only one pair of data, establishing the correct linear relationship is unattainable, resulting in suboptimal performance, as demonstrated in Tab. XI. Furthermore, utilizing two or more pairs with similar system gains fails to precisely model the linear relationship due to a non-negligible error in the sampling scope ( $\hat{\sigma}$ in Eqn. (4)), as illustrated in Fig. 11. Following the principle of using two points to determine a straight line, adopting two pairs with marginally different system gains facilitates the accurate modeling of linearity, significantly enhancing denoising capabilities. Additionally, as shown in Fig. 10, an increase in the number of pairs enables a more accurate fitting of linearity, thereby reducing regression errors further.

For typical explicit calibration-based methods, the primary objective of the calibration process is to compute the linear relationships mentioned previously. Subsequently, the network is trained on synthetic data to learn this relationship. However, our implicit calibration adjusts the learned linear relationships of the network directly through “calibrating” network parameters. This approach makes the entire process more direct and enables the network to serve as a swift adapter.

\begin{overpic}[width=208.13574pt]{raw_distribution_nikon.pdf} \put(69.3,19.0){\small{$KLD=0.0289$}} \end{overpic}
\begin{overpic}[width=208.13574pt]{raw_distribution_canon.pdf} \put(69.3,22.0){\small{$KLD=0.2978$}} \end{overpic}

Figure 12: Histogram of intensities captured in the same scene with three different camera models: Nikon D850, Canon EOS700D, and Sony A7S2.

KLD

denotes the KL divergence between distributions. Note that the distribution is similar between Nikon and Sony, while the difference remains between Sony and Canon.

Noise prior or image prior? Both! It is well known that existing calibration-based methods uniformly utilize noise prior techniques (explicit noise model calibration). However, these methods can exhibit sudden performance degradation on certain cameras, as shown in Canon EOS70D and Canon EOS700D of Tab. II, This is attributed to these methods having learned an excessive amount of image priors from other cameras during training. Sensors of various manufacturers would hold diverse response models, thus yielding different signal intensities to the same scenario. In most calibration-based methods [8, 16], the network’s denoising ability is restricted to a certain image distribution prior, i.e., Sony A7S2. As stated in [49] and shown in Fig. 12, the intensity distributions of Nikon D850 and Sony A7S2 show high similarity. Therefore, generated from the response intensity of Sony A7S2 and the noise model of Nikon D850, the synthetic image exhibits slight discrepancy from the real image prior, assisting network to achieve great performance, as shown in Nikon D850 of Tab. II of the main paper. On the contrary, the intensity distributions between Canon EOS700D and Sony A7S2 remain large discrepancy, leading to a performance drop.

However, it is important to note that as additional digital gain increases, the performance gap between LED and other methods is gradually narrowing. This is because higher digital gain leads to more pronounced noise, making the noise prior to learning by the network more effective. Conversely, under conditions of low digital gain, the image prior previously learned by the network becomes predominant.

Based on this observation, the balance between image prior and noise prior is the key to this problem. With the help of the proposed CSA, features are aligned to the shared space before denoising, decreasing the influence of the image prior to the network. As shown in Tab. XII, even pre-trained with the response model of Sony A7S2, LED can outperforms other calibration-based methods. Furthermore, fine-tuning a few pairs of images of the target camera complements the camera-specific information, supporting the network to step forward for learning both image prior and noise prior.

TABLE XII: Ablation studies on training with noisy pairs generated from different RAW sources. The experiments are based on the Canon EOS700D camera and Sony A7S2 of the ELD [8] dataset. RAW Src. denotes that the RAW image pairs for fine-tuning are generated by the ground truth of Sony A7S2 or Canon EOS700D.

RAW Src.	$\times$ 1	$\times$ 10	$\times$ 100	$\times$ 200
RAW Src.	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
Sony	44.27/0.992	42.15/0.982	37.43/0.917	34.74/0.867
Canon	46.24/0.992	44.14/0.983	37.94/0.920	34.78/0.869

7 Conclusion and Future Work

To address the inherent shortcomings of calibration-based methods, we introduce a implicit calibration pipeline designed to lighting even the darkest scenes. Leveraging the camera-specific alignment (CSA), we substitute the explicit calibration procedure with an implicit learning process on the denoiser. The CSA facilitates rapid adaptation to the target camera by separating camera-specific information from the common knowledge of the noise model. Additionally, a parallel convolution mechanism is implemented to learn and eliminate out-of-model noise. With 2 pairs for each ratio (a total of 6 pairs) and 1.5K iterations, our approach attains superior performance compared to existing methods.

Up to this point, the final output quality of LED is still strongly correlated with the data quality used in the few-shot fine-tuning. However, this is not solely a limitation of our method but a common drawback of most few-shot methods. Future work could focus more on making few-shot learning more stable. This represents a key distinction between LED and previous methods: earlier approaches primarily concentrated on engineering for sensor noise modeling rather than focusing on deep learning techniques like few-shot, transfer, or continual learning. Consequently, LED allows researchers to shift their focus from sensor engineering to exploring few-shot learning.

Acknowledgement

This research was supported by the NSFC (NO. 62225604, 62306153) and the Fundamental Research Funds for the Central Universities (Nankai University, 070-63233089). The Supercomputing Center of Nankai University supports computation. Moreover, we would like to express our profound gratitude to Yixuan Huang, Yipeng Du, Bowen Yin, Yunheng Li, and Ruihong Cen (in no particular order) for their dedicated efforts in constructing our dataset.

References

[1] X. **, J.-W. Xiao, L.-H. Han, C. Guo, R. Zhang, X. Liu, and C. Li, “Lighting every darkness in two pairs: A calibration-free pipeline for raw denoising,” in ICCV, 2023.
[2] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in CVPR, 2005.
[3] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE TIP, 2017.
[4] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.
[5] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in CVPR, 2018.
[6] A. Abdelhamed, S. Lin, and M. S. Brown, “A high-quality denoising dataset for smartphone cameras,” in CVPR, 2018.
[7] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in CVPR, 2018.
[8] K. Wei, Y. Fu, Y. Zheng, and J. Yang, “Physics-based noise modeling for extreme low-light photography,” IEEE TPAMI, 2021.
[9] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE TIP, 2018.
[10] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” in CVPR, 2019.
[11] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for fast image restoration and enhancement,” IEEE TPAMI, 2022.
[12] X. **, L.-H. Han, Z. Li, C.-L. Guo, Z. Chai, and C. Li, “Dnf: Decouple and feedback network for seeing in the dark,” in CVPR, 2023.
[13] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise modeling with conditional normalizing flows,” in ICCV, 2019.
[14] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” in CVPR, 2020.
[15] G. Jang, W. Lee, S. Son, and K. M. Lee, “C2n: Practical generative noise modeling for real-world denoising,” in CVPR, 2021.
[16] Y. Zhang, H. Qin, X. Wang, and H. Li, “Rethinking noise synthesis and modeling in raw denoising,” in ICCV, 2021.
[17] A. Maleky, S. Kousha, M. S. Brown, and M. A. Brubaker, “Noise2noiseflow: Realistic camera noise modeling without clean images,” in CVPR, 2022.
[18] S. Kousha, A. Maleky, M. S. Brown, and M. A. Brubaker, “Modeling srgb camera noise with normalizing flows,” in CVPR, 2022.
[19] Y. Wang, H. Huang, Q. Xu, J. Liu, Y. Liu, and J. Wang, “Practical deep raw image denoising on mobile devices,” in ECCV, 2020.
[20] K. Monakhova, S. R. Richter, L. Waller, and V. Koltun, “Dancing under the stars: video denoising in starlight,” in CVPR, 2022.
[21] Y. Zou and Y. Fu, “Estimating fine-grained noise model via contrastive learning,” in CVPR, 2022.
[22] H. Feng, L. Wang, Y. Wang, and H. Huang, “Learnability enhancement for low-light raw denoising: Where paired real data meets noise modeling,” in ACM MM, 2022.
[23] Y. Kim, J. W. Soh, G. Y. Park, and N. I. Cho, “Transfer learning from synthetic to real-noise denoising with adaptive instance normalization,” in CVPR, 2020.
[24] X. Ding, X. Zhang, J. Han, and G. Ding, “Diverse branch block: Building a convolution as an inception-like unit,” in CVPR, 2021.
[25] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in CVPR, 2021.
[26] L. Chen, Y. Fu, K. Wei, D. Zheng, and F. Heide, “Instance segmentation in the dark,” IJCV, 2023.
[27] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
[28] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for real image restoration and enhancement,” in ECCV, 2020.
[29] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen, “Hinet: Half instance normalization network for image restoration,” in CVPR, 2021.
[30] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in CVPR, 2021.
[31] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in CVPR, 2022.
[32] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” in ECCV, 2022.
[33] Z. Zhang, Y. Jiang, W. Shao, X. Wang, P. Luo, K. Lin, and J. Gu, “Real-time controllable denoising for image and video,” in CVPR, 2023.
[34] R. A. Boie and I. J. Cox, “An analysis of camera noise,” IEEE TPAMI, 1992.
[35] G. E. Healey and R. Kondepudy, “Radiometric ccd camera calibration and noise estimation,” IEEE TPAMI, 1994.
[36] R. D. Gow, D. Renshaw, K. Findlater, L. Grant, S. J. McLeod, J. Hart, and R. L. Nicol, “A comprehensive tool for modeling cmos image-sensor-noise performance,” IEEE TED, 2007.
[37] K. Irie, A. E. McKinnon, K. Unsworth, and I. M. Woodhead, “A technique for evaluation of ccd video-camera noise,” IEEE TCSVT, 2008.
[38] M. Konnik and J. Welsh, “High-level numerical simulations of noise in ccd and cmos photosensors: review and tutorial,” arXiv:1412.4031, 2014.
[39] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” in NeurIPS, 2014.
[40] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
[41] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020.
[42] H. Wach and E. R. Dowski Jr, “Noise modeling for design and simulation of computational imaging systems,” in Visual Information Processing XIII, 2004.
[43] M. Maggioni, E. Sánchez-Monge, and A. Foi, “Joint removal of random and fixed-pattern noise through spatiotemporal video filtering,” IEEE TIP, 2014.
[44] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017.
[45] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019.
[46] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning in neural networks: A survey,” IEEE TPAMI, 2021.
[47] H.-J. Ye, L. Ming, D.-C. Zhan, and W.-L. Chao, “Few-shot learning with a strong teacher,” IEEE TPAMI, 2022.
[48] G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE TPAMI, 2022.
[49] K. R. Prabhakar, V. Vinod, N. R. Sahoo, and R. V. Babu, “Few-shot domain adaptation for low light raw image enhancement,” in BMVC, 2021.
[50] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv:1607.08022, 2016.
[51] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016.
[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
[53] B. L. Joiner and J. R. Rosenblatt, “Some properties of the range in samples from tukey’s symmetric lambda distributions,” Journal of the American Statistical Association, 1971.
[54] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2016.
[55] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017.
[56] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv:1505.00853, 2015.
[57] C.-B. Zhang, J.-W. Xiao, X. Liu, Y.-C. Chen, and M.-M. Cheng, “Representation compensation networks for continual semantic segmentation,” in CVPR, 2022.
[58] J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, and S. Park, “Swad: Domain generalization by seeking flat minima,” in NeurIPS, 2021.
[59] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv:1803.05407, 2018.
[60] J.-W. Xiao, C.-B. Zhang, J. Feng, X. Liu, J. van de Weijer, and M.-M. Cheng, “Endpoints weight fusion for class incremental semantic segmentation,” in CVPR, 2023.
[61] X. Ding, Y. Guo, G. Ding, and J. Han, “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks,” in ICCV, 2019.
[62] X. Ding, H. Chen, X. Zhang, J. Han, and G. Ding, “Repmlpnet: Hierarchical vision mlp with re-parameterized locality,” in CVPR, 2022.
[63] X. Ding, Y. Zhang, Y. Ge, S. Zhao, L. Song, X. Yue, and Y. Shan, “Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition,” arXiv:2311.15599, 2023.
[64] M. Hu, J. Feng, J. Hua, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Online convolutional re-parameterization,” in CVPR, 2022.
[65] X. **, J.-W. Xiao, and Y. Huang, “Led,” https://github.com/Srameo/LED, 2023.
[66] M. Riechert, “Rawpy,” https://github.com/letmaik/rawpy, 2014.
[67] A. Pavao, I. Guyon, A.-C. Letournel, D.-T. Tran, X. Baro, H. J. Escalante, S. Escalera, T. Thomas, and Z. Xu, “Codalab competitions: An open source platform to organize scientific challenges,” JMLR, 2023.
[68] S. Cheng, Y. Wang, H. Huang, D. Liu, H. Fan, and S. Liu, “Nbnet: Noise basis learning for image denoising with subspace projection,” in CVPR, 2021.
[69] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS Workshops, 2017.
[70] Mindspore-AI, “Mindspore,” https://github.com/mindspore-ai/mindspore, 2019.
[71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
[72] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, 2004.
[73] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in ICLR, 2017.
[74] Y. Cao, M. Liu, S. Liu, X. Wang, L. Lei, and W. Zuo, “Physics-guided iso-dependent sensor noise modeling for extreme low-light photography,” in CVPR, 2023.