HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic
  • failed: graphbox
  • failed: silence

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2308.03448v2 [cs.CV] 25 Dec 2023
\WarningFilter

latexFont shape

Make Explicit Calibration Implicit:
Calibrate Denoiser Instead of the Noise Model

Xin **, Jia-Wen Xiao, Ling-Hao Han, Chunle Guo, Xialei Liu, Chongyi Li, and Ming-Ming Cheng All the authors are with VCIP, CS, Nankai University, Tian**, China. CL Guo and MM Cheng ({guochunle,cmm}@nankai.edu.cn) are corresponding authors. This paper is an extension of our ICCV 2023 conference version [1].
Abstract

Explicit calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods are impeded by several critical limitations: a) the explicit calibration process is both labor- and time-intensive, b) challenge exists in transferring denoisers across different camera models, and c) the disparity between synthetic and real noise is exacerbated by digital gain. To address these issues, we introduce a groundbreaking pipeline named Lighting Every Darkness (LED), which is effective regardless of the digital gain or the camera sensor. LED eliminates the need for explicit noise model calibration, instead utilizing an implicit fine-tuning process that allows quick deployment and requires minimal data. Structural modifications are also included to reduce the discrepancy between synthetic and real noise without extra computational demands. Our method surpasses existing methods in various camera models, including new ones not in public datasets, with just a few pairs per digital gain and only 0.5%percent\%% of the typical iterations. Furthermore, LED also allows researchers to focus more on deep learning advancements while still utilizing sensor engineering benefits. Code and related materials can be found in https://srameo.github.io/projects/led-iccv23/.

Index Terms:
Extreme low-light imaging, few-shot learning, deep low-light image denoising, low-light denoising dataset.

1 Introduction

Noise, an inescapable topic for image capturing, has been systematically investigated in recent years [2, 3, 4, 5, 6, 7, 8]. Compared to standard RGB images, RAW images offer two substantial advantages for image denoising: tractable, primitive noise distribution [8] and higher bit depth for differentiating signal from noise. Learning-based methodologies have demonstrated remarkable advancements in RAW image denoising, particularly when utilizing paired real datasets [9, 10, 11, 12]. However, creating extensive real RAW image datasets tailored to each camera model is impractical. Consequently, there has been a growing focus on applying learning-based techniques to synthetic datasets, a trend reflected in various studies [13, 14, 15, 8, 16, 17, 18].

Calibration-based noise synthesis, particularly when employing physics-based models, has demonstrated its proficiency in accurately fitting real noise characteristics [19, 8, 16, 20, 21, 22]. These methods typically adhere to a systematic process. Initially, they construct a well-designed noise model that aligns with the electronic imaging pipeline. Subsequently, a specific target camera is chosen, and the parameters of the pre-defined noise model are meticulously calibrated. The final step involves generating synthetic paired data for training a denoising network. Moreover, some approaches have been exploring the use of Deep Neural Network (DNN)-based generative models to facilitate the calibration of noise parameters [20, 21].

\begin{overpic}[width=433.62pt]{led_radar.pdf} \put(89.0,12.1){\scriptsize~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}} \put(94.5,4.6){\scriptsize~{}\cite[cite]{[\@@bibref{}{kim2020transfer}{}{}]}} \put(96.3,0.8){\scriptsize~{}\cite[cite]{[\@@bibref{}{zhang2021rethinking}{}{}% ]}} \end{overpic}
Figure 1: LED exhibits unparalleled state-of-the-art performance across a spectrum of darkness scenarios, encompassing various digital gain levels and camera sensors, outperforming calibration-based and transfer learning-based methodologies. Furthermore, adopting our proposed pipeline for new camera models requires minimal cost. Metrics are scaled into non-linear space for best understanding. Refer to Sec. 5 for a comprehensive explanation.

Despite their notable achievements, current methods encounter three principal limitations, as depicted in Fig. 2 (b). 1) Explicit camera-specific noisy model calibration is time-consuming and labor-intensive, requiring specialized data collection with a consistent illumination environment and comprehensive post-processing. 2) Each denoising network (denoiser) is tailored for a specific camera model. Such coupling issues exhibit adaptability challenges to different cameras, requiring repeated calibration and training for distinct target cameras. 3) The noise model trained with synthetic-only data may not encompass certain noise distributions, leading to what is termed as out-of-model noise [8, 16, 22]. In other words, a domain gap persists between Synthetic Noise (SN) and Real Noise (RN). While recent advancements [21] have concentrated on reducing calibration costs through DNN-based methods, issues related to the coupling of between networks and cameras, and out-of-model noise continue to increase training expenses and constrain overall performance.

\begin{overpic}[width=433.62pt]{teaser_framework.pdf} \put(0.0,48.0){\small{(a) Paired data-based methods}} \put(0.0,2.0){\small{(b) Calibration-based methods}} \put(54.9,22.5){\small{(c) Our proposed method}} \end{overpic}
Figure 2: The thumbnail of paired data-based methods, explicit calibration-based methods, and our proposed LED (Zoom-in for best view). The “\rightarrow” denotes the limitations of the paired data- and calibration-based methods, and the “\rightarrow” highlights our solutions for the above limitations. Calib. represents the calibration operations, including pre-defining a noise model, collecting calibration-specialized data, post-processing, and calculating the noise parameters. In LED, the collection procedure only captures few-shot paired data, alleviating the deployment cost.

We introduce an innovative pipeline, LED, for lighting every darkness, addressing the identified shortcomings of calibration-based methods. As illustrated in Fig. 2 (c), our framework eliminates the necessity for calibration data and operations related to the noise model. To sever the strong dependency between the denoising network and a specific target camera, we propose a dual-stage approach: pre-training with a virtual camera set111“Virtual” cameras do not correspond to any real camera models but with reasonable noise parameters of the pre-defined noise model. It is sampled from a parameter space 𝒮𝒮\mathcal{S}caligraphic_S with our proposed sampling strategy. Details can be found in Sec. 3.2. followed by fine-tuning with few-shot pairs from a specific real camera. This strategy effectively decouples the network from being bound to a single camera model. Concerning the disparity between a virtual and a target camera and the challenges posed by out-of-model noise, we introduce the Re-parameterized Noise Removal (RepNR) block. During the pre-training stage, the RepNR block has several camera-specific alignments (CSA). Each CSA is responsible for learning the camera-specific information of a single virtual camera and aligning features to a shared space. Then, the common knowledge of in-model (components that have been assumed as part of the noise model) noise is learned by a shared denoising convolution. In the fine-tuning stage, we average all the CSAs of virtual cameras as initialization of the target camera. Additionally, we integrate a parallel convolution branch for Out-of-Model Noise Removal (OMNR). During the fine-tuning stage, LED implicitly “calibrates” the parameters of the denoiser, especially the CSAs, instead of explicitly calibrating the noise model. Only 2 pairs for each ratio (additional digital gain) captured by the target camera, in a total of 6 raw image pairs, are used for learning to remove real noise (discussion on why 2 pairs for each ratio can be found in Sec. 6). During deployment, all the RepNR blocks can be structurally parameterized [24, 25, 26] into a straightforward 3×3333\times 33 × 3 convolution without any extra computational cost, yielding a plain UNet [27].

To comprehensively evaluate the efficacy of LED across diverse camera models, we introduce a novel dataset specifically tailored for Multi-camera and dark scene RAW image denoising, referred to as MultiRAW. This dataset is distinct in that it includes five different camera models that have never appeared before. A notable feature of MultiRAW is its encompassment of various sensor sizes, ranging from full-frame cameras to APS-C format cameras, offering a more expansive and realistic testing ground. Furthermore, MultiRAW dataset will be used in the CVPR 2024 and subsequent MIPI (Mobile Intelligent Photography & Imaging) workshops. This utilization underscores its significance and potential impact in advancing the field of RAW image denoising, particularly in scenarios characterized by extremely low light conditions.

Compared to LED, previous methods primarily focused on constructing noise models and calibrating noise parameters, namely sensor-related engineering. However, LED has focused on deep learning techniques like few-shot and transfer learning. Additionally, our method does not deviate from traditional noise modeling methods, which can still empower the pre-training stage of LED.

Our principal contributions are concisely encapsulated as follows:

  • We introduce a novel, implicit “calibration” pipeline for lighting every darkness, eliminating the need for additional calibration-related expenses for noise parameter calculation.

  • The implementation of Camera-Specific Alignments (CSA) mitigates the dependence of the denoising network on specific camera models. At the same time, the Out-of-Model Noise Removal (OMNR) mechanism facilitates few-shot transfer by learning the out-of-model noise of different sensors.

  • We release a new dataset, MultiRAW, encompassing various camera models, assorted scenes, and varying brightness levels. This dataset substantially enriches the current landscape of open-source datasets and addresses the prevalent limitation of limited camera variety.

  • Remarkably, our method requires only 2 RAW image pairs for each ratio and a mere 0.5%percent\%% of the iterations typically needed by state-of-the-art methods (Fig. 1).

Compared to the ICCV 2023 [1] version, this journal extension includes several notable expansions. 1) Experiments (Sec. 5.5) demonstrate that our method can be seamlessly integrated with various existing network architectures and explicit calibration methods, showcasing the broad applicability of our proposed pipeline. 2) Furthermore, a discussion is provided on whether the network employs noise prior or image prior during denoising (detailed in Sec. 6), serving as guidance for further research. 3) We provide a detailed process for few-shot dataset collection and considerations, laying the groundwork for widespread adoption of our implicit calibration pipeline, LED. 4) Based on the remainder in 3), we introduce a new dataset, MultiRAW, featuring various camera models (not included in prior public datasets), multiple additional digital gains, and each setting encompassing two different ISO configurations. 5) We plan to invigorate the RAW image denoising community by hosting a Few-shot RAW Image Denoising competition with the proposed MultiRAW dataset at the CVPR 2024 workshop: Mobile Intelligent Photography & Imaging.

2 Related Work

The issue of image capture in extremely dark scenes has received widespread attention from numerous camera/smartphone manufacturers. This section will revisit denoising techniques such as training with paired data and methods based on noise model calibration.

2.1 Training with Paired Real Data.

The field of RAW data exploitation for image denoising has its roots in the groundbreaking work of the SIDD project [6]. Progress in this area has recently broadened to encompass traditional light image denoising and the more complex challenges inherent in extremely low-light conditions. This expansion is illustrated by notable studies such as SID [7] and ELD [8]. While methodologies based on real noise have yielded encouraging results [28, 29, 30, 31, 32, 33], their widespread application is hampered by the considerable effort required to compile extensive datasets of paired low and high-quality images. To address this, employing training strategies that utilize paired low-quality raw images, exemplified by Noise2Noise [5] and Noise2NoiseFlow [17], offers an effective workaround to the tedious task of assembling noisy-clean image pairs. However, these techniques tend to under-perform in severe noise levels, especially in scenarios with extreme darkness [7, 8].

In this context, our LED aims to advance the understanding and effectiveness of real noise elimination. It incorporates insights from a limited number of paired images taken in extremely low-light conditions, thereby mitigating the data collection challenges associated with such environments.

2.2 Calibration-Based Denoising.

While alleviating the burden of compiling pairwise datasets, synthetic noise-based techniques encounter practical limitations. Common noise models like Poisson and Gaussian significantly diverge from actual noise distributions in extremely low-light conditions [7, 8] 222 Denoising under extremely low-light scenarios necessitates the application of additional digital gain (up to 300×\times×) to the input, thereby intensifying the domain gap between real and synthetic noise. . In response, explicit calibration-based methods, simulating each noise component in electronic imaging pipelines [34, 35, 36, 37, 38], have thrived due to their reliability.

ELD [8] proposed a noise model that closely aligns with real noise characteristics, achieving notable performance in dark scenarios. Zhang et al.  [16] acknowledged the complexity of modeling signal-independent noise sources and proposed a method that randomly samples such noise from dark frames. However, it still necessitates calibration for signal-dependent noise parameters (overall system gain). Monakhova et al.  [20] devised a noise generator combining physics-based noise models with a generative adversarial framework [39]. Zou et al.  [21] pursued more accurate and concise calibration by employing contrastive learning [40, 41] for parameter estimation.

Despite the impressive performance achieved by calibration-based methods, certain challenges persist. Stable illumination environments (e.g., consistent brightness and temperature), calibration-specific data collection (e.g., multiple images for each camera setting), and intricate post-processing tasks (e.g., alignment, localization, and statistical analyses) are prerequisites for precisely estimating noise parameters. Furthermore, repeated calibration and training processes are essential for distinct cameras, owing to the diversity of parameters and the nonuniform pre-defined noise model [42, 36, 38, 43]. Additionally, the domain gap between synthetic and real noise is not adequately addressed.

Our LED overcomes these challenges by replacing the explicit calibration procedure with implicitly calibrating the denoiser: a pre-training and fine-tuning framework and a RepNR block designed for noise removal, respectively.

2.3 From Synthetic to Real Noise.

The domain gap between real and synthetic noise, a fundamental challenge, becomes particularly pronounced when models trained on synthetic data are tested on real-world data. To bridge this gap, recent research has increasingly focused on employing techniques like Adaptive Instance Normalization (AdaIN) [44, 45] and few-shot learning [46, 47, 48], along with transfer learning [23] and domain adaptation [49] strategies. However, these approaches often struggle in extremely dark environments where the numerical instability caused by intense noise and high digital gain can impair signal reconstruction.

To address this, our framework introduces a novel camera-specific alignment strategy. This method reduces numerical instability and effectively separates camera-specific characteristics from the general attributes of the noise model. Moreover, unlike instance or layer normalization [50, 51], our alignment operations can be reparameterized into a straightforward convolution, similar to custom batch normalization [52]. This reparameterization ensures that our approach does not incur any additional computational burden.

\begin{overpic}[width=433.62pt]{archs.pdf} \put(49.9,22.3){\LARGE{\color[rgb]{0.4375,0.6796875,0.27734375}$\Rightarrow$}} \end{overpic}
Figure 3: Illustration of our proposed LED and RepNR block. The overall pipeline is delineated into four key stages: 1) Sampling a set of m𝑚mitalic_m virtual cameras responsible for synthesizing noise at a later stage; 2) Pre-training the denoising network with m𝑚mitalic_m camera-specific alignments (CSAs) and synthetic paired images, with each CSA corresponding to a virtual camera; 3) Utilizing the target camera to acquire a limited number of real noisy image pairs; 4) Fine-tuning the pre-trained denoising network with real noisy data, tailoring the network to the characteristics of the target camera. In the intermediary phase, we introduce distinct optimization strategies tailored for the specific training stages of our RepNR block. During the stage transition, indicated by “\Rightarrow”, we average the CSAs to initialize the CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT. Subsequently, once CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT reaches convergence, we introduce the OMNR (3×3333\times 33 × 3) branch alongside the existing IMNR (3×3333\times 33 × 3 +++ CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT) branch, and proceed with the training process.

3 Method

This section commences with an overview of the complete pipeline for our proposed raw image denoising with implicit calibration. Subsequently, we introduce our Reparameterized Noise Removal (RepNR) block. The comprehensive denoising pipeline is illustrated in Fig. 3.

3.1 Preliminaries and Motivation

In raw image space, the captured signals D𝐷Ditalic_D are conventionally regarded as the sum of the clean image I𝐼Iitalic_I and various noise components N𝑁Nitalic_N, expressed as Eqn. (1).

D=I+N,𝐷𝐼𝑁D=I+N,italic_D = italic_I + italic_N , (1)

where N𝑁Nitalic_N is assumed to follow a noise model,

N=Nshot+Nread+Nrow+Nquant+ϵ,𝑁subscript𝑁𝑠𝑜𝑡subscript𝑁𝑟𝑒𝑎𝑑subscript𝑁𝑟𝑜𝑤subscript𝑁𝑞𝑢𝑎𝑛𝑡italic-ϵ\displaystyle N=N_{shot}+N_{read}+N_{row}+N_{quant}+\epsilon,italic_N = italic_N start_POSTSUBSCRIPT italic_s italic_h italic_o italic_t end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_r italic_e italic_a italic_d end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_q italic_u italic_a italic_n italic_t end_POSTSUBSCRIPT + italic_ϵ , (2)

with Nshotsubscript𝑁𝑠𝑜𝑡N_{shot}italic_N start_POSTSUBSCRIPT italic_s italic_h italic_o italic_t end_POSTSUBSCRIPT, Nreadsubscript𝑁𝑟𝑒𝑎𝑑N_{read}italic_N start_POSTSUBSCRIPT italic_r italic_e italic_a italic_d end_POSTSUBSCRIPT, Nrowsubscript𝑁𝑟𝑜𝑤N_{row}italic_N start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT, Nquantsubscript𝑁𝑞𝑢𝑎𝑛𝑡N_{quant}italic_N start_POSTSUBSCRIPT italic_q italic_u italic_a italic_n italic_t end_POSTSUBSCRIPT, and ϵitalic-ϵ\epsilonitalic_ϵ representing shot noise, read noise, row noise, quantization noise, and out-of-model noise, respectively. Apart from the out-of-model noise, other noise components are sampled from specific distributions:

Nshot+I𝒫(IK)K,Nread𝒯(λ;μc,σ𝒯),Nrow𝒩(0,σr),NquantU(12,12),formulae-sequencesimilar-tosubscript𝑁𝑠𝑜𝑡𝐼𝒫𝐼𝐾𝐾formulae-sequencesimilar-tosubscript𝑁𝑟𝑒𝑎𝑑𝒯𝜆subscript𝜇𝑐subscript𝜎𝒯formulae-sequencesimilar-tosubscript𝑁𝑟𝑜𝑤𝒩0subscript𝜎𝑟similar-tosubscript𝑁𝑞𝑢𝑎𝑛𝑡𝑈1212\begin{split}&N_{shot}+I\sim\mathcal{P}(\frac{I}{K})K,\\ &N_{read}\sim\mathcal{T}(\lambda;\mu_{c},\sigma_{\mathcal{T}}),\\ &N_{row}\sim\mathcal{N}(0,\sigma_{r}),\\ &N_{quant}\sim U(-\frac{1}{2},\frac{1}{2}),\end{split}start_ROW start_CELL end_CELL start_CELL italic_N start_POSTSUBSCRIPT italic_s italic_h italic_o italic_t end_POSTSUBSCRIPT + italic_I ∼ caligraphic_P ( divide start_ARG italic_I end_ARG start_ARG italic_K end_ARG ) italic_K , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_N start_POSTSUBSCRIPT italic_r italic_e italic_a italic_d end_POSTSUBSCRIPT ∼ caligraphic_T ( italic_λ ; italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_N start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_N start_POSTSUBSCRIPT italic_q italic_u italic_a italic_n italic_t end_POSTSUBSCRIPT ∼ italic_U ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) , end_CELL end_ROW (3)

where K𝐾Kitalic_K denotes the overall system gain. Here, 𝒫𝒫\mathcal{P}caligraphic_P, 𝒩𝒩\mathcal{N}caligraphic_N, and U𝑈Uitalic_U represent Poisson, Gaussian, and uniform distributions, respectively. 𝒯(λ;μ,σ)𝒯𝜆𝜇𝜎\mathcal{T}(\lambda;\mu,\sigma)caligraphic_T ( italic_λ ; italic_μ , italic_σ ) stands for the Tukey-lambda distribution [53] with shape λ𝜆\lambdaitalic_λ, mean μ𝜇\muitalic_μ, and standard deviation σ𝜎\sigmaitalic_σ. Based on the assumption in ELD [8], a linear relationship governs the joint distribution of (K,σ𝒯)𝐾subscript𝜎𝒯(K,\sigma_{\mathcal{T}})( italic_K , italic_σ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) and (K,σr)𝐾subscript𝜎𝑟(K,\sigma_{r})( italic_K , italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), expressed as:

log(K)U(log(K^min),log(K^max)),log(σ𝒯)|log(K)𝒩(a𝒯log(K)+b𝒯,σ^𝒯),log(σr)|log(K)𝒩(arlog(K)+br,σ^r),formulae-sequencesimilar-to𝐾𝑈subscript^𝐾𝑚𝑖𝑛subscript^𝐾𝑚𝑎𝑥formulae-sequencesimilar-toconditionalsubscript𝜎𝒯𝐾𝒩subscript𝑎𝒯𝐾subscript𝑏𝒯subscript^𝜎𝒯similar-toconditionalsubscript𝜎𝑟𝐾𝒩subscript𝑎𝑟𝐾subscript𝑏𝑟subscript^𝜎𝑟\displaystyle\begin{split}&\log(K)\sim U(\log(\hat{K}_{min}),\log(\hat{K}_{max% })),\\ &\log(\sigma_{\mathcal{T}})|\log(K)\sim\mathcal{N}(a_{\mathcal{T}}\log(K)+b_{% \mathcal{T}},\hat{\sigma}_{\mathcal{T}}),\\ &\log(\sigma_{r})|\log(K)\sim\mathcal{N}(a_{r}\log(K)+b_{r},\hat{\sigma}_{r}),% \end{split}start_ROW start_CELL end_CELL start_CELL roman_log ( italic_K ) ∼ italic_U ( roman_log ( over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) , roman_log ( over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_log ( italic_σ start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) | roman_log ( italic_K ) ∼ caligraphic_N ( italic_a start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT roman_log ( italic_K ) + italic_b start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_log ( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) | roman_log ( italic_K ) ∼ caligraphic_N ( italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT roman_log ( italic_K ) + italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , end_CELL end_ROW (4)

where K^minsubscript^𝐾𝑚𝑖𝑛\hat{K}_{min}over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, K^maxsubscript^𝐾𝑚𝑎𝑥\hat{K}_{max}over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT denotes the range of the overall system gain, determined by the minimal and maximum ISO value. a𝑎aitalic_a, b𝑏bitalic_b, and σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG indicate the line’s slope, bias, and an unbiased estimator of the standard deviation, respectively. In this context, a camera can be approximated as a ten-dimensional coordinate 𝒞𝒞\mathcal{C}caligraphic_C:

𝒞=(K^min,K^max,λ,μc,a𝒯,b𝒯,σ^𝒯,ar,br,σ^r).𝒞subscript^𝐾𝑚𝑖𝑛subscript^𝐾𝑚𝑎𝑥𝜆subscript𝜇𝑐subscript𝑎𝒯subscript𝑏𝒯subscript^𝜎𝒯subscript𝑎𝑟subscript𝑏𝑟subscript^𝜎𝑟\displaystyle\mathcal{C}=(\hat{K}_{min},\hat{K}_{max},\lambda,\mu_{c},a_{% \mathcal{T}},b_{\mathcal{T}},\hat{\sigma}_{\mathcal{T}},a_{r},b_{r},\hat{% \sigma}_{r}).caligraphic_C = ( over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_λ , italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) . (5)

Existing methods predominantly rely on explicit calibration to determine the coordinate 𝒞𝒞\mathcal{C}caligraphic_C, especially the linear relationship. It is a process characterized by intensive labor and a substantial domain gap (i.e., the gap between simulated noise and real noise). Moreover, the entanglement between neural networks and cameras requires repeated explicit calibration and training. In our implementation, these distributions and linear relationships are defined similarly to ELD [8]. However, we can also employ more advanced noise models as replacements to achieve theoretically superior performance.

We aim to streamline the complex calibration process and mitigate the strong coupling between networks and cameras. Additionally, we address the out-of-model noise comprehensively, a task facilitated by the structural modifications introduced in the RepNR block. Our motivation is to compel the network to function as a swift adapter [54, 55].

Algorithm 1 Pre-training pipeline in LED
0:  model Φ,m,𝒮,Φ𝑚𝒮\Phi,m,\mathcal{S},roman_Φ , italic_m , caligraphic_S , clean dataset D𝐷Ditalic_D
  ΦpresubscriptΦpreabsent\Phi_{\text{pre}}\leftarrowroman_Φ start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT ← insert-multi-CSA(ΦΦ\Phiroman_Φ)
  {ck}k=1msuperscriptsubscriptsubscript𝑐𝑘𝑘1𝑚absent\{c_{k}\}_{k=1}^{m}\leftarrow{ italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ← generate-virtual-camera(𝒮𝒮\mathcal{S}caligraphic_S)
  while not converged do
     Sample mini-batch xiDsimilar-tosubscript𝑥𝑖𝐷x_{i}\sim Ditalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_D
     k𝑘absentk\leftarrowitalic_k ← random(1,m)1𝑚(1,m)( 1 , italic_m )
     xi~~subscript𝑥𝑖absent\tilde{x_{i}}\leftarrowover~ start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ← augment(ck,xi)subscript𝑐𝑘subscript𝑥𝑖(c_{k},x_{i})( italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
     Φpre,ksubscriptΦpre𝑘absent\Phi_{\text{pre},k}\leftarrowroman_Φ start_POSTSUBSCRIPT pre , italic_k end_POSTSUBSCRIPT ← select-CSA(Φpre,ksubscriptΦpre𝑘\Phi_{\text{pre}},kroman_Φ start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT , italic_k)
     train(Φpre,k,{xi~,xi})subscriptΦpre𝑘~subscript𝑥𝑖subscript𝑥𝑖(\Phi_{\text{pre},k},\{\tilde{x_{i}},x_{i}\})( roman_Φ start_POSTSUBSCRIPT pre , italic_k end_POSTSUBSCRIPT , { over~ start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } )
  end while

3.2 Pre-train with Camera-Specific Alignment

Preprocessing. We initiate the pre-training stage using virtual cameras to induce the network to function as a fast adapter. Given the number of virtual cameras m𝑚mitalic_m and the parameter space (formulated as 𝒮𝒮\mathcal{S}caligraphic_S), for the k𝑘kitalic_k-th camera, we select the k𝑘kitalic_k-th m𝑚mitalic_m bisection points for each parameter range and combine them to construct a virtual camera. Augmenting the data with synthetic noise, we can pre-train our network based on multiple virtual cameras, compelling the network to acquire common knowledge.

Camera-Specific Alignment. As depicted in Fig. 3, within the pre-training process, we introduce our Camera-Specific Alignment (CSA) module, which focuses on adjusting the distribution of input features. In the baseline model, a 3×3333\times 33 × 3 convolution followed by leaky-ReLU [56] constitutes the primary component. A multi-path alignment layer is inserted before each convolution of the network to align features from different virtual cameras into a shared space. Each path represents the CSA corresponding to the k𝑘kitalic_k-th camera, aligning the k𝑘kitalic_k-th camera-specific feature distribution into a shared space. Let the feature of the k𝑘kitalic_k-th virtual camera be FkB×C×H×Wsubscript𝐹𝑘superscript𝐵𝐶𝐻𝑊F_{k}\in\mathcal{R}^{B\times C\times H\times W}italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT. Formally, the k𝑘kitalic_k-th branch contains a weight WkCsubscript𝑊𝑘superscript𝐶W_{k}\in\mathcal{R}^{C}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and a bias bkCsubscript𝑏𝑘superscript𝐶b_{k}\in\mathcal{R}^{C}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, performing channel-wise linear projection, denoted by Y=WkF+bk𝑌subscript𝑊𝑘𝐹subscript𝑏𝑘Y=W_{k}F+b_{k}italic_Y = italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Wksubscript𝑊𝑘{W_{k}}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are initialized as 𝟏1\mathbf{1}bold_1, and bksubscript𝑏𝑘{b_{k}}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are initialized as 𝟎0\mathbf{0}bold_0, with no effect on the 3×3333\times 33 × 3 convolution at the beginning.

During training, data augmented by the noise of the k𝑘kitalic_k-th virtual camera is fed into the k𝑘kitalic_k-th path for alignment and a shared 3×3333\times 33 × 3 convolution for further processing. The detailed pre-training pipeline is described in Algorithm 1.

3.3 Fine-tune with Few-shot RAW Image Pairs

Following the pre-training process, the model is intended for deployment in realistic denoising tasks. We advocate for a few-shot strategy, specifically employing only 6 pairs (2 pairs for each of the three ratios) of raw images to fine-tune the pre-trained model. We assume that 3×3333\times 33 × 3 convolutions have acquired sufficient capability to handle features aligned by CSAs. The convolutions remain frozen during subsequent fine-tuning to maximize the utilization of the model parameters obtained from pre-training. For addressing real noise, we substitute the multi-branch CSA with a new CSA layer, denoted as CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT (CSA for the target camera). Unlike the multi-branch CSA during pre-training, the CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT layer is initialized by averaging the pre-trained CSAs for improved generalization. The CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT followed by a 3×3333\times 33 × 3 convolution branch mentioned above is called the in-model noise removal branch (IMNR).

\begin{overpic}[width=208.13574pt]{rep+ensemble.pdf} \put(15.0,32.0){(a)} \put(60.0,32.0){(b)} \put(92.0,32.0){(c)} \put(51.0,-5.0){(d)} \end{overpic}
Figure 4: Illustration for the initializing strategy of CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT and the reparameterization process. (a) RepNR block during pre-training. (b) Our RepNR block can be seen as m𝑚mitalic_m parameters sharing blocks, each for a specific virtual camera. (c) We initialize the CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT by averaging the pre-trained CSAs, which can be considered model ensembling. (d) The reparameterization process during deployment. Rep. denotes reparameterize. We detailed the sequential reparameterization process in Sec. 3.4.

Nevertheless, real noise encompasses the modeled part and some out-of-model noise. Since our CSA layer is specifically designed for aligning features augmented by synthetic noise, a gap still exists between real noise and the one that IMNR can handle (i.e., ϵitalic-ϵ\epsilonitalic_ϵ in Eqn. (2)). Therefore, we propose introducing an out-of-model noise removal branch (OMNR), to learn the gap between real noise and the modeled components. We treat the OMNR component as a parallel branch alongside the IMNR branch, due to previous research that has demonstrated the efficacy of parallel convolution branches in transfer and continual learning [57]. OMNR comprises only a 3×3333\times 33 × 3 convolution, aiming to capture the structural characteristics of real noise from few-shot raw image pairs. Given the absence of prior information on the noise remainder ϵitalic-ϵ\epsilonitalic_ϵ, we initialize the weights and bias of OMNR as a tensor of 𝟎0\mathbf{0}bold_0. Combining IMNR with OMNR yields the proposed RepNR block. It is worth noting that it is more reasonable to first learn in-model noise and subsequently address out-of-model noise. Therefore, we divide the optimization process into two steps: initially training IMNR and subsequently training OMNR. Following this approach, iterations of two-step fine-tuning only account for 0.5%percent\%% of the pre-training, rendering it highly feasible for practical implementation. The detailed fine-tuning pipeline is described in Algorithm 2.

Analysis on the Initialization of CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT. As mentioned in Sec. 3.3, we initialize CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT by averaging the pre-trained CSAs in the multi-branch CSA layer. Given that every path shares the convolution in the multi-branch CSA, this initialization can be conceptualized as the ensemble of m𝑚mitalic_m models, where m𝑚mitalic_m is the number of paths, like (a)-(c) in Fig. 4. According to studies [58, 59, 60], the weighted average of different models can significantly enhance the model’s generalization. This aligns with our objective of generalizing the model to the target noisy domain.

Another rationale for this approach is that CSAs are largely determined by the coordinates 𝒞𝒞\mathcal{C}caligraphic_C. From this perspective, the average of different CSAs can be considered the center of gravity of these coordinates. Moreover, the coordinates of test cameras, both in SID [7] and ELD [8], are encompassed in the parameter space 𝒮𝒮\mathcal{S}caligraphic_S. In such circumstances, averaging the pre-trained CSAs is a sound starting point. However, even if coordinates 𝒞𝒞\mathcal{C}caligraphic_C are not in the pre-defined parameter space 𝒮𝒮\mathcal{S}caligraphic_S (in our MultiRAW dataset), LED could also achieve SOAT performance with a few more iterations during fine-tuning.

Algorithm 2 Fine-tuning and deploy pipeline in LED
0:  pre-trained model ΦpresubscriptΦpre\Phi_{\text{pre}}roman_Φ start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT, real dataset Drealsubscript𝐷realD_{\text{real}}italic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT
  ΦftsubscriptΦftabsent\Phi_{\text{ft}}\leftarrowroman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT ← freeze-3×\times×3(ΦpresubscriptΦpre\Phi_{\text{pre}}roman_Φ start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT)
  ΦftsubscriptΦftabsent\Phi_{\text{ft}}\leftarrowroman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT ← average-CSA(ΦftsubscriptΦft\Phi_{\text{ft}}roman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT)
  while not converged do
     Sample mini-batch pairs {xi,yi}Drealsimilar-tosubscript𝑥𝑖subscript𝑦𝑖subscript𝐷real\{x_{i},y_{i}\}\sim D_{\text{real}}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∼ italic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT
     train(Φft,{xi,yi})subscriptΦftsubscript𝑥𝑖subscript𝑦𝑖(\Phi_{\text{ft}},\{x_{i},y_{i}\})( roman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT , { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } )
  end while
  ΦftsubscriptΦftabsent\Phi_{\text{ft}}\leftarrowroman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT ← freeze-IMNR(ΦftsubscriptΦft\Phi_{\text{ft}}roman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT)
  ΦftsubscriptΦftabsent\Phi_{\text{ft}}\leftarrowroman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT ← add-OMNR(ΦftsubscriptΦft\Phi_{\text{ft}}roman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT)
  while not converged do
     Sample mini-batch pairs {xi,yi}Drealsimilar-tosubscript𝑥𝑖subscript𝑦𝑖subscript𝐷real\{x_{i},y_{i}\}\sim D_{\text{real}}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∼ italic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT
     train(Φft,{xi,yi})subscriptΦftsubscript𝑥𝑖subscript𝑦𝑖(\Phi_{\text{ft}},\{x_{i},y_{i}\})( roman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT , { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } )
  end while
  ΦfinalsubscriptΦfinalabsent\Phi_{\text{final}}\leftarrowroman_Φ start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ← deploy(ΦftsubscriptΦft\Phi_{\text{ft}}roman_Φ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT)

3.4 Deploy

Upon completion of fine-tuning, the deployment of the model holds paramount importance for future applications. Directly substituting the 3×3333\times 33 × 3 convolution with our RepNR Block would inevitably increase the number of parameters and computational workload. However, it is noteworthy that our RepNR block solely comprises serial vs. parallel linear map**s. Additionally, the receptive field of each branch in the RepNR block is 3333. Therefore, employing the structural reparameterization technique [61, 24, 25], our RepNR block can be transformed into a plain 3×3333\times 33 × 3 convolution during deployment, as illustrated in Fig. 4 (d). This implies that our model incurs no additional costs in the application process and facilitates a fair comparison with other methods. Regarding parallel reparameterization techniques, please refer to previous works [61, 24, 25, 62, 63]. Here, we primarily introduce the serial reparameterization techniques we employed.

Sequential Reparameterization. The reparameterization process can be denoted as the following equation:

W𝐫𝐞𝐩=𝐝𝐢𝐚𝐠(W)W3×3,b𝐫𝐞𝐩=W3×3𝐩𝐚𝐝(b)+b3×3,formulae-sequencesubscript𝑊𝐫𝐞𝐩tensor-product𝐝𝐢𝐚𝐠𝑊subscript𝑊33subscript𝑏𝐫𝐞𝐩tensor-productsubscript𝑊33𝐩𝐚𝐝𝑏subscript𝑏33\displaystyle\begin{split}W_{\mathbf{rep}}&=\mathbf{diag}(W)\otimes W_{3\times 3% },\\ b_{\mathbf{rep}}&=W_{3\times 3}\otimes\mathbf{pad}(b)+b_{3\times 3},\end{split}start_ROW start_CELL italic_W start_POSTSUBSCRIPT bold_rep end_POSTSUBSCRIPT end_CELL start_CELL = bold_diag ( italic_W ) ⊗ italic_W start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT bold_rep end_POSTSUBSCRIPT end_CELL start_CELL = italic_W start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ⊗ bold_pad ( italic_b ) + italic_b start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT , end_CELL end_ROW (6)

where 𝐝𝐢𝐚𝐠𝐝𝐢𝐚𝐠\mathbf{diag}bold_diag, 𝐩𝐚𝐝𝐩𝐚𝐝\mathbf{pad}bold_pad denotes transform a C𝐶Citalic_C dimensional vector into a C×C𝐶𝐶C\times Citalic_C × italic_C diagonal matrix and replicate-padding a 1×1×C11𝐶1\times 1\times C1 × 1 × italic_C dimensional vector into a 3×3×C33𝐶3\times 3\times C3 × 3 × italic_C matrix respectively. And W𝑊Witalic_W, W3×3subscript𝑊33W_{3\times 3}italic_W start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT, and W𝐫𝐞𝐩subscript𝑊𝐫𝐞𝐩W_{\mathbf{rep}}italic_W start_POSTSUBSCRIPT bold_rep end_POSTSUBSCRIPT stand for the weight of the CSA, the 3×3333\times 33 × 3 convolution, and the reparameterized weight, respectively. And the bsubscript𝑏b_{\ast}italic_b start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT are standing for the bias of the corresponding type.

Since our CSA operator solely comprises 1×1111\times 11 × 1 channel-wise operations, it is necessary to initially transform it into a regular 1×1111\times 11 × 1 convolution using the 𝐝𝐢𝐚𝐠𝐝𝐢𝐚𝐠\mathbf{diag}bold_diag operator during reparameterization. It is worth noting that such reparameterization can only approximate the W𝐫𝐞𝐩subscript𝑊𝐫𝐞𝐩W_{\mathbf{rep}}italic_W start_POSTSUBSCRIPT bold_rep end_POSTSUBSCRIPT and b𝐫𝐞𝐩subscript𝑏𝐫𝐞𝐩b_{\mathbf{rep}}italic_b start_POSTSUBSCRIPT bold_rep end_POSTSUBSCRIPT. To ensure consistency during training and testing, we employed the online reparameterization technique [64]. It allows for reparameterization during training, which intends to save more GPU memories. However, our primary goal is to ensure consistency between training and testing utilizing the online reparameterization technique. More details can be found in our Github repo [65].

4 Dark RAW Images (MultiRAW) Dataset

In this section, we will introduce the MultiRAW dataset, details related to data collection (to guide the deployment of LED to any other cameras), and the availability and limitations of the data. Notice that, description in this section has been simplified as much as possible to facilitate a more comfortable and rapid deployment of LED on any other camera models.

4.1 Overview of the MultiRAW Dataset

To further validate the effectiveness of LED across different cameras, we introduce the MultiRAW dataset. Compared to existing datasets, our MultiRAW dataset has the following advantages:

  • Multi-Camera Data: To further demonstrate the effectiveness of LED across different cameras (corresponding to different noise parameters, coordinates 𝒞𝒞\mathcal{C}caligraphic_C), our dataset includes five distinct models not covered in existing datasets. Additionally, MultiRAW includes full-frame and APS-C format cameras with smaller sensor areas, often exhibiting stronger noise characteristics.

  • Varied Illumination Settings: The dataset contains data under five different illumination ratios (×1absent1\times 1× 1, ×10absent10\times 10× 10, ×100absent100\times 100× 100, ×200absent200\times 200× 200, and ×300absent300\times 300× 300), each representing varying levels of denoising difficulty.

  • Dual ISO Configurations: There are two different ISO settings for each scene and illumination setting. These can be used not only for the fine-tuning stage of the LED method but also for testing the algorithm’s robustness under different illumination settings.

In addition to the three highlighted points, the MultiRAW dataset spans 30 indoor scenes, featuring diverse backgrounds and varying types and quantities of objects being photographed. It includes seven different ISO settings ranging from 200 to 6400. The hardest example in our dataset resembles the image captured at a “pseudo” ISO up to 960,000 (3200×30032003003200\times 3003200 × 300). We captured a 5-image burst per setting to collect a broader range of noise samples for each ISO configuration under every illumination setting. This approach provides more test data pairs and lays the groundwork for burst raw image denoising in extremely dark environments. Also, we captured data for explicit calibration to reproduce existing calibration-based methods for fully evaluation.

Most existing datasets directly use low ISO and long exposure images as ground truth because the noise produced at low ISO settings is often negligible in full-frame cameras. However, since our shooting equipment includes APS-C format cameras with smaller sensor areas, we need to additionally perform multi-frame averaging denoising on low ISO and long exposure images (4 frames in our implementations). Therefore, we collected a total of (5*5*2)*5*30=7,5005525307500(5*5*2)*5*30=7,500( 5 * 5 * 2 ) * 5 * 30 = 7 , 500 noisy images and 4*5*30=60045306004*5*30=6004 * 5 * 30 = 600 images for creating 150150150150 ground-truths, comprising (5*5*2)*5*30=7,5005525307500(5*5*2)*5*30=7,500( 5 * 5 * 2 ) * 5 * 30 = 7 , 500 pairs of data for both training and evaluation.

Refer to caption
Figure 5: A thumbnail of our MultiRAW dataset (Zoom in for best view). It features 30 unique scenes, captured using 5 distinct camera models previously unrepresented in public datasets, under 5 varied lighting conditions (ranging from ×1absent1\times 1× 1 to ×300absent300\times 300× 300 ratios). For each camera, scene, and lighting combination, we recorded images in dual ISO configurations to enhance the tuning of our LED (detailed in Sec. 6), along with a burst of 5 images for expanded application. In total, MultiRAW provides 7,500 paired images for both training and evaluative purposes. The visual results are amplified and post-processed with the ISP provided by RawPy [66]. Then, downsampled 4×\times× to reduce file size.

4.2 Instructions on Data Collection

To ensure the quality of the dataset, special attention must be paid to lighting, alignment, and environmental factors during the shooting process:

  • Lighting: To ensure consistent lighting conditions for the images, it is often necessary to supplement environmental lighting or adjust the aperture. This allows for correct exposure in low ISO and long exposure scenarios.

  • Alignment: Remote control is essential to prevent misalignment issues. Additionally, to avoid camera shake caused by the mechanical shutter during photography, the camera should be set to electronic shutter mode for shooting.

  • Temperature: To prevent the increase in camera temperature caused by continuous shooting (which typically leads to increased noise variance), it is necessary to set the interval between continuous shots to 5 seconds or more.

Moreover, to provide more information on signal-dependent noise (shot noise) for the fine-tuning of LED, the scenes photographed should have a wide variety of colors.

TABLE I: Quantitative results on the SID [7] Sony subset. The best result is in bold, whereas the second best one is in underlined. The extra data requirements and iterations (K) are calculated when transferred to a new target camera. The DNN model-based methods require training noise generators for the target camera, resulting in larger iteration requirements. AINDNet* indicates that the AINDNet is pre-trained with our proposed noise model instead of AWGN. It is worth noting that all methods except AINDNet are trained with the same UNet architecture, while we keep the AINDNet the same as their paper with almost twice the number of parameters compared to the UNet.
Categories Methods Extra Data Requirements Iterations (K) ×100absent100\times 100× 100 ×250absent250\times 250× 250 ×300absent300\times 300× 300
PSNR SSIM PSNR SSIM PSNR SSIM
DNN Model Based Kristina et al.  [20] similar-to\sim1800 noisy-clean pairs 327.6 38.7799 0.9120 34.4924 0.7900 31.2971 0.6990
NoiseFlow [13] similar-to\sim1800 noisy-clean pairs 777.6 37.0200 0.8820 32.9457 0.7699 29.8068 0.6700
Calibration-Based Calibrated P-G similar-to\sim300 calibration data 257.6 39.1576 0.8963 33.8929 0.7630 31.0035 0.6522
ELD [8] similar-to\sim300 calibration data 257.6 41.8271 0.9538 38.8492 0.9278 35.9402 0.8982
Zhang et al.  [16] similar-to\sim150/similar-to\sim150 for calib./database 257.6 40.9232 0.9488 38.4397 0.9255 35.5439 0.8975
Real Data Based SID [7] similar-to\sim1800 noisy-clean pairs 257.6 41.7273 0.9531 39.1353 0.9304 37.3627 0.9341
Noise2Noise [5] similar-to\sim12000 noisy pairs 257.6 39.2769 0.8993 34.1660 0.7824 31.0991 0.7080
AINDNet [23] similar-to\sim300 noisy-clean pairs 1.5 40.5636 0.9194 36.2538 0.8509 32.2291 0.7397
AINDNet* similar-to\sim300 noisy-clean pairs 1.5 39.8052 0.9350 37.2210 0.9101 34.5615 0.8856
LED (Ours) 6 noisy-clean pairs 1.5 41.9842 0.9539 39.3419 0.9317 36.6728 0.9147
TABLE II: Quantitative results on four camera models, SonyA7S2, NikonD850, Canon EOS70D and Canon EOS700D, of the ELD [8] dataset. The best result is denoted as bold. The reasons for the significant performance improvement observed with Canon cameras are discussed in detail in Sec. 6. All the metrics in this table are calculated with the last eight scenes in the ELD [8] dataset, details in .
Cam. Ratio Calibrated P-G ELD [8] LED (Ours)
PSNR/SSIM PSNR/SSIM PSNR/SSIM
Sony A7S2 ×1absent1\times{1}× 1 54.3710/0.9977 52.8120/0.9957 51.9547/0.9968
×10absent10\times{10}× 10 49.9973/0.9891 50.0152/0.9913 50.1762/0.9945
×100absent100\times{100}× 100 41.5246/0.8668 44.9865/0.9707 45.3574/0.9779
×200absent200\times{200}× 200 37.6866/0.7818 42.5440/0.9430 42.9747/0.9577
Cam. Ratio Calibrated P-G ELD [8] LED (Ours)
PSNR/SSIM PSNR/SSIM PSNR/SSIM
Nikon D850 ×1absent1\times{1}× 1 50.6207/0.9949 50.5628/0.9925 50.6222/0.9939
×10absent10\times{10}× 10 48.3461/0.9884 48.3667/0.9890 48.0684/0.9894
×100absent100\times{100}× 100 42.2231/0.9046 43.6907/0.9634 43.5620/0.9667
×200absent200\times{200}× 200 39.0084/0.8391 41.3311/0.9364 41.3984/0.9482
Cam. Ratio Calibrated P-G ELD [8] LED (Ours)
PSNR/SSIM PSNR/SSIM PSNR/SSIM
Canon EOS70D ×1absent1\times{1}× 1 42.7352/0.9915 42.4305/0.9900 48.5063/0.9924
×10absent10\times{10}× 10 41.0061/0.9841 40.6364/0.9833 45.4415/0.9842
×100absent100\times{100}× 100 36.7007/0.8700 37.7944/0.9255 39.5491/0.9360
×200absent200\times{200}× 200 33.3459/0.7942 35.1554/0.8703 36.2362/0.8948
Cam. Ratio Calibrated P-G ELD [8] LED (Ours)
PSNR/SSIM PSNR/SSIM PSNR/SSIM
Canon EOS700D ×1absent1\times{1}× 1 42.0156/0.9900 41.9264/0.9881 47.7006/0.9910
×10absent10\times{10}× 10 40.7658/0.9791 40.5297/0.9758 44.8541/0.9815
×100absent100\times{100}× 100 36.7589/0.8697 36.9642/0.8937 38.3147/0.9206
×200absent200\times{200}× 200 34.3376/0.8063 34.9231/0.8534 35.1962/0.8717

4.3 Dataset Application and Availability

Our dataset will be used in the Few-shot RAW Image Denoising track at the CVPR 2024 workshop: Mobile Intelligent Photography & Imaging. Following popular benchmarks, we fully release a subset of the data (about 20 scenes of the Canon EOSR10 and Sony A6400 camera models), along with a batch of test data. To prevent overfitting, we only make the images public, with the corresponding ground truths accessible via an online leaderboard on Google CodaLab [67]. A thumbnail of our MultiRAW dataset is illustrated in Fig. 5.

5 Experiments and Analysis

This section offers a comprehensive description of our implementation, details the evaluation metrics and datasets used, presents comparative experiments with other methods, and includes ablation studies to demonstrate the efficacy of our approach.

5.1 Implementation Details

Similar to most denoising methods [14, 68], we utilize the L1𝐿1L1italic_L 1 loss function as the training objective. We adopt the same UNet [27] architecture as previous methods for a fair comparison, with the distinction that we replace the convolution blocks inside the UNet with our proposed RepNR block. As mentioned in Sec. 3.4, the RepNR block can be structurally reparameterized into a simple convolution block without incurring additional computational costs. We employ the same data preprocessing and optimization strategy as ELD [8] during pre-training. The raw images with long exposure time in the SID [7] train subset are utilized for noise synthesis. Concerning data preprocessing, we pack the Bayer images into 4 channels, followed by crop** the long exposure data with a patch size of 512×512512512512\times 512512 × 512, non-overlap**, step 256256256256, thereby increasing the iterations of one epoch from 161161161161 to 1288128812881288. Our implementation is based on PyTorch [69] and MindSpore [70]. We train the models for 200 epochs (257.6K iterations) using the Adam optimizer [71] with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 for optimization, without applying weight decay. The initial learning rate is set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and is halved at the 100th epoch (128.8K iterations) before being further reduced to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT at the 180th epoch (231.84K iterations).

During fine-tuning, we initially freeze the 3×3333\times 33 × 3 convolution and average the multi-branch CSA to initialize CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT. We first train CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT until convergence, which constitutes the implicit calibration process we propose. After CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT has converged, we introduce the out-of-model noise removal branch (a parallel 3×3333\times 33 × 3 convolution) and freeze all the remaining parameters in our network, as depicted in Fig. 3 ④. Subsequently, we train the OMNR until convergence. Different datasets require varying iterations and learning rates, the details of which will be described in Sec. II. After completing the training process, we deploy our model by reparameterizing the RepNR blocks into convolutions.

5.2 Evaluation Metrics and Datasets

PSNR and SSIM [72] are utilized as quantitative evaluation metrics for pixel-wise and structural assessment. It’s important to note that the pixel value of low-light raw images usually lies in a smaller range than sRGB images, typically [0,0.5]00.5[0,0.5][ 0 , 0.5 ] after normalization. This can result in a lower mean square error and higher PSNR. We evaluated our proposed LED on 3 RAW-based denoising datasets, namely SID [7], ELD [8] and our proposed MultiRAW.

SID [7] dataset. The SID [7] dataset exclusively comprises the Sony A7S2 camera model, yet its test scenes are highly diverse, effectively demonstrating the algorithm’s efficacy to the greatest extent. Consequently, a substantial number of ablation experiments are based on this dataset. We randomly selected two pairs of data for each additional digital gain (×100absent100\times 100× 100, ×250absent250\times 250× 250, and ×300absent300\times 300× 300), in a total of six pairs, as the few-shot training datasets. Since the coordinate 𝒞𝒞\mathcal{C}caligraphic_C (first mentioned in Eqn. (5)) of the Sony A7S2 is already included in our pre-defined parameter space 𝒮𝒮\mathcal{S}caligraphic_S, the required training strategy can be relatively mild. We initially fine-tuned CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT using a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for 1K iterations. Subsequently, we fine-tune the OMNR branch for 500 iterations using a learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

ELD [8] dataset. The ELD [8] dataset encompasses four camera models: Sony A7S2, Nikon D850, Canon EOS70D, and Canon EOS700D. We used the paired raw images of the first two scenarios for fine-tuning the pre-trained network, while the remaining eight scenarios were used for evaluation. All the metrics in Tab. II are calculated across the eight scenes for fair comparison. On the ELD [8] dataset, since the four cameras’ coordinate 𝒞𝒞\mathcal{C}caligraphic_Cs are all included in our pre-defined parameter space 𝒮𝒮\mathcal{S}caligraphic_S, the training strategy is the same as for the SID [7] dataset.

MultiRAW dataset. The MultiRAW dataset includes five camera models not previously mentioned: Sony A6400, Canon EOSR10, and three other cameras. Given that this dataset is intended for few-shot raw image denoising, we directly use its training set for fine-tuning. The training strategy on the MultiRAW dataset may be somewhat aggressive because the coordinate 𝒞𝒞\mathcal{C}caligraphic_Cs of the 5 camera models in MultiRAW dataset are not included in our pre-defined parameter space 𝒮𝒮\mathcal{S}caligraphic_S. However, This would fully verify the effectiveness of our proposed LED on unseen camera models. During the fine-tuning process, we adopted the SGDR [73] learning rate decay strategy. Initially, CSAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT is trained with a learning rate from 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for 1K iterations for rapid convergence. Subsequently, the OMNR is trained for 2K iterations with a learning rate from 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

TABLE III: Quantitative results on the five different camera models, Canon EOSR10, Sony A6400, and three other camera models, of the proposed MultiRAW dataset. The best result is in bold. Time denotes the training time on a single Nvidia Geforce 3090 GPU with training strategy declared in Sec. 5.1. For LED and AINDNet [23], Time denotes the training time of the fine-tuning stage (only when deploying to new camera models.). AINDNet* indicates that the AINDNet is pre-trained with our proposed noise model instead of AWGN. All methods except AINDNet are trained with the same UNet architecture, while we keep the AINDNet the same as their paper with almost twice the number of parameters compared to the UNet. Please note that these metrics were calculated across all scenarios of the proposed MultiRAW dataset.
Camera Ratio P-G AINDNet* [23] ELD [8] Zhang et al.  [16] LED (Ours)
PSNR SSIM Time PSNR SSIM Time PSNR SSIM Time PSNR SSIM Time PSNR SSIM Time
Canon EOSR10 ×1absent1\times 1× 1 45.5070 0.9895 4h 35m 27s 42.8885 0.9749 15m 01s 45.4837 0.9786 4h 37m 11s 45.4036 0.9865 4h 29m 12s 48.6290 0.9918 7m 17s
×10absent10\times 10× 10 44.7179 0.9847 41.8977 0.9670 43.4092 0.9601 43.9946 0.9803 46.3750 0.9842
×100absent100\times 100× 100 39.8212 0.9064 39.2519 0.9391 40.6755 0.9310 41.2814 0.9594 41.8574 0.9547
×200absent200\times 200× 200 37.0122 0.8130 38.3639 0.9279 40.3582 0.9439 40.1521 0.9486 40.8654 0.9456
×300absent300\times 300× 300 34.5953 0.7769 35.7965 0.8700 37.7036 0.8987 37.6117 0.8967 37.7800 0.8972
Sony A6400 ×1absent1\times 1× 1 49.3146 0.9934 4h 23m 15s 43.5193 0.9750 15m 15s 48.9889 0.9927 4h 39m 27s 48.3114 0.9913 4h 29m 32s 49.0211 0.9936 7m 19s
×10absent10\times 10× 10 47.7593 0.9880 42.7484 0.9677 47.1114 0.9835 46.6079 0.9843 47.4265 0.9880
×100absent100\times 100× 100 43.6363 0.9415 41.0480 0.9531 43.1836 0.9346 43.3121 0.9505 43.7688 0.9613
×200absent200\times 200× 200 41.3958 0.9131 39.8725 0.9383 42.0199 0.9204 42.1055 0.9379 42.5766 0.9562
×300absent300\times 300× 300 38.1028 0.8427 38.0563 0.9098 39.5744 0.8873 40.2146 0.9169 40.3370 0.9381
Camera3 ×1absent1\times 1× 1 41.1760 0.9798 4h 36m 23s 40.7700 0.9594 15m 15s 40.5599 0.9796 4h 38m 12s 42.0061 0.9790 4h 30m 33s 42.3091 0.9816 7m 13s
×10absent10\times 10× 10 40.0307 0.9677 39.4657 0.9420 39.6185 0.9666 40.4674 0.9672 40.7769 0.9700
×100absent100\times 100× 100 36.2148 0.8938 36.1391 0.8914 36.7027 0.9138 37.2370 0.9280 37.4741 0.9311
×200absent200\times 200× 200 34.3638 0.8487 35.1045 0.8783 35.2796 0.8791 36.0706 0.9045 36.0443 0.9130
×300absent300\times 300× 300 30.4170 0.7663 31.4775 0.7760 31.8913 0.8211 32.8985 0.8532 33.0504 0.8561
Camera4 ×1absent1\times 1× 1 49.2394 0.9942 4h 36m 20s 43.7557 0.9705 15m 08s 47.9876 0.9924 4h 38m 15s 47.4546 0.9887 4h 30m 30s 50.1183 0.9945 7m 19s
×10absent10\times 10× 10 47.6744 0.9895 42.9754 0.9636 46.3897 0.9811 45.8446 0.9768 47.7583 0.9895
×100absent100\times 100× 100 41.9510 0.9335 39.8534 0.9360 42.4956 0.9537 42.0030 0.9540 41.9648 0.9587
×200absent200\times 200× 200 40.5930 0.9230 38.7384 0.9294 41.0072 0.9463 40.3252 0.9354 40.5241 0.9503
×300absent300\times 300× 300 36.6494 0.8391 36.2330 0.8915 38.5018 0.9108 38.6361 0.9231 38.1756 0.9209
Camera5 ×1absent1\times 1× 1 48.6019 0.9928 4h 24m 03s 42.8059 0.9713 14m 58s 47.1503 0.9874 4h 18m 44s 46.0550 0.9868 4h 29m 52s 46.9796 0.9897 7m 16s
×10absent10\times 10× 10 43.4577 0.9134 41.6037 0.9545 43.5000 0.9627 43.9310 0.9749 44.5822 0.9753
×100absent100\times 100× 100 36.4346 0.7930 38.1994 0.9081 39.6707 0.9040 39.9786 0.9321 41.3606 0.9478
×200absent200\times 200× 200 32.6378 0.7228 36.4481 0.8836 37.3455 0.8712 37.6322 0.9017 39.8046 0.9307
×300absent300\times 300× 300 29.2045 0.6537 32.9607 0.8229 34.5113 0.8179 33.9278 0.8524 36.4322 0.8922
\begin{overpic}[width=429.28616pt]{sid_comparison.pdf} \put(5.8,53.5){{Input}} \put(18.9,53.5){{AINDNet~{}\cite[cite]{[\@@bibref{}{kim2020transfer}{}{}]}}} \put(34.6,53.5){{Zhang~{}{et al. }~{}\cite[cite]{[\@@bibref{}{zhang2021% rethinking}{}{}]}}} \put(55.0,53.5){{ELD~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}}} \put(70.2,53.5){{{\bf{LED~{}(Ours)}}}} \put(90.5,53.5){GT} \end{overpic}
Figure 6: Visual comparison between our LED and other state-of-the-art methods on the SID [7] dataset (Zoom-in for best view). We amplified and post-processed the input images with the same ISP as ELD [8].

5.3 Comparison with State-of-the-art Methods

We assess the performance of our LED on three distinct datasets: the Sony subset of SID [7], the ELD dataset [8], and the 5 subsets in our MultiRAW dataset. This evaluation aims to gauge the generalization capabilities of LED across outdoor and indoor scenes and across more camera models, respectively. LED is benchmarked against state-of-the-art raw denoising methods designed for extremely low-light environments. These comparative analyses include:

  • DNN model-based methods: Exemplars in this category encompass the approaches presented by Kristina et al.  [20] and NoiseFlow [13]. These methodologies initially undergo training on paired real raw images, enabling them to learn the intricacies of noise generation specific to a particular camera. However, they may necessitate additional iterations when applied to a novel camera model.

  • Calibration-based methods: This classification encompasses ELD [8], the approach proposed by Zhang et al.  [16], and Calibrated P-G. Noteworthy is the requirement for a time-intensive and laborious calibration process intrinsic to these methods.

  • Real data-based methods: Techniques falling under this category involve training with various data pairings, such as noisy-clean pairs (SID [7]), noisy-noisy pairs (Noise2Noise [5]), and transfer learning as demonstrated by AINDNet [23].

The denoising network for all methods above is trained under identical settings, following the parameters outlined in ELD [8]. This standardization ensures a fair and consistent basis for comparison, as elucidated in Sec. 5.1.

Quantitative Evaluation. As demonstrated in Tab. I, Tab. II and Tab. III, our approach surpasses previous calibration-based methods in denoising performance under extremely low-light conditions. The disparity between synthetic and real noise is exacerbated with a substantial ratio (×250absent250\times 250× 250 and ×300absent300\times 300× 300), resulting in diminished performance during training with synthetic noise. This is exemplified in comparing ELD [8] and SID [7]. Moreover, DNN model-based methods often exhibit more significant discrepancies than calibration-based methods, with Kristina et al.  [20] failing to account for different system gains. Our method mitigates this discrepancy by fine-tuning with few-shot real data, achieving superior performance under ×100absent100\times 100× 100 and ×250absent250\times 250× 250 digital gain, as detailed in Tab. I. AINDNet [23] also demonstrates enhanced performance under extremely dark scenes, benefitting from a noise model with reduced deviation. Notably, the noise model deviation has minimal impact on denoising efficacy under small additional digital gain, even may enhance performance, as illustrated in Tab. II. Discussions related to this phenomenon can be found in Sec. 6. Significantly, our method exhibits superiority under extremely low-light scenes, even across different camera models. Additionally, when compared to alternative methods, LED introduces lower training costs in terms of data requirements, training iterations, and training time.

Qualitative Evaluation. The visual comparisons presented in Fig. 6, Fig. 7 and Fig. 8 illustrate the performance of our method against other state-of-the-art approaches on the SID [7], ELD [8] and MultiRAW datasets, respectively. Under extremely low-light conditions, LED recovers more high-frequency information. As shown in Camera3 in Fig. 8, LED is the only method to restore the strings of all three badminton rackets, especially the blue one. Also, the presence of intense noise significantly disrupts the color tone. In Fig. 6, input images exhibit noticeable green or purple color shifts, with many comparative methods struggling to restore the correct color tone. Leveraging implicit noise modeling and a diverse sampling space, LED efficiently reconstructs signals amidst severe noise interference, achieving accurate color rendering and preserving rich texture detail. Moreover, other methods often fail to discern and address enlarged out-of-model noises, resulting in the corruption of the final image with fixed patterns or specific positional artifacts. In contrast, during the fine-tuning, LED learns to effectively eliminate these camera-specific noises, enhancing visual quality and demonstrating robustness against such challenges.

\begin{overpic}[width=208.13574pt]{eld_comparison.pdf} \put(8.0,41.5){Input} \put(30.8,41.5){ELD~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}} \put(52.4,41.5){{\bf{LED~{}(Ours)}}} \put(84.2,41.5){GT} \end{overpic}
Figure 7: Visual comparison on the ELD [8] dataset (Zoom-in for best view).
\begin{overpic}[width=424.94574pt]{multiraw_comparison.pdf} \put(7.5,65.0){{Input}} \put(24.5,65.0){{P-G}} \put(39.0,65.0){{ELD~{}\cite[cite]{[\@@bibref{}{wei2021physics}{}{}]}}} \put(52.0,65.0){{Zhang~{}{et al. }~{}\cite[cite]{[\@@bibref{}{zhang2021% rethinking}{}{}]}}} \put(70.5,65.0){{{\bf{LED~{}(Ours)}}}} \put(90.5,65.0){GT} \end{overpic}
Figure 8: Visual comparison between our LED and other state-of-the-art calibration-based methods on our proposed MultiRAW dataset, along with 5 cameras (Zoom-in for best view). We amplified and post-processed the input images with the same ISP as ELD [8].
TABLE IV: Ablation studies on the RepNR block. The provided metrics are with the fine-tuning, as shown in ③ of Fig. 3.
Setting ×\times×100 ×\times×250 ×\times×300
U-net CSA OMNR PSNR/SSIM PSNR/SSIM PSNR/SSIM
41.518/0.951 39.140/0.923 36.273/0.898
41.866/0.954 39.201/0.931 36.499/0.912
41.984/0.954 39.342/0.932 36.673/0.915

5.4 Ablation Studies

Reparameterized Noise Removal Block. We conduct experiments to analyze the impact of different components in the Reparameterized Noise Removal Block (RepNR). As depicted in Tab. IV, our RepNR consistently demonstrates improved performance across three different ratios, with each component in the RepNR block contributing positively to the overall pipeline.

Pre-training with Advanced Strategy. As outlined in Tab. V, pre-training with the SGDR [73] optimizer and larger batch size (equivalent to the training strategy of PMN [22]) yields further performance improvements, all while maintaining the same fine-tuning (2 image pairs for each ratio and 1.5K iterations). This underscores the scalability of the proposed LED. Additionally, in comparison to LLD [74], LED demonstrates superior performance with minimal data and training costs.

Comparison between CSA and Other Normalization. A similar technique to our proposed one is to insert normalization layers in the network, which is relatively common in transfer learning scenarios. To show the superiority of CSA compared with the usual method, we directly replace CSAs with different kinds of normalization layers to observe the difference. As shown in Tab. VI, Alternatives are Instance-Normalization [50], Layer-Normalization [51], and Batch-Normalization [52] (*** denotes BN without running-mean and running-variance). Any normalization cannot achieve comparable performance to CSA. One main reason is that the value range of features is crucial to the denoising task. Normalization seriously destroys the value range of the feature and breaks its stability. On the contrary, CSA roughly maintains the original value range, preventing model performance from collapsing.

TABLE V: Ablation studies on the pre-training strategy. The notation with \star indicates utilizing the same training strategy as PMN [22] for the denoiser. At the same time, LED\star employs this strategy specifically for the pre-training stage and keeps the fine-tuning the same as before.
Method ×\times×100 ×\times×250 ×\times×300
PSNR/SSIM PSNR/SSIM PSNR/SSIM
LED 41.984/0.954 39.342/0.932 36.673/0.915
ELD\star [8] 42.081/0.955 39.461/0.934 36.870/0.920
LLD\star [74] 42.100/0.955 39.760/0.933 36.760/0.912
LED\star 42.396/0.955 39.843/0.939 36.997/0.923
TABLE VI: Ablation studies on the CSA. BN* denotes batch normalization with running mean and running variance.
Metric CSA IN [50] LN [51] BN [52] BN*
PSNR 39.161 26.596 26.605 26.412 23.995
SSIM 0.9322 0.5883 0.5938 0.6066 0.4186

Virtual Camera Number. We have done ablation studies on the virtual camera numbers of our proposed LED. As shown in Fig. 9, LED achieves the best performance with five virtual cameras. Intuitive thought is that too few cameras will make it difficult for the model to learn common knowledge, while too many cameras significantly increase the difficulty of the model learning process. Since five virtual cameras show an impressive improvement over the whole process, we chose five as the number of virtual cameras for our pre-training process.

Sampling Strategy. Uniform sampling makes covering the whole parameter space 𝒮𝒮\mathcal{S}caligraphic_S hard. However, our sampling strategy could cover the whole parameter space 𝒮𝒮\mathcal{S}caligraphic_S, thus resulting in better performance, as shown in Tab. VII. Based on the observation, we use the equivalence point strategy to choose the parameters of the virtual camera. To reduce errors, we conducted experiments with uniform sampling three times and averaged the metrics.

TABLE VII: Ablation studies on virtual camera sampling strategy. Rand represents leveraging uniform distribution as the strategy. The results of Rand are derived from the average of three trials to minimize errors.
Setting ×\times×100 ×\times×250 ×\times×300
PSNR/SSIM PSNR/SSIM PSNR/SSIM
Rand 41.5253/0.9489 39.2755/0.9283 36.3940/0.9070
Ours 41.9842/0.9539 39.3419/0.9317 36.6728/0.9147
\begin{overpic}[width=208.13574pt]{camera_number.pdf} \end{overpic}
Figure 9: Ablation studies on virtual camera numbers. PSNR and SSIM reach the apex when the virtual camera number is 5.

Initialization of CSA for Target Camera. Given the initialization of CSTT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT as described in Sec. 3.3, we present the PSNR/SSIM difference between (𝟏,𝟎)10\mathbf{(1,0)}( bold_1 , bold_0 ) initialization and model averaging. The results indicate that, in most scenarios, model averaging yields superior performance. Furthermore, the performance on the Sony A7S2 of SID [7], as shown in Tab. X, is considered representative of the generalization ability, owing to the scale of the dataset.

Fine-tuning with More Images. Ablation studies are conducted to explore the impact of the number of fine-tuning, illustrating the potential of our proposed LED. As depicted in Fig. 10, an increase in the quantity of paired data correlates with a gradual performance improvement. Moreover, LED outperforms ELD [8] even when fine-tuning only two noise-clean pairs. Further discussions are provided in Sec. 6.

\begin{overpic}[width=208.13574pt]{fewshot_list.pdf} \end{overpic}
Figure 10: Ablation studies on the data amount for fine-tuning that LED achieves superior performance with just 2 pairs for each ratio.

5.5 Further Application

Equip RepNR block on other network architecture. By simply replacing the convolutional operators of other structures with our proposed RepNR Block, LED can be easily migrated to architectures beyond UNet. In Tab. VIII, we experimented with Restormer [31] and NAFNet [32], transformer-based and convolution-based, respectively. Results demonstrate that LED still possesses performance comparable to calibration-based methods.

TABLE VIII: Experiments on network architecture. For LED, we first replace most of the convolution block into our proposed RepNR block during pre-training and fine-tuning in deploying, LED outputs the same architecture as other methods without any additional computational burden, owing to the structural reparameterization procedure.
Architecture Method ×100absent100\times 100× 100 ×250absent250\times 250× 250 ×300absent300\times 300× 300
PSNR/SSIM PSNR/SSIM PSNR/SSIM
Restormer [31] P-G 39.457/0.8943 33.956/0.7525 30.964/0.6409
ELD [8] 42.568/0.9536 38.699/0.9280 35.863/0.9059
LED 42.452/0.9492 39.376/0.9143 36.322/0.9143
NAFNet [32] P-G 39.388/0.8945 33.892/0.7541 30.948/0.6445
ELD [8] 42.351/0.9535 38.697/0.9300 35.931/0.9112
LED 42.368/0.9532 39.277/0.9351 36.292/0.9188
\begin{overpic}[width=208.13574pt]{two_pairs.pdf} \end{overpic}
Figure 11: Illustration of the feasible solution space (blue area) depicting the linear relationship between the overall system gain log(K)𝐾\log(K)roman_log ( italic_K ) and noise variance log(σ)𝜎\log(\sigma)roman_log ( italic_σ ) under various sample strategies.

LED pre-training could boost the performance of other methods. By integrating LED pre-training into various existing calibration-based or paired data-based methods, as referenced in [8, 7], our approach facilitates notable enhancements in performance as shown in Tab. IX. These improvements are not uniform but rather depend on the difference in the pre-training strategies employed. This proves particularly effective in industrial applications, where the demands for efficiency are paramount. The strategic application of LED pre-training not only boosts the performance of the denoiser but also paves the way for more advanced, adaptable, and efficient denoising.

TABLE IX: Experiments on LED pre-training with other methods. 𝐗+𝐘𝐗𝐘\mathbf{X}+\mathbf{Y}bold_X + bold_Y denotes 𝐗𝐗\mathbf{X}bold_X method is training on the pre-trained network of 𝐘𝐘\mathbf{Y}bold_Y. \star indicates the utilization of the advanced training strategy same as PMN [22] for the denoiser during pre-training.
Method ×100absent100\times 100× 100 ×250absent250\times 250× 250 ×300absent300\times 300× 300
PSNR SSIM PSNR SSIM PSNR SSIM
ELD [8] 41.827 0.9538 38.849 0.9278 35.940 0.8982
ELD [8]+++LED 42.170 0.9558 39.285 0.9302 36.384 0.9058
ELD [8]+++LED\star 42.471 0.9567 39.454 0.9333 36.534 0.9138
SID [7] 41.727 0.9531 39.135 0.9304 37.363 0.9341
SID [7]+++LED 42.277 0.9580 39.576 0.9445 37.518 0.9369
SID [7]+++LED\star 42.320 0.9585 39.613 0.9455 37.614 0.9369
TABLE X: Ablation studies on the initialization strategy of CSA for the target camera. “Sony A7S2#” denotes that fine-tuning and testing is performed on the SID [7] dataset, while other evaluations are conducted based on the ELD [8] dataset.
Init Metric Sony Nikon Canon
A7S2# A7S2 D850 EOS700D EOS70D
(𝟏,𝟎)10\mathbf{(1,0)}( bold_1 , bold_0 ) PSNR 39.015 47.310 45.790 41.409 42.344
SSIM 0.9307 0.9809 0.9737 0.9408 0.9520
Avg. PSNR 39.161 47.616 45.903 41.516 42.495
SSIM 0.9322 0.9817 0.9743 0.9412 0.9524
TABLE XI: Ablation studies on the pair count for fine-tuning testing on the synthetic dataset. n𝑛nitalic_n represents fine-tuning n𝑛nitalic_n data pairs with a similar overall system gain for each ratio. n*superscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT denotes pairs of data with marginally different overall system gains.
Ratio 1 2 4 2*
×100absent100\times 100× 100 41.295/0.9480 41.704/0.9523 41.432/0.9466 43.795/0.9648
×250absent250\times 250× 250 39.239/0.9350 39.410/0.9351 39.327/0.9367 41.311/0.9457
×300absent300\times 300× 300 38.314/0.9229 38.486/0.9216 38.499/0.9240 39.190/0.9278

6 Discussions

Why 2222 pairs for each ratio? As indicated in Eqn. (4), the variance of noise log(σ)𝜎\log(\sigma)roman_log ( italic_σ ) exhibits a linear relationship with the overall system gain log(K)𝐾\log(K)roman_log ( italic_K ). With only one pair of data, establishing the correct linear relationship is unattainable, resulting in suboptimal performance, as demonstrated in Tab. XI. Furthermore, utilizing two or more pairs with similar system gains fails to precisely model the linear relationship due to a non-negligible error in the sampling scope (σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG in Eqn. (4)), as illustrated in Fig. 11. Following the principle of using two points to determine a straight line, adopting two pairs with marginally different system gains facilitates the accurate modeling of linearity, significantly enhancing denoising capabilities. Additionally, as shown in Fig. 10, an increase in the number of pairs enables a more accurate fitting of linearity, thereby reducing regression errors further.

For typical explicit calibration-based methods, the primary objective of the calibration process is to compute the linear relationships mentioned previously. Subsequently, the network is trained on synthetic data to learn this relationship. However, our implicit calibration adjusts the learned linear relationships of the network directly through “calibrating” network parameters. This approach makes the entire process more direct and enables the network to serve as a swift adapter.

\begin{overpic}[width=208.13574pt]{raw_distribution_nikon.pdf} \put(69.3,19.0){\small{$KLD=0.0289$}} \end{overpic}
\begin{overpic}[width=208.13574pt]{raw_distribution_canon.pdf} \put(69.3,22.0){\small{$KLD=0.2978$}} \end{overpic}
Figure 12: Histogram of intensities captured in the same scene with three different camera models: Nikon D850, Canon EOS700D, and Sony A7S2. KLD𝐾𝐿𝐷KLDitalic_K italic_L italic_D denotes the KL divergence between distributions. Note that the distribution is similar between Nikon and Sony, while the difference remains between Sony and Canon.

Noise prior or image prior? Both! It is well known that existing calibration-based methods uniformly utilize noise prior techniques (explicit noise model calibration). However, these methods can exhibit sudden performance degradation on certain cameras, as shown in Canon EOS70D and Canon EOS700D of Tab. II, This is attributed to these methods having learned an excessive amount of image priors from other cameras during training. Sensors of various manufacturers would hold diverse response models, thus yielding different signal intensities to the same scenario. In most calibration-based methods [8, 16], the network’s denoising ability is restricted to a certain image distribution prior, i.e., Sony A7S2. As stated in [49] and shown in Fig. 12, the intensity distributions of Nikon D850 and Sony A7S2 show high similarity. Therefore, generated from the response intensity of Sony A7S2 and the noise model of Nikon D850, the synthetic image exhibits slight discrepancy from the real image prior, assisting network to achieve great performance, as shown in Nikon D850 of Tab. II of the main paper. On the contrary, the intensity distributions between Canon EOS700D and Sony A7S2 remain large discrepancy, leading to a performance drop.

However, it is important to note that as additional digital gain increases, the performance gap between LED and other methods is gradually narrowing. This is because higher digital gain leads to more pronounced noise, making the noise prior to learning by the network more effective. Conversely, under conditions of low digital gain, the image prior previously learned by the network becomes predominant.

Based on this observation, the balance between image prior and noise prior is the key to this problem. With the help of the proposed CSA, features are aligned to the shared space before denoising, decreasing the influence of the image prior to the network. As shown in Tab. XII, even pre-trained with the response model of Sony A7S2, LED can outperforms other calibration-based methods. Furthermore, fine-tuning a few pairs of images of the target camera complements the camera-specific information, supporting the network to step forward for learning both image prior and noise prior.

TABLE XII: Ablation studies on training with noisy pairs generated from different RAW sources. The experiments are based on the Canon EOS700D camera and Sony A7S2 of the ELD [8] dataset. RAW Src. denotes that the RAW image pairs for fine-tuning are generated by the ground truth of Sony A7S2 or Canon EOS700D.
RAW Src. ×\times×1 ×\times×10 ×\times×100 ×\times×200
PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM
Sony 44.27/0.992 42.15/0.982 37.43/0.917 34.74/0.867
Canon 46.24/0.992 44.14/0.983 37.94/0.920 34.78/0.869

7 Conclusion and Future Work

To address the inherent shortcomings of calibration-based methods, we introduce a implicit calibration pipeline designed to lighting even the darkest scenes. Leveraging the camera-specific alignment (CSA), we substitute the explicit calibration procedure with an implicit learning process on the denoiser. The CSA facilitates rapid adaptation to the target camera by separating camera-specific information from the common knowledge of the noise model. Additionally, a parallel convolution mechanism is implemented to learn and eliminate out-of-model noise. With 2 pairs for each ratio (a total of 6 pairs) and 1.5K iterations, our approach attains superior performance compared to existing methods.

Up to this point, the final output quality of LED is still strongly correlated with the data quality used in the few-shot fine-tuning. However, this is not solely a limitation of our method but a common drawback of most few-shot methods. Future work could focus more on making few-shot learning more stable. This represents a key distinction between LED and previous methods: earlier approaches primarily concentrated on engineering for sensor noise modeling rather than focusing on deep learning techniques like few-shot, transfer, or continual learning. Consequently, LED allows researchers to shift their focus from sensor engineering to exploring few-shot learning.

Acknowledgement

This research was supported by the NSFC (NO. 62225604, 62306153) and the Fundamental Research Funds for the Central Universities (Nankai University, 070-63233089). The Supercomputing Center of Nankai University supports computation. Moreover, we would like to express our profound gratitude to Yixuan Huang, Yipeng Du, Bowen Yin, Yunheng Li, and Ruihong Cen (in no particular order) for their dedicated efforts in constructing our dataset.

References

  • [1] X. **, J.-W. Xiao, L.-H. Han, C. Guo, R. Zhang, X. Liu, and C. Li, “Lighting every darkness in two pairs: A calibration-free pipeline for raw denoising,” in ICCV, 2023.
  • [2] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in CVPR, 2005.
  • [3] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE TIP, 2017.
  • [4] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.
  • [5] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in CVPR, 2018.
  • [6] A. Abdelhamed, S. Lin, and M. S. Brown, “A high-quality denoising dataset for smartphone cameras,” in CVPR, 2018.
  • [7] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in CVPR, 2018.
  • [8] K. Wei, Y. Fu, Y. Zheng, and J. Yang, “Physics-based noise modeling for extreme low-light photography,” IEEE TPAMI, 2021.
  • [9] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE TIP, 2018.
  • [10] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” in CVPR, 2019.
  • [11] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for fast image restoration and enhancement,” IEEE TPAMI, 2022.
  • [12] X. **, L.-H. Han, Z. Li, C.-L. Guo, Z. Chai, and C. Li, “Dnf: Decouple and feedback network for seeing in the dark,” in CVPR, 2023.
  • [13] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise modeling with conditional normalizing flows,” in ICCV, 2019.
  • [14] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” in CVPR, 2020.
  • [15] G. Jang, W. Lee, S. Son, and K. M. Lee, “C2n: Practical generative noise modeling for real-world denoising,” in CVPR, 2021.
  • [16] Y. Zhang, H. Qin, X. Wang, and H. Li, “Rethinking noise synthesis and modeling in raw denoising,” in ICCV, 2021.
  • [17] A. Maleky, S. Kousha, M. S. Brown, and M. A. Brubaker, “Noise2noiseflow: Realistic camera noise modeling without clean images,” in CVPR, 2022.
  • [18] S. Kousha, A. Maleky, M. S. Brown, and M. A. Brubaker, “Modeling srgb camera noise with normalizing flows,” in CVPR, 2022.
  • [19] Y. Wang, H. Huang, Q. Xu, J. Liu, Y. Liu, and J. Wang, “Practical deep raw image denoising on mobile devices,” in ECCV, 2020.
  • [20] K. Monakhova, S. R. Richter, L. Waller, and V. Koltun, “Dancing under the stars: video denoising in starlight,” in CVPR, 2022.
  • [21] Y. Zou and Y. Fu, “Estimating fine-grained noise model via contrastive learning,” in CVPR, 2022.
  • [22] H. Feng, L. Wang, Y. Wang, and H. Huang, “Learnability enhancement for low-light raw denoising: Where paired real data meets noise modeling,” in ACM MM, 2022.
  • [23] Y. Kim, J. W. Soh, G. Y. Park, and N. I. Cho, “Transfer learning from synthetic to real-noise denoising with adaptive instance normalization,” in CVPR, 2020.
  • [24] X. Ding, X. Zhang, J. Han, and G. Ding, “Diverse branch block: Building a convolution as an inception-like unit,” in CVPR, 2021.
  • [25] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in CVPR, 2021.
  • [26] L. Chen, Y. Fu, K. Wei, D. Zheng, and F. Heide, “Instance segmentation in the dark,” IJCV, 2023.
  • [27] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
  • [28] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for real image restoration and enhancement,” in ECCV, 2020.
  • [29] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen, “Hinet: Half instance normalization network for image restoration,” in CVPR, 2021.
  • [30] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in CVPR, 2021.
  • [31] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in CVPR, 2022.
  • [32] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” in ECCV, 2022.
  • [33] Z. Zhang, Y. Jiang, W. Shao, X. Wang, P. Luo, K. Lin, and J. Gu, “Real-time controllable denoising for image and video,” in CVPR, 2023.
  • [34] R. A. Boie and I. J. Cox, “An analysis of camera noise,” IEEE TPAMI, 1992.
  • [35] G. E. Healey and R. Kondepudy, “Radiometric ccd camera calibration and noise estimation,” IEEE TPAMI, 1994.
  • [36] R. D. Gow, D. Renshaw, K. Findlater, L. Grant, S. J. McLeod, J. Hart, and R. L. Nicol, “A comprehensive tool for modeling cmos image-sensor-noise performance,” IEEE TED, 2007.
  • [37] K. Irie, A. E. McKinnon, K. Unsworth, and I. M. Woodhead, “A technique for evaluation of ccd video-camera noise,” IEEE TCSVT, 2008.
  • [38] M. Konnik and J. Welsh, “High-level numerical simulations of noise in ccd and cmos photosensors: review and tutorial,” arXiv:1412.4031, 2014.
  • [39] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” in NeurIPS, 2014.
  • [40] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
  • [41] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020.
  • [42] H. Wach and E. R. Dowski Jr, “Noise modeling for design and simulation of computational imaging systems,” in Visual Information Processing XIII, 2004.
  • [43] M. Maggioni, E. Sánchez-Monge, and A. Foi, “Joint removal of random and fixed-pattern noise through spatiotemporal video filtering,” IEEE TIP, 2014.
  • [44] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017.
  • [45] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019.
  • [46] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning in neural networks: A survey,” IEEE TPAMI, 2021.
  • [47] H.-J. Ye, L. Ming, D.-C. Zhan, and W.-L. Chao, “Few-shot learning with a strong teacher,” IEEE TPAMI, 2022.
  • [48] G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE TPAMI, 2022.
  • [49] K. R. Prabhakar, V. Vinod, N. R. Sahoo, and R. V. Babu, “Few-shot domain adaptation for low light raw image enhancement,” in BMVC, 2021.
  • [50] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv:1607.08022, 2016.
  • [51] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016.
  • [52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
  • [53] B. L. Joiner and J. R. Rosenblatt, “Some properties of the range in samples from tukey’s symmetric lambda distributions,” Journal of the American Statistical Association, 1971.
  • [54] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2016.
  • [55] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017.
  • [56] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv:1505.00853, 2015.
  • [57] C.-B. Zhang, J.-W. Xiao, X. Liu, Y.-C. Chen, and M.-M. Cheng, “Representation compensation networks for continual semantic segmentation,” in CVPR, 2022.
  • [58] J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, and S. Park, “Swad: Domain generalization by seeking flat minima,” in NeurIPS, 2021.
  • [59] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv:1803.05407, 2018.
  • [60] J.-W. Xiao, C.-B. Zhang, J. Feng, X. Liu, J. van de Weijer, and M.-M. Cheng, “Endpoints weight fusion for class incremental semantic segmentation,” in CVPR, 2023.
  • [61] X. Ding, Y. Guo, G. Ding, and J. Han, “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks,” in ICCV, 2019.
  • [62] X. Ding, H. Chen, X. Zhang, J. Han, and G. Ding, “Repmlpnet: Hierarchical vision mlp with re-parameterized locality,” in CVPR, 2022.
  • [63] X. Ding, Y. Zhang, Y. Ge, S. Zhao, L. Song, X. Yue, and Y. Shan, “Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition,” arXiv:2311.15599, 2023.
  • [64] M. Hu, J. Feng, J. Hua, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Online convolutional re-parameterization,” in CVPR, 2022.
  • [65] X. **, J.-W. Xiao, and Y. Huang, “Led,” https://github.com/Srameo/LED, 2023.
  • [66] M. Riechert, “Rawpy,” https://github.com/letmaik/rawpy, 2014.
  • [67] A. Pavao, I. Guyon, A.-C. Letournel, D.-T. Tran, X. Baro, H. J. Escalante, S. Escalera, T. Thomas, and Z. Xu, “Codalab competitions: An open source platform to organize scientific challenges,” JMLR, 2023.
  • [68] S. Cheng, Y. Wang, H. Huang, D. Liu, H. Fan, and S. Liu, “Nbnet: Noise basis learning for image denoising with subspace projection,” in CVPR, 2021.
  • [69] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS Workshops, 2017.
  • [70] Mindspore-AI, “Mindspore,” https://github.com/mindspore-ai/mindspore, 2019.
  • [71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
  • [72] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, 2004.
  • [73] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in ICLR, 2017.
  • [74] Y. Cao, M. Liu, S. Liu, X. Wang, L. Lei, and W. Zuo, “Physics-guided iso-dependent sensor noise modeling for extreme low-light photography,” in CVPR, 2023.
[Uncaptioned image] Xin ** received the BS degree from the College of Software, Nankai University, China, in 2022. He is currently a Ph.D. student at the College of Computer Science, Nankai University. His research interests include computational photography and video/image processing.
[Uncaptioned image] Jia-Wen Xiao received his BS degree from the College of Computer Science, Nankai University, China, in 2022. He is currently a Ph.D. student at the College of Computer Science, Nankai University. His research interests include continual learning, self-supervised learning, few-shot learning, and computational photography.
[Uncaptioned image] Ling-Hao Han is a Ph.D. student from the College of Computer Science at Nankai University, under Prof. Ming-Ming Cheng’s supervision. Before that, he received a Bachelor’s Degree from Nankai University in 2020. His research interests include image restoration, low-light image enhancement, and computational photography.
[Uncaptioned image] Chunle Guo received his PhD from Tian** University in China. He continued his research as a Research Associate with the Department of Computer Science, City University of Hong Kong (CityU), from 2018 to 2019. Now, he is a postdoc research fellow working with Prof. Ming-Ming Cheng at Nankai University. His research interests lie in image processing, computer vision, and deep learning.
[Uncaptioned image] Xialei Liu is currently an associate professor at Nankai University, Tian**, China. Before that, he was a postdoc research associate at the University of Edinburgh, Edinburgh, UK. He obtained his PhD at the Autonomous University of Barcelona, Barcelona, Spain. He received B.S. and M.S. degrees at Northwestern Polytechnical University in 2013 and 2016, respectively, in Xi’an, China. His research interests include continual learning, self-supervised learning, few-shot learning etc.
[Uncaptioned image] Chongyi Li is a professor at the Nankai University, China. He was a Research Fellow and then a Research Assistant Professor with City University of Hong Kong and Nanyang Technological University from 2018 to 2023. His research interests include image enhancement and restoration, image generation and editing, and underwater imaging. He serves as an AE of the IEEE TCSVT, and a Lead Guest AE of IJCV. He is an IEEE Senior Member.
[Uncaptioned image] Ming-Ming Cheng received his PhD degree from Tsinghua University in 2012, and then worked with Prof. Philip Torr in Oxford for 2 years. Since 2016, he is a full professor at Nankai University, leading the Media Computing Lab. His research interests include computer vision and computer graphics. He received awards, including the ACM China Rising Star Award, IBM Global SUR Award, etc.. He is a senior member of the IEEE and on the editorial boards of IEEE TPAMI and IEEE TIP.