(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: LiAuto, China
11email: {lifei3,houwenbo,jiapeng}@lixiang.com

RMFA-Net: A Neural ISP for Real RAW to RGB Image Reconstruction

Fei Li\orcidlink0000-1111-2222-3333 11    Wenbo Hou\orcidlink1111-2222-3333-4444 11    Peng Jia\orcidlink2222–3333-4444-5555 11
Abstract

Deep learning-based ISP algorithms have demonstrated significant potential in raw2rgb reconstruction. However, existing networks have not fully considered the specific characteristics of raw data, such as black level and CFA, which can negatively impact texture and color if mishandled. Moreover, uneven exposure in raw data is also not considered carefully, leading to adverse effects on contrast and brightness. In this paper, we introduce RMFA-Net to tackle these problems. We perform implicit black level correction to mitigate color shifts in dim scenes. To preserve high-frequency information and prevent misalignment, we propose a novel Three-Channel-Split mode. To address the issue of uneven exposure, we designed an explicit tone map** module based on the Retinex theory. We train and evaluate our models using the dataset released by the Mobile AI 2022 Learned Smartphone ISP Challenge. It is demonstrated that RMFA-Net outperforms previous algorithms, achieving a PSNR score of over 25 dB, surpassing the state-of-the-art by +1 dB. Furthermore, we developed a lightweight version, RMFANet-tiny, for engineering deployment while still maintaining strong performance, surpassing the SOTA by +0.5 dB.

Keywords:
Neural ISP Uneven Exposure RMFANet

1 Introduction

The Image Signal Processor (ISP) is a specialized system designed to reconstruct RGB images from raw data captured by CMOS sensors. Traditional ISP algorithms are manually crafted and rely on a deep understanding of the sensors and complex tuning, which limits their applicability in fields such as autonomous driving and robotics. While image quality is well understood for human vision applications, it is not well-defined for visual perception systems. Deep learning-based ISP algorithms have emerged as a promising approach with significant potential and versatility. In recent years, there has been increasing interest in develo** learning-based algorithms to design efficient and high-performance ISP algorithms tailored to specific domain requirements.

Refer to caption
Figure 1: RMFA-Net: Residual-Multi-Frequency Attention Network. Figure (a) illustrates the overall architecture of RMFA-Net. The network consists of several key modules: the input module, stack of RMFA blocks and the output layer. Figure (b) illustrates the internal structure of RMFA block.

There are three primary approaches to enhance ISP performance with deep learning algorithms: image enhancement, network in the loop, and end-to-end neural ISP. Image enhancement algorithms [6, 2, 7, 19, 22] focus on improving specific aspects of image quality, such as denoising [38], HDR [35], and super-resolution [36], through post-ISP processing on RGB images. Network in the loop replaces certain sub-modules of traditional hardware ISP pipelines, like demosaicing [25], with neural networks, necessitating frequent data exchange between different hardware types. Neural ISP is an emerging approach that replaces the entire ISP system with an end-to-end neural network [30], directly taking raw data as input and producing RGB images, showing promise in enhancing image quality and perception performance.

When designing a neural ISP algorithm, careful consideration must be given to the preprocessing of raw data. We argue that raw data preprocessing should be sensor-dependent, considering characteristics such as black level, CFA, and pattern, which vary significantly among sensors [27]. Black level, for instance, impacts color, particularly in dim scenes. Different CFAs have varying sampling rates. For example, in Bayer filter [1], the sampling rate of Green channel is twice of that of Red and Blue, which means the Green channel contains more information. Changes in patterns contribute to color shifts in trained networks. An effective ISP system should address key challenges related to resolution, contrast, color, and brightness. Deep learning perspectives suggest that resolution relies more on local features [21], while color and brightness depend more on global features [14]. Global statistics affect overall image contrast, while local statistics impact contrast in smaller regions [3]. Therefore, a network should accurately extract and represent both global and local features.

In this work, we propose a novel data preprocessing method that preserves raw information by kee** the G channel unsplit while avoiding misalignment through pixel location invariant channel split. We introduce RMFA-Net (Residual Multi-Frequency Attention network), a new network architecture utilizing convolutional modules with various kernel sizes to extract global and local features. Additionally, we design an explicit tone map** module based on Retinex theory [23] to remove uneven exposure interference from raw data, facilitating more effective tone map** learning. Finally, we combine these designs into a unified module called RMFA, serving as the fundamental building block of RMFA-Net. Through training and evaluation on the Fujifilm UltraISP dataset [13], our proposed method achieves image quality exceeding 25 dB in terms of PSNR and an SSIM score of 0.889.

Our contributions can be summarized as follows:

[1]We propose a new method of data preprocessing that incorporates explicit black level correction to prevent color shifts in dim scenes. Additionally, we introduce a three-channel-split mode to address the differences in sampling rates, ensuring the preservation of complete raw information while avoiding misalignment;

[2]To improve the learning of tone map**, we design an explicit algorithm based on Retinex theory, effectively removing uneven exposure from the raw data;

[3]Furthermore, we present RMFA-Net, an end-to-end deep learning approach for raw to RGB image reconstruction. Through our proposed method, we achieve state-of-the-art image quality on the Fujifilm UltraISP dataset.

The remainder of the paper is structured as follows. In Section 2 we summarized the related work on deep learning based isp algorithms from the angle of network architecture. Section 3 presents our proposed network RMFA-Net and describes the underlying design choices. Section 4 shows and analyzes the experimental results and discusses the limitations of the solution. Finally, Section 5 concludes the paper.

2 Related Work

A seminal work in the field is PyNET [14], an explicit multi-branch architecture model trained to directly map raw Bayer sensor data to RGB images captured by a DSLR camera. This model adopts an inverted pyramidal shape and processes images at five different scales, with each scale corresponding to a branch that is trained sequentially. PyNET effectively extracts both global and local features, merging them to generate the final outputs. It achieves image quality on par with commercial ISP system of Huawei P20 camera phone. Subsequent works, namely PyNET-CA [18], introduced a channel attention mechanism to further enhance performance. Moreover, lightweight versions of PyNET, Micro ISP [12] and PyNET-V2 [11], were proposed to enable efficient execution on mobile devices, achieving a balance between image quality and computational efficiency.

Since the introduction of SRCNN [19], which addressed the super-resolution challenge using CNN, various derivative models have been proposed to tackle image reconstruction tasks [31, 39, 32, 4]. Building upon this structure, AIISP [9] proposed the Channel Spacial Attention Network, incorporating double attention modules (DAM) with skip connections to enhance spatial dependencies and overcome the vanishing gradient problem. ENERZAi [9] introduced DenseNet-based residual blocks with separable convolutions, transpose convolution, and a channel attention mechanism to improve image quality. CVML [9] utilized residual blocks to extract a rich set of features from the input data, while the transposed convolution layer was employed for upsampling the final feature maps to the target resolution. EdS [9] proposed a ResNet-based architecture based on [10] and incorporated two additional 4×4444\times 44 × 4 convolutional layers with a stride of 2 to extract global features. Multimedia [13] introduced the enormous Re-parameter Convolution (eReopConv) layer as a replacement for standard convolution, while HITZST01 [13] proposed the Enhance Features Module in their RFD-CSA architecture, which effectively extracts features at multiple model levels while maintaining performance with a long-term residual connection

The U-Net [29] architecture is also widely adopted in image reconstruction tasks [34, 15, 5, 16, 8]. Specifically for the Raw2RGB reconstruction task, the W-net [33] combines two U-Net structures with channel attention modules and achieves good performance. SalGAN [28] employs the U-Net structure as the generator in an adversarial training scheme and incorporates a spatial attention scheme into the loss function. isp_forever [9] proposes a U-Net-based model augmented with a channel attention module. MiAlgo [13] introduces a 4-level UNet-based structure, where several convolutional layers are replaced with a residual group to enhance the network’s reconstruction ability. CASIA 1st [13] adopts a teacher-guided training strategy and proposes both teacher and student networks based on the U-Net architecture, incorporating a self-attention module

A notable event in the field is the Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM Challenge, which took place in 2021 [9] and 2022 [13]. The challenge aimed to foster the development of efficient and high-performance models for inference on edge devices. Participants presented numerous innovative ideas that pushed the performance boundaries to new heights.

3 Proposed Method

3.1 Network Architecture

Fig. 1 provides a schematic representation of the proposed deep learning architecture. The network is divided into three main parts: the input module, a stack of RMFA blocks, and the output module. The input module takes images of size 256×256×32562563256\times 256\times 3256 × 256 × 3 as input and expands the depth from 3 to a uniform width. In this part, two convolutional layers with a kernel size of 3×3333\times 33 × 3 are stacked. It is important to note that the tanh function is used to map the results to the interval (1,1)11(-1,1)( - 1 , 1 ) [9]. The second part consists of a stack of RMFA modules. The third part is the output module, where a convolutional layer followed by a sigmoid activation function is used to generate the output

3.2 Black Level Correction

In the case of an electronic device, the sensor generates electrons even in the absence of light. These electrons will also be collected and readout by the sensor. To address this issue, the sensor incorporates optical black pixels to mitigate the black level and adds a fixed pedestal to the final output. For instance, the IMX586 sensor(see https://www.sony.com/en/SonyInfo/News/Press/201807/18-060E/) has a black offset of 63. However, the pedestal can introduce color shift, particularly in dim scenes. Instead of directly normalizing the raw data as done in many existing deep learning-based algorithms, we subtract the black level first to address this concern.

3.3 Three Channel Split

In previous works, it is common practice to split each channel (R, Gr, Gb, B) of the Bayer pattern and stack them as the inputs to neural networks, as depicted in Fig. 2(b). However, we argue that this approach may not be optimal for two reasons.

Refer to caption
Figure 2: (a) Bayer CFA; (b) Four-Channel-Split mode;(c) Three-Channel-Split mode

Firstly, each channel of the raw data has different sampling rates. In the Bayer pattern, the green channel has a sampling rate twice that of the red and blue channels, as shown in Fig. 2(a). Additionally, the green channel typically exhibits better sensitivity compared to the red and blue channels, resulting in more textures, especially high-frequency textures, and higher signal-to-noise ratio (SNR). We believe that the green channel is more beneficial for recovering high-frequency textures and should be handled with care.

Secondly, there is a loss of high-frequency information when using the four-channel mode. In this mode, the green channel is further split into Gr and Gb channels. This additional split corresponds to applying downsampling to the green channel, which inevitably leads to the loss of high-frequency information in the raw data. The size of data in the four-channel mode is half that of the original size, as depicted in Fig. 2(b). For example, a 3×3333\times 33 × 3 patch of four-channel data corresponds to a 5×5555\times 55 × 5 patch in the original-sized data. This artificial change in spatial frequency makes it challenging for the network to accurately extract and reconstruct the lost high-frequency information in the original 5×5555\times 55 × 5 patch.

Furthermore, misalignment is another issue to consider. As illustrated by the black box in Fig. 2(a), pixels from the same location in the four channels actually correspond to 2×2222\times 22 × 2 neighborhoods in the original raw data. This misalignment is likely to cause blur and negatively impact the reconstruction quality.

To address the aforementioned problems, we have devised a new method, as illustrated in Fig. 2(c). We split the Bayer raw data into three channels (R, G, B), where each channel retains the size of the raw data. For the unsampled pixels, we fill them with 1s. As a result, the sampling rate of the G channel remains unchanged, preserving the high-frequency texture information as much as possible. We believe this approach will be more beneficial for the network to reconstruct the high-frequency information accurately.

It is important to note that this method requires additional computation and memory compared to the traditional four-channel mode. However, we have made efforts to optimize the network architecture to strike a balance between computational cost and memory usage. The goal is to ensure efficient processing while still retaining the benefits of the proposed splitting method.

By adopting this new approach, we aim to address the issues related to sampling rate, high-frequency information loss, and misalignment, ultimately improving the reconstruction quality of the ISP algorithm.

3.4 Texture Module

In order to achieve high-quality image reconstruction, it is important to capture both high-frequency and low-frequency textures. To address this, we have designed a texture module that consists of two sub-branches:

  • 1)

    High-Frequency Information Extraction Branch: This sub-branch focuses on extracting high-frequency information from the input data. It utilizes a kernel size of 1×1111\times 11 × 1 to capture fine details and subtle variations in the image. By using a smaller kernel size, the network can effectively capture high-frequency textures and preserve the intricate details in the reconstructed image.

  • 2)

    Low-Frequency Branch: This sub-branch is responsible for capturing the low-frequency information in the input data. It utilizes a larger kernel size of 3×3333\times 33 × 3 to capture broader features and smooth out the image. The larger kernel size allows the network to capture low-frequency textures, such as overall color and tone variations, and ensure the reconstructed image maintains a visually pleasing appearance.

By combining these two sub-branches within the texture module, we aim to capture a wide range of textures and enhance the overall image quality during the reconstruction process. The high-frequency branch preserves fine details, while the low-frequency branch captures broader features, resulting in a more comprehensive representation of the image’s texture information.

3.5 Tone Map**

To generate 8-bit RGB images from 12-bit raw data, it is crucial for the network to perform tone map**, which involves adjusting the tone locally and globally to achieve higher contrast while maintaining consistent brightness and color. Existing neural network-based methods typically achieve this implicitly by minimizing certain structure losses. However, these approaches often overlook the uneven exposure property of raw data, which can hinder the learning process.

Raw data is typically captured with auto exposure to obtain suitable brightness. However, the auto exposure strategy may only properly cover certain areas of the image, leaving other regions overexposed or underexposed. This introduces a challenge as the unevenness can vary with ambient illuminance, resulting in diverse tone distributions within the raw data. When learning a tone map** function, the network is expected to locally brighten underexposed areas, darken overexposed areas, and globally render the tone to adapt to the 8-bit output. Essentially, the network needs to map this diversity to a relatively consistent range, which can be seen as an approximate many-to-one map**. We argue that this significantly increases the difficulty and complexity of the learning process.

In our approach, we aim to simplify this problem by addressing the uneven exposure directly and transforming the relationship from a many-to-one to a one-to-one map**. According to Retinex theory [23], an image can be decomposed into reflectance and illumination components:

Retinex(I)=Reflectance(r)Illumination(S)𝑅𝑒𝑡𝑖𝑛𝑒𝑥𝐼𝑅𝑒𝑓𝑙𝑒𝑐𝑡𝑎𝑛𝑐𝑒𝑟𝐼𝑙𝑙𝑢𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑆Retinex(I)=Reflectance(r)\cdot Illumination(S)italic_R italic_e italic_t italic_i italic_n italic_e italic_x ( italic_I ) = italic_R italic_e italic_f italic_l italic_e italic_c italic_t italic_a italic_n italic_c italic_e ( italic_r ) ⋅ italic_I italic_l italic_l italic_u italic_m italic_i italic_n italic_a italic_t italic_i italic_o italic_n ( italic_S ) (1)

The reflectance component represents the intrinsic properties of the image and remains consistent under all lighting conditions, while the illumination component represents the variations in lighting [23]. We assume that by extracting the consistent reflectance, we can remove the uneven exposure, simplifying the tone map** function from a many-to-one map** to a one-to-one map**.

Based on this analysis, we explicitly designed a tone map** module. To estimate the illumination component, which relies more on global features, we introduced a Pooling-Convolution structure. This structure downsamples the input data through a pooling layer and then applies a convolutional layer. The initial pooling layer enlarges the receptive field of the subsequent convolutional layer, and this unit can be repeated to extract more accurate global features. The feature maps are then upsampled to their original size. To estimate the reflectance, we directly subtract the estimated illumination from the outputs of the texture module. As described in subsection 3.4, the texture module extracts local textures using small convolutional kernels. By removing the low-frequency information extracted by the global features, we consider the final outputs as the reflectance of the image.

By explicitly considering the uneven exposure property and designing a dedicated tone map** module, we aim to improve the learning process and enhance the network’s ability to perform effective tone map** while generating 8-bit RGB images from 12-bit raw data

3.6 RMFA block

The Residual Multi-Frequency Attention (RMFA) module serves as the fundamental building block of our model. As depicted in Fig. 1(b), the outputs of the texture module and tone map** module are first concatenated together. Subsequently, a convolutional layer is employed to map the number of feature maps to the original depth width. ReLU activation function is applied after each convolutional operation. Channel attention and spatial attention modules are then sequentially added. Finally, a skip connection is introduced to prevent performance degradation. Being a versatile building block, the RMFA module can be seamlessly integrated into various architectures, enhancing the flexibility and adaptability of our model.

4 Experiment and Results

4.1 Dataset

The Mobile AI Workshop [13, 9] provided a dataset consisting of 24,000 pairs of training RAW-RGB images for algorithm development. In this paper, we utilized this dataset for both training and evaluating our proposed model. Originally, this dataset was part of the Fujifilm UltraISP dataset [11], which involved capturing high-quality images using the Fujifilm GFX100 medium format 102 MP camera and acquiring raw data from the Sony IMX586 Quad Bayer mobile camera sensor. As the collected RAW-RGB image pairs were not perfectly aligned, a state-of-the-art deep learning-based dense matching algorithm was employed to align the images. From the aligned images, 256×256256256256\times 256256 × 256 pixel patches were extracted for further processing. It is important to note that all alignment operations were conducted solely on the Fujifilm RGB images, while the RAW data from the Sony sensor remained unmodified, preserving the original values as read from the sensor.

To create train and test datasets, we randomly divided the dataset with a ratio of 9:1, ensuring a stable evaluation. We repeated this experiment 10 times and the reported results are the average output of each repetition, providing robustness and reliability in our findings.

4.2 Training Details

The loss function employed for training the RMFA-NET model is a linear combination of several components, including the L1 loss, perception loss based on four VGG16 layers [17], structural similarity index measure (SSIM) loss [37], and color loss [10]. The L1 loss is minimized with a weight parameter θ𝜃\thetaitalic_θ set to 1.0, while the weights for the other components (perception, SSIM, and color) are adjusted accordingly. The overall loss function is defined as follows:

Loss=θLl1+ηLvgg+λLssim+γLcolor𝐿𝑜𝑠𝑠𝜃subscript𝐿𝑙1𝜂subscript𝐿𝑣𝑔𝑔𝜆subscript𝐿𝑠𝑠𝑖𝑚𝛾subscript𝐿𝑐𝑜𝑙𝑜𝑟Loss=\theta L_{l1}+\eta L_{vgg}+\lambda L_{ssim}+\gamma L_{color}italic_L italic_o italic_s italic_s = italic_θ italic_L start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT + italic_η italic_L start_POSTSUBSCRIPT italic_v italic_g italic_g end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT (2)

The model was implemented using PyTorch and trained on a single NVIDIA Tesla A100 GPU. The batch size varied between 8 and 64, depending on the scale of the model. During training, image patches were randomly extracted from the RAW images, with dimensions of 3×256×25632562563\times 256\times 2563 × 256 × 256 using three-channel packing (RGB). Corresponding patches were also extracted from the sRGB images with the same dimensions.

The model parameters were optimized for 1000 epochs using the Adam algorithm [20] with an initial learning rate of 1e-3. A cosine annealing schedule [26] was employed, aggressively decreasing the learning rate to 1e-7 every 100 epochs.

4.3 Ablation Study

We compared our method with the top 5 winners of the Mobile AI Workshop@CVPR 2022 [13] in terms of visual results. The workshop allowed participants to develop larger and more powerful models for the task at hand. To evaluate the image quality, we used the PSNR and SSIM metrics, which were also employed by the workshop. Additionally, we provide the parameters of each model for reference

Table 1: Average PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) results on the test images.
Channel Mode PSNR SSIM Parameter (MB)
Ours RMFA-Net 25.1 0.889 0.19
HITZST01 24.09 0.8667 1.2
ENERZAi 24.08 0.8778 4.5
CASIA 1st 24.09 0.884 205
Multimedia 23.96 0.8543 0.029
HITZST01 23.89 0.8666 0.060
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 3: Outputs of RMFA-Net (top) and target photos captured with the Fujifilm GFX100 (bottom).

As shown in Tab. 1, our method achieved the highest results in terms of PSNR and SSIM. Fig. 3 displays the outputs of RMFA-Net alongside the corresponding target photos. It is evident that the network’s outputs are accurate and closely resemble the ground truth. Both global and local levels of brightness are reconstructed with precision. The estimation of hue and saturation of colors is also commendable, exhibiting no noticeable color cast. The white balance is successfully achieved. Considering that the dataset was collected during daytime with sunlight as the only illuminant, it should be noted that when designing a neural ISP algorithm with a more powerful and robust white balance function, it is important to consider a wider range of illuminant types and color temperatures (CCT).

The recovery of resolution and texture is consistent with expectations, particularly in high-frequency areas. Moreover, no prominent artifacts, such as false color or zipper effects, are observed in these regions. The overall contrast and sharpness of the output images are also commendable, indicating successful learning of the tone map** function.

4.4 The Effectiveness of Three Channel Split Mode

To demonstrate the effectiveness of the channel split approach proposed in this paper, we designed a four-channel version. The raw data is preprocessed following the method used in PyNet [14], and pixel-shuffle operations are employed to upsample the network outputs to the original size.

Table 2: The quantitative results of RMFA-NET and RMFA-Net-four-channel-split, along with the winner of Mobile AI Workshop@CVPR 2022 [13], are presented for reference.
Channel Mode PSNR SSIM Parameter (MB)
RMFA-Net 25.1 0.889 0.19
RMFA-Net-Four 24.7 0.881 0.196
HITZST01 24.09 0.8667 1.2
Four-channel mode Three-channel mode Fujifilm GFX 100 Target Photo
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 4: Outputs of four-channel mode (left), three-channel mode (middle), and target photos captured with the Fujifilm GFX100 (right).

From Tab. 2, it can be observed that the channel split method outperforms the four-channel approach by approximately 0.4dB. Fig. 4 illustrates the visual results of each model alongside their corresponding ground truth images. In the output images of the four-channel model, noticeable artifacts such as moire patterns, blurring, zipper effects, and false colors are evident. Conversely, these artifacts are barely noticeable in the outputs of RMFA-Net, highlighting the effectiveness of the three-channel-split mode.

In the four-channel-split mode, the input data size is half of the raw data. To reconstruct the original size, upsampling operations are necessary. Pixel shuffle and interpolation are commonly used for this purpose. However, it is important to note that there is a natural limitation on the sampling rate in order to accurately recover the original signal. According to the Nyquist Sampling Theorem [24], a bandlimited continuous-time signal can be perfectly reconstructed from its samples if the waveform is sampled at a rate that is at least twice as fast as its highest frequency component. In other words, the highest frequency component that can be accurately represented is limited to:

fmax<12fssubscript𝑓𝑚𝑎𝑥12subscript𝑓𝑠f_{max}<\frac{1}{2}f_{s}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT < divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (3)

Frequencies higher than this limit cannot be accurately recovered and will appear as low-frequency aliasing. This is why moire patterns or zipper effects are more likely to occur in the four-channel mode, especially in areas with high frequencies. To remove these effects, more complex network architectures or post-processing techniques would be required. This explains the observed PSNR drop in the four-channel mode models.

4.5 The Effectiveness of Tone Map** Module

To demonstrate the effectiveness of the tone map** module, we conducted an experiment where we removed it from RMFA-Net. The results are reported in Tab. 3, revealing a decrease in PSNR of over 0.2dB. Fig. 5 visually presents the impact of the tone map** module, showing that it achieves a high level of consistency with the target photos. The outputs of RMFA-Net with the tone map** module exhibit better contrast and sharpness. The brightness is accurately reconstructed at a similar level to the target photos.

Table 3: The results of RMFA-NET and RMFA-Net-without-tonemap**. Winner of Mobile AI Workshop@CVPR 2022 [13] is provided for the reference.
Channel Mode PSNR SSIM Parameter (MB)
RMFA-Net 25.1 0.889 0.19
RMFA-Net-without-tonemap** 24.87 0.886 0.196
HITZST01 24.09 0.8667 1.2

By comparing the first and third columns of Fig. 5, it is evident that without the tone map** module, the network fails to accurately recover information in dim scenes. Additionally, in the absence of the module, the highlight areas become prone to overexposure, leading to a loss of details as depicted in the fourth column. With the inclusion of the tone map** module and accurate brightness adjustment, the saturation is reconstructed more correctly, as shown in the second and fifth columns. The absence of the tone map** module results in an obvious color cast in the network outputs, indicating that the explicitly designed tone map** module is more effective in achieving accurate and consistent color representation.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: Outputs of RMFA-without-tonemap** (top row), RMFA-NET (middle row), and Fujifilm GFX100 target photos (bottom row).

4.6 Adjusting the Model Complexity

RMFA-NET offers the flexibility to adapt its computational complexity by adjusting the number of building blocks. This allows for the design of networks with varying sizes based on the desired runtime, task performance, and computational budget. In this section, we present five RMFA-Net models with different depths and widths. Specifically, for the tiny model, we stack two RMFA blocks with a channel width of 16. For the medium model, we use 8 RMFA blocks with a channel width of 16. The large model consists of 20 RMFA blocks with a channel width of 16. Additionally, we extend the channel width of the large model to 32 and 64. The quantitative results are presented in Tab. 4, indicating that even the tiny model achieves a commendable score. This demonstrates the flexibility of the model’s architecture, allowing for the adjustment of complexity to accommodate devices with varying computational power.

Table 4: Quantitative results of several RMFA-NET models with different depth and width. Winner of Mobile AI Workshop@CVPR 2022 [13] is provided for the reference.
Channel Mode PSNR SSIM Parameter (MB)
Tiny-RMFA-Net-W16 24.549 0.88 0.022
Medium-RMFA-Net-W16 24.86 0.8819 0.0798
Large-RMFA-Net-W16 25.1 0.889 0.19
Large-RMFA-Net-W32 25.315 0.8911 0.7726
Large-RMFA-Net-W64 25.22 0.8906 3.16
HITZST01 24.09 0.8667 1.2

4.7 Limitations

While RMFA-Net shows significant improvements in RAW to RGB image reconstruction, there are still several avenues for future exploration and improvement. We identify the following limitations and challenges:

Generalization to diverse RAW inputs: Our model is trained and evaluated on a specific dataset with limited variations in illuminants and color temperatures. Future work should focus on enhancing the generalization ability of RMFA-Net to handle a wider range of RAW inputs captured in various lighting conditions and scenes.

Computational efficiency: While RMFA-Net achieves impressive results in terms of image quality, it is important to note that it comes with a higher computational cost. This increased computational cost is primarily introduced by the three-channel split mode, which keeps the input data at its original size. Balancing the computational cost and the desired level of image quality is an important consideration when deploying RMFA-Net in practical applications. Moreover, exploring methods to optimize the model’s architecture or develop lightweight versions without compromising performance would be beneficial, especially for real-time applications on resource-constrained devices.

Robustness to noise and artifacts: RAW images often contain noise and artifacts due to sensor limitations or imperfect alignment. Enhancing the robustness of RMFA-Net to handle such challenges and produce clean and artifact-free outputs is an important area for future research.

5 Conclusion

In this paper, we proposed RMFA-Net, a novel deep learning model for real RAW to RGB image reconstruction. Through extensive experiments and evaluations, we demonstrated the effectiveness of our approach in terms of quantitative metrics such as PSNR and SSIM, as well as visual quality compared to state-of-the-art methods. RMFA-Net excels in reconstructing accurate brightness, color, texture, and overall image details, providing promising results for real-world RAW image processing.

While RMFA-Net shows significant advancements, there are still several areas for future exploration and enhancement. One key aspect is optimizing computational efficiency, which is crucial for practical deployment. Additionally, improving the robustness of the model would be beneficial. Considering the diversity of sensors, it is important to enhance its generalization in future research.

References

  • [1] Bayer, B.E.: Color imaging array (1976)
  • [2] Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing 25(11), 5187–5198 (2016)
  • [3] Cai, J., Gu, S., Zhang, L.: Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27(4), 2049–2062 (2018)
  • [4] Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
  • [5] Cho, S.J., Ji, S.W., Hong, J.P., Jung, S.W., Ko, S.J.: Rethinking coarse-to-fine approach in single image deblurring (2021)
  • [6] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2015)
  • [7] Guo, Q., Sun, J., Juefei-Xu, F., Ma, L., Xie, X., Feng, W., Liu, Y., Zhao, J.: Efficientderain: Learning pixel-wise dilation filtering for high-efficiency single-image deraining. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1487–1495 (2021)
  • [8] Hu, X., Naiel, M.A., Wong, A., Lamm, M., Fieguth, P.: Runet: A robust unet architecture for image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019)
  • [9] Ignatov, A., Chiang, C.M., Kuo, H.K., Sycheva, A., Timofte, R.: Learned smartphone isp on mobile npus with deep learning, mobile ai 2021 challenge: Report. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2503–2514 (2021)
  • [10] Ignatov, A., Kobyshev, N., Timofte, R., Vanhoey, K., Van Gool, L.: Dslr-quality photos on mobile devices with deep convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 3277–3285 (2017)
  • [11] Ignatov, A., Malivenko, G., Timofte, R., Tseng, Y., Xu, Y.S., Yu, P.H., Chiang, C.M., Kuo, H.K., Chen, M.H., Cheng, C.M., et al.: Pynet-v2 mobile: Efficient on-device photo processing with neural networks. In: 2022 26th International Conference on Pattern Recognition (ICPR). pp. 677–684. IEEE (2022)
  • [12] Ignatov, A., Sycheva, A., Timofte, R., Tseng, Y., Xu, Y.S., Yu, P.H., Chiang, C.M., Kuo, H.K., Chen, M.H., Cheng, C.M., et al.: Microisp: Processing 32mp photos on mobile devices with deep learning. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II. pp. 729–746. Springer (2023)
  • [13] Ignatov, A., Timofte, R., Liu, S., Feng, C., Bai, F., Wang, X., Lei, L., Yi, Z., Xiang, Y., Liu, Z., et al.: Learned smartphone isp on mobile gpus with deep learning, mobile ai & aim 2022 challenge: report. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. pp. 44–70. Springer (2023)
  • [14] Ignatov, A., Van Gool, L., Timofte, R.: Replacing mobile camera isp with a single deep learning model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 536–537 (2020)
  • [15] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. IEEE (2016)
  • [16] Jia, F., Wong, W.H., Zeng, T.: Ddunet: Dense dense u-net with applications in image denoising. In: International Conference on Computer Vision (2021)
  • [17] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. pp. 694–711. Springer (2016)
  • [18] Kim, B.H., Song, J., Ye, J.C., Baek, J.: Pynet-ca: enhanced pynet with channel attention for end-to-end mobile image signal processing. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. pp. 202–212. Springer (2020)
  • [19] Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1646–1654 (2016)
  • [20] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [21] Kong, F., Li, M., Liu, S., Liu, D., He, J., Bai, Y., Chen, F., Fu, L.: Residual local feature network for efficient super-resolution (2022)
  • [22] Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., Matas, J.: Deblurgan: Blind motion deblurring using conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8183–8192 (2018)
  • [23] Land, E.H., McCann, J.J.: Lightness and retinex theory. Josa 61(1), 1–11 (1971)
  • [24] Landau, H.: Sampling, data transmission, and the nyquist rate. Proceedings of the IEEE 55(10), 1701–1706 (1967)
  • [25] Liu, L., Jia, X., Liu, J., Tian, Q.: Joint demosaicing and denoising with self guidance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2240–2249 (2020)
  • [26] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
  • [27] Lukac, R., Plataniotis, K.N.: Color filter arrays: design and performance analysis. Consumer Electronics IEEE Transactions on 51(4), 1260–1267 (2005)
  • [28] Pan, J., Canton, C., Mcguinness, K., O’Connor, N.E., Giro-I-Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks (2017)
  • [29] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
  • [30] Schwartz, E., Giryes, R., Bronstein, A.M.: Deepisp: Toward learning an end-to-end image processing pipeline. IEEE Transactions on Image Processing 28(2), 912–923 (2018)
  • [31] Shi, W., Caballero, J., Huszár, F., Totz, J., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [32] Uchida, K., Tanaka, M., Okutomi, M.: Non-blind image restoration based on convolutional neural network. In: 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE) (2018)
  • [33] Uhm, K.H., Kim, S.W., Ji, S.W., Cho, S.J., Hong, J.P., Ko, S.J.: W-net: Two-stage u-net with misaligned data for raw-to-rgb map**. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3636–3642 (2019). https://doi.org/10.1109/ICCVW.2019.00448
  • [34] Wang, C., Xu, C., Wanga, C., Tao, D.: Perceptual adversarial networks for image-to-image transformation. IEEE Transactions on Image Processing pp. 4066–4079 (2018)
  • [35] Wang, L., Yoon, K.J.: Deep learning for hdr imaging: State-of-the-art and future trends. IEEE transactions on pattern analysis and machine intelligence 44(12), 8874–8895 (2021)
  • [36] Wang, Z., Chen, J., Hoi, S.C.: Deep learning for image super-resolution: A survey. IEEE transactions on pattern analysis and machine intelligence 43(10), 3365–3387 (2020)
  • [37] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
  • [38] Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing 26(7), 3142–3155 (2017)
  • [39] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European conference on computer vision (ECCV). pp. 286–301 (2018)