License: arXiv.org perpetual non-exclusive license
arXiv:2403.06831v1 [cs.CV] 11 Mar 2024

HDRTransDC: High Dynamic Range Image Reconstruction with Transformer Deformation Convolution

Shuaikang Shang    Xue**g Kang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT    Anlong Ming School of Computer Science (National Pilot Software Engineering School), Bei**g University of Posts and Telecommunications {shangshuaikang, kangxue**g}@bupt.edu.cn, [email protected]
Abstract

High Dynamic Range (HDR) imaging aims to generate an artifact-free HDR image with realistic details by fusing multi-exposure Low Dynamic Range (LDR) images. Caused by large motion and severe under-/over-exposure among input LDR images, HDR imaging suffers from ghosting artifacts and fusion distortions. To address these critical issues, we propose an HDR Transformer Deformation Convolution (HDRTransDC) network to generate high-quality HDR images, which consists of the Transformer Deformable Convolution Alignment Module (TDCAM) and the Dynamic Weight Fusion Block (DWFB). To solve the ghosting artifacts, the proposed TDCAM extracts long-distance content similar to the reference feature in the entire non-reference features, which can accurately remove misalignment and fill the content occluded by moving objects. For the purpose of eliminating fusion distortions, we propose DWFB to spatially adaptively select useful information across frames to effectively fuse multi-exposed features. Extensive experiments show that our method quantitatively and qualitatively achieves state-of-the-art performance.

1 Introduction

The limited range of most digital photography sensors causes under-/over-exposed regions to appear in Low Dynamic Range (LDR) images, which can not reflect the human ability to see content in an actual scene. To generate an image with a broad illuminance range, High Dynamic Range (HDR) imaging has been developed in recent years. The most common HDR imaging methods Debevec and Malik (2008); Granados et al. (2010); Reinhard et al. (2010); Mertens et al. (2007); Ma et al. (2017); Kalantari et al. (2017) usually merge several LDR images with different exposures to get a high-quality HDR image. However, misalignment usually appears in LDR images due to the motions of the camera and foreground objects. When the misaligned LDR images with different exposures are fused, ghosting artifacts often appear in the HDR results Myszkowski et al. (2008); Zimmer et al. (2011). Such a problem is challenging to overcome since it is difficult to compensate for the misalignment and occluded content caused by moving objects.

To address the above problems, several traditional methods are proposed, including alignment-based methods Bogoni (2000); Kang et al. (2003); Zimmer et al. (2011); Sen et al. (2012) and rejection-based methods Zhang and Cham (2011); Granados et al. (2013); Lee et al. (2014); Oh et al. (2014). The alignment-based methods employ optical flow or homographies to explicitly align the motion regions, but they are error-prone and cause ghosts when LDRs contain motion and saturation. The rejection-based methods attempt to remove the misaligned components before fusion, but they suffer from insufficient information recovery in other regions.

Refer to caption
Figure 1: We compare the results of our method and the state-of-the-art networks. In this sample, large motions and under-/over-exposure appear between multi-exposure images. Compared with existing methods that suffer from severe ghosting artifacts, our method can effectively remove the ghosts and generate a high-quality HDR image.

With the rise of Deep Neural Networks (DNNs), recent methods reconstruct an HDR image using Convolution Neural Networks (CNNs). Some non-end-to-end methods Kalantari et al. (2017); Wu et al. (2018) first explicitly align the LDRs by optical flow or homographies, and then merge the aligned LDRs by a CNN. However, these methods are limited by the unreliability of the optical flow and homographies that introduce the artifacts in the results. The other end-to-end methods Yan et al. (2019, 2020); Niu et al. (2021) directly use a CNN to implicitly align and merge multi-exposure information at the feature level. In these methods, a specially designed alignment module or an adversarial learning strategy is used to handle misalignment, thereby achieving better performance. However, the results still suffer from severe ghosting artifacts when LDR images contain large motions and severe saturation, as shown in Figure 1. The reason lies in the local extraction of the CNN-based alignment module. These methods can not extract long-range relevant features to align moving objects and recover occlusion areas, so the saturated regions of the reference image are filled with moving objects. Besides, exiting methods fuse multi-exposure features by simply stacking convolutional layers, which can not effectively select the complement information to accurately recover the saturation regions, resulting in fusion distortions.

In this paper, to remove the ghosting artifacts and fusing distortions to generate a high-quality HDR image, we propose an HDR imaging framework with Transformer Deformation Convolution (HDRTransDC) which establishes the complicated alignment and fusion relationship between reference and non-reference images. Firstly, we propose a Transformer Deformable Convolution Alignment Module (TDCAM) to compensate for the misalignment and occluded content. Specifically, our TDCAM uses a designed Transformer Offset Estimator (TOE) to learn the global information to capture the location of long-range content similar to the reference image. Then the deformable convolution is utilized to extract the matching content to perform the geometric transformation of original non-reference images, thereby aligning the non-reference images to the reference image. Secondly, to further alleviate the fusion distortions, we propose the Dynamic Weight Fusion Block (DWFB) that spatially and adaptively selects useful information across frames to effectively fuse multi-exposed features. In summary, the main contributions of the paper can be summarized as follows:

  • \bullet

    The proposed TDCAM extracts long-distance content similar to the reference feature in the entire non-reference features to align the objects and recover the occluded regions to solve the ghosting artifacts.

  • \bullet

    The proposed DWFB adaptively selects useful information across frames to effectively fuse multi-exposed features, thus further alleviating the fusing distortions.

  • \bullet

    Extensive experiments are conducted to demonstrate that our method achieves excellent results and state-of-the-art performance.

2 Related Work

Several methods have been proposed to alleviate the ghosting artifacts in HDR imaging, and these methods can be generalized into two categories: the traditional methods and the CNN-based methods.

2.1 Traditional methods

2.1.1 Motion alignment methods

These methods focus on aligning the non-reference LDRs to reference one before fusing them by using optical flow or searching for the best matching region. For example, some works adopt the optical flow in the image domain Bogoni (2000); Zimmer et al. (2011) or luminance domain Kang et al. (2003) to align the LDR images. Sen et al. (2012) utilizes a patch-based energy-minimization formulation that integrates alignment and reconstruction in a joint optimization. However, these methods are nearly ineffective for large motions since the accurate alignment is challenging, producing visible ghosting artifacts.

2.1.2 Motion rejection methods

These methods aim to detect motion areas between the reference and non-reference images and exclude these pixels during fusion. For instance, Zhang and Cham (2011) finds the misaligned regions with the guidance of gradient-based quality assessment and generates the motion weighting map. Granados et al. (2013) uses a Markov random field prior and the color difference of inputs for detection and generation. Lee et al. (2014) and Oh Oh et al. (2014) have also proposed rank minimization to detect motion and synthesize HDR images. However, these methods heavily depend on the effectiveness of motion detection and lose the content of moving objects, which is not sufficient for recovering information.

2.2 CNN-based methods

With the great success of deep learning in vision tasks, several CNN-based methods have been proposed to reconstruct ghost-free high-quality HDR images. Kalantari et al. (2017) proposed the first CNN-based method that uses a CNN to merge the LDR images that were aligned explicitly by optical flow. DeepHDR Wu et al. (2018) formulated HDR imaging as an image translation problem, and used homography to align the camera motion and a CNN to fuse them. However, the optical flow and homography are unreliable in the presence of occlusion and saturation, which results in artifacts and color distortions. Instead of aligning LDRs in the image domain, AHDRNet Yan et al. (2019) employed spatial attention to suppress the motion regions between the LDR images for ghost-free HDR images and improved the performance. Prabhakar et al. (2020) proposed the bilateral guided upsampler and motion compensation based on CNN for HDR deghosting. HDR-GAN Niu et al. (2021) first introduced the Generative Adversarial Networks (GAN) framework to generate the missing content. Although the CNN-based methods can alleviate the ghosts in the weak motion regions and retain reasonable local details, the ghosting artifacts are still severe when LDR images contain large motions and saturation. The main reason lies in the limited receptive field of CNN, which cannot capture long-distance similar features to align the large motion and fill the occluded regions.

3 Method

Refer to caption
Figure 2: The architecture of our proposed network. Our network consists of a feature extraction network and a feature merging network. The feature extraction network generates the aligned features through the proposed TDCAM to remove the ghosting artifacts. Then the aligned features are fed into the feature merging network that is constructed based on the proposed DWFB to get the final HDR image.
Refer to caption
Figure 3: The framework of the proposed Transformer Deformation Convolution Alignment Module (TDCAM).

In this section, we present the HDRTransDC network that compensates for large motion and fuses the useful content between multi-exposed features to generate ghost-free high-quality HDR images. Given three LDR images {Li|i=1,2,3}conditional-setsubscript𝐿𝑖𝑖123\{L_{i}|i=1,2,3\}{ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , 2 , 3 } (sorted by their exposure time), our goal is to merge them to obtain an HDR image. The estimated HDR image contains the similar structure as the reference image. Following previous work Kalantari et al. (2017); Yan et al. (2019), the medium exposure image L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is regarded as the reference image and we transform {Li}subscript𝐿𝑖\{L_{i}\}{ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to HDR domain {Hi}subscript𝐻𝑖\{H_{i}\}{ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } with gamma correction and exposure normalization:

Hi=Liγti,i=1,2,3formulae-sequencesubscript𝐻𝑖superscriptsubscript𝐿𝑖𝛾subscript𝑡𝑖𝑖123H_{i}=\frac{{L_{i}}^{\gamma}}{t_{i}},i=1,2,3italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , italic_i = 1 , 2 , 3 (1)

where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the exposure time of LDR image Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , γ𝛾\gammaitalic_γ represents the gamma correction parameter, we set γ𝛾\gammaitalic_γ to 2.2. It is confirmed that the LDR images are effective in detecting saturated and noisy regions and HDR domains are effective in detecting moving objects. Hence, we concatenate the Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along the channel to get a 6-channel input Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

As shown in Figure 2, the entire network consists of the feature extraction network and feature merging network. Since the input LDR images are captured from different exposure domains, in the feature extraction network, we first use three Residual Blocks (RBs) to extract the domain-specific features that are beneficial for detecting moving objects. Each RB consists of two convolution layers and a LeakyReLu activation function. The large motion causes severe misalignment and occluded content in non-reference images so that the saturated regions of reference images cannot be filled with reasonable non-reference image information, resulting in ghosting artifacts. Here, we utilize stacked TDCAM to adaptively capture the long-range relevant information to compensate for the moving objects and occluded content, aligning the multi-exposed features.

In the under-/over-exposed regions where seriously miss the information, inaccurate information recovery leads to fusion distortions such as blurring and halos in the results. In the feature merging network, we use a series of Residual Dynamic Weight Fusion Blocks (RDWFBs) to fuse the extracted alignment features and get the final satisfactory HDR image. The RDWFB uses the proposed DWFB as the basic unit which spatially and adaptively selects useful information across frames to effectively restore the saturated region content while maintaining well-exposed regions of the reference image to effectively fuse multi-exposed features, further alleviating the fusion distortions.

3.1 Transformer Deformation Convolution Alignment Module

When LDR images contain long-distance motions, the neighbor information cannot be used to handle misalignment and fill occluded content accurately. Here, we propose the TDCAM to get the aligned non-reference features with the guidance of reference features. As shown in Figure 3, the proposed TDCAM uses the TOE to generate precise offsets of relevant features after searching for similar information on the entire feature map. Then, to get the aligned features, the deformable convolution extracts the matching information and performs the geometric transformation of the origin features based on the predicted offsets.

Transformer Offset Estimator Considering that LDR images contain both aligned and misaligned regions, TOE utilizes local Transformers and global Transformers to capture the position of the most relevant features, respectively. Because the final offsets represent the position difference between the extracted features and the generated feature, we add the learnable absolute positional encoding PosRH×W×C𝑃𝑜𝑠superscript𝑅𝐻𝑊𝐶Pos\in{R}^{H\times W\times C}italic_P italic_o italic_s ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT on the non-reference features FnonRH×W×Csubscript𝐹𝑛𝑜𝑛superscript𝑅𝐻𝑊𝐶F_{non}\in{R}^{H\times W\times C}italic_F start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT.

For the aligned regions, we use M𝑀Mitalic_M local Transformer layers to localize the locally relevant features. In each local Transformer layer, we divided the non-reference feature maps into several windows, and the similarity of features is calculated inside the windows. Hence, the details in the aligned regions are enhanced by the information in a window. For large motion regions, we use N𝑁Nitalic_N global Transformer layers to capture the long-distance relevant features. The global feature similarity is computed to be aware of the global information and search for the content that is used to compensate for the misalignment and occluded content.

Refer to caption
Figure 4: The extraction locations of proposed TDCAM.

In order to maintain the structural consistency between the non-reference image and the reference image, we search for features similar to the reference image on the non-reference image. Hence, in both the local Transformer layer and global Transformer layer, we use the reference tokens Zrefsubscript𝑍𝑟𝑒𝑓Z_{ref}italic_Z start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT to generate query Q𝑄Qitalic_Q and use the non-reference tokens Znonsubscript𝑍𝑛𝑜𝑛Z_{non}italic_Z start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT to generate the key K𝐾Kitalic_K and the value V𝑉Vitalic_V:

Q=ZrefWq,K=ZnonWk,V=ZnonWvformulae-sequence𝑄subscript𝑍𝑟𝑒𝑓subscript𝑊𝑞formulae-sequence𝐾subscript𝑍𝑛𝑜𝑛subscript𝑊𝑘𝑉subscript𝑍𝑛𝑜𝑛subscript𝑊𝑣Q=Z_{ref}W_{q},K=Z_{non}W_{k},V=Z_{non}W_{v}italic_Q = italic_Z start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_K = italic_Z start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_V = italic_Z start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (2)

where Wq,Wk,Wvsubscript𝑊𝑞subscript𝑊𝑘subscript𝑊𝑣W_{q},W_{k},W_{v}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are the weights. The output of a Transformer layer can be calculated by:

Ysuperscript𝑌\displaystyle Y^{{}^{\prime}}italic_Y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT =MSA(Q,K,V)+Znonabsent𝑀𝑆𝐴𝑄𝐾𝑉subscript𝑍𝑛𝑜𝑛\displaystyle=MSA(Q,K,V)+Z_{non}= italic_M italic_S italic_A ( italic_Q , italic_K , italic_V ) + italic_Z start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT (3)
Y𝑌\displaystyle Yitalic_Y =FFN(Y)+Yabsent𝐹𝐹𝑁superscript𝑌superscript𝑌\displaystyle=FFN(Y^{{}^{\prime}})+Y^{{}^{\prime}}= italic_F italic_F italic_N ( italic_Y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_Y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT

where MSA𝑀𝑆𝐴MSAitalic_M italic_S italic_A is the multi-headed self-attention and FFN𝐹𝐹𝑁FFNitalic_F italic_F italic_N denotes the feed-forward network. Here, the position and similarity of local and global information are captured through Transformer layers. Finally, a CNN-based block which consists of several convolution layers and the LeakyReLU activation function is used to generate the final offsets of similar features.

Deformable Convolution Layer Based on the accurate offsets ΔPΔ𝑃\Delta Proman_Δ italic_P predicted by TOE, deformable convolution is used to extract the matching features on the origin non-reference feature Fnonsubscript𝐹𝑛𝑜𝑛F_{non}italic_F start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT:

Fnona(p)superscriptsubscript𝐹𝑛𝑜𝑛𝑎𝑝\displaystyle F_{non}^{a}(p)italic_F start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_p ) =k=1KmkFnon(p+pk+ΔP(p,k))absentsubscriptsuperscript𝐾𝑘1subscript𝑚𝑘subscript𝐹𝑛𝑜𝑛𝑝subscript𝑝𝑘Δ𝑃𝑝𝑘\displaystyle=\sum^{K}_{k=1}m_{k}\cdot F_{non}(p+p_{k}+\Delta P(p,k))= ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT ( italic_p + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_P ( italic_p , italic_k ) ) (4)

where Fnonasuperscriptsubscript𝐹𝑛𝑜𝑛𝑎F_{non}^{a}italic_F start_POSTSUBSCRIPT italic_n italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT denotes the aligned non-reference features and K,k𝐾𝑘K,kitalic_K , italic_k are the number and index of kernel weights. mk,p,pksubscript𝑚𝑘𝑝subscript𝑝𝑘m_{k},p,p_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the k𝑘kitalic_k-th kernel weight, position of center and the k𝑘kitalic_k-th fixed offset respectively. The origin non-reference features are geometrically transformed via convolution extraction so that the misalignment is removed and the occluded content is compensated. K𝐾Kitalic_K, the number of kernel weights, also can be adjusted to modify the number of relevant features that are captured by TOE.

Through the combination of TOE and deformable convolution, the proposed TDCAM can extract long-distance features after being aware of the global information to generate more accurate and robust aligned non-reference features. To further explain the advantages of our proposed TDCAM in resolving ghosting, we visualize its sampling map for alignment. As shown in Figure 4, when LDRs contain large motion (green points), the CNN-based method can only extract local features (red points) for the alignment and cannot recover the occluded regions. Since the non-reference image is not aligned, the saturation regions of the reference image will be filled with moving objects and ghosts will appear in the HDR images. On the contrary, our TDCAM can extract long-range content (red points) in the non-reference image according to the content of the reference image to restore the occluded regions, hence our TDCAM can effectively extract the aligned non-reference features and significantly alleviate the ghosting artifacts.

Refer to caption
Figure 5: The design of proposed Dynamic Weight Fusion Block (DWFB) and Residual Dynamic Weight Fusion Block (RDWFB) .

3.2 Dynamic Weight Fusion Block

After the aligned features are extracted, the multi-exposed information also should be fused accurately to obtain a high-quality HDR image. On the reference image, the well-exposed information is beneficial to maintain the result consistent with the reference, and the complement information of the high/low exposure images is effective in recovering the missing content in the under-/over-exposed regions. Since convolutions share fusion weights across the entire feature, it is difficult for directly stacking convolutional layers to spatially and dynamically highlight the information of an appropriate image for an accurate result. Moreover, the traditional convolution gets more attention for the information of reference images and weakens the complement information of non-reference images, resulting in fusing distortions. To effectively fuse the multi-exposed features, we propose DWFB which spatially selects the useful information across frames, as shown in Fig. 5(a).

To be special, given the input feature xRH×W×D𝑥superscript𝑅𝐻𝑊𝐷x\in R^{H\times W\times D}italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT, we first utilize the spatial attention mechanism to generate a dynamic weight ΔwΔ𝑤\Delta wroman_Δ italic_w based on the content of the input multi-exposure features:

Δw=ϕ(x)Δ𝑤italic-ϕ𝑥\Delta w=\phi(x)roman_Δ italic_w = italic_ϕ ( italic_x ) (5)

where ϕitalic-ϕ\phiitalic_ϕ is the attention block. The attention block generates the dynamic weight by sensing the exposure levels of multiple frames in different areas. The dynamic weight ΔwΔ𝑤\Delta wroman_Δ italic_w adaptively selects the information of different frames, that is, the information in the well-exposed regions of the reference image is selected, and the supplementary content corresponding to the non-reference image is selected in the badly-exposed area of the reference image. Then, we use the weight ΔwΔ𝑤\Delta wroman_Δ italic_w that is shared in spatial dimension and fully connected in channel dimension to fuse inter-channel information to merge the content of multi-exposed images.

Method PSNR-μ𝜇\muitalic_μ PSNR-l𝑙litalic_l SSIM-μ𝜇\muitalic_μ SSIM-l𝑙litalic_l HDR-VDP-2
Kalantari Kalantari et al. (2017) 42.67 41.23 0.9888 0.9846 65.05
DeepHDR Wu et al. (2018) 41.65 40.88 0.9860 0.9858 64.90
AHDRNet Yan et al. (2019) 43.63 41.14 0.9900 0.9702 64.61
NHDRNet Yan et al. (2020) 42.41 41.43 0.9877 0.9857 61.21
HDR-GAN Niu et al. (2021) 43.92 41.57 0.9905 0.9865 65.45
APNT-Fusion Chen et al. (2022) 43.96 41.69 0.9898 0.9914 -
Liu Liu et al. (2022) 44.32 42.18 0.9916 0.9884 66.03
HDRTransDC (Ours) 44.66 42.51 0.9917 0.9892 66.10
Table 1: Quantitative comparison between previous works and the proposed method. - indicates that the results can’t be obtained because they are not reported and the code is not available. Bold indicates the best performance.

As shown in 5(b), we build the RDWFB that is based on the DWFB to form the HDR merging network to fuse the aligned multi-exposure features for final results. The residual structure of RDWFB helps to preserve image information, which is beneficial to accurate HDR images.

3.3 Training Loss

Refer to caption
Figure 6: A sample in Kalantari’s dataset. Visual comparison of the proposed method and existing methods on Kalantari’s testset. The arrow highlights the difference between the results generated by the methods.

In the HDR task, we jointly utilize the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and gradient loss to reconstruct satisfactory results. Given the predicted HDR image H𝐻Hitalic_H and ground truth HDR image Hgtsubscript𝐻𝑔𝑡H_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT, the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss is computed like previous methods Yan et al. (2019); Niu et al. (2021):

l1=Γ(H)Γ(Hgt)1subscript𝑙1subscriptnormΓ𝐻Γsubscript𝐻𝑔𝑡1\mathcal{L}_{l1}=\|\Gamma(H)-\Gamma(H_{gt})\|_{1}caligraphic_L start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT = ∥ roman_Γ ( italic_H ) - roman_Γ ( italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (6)

where Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) stands for the tonemap** based on μ𝜇\muitalic_μ-law Kalantari et al. (2017); Yan et al. (2019); Niu et al. (2021). Considering that the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss may result in the damage of high-frequency content in the reconstructed HDR images Xu et al. (2020a), such as texture and edge, we use a gradient loss that is based on the Sobel operator to compensate for the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss to retain the high-frequency content:

grad=Γ(H)Γ(Hgt)1subscript𝑔𝑟𝑎𝑑subscriptnormΓ𝐻Γsubscript𝐻𝑔𝑡1\mathcal{L}_{grad}=\|\nabla\Gamma(H)-\nabla\Gamma(H_{gt})\|_{1}caligraphic_L start_POSTSUBSCRIPT italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT = ∥ ∇ roman_Γ ( italic_H ) - ∇ roman_Γ ( italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (7)

where \nabla denotes the Sobel operator. Finally, the total loss \mathcal{L}caligraphic_L is defined as:

=l1+λggradsubscript𝑙1subscript𝜆𝑔subscript𝑔𝑟𝑎𝑑\mathcal{L}=\mathcal{L}_{l1}+\lambda_{g}\mathcal{L}_{grad}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT (8)

where λgsubscript𝜆𝑔\lambda_{g}italic_λ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the trade-off coefficient, and we set it to 0.6.

Based on the total loss, the proposed model is optimized to focus on both overall content and high-frequency regions, further promoting the generation of high-quality images with rich details.

Refer to caption
Figure 7: Another sample in Kalantari’s dataset. Visual comparison of the proposed method and existing methods on Kalantari’s testset. The arrow highlights the difference between the results generated by the methods.

4 Experiments

4.1 Implementation Details

We implement our network using the PyTorch platform. For training, we use Adam optimizer, and the β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are set to 0.5 and 0.999, respectively. The learning rate is initialized to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and gradually is decreased to 0. In addition, we train our network on 6000 epochs with a batch size of 2. Given the training images, we randomly crop the 256×256256256256\times 256256 × 256 patches and flip and rotate the patches for training. The whole training is performed on an NVIDIA RTX 3090 GPU with 24G.

4.2 Datasets and Metrics

Datasets To evaluate our method, we select Kalantari’s dataset Kalantari et al. (2017) that consists of 74 training scenes and 15 test scenes. Each scene contains three LDR images with exposure {2,0,+2}202\{-2,0,+2\}{ - 2 , 0 , + 2 } or {3,0,+3}303\{-3,0,+3\}{ - 3 , 0 , + 3 } and an HDR ground truth. The resolution of images is 1000×1500100015001000\times 15001000 × 1500. Flip** and rotation on cropped patches are applied while training. We conduct quantitative and qualitative evaluation experiments on Kalantari’s dataset. To verify the normalization of our method on other datasets, we also perform the qualitative evaluation on Tursun’s Tursun et al. (2016) and Sen’s Sen et al. (2012) datasets without ground truth.

Metrics Following previous works Yan et al. (2019); Niu et al. (2021); Chen et al. (2022), we select and compute the PSNR and SSIM value in both linear and tonemapped (using μ𝜇\muitalic_μ-law) domains to compare these methods, denoted by PSNR-l𝑙litalic_l, PSNR-μ𝜇\muitalic_μ, SSIM-l𝑙litalic_l, SSIM-μ𝜇\muitalic_μ. To assess the visibility and quality of HDR images, we also apply the HDR-VDP-2 Mantiuk et al. (2011) as the metric.

Refer to caption
Figure 8: The example from Tursun’s dataset.

4.3 Comparison With the SOTA methods

Comparison on Kalantari’s dataset We conduct a quantitative comparison on testing data of Kalantari’s dataset. The results are listed in Table 1. We can find that our method outperforms the exiting Liu’s method Liu et al. (2022) by 0.34 dB, 0.33 dB, and 0.07 in terms of PSNR-μ𝜇\muitalic_μ, PSNR-l𝑙litalic_l, and HDR-VDP-2. Our best performance on PSNR-l𝑙litalic_l and PSNR-μ𝜇\muitalic_μ indicates that the proposed method reconstructs the results with the highest fidelity in both the HDR domain and tonemapped domain. Furthermore, other metrics demonstrate the reconstruction ability of our method in terms of structure and visualization.

The qualitative comparison of some challenging scenes with saturated background and foreground motions are shown in Figure 6 and 7. The LDR patches contain large motions of the arm or body, resulting in severe misalignment between the LDR images and the occluded content in non-reference images. As shown in Figures 6(a)(b) and 7(a)(b), non-end-to-end methods Kalantari et al. (2017); Wu et al. (2018) still suffer from ghosts since they use unreliable optical flow or homographies to explicitly align the large motion. The AHDRNet Yan et al. (2019) and Liu’s method Liu et al. (2022) use attention to suppress the motion regions. HDR-GAN Niu et al. (2021) utilizes a CNN-based residual block to align the moving objects. However, due to the local extraction of their alignment module, these methods still produce severe ghosting artifacts, as shown in Figures 6(c)-(e) and 7(c)-(e). On the contrary, our method can extract long-distance information to align the motion objects and occluded regions. That is, for the regions covered by the arm shown by the red block in Figure 6 and 7, our method can effectively extract the information of walls or cars to fill in the occluded content. Hence, the ghosts are effectively removed in the results, and more accurate and robust HDR images are reconstructed.

Comparison on datasets w/o GT To verify the generalization of the proposed method, we also conduct qualitative experiments on Tursun’s and Sen’s datasets. We provide visual comparison results because there is no Ground Truth. As shown in Figure 8, the red blocks indicate large motion regions and the carpet is occluded by trains in non-reference images, making it difficult to recover the information of over-exposed regions in reference images. The previous methods can not accurately fill the occluded content and suffer from distortions and ghosts in HDR images. The proposed method handles the problems caused by the large motion and generates the aligned features, thus reconstructing high-quality ghost-free results. In Figure 9, the boy’s hand occludes a region with textures like tree branches, and previous methods produce artifacts in the results and lose some texture content. Our method effectively mitigates ghosting artifacts and hallucinates complete textures.

Refer to caption
Figure 9: The example from Sen’s dataset.
Method Times(s)
DeepHDRWu et al. (2018) 0.140
AHDRNetYan et al. (2019) 0.003
NHDRNetYan et al. (2020) 0.157
HDR-GANNiu et al. (2021) 0.139
Liu Liu et al. (2022) 3.288
Ours 1.503
Table 2: Average running time and performance for different methods.

Analysis of Computational Budgets We also compare the inference time of our proposed method and the state-of-the-art methods. To be specific, the inference time is calculated by the average time of generating 15 images with the resolution of 1500×1000150010001500\times 10001500 × 1000 in Kalantari’s dataset using GPU. It is worth noting that we do not consider the time consumption of homographies in DeepHDR Wu et al. (2018). The reason is that this model assumes that the input LDRs are pre-aligned and homographies are usually performed on CPU and are very time-consuming. As shown in Table 2, the inference time of our model is acceptable.

Method PSNR-μ𝜇\muitalic_μ PSNR-l𝑙litalic_l HDR-VDP-2
(1) w/o TDCAM 44.30 42.26 65.60
(2) w/o DWFB 44.02 42.15 65.07
(3) w/o gradsubscript𝑔𝑟𝑎𝑑\mathcal{L}_{grad}caligraphic_L start_POSTSUBSCRIPT italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT 44.54 42.21 65.98
(4) HDRTransDC 44.66 42.51 66.10
Table 3: The quantitative results of ablation study.

4.4 Ablation Study

We conduct qualitative and quantitative ablation experiments on Kalantari’s dataset to evaluate the effectiveness of each component in the proposed HDRTransDC.

Ablation Analysis of Network Architecture In the experiments (1), we directly replace the proposed TDCAM with a deformable convolution layer that uses CNN to estimate the offset to verify the effectiveness of the proposed alignment module. As shown in Table 3, comparing experiments (1) and (4), we can find that our TDCAM can improve the performance effectively. The reason is that it can extract distant similar features to align multi-exposure features. The qualitative experiments in Figure 10 (a) demonstrate that the TDCAM effectively alleviates the ghosting artifact and significantly improves the quality of the reconstructed image. In experiment (2), we remove the DWFB and use the traditional convolution layers to merge the features. The results show that DWFB further improves the quality of the HDR images since it can perceive the exposure levels of different regions and adaptively select useful information from different frames. As shown in Figure 10 (b), the proposed DWFB further mitigates fusing distortions and recovers missing contents, and thereby more satisfactory results are reconstructed.

Ablation Analysis of Loss We also verify the effect of the gradient loss and conduct the ablation experiments. As shown in Table 3, experiment (3) shows that the gradient loss further improves the performance of the proposed model since this loss introduces more attention to high-frequency regions to retain the textures and edges. The qualitative results in Figure 10 (c) demonstrate that textures and details are blurred if the gradient loss is removed, proving the effectiveness of this loss.

5 Conclusion

This paper proposes an HDRTransDC network to establish the complicated alignment and fusion relationship between reference and non-reference images to reconstruct high-quality HDR images. Specifically, we propose a TDCAM to extract long-range relevant information to compensate for the misaligned components and occluded content to remove the ghosting artifacts caused by the large motion and saturation. In addition, we propose the DWFB to adaptively selects useful information across frames to effectively fuse multi-exposed features, which reduces the fusing distortions. We conduct extensive comparative experiments to verify that the proposed method achieves state-of-the-art performance and generates artifact-free HDR images with realistic details. The ablation experiments are also performed to verify the effectiveness of the components.

Refer to caption
Figure 10: The qualitative results of our ablation study.

References

  • Bogoni [2000] Luca Bogoni. Extending dynamic range of monochrome and color images through fusion. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 3, pages 7–12. IEEE, 2000.
  • Chen et al. [2022] Jie Chen, Zaifeng Yang, Tsz Nam Chan, Hui Li, Junhui Hou, and Lap-Pui Chau. Attention-guided progressive neural texture fusion for high dynamic range image restoration. IEEE Transactions on Image Processing, 31:2661–2672, 2022.
  • Granados et al. [2010] Miguel Granados, Boris Ajdin, Michael Wand, Christian Theobalt, Hans-Peter Seidel, and Hendrik PA Lensch. Optimal hdr reconstruction with linear digital cameras. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 215–222. IEEE, 2010.
  • Debevec and Malik [2008] Paul E Debevec and Jitendra Malik. Recovering high dynamic range radiance maps from photographs. In ACM SIGGRAPH 2008 classes, pages 1–10. 2008.
  • Granados et al. [2013] Miguel Granados, Kwang In Kim, James Tompkin, and Christian Theobalt. Automatic noise modeling for ghost-free hdr reconstruction. ACM Transactions on Graphics (TOG), 32(6):1–10, 2013.
  • Kalantari et al. [2017] Nima Khademi Kalantari, Ravi Ramamoorthi, et al. Deep high dynamic range imaging of dynamic scenes. ACM Trans. Graph., 36(4):144–1, 2017.
  • Kang et al. [2003] Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. High dynamic range video. ACM Transactions on Graphics (TOG), 22(3):319–325, 2003.
  • Lee et al. [2014] Chul Lee, Yuelong Li, and Vishal Monga. Ghost-free high dynamic range imaging via rank minimization. IEEE signal processing letters, 21(9):1045–1049, 2014.
  • Liu et al. [2022] Zhen Liu, Yinglong Wang, Bing Zeng, and Shuaicheng Liu. Ghost-free high dynamic range imaging with context-aware transformer. arXiv preprint arXiv:2208.05114, 2022.
  • Ma et al. [2017] Kede Ma, Hui Li, Hongwei Yong, Zhou Wang, Deyu Meng, and Lei Zhang. Robust multi-exposure image fusion: a structural patch decomposition approach. IEEE Transactions on Image Processing, 26(5):2519–2532, 2017.
  • Mantiuk et al. [2011] Rafał Mantiuk, Kil Joong Kim, Allan G Rempel, and Wolfgang Heidrich. Hdr-vdp-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions. ACM Transactions on graphics (TOG), 30(4):1–14, 2011.
  • Mertens et al. [2007] Tom Mertens, Jan Kautz, and Frank Van Reeth. Exposure fusion. In 15th Pacific Conference on Computer Graphics and Applications (PG’07), pages 382–390. IEEE, 2007.
  • Myszkowski et al. [2008] Karol Myszkowski, Rafal Mantiuk, and Grzegorz Krawczyk. High dynamic range video. Synthesis Lectures on Computer Graphics and Animation, 1(1):1–158, 2008.
  • Niu et al. [2021] Yuzhen Niu, Jianbin Wu, Wenxi Liu, Wenzhong Guo, and Rynson WH Lau. Hdr-gan: Hdr image reconstruction from multi-exposed ldr images with large motions. IEEE Transactions on Image Processing, 30:3885–3896, 2021.
  • Oh et al. [2014] Tae-Hyun Oh, Joon-Young Lee, Yu-Wing Tai, and In So Kweon. Robust high dynamic range imaging by rank minimization. IEEE transactions on pattern analysis and machine intelligence, 37(6):1219–1232, 2014.
  • Prabhakar et al. [2020] K Ram Prabhakar, Susmit Agrawal, Durgesh Kumar Singh, Balraj Ashwath, and R Venkatesh Babu. Towards practical and efficient high-resolution hdr deghosting with cnn. In European Conference on Computer Vision, pages 497–513. Springer, 2020.
  • Reinhard et al. [2010] Erik Reinhard, Wolfgang Heidrich, Paul Debevec, Sumanta Pattanaik, Greg Ward, and Karol Myszkowski. High dynamic range imaging: acquisition, display, and image-based lighting. Morgan Kaufmann, 2010.
  • Sen et al. [2012] Pradeep Sen, Nima Khademi Kalantari, Maziar Yaesoubi, Soheil Darabi, Dan B Goldman, and Eli Shechtman. Robust patch-based hdr reconstruction of dynamic scenes. ACM Trans. Graph., 31(6):203–1, 2012.
  • Tursun et al. [2016] Okan Tarhan Tursun, Ahmet Oğuz Akyüz, Aykut Erdem, and Erkut Erdem. An objective deghosting quality metric for hdr images. In Computer Graphics Forum, volume 35, pages 139–152. Wiley Online Library, 2016.
  • Wu et al. [2018] Shangzhe Wu, Jiarui Xu, Yu-Wing Tai, and Chi-Keung Tang. Deep high dynamic range imaging with large foreground motions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 117–132, 2018.
  • Xu et al. [2020a] H. Xu, J. Ma, and X. P. Zhang. Mef-gan: Multi-exposure image fusion via generative adversarial networks. IEEE Transactions on Image Processing, PP(99):1–1, 2020.
  • Yan et al. [2019] Qingsen Yan, Dong Gong, Qinfeng Shi, Anton van den Hengel, Chunhua Shen, Ian Reid, and Yanning Zhang. Attention-guided network for ghost-free high dynamic range imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1751–1760, 2019.
  • Yan et al. [2020] Qingsen Yan, Lei Zhang, Yu Liu, Yu Zhu, **qiu Sun, Qinfeng Shi, and Yanning Zhang. Deep hdr imaging via a non-local network. IEEE Transactions on Image Processing, 29:4308–4322, 2020.
  • Zhang and Cham [2011] Wei Zhang and Wai-Kuen Cham. Gradient-directed multiexposure composition. IEEE Transactions on Image Processing, 21(4):2318–2323, 2011.
  • Zimmer et al. [2011] Henning Zimmer, Andrés Bruhn, and Joachim Weickert. Freehand hdr imaging of moving scenes with simultaneous resolution enhancement. In Computer Graphics Forum, volume 30, pages 405–414. Wiley Online Library, 2011.