Benchmark Evaluation of Image Fusion algorithms for Smartphone Camera Capture

Lucas Nedel Kirsten
Motorola Mobility Comércio de Produtos Eletrônicos Ltda, Jaguariúna, SP 13918-900 BR
[email protected]

Abstract— This paper investigates the trade-off between computational resource utilization and image quality in the context of image fusion techniques for smartphone camera capture. The study explores various combinations of fusion methods, fusion weights, number of frames, and stacking (a.k.a. merging) techniques using a proprietary dataset of images captured with Motorola smartphones. The objective was to identify optimal configurations that balance computational efficiency with image quality. Our results indicate that multi-scale methods and their single-scale fusion counterparts return similar image quality measures and runtime, but single-scale ones have lower memory usage. Furthermore, we identified that fusion methods operating in the YUV color space yield better performance in terms of image quality, resource utilization, and runtime. The study also shows that fusion weights have an overall small impact on image quality, runtime, and memory. Moreover, our results reveal that increasing the number of highly exposed input frames does not necessarily improve image quality and comes with a corresponding increase in computational resources usage and runtime; and that stacking methods, although reducing memory usage, may compromise image quality. Finally, our work underscores the importance of thoughtful configuration selection for image fusion techniques in constrained environments and offers insights for future image fusion method development, particularly in the realm of smartphone applications.

Index Terms— Image Fusion, Smartphone image capture.

I INTRODUCTION

Photographs captured on environments with highly imbalanced lighting usually result in a poor quality picture mainly due to the presence of under- and over-saturated regions. This is caused by the limited dynamic range of digital cameras, i.e. the ratio of the highest brightness to the lowest brightness in a scene mertens. For most cameras, this dynamic range is on the order of $2\!\sim\!3$ , while real scenes can display orders of $10$ or higher fyuv. Acquiring multiple exposure frames of the same scene allows capturing the high dynamic range; however, upon display, the intensities need to be remapped to match the device’s range through tone map** reinhard2010high or with exposure fusion methods xu2022multi; survey_class; survey_deep.

Exposure fusion methods rely on computing perceptual quality measures (a.k.a., weight maps) for each pixel in the multi-exposure low-dynamic-range (LDR) sequence of frames, and then selecting the “good” pixels using some weighted merge-like operation (e.g., mean or median) mertens; survey_class – suggesting that the choice of the weight maps must have a high impact on the quality of the resulting image. Exposure methods have several advantages compared to other methods (e.g., tone map**), such as a simplified pipeline (since no in-between HDR image needs to be computed fattal2023gradient), it does not require camera calibration, and can even use images captured with external assistance (e.g., with flash) to improve the results mertens.

Most popular fusion methods compute the final “merged” image using Laplacian pyramids (a.k.a., multi-scale image fusion methods), in which a sequence of sub-images are generated to produce more natural transitions on the edges of the merged image mertens; fyuv. However, the multiple operations required for computing the pyramid sub-images can introduce large computational overhead depending on the number and the size of the input images, especially in hardware-constrained applications (e.g., smartphones). One workaround is to combine images (a.k.a., image stacking or image merging) with close Exposure Value (EV) using some reduction function (e.g., mean, median), to diminish the number of input images survey_class; xu2022multi. Single-scale image fusion methods have also been proposed to alleviate the pyramid computational overhead by approximating the overall impact of the Laplacian pyramids ssf_rgb. However, as observed in fyuv; survey_class, single-scale methods usually produce worse image quality due to the huge gray difference of the images to be fused, creating obvious seams. Yet, is the overhead introduced by multi-scale methods justified in the quality of the merged image when compared to single-scale methods? What is the individual impact of the weight maps on the final image? What is the benchmark between the used computational resources and image quality related to using more input frames? And do stacking methods collaborate to reduce the required computation resources without losing image quality?

To answer these questions, in this work, we investigate the individual and joined impacts on the used computational resources, runtime, and image quality related to the variables: fusion method, weights of the fusion method, stacking method, and used number of frames. The main objective of this work is to determine the benchmark among these variables and to establish the best combinations to be employed in hardware-constrained environments, such as smartphones. We evaluated multiple combinations of these variables on a proprietary dataset composed of Motorola smartphone pictures collected in diverse environments and under various lighting conditions. Our evaluation protocol used standard literature metrics for image quality and correlated them with the used computational resource and runtime for each variable combination. Moreover, although the current literature has mainly focused on deep learning-based approaches survey_deep; xu2022multi, we emphasize that our findings can be directly used for such methods since some of the evaluated variables (such as the number of frames to be used, and the stacking method) are also important hyper-parameters employed in such learning algorithms.

II RELATED WORK

Multi-scale image fusion methods refer to approaches that employ a sequence of Laplacian pyramids to decompose the details of the input frames (and weight maps) into “sub-images” (i.e., downsized versions of the input frames), to eliminate exposure differences and make the local transitions more natural fyuv. The pioneer Merge Mertens mertens algorithm is based on computing three quality measures related to the image saturation, contrast, and exposure levels of the input frames, and then employing the Laplacian pyramids sequence on both input frames and weight maps. The resulting sub-images and maps are then merged as in a weighted average fashion.

The Merge Mertens algorithm needs to operate in the RGB color channels, which can introduce large overhead, especially for hardware-constrained applications (e.g., smartphones), due to the multiple operations required for computing the pyramid sub-images on all the 3 input channels of each frame. To improve Merge Mertens performance, Liu et al. fyuv proposed the Fast YUV method, based on using the YUV color space. The authors propose to compute the pyramid sub-images using only the Y channel (hence reducing the pyramid computational efforts by 1/3 compared to the standard Merge Mertens), using 2 image quality measures (Merge Mertens employs 3), and fusing the UV channels using a maximum reduction operation over the frames sequence.

Both Merge Mertens and Fast YUV depend on computing multiple sub-images through a sequence of Laplacian pyramids. Depending on the size and number of sequence frames, this process can demand a huge amount of computational resources, impacting both memory usage and runtime. An easy assumption would be to make all computations in the original image resolution. However, as noted in fyuv, due to the huge gray difference of the images to be fused, it will usually lead to unnatural transitions of the edges, creating obvious seams. Nevertheless, Ancuti et al. ssf_rgb proposed a single-scale image fusion (namely, SSF) method based on approximating the operations in the pyramid step of the standard Merge Mertens algorithm. Although their work was not capable of generally overcoming the reference multi-scale fusion method, it highly improved the usual edge transitions issues referred to in fyuv, and considerably reduced the computation resources requirements.

More recently, methods based on deep learning have also been explored for image fusion 10288540; survey_deep; xu2022multi. The main advantage of such techniques is their capability to learn from examples, so there is no necessity to manually define the image quality measures for the weight maps. However, such methods are designed to work with specific input setups (e.g., they require the same number of frames with defined EV to work properly), which is usually not the case on real applications. For instance, the frames’ EV can depend on the scene detected illuminance, and the number of frames can depend on predefined computational restrictions for capture, or on the output of some “frame selector” algorithm (e.g., one that eliminates poor quality frames). Moreover, in the scope of this work, evaluating learning-based methods would require a full retraining step for each considered variable combination to establish a fair baseline comparison, making it timely unfeasible (see Section III-C for more details). Nevertheless, our findings can be directly explored to define better setups for such methods (e.g., number of frames, stacking method to be used) as we discuss in more detail in Sections IV and V.

III EXPERIMENTAL SETUP

In this work, we investigate the benchmark between the computational resource (i.e., runtime and memory usage) and image quality measures related to the combination of the following variables on fusing LDR images: fusion method, weights of the fusion method, used number of frames, and stacking method. We employed a proprietary dataset composed of Motorola smartphone pictures collected using the front camera in diverse environments and under various lighting conditions. Nevertheless, we emphasize that the usual complete process of mobile image capture relies on more steps than the ones being considered in this work, such as frame selection for removing low-quality frames, frame alignment, and image enhancement (e.g., color correction, and contrast improvement). However, considering all these steps as variables would be unfeasible due to the huge number of possible combinations to be evaluated. Hence, we limit our experiments to the aforementioned variables, which have a higher impact on the fusion process itself survey_class; xu2022multi; survey_deep. We proceed to provide details regarding our complete experimental setup.

III-A DATASET

We employed a proprietary dataset composed of 480 Motorola smartphone captured images using its 8-megapixel front camera. The images were collected in several different indoor and outdoor scenes with environment illuminance ranging from 0,15 lux up to 20,2 lux. Six frames were captured for each scene with EV values: -24, 0, 1, 2, 3, and 4. The order in which the frames were captured is from the lowest (EV -24) to the highest (EV 4). The ground-truth (GT) images were manually generated using image processing software to obtain the “best quality” merged image. The 6 captured frames plus a very long exposed frame (which is not available during the usual device’s camera capture, and was used for color improvement and correction) were accessible to the annotators. However, image quality is generally a subjective task lpips (i.e., different annotators will usually provide distinct GT images for the same set of frames), and hence, the following protocol was established to standardize the development of the GT images:

•

Regarding brightness and color, GT should be comparable to EV 3 frame;
•

Regarding clarity (details), GT should be comparable to EV 0 and EV 3 frames;
•

Artifacts, such as noise, must be removed in the GT;
•

Over- and underexposure areas must be corrected in the GT.

We provide examples of the dataset in Figure 1.

Refer to caption — Figure 1: Example of image frames for three scenes with different illuminance levels (value in the left) from the employed dataset.

III-B VARIABLE COMBINATIONS

Each of our dataset images was evaluated using all possible combinations for each of the following variables: fusion method, weights of the fusion method, used number of frames, and stacking method (54,850 total tested combinations). Table I summarizes the space of values for each of the considered variables, where one “variable combination” refers to choosing one-row value for each column of this table. We proceed to provide details regarding each of the considered variables.

TABLE I: Variables space of values summary. Each variable combination refers to choosing one-row value for each column.

\mathbf{\mathcal{W}}_{C}

refers to Contrast weight,

\mathbf{\mathcal{W}}_{E}

, to Exposure weight, and

\mathbf{\mathcal{W}}_{S}

to Saturation weight.

Fusion method

Fusion weights

\geq 0

Frames

Stacking method

Merge Mertens

\mathbf{\mathcal{W}}_{C}

\mathbf{\mathcal{W}}_{E}

\mathbf{\mathcal{W}}_{S}

^∗

Mean^†

Fast YUV

\mathbf{\mathcal{W}}_{C}

\mathbf{\mathcal{W}}_{E}

Median^†

SSF RGB

\mathbf{\mathcal{W}}_{C}

\mathbf{\mathcal{W}}_{S}

^∗

None

SSF YUV

\mathbf{\mathcal{W}}_{E}

\mathbf{\mathcal{W}}_{S}

^∗

^∗Only applicable to Merge Mertens and SSF RGB. ^†Only applicable when using more than 1 EV

\geq 0

frame.

III-B1 FUSION METHOD

We implemented the Merge Mertens mertens, Fast YUV fyuv, and the Single-Scale Fusion ssf_rgb (namely, SSF RGB) methods using their original proposed implementation. We also implemented an adapted version of the SSF RGB to work on YUV images (namely, SSF YUV), adapting the SSF RGB ssf_rgb w.r.t. Fast YUV fyuv with the following modifications: we employ the single Laplacian step only on the Y channel; we compute 2 weight maps equal to those of Fast YUV; and the UV channels are fused using the maximum per-frame function. All methods were developed using Python 3 python programming language with OpenCV opencv library, and are available at: https://github.com/LucasKirsten/Benchmark-Image-Fusion.

III-B2 FUSION WEIGHTS

In all fusion methods, the final weight map is computed following:

\mathbf{\mathcal{W}}=\prod_{i=1}^{M}(\mathbf{\mathcal{W}}_{i})^{k_{i}},

(1)

where $\mathbf{\mathcal{W}}_{i}\in\mathbb{R}^{W\times H\times N}$ is the $i$ -th weight map, $k_{i}$ is an associated exponential factor (usually ranging from 0 to 1), $N$ is the number of input frames, and $M$ is the number of computed weights: $M=3$ ( $\mathbf{\mathcal{W}}_{C}$ : contrast, $\mathbf{\mathcal{W}}_{S}$ : saturation, and $\mathbf{\mathcal{W}}_{E}$ : exposure) for the RGB-based fusion methods (Merge Mertens and SSF RGB), and $M=2$ ( $\mathbf{\mathcal{W}}_{C}$ and $\mathbf{\mathcal{W}}_{E}$ ) for the YUV-based ones (Fast YUV and SSF YUV). Note that, if $k_{j}=0$ , the corresponding weight map will not have any effect during the fusion process, since ${(\mathbf{\mathcal{W}}_{j})^{0}=\mathbf{1}}$ . In the fusion method implementations, we explored this fact to individually evaluate the influence of each proposed weight map. First, we computed the fused image using all the method proposed weights (equivalent to setting all $k=1$ ), and then we excluded the computation of one weight map (equivalent to setting one $k=0$ but without the computational cost of first computing the weight map and then raising it to 0). Specifically, for the RGB methods we used 4 weight combinations (1 using all weights, and the other 3 excluding one weight), and for the YUV methods we used 3 weight combinations (1 using all weights, and the other 2 excluding one weight).

III-B3 NUMBER OF FRAMES

We used the single EV negative (EV -24) frame on all tested combinations (to correct overexposure areas), and iterated on the number of EV-positive (EV $\geq 0$ ) ones. Hence, for each combination, we used 1 EV negative + $N$ EV-positive frames, with $N$ ranging from 1 to 5. The EV-positive frames order of choice followed their ascending value (i.e., 0, 1, 2, 3, and 4), which is the order the frames were acquired in the smartphones.

III-B4 STACKING METHOD

We evaluated the usage of the Mean and Median stacking functions on the EV-positive frames. Both methods are applied in the form:

S=\mathcal{F}\left(\ \bigcup_{i=1}^{N}I_{i}\right)\,

(2)

where $S$ is the stacked image, ${I_{i}\in\mathbb{R}^{W\times H\times C}}$ is the $i$ -th image to be stacked, ${\mathcal{F}:\mathbb{R}^{N\times W\times H\times C}\rightarrow\mathbb{R}^{W% \times H\times C}}$ is the stacking function, and $N$ is the number of frames to be stacked. We also evaluated the effects of not using any stacking method. Hence, in total, three combinations were evaluated: Mean, Median, and “None” (referring to not using any stacking method on the EV-positive frames).

III-C EVALUATION PROTOCOL

For evaluating the quality of the fused images, we employed three standard literature metrics, namely: Multi-scale Structural Similarity Index (MS-SSIM) ssim, Peak signal-to-noise ratio (PSNR), and Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) ergas; and one metric based on deep learning: Learned Perceptual Image Patch Similarity (LPIPS) lpips. MS-SSIM¹¹1We used the implementation provided by the Scikit-image scikit-image Python library. is a perception-based metric that considers image degradation as the perceived change in structural information, while also incorporating important perceptual aspects, such as luminance, image distortion, and the combination of contrast distortion survey_class. PSNR1 is used to compute the ratio of peak power and noise value power. ERGAS²²2We used the following implementation: https://github.com/andrewekhalel/sewar is used to quantify the image quality from the fusion of high spatial resolution images. LPIPS³³3Available at: https://github.com/richzhang/PerceptualSimilarity employs a large-scale, highly varied, perceptual similarity dataset to fine-tune deep learning models for the image quality assessment task. The image quality experiments were conducted in an AWS 6a.32xlarge instance⁴⁴4Available at https://aws.amazon.com/pt/ec2/instance-types/c6a/. with 3rd generation AMD EPYC processor, 128 CPUs and 256 GB of memory. We leverage the high number of CPUs to parallelize the variable combination computation for each tested image. Nevertheless, the whole computation took more than one week to be completed. In addition, we manually analyzed the images to confirm that the metrics were in agreement with the visual results.

The runtime and memory experiments used a Motorola smartphone with 4 GB of RAM, four 2.4 GHz Kryo 265 Gold and four 1.9 GHz Kryo 265 Silver processors, Snapdragon 680 4G Qualcomm SM6225 chipset, and Adreno 610 GPU. The algorithms for each variable combination were converted to the TensorFlow Lite format⁵⁵5Available at: https://www.tensorflow.org/lite/ (to run on the devices), and the measurements were performed with the TensorFlow Lite benchmark tool⁶⁶6Available at: https://www.tensorflow.org/lite/performance/measurement with 10 simulated runs using GPU.

IV RESULTS AND DISCUSSIONS

Table II presents the grouped results for our experiments involving all the variable combinations. Notably, the best-performing configuration varies across variable choices, with some consistencies observed in some optimal parameter combinations, as we proceed to discuss. We recall that the usual complete process of mobile image capture relies on more steps than the ones being considered in this work, and so this may be the main cause of some low image quality metrics values. To better interpret the individual variable impacts, we also present the grouped results for the tested fusion methods and weights, merging all stacking methods and number of EV-positive frames results altogether (i.e., for a stacking and frame variation agnostic evaluation) in Table III. Similarly, in Table IV, we show the grouped results related to stacking methods and numbers of EV-positive frames, merging all fusion methods and weights combinations results altogether (i.e., for a fusion agnostic evaluation).

TABLE II: Complete grouped results for the tested variable combinations reporting the mean value for each metric (split into three columns). The Time column is in seconds, and the Memory is in Mega-Bytes (MB). Best results are marked in bold, whereas worst results are in underline.

\mathbf{\mathcal{W}}_{C}

refers to Contrast weight,

\mathbf{\mathcal{W}}_{E}

, to Exposure weight, and

\mathbf{\mathcal{W}}_{S}

to Saturation weight.

Fusion Method and Weights	Stacking Method	EV $\geq 0$ Frames	MS-SSIM	PSNR	LPIPS	ERGAS	Time	Memory
Fast YUV $\mathbf{\mathcal{W}}_{E}$	Mean	2	0.46	14.35	0.62	13.09	1.02	366.36
		3	0.47	14.55	0.62	12.80	1.18	378.04
		4	0.46	14.43	0.64	12.06	1.39	389.38
		5	0.47	14.63	0.64	11.79	1.53	397.00
	Median	2	0.46	14.35	0.62	13.09	1.14	427.43
		3	0.47	14.58	0.61	12.76	1.20	469.35
		4	0.46	14.45	0.64	12.03	1.46	507.51
		5	0.46	14.67	0.64	11.73	1.58	549.89
	None	1	0.45	14.16	0.63	13.38	0.25	385.11
		2	0.50	17.14	0.62	9.69	0.29	453.77
		3	0.52	18.88	0.62	8.00	0.31	487.91
		4	0.51	20.13	0.67	6.71	0.43	670.61
		5	0.51	20.73	0.68	6.33	0.45	705.27
Fast YUV $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.09	1.03	362.55
		3	0.47	14.55	0.62	12.80	1.18	374.21
		4	0.46	14.43	0.64	12.06	1.39	385.71
		5	0.47	14.63	0.64	11.79	1.53	396.92
	Median	2	0.46	14.35	0.62	13.09	1.15	424.42
		3	0.47	14.58	0.61	12.76	1.21	465.33
		4	0.46	14.45	0.64	12.03	1.46	507.82
		5	0.46	14.67	0.64	11.73	1.57	549.48
	None	1	0.45	14.16	0.63	13.38	0.26	385.05
		2	0.50	17.14	0.62	9.69	0.30	449.83
		3	0.52	18.88	0.62	8.00	0.31	484.23
		4	0.51	20.13	0.67	6.71	0.44	670.75
		5	0.51	20.73	0.68	6.33	0.46	705.24
Fast YUV $\mathbf{\mathcal{W}}_{C}$ , $\mathbf{\mathcal{W}}_{E}$	Mean	2	0.46	14.35	0.62	13.09	1.05	362.54
		3	0.47	14.55	0.62	12.80	1.19	374.06
		4	0.46	14.43	0.64	12.06	1.38	385.52
		5	0.47	14.63	0.64	11.79	1.54	396.97
	Median	2	0.46	14.35	0.62	13.09	1.17	423.73
		3	0.47	14.58	0.61	12.76	1.22	466.30
		4	0.46	14.45	0.64	12.03	1.47	508.03
		5	0.46	14.67	0.64	11.73	1.58	549.95
	None	1	0.45	14.16	0.63	13.38	0.27	385.31
		2	0.50	17.14	0.62	9.69	0.30	450.00
		3	0.52	18.88	0.62	8.00	0.31	484.40
		4	0.51	20.13	0.67	6.71	0.47	670.99
		5	0.51	20.73	0.68	6.33	0.47	705.45
Mertens $\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.10	2.46	942.10
		3	0.47	14.55	0.62	12.80	2.62	953.94
		4	0.47	14.49	0.64	11.97	2.81	965.05
		5	0.47	14.70	0.64	11.70	2.96	976.22
	Median	2	0.46	14.35	0.62	13.09	2.58	987.91
		3	0.47	14.58	0.61	12.76	2.64	1022.32
		4	0.46	14.51	0.64	11.94	2.88	1056.87
		5	0.47	14.74	0.64	11.64	3.02	1090.68
	None	1	0.45	14.16	0.63	13.38	1.87	931.11
		2	0.50	17.14	0.62	9.69	2.44	1103.98
		3	0.52	18.88	0.62	8.00	2.95	1219.85
		4	0.51	20.40	0.67	6.46	3.65	1725.26
		5	0.52	21.05	0.68	6.07	4.17	1841.25
Mertens $\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.10	2.82	843.65
		3	0.47	14.55	0.62	12.80	2.98	855.00
		4	0.46	14.43	0.64	12.06	3.22	866.34
		5	0.47	14.63	0.64	11.79	3.31	877.78
	Median	2	0.46	14.35	0.62	13.09	2.94	889.57
		3	0.47	14.58	0.61	12.76	3.00	923.65
		4	0.46	14.45	0.64	12.03	3.26	957.96
		5	0.46	14.67	0.64	11.73	3.40	992.39
	None	1	0.45	14.16	0.63	13.38	2.23	832.16
		2	0.50	17.14	0.62	9.69	3.08	981.93
		3	0.52	18.88	0.62	8.00	3.75	1079.14
		4	0.51	20.13	0.67	6.71	4.62	1477.49
		5	0.51	20.73	0.68	6.33	5.33	1574.33

Fusion Method and Weights	Stacking Method	EV $\geq 0$ Frames	MS-SSIM	PSNR	LPIPS	ERGAS	Time	Memory
Mertens $\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$	Mean	2	0.46	14.35	0.62	13.10	2.22	1544.81
		3	0.47	14.55	0.62	12.80	2.36	1556.66
		4	0.46	14.43	0.64	12.06	2.58	1566.65
		5	0.47	14.63	0.64	11.79	2.74	1577.89
	Median	2	0.46	14.35	0.62	13.09	2.34	1591.12
		3	0.47	14.58	0.61	12.76	2.40	1625.21
		4	0.46	14.45	0.64	12.03	2.63	1667.64
		5	0.46	14.67	0.64	11.73	2.77	1708.71
	None	1	0.45	14.16	0.63	13.38	1.47	1560.23
		2	0.50	17.14	0.62	9.69	1.87	1792.38
		3	0.52	18.88	0.62	8.00	2.18	1913.18
		4	0.51	20.13	0.67	6.71	2.83	2617.55
		5	0.51	20.73	0.68	6.33	3.81	2681.38
Mertens $\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.10	2.93	992.66
		3	0.47	14.55	0.62	12.80	3.08	1003.67
		4	0.46	14.43	0.64	12.06	3.33	1015.21
		5	0.47	14.63	0.64	11.79	3.45	1026.95
	Median	2	0.46	14.35	0.62	13.09	3.06	1038.21
		3	0.47	14.58	0.61	12.76	3.13	1072.92
		4	0.46	14.45	0.64	12.03	3.38	1106.75
		5	0.46	14.67	0.64	11.73	3.49	1141.16
	None	1	0.45	14.16	0.63	13.38	2.36	980.92
		2	0.50	17.14	0.62	9.69	3.20	1167.05
		3	0.52	18.88	0.62	8.00	3.89	1292.47
		4	0.51	20.13	0.67	6.71	4.87	1830.16
		5	0.51	20.73	0.68	6.33	5.60	1956.23
SSF RGB $\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.09	2.22	649.55
		3	0.47	14.55	0.62	12.80	2.37	661.16
		4	0.46	14.43	0.64	12.06	2.56	671.91
		5	0.47	14.63	0.64	11.79	2.71	683.99
	Median	2	0.46	14.35	0.62	13.09	2.34	695.43
		3	0.47	14.58	0.61	12.76	2.38	729.11
		4	0.46	14.45	0.64	12.03	2.63	763.57
		5	0.46	14.67	0.64	11.73	2.77	798.20
	None	1	0.45	14.16	0.63	13.38	1.60	638.16
		2	0.50	17.14	0.62	9.69	2.20	818.38
		3	0.52	18.88	0.62	8.00	2.67	932.22
		4	0.51	20.13	0.67	6.71	3.45	1415.96
		5	0.51	20.73	0.68	6.33	3.93	1529.54
SSF RGB $\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.09	2.56	548.68
		3	0.47	14.55	0.62	12.80	2.75	559.65
		4	0.46	14.43	0.64	12.06	2.95	571.24
		5	0.47	14.63	0.64	11.79	3.08	582.68
	Median	2	0.46	14.35	0.62	13.09	2.71	593.86
		3	0.47	14.58	0.61	12.76	2.77	628.52
		4	0.46	14.45	0.64	12.03	2.98	662.90
		5	0.46	14.67	0.64	11.73	3.14	702.49
	None	1	0.45	14.16	0.63	13.38	2.00	536.87
		2	0.50	17.14	0.62	9.69	2.81	696.62
		3	0.52	18.88	0.62	8.00	3.49	791.53
		4	0.51	20.13	0.67	6.71	4.42	1190.99
		5	0.51	20.73	0.68	6.33	5.10	1285.76
SSF RGB $\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$	Mean	2	0.46	14.35	0.62	13.09	2.17	882.84
		3	0.47	14.55	0.62	12.80	2.35	894.14
		4	0.46	14.43	0.64	12.06	2.52	905.78
		5	0.47	14.63	0.64	11.79	2.65	917.29
	Median	2	0.46	14.35	0.62	13.09	2.29	940.41
		3	0.47	14.58	0.61	12.76	2.36	960.53
		4	0.46	14.45	0.64	12.03	2.61	1002.66
		5	0.46	14.67	0.64	11.73	2.73	1044.50
	None	1	0.45	14.16	0.63	13.38	1.42	897.84
		2	0.50	17.14	0.62	9.69	1.92	1105.16
		3	0.52	18.88	0.62	8.00	2.32	1222.63
		4	0.51	20.13	0.67	6.71	2.97	1849.41
		5	0.51	20.73	0.68	6.33	3.42	1968.71

Fusion Method and Weights	Stacking Method	EV $\geq 0$ Frames	MS-SSIM	PSNR	LPIPS	ERGAS	Time	Memory
SSF RGB $\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.09	2.70	706.74
		3	0.47	14.55	0.62	12.80	2.86	718.16
		4	0.46	14.43	0.64	12.06	3.07	729.73
		5	0.47	14.63	0.64	11.79	3.24	740.82
	Median	2	0.46	14.35	0.62	13.09	2.84	752.55
		3	0.47	14.58	0.61	12.76	2.89	786.71
		4	0.46	14.45	0.64	12.03	3.15	821.25
		5	0.46	14.67	0.64	11.73	3.26	855.28
	None	1	0.45	14.16	0.63	13.38	2.07	695.07
		2	0.50	17.14	0.62	9.69	2.96	893.09
		3	0.52	18.88	0.62	8.00	3.65	1016.46
		4	0.51	20.13	0.67	6.71	4.68	1555.05
		5	0.51	20.73	0.68	6.33	5.38	1678.54
SSF YUV $\mathbf{\mathcal{W}}_{E}$	Mean	2	0.46	14.35	0.62	13.09	1.02	307.72
		3	0.47	14.55	0.62	12.80	1.20	318.99
		4	0.46	14.43	0.64	12.06	1.40	330.30
		5	0.47	14.63	0.64	11.79	1.54	342.03
	Median	2	0.46	14.35	0.62	13.09	1.16	368.85
		3	0.47	14.58	0.61	12.76	1.22	411.22
		4	0.46	14.45	0.64	12.03	1.48	452.79
		5	0.46	14.67	0.64	11.73	1.61	494.84
	None	1	0.45	14.16	0.63	13.38	0.26	327.99
		2	0.50	17.14	0.62	9.69	0.31	394.91
		3	0.52	18.88	0.62	8.00	0.32	428.82
		4	0.51	20.13	0.67	6.71	0.47	619.59
		5	0.51	20.73	0.68	6.33	0.49	653.74
SSF YUV $\mathbf{\mathcal{W}}_{C}$	Mean	2	0.46	14.35	0.62	13.09	1.03	299.96
		3	0.47	14.55	0.62	12.80	1.19	311.75
		4	0.46	14.43	0.64	12.06	1.39	323.20
		5	0.47	14.63	0.64	11.79	1.54	334.27
	Median	2	0.46	14.35	0.62	13.09	1.18	361.31
		3	0.47	14.58	0.61	12.76	1.22	403.32
		4	0.46	14.45	0.64	12.03	1.47	445.14
		5	0.46	14.67	0.64	11.73	1.59	486.88
	None	1	0.45	14.16	0.63	13.38	0.27	322.35
		2	0.50	17.14	0.62	9.69	0.31	383.21
		3	0.52	18.88	0.62	8.00	0.31	417.72
		4	0.51	20.13	0.67	6.71	0.49	604.32
		5	0.51	20.73	0.68	6.33	0.51	638.61
SSF YUV $\mathbf{\mathcal{W}}_{C}$ , $\mathbf{\mathcal{W}}_{E}$	Mean	2	0.46	14.35	0.62	13.09	1.04	303.91
		3	0.47	14.55	0.62	12.80	1.20	315.48
		4	0.46	14.43	0.64	12.06	1.41	327.12
		5	0.47	14.63	0.64	11.79	1.55	338.46
	Median	2	0.46	14.35	0.62	13.09	1.18	365.34
		3	0.47	14.58	0.61	12.76	1.22	407.45
		4	0.46	14.45	0.64	12.03	1.47	449.17
		5	0.46	14.67	0.64	11.73	1.59	490.96
	None	1	0.45	14.16	0.63	13.38	0.28	326.61
		2	0.50	17.14	0.62	9.69	0.32	391.31
		3	0.52	18.88	0.62	8.00	0.31	425.79
		4	0.51	20.13	0.67	6.71	0.51	612.03
		5	0.51	20.73	0.68	6.33	0.52	646.34

TABLE III: Grouped results for the Fusion Methods and Weights reporting the mean value for each metric. The Time column is in seconds, and the Memory in Mega-Bytes (MB). Best results are marked in bold, whereas worst results are in underline.

\mathbf{\mathcal{W}}_{C}

refers to Contrast weight,

\mathbf{\mathcal{W}}_{E}

, to Exposure weight, and

\mathbf{\mathcal{W}}_{S}

to Saturation weight.

Fusion Method	Fusion Weights	MS-SSIM	PSNR	LPIPS	ERGAS	Time	Memory
Fast YUV	$\mathbf{\mathcal{W}}_{E}$	0.48	15.48	0.62	11.75	0.94	475.97
	$\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	0.94	473.97
	$\mathbf{\mathcal{W}}_{C}$ , $\mathbf{\mathcal{W}}_{E}$	0.48	15.48	0.62	11.75	0.95	474.10
Mertens	$\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	0.48	15.51	0.62	11.71	2.85	1139.73
	$\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	3.38	1011.64
	$\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$	0.48	15.48	0.62	11.75	2.48	1800.26
	$\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	3.52	1201.87
SSF RGB	$\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	2.60	845.17
	$\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	3.14	719.37
	$\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$	0.48	15.48	0.62	11.75	2.44	1122.45
	$\mathbf{\mathcal{W}}_{S}$ , $\mathbf{\mathcal{W}}_{E}$ , $\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	3.29	919.19
SSF YUV	$\mathbf{\mathcal{W}}_{E}$	0.48	15.48	0.62	11.75	0.96	419.37
	$\mathbf{\mathcal{W}}_{C}$	0.48	15.48	0.62	11.75	0.96	410.16
	$\mathbf{\mathcal{W}}_{C}$ , $\mathbf{\mathcal{W}}_{E}$	0.48	15.48	0.62	11.75	0.97	415.38

TABLE IV: Grouped results for the tested Number of EV-positive frames related to the employed Stacking Method reporting the mean value for each metric. The Time column is in seconds, and the Memory in Mega-Bytes (MB). Best results are marked in bold, whereas worst results are in underline.

EV $\geq 0$ Frames	Stacking Method	MS-SSIM	PSNR	LPIPS	ERGAS	Time	Memory
1	None	0.45	14.16	0.63	13.38	1.19	657.48
2	Mean	0.46	14.35	0.62	13.09	1.88	651.01
	Median	0.46	14.35	0.62	13.09	2.01	704.30
	None	0.50	17.14	0.62	9.69	1.59	791.54
3	Mean	0.47	14.55	0.62	12.80	2.04	662.49
	Median	0.47	14.58	0.61	12.76	2.06	740.85
	None	0.52	18.88	0.62	8.00	1.91	871.17
4	Mean	0.46	14.43	0.64	12.05	2.24	673.79
	Median	0.46	14.46	0.64	12.02	2.31	779.29
	None	0.51	20.16	0.67	6.69	2.45	1250.73
5	Mean	0.47	14.63	0.64	11.78	2.38	684.95
	Median	0.46	14.68	0.64	11.72	2.44	818.24
	None	0.51	20.77	0.68	6.31	2.83	1326.46

In regards to the fusion methods and weights, observe that combinations involving Exposure and Contrast weights tend to yield the highest MS-SSIM and PSNR, and the lowest ERGAS values, indicating superior image quality and spectral distortion reduction. However, it is worth noting that all quality measures produce very similar results, implying that these variables have a minimal effect on the final image quality. Related to computational resources, we can note that methods that operate on the YUV color space are faster and more efficient than the ones that operate on RGB. The primary explanation appears to be related to the predominance of operations made exclusively within the Y channel, as opposed to RGB methods which require operations across all three channels. Furthermore, see that Fast YUV had the faster runtime among all methods (including the single-scale ones), and the memory usage was very similar to its single-scale counterpart, SSF YUV. This could be attributed to the fact that, although SSF YUV eliminates the multiple pyramid step, it introduces additional dot product operations, which appear to be more resource-intensive for the tested smartphone hardware.

Regarding stacking methods, we can observe that, across all frame counts, the None method (i.e., not using any stacking method) consistently outperforms the others in terms of image quality measures, while also demonstrating the lowest processing time when using 3 EV-positive frames or less. Overall, the None stacking method proves to be a robust choice, offering high-quality results with efficient processing across varying numbers of frames. Nevertheless, note that None stacking has the highest memory usage among the other stacking methods (1.9 times more in the worst case, and 1.2 in the best for the same number of frames). These findings are highly significant as they illustrate that opting not to utilize any stacking method consistently yields superior image quality and reduces processing time for image fusion. This conclusion is not immediately intuitive, as one might anticipate that employing some stacking method would at least decrease runtime, given that fewer frames would be supplied to the fusion algorithm.

Regarding solely the number of used frames, note that, as expected, using more frames usually requires more processing time and memory, but it doesn’t necessarily translate into better image quality. Specifically, note that, when using 3 frames and None stacking, we have the best MS-SSIM score, and with Medidan we have the best LPIPS score. As previously mentioned, incorporating additional EV-positive frames during fusion involves feeding images with longer exposure. Hence, for this particular use-case scenario, we observe that using frames with EV higher than 2 doesn’t necessarily lead to an improvement in the fused image quality.

Overall, these findings underscore the importance of carefully selecting the setup for image fusion based on the desired outcome metric and computational constraints. Our visual inspection corroborates with these findings, as illustrated in Figures 2, 3 and 4. Specifically, note in Figure 2 that increasing the number of EV-positive frames above 3 (EV values higher than 2) does not seem to have major effects on improving image quality. Moreover, regarding stacking methods, Figure 3 shows that None stacking produces brighter (clearer) images compared to Mean and Median. Finally, regarding the fusion method and weights, Figure 4 demonstrates that the tested methods and their weight variation produce similar results.

V CONCLUSIONS

In this work, we delve into examining the trade-off between computational resources and the quality of images generated by employing different fusion methods, fusion weights, used number of frames, and stacking techniques. Our study used a proprietary dataset comprising images taken with Motorola smartphones’ front cameras across different environmental settings and lighting conditions. Our goal was to determine the variable combinations that produce the best image quality related to its runtime and resource allocation. In regards to using multi- or single-scale methods, the literature often highlights that, although single-scale ones should run faster, they usually do not provide good image quality mertens; fyuv; survey_class. However, our work shows that the multi-scale Fast YUV fyuv had the faster runtime, and the second lowest memory usage among all tested methods. Moreover, both versions of the single-scale SSF ssf_rgb (RGB and YUV) had similar image quality results compared to the tested multi-scale methods (Fast YUV and Mertens mertens). Our research also uncovered that methods operating in the YUV color space exhibit superior benchmark performance compared to RGB color-based ones. Specifically, they produce similar image quality results with faster runtime and lower memory usage. This discovery is significant as it suggests that the advancement of new learning-based methods could benefit from utilizing YUV color space images, with a focus on operating solely on the Y color channel to conserve computational resources. Additionally, our findings demonstrate minimal impacts associated with the choice of fusion weights proposed in the literature on final image quality. This further aligns with the recent trend in deep learning methods survey_deep; xu2022multi, which aims to learn weight maps based on example data rather than relying on hand-coded ones.

In regards to the number of used frames, we showed that feeding more input frames for the fusion method not necessarily will improve the final image quality. More specifically, our experimental setup demonstrated that using frames with EV value superior to 2 (in our specific case, using 3 frames) usually will not improve the final image quality, but, as expected, it will require more computation resources, increasing runtime and memory usage. To address this limitation, we also investigated stacking methods as a means to decrease the number of EV-positive input frames fed into the fusion method. However, our experiments revealed that the tested methods (Mean and Median) resulted in lower image quality compared to not using any stacking (None method), despite exhibiting similar runtime. Nevertheless, stacking methods proved to significantly reduce memory usage, particularly when more frames were employed. Regarding the recent trend in deep learning methods, this finding suggests that the architecture of these models could be designed to accommodate a large number of frames and then employ some learning-based stacking method (e.g., using convolutional layers with a reduced number of neurons compared to the input frames) to mitigate the need for extensive memory resources.

Finally, our results underscore the importance of carefully configuring the image fusion setup based on both the target image quality metrics and the computational limitations of the system. Moreover, they offer valuable insights for the advancement of new fusion techniques. In future works, we aim to leverage these discoveries to devise more efficient methods for smartphone image fusion.