LU2Net: A Lightweight Network for Real-time
Underwater Image Enhancement

Haodong Yang, Jisheng Xu, Zhiliang Lin and Jian** He Haodong Yang is with the Department of Computer Science, Shanghai Jiao Tong University. Jisheng Xu, Zhiliang Lin and Jian** He are with the Department of Automation, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, and Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai 200240, China. Zhiliang Lin is with the School of Ocean and Civil Engineering, Shanghai Jiao Tong University, State Key Laboratory of Ocean Engineering, Shanghai 200240, China. Emails: {yanghaodong, Jimmy_xu, linzhiliang, jphe}@sjtu.edu.cn.
Abstract

Computer vision techniques have empowered underwater robots to effectively undertake a multitude of tasks, including object tracking and path planning. However, underwater optical factors like light refraction and absorption present challenges to underwater vision, which cause degradation of underwater images. A variety of underwater image enhancement methods have been proposed to improve the effectiveness of underwater vision perception. Nevertheless, for real-time vision tasks on underwater robots, it is necessary to overcome the challenges associated with algorithmic efficiency and real-time capabilities. In this paper, we introduce Lightweight Underwater Unet (LU2Net), a novel U-shape network designed specifically for real-time enhancement of underwater images. The proposed model incorporates axial depthwise convolution and the channel attention module, enabling it to significantly reduce computational demands and model parameters, thereby improving processing speed. The extensive experiments conducted on the dataset and real-world underwater robots demonstrate the exceptional performance and speed of proposed model. It is capable of providing well-enhanced underwater images at a speed 8 times faster than the current state-of-the-art underwater image enhancement method. Moreover, LU2Net is able to handle real-time underwater video enhancement.

I INTRODUCTION

With the remarkable progress of computer vision in recent years, underwater vision is emerging as a prevalent and vital means of gathering information for underwater robots. A variety of vision models have been integrated into underwater robots for marine tasks, like object detection [1], monocular depth estimation [2] and visual odometry [3]. Real-time and high-quality underwater images provide essential information for robotic decision making and improve the safety and stability of underwater robots.

Though computer vision technologies strengthen the perception abilities of underwater robots, low-quality underwater images degrade the performance of marine robotic tasks. In contrast to the relatively stable onshore environment, the underwater environment is characterized by complexity and variability. Greater absorption of red light and scattering of suspended particles in water lead to imbalanced color channels and inconsistent blurriness in images [4]. Additionally, light sources may be insufficient in deep water, limiting the visibility of target objects and surroundings. The changeable underwater environment hugely affects the quality of underwater images. Thus, the computer vision models designed for onshore clear images and videos may suffer from decrease in performance when applied in underwater robots for marine vision tasks.

To restore degraded underwater images, researchers have proposed a variety of underwater image enhancement (UIE) methods. Traditional UIE Methods that are based on visual and physical prior [5, 6, 7, 8, 9, 10, 11, 12] are usually unable to adapt to various underwater environments and result in inappropriate enhancement. Deep neural networks for underwater image enhancement has been intensively investigated. UIEC^2-Net [13] integrates RGB and HSV color spaces in one single CNN, utilizing different image properties. PUGAN [14] estimates physical parameters to guide image enhancement in neural networks. Recently, [15] introduced transformer into UIE. However, the above-mentioned learning-based methods often focus on visual effects rather than processing speed. For underwater robots performing practical marine tasks, it is necessary to provide fast image enhancement.

To address the aforementioned issue, in this paper, we propose Lightweight Underwater UNet (LU2Net) for real-time underwater images enhancement. Axial depthwise convolution is embedded into LU2Net, which contributes to larger receptive fields. Thus more details are perceived with less convolution layers. Besides, channel attention module is integrated to adaptively adjust channel weights and mitigate inconsistent attenuation among channels. The experiments on the dataset and real-world robots demonstrate that our model outperforms state-of-the-art methods in both quantitative metrics and processing speed. Our main contributions can be summarized as:

Refer to caption
Figure 1: The illustration of LU2Net structure. Specially designed encoder and decoder blocks enable larger receptive fields and the adaptive adjustment of channel weights. Skip connection ensures the full utilization of multi-stage information.
  • \bullet

    We propose a novel U-shape network, LU2Net, that integrates a lightweight convolution and adaptive channel weights for real-time underwater image enhancement. This lightweight U-shape model enables underwater robots to obtain high-quality and real-time images for marine vision tasks.

  • \bullet

    Our specially designed block structure is embedded with channel-wise attention and axial convolution. This simple and fast block structure mitigates inconsistent attenuation in color channels and achieves large receptive fields, consuming a small amount of time and resources.

  • \bullet

    Our model demonstrates significant advantages through experiments on a dataset and practical testing on underwater robots. Maintaining better performance, our model achieves 8 times faster speed than state-of-the-art model, showing LU2Net’s ability on videos.

II Related Works

II-A UIE Methods

Various methods have been proposed to enhance recorded single underwater images, which can be divided into the following three categories.

i) Visual prior methods: These approaches focus on adjusting pixel values to improve image quality. Early methods include histogram equalization [16], which transforms the image histogram from a narrow unimodal histogram to a balanced distribution histogram. Image fusion [17, 18] takes enhanced images by different algorithms as input and generate results by fusing them. Retinex-based methods [8] obtain the actual appearance of the object by estimating and removing the illumination light in the environment. More recently, morphology-based techniques [6, 7] have been introduced to underwater image enhancement. Visual prior methods inherently improve contrast and saturation in underwater images. However, they may lead to over-enhancement when applied to real-world underwater images.

ii) Physics-based methods: These methods utilize prior knowledge of the physics of underwater imaging to guide image enhancement. For example, UDCP algorithm [11] is based on the fact that most non-sky pixels in haze-free images possess at least one color channel with significantly low intensity. By estimating the weights of different channels through less attenuated blue and green channels, UDCP demonstrates effectiveness in processing deep-sea images. Another approach, RDCP [12], leverages the information of fastly decayed red light and considers both natural light and artificial light to achieve good enhancement results. However, as these methods focus on physical models and do not consider human visual perception, the visual quality of the enhanced images may be compromised.

iii) Deep learning methods: The emergence of deep learning has sparked interest in data-driven underwater image enhancement. CNN-based methods learn the map** from original underwater images to high-quality images. UWCNN [19] introduced a simple CNN architecture with stacked convolution layers for UIE tasks. UIEC^2-Net [13] utilizes image properties in various color spaces. Besides, a variety of GAN-based models have been proposed, which usually take non-image information for enhancement. WaterGAN [20] estimates and utilizes depth information. PUGAN [14] combines the physical model of underwater imaging and the generative adversarial network. Recently, transformer-based methods [15] are also introduced into UIE. Though pleasing visual effect is achieved, the importance of speed and net size is frequently ignored. Current deep learning methods often suffer from relatively slow processing speed and a reliance on high-end graphics processing hardware. This hinders their applicability for real-time vision tasks on robotic platforms with limited computing capabilities.

II-B Light Weight U-Net

U-Net employs a U-shape network with encoder-decoder architecture for biomedical image segmentation [21]. After the emergence of U-Net, the U-shape architecture has been widely adopted in various vision tasks. For instance, it has proven effective for image enhancement. Peng et al. [15] introduced U-shape structure into UIE task.

After the success of U-Net, researchers have explored methods to streamline U-shape networks, aiming to reduce computational complexity and improve efficiency. For example, Unext proposed in [22] combines a multi-layer perceptron with U-Net, resulting in less convolution layers and network parameters. In [23], U-Lite leverages depth separable convolution, significantly reducing the parameter count while maintaining high performance. Motivated from these works, in this paper, we utilize the idea from U-Lite for reduction of network size and complexity.

III METHODOLOGY

In this section, we introduce the structure of LU2Net. As illustrated in Fig. 1, U-shape network structure is utilized for multi-scale information extraction. A lightweight convolution architecture, axial depthwise convolution [23], is leveraged to achieve larger receptive fields with less layers and parameters. Besides, the channel attention module proposed in [24] is embedded in LU2Net, which adaptively adjust channel weights and effectively correct color distortion.

With these modules, novel encoder and decoder block structures are designed for improved performance and less parameters and shortened processing time. The novel block structures allow for reduced size and reinforced speed of our model while maintaining significant enhancement performance, which is illustrated in Section \@slowromancapiv@.

III-A U-Shape Network Architecture

As is shown in Fig. 1, LU2Net proposed in this paper adopts a streamlined U-shape network design, achieving high performance without excessive block stacking. Our U-shape network architecture consists of an classic encoder-decoder structure. Stacked encoder and decoder blocks provide a holistic view of multi-scale features. The skip connections in the network facilitate the transfer of information from earlier encoder stages to later decoder stages, improving the performance by incorporating multi-stage information.

Different from traditional U-Net [21], whose blocks are constructed by stacked convolution layers, the proposed model introduces novel encoder and decoder block structures. The blocks incorporate axial depthwise convolution, enabling larger receptive fields. Then, less convolution layers are required in our blocks. Channel attention module in blocks provide adaptive adjustment of channel weights, which provides a light and fast solution for underwater color distortion. Utilizing this block structures, LU2Net achieves state-of-the-art enhancement performance and outstanding processing speed with a small number of blocks.

III-B Axial Depthwise Convolution Module

Axial depthwise convolution is integrated in our model for higher performance while reducing parameters and computational complexity. The axial depthwise convolution combines depthwise separable convolution [25] and axial feature extraction, resulting in a novel convolution variant[23]. By replacing common convolution layers with axial depthwise convolution, the amount of parameters in blocks largely decreases while the performance is improved.

By transforming traditional square convolution kernels into axial variants, the receptive fields of axial depthwise convolution expand while reducing the number of parameters. Fig. 2 showcases the larger receptive fields of axial depthwise convolution compared to traditional convolution. This modification allows for leveraging more precise details from the input image, leading to superior performance with fewer parameters.

Axial depthwise convolution consists of two stages, as is shown in Fig. 3. Depthwise convolution keeps the depth of the output unchanged. Each pair of horizontal and vertical convolution kernels is individually applied to each input channel. Then, pointwise convolution is carried out, convolving across all input channels simultaneously. By dividing single convolution operation into two parts, axial depthwise convolution enables a massive reduction of parameters.

Refer to caption
Figure 2: The comparsion of receptive field between axial depthwise convolution and traditional convolution. As the network get deeper with more convolution layers, the improvement of receptive fields in axial depthwise convolution is strengthened. Thus more details are perceived with the number of layers unchanged.
Input: A 3-dimension tensor In𝐼𝑛Initalic_I italic_n of l𝑙litalic_l channels
Output: A 3-dimension tensor Out𝑂𝑢𝑡Outitalic_O italic_u italic_t of l𝑙litalic_l channels
1
2for i0𝑖0i\leftarrow 0italic_i ← 0 to l𝑙litalic_l do
3       Hconv[i]𝐻𝑐𝑜𝑛𝑣delimited-[]𝑖Hconv\left[i\right]italic_H italic_c italic_o italic_n italic_v [ italic_i ] = Conv(In[i]𝐼𝑛delimited-[]𝑖In\left[i\right]italic_I italic_n [ italic_i ], horizontal=True𝑜𝑟𝑖𝑧𝑜𝑛𝑡𝑎𝑙𝑇𝑟𝑢𝑒horizontal=Trueitalic_h italic_o italic_r italic_i italic_z italic_o italic_n italic_t italic_a italic_l = italic_T italic_r italic_u italic_e);
4       Vconv[i]𝑉𝑐𝑜𝑛𝑣delimited-[]𝑖Vconv\left[i\right]italic_V italic_c italic_o italic_n italic_v [ italic_i ] = Conv(In[i]𝐼𝑛delimited-[]𝑖In\left[i\right]italic_I italic_n [ italic_i ], vertical=True𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙𝑇𝑟𝑢𝑒vertical=Trueitalic_v italic_e italic_r italic_t italic_i italic_c italic_a italic_l = italic_T italic_r italic_u italic_e);
5       DWconv[i]𝐷𝑊𝑐𝑜𝑛𝑣delimited-[]𝑖DWconv[i]italic_D italic_W italic_c italic_o italic_n italic_v [ italic_i ] = Hconv[i]𝐻𝑐𝑜𝑛𝑣delimited-[]𝑖Hconv\left[i\right]italic_H italic_c italic_o italic_n italic_v [ italic_i ] + Lconv[i]𝐿𝑐𝑜𝑛𝑣delimited-[]𝑖Lconv\left[i\right]italic_L italic_c italic_o italic_n italic_v [ italic_i ] + In[i]𝐼𝑛delimited-[]𝑖In[i]italic_I italic_n [ italic_i ];
6      
Out𝑂𝑢𝑡Outitalic_O italic_u italic_t = Conv(DWconv𝐷𝑊𝑐𝑜𝑛𝑣DWconvitalic_D italic_W italic_c italic_o italic_n italic_v, all=True𝑎𝑙𝑙𝑇𝑟𝑢𝑒all=Trueitalic_a italic_l italic_l = italic_T italic_r italic_u italic_e)
Algorithm 1 Axial Depthwise Convolution
Refer to caption
Figure 3: The structure of axial depthwise convolution and pointwise convolution. In axial depthwise convolution, each convolution kernel operates on single input channel and the number of channel remains the same. While in pointwise convolution, each kernel combines all input channels and corresponds to an output channel.

III-C Channel Attention Module

The channel attention module [24] focuses on dynamically adjusting the weights of each channel based on its features. This module offers an effective solution for channel-specific degradation in underwater images. By incorporating the channel attention module, our model can learn the significance of different channels and assign higher weights to the most informative ones. Thus, we can reduce the inconsistent attenuation by addressing the varying degradation levels among channels.

The channel attention module comprises three main components. Firstly, global average pooling is performed on each channel, reducing the two-dimensional features to a single scalar weight value for each channel. Next, through an multi-layer perceptron (MLP) or a similar MLP-like architecture, the excitation process modifies the original weights to generate channel-wise weights. Finally, element-wise multiplication is applied between each channel and its corresponding weight, resulting in the final output. The process is represented in Fig. 4.

Input: A 3-dimension tensor In𝐼𝑛Initalic_I italic_n of l𝑙litalic_l channels
Output: A 3-dimension adjusted tensor Out𝑂𝑢𝑡Outitalic_O italic_u italic_t of l𝑙litalic_l channels
1
2for i0𝑖0i\leftarrow 0italic_i ← 0 to l𝑙litalic_l do
3       w0[i]subscript𝑤0delimited-[]𝑖w_{0}\left[i\right]italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT [ italic_i ] = avgPool(In[i]𝐼𝑛delimited-[]𝑖In\left[i\right]italic_I italic_n [ italic_i ]);
4      
5w𝑤witalic_w = MLP(w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT);
6 for i0𝑖0i\leftarrow 0italic_i ← 0 to l𝑙litalic_l do
7       Out[i]𝑂𝑢𝑡delimited-[]𝑖Out\left[i\right]italic_O italic_u italic_t [ italic_i ] = w[i]×In[i]𝑤delimited-[]𝑖𝐼𝑛delimited-[]𝑖w\left[i\right]\times In[i]italic_w [ italic_i ] × italic_I italic_n [ italic_i ];
8      
9
Algorithm 2 CALayer
Refer to caption
Figure 4: The illustration of CALayer based on channel attention module. The features of input channels are first extracted by average pooling and then adjusted by MLP-like architecture for channel weights. The adaptation of channels is carried out by element-wise multiplication of input channels and final weights.

III-D Loss Function Design

To attain a more comprehensive sense of image information and achieve a more pleasing visual effect, we design a loss function expressed as follows:

Ltotal=lRGB+lLAB+lLCH+lSSIM+lVGG.subscript𝐿𝑡𝑜𝑡𝑎𝑙subscript𝑙𝑅𝐺𝐵subscript𝑙𝐿𝐴𝐵subscript𝑙𝐿𝐶𝐻subscript𝑙𝑆𝑆𝐼𝑀subscript𝑙𝑉𝐺𝐺L_{total}=l_{RGB}+l_{LAB}+l_{LCH}+l_{SSIM}+l_{VGG}.italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_L italic_A italic_B end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_L italic_C italic_H end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_V italic_G italic_G end_POSTSUBSCRIPT . (1)

lRGBsubscript𝑙𝑅𝐺𝐵l_{RGB}italic_l start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT, lLABsubscript𝑙𝐿𝐴𝐵l_{LAB}italic_l start_POSTSUBSCRIPT italic_L italic_A italic_B end_POSTSUBSCRIPT, lLCHsubscript𝑙𝐿𝐶𝐻l_{LCH}italic_l start_POSTSUBSCRIPT italic_L italic_C italic_H end_POSTSUBSCRIPT are mean square error in RGB, LAB and LCH color spaces. lSSIMsubscript𝑙𝑆𝑆𝐼𝑀l_{SSIM}italic_l start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT is the SSIM loss. lVGGsubscript𝑙𝑉𝐺𝐺l_{VGG}italic_l start_POSTSUBSCRIPT italic_V italic_G italic_G end_POSTSUBSCRIPT is the VGG loss. SSIM (Structural Similarity Index) evaluates the similarity of luminance, contrast and structure between output image and the ground truth. VGG loss compares the high-level features of two images after the process of the pretrained VGG network.

In the case of underwater image enhancement, where datasets are relatively small, it is essential to incorporate comprehensive image information from various perspectives to achieve superior performance. Therefore, our proposed loss function (1) leverages multiple image properties to encompass a broader range of image information, resulting in improved performance.

SSIM is a widely adopted metric for evaluating image quality, where a higher SSIM score typically indicates superior image quality. In line with the experimental findings reported in [15], it was observed that the RGB, LAB, and LCH color spaces yield the highest SSIM scores when applied to underwater images. Therefore, we incorporate these three color spaces and the SSIM loss into our loss function, incorporating multiply image characteristics.

Furthermore, to achieve a more perceptually pleasing enhancement result, we employ the VGG loss. The application of VGG loss ensures more satisfactory visual perceptual effects of enhanced images.

IV EXPERIMENTS

In this section, we begin by providing an overview of the experimental settings about the processing hardware and the underwater robotic platform on which the real-world experiments are conducted. Following this, training specifics for our Lightweight Underwater UNet model are presented. Then, training experiments conducted on a dataset compare the enhancement performance and processing speed of our model with other state-of-the-art models. Moreover, our model is integrated into the underwater remotely operated vehicle (ROV) to test the visual effects of enhanced images and processing speed in real-world underwater tasks.

IV-A Experimental Settings

Our experiments are conducted on an RTX 3060 laptop. Besides, the extensive experiment of LU2Net is carried out on an i7-10750H processor to test the speed without a graphics processing unit.

Refer to caption
Figure 5: Our underwater ROV and experimental environment

Fig. 5 illustrates the structure of the underwater ROV for real-world experiments and the experimental environment. Our underwater ROV is equipped with a water-proof camera for visual perception, which can output videos of 640×\times×480 resolution at the speed of 30 frames per second (fps). The artificial light source provides visible light in poor lightning conditions. Thrusters attached to the body empower our ROV with motion ability.

IV-B Training Details

We use a comprehensive underwater image enhancement dataset [15], LSUI, which consists of 5000 pairs of original input images and corresponding ground truth images. This dataset significantly surpasses previous datasets in terms of size, making it an ideal choice for our experiments.

For the implementation of our model, we utilize the Python programming language and the PyTorch framework. Throughout the entire training process, which spans 150 epochs, we employ the Adam optimization algorithm. The initial learning rate is set to 0.0005, and every 40 epochs, the learning rate is reduced by 20 percent.

IV-C Model Evaluation

As is customary, we partition the LSUI dataset into a training dataset and a testing dataset, following an 8:2 split. The images in the dataset undergo resizing to dimensions of 256×256×32562563256\times 256\times 3256 × 256 × 3, and normalization is performed to ensure pixel values fall within the range of [1,1]11[-1,1][ - 1 , 1 ].

To evaluate the enhancement performance and processing efficiency of our proposed method, we compare it with recent state-of-the-art UIE methods. Each model undergoes training on the designated training dataset for a total of 150 epochs.

During the comparison, we assess various metrics, including image quality, processing time, and parameter count, to provide a comprehensive evaluation of different methods.

IV-C1 Enhancement Performance Evaluation

Refer to captionRefer to captionRefer to caption
(a) Real-world images
Refer to captionRefer to captionRefer to caption
(b) MLLE
Refer to captionRefer to captionRefer to caption
(c) U-Trans
Refer to captionRefer to captionRefer to caption
(d) Ours
Refer to captionRefer to captionRefer to caption
(e) Ground truth
Figure 6: Illustration of enhancement results by different models

MLLE serves as an example of a non-deep-learning approach for underwater image enhancement, while U-Trans stands out as a recently proposed deep learning method that achieves superior results. Additionally, UWCNN and WaterNet are notable examples of previous state-of-the-art UIE models.

MLLE inherits the limitations associated with traditional UIE methods, including color distortion and an inability to handle diverse underwater color and lighting conditions. In contrast, both U-Trans and our proposed model generate visually appealing enhanced images. These enhanced images enable improved identification of underwater objects and the surrounding environment, thereby facilitating higher-level tasks such as object tracking, object detection, and obstacle avoidance.

From the comparative analysis presented in Table. I and the illustration of enhanced underwater images in Fig. 6, it is evident that our model attains a state-of-the-art level of enhancement performance when compared to the selected methods. The evaluation metrics used include PSNR, SSIM and subjective visual quality assessment. Our model outperforms the other methods in terms of these metrics, indicating its superiority in enhancing underwater images. This achievement can be attributed to the effective utilization of advanced deep learning techniques and the incorporation of domain-specific knowledge in our model’s architecture.

TABLE I: Comparisons of enhancement performance
Method PSNR SSIM
MLLE 18.133 0.730
WaterNet 19.517 0.813
UWCNN 22.621 0.825
U-Trans 25.032 0.843
Ours 25.549 0.868

IV-C2 Model Speed and Parameters Comparison

The experimental results demonstrate that our model achieves a remarkable processing speed. Utilizing a common RTX 3060 laptop, our model is capable of outputting videos at 100 fps, thereby meeting the real-time demand of underwater tasks. Additionally, without a GPU, our model can still generate enhanced images at the speed of 12 fps on an i7-10750H processor.

Moreover, the experiment results exhibits the lowest demand for computational resources of our model, as indicated by the lowest FLOPs (Floating Point Operations) requirement. Furthermore, with a relatively small number of parameters, our model does not necessitate a large amount of memory. Consequently, our model is lightweight enough to be deployed on underwater robotic platforms while achieving impressive enhancement results. As a low-level underwater image enhancement solution, our model preserves ample computational resources for other tasks and coexists harmoniously with high-level vision models.

TABLE II: Comparisons of processing speed and parameters
Method FLOPs Parameters Time/Frame
MLLE / / 0.163s
WaterNet 143.3G 1.0M 0.475s
UWCNN 5.2G 40.0K 0.005s
U-Trans 60.2G 22.8M 0.08s
Ours 2.8G 176K 0.01s

IV-D Real-world Test

To test the performance of our model on real-world robots, we conduct a real-time underwater image enhancement experiment on our underwater vision-driven ROV. Due to the limitation of processing speed of different models, we only compare the performance of UWCNN and our model that can handle real-time enhancement tasks. In absence of ground truth, PSNR and SSIM are not suitable for the assessment of real-world underwater image enhancement. Instead, we choose UCIQE [26] as the alternative metric, which combines chroma, saturation, and contrast for image quantification. Commonly, higher UCIQE value indicates better image quality.

Refer to captionRefer to captionRefer to caption
(a) Real-world
Refer to captionRefer to captionRefer to caption
(b) UWCNN
Refer to captionRefer to captionRefer to caption
(c) Ours
Figure 7: Real-world enhancement results by real-time models. The UCIQE value is indicated in the top left corner of each image.

Real-world experiments shown in Fig. 7 prove the ability of LU2Net to provide high-quality enhanced underwater videos. The output video of 30 fps by the camera is immediately enhanced without noticeable delay, showcasing the potential of our model for real-time marine vision tasks.

From the illustrated experiment results, our model achieves a significant visual effect and outputs balanced and detailed images from a subjective aspect. Moreover, higher UCIQE values further prove the significant performance of LU2Net. With our model, human operators can clearly identify underwater objects and environments through well-enhanced videos and make proper decisions for better robotic safety and performance.

V CONCLUSIONS

In this paper, we proposed a lightweight underwater image enhancement model that demonstrated superior capabilities in effectively removing image distortion and providing real-time enhancement. Our model combined axial depthwise convolution, the channel attention module, and the U-Shape net structure, resulting in the lightweight net structure and significantly fast processing speed. The experiments conducted on the dataset and real-world robots confirmed the model’s low computational requirements and efficient resource utilization, which enabled the proposed model to handle real-time image and video enhancement. This advantage enhanced our model’s suitability for integration into underwater robots for real-time vision tasks.

References

  • [1] D. R. Yoerger, A. F. Govindarajan, J. C. Howland, J. K. Llopiz, P. H. Wiebe, M. Curran, J. Fujii, D. Gomez-Ibanez, K. Katija, B. H. Robison et al., “A hybrid underwater robot for multidisciplinary investigation of the ocean twilight zone,” Science Robotics, vol. 6, no. 55, p. eabe1901, 2021.
  • [2] B. Yu, J. Wu, and M. J. Islam, “Udepth: Fast monocular depth estimation for visually-guided underwater robots,” in Proceedings of IEEE International Conference on Robotics and Automation, 2023, pp. 3116–3123.
  • [3] A. Bucci, L. Zacchini, M. Franchi, A. Ridolfi, and B. Allotta, “Comparison of feature detection and outlier removal strategies in a mono visual odometry algorithm for underwater navigation,” Applied Ocean Research, vol. 118, p. 102961, 2022.
  • [4] J. S. Jaffe, “Computer modeling and the design of optimal underwater imaging systems,” IEEE Journal of Oceanic Engineering, vol. 15, no. 2, pp. 101–111, 1990.
  • [5] H. S. Lee, S. W. Moon, and I. K. Eom, “Underwater image enhancement using successive color correction and superpixel dark channel prior,” Symmetry, vol. 12, no. 8, 2020.
  • [6] S.-B. Gao, M. Zhang, Q. Zhao, X.-S. Zhang, and Y.-J. Li, “Underwater image enhancement using adaptive retinal mechanisms,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5580–5595, 2019.
  • [7] J. Yuan, W. Cao, Z. Cai, and B. Su, “An underwater image vision enhancement algorithm based on contour bougie morphology,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 10, pp. 8117–8128, 2020.
  • [8] S. Zhang, T. Wang, J. Dong, and H. Yu, “Underwater image enhancement via extended multi-scale retinex,” Neurocomputing, vol. 245, pp. 1–9, 2017.
  • [9] W. Zhang, P. Zhuang, H.-H. Sun, G. Li, S. Kwong, and C. Li, “Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement,” IEEE Transactions on Image Processing, vol. 31, pp. 3997–4010, 2022.
  • [10] M. Yang, A. Sowmya, Z. Wei, and B. Zheng, “Offshore underwater image restoration using reflection-decomposition-based transmission map estimation,” IEEE Journal of Oceanic Engineering, vol. 45, no. 2, pp. 521–533, 2020.
  • [11] P. Drews, E. Nascimento, F. Moraes, S. Botelho, and M. Campos, “Transmission estimation in underwater single images,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 825–830.
  • [12] A. Galdran, D. Pardo, A. Picón, and A. Alvarez-Gila, “Automatic red-channel underwater image restoration,” Journal of Visual Communication and Image Representation, vol. 26, pp. 132–145, 2015.
  • [13] Y. Wang, J. Guo, H. Gao, and H. Yue, “Uiec^ 2-net: Cnn-based underwater image enhancement using two color space,” Signal Processing: Image Communication, vol. 96, p. 116250, 2021.
  • [14] R. Cong, W. Yang, W. Zhang, C. Li, C.-L. Guo, Q. Huang, and S. Kwong, “Pugan: Physical model-guided underwater image enhancement using gan with dual-discriminators,” IEEE Transactions on Image Processing, vol. 32, pp. 4472–4485, 2023.
  • [15] L. Peng, C. Zhu, and L. Bian, “U-shape transformer for underwater image enhancement,” IEEE Transactions on Image Processing, 2023.
  • [16] M. S. Hitam, E. A. Awalludin, W. N. J. H. W. Yussof, and Z. Bachok, “Mixture contrast limited adaptive histogram equalization for underwater image enhancement,” in Proceedings of International Conference on Computer Applications Technology, 2013, pp. 1–5.
  • [17] C. Ancuti, C. O. Ancuti, T. Haber, and P. Bekaert, “Enhancing underwater images and videos by fusion,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 81–88.
  • [18] C. O. Ancuti, C. Ancuti, C. De Vleeschouwer, and P. Bekaert, “Color balance and fusion for underwater image enhancement,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 379–393, 2017.
  • [19] C. Li, S. Anwar, and F. Porikli, “Underwater scene prior inspired deep underwater image and video enhancement,” Pattern Recognition, vol. 98, p. 107038, 2020.
  • [20] J. Li, K. A. Skinner, R. M. Eustice, and M. Johnson-Roberson, “Watergan: Unsupervised generative network to enable real-time color correction of monocular underwater images,” IEEE Robotics and Automation letters, vol. 3, no. 1, pp. 387–394, 2017.
  • [21] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceedings of Medical Image Computing and Computer-Assisted Intervention.   Springer, 2015, pp. 234–241.
  • [22] J. M. J. Valanarasu and V. M. Patel, “Unext: Mlp-based rapid medical image segmentation network,” in Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 23–33.
  • [23] B.-D. Dinh, T.-T. Nguyen, T.-T. Tran, and V.-T. Pham, “1m parameters are enough? a lightweight cnn-based model for medical image segmentation,” in Proceedings of Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2023, pp. 1279–1284.
  • [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  • [25] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
  • [26] M. Yang and A. Sowmya, “An underwater color image quality evaluation metric,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 6062–6071, 2015.