UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN
Abstract
Skin cancer presents a formidable public health challenge, with its incidence expected to rise significantly in the coming decades. Early diagnosis is vital for effective treatment, highlighting the importance of computer-aided diagnosis systems in detecting and managing this disease. A critical task in these systems is accurately segmenting skin lesions from images, which is essential for further analysis, classification, and detection. This task is notably complex due to the diverse characteristics of lesions, such as their appearance, shape, size, color, texture, and location, compounded by image quality issues like noise, artifacts, and occlusions. While recent advancements in deep learning have shown promise in skin lesion segmentation, they often need more computational and extensive parameter requirements, limiting their practicality for mobile health applications. Addressing these challenges, we introduce UCM-Net, a novel, efficient, and lightweight model that synergizes the strengths of Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). The innovative UCM-Net Block diverges from conventional UNet architectures, offering significant reductions in parameter count and enhancing the model’s learning efficiency. This results in robust segmentation performance while maintaining a low computational footprint. UCM-Net’s performance is rigorously evaluated using the PH2, ISIC 2017, and ISIC 2018 datasets. It demonstrates its effectiveness with fewer than 50K parameters and requires less than 0.05 Giga-Operations Per Second (GLOPs). Moreover, its minimal memory requirement is just 1.19MB in CPU environment positions. It is a potential benchmark for efficiency in skin lesion segmentation, suitable for deployment in resource-constrained settings. In order to facilitate accessibility and further research in the field,the UCM-Net source code is https://github.com/chunyuyuan/UCM-Net.
Keywords Medical image segmentation Light-weight model Mobile health
1 Introduction
Skin cancer poses a significant global health concern and stands as one of the leading cancer types worldwide. Skin cancer can be broadly categorized into two types: melanoma and non-melanoma. While melanoma accounts for only 1% of cases, it is responsible for the majority of deaths due to its aggressive nature. In 2022, it was estimated that melanoma would account for approximately 7,650 deaths in the United States, affecting 5,080 men and 2,570 women [1, 2]. In addition, it is estimated that the United States will have 97,610 new cases of melanoma in 2023. Current statistics suggest that one in five Americans will develop skin cancer at some point in their lives, underscoring the gravity of this issue. Over the past few decades, skin cancer has emerged as a substantial public health problem, resulting in annual expenses of approximately $ 8.1 billion in the United States alone [3].
Skin cancer [4] is a prevalent and potentially life-threatening disease affecting millions worldwide. Among the various types of skin cancer, malignant melanoma is known for its rapid progression and high mortality rate if not detected and treated early. Early and accurate diagnosis is, therefore, critical to improving patient outcomes. Medical imaging [5], particularly dermatoscopy and dermoscopy, is crucial in diagnosing skin cancer. Dermatologists and healthcare professionals rely on these imaging techniques to examine and analyze skin lesions for signs of malignancy. However, the manual interpretation of such images with a naked eye is a time-consuming and error-prone process, heavily reliant on the expertise of the examining physician [6, 7].
To address these challenges and improve the accuracy and efficiency of skin cancer diagnosis, computer-aided tools and artificial intelligence (AI) have been leveraged in recent years[8, 9, 10]. Skin cancer segmentation, a fundamental step in the diagnostic process, involves delineating the boundaries of skin lesions within medical images. This task is essential for quantifying lesion characteristics, monitoring changes over time, and aiding in the decision-making process for treatment. Segmenting skin lesions from images faces several key challenges [11]: unclear boundaries where the lesion blends into surrounding skin; illumination variations that alter lesion appearance; artifacts like hair and bubbles that obscure lesion boundaries; variability in lesion size and shape; different imaging conditions and resolutions; age-related skin changes affecting texture; complex backgrounds that hinder segmentation; and differences in skin color due to race and climate. Figure 1 presents some representative samples of complex skin lesion.
Overcoming these difficulties is crucial for accurate segmentation to enable early diagnosis and treatment of malignant melanoma. Recently, a groundbreaking transformation in skin cancer segmentation has been driven by the development of advanced deep-learning algorithms [12, 13, 14, 15, 16]. These AI-driven approaches have exhibited remarkable capabilities in automating the segmentation of skin lesions, significantly reducing the burden on healthcare professionals and potentially improving diagnostic accuracy. In addition, the rapid advancements in AI techniques and the widespread adoption of smart devices,such as the point-of-care ultrasound (POCUS) devices or smartphones [17, 18, 19], have brought about transformative changes in the healthcare industry [20]. Figure 2 briefly presents the entire diagnose of skin cancer detection with portable devices and AI techniques.
Patients now have greater access to medical information, remote monitoring, and personalized care, leading to increased satisfaction with their healthcare experiences. However, amidst these advancements, there are still challenges that need to be addressed. One such challenge is the accurate and efficient segmentation of skin lesions for diagnostic purposes within limited computation hardwares and devices. Most of AI medical methods are developed based on deep-learning [21]. The major deep-learning methods utilize expensive computation overhead and a large number of learning parameters to achieve a good prediction result. It is a challenge to embed these methods to hardware-limit devices [22, 23]. In this study, we introduce UCM-Net, a lightweight and robust approach for skin lesion segmentation. UCM-Net leverages a novel hybrid module that combines Convolutional Neural Networks (CNN) and Multi-Layer Perceptrons (MLP) to enhance feature learning while reducing parameters. Utilizing group loss functions, our method surpasses existing machine learning-based techniques in skin lesion segmentation.
Key contributions of UCM-Net include:
-
1.
Hybrid Module: We introduce the UCM-Net Block, a hybrid structure combining CNN and MLP with superior feature-learning capabilities and reduced computation and parameters.
-
2.
Efficient Segmentation: UCM-Net is developed based on UCM-Net Blocks and the base model U-Net, offering a highly efficient method for skin lesion segmentation. It is the first model with less than 50 KB parameters and less than 0.05 Giga-Operations Per Second (GLOPs). UCM-Net is 1177 times faster and has 622 times fewer parameters than U-Net. Compared to the state-of-the-art EGE-UNet, UCM-Net reduces parameter and computation costs by 1.06x and 1.56x.
-
3.
Light-weight Segmentation: Compared to published models, our model contains lower number of parameters. Besides,we further reduced the model’s memory usage by engineering methods without third party library. Finally, our UCM-Net only spent 1.19MB on CPU environment.
-
4.
New Group Loss Function: In the training parts, we present a new group loss function with the output loss and internal stage losses. The new designed function can improve UCM-Net’s learning ability.
-
5.
Improved Segmentation: UCM-Net’s segmentation performance is evaluated using mean Intersection over Union (mIoU) and mean Dice similarity score (mDice). On the PH2, Isic2017 and Isic2018 datasets, UCM-Net enhances the baseline U-Net model by an average of 2.53% in mIoU and 1.50% in mDice. Notably, UCM-Net outperforms the state-of-the-art EGE-UNet on Ph2, ISIC 2017 and ISIC 2018 datasets, with respective mean IoU scores of 89.62% (UCM-Net) vs 88.39% , 80.71% (UCM-Net) vs 80.22% and 81.26% (UCM-Net) vs 81.16%.
2 Related works
AI Method Categories and Applications AI-driven approaches for biomedical images can be broadly classified as supervised learning methods, semi-supervised learning methods, and unsupervised learning methods [24, 25, 26]. Supervised learning is a solution with labeled image data, expecting to develop predictive capability. The labeled image data can be the previous patients’ diagnosed results, such as computed tomography(CT), with Clinicians’ analysis. With the labeled data, AI-driven solutions can be developed and performed against the ground truth results. Supervised learning solutions are widely applied to disease classification, tumor location detection, and tumor segmentation [27]. Relatively, unsupervised learning is a discovering process, diving into unlabeled data to capture hidden information. Unsupervised learning solutions derive insights directly from unlabeled medical data without inadequate or biased human supervisions and can be used for information compression, dimensional reduction, super resolution for medical image and sequence data detection and analysis such as protein, DNA and RNA [28]. In recent years, semi-supervised learning is becoming popular, which utilizes a large number of unlabeled data in conjunction with the limited amount of labeled data to train higher-performing models. Semi-supervised learning solutions can be also applied into disease classification and medical segmentation [29, 30].
TinyML for healthcare Integrating the potential of TinyML into healthcare, particularly for tasks such as lesion segmentation, offers an exciting avenue for future research and practical applications. TinyML, which refers to the deployment of machine learning models on low-power, miniature hardware, presents a unique opportunity to bring advanced analytical capabilities directly to the point of care. This approach could democratize access to sophisticated medical image analysis technologies, enabling real-time, on-device processing in environments where traditional computing resources are scarce or in mobile healthcare applications. For instance, applying TinyML for lesion segmentation could facilitate immediate diagnostic insights during patient examinations or in remote areas, significantly reducing the time and infrastructure typically required for such analyses. The integration of TinyML into healthcare devices promises to enhance diagnostic processes, improve patient outcomes, and extend the reach of advanced medical technologies to underserved regions.To further enhance the efficiency and feasibility of deploying TinyML in such critical applications, researchers are exploring innovative approaches like hyper-structure optimization[31] and the use of quantitative methods such as binary neural networks[32]. Hyper-structure optimization aims to minimize the number of parameters without compromising the model’s performance, thus ensuring that the models are lightweight yet effective enough for deployment on tiny devices. As we explore the optimization and application of UCM-Net in this context, we acknowledge the pioneering work and insights offered by Partha. in their exploration of TinyML’s potential in healthcare [33], which serves as a survey foundational reference for our proposed direction.
Supervised Methods of Segmentation As the technique evolves and develops, the solution of AI for medical image segmentation is from purely applying a convolution neural network(CNN) such as U-Net and Att-UNet [34] to a hybrid structure method like TransUNet [35], TransFuse [36] and SANet [37]. U-Net is a earest CNN solution on biomedical image segmentation, which replaces pooling operators by upsampling operators. Att-UNet is developed on the top of U-Net adding attention structures. TransUNet merits both Transformers and U-Net for medical image segmentation. TransFuse is a novel approach that combines Transformers and CNNs with late fusion for medical image segmentation, achieving a strong performance while maintaining high efficiency, with potential applications in various medical-related tasks. SANet [38], the Shallow Attention Network, addresses challenges in polyp segmentation by mitigating color inconsistencies, preserving small polyps through shallow attention modules, and balancing pixel distributions, achieving remarkable performance improvements. Swin-UNet is a UNet-like pure Transformer for medical image segmentation that proposed shifted windows as the encoder to extract context features, and a transformer-based decoder with patch expanding layer performs the up-sampling operation to restore the spatial resolution of the feature maps. MedT [39] is also a transformer-based network architecture, a gated axial-attention model that introduces an additional control mechanism in the self-attention module. ConvUNeXt [40], an efficient model inspired by ConvNeXts [41] and based on the classic UNet architecture, achieving excellent medical image segmentation results with a significantly reduced parameter count while incorporating features such as large convolution kernels, depth-wise separable convolution, residual connections, and a lightweight attention mechanism. UNeXt [42] is introduced as an efficient Convolutional multilayer perceptron (MLP) based network that reduces parameters and computational complexity while achieving superior segmentation performance through tokenized MLP blocks and channel shifting, making it suitable for point-of-care applications. MALUNet [43] and its extended version EGE-UNet [44] develop new attention modules to significantly reduce parameters and computational complexity while achieving powerful skin lesion segmentation performance, making it highly suitable for resource-constrained clinical environments.
3 UCM-Net
Network Design Figure 4 provides a comprehensive view of the structural framework of UCM-Net, an advanced architecture that showcases a distinctive U-Shape design. Our design is developed from U-Net. UCM-Net includes a down-sampling encoder and an up-sampling decoder, resulting in a high-powered network for skin lesion segmentation. The entirety of the network encompasses six stages of encoder-decoder units, each equipped with channel capacities of {8, 16, 24, 32, 48, 64}. Within each stage, we leverage a convolutional block alongside our novel UCMNet block, facilitating the extraction and acquisition of essential features. In the convolutional block, we opt for a kernel size of 1, a choice that serves to further curtail the parameter count. Our innovative UCMNet introduces a hybrid structure module, wherein an amalgamation of a Multi-Layer Perceptron (MLP) linear component and Convolutional Neural Network (CNN) is employed, bolstered by the inclusion of skip connections. This strategic amalgamation fortifies the network’s prowess in feature acquisition and learning capabilities.
Convolution Block In our designing, we contain two different convolution blocks. Convolution Block 1 utilizes a kernel size of 3×3, which is commonly employed for its effectiveness in capturing spatial relationships within the input features. This size is particularly advantageous in the network’s initial layers, where preserving the spatial integrity of feature maps is essential for decoding complex input patterns. Convolution Block 2 is characterized by a 1×1 kernel size, drastically reducing the number of learnable parameters and computational load[45]. The smaller kernel size serves to linearly combine features across channels, allowing for dimensionality reduction and feature recombination without spatial aggregation, thus enhancing the network’s efficiency.
UCM-Net Block The UCM-Net Block exemplifies a sophisticated approach to combining Convolutional Neural Networks (CNNs) with Multilayer Perceptrons (MLPs) for robust feature learning. This hybrid model benefits from the spatial feature extraction capabilities of CNNs and the pattern recognition strengths of MLPs through a series of strategically designed operations. Initially, the input feature map undergoes resha** to cater to the distinct structural requirements of CNNs and MLPs—transitioning from a four-dimensional tensor for CNN processing to a three-dimensional tensor for MLP operations. Following this, the block applies a sequence of linear and convolutional transformations, each accompanied by normalization and activation functions. This includes the strategic use of different kernel sizes in the convolutional stages to balance computational efficiency with the network’s learning capacity. Furthermore, a residual connection is integrated, enhancing information flow through the network and supporting the construction of deeper architectures by mitigating the vanishing gradient problem. Through these mechanisms, UCM-Net adeptly captures and processes complex patterns in image data, demonstrating its potential for advanced image analysis applications. The pseudocode of UCM-Net Block 1 presents our defined sequence of operations, which is how we combine CNN with MLP(Linear transformation operation) for feature learning.
Loss functions: In our solution, we designed a group loss function similar to those used in TransFuse [36] and EGE-UNet [44]. However, the base loss function for each stage is distinct. The base loss function from TransFuse and EGE-UNet is calculated using binary cross-entropy (BCE) (1) and Dice loss (Dice) (2) components. Our proposed base loss function is calculated from BCE components and squared-Dice loss (SDice) components, to calculate the loss from the scaled layer masks in different stages compared with the ground truth masks. Equations (1) and (2) present the stage loss in different layers and the output loss in the output layer, which are calculated using binary cross-entropy (BCE) and Dice loss (Dice) components, respectively.
(1) |
where is the total number of pixels (for image segmentation) or elements (for other tasks), is the ground truth value, and is the predicted probability for the -th element.
(2) |
where smooth is a small constant added to improve numerical stability.
(3) |
which represents an enhancement over the standard Dice loss by emphasizing the squared terms of intersections and unions.
(4) |
(5) |
Equations (1) and (3) define the base loss functions (5) for our proposed model, incorporating the Dice loss, and squared-Dice loss components. is the weight for different stages. In this paper, we set to 0.1, 0.2, 0.3, 0.4, and 0.5 based on the i-th stage, as illustrated in Figure 4. Equations (8) is our proposed group loss function that calculates the loss from the scaled layer masks in different stages with ground truth masks. Equation and present the stage loss in different stage layer and output loss in the output layer.
(6) |
(7) |
(8) |
4 Experiments and Results
4.1 Experiments Setting
Datasets To evaluate the efficiency and performance of model with other published models, we pick the three public skin segementation datasets from PH2, International Skin Imaging Collaboration, namely ISIC2017 [46, 47] and ISIC2018 [48, 49]. The PH2 [50] is a dermoscopic image database which 400 images of contains manual segmentation and the clinical diagnosis. The ISIC2017 dataset comprises 2150 dermoscopy images, and ISIC2018 includes 2694 images. We noted that earlier studies[44, 43] have already presented a dataset version with a pre-established train-test partition, maintaining a 7:3 ratio. In our experimental setup, we opted to utilize the previously published dataset version.
Implementation Details Our UCM-Net is implemented with Pytorch [51] framework. All experiments are conducted on the instance node at Lambda [52] that has a single NVIDIA RTX A6000 GPU (24 GB), 14vCPUs, 46 GiB RAM and 512 GiB SSD. The images are normalized and resized to 256256. Simple data augmentations are applied, including horizontal flip**, vertical flip**, and random rotation. We noticed the prior studies [44, 43] applied initial image processing with the calculated mean and standard deviation (std) values of the whole train and test datasets separately. While this approach can potentially enhance their models’ training and testing performance, the outcomes are notably influenced by the computed mean and std values. Additionally, if the test dataset’s context information is unknown, this operation can render the trained model less practical. In our experiment, we don’t calculate the mean and std values based on the train and test datasets. Besides, TransFuse-S and SwinNet require the pre-train models with the specified input image size in their encoding stage. To enable fair benchmark testing, we follow the image input size for the TransFuse-S [36, 53],TransUNet[35, 54] and SwinNet [37, 55] to 192256, 224224 and 224224, Correspondingly. For the optimizer, we select AdamW [56] initialized with a learning rate of 0.001 and a weight decay of 0.01. The CosineAnnealingLR [57] is Utilized as the scheduler with a maximum number of iterations of 50 and a minimum learning rate of 1e-5. A total of 300 epochs are trained with a training batch size of 8 and a testing batch size of 1.
Evaluate Metrics To assess the predictive performance of our methods, we employ mean Intersection over Union (mIoU) and mean Dice similarity score (mDice) as evaluation metrics. It’s worth noting that previous studies [44, 43] and [42] have employed distinct calculation methods for mIoU and mDice. To comprehensively compare the performance predictions, our experiments include the presentation of mIoU, mDice, mIoU*, and mDice* results. These results are calculated using the following equations:
(9) |
(10) |
where intersection represents the number of pixels that are common between the predicted output and the ground truth, smooth is a small constant added to improve numerical stability and union represents the total number of pixels in both the predicted output and the ground truth.
(11) |
(12) |
where N is the number of images, represents the IoU score for image i and represents the Dice score for image i.
(13) |
(14) |
where TP represents the number of true positive pixels, FP represents the number of false positive pixels and FN represents the number of false negative pixels.
(15) |
(16) |
where represents the total number of true positive pixels for images, represents the total number of false positive pixels for images and represents the total number of false negative pixels for images.
In our benchmark experiments, we evaluate our method’s performance and compare the results among other published efficient models’. To ensure a fair comparison, we perform three sets of experiments for each method and subsequently present the mean and std of the prediction outcomes across each dataset.
4.2 Performance Comparisons
Dataset | Models | Year | mIoU(%)↑ | mDice(%)↑ | mIoU*(%)↑ | mDice*(%)↑ |
---|---|---|---|---|---|---|
PH2 | TransUnet∗ [35, 54] | 2021 | 90.77 ± 0.030 | 95.09 ± 0.014 | 91.89 ± 0.148 | 95.77 ± 0.080 |
SANet∗ [38, 58] | 2021 | 89.99 ± 0.349 | 94.62 ± 0.206 | 91.06 ± 0.481 | 95.32 ± 0.264 | |
SwinNet∗ [37, 55] | 2021 | 88.92 ± 0.493 | 94.03 ± 0.289 | 89.96 ± 0.388 | 94.71 ± 0.215 | |
TransFuse-S∗ [36, 53] | 2021 | 91.15 ± 0.131 | 95.30 ± 0.078 | 92.16 ± 0.113 | 95.92 ± 0.061 | |
U-Net [59, 60] | 2015 | 89.09 ± 0.224 | 94.13 ± 0.112 | 90.08 ± 0.545 | 94.78 ± 0.302 | |
Att-UNet [61, 34] | 2018 | 89.01 ± 0.345 | 94.06 ± 0.222 | 89.82 ± 0.371 | 94.64 ± 0.206 | |
MedT [39, 62] | 2021 | 87.77 ± 0.099 | 93.35 ± 0.063 | 88.53 ± 0.338 | 93.92 ± 0.190 | |
ConvUNeXt [40, 63] | 2022 | 89.46 ± 0.056 | 94.33 ± 0.035 | 90.39 ± 0.172 | 94.95 ± 0.095 | |
UNeXt-S [42, 64] | 2022 | 89.03 ± 0.246 | 94.09 ± 0.163 | 89.689 ± 0.269 | 94.56 ± 0.149 | |
MALUNet [43, 65] | 2022 | 87.75 ± 0.378 | 93.35 ± 0.214 | 88.19 ± 0.329 | 93.73 ± 0.186 | |
EGE-UNet [44, 66] | 2023 | 88.39 ± 0.303 | 93.73 ± 0.189 | 88.93 ± 0.802 | 94.14 ± 0.450 | |
UCM-Net (ours) | 2023 | 89.62 ± 0.221 | 94.42 ± 0.124 | 90.51 ± 0.211 | 94.96 ± 0.116 |
-
•
∗: this method needs the pre-train model on training.
Dataset | Models | Year | mIoU(%)↑ | mDice(%)↑ | mIoU*(%)↑ | mDice*(%)↑ |
---|---|---|---|---|---|---|
isic2017 | TransUnet∗ [35, 54] | 2021 | 81.12 ± 0.152 | 88.38 ± 0.157 | 80.02 ± 0.011 | 88.90 ± 0.113 |
SANet∗[38, 58] | 2021 | 79.34 ± 0.023 | 87.04 ± 0.155 | 79.34 ± 0.102 | 88.15 ± 0.060 | |
SwinNet∗ [37, 55] | 2021 | 80.51 ± 0.102 | 87.81 ± 0.231 | 80.48 ± 0.134 | 89.19 ± 0.073 | |
TransFuse-S∗ [36, 53] | 2021 | 80.73 ± 0.312 | 88.03 ± 0.111 | 79.87 ± 0.109 | 88.81 ± 0.058 | |
U-Net [59, 60] | 2015 | 78.12 ± 0.175 | 85.97 ± 0.196 | 76.42 ± 0.381 | 86.63 ± 0.245 | |
Att-UNet [61, 34] | 2018 | 78.45 ± 0.113 | 86.22 ± 0.124 | 77.14 ± 0.097 | 87.10 ± 0.062 | |
MedT [39, 62] | 2021 | 79.05 ± 0.231 | 86.59 ± 0.125 | 77.61 ± 0.121 | 87.40 ± 0.401 | |
ConvUNeXt [40, 63] | 2022 | 78.78 ± 0.362 | 86.21 ± 0.267 | 76.98 ± 0.490 | 86.99 ± 0.313 | |
UNeXt-S [42, 64] | 2022 | 78.79 ± 0.234 | 86.30 ± 0.140 | 77.48 ± 0.466 | 87.31 ± 0.296 | |
MALUNet [43, 65] | 2022 | 79.97 ± 0.389 | 87.23 ± 0.345 | 79.11 ± 0.345 | 88.34 ± 0.215 | |
EGE-UNet [44, 66] | 2023 | 80.22 ± 0.237 | 87.24 ± 0.149 | 79.71 ± 0.411 | 88.71 ± 0.255 | |
UCM-Net (ours) | 2023 | 80.71 ± 0.345 | 87.66 ± 0.221 | 79.29 ± 0.188 | 88.45 ± 0.117 |
-
•
∗: this method needs the pre-train model on training.
Dataset | Models | Year | mIoU(%)↑ | mDice(%)↑ | mIoU*(%)↑ | mDice*(%)↑ |
---|---|---|---|---|---|---|
isic2018 | TransUnet∗ [35, 54] | 2021 | 82.00 ± 0.152 | 89.11 ± 0.031 | 81.68 ± 0.511 | 89.92 ± 0.010 |
SANet∗ [38, 58] | 2021 | 80.37 ± 0.124 | 87.87 ± 0.114 | 79.39 ± 0.135 | 88.51 ± 0.084 | |
SwinNet∗ [37, 55] | 2021 | 81.41 ± 0.069 | 88.58 ± 0.019 | 80.72 ± 0.069 | 89.33 ± 0.042 | |
TransFuse-S∗ [36, 53] | 2021 | 81.69 ± 0.128 | 88.82 ± 0.134 | 81.29 ± 0.084 | 89.68 ± 0.070 | |
U-Net [59, 60] | 2015 | 79.86 ± 0.075 | 87.57 ± 0.085 | 78.27 ± 0.300 | 87.81 ± 0.188 | |
Att-UNet [61, 34] | 2018 | 80.05 ± 0.079 | 87.62 ± 0.078 | 78.38 ± 0.151 | 87.88 ± 0.095 | |
MedT [39, 62] | 2021 | 80.34 ± 0.034 | 87.77 ± 0.107 | 79.29 ± 0.411 | 88.45 ± 0.251 | |
ConvUNeXt [40, 63] | 2022 | 80.51 ± 0.043 | 87.99 ± 0.049 | 78.71 ± 0.128 | 88.09 ± 0.080 | |
UNeXt-S [42, 64] | 2022 | 80.70 ± 0.226 | 88.17 ± 0.194 | 79.26 ± 0.497 | 88.43 ± 0.309 | |
MALUNet [43, 65] | 2022 | 80.95 ± 0.393 | 88.25 ± 0.315 | 79.99 ± 0.644 | 88.88 ± 0.398 | |
EGE-UNet [44, 66] | 2023 | 81.16 ± 0.104 | 88.36 ± 0.086 | 80.28 ± 0.363 | 89.06 ± 0.223 | |
UCM-Net (ours) | 2023 | 81.26 ± 0.030 | 88.48 ± 0.109 | 80.85 ± 0.251 | 89.35 ± 0.111 |
-
•
∗: this method needs the pre-train model on training.
Models | Year | Image Size (H x W) | Params↓ | GFLOPs↓ | Memory (MB)↓ |
---|---|---|---|---|---|
TransUNet∗ [35, 54] | 2021 | 224 x 224 | 105,277,081 | 24.7278 | 402.3614 |
SANet∗ [38, 58] | 2021 | 256 x 256 | 23,899,497 | 5.9983 | 91.9233 |
SwinNet∗ [37, 55] | 2021 | 224 x 224 | 20,076,204 | 5.5635 | 77.3503 |
TransFuse-S∗ [36, 53] | 2021 | 192 x 256 | 26,248,725 | 8.6462 | 100.8809 |
U-Net [59, 60] | 2015 | 256 x 256 | 31,037,633 | 54.7378 | 119.3992 |
Att-UNet [61, 34] | 2018 | 256 x 256 | 34,878,573 | 66.6318 | 134.0512 |
MedT [39, 62] | 2021 | 256 x 256 | 1,564,202 | 2.4061 | 6.9670 |
ConvUNeXt [40, 63] | 2022 | 256 x 256 | 3,505,697 | 7.2537 | 14.3732 |
UNeXt-S [42, 64] | 2022 | 256 x 256 | 253,561 | 0.1038 | 1.9673 |
MALUNet [43, 65] | 2022 | 256 x 256 | 177,943 | 0.0830 | 1.6788 |
EGE-UNet [44, 66] | 2023 | 256 x 256 | 53,374 | 0.0721 | 1.2036 |
UCM-Net(ours) | 2023 | 256 x 256 | 49,932 | 0.0465 | 1.1905 |
-
•
∗: this method needs the pre-train model on training.
Table 1-3 comprehensively evaluates the performance of our UCM-Net, a novel skin lesion segmentation model, compared to well-established models, using the widely recognized PH2, ISIC2017 and ISIC2018 datasets. Introduced in 2023, UCM-Net is a robust and highly competitive solution in this domain. One of the key takeaways from the table is UCM-Net’s ability to outperform EGE-UNet, which had previously held the title of the state-of-the-art model for skin lesion segmentation. Our model achieves superior results across various prediction metrics, emphasizing its advancement in the field and its potential to redefine the standard for accurate skin lesion delineation. Moreover, UCM-Net’s performance is notably competitive even when compared to SwinNet, a model that relies on pre-trained models during training. Table 4 complements this assessment by comparing computational aspects and the number of parameters for various segmentation models. Remarkably, UCM-Net, operating with the same number of channels {8,16,24,32,48,64} and image size, as EGE-UNet, boasts fewer parameters and lower GFLOPs. Additionally, even when compared to TransUNet, TransFuse-S and SwinNet, which operate with smaller image sizes, UCM-Net demonstrates faster computational speed. In Figure 5, we present a visual exhibition of all the non-pretrained models’ segmentation outputs. This figure directly compares our segmentation results, those produced by other methods, and the ground truth, all displayed side by side using representative sample images. Notably, our segmentation results demonstrate a remarkable level of accuracy, closely resembling the ground truth annotations. Tables 1-3 and Figure 5 collectively underscore UCM-Net’s exceptional performance and efficiency in skin lesion segmentation, affirming its potential to make a substantial impact in advancing early skin cancer diagnosis and treatment.
4.3 Ablation results
To demonstrate the efficiency and effectiveness of our proposed modules, we conducted a series of ablation experiments on dataset ISIC2017. We develop UCM-Net based on U-Net. Figure 6 shows the different block structures in the stages among the compared models. Table 5 shows the ablation experiments’ results including the number of parameters, Giga Flops, mIoU score and mDice score. U-Net variant and variant 1 are six-stage U-Nets with the stage channels {8,16,24,32,48,64}. The details of UCM-Net are illustrated in Figure 4.
Models | Structure Reference | Params↓ | GFLOPs↓ | mIoU(%)↑ | mDice(%)↑ |
---|---|---|---|---|---|
U-Net(baseline) | Figure 6 (A) | 310,376,33 | 54.7378 | 78.34 | 86.23 |
U-Net Variant | Figure 6 (A) | 248,531 | 0.5715 | 78.48 | 86.22 |
U-Net Variant 1 | Figure 6 (B) | 148,157 | 0.3700 | 73.89 | 82.36 |
UCM-Net | Figure 6 (C) | 49,932 | 0.0465 | 79.76 | 86.94 |
UCM-Net + Group Loss | Figure 6 (C) | 49,932 | 0.0465 | 80.63 | 87.64 |
UCM-Net + Group Loss(ours) | Figure 6 (C) | 49,932 | 0.0465 | 81.01 | 87.98 |
As the results of the U-Net and U-Net variants are shown in Table 5, although the number of parameters is reduced, the U-Net variant performs better with the six-stage structure. When we set the one convolution kernel to 1 to reduce the number of parameters, the model’s performance drops severely. However, when we replaced the convolution with our proposed UCMNet block, the results showed that the model’s performance improved significantly. The UCM-Net, as depicted in Figure 4 and Figure 6(C), with 49,932 parameters and 0.0465 GFLOPs, outperforms the U-Net variant with 248,531 parameters and the baseline U-Net with 31,037,633 parameters in terms of both mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDice) metrics.
Furthermore, when incorporating the proposed Group Loss into the UCM-Net architecture, denoted as "UCM-Net + Group Loss(ours)", the model’s performance continued to excel. This enhancement resulted in a higher mIoU of 81.00% and a mDice of 87.98%, demonstrating the effectiveness of the proposed Group Loss in further improving segmentation accuracy. Figures 7-8 show that "UCM-Net + Group Loss(ours)" always presents the high scores of mIoU and mDice with the training epoch increase.
The above findings in the ablation experiments underscore the significance of architectural innovations such as the UCMNet block in achieving superior semantic segmentation performance, even with fewer parameters and computational complexity than the U-Net baseline. We note the batch size will affect the prediction performance. So, we tested our model with different batch sizes and reported it in Table 6 in the appendix.
5 Conclusion
This paper introduces UCM-Net, a novel, lightweight, and highly efficient solution. UCM-Net combines MLP and CNN, providing robust feature learning capabilities while maintaining a minimal parameter count and reduced computational demand. We applied this innovative approach to the challenging task of skin lesion segmentation, conducting comprehensive experiments with a range of evaluation metrics to showcase its effectiveness and efficiency. The results of our extensive experiments unequivocally demonstrate UCM-Net’s superior performance compared to the state-of-the-art EGE-UNet. Remarkably, UCM-Net is the first model with fewer than 50KB parameters and consuming less than 0.05 GLOPs for skin lesion segmentation.
Looking forward to future research endeavors, we aim to expand the application of UCM-Net to other critical medical image tasks, advancing the field and exploring how this efficient architecture can contribute to a broader spectrum of healthcare applications. This potential revolution in utilizing deep learning for medical image analysis opens up numerous possibilities for enhancing patient care and diagnostic accuracy. However, it is essential to acknowledge that compared to models leveraging pre-trained components, UCM-Net currently demonstrates lower accuracy performance. Recognizing this limitation, our future efforts will focus on optimizing UCM-Net to improve its accuracy. Such advancements are crucial for ensuring that the model maintains its efficiency and competes favorably in performance with existing state-of-the-art solutions. By addressing these challenges, we aim to advance the field further and expand the impact of deep learning in healthcare applications, making significant contributions to medical imaging and beyond.
References
- [1] American cancer society. cancer facts figures 2022. atlanta: American cancer society. 2022.
- [2] RL Siegel, KD Miller, HE Fuchs, and A Jemal. Cancer statistics, 2022. ca. 506. Cancer J. Clin, 72(7-33):507, 2022.
- [3] Rebecca L Siegel, Kimberly D Miller, and Nikita Sandeep Wagle. Cancer statistics, 2023. 2023.
- [4] Robin Marks. An overview of skin cancers. Cancer, 75(S2):607–612, 1995.
- [5] Thomas Martin Lehmann, Claudia Gonner, and Klaus Spitzer. Survey: Interpolation methods in medical image processing. IEEE transactions on medical imaging, 18(11):1049–1075, 1999.
- [6] Muhammad Nasir, Muhammad Attique Khan, Muhammad Sharif, Ikram Ullah Lali, Tanzila Saba, and Tassawar Iqbal. An improved strategy for skin lesion detection and classification using uniform segmentation and feature selection based approach. Microscopy research and technique, 81(6):528–543, 2018.
- [7] Catarina Barata, M Emre Celebi, and Jorge S Marques. Explainable skin lesion diagnosis using taxonomies. Pattern Recognition, 110:107413, 2021.
- [8] Isaac Sanchez and Sos Agaian. Computer aided diagnosis of lesions extracted from large skin surfaces. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2879–2884. IEEE, 2012.
- [9] Isaac Sanchez and Sos Agaian. A new system of computer-aided diagnosis of skin lesions. In Image Processing: Algorithms and Systems X; and Parallel Processing for Imaging Applications II, volume 8295, pages 390–401. SPIE, 2012.
- [10] Alex Liew, Sos Agaian, and Liang Zhao. Mitigation of adversarial noise attacks on skin cancer detection via ordered statistics binary local features. In Multimodal Image Exploitation and Learning 2023, volume 12526, pages 153–164. SPIE, 2023.
- [11] Khalid M Hosny, Doaa Elshora, Ehab R Mohamed, Eleni Vrochidou, and George A Papakostas. Deep learning and optimization-based methods for skin lesions segmentation: A review. IEEE Access, 2023.
- [12] Vladimir Frants and Sos Agaian. Dermoscopic image segmentation based on modified grabcut with octree color quantization. In Mobile Multimedia/Image Processing, Security, and Applications 2020, volume 11399, pages 119–130. SPIE, 2020.
- [13] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps: Automation of Decision Making, pages 323–350, 2018.
- [14] Andreas Maier, Christopher Syben, Tobias Lasser, and Christian Riess. A gentle introduction to deep learning in medical image processing. Zeitschrift für Medizinische Physik, 29(2):86–101, 2019.
- [15] Neeraj Sharma and Lalit M Aggarwal. Automated medical image segmentation techniques. Journal of medical physics/Association of Medical Physicists of India, 35(1):3, 2010.
- [16] Zahra Mirikharaji, Kumar Abhishek, Alceu Bissoto, Catarina Barata, Sandra Avila, Eduardo Valle, M Emre Celebi, and Ghassan Hamarneh. A survey on deep learning for skin lesion segmentation. Medical Image Analysis, page 102863, 2023.
- [17] Tiago M de Carvalho, Eline Noels, Marlies Wakkee, Andreea Udrea, and Tamar Nijsten. Development of smartphone apps for skin cancer risk assessment: progress and promise. JMIR Dermatology, 2(1):e13376, 2019.
- [18] butterflynetwork. https://www.butterflynetwork.com/iq-ultrasound-individuals.
- [19] phonemedical. https://blog.google/technology/health/ai-dermatology-preview-io-2021/.
- [20] Sandeep Kumar Vashist. Point-of-care diagnostics: Recent advances and trends. Biosensors, 7(4):62, 2017.
- [21] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.
- [22] Chunlei Chen, Peng Zhang, Huixiang Zhang, Jiangyan Dai, Yugen Yi, Huihui Zhang, and Yonghui Zhang. Deep learning on computational-resource-limited platforms: a survey. Mobile Information Systems, 2020:1–19, 2020.
- [23] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 2020.
- [24] Hyunseok Seo, Masoud Badiei Khuzani, Varun Vasudevan, Charles Huang, Hongyi Ren, Ruoxiu Xiao, Xiao Jia, and Lei Xing. Machine learning techniques for biomedical image segmentation: an overview of technical aspects and introduction to state-of-art applications. Medical physics, 47(5):e148–e167, 2020.
- [25] Sertan Serte, Ali Serener, and Fadi Al-Turjman. Deep learning in medical imaging: A brief review. Transactions on Emerging Telecommunications Technologies, 33(10):e4080, 2022.
- [26] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- [27] Abeer Aljuaid and Mohd Anwar. Survey of supervised learning for medical image processing. SN Computer Science, 3(4):292, 2022.
- [28] Khalid Raza and Nripendra K Singh. A tour of unsupervised deep learning for medical image analysis. Current Medical Imaging, 17(9):1059–1077, 2021.
- [29] Quande Liu, Lequan Yu, Luyang Luo, Qi Dou, and Pheng Ann Heng. Semi-supervised medical image classification with relation-driven self-ensembling model. IEEE transactions on medical imaging, 39(11):3429–3440, 2020.
- [30] Rushi Jiao, Yichi Zhang, Le Ding, Rong Cai, and Jicong Zhang. Learning with limited annotations: a survey on deep semi-supervised learning for medical image segmentation. arXiv preprint arXiv:2207.14191, 2022.
- [31] Hyeji Kim, Muhammad Umar Karim Khan, and Chong-Min Kyung. Efficient neural network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12569–12577, 2019.
- [32] Chunyu Yuan and Sos S Agaian. A comprehensive review of binary neural network. Artificial Intelligence Review, 56(11):12949–13013, 2023.
- [33] Partha Pratim Ray. A review on tinyml: State-of-the-art and prospects. Journal of King Saud University-Computer and Information Sciences, 34(4):1595–1623, 2022.
- [34] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
- [35] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
- [36] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 14–24. Springer, 2021.
- [37] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
- [38] Jun Wei, Yiwen Hu, Ruimao Zhang, Zhen Li, S Kevin Zhou, and Shuguang Cui. Shallow attention network for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 699–708. Springer, 2021.
- [39] Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, and Vishal M Patel. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 36–46. Springer, 2021.
- [40] Zhimeng Han, Muwei Jian, and Gai-Ge Wang. Convunext: An efficient convolution neural network for medical image segmentation. Knowledge-Based Systems, 253:109512, 2022.
- [41] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
- [42] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-based rapid medical image segmentation network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 23–33. Springer, 2022.
- [43] Jiacheng Ruan, Suncheng Xiang, Mingye Xie, Ting Liu, and Yuzhuo Fu. Malunet: A multi-attention and light-weight unet for skin lesion segmentation. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1150–1156. IEEE, 2022.
- [44] Jiacheng Ruan, Mingye Xie, **gsheng Gao, Ting Liu, and Yuzhuo Fu. Ege-unet: an efficient group enhanced unet for skin lesion segmentation. arXiv preprint arXiv:2307.08473, 2023.
- [45] Gousia Habib and Shaima Qureshi. Optimization and acceleration of convolutional neural networks: A survey. Journal of King Saud University-Computer and Information Sciences, 34(7):4244–4268, 2022.
- [46] Isic 2017 challenge dataset. https://challenge.isic-archive.com/data/#2017.
- [47] Matt Berseth. Isic 2017-skin lesion analysis towards melanoma detection. arXiv preprint arXiv:1703.00523, 2017.
- [48] Isic 2018 challenge dataset. https://challenge.isic-archive.com/data/#2018.
- [49] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
- [50] Teresa Mendonça, Pedro M Ferreira, Jorge S Marques, André RS Marcal, and Jorge Rozeira. Ph 2-a dermoscopic image database for research and benchmarking. In 2013 35th annual international conference of the IEEE engineering in medicine and biology society (EMBC), pages 5437–5440. IEEE, 2013.
- [51] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- [52] lambda cloud gpu. https://cloud.lambdalabs.com/instances.
- [53] Transfuse official code. https://github.com/Rayicer/TransFuse.
- [54] Transunet official code. https://github.com/Beckschen/TransUNet.
- [55] Swinnet official code. https://github.com/HuCaoFighting/Swin-Unet.
- [56] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- [57] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- [58] Sanet official code. https://github.com/weijun88/SANet.
- [59] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- [60] U-net highest stars code. https://github.com/milesial/Pytorch-UNet.
- [61] Att u-net highest stars code. https://github.com/LeeJunHyun/Image_Segmentation.
- [62] Medt official code. https://github.com/jeya-maria-jose/Medical-Transformer.
- [63] Convunext official code. https://github.com/1914669687/ConvUNeXt.
- [64] Unext official code. https://github.com/jeya-maria-jose/UNeXt-pytorch.
- [65] Malunet official code. https://github.com/JCruan519/MALUNet.
- [66] Ege-net official code. https://github.com/JCruan519/EGE-UNet.
6 Appendix
Models | Batch Size | mIoU(%)↑ | mDice(%)↑ | mIoU*(%)↑ | mDice*(%)↑ |
---|---|---|---|---|---|
TransUnet∗ [35, 54] | 8 | 90.77 ± 0.030 | 95.09 ± 0.014 | 91.89 ± 0.148 | 95.77 ± 0.080 |
SANet∗ [38, 58] | 8 | 89.99 ± 0.349 | 94.62 ± 0.206 | 91.06 ± 0.481 | 95.32 ± 0.264 |
SwinNet∗ [37, 55] | 8 | 88.92 ± 0.493 | 94.03 ± 0.289 | 89.96 ± 0.388 | 94.71 ± 0.215 |
TransFuse-S∗ [36, 53] | 8 | 91.15 ± 0.131 | 95.30 ± 0.078 | 92.16 ± 0.113 | 95.92 ± 0.061 |
UCM-Net + Group Loss(ours) | 1 | 89.68 | 94.45 | 89.96 | 94.72 |
2 | 89.88 | 94.57 | 90.39 | 94.95 | |
4 | 89.57 | 94.38 | 90.18 | 94.84 | |
8 | 89.39 | 94.28 | 89.67 | 94.55 | |
16 | 88.44 | 93.74 | 88.85 | 94.10 | |
32 | 89.15 | 94.15 | 90.23 | 94.86 |
We tested our model with different batch sizes and reported it in Table 6 in the appendix.