UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN

Chunyu Yuan

The Graduate Center, City University of New York
[email protected]
&Dongfang Zhao

University of Washington
[email protected]
&Sos S. Agaian

The Graduate Center, City University of New York 
College of Staten Island, City University of New York
[email protected]
Abstract

Skin cancer presents a formidable public health challenge, with its incidence expected to rise significantly in the coming decades. Early diagnosis is vital for effective treatment, highlighting the importance of computer-aided diagnosis systems in detecting and managing this disease. A critical task in these systems is accurately segmenting skin lesions from images, which is essential for further analysis, classification, and detection. This task is notably complex due to the diverse characteristics of lesions, such as their appearance, shape, size, color, texture, and location, compounded by image quality issues like noise, artifacts, and occlusions. While recent advancements in deep learning have shown promise in skin lesion segmentation, they often need more computational and extensive parameter requirements, limiting their practicality for mobile health applications. Addressing these challenges, we introduce UCM-Net, a novel, efficient, and lightweight model that synergizes the strengths of Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). The innovative UCM-Net Block diverges from conventional UNet architectures, offering significant reductions in parameter count and enhancing the model’s learning efficiency. This results in robust segmentation performance while maintaining a low computational footprint. UCM-Net’s performance is rigorously evaluated using the PH2, ISIC 2017, and ISIC 2018 datasets. It demonstrates its effectiveness with fewer than 50K parameters and requires less than 0.05 Giga-Operations Per Second (GLOPs). Moreover, its minimal memory requirement is just 1.19MB in CPU environment positions. It is a potential benchmark for efficiency in skin lesion segmentation, suitable for deployment in resource-constrained settings. In order to facilitate accessibility and further research in the field,the UCM-Net source code is https://github.com/chunyuyuan/UCM-Net.

Keywords Medical image segmentation  \cdot Light-weight model  \cdot Mobile health

1 Introduction

Skin cancer poses a significant global health concern and stands as one of the leading cancer types worldwide. Skin cancer can be broadly categorized into two types: melanoma and non-melanoma. While melanoma accounts for only 1% of cases, it is responsible for the majority of deaths due to its aggressive nature. In 2022, it was estimated that melanoma would account for approximately 7,650 deaths in the United States, affecting 5,080 men and 2,570 women [1, 2]. In addition, it is estimated that the United States will have 97,610 new cases of melanoma in 2023. Current statistics suggest that one in five Americans will develop skin cancer at some point in their lives, underscoring the gravity of this issue. Over the past few decades, skin cancer has emerged as a substantial public health problem, resulting in annual expenses of approximately $ 8.1 billion in the United States alone [3].
Skin cancer [4] is a prevalent and potentially life-threatening disease affecting millions worldwide. Among the various types of skin cancer, malignant melanoma is known for its rapid progression and high mortality rate if not detected and treated early. Early and accurate diagnosis is, therefore, critical to improving patient outcomes. Medical imaging [5], particularly dermatoscopy and dermoscopy, is crucial in diagnosing skin cancer. Dermatologists and healthcare professionals rely on these imaging techniques to examine and analyze skin lesions for signs of malignancy. However, the manual interpretation of such images with a naked eye is a time-consuming and error-prone process, heavily reliant on the expertise of the examining physician [6, 7].

To address these challenges and improve the accuracy and efficiency of skin cancer diagnosis, computer-aided tools and artificial intelligence (AI) have been leveraged in recent years[8, 9, 10]. Skin cancer segmentation, a fundamental step in the diagnostic process, involves delineating the boundaries of skin lesions within medical images. This task is essential for quantifying lesion characteristics, monitoring changes over time, and aiding in the decision-making process for treatment. Segmenting skin lesions from images faces several key challenges [11]: unclear boundaries where the lesion blends into surrounding skin; illumination variations that alter lesion appearance; artifacts like hair and bubbles that obscure lesion boundaries; variability in lesion size and shape; different imaging conditions and resolutions; age-related skin changes affecting texture; complex backgrounds that hinder segmentation; and differences in skin color due to race and climate. Figure 1 presents some representative samples of complex skin lesion.

Refer to caption
Figure 1: Complex skin lesion samples

Overcoming these difficulties is crucial for accurate segmentation to enable early diagnosis and treatment of malignant melanoma. Recently, a groundbreaking transformation in skin cancer segmentation has been driven by the development of advanced deep-learning algorithms [12, 13, 14, 15, 16]. These AI-driven approaches have exhibited remarkable capabilities in automating the segmentation of skin lesions, significantly reducing the burden on healthcare professionals and potentially improving diagnostic accuracy. In addition, the rapid advancements in AI techniques and the widespread adoption of smart devices,such as the point-of-care ultrasound (POCUS) devices or smartphones [17, 18, 19], have brought about transformative changes in the healthcare industry [20]. Figure 2 briefly presents the entire diagnose of skin cancer detection with portable devices and AI techniques.

Refer to caption
Figure 2: AI diagnose process of skin cancer detection

Patients now have greater access to medical information, remote monitoring, and personalized care, leading to increased satisfaction with their healthcare experiences. However, amidst these advancements, there are still challenges that need to be addressed. One such challenge is the accurate and efficient segmentation of skin lesions for diagnostic purposes within limited computation hardwares and devices. Most of AI medical methods are developed based on deep-learning [21]. The major deep-learning methods utilize expensive computation overhead and a large number of learning parameters to achieve a good prediction result. It is a challenge to embed these methods to hardware-limit devices [22, 23]. In this study, we introduce UCM-Net, a lightweight and robust approach for skin lesion segmentation. UCM-Net leverages a novel hybrid module that combines Convolutional Neural Networks (CNN) and Multi-Layer Perceptrons (MLP) to enhance feature learning while reducing parameters. Utilizing group loss functions, our method surpasses existing machine learning-based techniques in skin lesion segmentation.

Key contributions of UCM-Net include:

  1. 1.

    Hybrid Module: We introduce the UCM-Net Block, a hybrid structure combining CNN and MLP with superior feature-learning capabilities and reduced computation and parameters.

  2. 2.

    Efficient Segmentation: UCM-Net is developed based on UCM-Net Blocks and the base model U-Net, offering a highly efficient method for skin lesion segmentation. It is the first model with less than 50 KB parameters and less than 0.05 Giga-Operations Per Second (GLOPs). UCM-Net is 1177 times faster and has 622 times fewer parameters than U-Net. Compared to the state-of-the-art EGE-UNet, UCM-Net reduces parameter and computation costs by 1.06x and 1.56x.

  3. 3.

    Light-weight Segmentation: Compared to published models, our model contains lower number of parameters. Besides,we further reduced the model’s memory usage by engineering methods without third party library. Finally, our UCM-Net only spent 1.19MB on CPU environment.

  4. 4.

    New Group Loss Function: In the training parts, we present a new group loss function with the output loss and internal stage losses. The new designed function can improve UCM-Net’s learning ability.

  5. 5.

    Improved Segmentation: UCM-Net’s segmentation performance is evaluated using mean Intersection over Union (mIoU) and mean Dice similarity score (mDice). On the PH2, Isic2017 and Isic2018 datasets, UCM-Net enhances the baseline U-Net model by an average of 2.53% in mIoU and 1.50% in mDice. Notably, UCM-Net outperforms the state-of-the-art EGE-UNet on Ph2, ISIC 2017 and ISIC 2018 datasets, with respective mean IoU scores of 89.62% (UCM-Net) vs 88.39% , 80.71% (UCM-Net) vs 80.22% and 81.26% (UCM-Net) vs 81.16%.

Refer to caption
Figure 3: This figure shows the visualization of comparative experimental results on the ISIC2017 dataset. The X-axis represents mDice score (higher is better), while Y-axis represents mIoU (higher is better). The color depth represents the number of parameters (blue is better).

2 Related works

AI Method Categories and Applications AI-driven approaches for biomedical images can be broadly classified as supervised learning methods, semi-supervised learning methods, and unsupervised learning methods [24, 25, 26]. Supervised learning is a solution with labeled image data, expecting to develop predictive capability. The labeled image data can be the previous patients’ diagnosed results, such as computed tomography(CT), with Clinicians’ analysis. With the labeled data, AI-driven solutions can be developed and performed against the ground truth results. Supervised learning solutions are widely applied to disease classification, tumor location detection, and tumor segmentation [27]. Relatively, unsupervised learning is a discovering process, diving into unlabeled data to capture hidden information. Unsupervised learning solutions derive insights directly from unlabeled medical data without inadequate or biased human supervisions and can be used for information compression, dimensional reduction, super resolution for medical image and sequence data detection and analysis such as protein, DNA and RNA [28]. In recent years, semi-supervised learning is becoming popular, which utilizes a large number of unlabeled data in conjunction with the limited amount of labeled data to train higher-performing models. Semi-supervised learning solutions can be also applied into disease classification and medical segmentation [29, 30].

TinyML for healthcare Integrating the potential of TinyML into healthcare, particularly for tasks such as lesion segmentation, offers an exciting avenue for future research and practical applications. TinyML, which refers to the deployment of machine learning models on low-power, miniature hardware, presents a unique opportunity to bring advanced analytical capabilities directly to the point of care. This approach could democratize access to sophisticated medical image analysis technologies, enabling real-time, on-device processing in environments where traditional computing resources are scarce or in mobile healthcare applications. For instance, applying TinyML for lesion segmentation could facilitate immediate diagnostic insights during patient examinations or in remote areas, significantly reducing the time and infrastructure typically required for such analyses. The integration of TinyML into healthcare devices promises to enhance diagnostic processes, improve patient outcomes, and extend the reach of advanced medical technologies to underserved regions.To further enhance the efficiency and feasibility of deploying TinyML in such critical applications, researchers are exploring innovative approaches like hyper-structure optimization[31] and the use of quantitative methods such as binary neural networks[32]. Hyper-structure optimization aims to minimize the number of parameters without compromising the model’s performance, thus ensuring that the models are lightweight yet effective enough for deployment on tiny devices. As we explore the optimization and application of UCM-Net in this context, we acknowledge the pioneering work and insights offered by Partha. in their exploration of TinyML’s potential in healthcare [33], which serves as a survey foundational reference for our proposed direction.

Supervised Methods of Segmentation As the technique evolves and develops, the solution of AI for medical image segmentation is from purely applying a convolution neural network(CNN) such as U-Net and Att-UNet [34] to a hybrid structure method like TransUNet [35], TransFuse [36] and SANet [37]. U-Net is a earest CNN solution on biomedical image segmentation, which replaces pooling operators by upsampling operators. Att-UNet is developed on the top of U-Net adding attention structures. TransUNet merits both Transformers and U-Net for medical image segmentation. TransFuse is a novel approach that combines Transformers and CNNs with late fusion for medical image segmentation, achieving a strong performance while maintaining high efficiency, with potential applications in various medical-related tasks. SANet [38], the Shallow Attention Network, addresses challenges in polyp segmentation by mitigating color inconsistencies, preserving small polyps through shallow attention modules, and balancing pixel distributions, achieving remarkable performance improvements. Swin-UNet is a UNet-like pure Transformer for medical image segmentation that proposed shifted windows as the encoder to extract context features, and a transformer-based decoder with patch expanding layer performs the up-sampling operation to restore the spatial resolution of the feature maps. MedT [39] is also a transformer-based network architecture, a gated axial-attention model that introduces an additional control mechanism in the self-attention module. ConvUNeXt [40], an efficient model inspired by ConvNeXts [41] and based on the classic UNet architecture, achieving excellent medical image segmentation results with a significantly reduced parameter count while incorporating features such as large convolution kernels, depth-wise separable convolution, residual connections, and a lightweight attention mechanism. UNeXt [42] is introduced as an efficient Convolutional multilayer perceptron (MLP) based network that reduces parameters and computational complexity while achieving superior segmentation performance through tokenized MLP blocks and channel shifting, making it suitable for point-of-care applications. MALUNet [43] and its extended version EGE-UNet [44] develop new attention modules to significantly reduce parameters and computational complexity while achieving powerful skin lesion segmentation performance, making it highly suitable for resource-constrained clinical environments.

Refer to caption
Figure 4: UCM-Net Structure

3 UCM-Net

Network Design Figure 4 provides a comprehensive view of the structural framework of UCM-Net, an advanced architecture that showcases a distinctive U-Shape design. Our design is developed from U-Net. UCM-Net includes a down-sampling encoder and an up-sampling decoder, resulting in a high-powered network for skin lesion segmentation. The entirety of the network encompasses six stages of encoder-decoder units, each equipped with channel capacities of {8, 16, 24, 32, 48, 64}. Within each stage, we leverage a convolutional block alongside our novel UCMNet block, facilitating the extraction and acquisition of essential features. In the convolutional block, we opt for a kernel size of 1, a choice that serves to further curtail the parameter count. Our innovative UCMNet introduces a hybrid structure module, wherein an amalgamation of a Multi-Layer Perceptron (MLP) linear component and Convolutional Neural Network (CNN) is employed, bolstered by the inclusion of skip connections. This strategic amalgamation fortifies the network’s prowess in feature acquisition and learning capabilities.

Convolution Block In our designing, we contain two different convolution blocks. Convolution Block 1 utilizes a kernel size of 3×3, which is commonly employed for its effectiveness in capturing spatial relationships within the input features. This size is particularly advantageous in the network’s initial layers, where preserving the spatial integrity of feature maps is essential for decoding complex input patterns. Convolution Block 2 is characterized by a 1×1 kernel size, drastically reducing the number of learnable parameters and computational load[45]. The smaller kernel size serves to linearly combine features across channels, allowing for dimensionality reduction and feature recombination without spatial aggregation, thus enhancing the network’s efficiency.

# Input: X,the feature map with shape [Batch(B), Channel(C), Height(H), Width(W)]
# Output: Out,the feature map with shape [B, Height*Width(N),C]
# Operator: Conv, 2D Convolution  LN, LayerNorm  BN, BatchNorm,  Linear, Linear Transformation  Leaky, Leaky RelU
# UCM-Net Block Processing Pipeline
B, C, H, W = X.shape()
# Transform Feature from [B,C,H,W] to [B,H*W,C]
X = X.flatten(2).trnaspose(1,2)
# Copy feature for later residual addition
X1 = copy(X)
X = Linear(LN(X))
B, N, C = X.shape()
# Transform Feature from [B,H*W,C] to [B,C,H,W]
X = X.transpose(1,2).view(B,C,H,W)
X = Conv(LN(X))
X = Conv(Leaky(X))
# Transform Feature from [B,C,H,W] to [B,H*W,C]
X = X.flatten(2).trnaspose(1,2)
X = Linear(BN(X))
# Transform Feature from [B,H*W,C] to [B,C,H,W]
X = X.transpose(1,2).view(B,C,H,W)
X = Conv(LN(X))
# Transform Feature from [B,C,H,W] to [B,H*W,C]
X = X.flatten(2).trnaspose(1,2)
# Output with residual addition
Out = X + X1

Algorithm 1 PyTorch-style pseudocode for UCM-Net Block

UCM-Net Block The UCM-Net Block exemplifies a sophisticated approach to combining Convolutional Neural Networks (CNNs) with Multilayer Perceptrons (MLPs) for robust feature learning. This hybrid model benefits from the spatial feature extraction capabilities of CNNs and the pattern recognition strengths of MLPs through a series of strategically designed operations. Initially, the input feature map undergoes resha** to cater to the distinct structural requirements of CNNs and MLPs—transitioning from a four-dimensional tensor for CNN processing to a three-dimensional tensor for MLP operations. Following this, the block applies a sequence of linear and convolutional transformations, each accompanied by normalization and activation functions. This includes the strategic use of different kernel sizes in the convolutional stages to balance computational efficiency with the network’s learning capacity. Furthermore, a residual connection is integrated, enhancing information flow through the network and supporting the construction of deeper architectures by mitigating the vanishing gradient problem. Through these mechanisms, UCM-Net adeptly captures and processes complex patterns in image data, demonstrating its potential for advanced image analysis applications. The pseudocode of UCM-Net Block 1 presents our defined sequence of operations, which is how we combine CNN with MLP(Linear transformation operation) for feature learning.

Loss functions: In our solution, we designed a group loss function similar to those used in TransFuse [36] and EGE-UNet [44]. However, the base loss function for each stage is distinct. The base loss function from TransFuse and EGE-UNet is calculated using binary cross-entropy (BCE) (1) and Dice loss (Dice) (2) components. Our proposed base loss function is calculated from BCE components and squared-Dice loss (SDice) components, to calculate the loss from the scaled layer masks in different stages compared with the ground truth masks. Equations (1) and (2) present the stage loss in different layers and the output loss in the output layer, which are calculated using binary cross-entropy (BCE) and Dice loss (Dice) components, respectively.

BCE=1Ni=1N[yilog(pi)+(1yi)log(1pi)],BCE1𝑁superscriptsubscript𝑖1𝑁delimited-[]subscript𝑦𝑖subscript𝑝𝑖1subscript𝑦𝑖1subscript𝑝𝑖\displaystyle\hskip 56.9055pt\text{BCE}=-\frac{1}{N}\sum_{i=1}^{N}\left[y_{i}% \log(p_{i})+(1-y_{i})\log(1-p_{i})\right],BCE = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , (1)

where N𝑁Nitalic_N is the total number of pixels (for image segmentation) or elements (for other tasks), yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground truth value, and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted probability for the i𝑖iitalic_i-th element.

Dice_Loss=12×i=1N(piyi)+smoothi=1Npi+i=1Nyi+smooth,Dice_Loss12superscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑦𝑖smoothsuperscriptsubscript𝑖1𝑁subscript𝑝𝑖superscriptsubscript𝑖1𝑁subscript𝑦𝑖smooth\displaystyle\hskip 56.9055pt\text{Dice\_Loss}=1-\frac{2\times\sum_{i=1}^{N}(p% _{i}\cdot y_{i})+\text{smooth}}{\sum_{i=1}^{N}p_{i}+\sum_{i=1}^{N}y_{i}+\text{% smooth}},Dice_Loss = 1 - divide start_ARG 2 × ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + smooth end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + smooth end_ARG , (2)

where smooth is a small constant added to improve numerical stability.

Squared_Dice_Loss=12×(i=1N(piyi))2+smooth(i=1Npi)2+(i=1Nyi)2+smooth,Squared_Dice_Loss12superscriptsuperscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑦𝑖2smoothsuperscriptsuperscriptsubscript𝑖1𝑁subscript𝑝𝑖2superscriptsuperscriptsubscript𝑖1𝑁subscript𝑦𝑖2smooth\displaystyle\hskip 56.9055pt\text{Squared\_Dice\_Loss}=1-\frac{2\times\left(% \sum_{i=1}^{N}(p_{i}\cdot y_{i})\right)^{2}+\text{smooth}}{\left(\sum_{i=1}^{N% }p_{i}\right)^{2}+\left(\sum_{i=1}^{N}y_{i}\right)^{2}+\text{smooth}},Squared_Dice_Loss = 1 - divide start_ARG 2 × ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + smooth end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + smooth end_ARG , (3)

which represents an enhancement over the standard Dice loss by emphasizing the squared terms of intersections and unions.

Base_loss1=BCE+Dice_Loss,Base_loss1BCEDice_Loss\displaystyle\hskip 56.9055pt\text{Base\_loss1}=\text{BCE}+\text{Dice\_Loss},Base_loss1 = BCE + Dice_Loss , (4)
Base_loss2=BCE+Squared_Dice_Loss,Base_loss2BCESquared_Dice_Loss\displaystyle\hskip 56.9055pt\text{Base\_loss2}=\text{BCE}+\text{Squared\_Dice% \_Loss},Base_loss2 = BCE + Squared_Dice_Loss , (5)

Equations (1) and (3) define the base loss functions (5) for our proposed model, incorporating the Dice loss, and squared-Dice loss components. λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the weight for different stages. In this paper, we set λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 0.1, 0.2, 0.3, 0.4, and 0.5 based on the i-th stage, as illustrated in Figure 4. Equations (8) is our proposed group loss function that calculates the loss from the scaled layer masks in different stages with ground truth masks. Equation and present the stage loss in different stage layer and output loss in the output layer.

LossStage=BCE(StagePred,Target)+Squared_Dice_Loss(StagePred,Target),𝐿𝑜𝑠subscript𝑠𝑆𝑡𝑎𝑔𝑒BCE𝑆𝑡𝑎𝑔𝑒𝑃𝑟𝑒𝑑𝑇𝑎𝑟𝑔𝑒𝑡Squared_Dice_Loss𝑆𝑡𝑎𝑔𝑒𝑃𝑟𝑒𝑑𝑇𝑎𝑟𝑔𝑒𝑡\displaystyle\hskip 56.9055ptLoss_{Stage}=\text{BCE}(StagePred,Target)+\text{% Squared\_Dice\_Loss}(StagePred,Target),italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_S italic_t italic_a italic_g italic_e end_POSTSUBSCRIPT = BCE ( italic_S italic_t italic_a italic_g italic_e italic_P italic_r italic_e italic_d , italic_T italic_a italic_r italic_g italic_e italic_t ) + Squared_Dice_Loss ( italic_S italic_t italic_a italic_g italic_e italic_P italic_r italic_e italic_d , italic_T italic_a italic_r italic_g italic_e italic_t ) , (6)
LossOutput=BCE(OutputPred,Target)+Squared_Dice_Loss(OutputPred,Target),𝐿𝑜𝑠subscript𝑠𝑂𝑢𝑡𝑝𝑢𝑡BCE𝑂𝑢𝑡𝑝𝑢𝑡𝑃𝑟𝑒𝑑𝑇𝑎𝑟𝑔𝑒𝑡Squared_Dice_Loss𝑂𝑢𝑡𝑝𝑢𝑡𝑃𝑟𝑒𝑑𝑇𝑎𝑟𝑔𝑒𝑡\displaystyle\hskip 56.9055ptLoss_{Output}=\text{BCE}(OutputPred,Target)+\text% {Squared\_Dice\_Loss}(OutputPred,Target),italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_O italic_u italic_t italic_p italic_u italic_t end_POSTSUBSCRIPT = BCE ( italic_O italic_u italic_t italic_p italic_u italic_t italic_P italic_r italic_e italic_d , italic_T italic_a italic_r italic_g italic_e italic_t ) + Squared_Dice_Loss ( italic_O italic_u italic_t italic_p italic_u italic_t italic_P italic_r italic_e italic_d , italic_T italic_a italic_r italic_g italic_e italic_t ) , (7)
Group_Loss=LossOutput+i=15λi×LossStagei𝐺𝑟𝑜𝑢𝑝_𝐿𝑜𝑠𝑠𝐿𝑜𝑠subscript𝑠𝑂𝑢𝑡𝑝𝑢𝑡superscriptsubscript𝑖15subscript𝜆𝑖𝐿𝑜𝑠subscript𝑠𝑆𝑡𝑎𝑔subscript𝑒𝑖\displaystyle\hskip 56.9055ptGroup\_Loss=Loss_{Output}+\sum_{i=1}^{5}\lambda_{% i}\times Loss_{Stage_{i}}italic_G italic_r italic_o italic_u italic_p _ italic_L italic_o italic_s italic_s = italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_O italic_u italic_t italic_p italic_u italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_S italic_t italic_a italic_g italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (8)

4 Experiments and Results

4.1 Experiments Setting

Datasets To evaluate the efficiency and performance of model with other published models, we pick the three public skin segementation datasets from PH2, International Skin Imaging Collaboration, namely ISIC2017 [46, 47] and ISIC2018 [48, 49]. The PH2 [50] is a dermoscopic image database which 400 images of contains manual segmentation and the clinical diagnosis. The ISIC2017 dataset comprises 2150 dermoscopy images, and ISIC2018 includes 2694 images. We noted that earlier studies[44, 43] have already presented a dataset version with a pre-established train-test partition, maintaining a 7:3 ratio. In our experimental setup, we opted to utilize the previously published dataset version.

Implementation Details Our UCM-Net is implemented with Pytorch [51] framework. All experiments are conducted on the instance node at Lambda [52] that has a single NVIDIA RTX A6000 GPU (24 GB), 14vCPUs, 46 GiB RAM and 512 GiB SSD. The images are normalized and resized to 256×\times×256. Simple data augmentations are applied, including horizontal flip**, vertical flip**, and random rotation. We noticed the prior studies [44, 43] applied initial image processing with the calculated mean and standard deviation (std) values of the whole train and test datasets separately. While this approach can potentially enhance their models’ training and testing performance, the outcomes are notably influenced by the computed mean and std values. Additionally, if the test dataset’s context information is unknown, this operation can render the trained model less practical. In our experiment, we don’t calculate the mean and std values based on the train and test datasets. Besides, TransFuse-S and SwinNet require the pre-train models with the specified input image size in their encoding stage. To enable fair benchmark testing, we follow the image input size for the TransFuse-S [36, 53],TransUNet[35, 54] and SwinNet [37, 55] to 192×\times×256, 224×\times×224 and 224×\times×224, Correspondingly. For the optimizer, we select AdamW [56] initialized with a learning rate of 0.001 and a weight decay of 0.01. The CosineAnnealingLR [57] is Utilized as the scheduler with a maximum number of iterations of 50 and a minimum learning rate of 1e-5. A total of 300 epochs are trained with a training batch size of 8 and a testing batch size of 1.

Evaluate Metrics To assess the predictive performance of our methods, we employ mean Intersection over Union (mIoU) and mean Dice similarity score (mDice) as evaluation metrics. It’s worth noting that previous studies [44, 43] and [42] have employed distinct calculation methods for mIoU and mDice. To comprehensively compare the performance predictions, our experiments include the presentation of mIoU, mDice, mIoU*, and mDice* results. These results are calculated using the following equations:

IoU=intersectionunionIoUintersectionunion\text{IoU}=\frac{\text{intersection}}{\text{union}}IoU = divide start_ARG intersection end_ARG start_ARG union end_ARG (9)
Dice=2×intersection+smooothsum of pixels in prediction+sum of pixels in ground truth+smooothDice2intersectionsmooothsum of pixels in predictionsum of pixels in ground truthsmoooth\text{Dice}=\frac{2\times\text{intersection}+\text{smoooth}}{\text{sum of % pixels in prediction}+\text{sum of pixels in ground truth}+\text{smoooth}}Dice = divide start_ARG 2 × intersection + smoooth end_ARG start_ARG sum of pixels in prediction + sum of pixels in ground truth + smoooth end_ARG (10)

where intersection represents the number of pixels that are common between the predicted output and the ground truth, smooth is a small constant added to improve numerical stability and union represents the total number of pixels in both the predicted output and the ground truth.

mIoU=1Ni=1NIoUimIoU1𝑁superscriptsubscript𝑖1𝑁subscriptIoU𝑖\text{mIoU}=\frac{1}{N}\sum_{i=1}^{N}\text{IoU}_{i}mIoU = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT IoU start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (11)
mDice=1Ni=1NDiceimDice1𝑁superscriptsubscript𝑖1𝑁subscriptDice𝑖\text{mDice}=\frac{1}{N}\sum_{i=1}^{N}\text{Dice}_{i}mDice = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT Dice start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (12)

where N is the number of images, IoUisubscriptIoU𝑖\text{IoU}_{i}IoU start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the IoU score for image i and DiceisubscriptDice𝑖\text{Dice}_{i}Dice start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the Dice score for image i.

IoU*=TPTP+FP+FNIoU*TPTPFPFN\text{IoU*}=\frac{\text{TP}}{\text{TP}+\text{FP}+\text{FN}}IoU* = divide start_ARG TP end_ARG start_ARG TP + FP + FN end_ARG (13)
Dice*=2×TP2×TP+FP+FNDice*2TP2TPFPFN\text{Dice*}=\frac{2\times\text{TP}}{2\times\text{TP}+\text{FP}+\text{FN}}Dice* = divide start_ARG 2 × TP end_ARG start_ARG 2 × TP + FP + FN end_ARG (14)

where TP represents the number of true positive pixels, FP represents the number of false positive pixels and FN represents the number of false negative pixels.

mIoU*=TPsumTPsum+FPsum+FNsummIoU*subscriptTP𝑠𝑢𝑚subscriptTP𝑠𝑢𝑚subscriptFP𝑠𝑢𝑚subscriptFN𝑠𝑢𝑚\text{mIoU*}=\frac{\text{TP}_{sum}}{\text{TP}_{sum}+\text{FP}_{sum}+\text{FN}_% {sum}}mIoU* = divide start_ARG TP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT end_ARG start_ARG TP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT + FP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT + FN start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT end_ARG (15)
mDice*=2×TPsum2×TPsum+FPsum+FNsummDice*2subscriptTP𝑠𝑢𝑚2subscriptTP𝑠𝑢𝑚subscriptFP𝑠𝑢𝑚subscriptFN𝑠𝑢𝑚\text{mDice*}=\frac{2\times\text{TP}_{sum}}{2\times\text{TP}_{sum}+\text{FP}_{% sum}+\text{FN}_{sum}}mDice* = divide start_ARG 2 × TP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT end_ARG start_ARG 2 × TP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT + FP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT + FN start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT end_ARG (16)

where TPsumsubscriptTP𝑠𝑢𝑚\text{TP}_{sum}TP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT represents the total number of true positive pixels for images, FPsumsubscriptFP𝑠𝑢𝑚\text{FP}_{sum}FP start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT represents the total number of false positive pixels for images and FNsumsubscriptFN𝑠𝑢𝑚\text{FN}_{sum}FN start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT represents the total number of false negative pixels for images.

In our benchmark experiments, we evaluate our method’s performance and compare the results among other published efficient models’. To ensure a fair comparison, we perform three sets of experiments for each method and subsequently present the mean and std of the prediction outcomes across each dataset.

4.2 Performance Comparisons

Table 1: Comparative prediction results on the PH2 dataset
Dataset Models Year mIoU(%)↑ mDice(%)↑ mIoU*(%)↑ mDice*(%)↑
PH2 TransUnet [35, 54] 2021 90.77 ± 0.030 95.09 ± 0.014 91.89 ± 0.148 95.77 ± 0.080
SANet [38, 58] 2021 89.99 ± 0.349 94.62 ± 0.206 91.06 ± 0.481 95.32 ± 0.264
SwinNet [37, 55] 2021 88.92 ± 0.493 94.03 ± 0.289 89.96 ± 0.388 94.71 ± 0.215
TransFuse-S [36, 53] 2021 91.15 ± 0.131 95.30 ± 0.078 92.16 ± 0.113 95.92 ± 0.061
U-Net [59, 60] 2015 89.09 ± 0.224 94.13 ± 0.112 90.08 ± 0.545 94.78 ± 0.302
Att-UNet [61, 34] 2018 89.01 ± 0.345 94.06 ± 0.222 89.82 ± 0.371 94.64 ± 0.206
MedT [39, 62] 2021 87.77 ± 0.099 93.35 ± 0.063 88.53 ± 0.338 93.92 ± 0.190
ConvUNeXt [40, 63] 2022 89.46 ± 0.056 94.33 ± 0.035 90.39 ± 0.172 94.95 ± 0.095
UNeXt-S [42, 64] 2022 89.03 ± 0.246 94.09 ± 0.163 89.689 ± 0.269 94.56 ± 0.149
MALUNet [43, 65] 2022 87.75 ± 0.378 93.35 ± 0.214 88.19 ± 0.329 93.73 ± 0.186
EGE-UNet [44, 66] 2023 88.39 ± 0.303 93.73 ± 0.189 88.93 ± 0.802 94.14 ± 0.450
UCM-Net (ours) 2023 89.62 ± 0.221 94.42 ± 0.124 90.51 ± 0.211 94.96 ± 0.116
  • : this method needs the pre-train model on training.

Table 2: Comparative prediction results on the ISIC2017 dataset
Dataset Models Year mIoU(%)↑ mDice(%)↑ mIoU*(%)↑ mDice*(%)↑
isic2017 TransUnet [35, 54] 2021 81.12 ± 0.152 88.38 ± 0.157 80.02 ± 0.011 88.90 ± 0.113
SANet[38, 58] 2021 79.34 ± 0.023 87.04 ± 0.155 79.34 ± 0.102 88.15 ± 0.060
SwinNet [37, 55] 2021 80.51 ± 0.102 87.81 ± 0.231 80.48 ± 0.134 89.19 ± 0.073
TransFuse-S [36, 53] 2021 80.73 ± 0.312 88.03 ± 0.111 79.87 ± 0.109 88.81 ± 0.058
U-Net [59, 60] 2015 78.12 ± 0.175 85.97 ± 0.196 76.42 ± 0.381 86.63 ± 0.245
Att-UNet [61, 34] 2018 78.45 ± 0.113 86.22 ± 0.124 77.14 ± 0.097 87.10 ± 0.062
MedT [39, 62] 2021 79.05 ± 0.231 86.59 ± 0.125 77.61 ± 0.121 87.40 ± 0.401
ConvUNeXt [40, 63] 2022 78.78 ± 0.362 86.21 ± 0.267 76.98 ± 0.490 86.99 ± 0.313
UNeXt-S [42, 64] 2022 78.79 ± 0.234 86.30 ± 0.140 77.48 ± 0.466 87.31 ± 0.296
MALUNet [43, 65] 2022 79.97 ± 0.389 87.23 ± 0.345 79.11 ± 0.345 88.34 ± 0.215
EGE-UNet [44, 66] 2023 80.22 ± 0.237 87.24 ± 0.149 79.71 ± 0.411 88.71 ± 0.255
UCM-Net (ours) 2023 80.71 ± 0.345 87.66 ± 0.221 79.29 ± 0.188 88.45 ± 0.117
  • : this method needs the pre-train model on training.

Table 3: Comparative prediction results on the ISIC2018
Dataset Models Year mIoU(%)↑ mDice(%)↑ mIoU*(%)↑ mDice*(%)↑
isic2018 TransUnet [35, 54] 2021 82.00 ± 0.152 89.11 ± 0.031 81.68 ± 0.511 89.92 ± 0.010
SANet [38, 58] 2021 80.37 ± 0.124 87.87 ± 0.114 79.39 ± 0.135 88.51 ± 0.084
SwinNet [37, 55] 2021 81.41 ± 0.069 88.58 ± 0.019 80.72 ± 0.069 89.33 ± 0.042
TransFuse-S [36, 53] 2021 81.69 ± 0.128 88.82 ± 0.134 81.29 ± 0.084 89.68 ± 0.070
U-Net [59, 60] 2015 79.86 ± 0.075 87.57 ± 0.085 78.27 ± 0.300 87.81 ± 0.188
Att-UNet [61, 34] 2018 80.05 ± 0.079 87.62 ± 0.078 78.38 ± 0.151 87.88 ± 0.095
MedT [39, 62] 2021 80.34 ± 0.034 87.77 ± 0.107 79.29 ± 0.411 88.45 ± 0.251
ConvUNeXt [40, 63] 2022 80.51 ± 0.043 87.99 ± 0.049 78.71 ± 0.128 88.09 ± 0.080
UNeXt-S [42, 64] 2022 80.70 ± 0.226 88.17 ± 0.194 79.26 ± 0.497 88.43 ± 0.309
MALUNet [43, 65] 2022 80.95 ± 0.393 88.25 ± 0.315 79.99 ± 0.644 88.88 ± 0.398
EGE-UNet [44, 66] 2023 81.16 ± 0.104 88.36 ± 0.086 80.28 ± 0.363 89.06 ± 0.223
UCM-Net (ours) 2023 81.26 ± 0.030 88.48 ± 0.109 80.85 ± 0.251 89.35 ± 0.111
  • : this method needs the pre-train model on training.

Table 4: Comparative performance results on models’ computations and the number of parameters.
Models Year Image Size (H x W) Params↓ GFLOPs↓ Memory (MB)↓
TransUNet [35, 54] 2021 224 x 224 105,277,081 24.7278 402.3614
SANet [38, 58] 2021 256 x 256 23,899,497 5.9983 91.9233
SwinNet [37, 55] 2021 224 x 224 20,076,204 5.5635 77.3503
TransFuse-S [36, 53] 2021 192 x 256 26,248,725 8.6462 100.8809
U-Net [59, 60] 2015 256 x 256 31,037,633 54.7378 119.3992
Att-UNet [61, 34] 2018 256 x 256 34,878,573 66.6318 134.0512
MedT [39, 62] 2021 256 x 256 1,564,202 2.4061 6.9670
ConvUNeXt [40, 63] 2022 256 x 256 3,505,697 7.2537 14.3732
UNeXt-S [42, 64] 2022 256 x 256 253,561 0.1038 1.9673
MALUNet [43, 65] 2022 256 x 256 177,943 0.0830 1.6788
EGE-UNet [44, 66] 2023 256 x 256 53,374 0.0721 1.2036
UCM-Net(ours) 2023 256 x 256 49,932 0.0465 1.1905
  • : this method needs the pre-train model on training.

Refer to caption
Figure 5: Vision performance comparison on samples

Table 1-3 comprehensively evaluates the performance of our UCM-Net, a novel skin lesion segmentation model, compared to well-established models, using the widely recognized PH2, ISIC2017 and ISIC2018 datasets. Introduced in 2023, UCM-Net is a robust and highly competitive solution in this domain. One of the key takeaways from the table is UCM-Net’s ability to outperform EGE-UNet, which had previously held the title of the state-of-the-art model for skin lesion segmentation. Our model achieves superior results across various prediction metrics, emphasizing its advancement in the field and its potential to redefine the standard for accurate skin lesion delineation. Moreover, UCM-Net’s performance is notably competitive even when compared to SwinNet, a model that relies on pre-trained models during training. Table 4 complements this assessment by comparing computational aspects and the number of parameters for various segmentation models. Remarkably, UCM-Net, operating with the same number of channels {8,16,24,32,48,64} and image size, as EGE-UNet, boasts fewer parameters and lower GFLOPs. Additionally, even when compared to TransUNet, TransFuse-S and SwinNet, which operate with smaller image sizes, UCM-Net demonstrates faster computational speed. In Figure 5, we present a visual exhibition of all the non-pretrained models’ segmentation outputs. This figure directly compares our segmentation results, those produced by other methods, and the ground truth, all displayed side by side using representative sample images. Notably, our segmentation results demonstrate a remarkable level of accuracy, closely resembling the ground truth annotations. Tables 1-3 and Figure 5 collectively underscore UCM-Net’s exceptional performance and efficiency in skin lesion segmentation, affirming its potential to make a substantial impact in advancing early skin cancer diagnosis and treatment.

4.3 Ablation results

To demonstrate the efficiency and effectiveness of our proposed modules, we conducted a series of ablation experiments on dataset ISIC2017. We develop UCM-Net based on U-Net. Figure 6 shows the different block structures in the stages among the compared models. Table 5 shows the ablation experiments’ results including the number of parameters, Giga Flops, mIoU score and mDice score. U-Net variant and variant 1 are six-stage U-Nets with the stage channels {8,16,24,32,48,64}. The details of UCM-Net are illustrated in Figure 4.

Refer to caption
Figure 6: Stage Block Structures in ablation experiments.

A: U-Net and U-Net Variant. B: U-Net Variant 1. C: UCM-Net

Table 5: Ablation experiments’ results on the ISIC2017 dataset
Models Structure Reference Params↓ GFLOPs↓ mIoU(%)↑ mDice(%)↑
U-Net(baseline) Figure 6 (A) 310,376,33 54.7378 78.34 86.23
U-Net Variant Figure 6 (A) 248,531 0.5715 78.48 86.22
U-Net Variant 1 Figure 6 (B) 148,157 0.3700 73.89 82.36
UCM-Net Figure 6 (C) 49,932 0.0465 79.76 86.94
UCM-Net + Group Loss Figure 6 (C) 49,932 0.0465 80.63 87.64
UCM-Net + Group Loss(ours) Figure 6 (C) 49,932 0.0465 81.01 87.98
Refer to caption
Figure 7: IoU vs Epoch results of ablation experiments
Refer to caption
Figure 8: Dice vs Epoch results of ablation experiments

As the results of the U-Net and U-Net variants are shown in Table 5, although the number of parameters is reduced, the U-Net variant performs better with the six-stage structure. When we set the one convolution kernel to 1 to reduce the number of parameters, the model’s performance drops severely. However, when we replaced the convolution with our proposed UCMNet block, the results showed that the model’s performance improved significantly. The UCM-Net, as depicted in Figure 4 and Figure 6(C), with 49,932 parameters and 0.0465 GFLOPs, outperforms the U-Net variant with 248,531 parameters and the baseline U-Net with 31,037,633 parameters in terms of both mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDice) metrics.

Furthermore, when incorporating the proposed Group Loss into the UCM-Net architecture, denoted as "UCM-Net + Group Loss(ours)", the model’s performance continued to excel. This enhancement resulted in a higher mIoU of 81.00% and a mDice of 87.98%, demonstrating the effectiveness of the proposed Group Loss in further improving segmentation accuracy. Figures 7-8 show that "UCM-Net + Group Loss(ours)" always presents the high scores of mIoU and mDice with the training epoch increase.

The above findings in the ablation experiments underscore the significance of architectural innovations such as the UCMNet block in achieving superior semantic segmentation performance, even with fewer parameters and computational complexity than the U-Net baseline. We note the batch size will affect the prediction performance. So, we tested our model with different batch sizes and reported it in Table 6 in the appendix.

5 Conclusion

This paper introduces UCM-Net, a novel, lightweight, and highly efficient solution. UCM-Net combines MLP and CNN, providing robust feature learning capabilities while maintaining a minimal parameter count and reduced computational demand. We applied this innovative approach to the challenging task of skin lesion segmentation, conducting comprehensive experiments with a range of evaluation metrics to showcase its effectiveness and efficiency. The results of our extensive experiments unequivocally demonstrate UCM-Net’s superior performance compared to the state-of-the-art EGE-UNet. Remarkably, UCM-Net is the first model with fewer than 50KB parameters and consuming less than 0.05 GLOPs for skin lesion segmentation. Looking forward to future research endeavors, we aim to expand the application of UCM-Net to other critical medical image tasks, advancing the field and exploring how this efficient architecture can contribute to a broader spectrum of healthcare applications. This potential revolution in utilizing deep learning for medical image analysis opens up numerous possibilities for enhancing patient care and diagnostic accuracy. However, it is essential to acknowledge that compared to models leveraging pre-trained components, UCM-Net currently demonstrates lower accuracy performance. Recognizing this limitation, our future efforts will focus on optimizing UCM-Net to improve its accuracy. Such advancements are crucial for ensuring that the model maintains its efficiency and competes favorably in performance with existing state-of-the-art solutions. By addressing these challenges, we aim to advance the field further and expand the impact of deep learning in healthcare applications, making significant contributions to medical imaging and beyond.

References

  • [1] American cancer society. cancer facts figures 2022. atlanta: American cancer society. 2022.
  • [2] RL Siegel, KD Miller, HE Fuchs, and A Jemal. Cancer statistics, 2022. ca. 506. Cancer J. Clin, 72(7-33):507, 2022.
  • [3] Rebecca L Siegel, Kimberly D Miller, and Nikita Sandeep Wagle. Cancer statistics, 2023. 2023.
  • [4] Robin Marks. An overview of skin cancers. Cancer, 75(S2):607–612, 1995.
  • [5] Thomas Martin Lehmann, Claudia Gonner, and Klaus Spitzer. Survey: Interpolation methods in medical image processing. IEEE transactions on medical imaging, 18(11):1049–1075, 1999.
  • [6] Muhammad Nasir, Muhammad Attique Khan, Muhammad Sharif, Ikram Ullah Lali, Tanzila Saba, and Tassawar Iqbal. An improved strategy for skin lesion detection and classification using uniform segmentation and feature selection based approach. Microscopy research and technique, 81(6):528–543, 2018.
  • [7] Catarina Barata, M Emre Celebi, and Jorge S Marques. Explainable skin lesion diagnosis using taxonomies. Pattern Recognition, 110:107413, 2021.
  • [8] Isaac Sanchez and Sos Agaian. Computer aided diagnosis of lesions extracted from large skin surfaces. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2879–2884. IEEE, 2012.
  • [9] Isaac Sanchez and Sos Agaian. A new system of computer-aided diagnosis of skin lesions. In Image Processing: Algorithms and Systems X; and Parallel Processing for Imaging Applications II, volume 8295, pages 390–401. SPIE, 2012.
  • [10] Alex Liew, Sos Agaian, and Liang Zhao. Mitigation of adversarial noise attacks on skin cancer detection via ordered statistics binary local features. In Multimodal Image Exploitation and Learning 2023, volume 12526, pages 153–164. SPIE, 2023.
  • [11] Khalid M Hosny, Doaa Elshora, Ehab R Mohamed, Eleni Vrochidou, and George A Papakostas. Deep learning and optimization-based methods for skin lesions segmentation: A review. IEEE Access, 2023.
  • [12] Vladimir Frants and Sos Agaian. Dermoscopic image segmentation based on modified grabcut with octree color quantization. In Mobile Multimedia/Image Processing, Security, and Applications 2020, volume 11399, pages 119–130. SPIE, 2020.
  • [13] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps: Automation of Decision Making, pages 323–350, 2018.
  • [14] Andreas Maier, Christopher Syben, Tobias Lasser, and Christian Riess. A gentle introduction to deep learning in medical image processing. Zeitschrift für Medizinische Physik, 29(2):86–101, 2019.
  • [15] Neeraj Sharma and Lalit M Aggarwal. Automated medical image segmentation techniques. Journal of medical physics/Association of Medical Physicists of India, 35(1):3, 2010.
  • [16] Zahra Mirikharaji, Kumar Abhishek, Alceu Bissoto, Catarina Barata, Sandra Avila, Eduardo Valle, M Emre Celebi, and Ghassan Hamarneh. A survey on deep learning for skin lesion segmentation. Medical Image Analysis, page 102863, 2023.
  • [17] Tiago M de Carvalho, Eline Noels, Marlies Wakkee, Andreea Udrea, and Tamar Nijsten. Development of smartphone apps for skin cancer risk assessment: progress and promise. JMIR Dermatology, 2(1):e13376, 2019.
  • [18] butterflynetwork. https://www.butterflynetwork.com/iq-ultrasound-individuals.
  • [19] phonemedical. https://blog.google/technology/health/ai-dermatology-preview-io-2021/.
  • [20] Sandeep Kumar Vashist. Point-of-care diagnostics: Recent advances and trends. Biosensors, 7(4):62, 2017.
  • [21] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.
  • [22] Chunlei Chen, Peng Zhang, Huixiang Zhang, Jiangyan Dai, Yugen Yi, Huihui Zhang, and Yonghui Zhang. Deep learning on computational-resource-limited platforms: a survey. Mobile Information Systems, 2020:1–19, 2020.
  • [23] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 2020.
  • [24] Hyunseok Seo, Masoud Badiei Khuzani, Varun Vasudevan, Charles Huang, Hongyi Ren, Ruoxiu Xiao, Xiao Jia, and Lei Xing. Machine learning techniques for biomedical image segmentation: an overview of technical aspects and introduction to state-of-art applications. Medical physics, 47(5):e148–e167, 2020.
  • [25] Sertan Serte, Ali Serener, and Fadi Al-Turjman. Deep learning in medical imaging: A brief review. Transactions on Emerging Telecommunications Technologies, 33(10):e4080, 2022.
  • [26] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
  • [27] Abeer Aljuaid and Mohd Anwar. Survey of supervised learning for medical image processing. SN Computer Science, 3(4):292, 2022.
  • [28] Khalid Raza and Nripendra K Singh. A tour of unsupervised deep learning for medical image analysis. Current Medical Imaging, 17(9):1059–1077, 2021.
  • [29] Quande Liu, Lequan Yu, Luyang Luo, Qi Dou, and Pheng Ann Heng. Semi-supervised medical image classification with relation-driven self-ensembling model. IEEE transactions on medical imaging, 39(11):3429–3440, 2020.
  • [30] Rushi Jiao, Yichi Zhang, Le Ding, Rong Cai, and Jicong Zhang. Learning with limited annotations: a survey on deep semi-supervised learning for medical image segmentation. arXiv preprint arXiv:2207.14191, 2022.
  • [31] Hyeji Kim, Muhammad Umar Karim Khan, and Chong-Min Kyung. Efficient neural network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12569–12577, 2019.
  • [32] Chunyu Yuan and Sos S Agaian. A comprehensive review of binary neural network. Artificial Intelligence Review, 56(11):12949–13013, 2023.
  • [33] Partha Pratim Ray. A review on tinyml: State-of-the-art and prospects. Journal of King Saud University-Computer and Information Sciences, 34(4):1595–1623, 2022.
  • [34] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
  • [35] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
  • [36] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 14–24. Springer, 2021.
  • [37] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
  • [38] Jun Wei, Yiwen Hu, Ruimao Zhang, Zhen Li, S Kevin Zhou, and Shuguang Cui. Shallow attention network for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 699–708. Springer, 2021.
  • [39] Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, and Vishal M Patel. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 36–46. Springer, 2021.
  • [40] Zhimeng Han, Muwei Jian, and Gai-Ge Wang. Convunext: An efficient convolution neural network for medical image segmentation. Knowledge-Based Systems, 253:109512, 2022.
  • [41] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  • [42] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-based rapid medical image segmentation network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 23–33. Springer, 2022.
  • [43] Jiacheng Ruan, Suncheng Xiang, Mingye Xie, Ting Liu, and Yuzhuo Fu. Malunet: A multi-attention and light-weight unet for skin lesion segmentation. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1150–1156. IEEE, 2022.
  • [44] Jiacheng Ruan, Mingye Xie, **gsheng Gao, Ting Liu, and Yuzhuo Fu. Ege-unet: an efficient group enhanced unet for skin lesion segmentation. arXiv preprint arXiv:2307.08473, 2023.
  • [45] Gousia Habib and Shaima Qureshi. Optimization and acceleration of convolutional neural networks: A survey. Journal of King Saud University-Computer and Information Sciences, 34(7):4244–4268, 2022.
  • [46] Isic 2017 challenge dataset. https://challenge.isic-archive.com/data/#2017.
  • [47] Matt Berseth. Isic 2017-skin lesion analysis towards melanoma detection. arXiv preprint arXiv:1703.00523, 2017.
  • [48] Isic 2018 challenge dataset. https://challenge.isic-archive.com/data/#2018.
  • [49] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
  • [50] Teresa Mendonça, Pedro M Ferreira, Jorge S Marques, André RS Marcal, and Jorge Rozeira. Ph 2-a dermoscopic image database for research and benchmarking. In 2013 35th annual international conference of the IEEE engineering in medicine and biology society (EMBC), pages 5437–5440. IEEE, 2013.
  • [51] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [52] lambda cloud gpu. https://cloud.lambdalabs.com/instances.
  • [53] Transfuse official code. https://github.com/Rayicer/TransFuse.
  • [54] Transunet official code. https://github.com/Beckschen/TransUNet.
  • [55] Swinnet official code. https://github.com/HuCaoFighting/Swin-Unet.
  • [56] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  • [57] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [58] Sanet official code. https://github.com/weijun88/SANet.
  • [59] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • [60] U-net highest stars code. https://github.com/milesial/Pytorch-UNet.
  • [61] Att u-net highest stars code. https://github.com/LeeJunHyun/Image_Segmentation.
  • [62] Medt official code. https://github.com/jeya-maria-jose/Medical-Transformer.
  • [63] Convunext official code. https://github.com/1914669687/ConvUNeXt.
  • [64] Unext official code. https://github.com/jeya-maria-jose/UNeXt-pytorch.
  • [65] Malunet official code. https://github.com/JCruan519/MALUNet.
  • [66] Ege-net official code. https://github.com/JCruan519/EGE-UNet.

6 Appendix

Table 6: Batch size experiments’ results on the PH2 dataset
Models Batch Size mIoU(%)↑ mDice(%)↑ mIoU*(%)↑ mDice*(%)↑
TransUnet [35, 54] 8 90.77 ± 0.030 95.09 ± 0.014 91.89 ± 0.148 95.77 ± 0.080
SANet [38, 58] 8 89.99 ± 0.349 94.62 ± 0.206 91.06 ± 0.481 95.32 ± 0.264
SwinNet [37, 55] 8 88.92 ± 0.493 94.03 ± 0.289 89.96 ± 0.388 94.71 ± 0.215
TransFuse-S [36, 53] 8 91.15 ± 0.131 95.30 ± 0.078 92.16 ± 0.113 95.92 ± 0.061
UCM-Net + Group Loss(ours) 1 89.68 94.45 89.96 94.72
2 89.88 94.57 90.39 94.95
4 89.57 94.38 90.18 94.84
8 89.39 94.28 89.67 94.55
16 88.44 93.74 88.85 94.10
32 89.15 94.15 90.23 94.86

We tested our model with different batch sizes and reported it in Table 6 in the appendix.