EFCNet: Every Feature Counts for Small Medical Object Segmentation

Lingjie Kong, Qiaoling Wei, Chengming Xu, Han Chen, Yanwei Fu
Fudan University, Shanghai, China   
Equal contribution.Corresponding Author.
Abstract

This paper explores the segmentation of very small medical objects with significant clinical value. While Convolutional Neural Networks (CNNs), particularly UNet-like models, and recent Transformers have shown substantial progress in image segmentation, our empirical findings reveal their poor performance in segmenting the small medical objects and lesions concerned in this paper. This limitation may be attributed to information loss during their encoding and decoding process. In response to this challenge, we propose a novel model named EFCNet for small object segmentation in medical images. Our model incorporates two modules: the Cross-Stage Axial Attention Module (CSAA) and the Multi-Precision Supervision Module (MPS). These modules address information loss during encoding and decoding procedures, respectively. Specifically, CSAA integrates features from all stages of the encoder to adaptively learn suitable information needed in different decoding stages, thereby reducing information loss in the encoder. On the other hand, MPS introduces a novel multi-precision supervision mechanism to the decoder. This mechanism prioritizes attention to low-resolution features in the initial stages of the decoder, mitigating information loss caused by subsequent convolution and sampling processes and enhancing the model’s global perception. We evaluate our model on two benchmark medical image datasets. The results demonstrate that EFCNet significantly outperforms previous segmentation methods designed for both medical and normal images.

1 Introduction

Small medical objects, like HyperReflective Dots (HRDs) observed on Optical Coherence Tomography (OCT), are frequently encountered in disease research. Various studies [26, 15, 1, 9] have confirmed the significant relevance of these small lesions to medical diagnosis and treatment. However, the manual labeling of these small objects in medical images is a time-consuming and labor-intensive task, representing a substantial drain on medical resources. Consequently, there is a pressing need to automate the segmentation of small medical objects using computer vision algorithms. Considering the type of task, one would easily recall the numerous image segmentation models such as U-Net [27], ResUNet [39], DenseUNet [11], ResUNet++ [18], TransFuse [42] and Swin-Unet [3], as well as the recent SAM [20]. Given their generally desirable performance, it is straightforward to ask: Can these method solve the small medical objects segmentation?

Refer to caption
Figure 1: Samples of small medical objects in two datasets (S-HRD and S-Polyp) in our work.
Refer to caption
Figure 2: Comparison of our method (d) against conventional encoder-decoder based method in previous works (a)-(c). (a)&(b) In traditional methods [27, 39], the single-stage features of the encoder and the corresponding single-stage features of the decoder are fused by concatenation or addition, with one segmentation head at the end of the decoder. (c) Some methods [24, 41] attempt to add attention mechanisms to the encoder, which however are limited to single-stage features. And only one segmentation head is adopted at the end of the decoder. (d) Our method aggregates all features in each stages of the encoder through CSAA to guide the decoding procedure. Multi-resolution features in each stages of the decoder are segmented with multi-precision by multiple segmentation heads through MPS.

Sadly, these famous methods are generally wiped out in this specific task. Moreover, research specifically dedicated to the segmentation of small medical objects is lacking. Identified issues with previous segmentation methods reveal significant information loss, as highlighted in two points: 1) Earlier methods [27, 19] often use features from the preceding stage in the decoder and the corresponding stage in the encoder, limiting the direct use of features for each decoding stage. This contradicts previous studies [25, 8] indicating improved segmentation accuracy with both shallow and deep features and results in untapped information from various encoder stages. 2) Many prior approaches [39, 19, 14] employ only one segmentation head for supervision at the last decoder stage. In contrast, studies [43, 6, 7] highlight the strong global perception of low-resolution features in early decoding stages, valuable for small object localization. However, information in these early decoder stages is somewhat lost during convolution and upsampling processes. Additionally, small medical objects, compared to standard-sized objects, carry less information, intensifying the impact of information loss on segmentation accuracy. Some samples of small medical objects are illustrated in Fig. 1. Notably, the SAM model [20] performs inadequately in addressing this segmentation task.

To address the challenge of information loss during segmentation, we introduce a novel solution based on the encoder-decoder structure. Our model meticulously attends to all features in each stage of both the encoder and decoder, enhancing the accuracy of small medical object segmentation. Figure 2 illustrates the distinctions between our approach and prior methods. Specifically, for the encoder, we present the Cross-Stage Axial Attention Module (CSAA), leveraging the attention mechanism to integrate features from all stages. This adaptation enables the model to dynamically learn information necessary for each decoding stage. CSAA facilitates direct reference to all valuable information in the encoder during the decoding process of each stage, mitigating information loss in the encoder. Simultaneously, we introduce the Multi-Precision Supervision Module (MPS) for the decoder. This module adds segmentation heads with varying precision for supervision after each stage in the decoder. Low-precision segmentation heads focus on low-resolution features, temporarily overlooking local details to leverage their robust global perception. With MPS, the model effectively exploits information from each stage in the decoder, reducing information loss in this part of the model.

To validate the efficacy of the proposed method, we carry out comprehensive experiments on two datasets S-HRD and S-polyp, as illustrated in Fig. 1. These datasets comprise a fundus Optical Coherence Tomography (OCT) image dataset created by our team and a subset from CVC-ClinicDB [2]. Experimental results on these two datasets demonstrate that our method outperforms previous state-of-the-art models in terms of dice similarity coefficient (DSC) and intersection over union (IoU).

Our contributions are outlined as follows:
1. Innovative Segmentation Approach: We introduce a novel concept to address the challenge of small medical object segmentation. Emphasizing the significance of every feature in medical images, our model meticulously attends to all features at each stage. This approach enables the extraction of diverse information, thereby mitigating information loss associated with small medical objects.
2. Proposed Modules for Enhanced Accuracy: We devise two key modules, namely the Cross-Stage Axial Attention Module (CSAA) and the Multi-Precision Supervision Module (MPS). These modules effectively tackle information loss in the encoder and decoder, respectively, resulting in an improved segmentation accuracy for the model.
3. Benchmark Construction and Model Validation: We establish a new benchmark for evaluating small medical object segmentation. Through experiments on two datasets focusing on small medical objects, our model significantly outperforms previous state-of-the-art models. This demonstrates the robustness and superiority of our proposed approach in this challenging domain.

2 Related Works

Medical Image Segmentation. Recently, researchers have introduced several innovative methods [5, 23, 10, 12, 33] for semantic segmentation in medical images. Zhang et al. [42] proposed a unique hybrid structure that concurrently integrates CNN and Transformer, leading to a reduction in the loss of low-level details. Chen et al. [4] incorporated a Transformer module into the U-Net encoder, enhancing the model’s ability for long-range modeling. Wang et al. [36] addressed overfitting by employing a Transformer encoder and introduced the progressive locality decoder to improve local information processing in medical images. While these methods have significantly contributed to medical image segmentation, they often fall short in accounting for the impact of object size on segmentation results. Particularly, these models tend to underperform when confronted with the segmentation of small objects. Lou et al. [21] recognized the significance of considering the size of medical objects in the segmentation process, and introduced a Context Axial Reverse Attention module (CaraNet) to assist the model in detecting local information related to small medical objects. However, the use of bilinear interpolation in the decode stage of CaraNet leads to substantial information loss, significantly affecting the segmentation of small medical objects. In contrast to the aforementioned methods, our model addresses the issue of information loss in small medical objects through CSAA and MPS, effectively improving segmentation accuracy.

General Segmentation Model. Recently, Kirillov et al. [20] introduced the Segment Anything Model (SAM), a versatile segmentation model that has made significant strides in the realm of natural image segmentation. Despite its success in general applications, SAM proves unsuitable for many medical image segmentation tasks due to the intricate structures and complex boundaries present, particularly in cases involving small medical objects, without manual guidance [17, 13]. In response to these limitations, Ma et al. [22] devised MedSAM as an adaptation of SAM tailored specifically for medical image segmentation. MedSAM exhibits notable advancements in handling medical image segmentation tasks compared to SAM [20]. Nevertheless, even with these improvements, MedSAM [22] still struggles when tasked with the segmentation of small medical objects concerned in this paper.

Attention Mechanism. Numerous attention-based methods [24, 28, 16, 29, 32] have emerged in recent years, applied to diverse tasks in computer vision and natural language processing. Vaswani et al. [34] broke away from the conventional convolutional structure and introduced the Transformer, a novel architecture utilizing attention mechanisms. Woo et al. [38] enhanced convolutional neural networks by incorporating both Channel Attention and Spatial Attention. Zhang et al. [40] proposed the Pyramid Squeeze Attention Module (PSA) to enable the model to capture spatial information across different channels. Building upon the insights gained from the aforementioned attention-based methods, we introduce the Cross-Stage Axial Attention Module (CSAA). This module facilitates feature fusion and minimizes information loss within the model.

3 Method

We introduce small medical object segmentation and related notations. In Sec.3.1, we outline the overall structure of EFCNet. We detail the CSAA in Sec. 3.2 and the MPS in Sec. 3.3. Lastly, we discuss our loss function in Sec.3.4.

Problem Setup and Notations. In our segmentation task, we denote the medical picture dataset as X={x1,,xm|xiC×H×W,i=1,2,..,m}X=\{x_{1},...,x_{m}|x_{i}\in\mathbb{R}^{C\times H\times W},i=1,2,..,m\}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT , italic_i = 1 , 2 , . . , italic_m }. Doctors meticulously and manually annotate lesions in each image xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, forming the ground truth set Y={y1,,ym|yi{0,1}1×H×W,i=1,2,..,m}Y=\{y_{1},...,y_{m}|y_{i}\in\{0,1\}^{1\times H\times W},i=1,2,..,m\}italic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT 1 × italic_H × italic_W end_POSTSUPERSCRIPT , italic_i = 1 , 2 , . . , italic_m }. The complete dataset is denoted as D={X,Y}𝐷𝑋𝑌D=\{X,Y\}italic_D = { italic_X , italic_Y }, which is splitted into a training set Dtrain={Xtrain,Ytrain}subscript𝐷trainsubscript𝑋trainsubscript𝑌trainD_{\text{train}}=\{X_{\text{train}},Y_{\text{train}}\}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT train end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT train end_POSTSUBSCRIPT } and a testing set Dtest={Xtest,Ytest}subscript𝐷testsubscript𝑋testsubscript𝑌testD_{\text{test}}=\{X_{\text{test}},Y_{\text{test}}\}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT test end_POSTSUBSCRIPT }.

Our objective is to develop an algorithm that empowers our model to effectively segment small medical objects from Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT and demonstrate robust performance on Dtestsubscript𝐷testD_{\text{test}}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT.

3.1 Overall Architecture

Refer to caption
Figure 3: Structure of our proposed EFCNet. (a) Overview of our method, featuring an Encoder-Decoder architecture equipped with the Cross-Stage Axial Attention Module (CSAA) and Multi-Precision Supervision Module (MPS). (b) Details of our Cross-Stage Axial Attention Module (CSAA). The CSAA combines features from each stage of the encoder, dynamically extracts information about small medical objects, and directs the decoding process of each stage in the decoder.

We propose a novel method called EFCNet to address the challenge of segmenting small medical objects. Specifically, we have designed two modules, the Cross-Stage Axial Attention Module (CSAA) and the Multi-Precision Supervision Module (MPS), to ensure that the model focuses on small medical object infomration in both the encoder and decoder.

In Fig. 3(a), an image containing small medical objects serves as input to our model. Initially, the image undergoes encoding through k𝑘kitalic_k stages, producing feature maps encompassing diverse information about small medical objects. Subsequently, the CSAA Module processes features from all encoder stages, compelling the model to adaptively learn pertinent information for the decoding phase. The decoder then sequentially processes these feature maps under the guidance of the CSAA Module. Lastly, the feature maps from all decoder stages enter the MPS Module to yield multi-precision prediction results, receiving separate supervisions. During testing, we utilize the segmentation output from the last decoder stage as the final result of our model.

3.2 CSAA Module

Information about small medical objects is dispersed across different encoder stages, each containing varied types of data. However, much of this information is not directly usable by the decoder, and some is lost during the convolution and downsampling processes. To reduce information loss in the encoder and fully leverage insights about small medical objects, we present a novel Cross-Stage Axial Attention Module (CSAA). This module adaptively learns from features in all encoder stages and subsequently guides the decoding process. As depicted in Fig. 3(b), CSAA have four steps: resizing, W-dimensional axial attention, H-dimensional axial attention and resizing back.

Resizing. To enhance the fusion of features from all encoder stages, we resize all feature maps in each encoder stage to (C,H,W)superscript𝐶superscript𝐻superscript𝑊(C^{*},H^{*},W^{*})( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), adjusting both spatial and channel dimensions through convolution operations:

fi=σ(BN(convie(fie))),i=1,2,..,k,\begin{split}&f_{i}^{*}=\sigma(BN(conv_{i}^{e}(f_{i}^{e}))),i=1,2,..,k,\end{split}start_ROW start_CELL end_CELL start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_σ ( italic_B italic_N ( italic_c italic_o italic_n italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ) ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW (1)

where fiesuperscriptsubscript𝑓𝑖𝑒f_{i}^{e}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT represents the feature map in stage i𝑖iitalic_i of the encoder; σ𝜎\sigmaitalic_σ and BN𝐵𝑁BNitalic_B italic_N denote ReLU and Batch Normalization respectively; and k𝑘kitalic_k is the number of stages in the encoder and decoder. We consider the k-stage resized feature maps {fi}i=1ksuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑖1𝑘\{f_{i}^{*}\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as the input for the subsequent W-dimensional axial attention step.

W-Dimensional CSAA. We firstly generate Query {Qi,w}i=1ksuperscriptsubscriptsubscript𝑄𝑖𝑤𝑖1𝑘\{Q_{i,w}\}_{i=1}^{k}{ italic_Q start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Key {Ki,w}i=1ksuperscriptsubscriptsubscript𝐾𝑖𝑤𝑖1𝑘\{K_{i,w}\}_{i=1}^{k}{ italic_K start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Value {Vi,w}i=1ksuperscriptsubscriptsubscript𝑉𝑖𝑤𝑖1𝑘\{V_{i,w}\}_{i=1}^{k}{ italic_V start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT based on the feature maps {fi}i=1ksuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑖1𝑘\{f_{i}^{*}\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in the width (W) dimension:

Qi,w=WiQ(fi,w),i=1,2,..,k,Ki,w=WiK(f1,w,f2,w,,fk,w),i=1,2,..,k,Vi,w=WiV(f1,w,f2,w,,fk,w),i=1,2,..,k,\begin{split}&Q_{i,w}=W_{i}^{Q}(f_{i,w}^{*}),i=1,2,..,k,\\ &K_{i,w}=W_{i}^{K}(f_{1,w}^{*},f_{2,w}^{*},...,f_{k,w}^{*}),i=1,2,..,k,\\ &V_{i,w}=W_{i}^{V}(f_{1,w}^{*},f_{2,w}^{*},...,f_{k,w}^{*}),i=1,2,..,k,\end{split}start_ROW start_CELL end_CELL start_CELL italic_Q start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_K start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT 1 , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_V start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT 1 , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW (2)

where {WiQ}i=1ksuperscriptsubscriptsuperscriptsubscript𝑊𝑖𝑄𝑖1𝑘\{W_{i}^{Q}\}_{i=1}^{k}{ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {WiK}i=1ksuperscriptsubscriptsuperscriptsubscript𝑊𝑖𝐾𝑖1𝑘\{W_{i}^{K}\}_{i=1}^{k}{ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {WiV}i=1ksuperscriptsubscriptsuperscriptsubscript𝑊𝑖𝑉𝑖1𝑘\{W_{i}^{V}\}_{i=1}^{k}{ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represent the weight matrix used to generate {Qi,w}i=1ksuperscriptsubscriptsubscript𝑄𝑖𝑤𝑖1𝑘\{Q_{i,w}\}_{i=1}^{k}{ italic_Q start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {Ki,w}i=1ksuperscriptsubscriptsubscript𝐾𝑖𝑤𝑖1𝑘\{K_{i,w}\}_{i=1}^{k}{ italic_K start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {Vi,w}i=1ksuperscriptsubscriptsubscript𝑉𝑖𝑤𝑖1𝑘\{V_{i,w}\}_{i=1}^{k}{ italic_V start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT respectively; k𝑘kitalic_k is the number of stages in the encoder and decoder; and fi,wsuperscriptsubscript𝑓𝑖𝑤f_{i,w}^{*}italic_f start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the feature fisuperscriptsubscript𝑓𝑖f_{i}^{*}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in width dimension. Equation 2 shows that Ki,wsubscript𝐾𝑖𝑤K_{i,w}italic_K start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT and Vi,wsubscript𝑉𝑖𝑤V_{i,w}italic_V start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT merge the information from all stages of the encoder in width dimension. Next, we get the output {fiw}i=1ksuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑤𝑖1𝑘\{f_{i}^{w}\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT of W-Dimensional Axial-Attention by

fiw=Softmax(Qi,wKi,wTCH)Vi,w,i=1,2,..,k.\begin{split}f_{i}^{w}=\text{Softmax}(\frac{Q_{i,w}K_{i,w}^{T}}{\sqrt{C^{*}H^{% *}}})V_{i,w},i=1,2,..,k.\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_i , italic_w end_POSTSUBSCRIPT , italic_i = 1 , 2 , . . , italic_k . end_CELL end_ROW (3)

H-Dimensional CSAA. Similarly above, we firstly generate Query {Qi,h}i=1ksuperscriptsubscriptsubscript𝑄𝑖𝑖1𝑘\{Q_{i,h}\}_{i=1}^{k}{ italic_Q start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Key {Ki,h}i=1ksuperscriptsubscriptsubscript𝐾𝑖𝑖1𝑘\{K_{i,h}\}_{i=1}^{k}{ italic_K start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Value {Vi,h}i=1ksuperscriptsubscriptsubscript𝑉𝑖𝑖1𝑘\{V_{i,h}\}_{i=1}^{k}{ italic_V start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT based on the feature maps {fiw}i=1ksuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑤𝑖1𝑘\{f_{i}^{w}\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in the height (H) dimension:

Qi,h=WiQ(fi,hw),i=1,2,..,k,Ki,h=WiK(f1,hw,f2,hw,,fk,hw),i=1,2,..,k,Vi,h=WiV(f1,hw,f2,hw,,fk,hw),i=1,2,..,k,\begin{split}&Q_{i,h}=W_{i}^{Q}(f_{i,h}^{w}),i=1,2,..,k,\\ &K_{i,h}=W_{i}^{K}(f_{1,h}^{w},f_{2,h}^{w},...,f_{k,h}^{w}),i=1,2,..,k,\\ &V_{i,h}=W_{i}^{V}(f_{1,h}^{w},f_{2,h}^{w},...,f_{k,h}^{w}),i=1,2,..,k,\end{split}start_ROW start_CELL end_CELL start_CELL italic_Q start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_K start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT 1 , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_V start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT 1 , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , . . , italic_k , end_CELL end_ROW (4)

where {WiQ}i=1ksuperscriptsubscriptsuperscriptsubscript𝑊𝑖𝑄𝑖1𝑘\{W_{i}^{Q}\}_{i=1}^{k}{ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {WiK}i=1ksuperscriptsubscriptsuperscriptsubscript𝑊𝑖𝐾𝑖1𝑘\{W_{i}^{K}\}_{i=1}^{k}{ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {WiV}i=1ksuperscriptsubscriptsuperscriptsubscript𝑊𝑖𝑉𝑖1𝑘\{W_{i}^{V}\}_{i=1}^{k}{ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represent the weight matrix used to generate {Qi,h}i=1ksuperscriptsubscriptsubscript𝑄𝑖𝑖1𝑘\{Q_{i,h}\}_{i=1}^{k}{ italic_Q start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {Ki,h}i=1ksuperscriptsubscriptsubscript𝐾𝑖𝑖1𝑘\{K_{i,h}\}_{i=1}^{k}{ italic_K start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, {Vi,h}i=1ksuperscriptsubscriptsubscript𝑉𝑖𝑖1𝑘\{V_{i,h}\}_{i=1}^{k}{ italic_V start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT respectively; k𝑘kitalic_k is the number of stages in the encoder and decoder; and fi,hwsuperscriptsubscript𝑓𝑖𝑤f_{i,h}^{w}italic_f start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT denotes the feature fiwsuperscriptsubscript𝑓𝑖𝑤f_{i}^{w}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT in height dimension. Through Eq. 2 and Eq. 4,we merge the information from all stages of the encoder in both width dimension and height dimension. Next, we obtain the output {fih}i=1ksuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑖1𝑘\{f_{i}^{h}\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT of H-Dimensional Axial-Attention through the attention operation:

fih=Softmax(Qi,hKi,hTCW)Vi,h,i=1,2,..,k.\begin{split}f_{i}^{h}=\text{Softmax}(\frac{Q_{i,h}K_{i,h}^{T}}{\sqrt{C^{*}W^{% *}}})V_{i,h},i=1,2,..,k.\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT , italic_i = 1 , 2 , . . , italic_k . end_CELL end_ROW (5)

Resizing back. To aid the guidance of the decoding process with the information acquired through axial attention, we resize the output feature fihsuperscriptsubscript𝑓𝑖f_{i}^{h}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT of the two-step axial attention to match the dimensions of the feature map in the corresponding decoding stage i𝑖iitalic_i, adjusting both spatial and channel dimensions through convolution operations, denoted as {fiattnCi×Hi×Wi}i=1ksuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑎𝑡𝑡𝑛superscriptsubscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝑖1𝑘\{f_{i}^{attn}\in\mathbb{R}^{C_{i}\times H_{i}\times W_{i}}\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT,

fiattn=σ(BN(convih(fih))),i=1,..,k,\begin{split}&f_{i}^{attn}=\sigma(BN(conv_{i}^{h}(f_{i}^{h}))),i=1,..,k,\end{split}start_ROW start_CELL end_CELL start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_n end_POSTSUPERSCRIPT = italic_σ ( italic_B italic_N ( italic_c italic_o italic_n italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ) ) , italic_i = 1 , . . , italic_k , end_CELL end_ROW (6)

where σ𝜎\sigmaitalic_σ and BN𝐵𝑁BNitalic_B italic_N denote ReLU and Batch Normalization respectively; and k𝑘kitalic_k is the number of stages in the encoder and decoder. Finally, we concatenate fiattnsuperscriptsubscript𝑓𝑖𝑎𝑡𝑡𝑛f_{i}^{attn}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_n end_POSTSUPERSCRIPT to the feature map of the corresponding stage in the decoder along the channel dimension.

It is noteworthy that traditional two-dimensional attention mechanism demands substantial computing resources [35]. To address this, we employ two-stage one-dimensional attention modules in CSAA, conducting attention processing sequentially on the feature maps in the width and height dimensions.

The CSAA significantly aids the model in extracting information about small medical objects from the encoder and appropriately allocates it to the corresponding stage of the decoder. Unlike prior models, our decoding process for each stage in the decoder is influenced by segmentation-related information gleaned from all stages of the encoder, facilitated by CSAA. Through CSAA, our model achieves feature fusion in the encoder, and reinforces the linkage between the encoder and decoder.

3.3 MPS Module

Low-resolution features in the decoder possess robust global perception, enhancing the model’s performance in small medical object segmentation. However, in prior models such as [27, 4, 21], the globally perceptual information is not fully harnessed; and a significant amount of useful information is lost in the subsequent convolution and upsampling processes. To tackle this issue, we introduce the Multi-Precision Supervision Module (MPS) to extract information from low-resolution features in the decoder and diminish its loss in the ensuing decoding process. Specifically, MPS consists of two steps: segmentation and upsampling.

Segmentation. To thoroughly extract information about small medical objects from each stage of the decoder, we individually feed feature maps from each decoder stage into corresponding segmentation heads. This process yields segmentation results with distinct resolutions, denoted as {PiCi×Hi×Wi}i=1ksuperscriptsubscriptsubscript𝑃𝑖superscriptsubscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖𝑖1𝑘\{P_{i}\in\mathbb{R}^{C_{i}\times H_{i}\times W_{i}}\}_{i=1}^{k}{ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT,

Pi=S(σ(BN(convid(fid)))),i=1,..,k,\begin{split}&P_{i}=S(\sigma(BN(conv_{i}^{d}(f_{i}^{d})))),\quad i=1,..,k,\end% {split}start_ROW start_CELL end_CELL start_CELL italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_S ( italic_σ ( italic_B italic_N ( italic_c italic_o italic_n italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ) ) ) , italic_i = 1 , . . , italic_k , end_CELL end_ROW (7)

where fidsuperscriptsubscript𝑓𝑖𝑑f_{i}^{d}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents the feature map in stage i𝑖iitalic_i of the decoder; σ𝜎\sigmaitalic_σ and BN𝐵𝑁BNitalic_B italic_N denote ReLU and Batch Normalization respectively; and S(.)S(.)italic_S ( . ) represents the sigmoid function.

Upsampling. We employ neighbor interpolation method to upsample the segmentation results obtained in the previous step to match the size same of the ground truth image. This process enables us to achieve multi-precision segmentation {Mi}i=1ksuperscriptsubscriptsubscript𝑀𝑖𝑖1𝑘\{M_{i}\}_{i=1}^{k}{ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for small medical objects, as illustrated in Fig. 3(a).

Mi=Upsample(Pi)C×H×W,i=1,2,..,k.M_{i}=Upsample(P_{i})\in\mathbb{R}^{C\times H\times W},i=1,2,..,k.italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_U italic_p italic_s italic_a italic_m italic_p italic_l italic_e ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT , italic_i = 1 , 2 , . . , italic_k . (8)

We oversee the segmentation results of different precision with the ground truth label, ensuring that each stage of the decoder encompasses sufficient information to facilitate the segmentation of small medical objects.

In MPS, we formulate a supervision strategy with varying precision for features of different resolutions. Recognizing that low-resolution features possess robust global perception but lack local details, we employ low-precision supervision for them. This decision is made to temporarily forego the emphasis on local details while capitalizing on the strengths of low-resolution features with powerful global perception. This multi-precision supervision approach preserves the advantages of the conventional single-segmentation head while preserving additional global perception from low-resolution features. Consequently, it enhances the model’s performance in small medical object segmentation.

3.4 Loss function

Considering that the positive and negative pixels are extremely unbalanced in small medical object segmentation tasks, we adopt a combination of DiceLoss [31] and Binary Cross Entropy(BCE) Loss during the training process. The loss for the segmentation maps produced by each stage of the decoder is set as follows.

i=λ1Dice(Mi,Y)+λ2BCE(Mi,Y),subscript𝑖subscript𝜆1subscriptDicesubscript𝑀𝑖𝑌subscript𝜆2subscriptBCEsubscript𝑀𝑖𝑌\begin{split}&\mathcal{L}_{i}=\lambda_{1}\cdot\mathcal{L}_{\text{Dice}}(M_{i},% Y)+\lambda_{2}\cdot\mathcal{L}_{\text{BCE}}(M_{i},Y),\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT Dice end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT BCE end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y ) , end_CELL end_ROW (9)

where i=1,2,..,ki=1,2,..,kitalic_i = 1 , 2 , . . , italic_k denotes stage indexes; Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the result predicted by the model at stage i𝑖iitalic_i; and Y𝑌Yitalic_Y represents the ground truth with the hyperparameters λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to balance DiceLoss and BCELoss. Taking into account all segmentation results output by each stage of the decoder, the total loss of the model is

total=i=1kαii,subscripttotalsuperscriptsubscript𝑖1𝑘subscript𝛼𝑖subscript𝑖\mathcal{L}_{\text{total}}=\sum_{i=1}^{k}\alpha_{i}\cdot\mathcal{L}_{i},caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (10)

where hyperparameter {αi}i=1ksuperscriptsubscriptsubscript𝛼𝑖𝑖1𝑘\{\alpha_{i}\}_{i=1}^{k}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT leverage the losses of segmentation results with different precision.

4 Experiment

S-HRD Dataset. We have gathered a dataset comprising 313 optical coherence tomography (OCT) images from patients with macular edema, with the objective of segmenting small HyperReflective Dots (small HRDs) within them. We refer to this dataset as S-HRD, where ’S’ indicates ’Small’. In S-HRD, the area of each lesion is less than 1 percent of the entire image size. All the ground truths have been manually labeled by experienced eye doctors with over ten years of expertise. We have ensured that appropriate consent has been obtained for the utilization and presentation of images in our research. For further insights into the data collection process, privacy considerations and relevant medical knowledge, please refer to the Supplementary.

S-Polyp Dataset. We build a small-polyp segmentation dataset by excluding images with sizable medical lesions in CVC-ClinicDB [2]. From this selection, we retain 229 images where all lesions are small medical objects. We label this dataset as S-Polyp. In S-Polyp, the area of each lesion is less than 5 percent of the entire image size. Examples of both S-HRD and S-Polyp are illustrated in Fig. 1.

To mitigate limitations and address the specificity of the two datasets, we employ a five-fold cross-validation approach to assess the performance of our model.

Definition of Small Medical Objects. Given the absence of a consistent definition for small medical objects in previous works, we establish our own criteria. In our framework, an object is considered a small medical object if the ratio of its pixel count n𝑛nitalic_n to the total pixel count N𝑁Nitalic_N in the entire image is less than 5%percent\%%. For objects with a ratio below 1%percent\%%, we classify them as extremely small medical objects. In the S-Polyp dataset, all objects fall under the category of small medical objects, while in S-HRD, all objects are classified as extremely small medical objects.

Evaluation Metrics. We use two common metrics to compare our model with previous state-of-the-art models. Dice Similariy Coefficient (DSC) is defined as:

DSC=2×|PG||P|+|G|,𝐷𝑆𝐶2𝑃𝐺𝑃𝐺DSC=\frac{2\times|P\cap G|}{|P|+|G|},italic_D italic_S italic_C = divide start_ARG 2 × | italic_P ∩ italic_G | end_ARG start_ARG | italic_P | + | italic_G | end_ARG , (11)

where P𝑃Pitalic_P represents the area of the predicted label and G𝐺Gitalic_G represents the area of the ground truth. Intersection over Union (IoU) is defined as:

IoU=SiSu,𝐼𝑜𝑈subscript𝑆𝑖subscript𝑆𝑢IoU=\frac{S_{i}}{S_{u}},italic_I italic_o italic_U = divide start_ARG italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG , (12)

where Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the area where the predicted label and ground truth overlap; and Susubscript𝑆𝑢S_{u}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for the total area of the two.

Implementation Details. We conduct our experiments using one NVIDIA RTX A6000 GPU equipped with 48GB of memory. The SGD optimizer is employed with an initial learning rate of 0.01. Our training spans 200 epochs, employing a batch size of 4. All input images are resized uniformly to 352×352352352352\times 352352 × 352. The model configuration includes 4 stages in both the encoder and decoder. We set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 0.7 and 0.3 respectively to balance DiceLoss and BCELoss. And α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, α4subscript𝛼4\alpha_{4}italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are set to 1.0, 0.9, 0.8,0.7 respectively to balance losses of muti-precision segmentation results.

Competitors. Given the limited number of works dedicated specifically to small medical object segmentation, we reference recent models from the broader field of medical image segmentation, including state-of-the-art models: (1) CNN based methods: U-Net [27], Attention-UNet [24], MSU-Net [30], CaraNet [21]; (2) Transformer based methods: TransFuse [42], TransUNet [4], SSFormer [36], Swin-UNet [3]; (3) Segment Anything Model (SAM) [20] and the related works: SAM without any prompt, SAM with point, SAM with box and MedSAM [22]. (4) Additionally, we assess the performance of an enlarged version of U-Net (U-Net-Large), where both the encoder and decoder are scaled up from 4 layers to 12 layers. This exploration aims to understand the impact of model size on segmentation accuracy.

4.1 Quantitative Results

Table 1: Comparison of our EFCNet with competitors on S-HRD and S-Polyp in DSC (%percent\%%) and IoU (%percent\%%).
Metrics Methods S-HRD S-Polyp
Fold0 Fold1 Fold2 Fold3 Fold4 Mean Fold0 Fold1 Fold2 Fold3 Fold4 Mean
DSC (%percent\%%) U-Net [27] 42.09 36.51 37.74 35.90 41.29 38.71 75.90 75.47 81.95 72.06 76.73 76.42
U-Net-Large [27] 43.02 36.72 38.24 37.82 41.60 39.48 78.93 78.32 82.76 73.70 77.84 78.31
Attn-UNet [24] 44.33 39.97 39.07 38.46 39.90 40.35 79.89 81.17 81.19 77.01 74.68 78.79
MSU-Net [30] 44.06 40.94 38.66 38.54 39.44 40.33 87.60 86.64 86.03 82.99 75.56 83.76
CaraNet [21] 17.03 13.79 12.46 13.40 13.07 13.95 85.12 78.66 88.14 75.02 82.56 81.90
TransUNet [4] 32.84 30.34 30.71 31.54 32.81 31.65 84.83 82.13 88.24 76.53 82.85 82.92
TransFuse [42] 14.95 10.11 11.59 11.96 11.13 11.95 80.73 72.43 82.33 72.53 81.04 77.81
SSFormer [36] 34.66 27.74 27.06 28.49 25.97 28.78 85.92 84.44 88.00 78.80 83.37 84.11
Swin-UNet [3] 07.90 04.27 04.29 08.08 07.32 06.37 48.16 53.91 50.00 49.57 61.68 52.66
SAM [20] 03.81 03.66 01.47 01.40 02.89 02.64 44.65 40.57 49.94 56.43 46.31 47.58
SAM (box) [20] 03.99 02.78 02.60 02.67 02.55 02.92 77.21 80.40 82.65 77.73 78.60 79.32
SAM (point) [20] 10.49 07.48 05.16 04.66 07.79 07.12 72.53 73.27 75.30 67.80 72.08 72.20
MedSAM [22] 04.07 03.29 03.16 02.86 03.11 03.30 76.46 79.26 80.97 75.26 78.70 78.13
EFCNet(Ours) 49.10 44.91 43.72 43.46 44.95 45.23 89.11 90.74 89.20 85.39 83.58 87.60
IoU (%percent\%%) U-Net [27] 28.83 24.09 24.20 24.12 26.90 25.63 65.98 68.77 73.42 63.67 58.63 66.09
U-Net-Large [27] 29.99 25.09 25.21 25.13 27.88 26.66 68.93 69.99 76.75 62.55 65.98 68.84
Attn-UNet [24] 30.75 27.47 26.27 25.95 26.68 27.42 72.69 72.59 73.30 68.58 68.38 71.11
MSU-Net [30] 30.20 27.98 25.60 25.48 25.76 27.00 79.51 78.79 77.54 72.72 68.21 75.35
CaraNet [21] 10.36 08.04 07.21 07.77 07.50 08.18 77.24 72.04 81.40 67.29 75.56 74.71
TransUNet [4] 22.00 19.87 19.55 21.02 21.21 20.73 75.40 75.29 81.02 68.83 75.20 75.15
TransFuse [42] 09.05 05.75 06.58 07.15 06.25 06.96 70.77 64.74 74.47 65.13 73.01 69.62
SSFormer [36] 22.65 17.67 16.57 17.95 16.02 18.17 78.16 77.30 80.93 72.74 75.52 76.93
Swin-UNet [3] 05.03 02.38 02.32 05.08 04.16 03.79 37.36 41.84 39.23 38.00 50.10 41.31
SAM [20] 02.14 02.11 00.77 00.73 01.62 01.47 39.17 35.44 44.97 51.01 41.50 42.42
SAM (box) [20] 02.09 01.45 01.34 01.40 01.31 01.52 67.04 70.62 72.97 67.71 69.34 69.54
SAM (point) [20] 06.86 04.49 03.10 02.65 04.85 04.39 64.17 65.89 67.54 60.15 65.60 64.67
MedSAM [22] 02.13 01.71 01.66 01.49 01.64 01.73 66.47 69.21 71.61 65.59 69.13 68.40
EFCNet(Ours) 35.06 31.45 29.84 29.35 30.25 31.19 82.54 83.71 82.08 76.98 75.59 80.18

As shown in Tab. 1, our model consistently outperforms previous state-of-the-art models across all folds for both S-HRD and S-Polyp, measured by DSC and IoU.

On S-HRD, our model demonstrates a noteworthy improvement of 4.88%percent\%% in DSC and 3.77%percent\%% in IoU compared to earlier methods. Similarly, on S-Polyp, our model exhibits a performance boost of 3.49%percent\%% in DSC and 3.25%percent\%% in IoU.

Table 1 shows that methods tend to perform poorly on S-HRD. That is because the objects in S-HRD are smaller in size compared with S-Polyp, which means that there is less information available in images. Nevertheless, our model still performs best among all methods.

Furthermore, the results on S-HRD and S-Polyp indicate that the smaller the medical objects in the datasets, the more significant the improvement of our model compared to previous SOTA methods. This underscores the superiority of our model in small medical object segmentation.

Additionally, simply increasing the size of U-Net (U-Net-Large) yields only marginal improvements in segmentation performance compared to the standard-sized U-Net. In contrast, our EFCNet demonstrates substantial improvement. This indicates that the superior performance of EFCNet in segmentation is primarily attributed to our model design rather than the larger model size. While the addition of CSAA and MPS increases the model’s cost, we believe the improvement justifies the associated costs in the realm of small medical object segmentation. We provide detailed model costs comparison in the Supplementary.

4.2 Visualization

Refer to caption
Figure 4: Visualization of EFCNet (ours) and previous SOTA methods on S-HRD and S-Polyp. The previous SOTA method on S-HRD is Attn-UNet [24], and the previous SOTA method on S-Polyp is SSFormer [36]. The green circle areas show extremely small medical objects captured by our method that are not captured by previous SOTA methods. The yellow circle areas show that the segmentation of the boundaries of small medical objects in our method is significantly better than the previous SOTA method. The red circle areas show the wrong segmentation of small medical objects in the previous SOTA method while our method is correct.

Some visual results of different methods on S-HRD and S-Polyp are shown in Fig. 4. We compare our EFCNet with the previous SOTA methods. According to our experimental results in Tab. 1, the previous SOTA method on S-HRD is Attn-UNet [24], and the previous SOTA method on S-Polyp is SSFormer [36].

As illustrated in Fig. 4, the strengths of our method are predominantly evident in three aspects: (1) Our method excels at capturing extremely small medical objects, as highlighted in the green circle areas. (2) Our method demonstrates higher accuracy in terms of segmenting the boundaries of small medical objects, as evidenced by the yellow circle areas. (3) Our method is significantly less prone to erroneously segmenting the background into small medical objects, as indicated by the red circle areas.

On one hand, CSAA facilitates the application of valuable local information from low-level features in the encoder to the segmentation process, ensuring the model’s capability to capture fine details of small medical objects. On the other hand, MPS enables the model to leverage global perception inherent in low-resolution features in the initial stages of the decoder, enhancing its ability to locate small medical objects.

4.3 Ablation Studies

Table 2: Ablation study on CSAA module and MPS module on S-HRD and S-Polyp in DSC (%percent\%%) and IoU (%percent\%%).
Methods S-HRD S-Polyp
DSC(%percent\%%) IoU(%percent\%%) DSC(%percent\%%) IoU(%percent\%%)
U-Net 36.58 24.14 69.98 63.27
U-Net+CSAA 39.19 27.09 78.84 70.66
U-Net+MPS 38.98 26.85 80.20 74.14
U-Net+CSAA+MPS (Ours) 41.82 29.69 83.26 76.29

We perform several sets of ablation experiments on S-HRD and S-Polyp, confirming the positive effect of CSAA and MPS on segmentation ability for small medical objects respectively.

Effectiveness of CSAA and MPS. We incorporate CSAA and MPS into our U-Net backbone individually, and the experimental results are presented in Tab. 2. It is evident that each module contributes to the improvement of our model’s performance. Furthermore, with the addition of both CSAA and MPS, the segmentation ability of our model experiences further enhancement.

Table 3: Ablation study on the number of stages that CSAA aggregates on S-HRD and S-Polyp in DSC (%percent\%%) and IoU (%percent\%%).
Methods S-HRD S-Polyp
DSC (%percent\%%) IoU (%percent\%%) DSC (%percent\%%) IoU (%percent\%%)
Concat-One 38.98 26.85 80.20 74.14
AA-One 40.37 28.13 82.36 75.94
AA-All (Ours) 41.82 29.69 83.26 76.29

Number of Stages in CSAA. We change the number of stages aggregated in CSAA module: AA-All aggregates features in all stages of the encoder, which is exactly the CSAA applied in our final model. AA-One only performs axial attention on features in one stage of the encoder. Concat-One only concatenates features in each stage of the encoder to the corresponding decoder without any other processing. The performance of these three methods on S-HRD and S-Polyp is shown in Tab. 3. It can be seen that among the three models, Concat-One performs the worst. Compared with Concat-One, AA-One can improve the segmentation ability of the model. CSAA aggregates features of all stages of the encoder and performs best among these three methods above.

Table 4: Ablation study on the number of MPS connected to the decoder on S-HRD and S-Polyp in DSC (%percent\%%) and IoU (%percent\%%).
Methods S-HRD S-Polyp
DSC (%percent\%%) IoU (%percent\%%) DSC (%percent\%%) IoU (%percent\%%)
MPS-1 39.19 27.09 78.84 70.66
MPS-2 40.07 27.17 82.00 73.87
MPS-3 40.18 28.32 82.51 75.13
MPS-4 (Ours) 41.82 29.69 83.26 76.29

Number of Supervisions in MPS. We vary the number of supervisions in the MPS: MPS-4 connects segmentation heads to all stages of the decoder, representing the MPS configuration in our final model. MPS-3, MPS-2, and MPS-1 connect three, two and one segmentation heads to the decoder respectively. The performance of these four models on S-HRD and S-Polyp is shown in Tab. 4. It is evident that among these four models, increased supervision correlates with improved model performance.

5 Conclusion

We introduce a novel model called EFCNet to address the challenging task of small object segmentation in medical images. EFCNet pays sufficient attention to all features of each stage in the model, effectively reducing the information loss of small medical objects and improving the segmentation accuracy. Specifically, we propose Cross-Stage Axial Attention Module (CSAA) and Multi-Precision Supervision Module (MPS), which alleviate the loss of information in the encoder and decoder respectively, leading to a substantial enhancement in model performance. Moreover, we establish a new benchmark for small medical object segmentation research. Our experiments on two datasets demonstrate that CSAA and MPS contribute to improved segmentation accuracy, with our model significantly outperforming previous state-of-the-art models.

References

  • Arthi et al. [2021] M Arthi, Manavi D Sindal, and R Rashmita. Hyperreflective foci as biomarkers for inflammation in diabetic macular edema: Retrospective analysis of treatment naïve eyes from south india. Indian Journal of Ophthalmology, 69(5):1197, 2021.
  • Bernal et al. [2015] Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43:99–111, 2015.
  • Cao et al. [2022] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
  • Chen et al. [2021] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
  • Chen et al. [2018] Liang Chen, Paul Bentley, Kensaku Mori, Kazunari Misawa, Michitaka Fujiwara, and Daniel Rueckert. Drinet for medical image segmentation. IEEE transactions on medical imaging, 37(11):2453–2462, 2018.
  • Chen et al. [2014] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
  • Chen et al. [2017] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • Cheng et al. [2020] Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, and Chi-Keung Tang. Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8890–8899, 2020.
  • Chung et al. [2019] Yoo-Ri Chung, Young Ho Kim, Seong Jung Ha, Hye-Eun Byeon, Chung-Hyun Cho, Jeong Hun Kim, Kihwang Lee, et al. Role of inflammation in classification of diabetic macular edema by optical coherence tomography. Journal of Diabetes Research, 2019, 2019.
  • Gu et al. [2019] Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu. Ce-net: Context encoder network for 2d medical image segmentation. IEEE transactions on medical imaging, 38(10):2281–2292, 2019.
  • Guan et al. [2020] Steven Guan, Amir A. Khan, Siddhartha Sikdar, and Parag V. Chitnis. Fully dense unet for 2-d sparse photoacoustic tomography artifact removal. IEEE Journal of Biomedical and Health Informatics, 24(2):568–576, 2020.
  • Hatamizadeh et al. [2022] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574–584, 2022.
  • He et al. [2023] Sheng He, Rina Bao, **gpeng Li, P Ellen Grant, and Yangming Ou. Accuracy of segment-anything model (sam) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324, 2023.
  • Huang et al. [2020] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, and Jian Wu. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1055–1059. IEEE, 2020.
  • Huang et al. [2021] Haifan Huang, Liangjiu Zhu, Weifang Zhu, Tian Lin, Leonoor Inge Los, Chenpu Yao, Xinjian Chen, and Haoyu Chen. Algorithm for detection and quantification of hyperreflective dots on optical coherence tomography in diabetic macular edema. Frontiers in Medicine, 8:688986, 2021.
  • Huang et al. [2019] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4634–4643, 2019.
  • Huang et al. [2023] Yuhao Huang, Xin Yang, Lian Liu, Han Zhou, Ao Chang, Xinrui Zhou, Rusi Chen, Junxuan Yu, Jiongquan Chen, Chaoyu Chen, et al. Segment anything model for medical images? arXiv preprint arXiv:2304.14660, 2023.
  • Jha et al. [2019] Debesh Jha, Pia H. Smedsrud, Michael A. Riegler, Dag Johansen, Thomas de Lange, Pal Halvorsen, and Havard D. Johansen. Resunet++: An advanced architecture for medical image segmentation, 2019.
  • Jha et al. [2020] Debesh Jha, Michael A Riegler, Dag Johansen, Pål Halvorsen, and Håvard D Johansen. Doubleu-net: A deep convolutional neural network for medical image segmentation. In 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), pages 558–564. IEEE, 2020.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • Lou et al. [2022] Ange Lou, Shuyue Guan, Hanseok Ko, and Murray H Loew. Caranet: context axial reverse attention network for segmentation of small medical objects. In Medical Imaging 2022: Image Processing, pages 81–92. SPIE, 2022.
  • Ma and Wang [2023] Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  • Milletari et al. [2016] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  • Oktay et al. [2018] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
  • Poudel et al. [2018] Rudra PK Poudel, Ujwal Bonde, Stephan Liwicki, and Christopher Zach. Contextnet: Exploring context and detail for semantic segmentation in real-time. arXiv preprint arXiv:1805.04554, 2018.
  • Qin et al. [2021] Shiyue Qin, Chaoyang Zhang, Haifeng Qin, Hai Xie, Dawei Luo, Qinghua Qiu, Kun Liu, **gting Zhang, Guoxu Xu, and **gfa Zhang. Hyperreflective foci and subretinal fluid are potential imaging biomarkers to evaluate anti-vegf effect in diabetic macular edema. Frontiers in Physiology, page 2337, 2021.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Shen et al. [2018] Tao Shen, Tianyi Zhou, Guodong Long, **g Jiang, Sen Wang, and Chengqi Zhang. Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. arXiv preprint arXiv:1801.10296, 2018.
  • Sinha and Dolz [2020] Ashish Sinha and Jose Dolz. Multi-scale self-guided attention for medical image segmentation. IEEE journal of biomedical and health informatics, 25(1):121–130, 2020.
  • Su et al. [2021] Run Su, Deyun Zhang, **huai Liu, and Chuandong Cheng. Msu-net: Multi-scale u-net for 2d medical image segmentation. Frontiers in Genetics, 12:639930, 2021.
  • Sudre et al. [2017] Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, pages 240–248. Springer, 2017.
  • Tao et al. [2020] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020.
  • Valanarasu and Patel [2022] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-based rapid medical image segmentation network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V, pages 23–33. Springer, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. [2020] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV, pages 108–126. Springer, 2020.
  • Wang et al. [2022] **feng Wang, Qiming Huang, Feilong Tang, Jia Meng, Jionglong Su, and Sifan Song. Stepwise feature fusion: Local guides global. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part III, pages 110–120. Springer, 2022.
  • Williams [2008] John R Williams. The declaration of helsinki and public health. Bulletin of the World Health Organization, 86:650–652, 2008.
  • Woo et al. [2018] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • Xiao et al. [2018] Xiao Xiao, Shen Lian, Zhiming Luo, and Shaozi Li. Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th International Conference on Information Technology in Medicine and Education (ITME), pages 327–331, 2018.
  • Zhang et al. [2022a] Hu Zhang, Keke Zu, Jian Lu, Yuru Zou, and Deyu Meng. Epsanet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of the Asian Conference on Computer Vision, pages 1161–1177, 2022a.
  • Zhang et al. [2022b] Ming** Zhang, Rui Zhang, Yuxiang Yang, Haichen Bai, **g Zhang, and Jie Guo. Isnet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 877–886, 2022b.
  • Zhang et al. [2021] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 14–24. Springer, 2021.
  • Zhang et al. [2018] Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue, and Jian Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 269–284, 2018.
\thetitle

Supplementary Material

6 Rationale

Having the supplementary compiled together with the main paper means that:

  • The supplementary can back-reference sections of the main paper, for example, we can refer to Sec. 1;

  • The main paper can forward reference sub-sections within the supplementary explicitly (e.g. referring to a particular experiment);

  • When submitted to arXiv, the supplementary will already included at the end of the paper.

To split the supplementary pages from the main paper, you can use Preview (on macOS), Adobe Acrobat (on all OSs), as well as command line tools.

7 Overview

In the supplementary material, we firstly provide additional analysis of model cost in Sec. 8. Then, we provide additional details about the architecture of our EFCNet in Sec. 9 and experiment settings in Sec. 10. In the end, we introduce ethical considerations in Sec. 11.

8 Analysis of Model Cost

We provide model cost comparison of our EFCNet and other UNet-based methods in Tab. 5 and performance improvement compared to U-Net [27] in Tab. 6. We can draw a conclusion that in the field of small medical object segmentation, simply increasing the model size like U-Net-Large cannot bring significant improvement in segmentation performance based on the standard-sized U-Net. In comparison, our EFCNet achieves far better segmentation performance than other UNet-based methods with model cost less than that of U-Net-Large. Indeed, our model remains relatively large to effectively address the intricate challenge of the segmentation task.

9 Additional Details about Model Architecture

We provide additional details about the backbone network, the CSAA Module and the MPS Module in our EFCNet as shown in Tab. 7, Tab. 8 and Tab. 9 respectively.

10 Additional Details about Experiment Settings

Experimental environment. The environment of our experiment is as follows. GPU: NVIDIA RTX A6000; CUDA Version: 11.7; Python Version: 3.10.4; Torch Version: 1.13.1.

Split of Datasets. We perform five-fold cross-validation in our experiments. In each split, datasets are separated into training set, validation set, testing set by a ratio of 7:1:2.

Table 5: Model cost comparison of our EFCNet with other UNet-based methods.
Methods FLOPs (G) Params (M)
U-Net [27] 91.94 37.66
Attn-UNet [24] 125.94 34.88
MSU-Net [30] 143.05 47.09
U-Net-Large [27] 430.57 100.37
U-Net+CSAA 375.52 87.37
U-Net+MPS 101.48 38.81
U-Net+CSAA+MPS (EFCNet) 385.06 88.52
Table 6: Performance improvement among our EFCNet and other UNet-based methods compared to U-Net [27].
Methods S-HRD S-Polyp
ΔΔ\Deltaroman_ΔDSC(%percent\%%) ΔΔ\Deltaroman_ΔIoU(%percent\%%) ΔΔ\Deltaroman_ΔDSC(%percent\%%) ΔΔ\Deltaroman_ΔIoU(%percent\%%)
U-Net [27] +0.00 +0.00 +0.00 +0.00
Attn-UNet [24] +1.64 +1.79 +2.37 +5.02
MSU-Net [30] +1.62 +1.37 +7.34 +9.26
U-Net-Large [27] +0.77 +1.03 +1.89 +2.75
U-Net+CSAA+MPS (EFCNet) +6.52 +5.56 +11.18 +14.09
Table 7: Details of the backbone network of our EFCNet.
       Stage        Layer        Output shape
       Input        -        (3, 352, 352)
       Encoder stage1        (Conv, BN, ReLU) ×\times× 2, Downsample        (64, 176, 176)
       Encoder stage2        (Conv, BN, ReLU) ×\times× 2, Downsample        (128, 88, 88)
       Encoder stage3        (Conv, BN, ReLU) ×\times× 2, Downsample        (256, 44, 44)
       Encoder stage4        (Conv, BN, ReLU) ×\times× 2, Downsample        (512, 22, 22)
       Decoder stage4        (Conv, BN, ReLU) ×\times× 2, Upsample, Concat        (1024, 44, 44)
       Decoder stage3        (Conv, BN, ReLU) ×\times× 2, UPsample, Concat        (512, 88, 88)
       Decoder stage2        (Conv, BN, ReLU) ×\times× 2, UPsample, Concat        (256, 176, 176)
       Decoder stage1        (Conv, BN, ReLU) ×\times× 2, UPsample, Concat        (128, 352, 352)
Table 8: Details of the CSAA Module in our EFCNet.
       Stage        Layer        Output shape
       CSAA stage1        Resize: ((Conv, BN, ReLU) ×\times× 2)        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       W-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       H-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       Resize back: ((Conv, BN, ReLU) ×\times× 2)        (64, 352, 352)
       CSAA stage2        Resize: ((Conv, BN, ReLU) ×\times× 2)        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       W-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       H-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       Resize back: ((Conv, BN, ReLU) ×\times× 2)        (128, 176, 176)
       CSAA stage3        Resize: ((Conv, BN, ReLU) ×\times× 2)        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       W-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       H-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       Resize back: ((Conv, BN, ReLU) ×\times× 2)        (256, 88, 88)
       CSAA stage4        Resize: ((Conv, BN, ReLU) ×\times× 2)        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       W-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       H-CSAA: one-dimensional attention module        (Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Hsuperscript𝐻H^{*}italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
       Resize back: ((Conv, BN, ReLU) ×\times× 2)        (512, 44, 44)
Table 9: Details of the MPS Module in our EFCNet.
      Stage       Layer       Output shape
      MPS stage1       Segmentation: (Conv, BN, ReLU) ×\times× 2, Conv, Sigmoid       (1, 352, 352)
      MPS stage2       Segmentation: (Conv, BN, ReLU) ×\times× 2, Conv, Sigmoid       (1, 176, 176)
      Upsample: Nearest interpolation       (1, 352, 352)
      MPS stage3       Segmentation: (Conv, BN, ReLU) ×\times× 2, Conv, Sigmoid       (1, 88, 88)
      Upsample: Nearest interpolation       (1, 352, 352)
      MPS stage4       Segmentation: (Conv, BN, ReLU) ×\times× 2, Conv, Sigmoid       (1, 44, 44)
      Upsample: Nearest interpolation       (1, 352, 352)

11 Ethical Considerations

The collection and utilization of human data in our research project adheres to the highest ethical standards. Our study has received the full approval of the Institutional Review Board and the Ethics Committee of a hospital. This approval process is conducted in strict accordance with the principles outlined in the Declaration of Helsinki [37], which provides ethical guidelines for medical research involving human subjects. All recruited patients have signed the informed consent to publish this paper. Here, we provide the information on data collection and annotation, appropriate consent and privacy considerations.

Data collection and annotation. In this investigation, retinal OCT scans are collected from eyes of 313 patients who seek treatment for macular edema associated with diabetic retinopathy or retinal vein occlusion at a hospital within the past six months. The investigation has been approved by the Institutional Review Board and the Ethics Committee of a hospital, in accordance with the principles of the Declaration of Helsinki [37]. Each OCT scan is centered on the fovea, either vertically or horizontally. Macular edema is defined as a central retinal thickness (CRT) greater than 300 mm. Small hyperreflective dots (small HRDs) are defined as discrete tiny dots with diameter between 20 micron and 40 micron, characterized by reflectivity similar to that of the nerve fiber layer and the absence of back shadowing [15]. The ground truths of small HRDs have been manually labeled by experienced eye doctors with over ten years of expertise.

Appropriate consent and privacy considerations. We confirm that appropriate consent has been obtained for the use and display of images in our research. To uphold the privacy and confidentiality of the individuals involved, we take rigorous measures to ensure that all identifying information, including names, genders, and birth dates of patients, has been thoroughly removed from the images prior to any processing or analysis. These steps are taken to safeguard patient privacy and adhere to ethical standards. We understand the critical importance of addressing privacy concerns when dealing with medical images, and we are committed to upholding the highest ethical standards in our research practices.