SPIRONet: SPatial-Frequency Learning and
TopologIcal Channel InteRactiOn Network for Vessel Segmentation

De-Xing Huang, Xiao-Hu Zhou, , Xiao-Liang Xie, , Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Mei-Jiang Gui, Hao Li, Tian-Yu Xiang, Bo-Xian Yao, and Zeng-Guang Hou This work was supported in part by the National Natural Science Foundation of China under Grant 62373351, Grant 62222316, Grant U20A20224, U1913601, Grant 62073325, Grant 61720106012, Grant 62003198; in part by the Youth Innovation Promotion Association of Chinese Academy of Sciences (CAS) under Grant 2020140; and in part by the CIE-Tencent Robotics X Rhino-Bird Focused Research Program. (Corresponding authors: Xiao-Hu Zhou and Zeng-Guang Hou)D.-X. Huang, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, S.-Y. Wang, Z.-Q. Feng, M.-J. Gui, H. Li, T.-Y. Xiang, and B.-X. Yao are with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Bei**g 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Bei**g 100049, China (e-mail: [email protected]; [email protected]).Z.-G. Hou is with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Bei**g 100190, China, also with the CAS Center for Excellence in Brain Science and Intelligence Technology, Bei**g 100190, China, also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Bei**g 100049, China, and also with the Joint Laboratory of Intelligence Science and Technology, Institute of Systems Engineering, Macau University of Science and Technology, Taipa, Macau (e-mail:[email protected]).
Abstract

Automatic vessel segmentation is paramount for develo** next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel SPatial-frequency learning and topologIcal channel inteRactiOn Network (SPIRONet) is proposed to address the above issues. Specifically, dual encoders are utilized to comprehensively capture local spatial and global frequency vessel features. Then, a cross-attention fusion module is introduced to effectively fuse spatial and frequency features, thereby enhancing feature discriminability. Furthermore, a topological channel interaction module is designed to filter out task-irrelevant responses based on graph neural networks. Extensive experimental results on several challenging datasets (CADSA, CAXF, DCA1, and XCAD) demonstrate state-of-the-art performances of our method. Moreover, the inference speed of SPIRONet is 21 FPS with a 512×512512512512\times 512512 × 512 input size, surpassing clinical real-time requirements (6similar-to\sim12FPS). These promising outcomes indicate SPIRONet’s potential for integration into vascular interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.

Index Terms:
vessel segmentation, Fourier transform, spatial-frequency fusion, graph convolutional networks (GCNs).

I Introduction

Refer to caption

Figure 1: Illustration of challenges in vessel segmentation. i) Low signal-to-noise ratio (SNR). ii) Small or slender vessel branches. iii) Non-target and motion artifact interference. X-ray fluoroscopy images and their corresponding ground truths are from the XCAD dataset [1].

Cardiovascular diseases are major contributors to global morbidity and mortality rates [2][3]. Intravascular interventions have gained considerable attention for their minimally invasive nature and swift postoperative recovery times [4][5]. These procedures require physicians to deliver instruments (e.g., guidewires, catheters, or balloons) precisely to target vessels, typically guided by intraoperative digital subtraction angiographies (DSAs) [6] or X-ray fluoroscopies [7]. However, challenges such as uneven contrast agent flow [8] or vascular occlusions [6] may prevent some vessel branches from being opacified in intraoperative images. To ensure safe deployments of instruments, developments of intelligent navigation systems are essential [9][10]. As a foundational element of navigation systems, real-time segmentation of vessel morphology is critically important [11][12].

However, accurately segmenting vessels from intraoperative images is non-trivial [13][1]. As shown in Fig. 1, the primary challenges can be summarized three-fold: i) To reduce radiation exposure for patients and physicians, interventions utilize low-power X-rays, resulting in low signal-to-noise ratio (SNR) images. ii) Complex vessel structures often include small and slender branches, which are difficult to distinguish. iii) Non-target vessels, vessel-like objects (e.g., guidewires or catheters), and motion artifacts arising from patient physiological activities can cause significant interference.

Early vessel segmentation methods rely on conventional image processing techniques. These methods begin by enhancing vessel features through image filters [14][15][16], follow by applications of region-growing [17][18] or machine learning techniques [19][20] to obtain segmentation results. However, these conventional methods struggle to capture high-level semantic features crucial for successful segmentation, and their parameters are typically selected based on empirical evidence [21]. Consequently, the robustness and generalization capabilities of these methods are inadequate, rendering them unsuitable for clinical deployment.

In recent years, deep learning methods have taken dominant positions in various vision tasks due to their powerful capability to learn high-level semantic features [22][23]. Specifically, in the medical image segmentation domain, U-Net [24] and its variants [25][26][27] have gained widespread adoption, demonstrating remarkable success across different imaging modalities. Innovations based on U-Net have aimed to further enhance vessel segmentation performances through various methods, including designing attention modules [21][28], exploiting full-resolution learning [29], and integrating transformers [30]. Despite these advancements, current methods have not fully addressed challenges highlighted in Fig. 1, resulting in suboptimal results. For example, CAU-net [21] utilizes a channel attention mechanism to learn channel-wise dependencies and minimizes interference, yet it struggles to accurately classify vessel structures in images with low SNR. Transformers [31] offer advantages in capturing long-range vessel dependencies, which is beneficial for identifying vessel structures in low SNR images [30]. However, their image partitioning strategies [31][32] may disrupt vessel continuity, adversely affecting precise identifications of small or slender vessels.

Based on the above observations, this paper proposes a novel SPatial-frequency learning and topologIcal channel inteRactiOn Network (SPIRONet) for vessel segmentation. SPIRONet leverages a spatial and a frequency encoder to extract local and global vessel features, respectively. Specifically, the spatial encoder consists of successive residual blocks [33], while the frequency encoder, based on Fourier transform, has an image-size respective field (see Eq. (1)). Considering that local spatial features and global frequency features are complementary and mutually enhancing [34][35][36][37], a cross-attention fusion module is designed to integrate these two types of features. This fusion enables SPIRONet to discover more distinct vessel characteristics from low SNR images and accurately identify small or slender vessels. Furthermore, SPIRONet incorporates a topological channel interaction module, using graph neural networks (GNNs) [38] to explicitly learn topological channel relationships and eliminate interference. This comprehensive design allows SPIRONet to effectively tackle challenges outlined in Fig. 1, achieving remarkable segmentation performances.

In summary, the main contributions of this work are as follows:

  • A frequency encoder based on Fourier transform is designed to capture global vessel features in the frequency domain with high efficacy. A cross-attention fusion module is further introduced to fuse complementary spatial and frequency features.

  • A topological channel interaction module is developed to filter out task-irrelevant responses via explicit channel relationship modeling and effective information interaction.

  • Our SPIRONet markedly outperforms the state-of-the-art alternatives on two in-house datasets (CADSA and CAXF) and two public available benchmarks (DCA1 and XCAD) with a real-time inference rate of 21 FPS111In intravascular intervention scenarios, the definition of “real-time” is 6similar-to\sim12 FPS due to the low capture frequency of medical equipments like X-ray systems [39]..

The remainder of this paper is organized as follows: Section II briefly reviews current works related to this research. Section III depicts the proposed SPIRONet in detail. Section IV introduces datasets utilized in experiments and model configurations. Quantitative and qualitative experimental results are presented in Section V. Section VI gives a discussion about our model. Finally, Section VII concludes this paper.

II Related works

II-A Traditional vessel segmentation approaches

Traditional vessel segmentation methods primarily rely on pixel intensities. Among them, region-growing [40] is one of the most typical techniques. Jiang et al. [17] introduced an improved region-growing method that selects high-quality seeds based on spectral information. ELEMENT [41] framework was developed for multi-modal vessel segmentation. It integrates connectivity features with region-growing to identify potential vessel pixels and employs the Weka framework for segmentation by leveraging a comprehensive set of complementary features. In addition to region-growing methods, other alternatives exist. Dehkordi et al. [42] proposed an active contour model that incorporates a local feature fitting energy for vessel segmentation. Similarly, Memari et al. [43] adopted fuzzy C-means clustering to delineate coarse vessel structures, which are then refined using an integrated level set approach. Another innovative method, Tensor-cut [20], conceptualizes each voxel as a second-order tensor and employs a graph cuts algorithm for final segmentation. Despite their efficacy, these methods predominantly depend on manually designed features, requiring complex processing steps and facing scalability challenges [1][44].

II-B Vessel segmentation based on deep learning

With the tremendous advance of deep learning, researchers have designed various deep networks to improve vessel segmentation quality. U-Net [24] stands out for its encoder-decoder architecture, complemented by multi-scale skip connections, enabling the efficient capture of both low-level and high-level features. Building on UNet, models such as Attn-UNet [26] and UNet++ [25] have been introduced to further enhance segmentation performance through the incorporation of attention mechanisms and the redesign of skip connections. For vessel-specific models, Gu et al. [45] developed CE-Net, which employs a dense atrous convolution (DAC) and residual multi-kernel pooling (RMP) to simultaneously capture high-level features and preserve spatial vessel details. Similarly, CS2-Net [28] integrates channel and spatial attention modules to improve feature representations. However, limited by the receptive fields of convolutional neural networks (CNNs), these models cannot fully exploit global contexts of images [31]. In contrast, transformers exhibit a robust capacity for global context modeling [31] and have been extensively applied in natural and medical image segmentation [34][46]. TransUNet [34] has been a forerunner in integrating transformers with medical image segmentation, utilizing transformers to encode CNN features for comprehensive global context modeling. UCTransNet [47] replaced the original skip connections with a channel transformer (CTrans) to reduce semantic gaps between shallower-level encoders and decoders. Additionally, Li et al. [30] proposed a global transformer and dual local attention network GT-DLA-dsHFF, which achieves deep-shallow hierarchical feature fusion to capture global and local vessel characteristics.

II-C Learning form frequency domain

Fourier transform is a fundamental technique in conventional signal processing [48]. Leveraging its global information aggregation capabilities (see Eq. (1)), many works have incorporated Fourier transform into deep neural networks to capture long-range dependencies [49][50][51][52]. These approaches manipulate frequency information through various operations, including convolution [49], groupwise MLP layers [50], elementwise multiplication with trainable parameters [51], and adaptive frequency filters [52]. In medical image segmentation, FRCU-Net [53] introduced a channel-wise attention mechanism to re-calibrate different frequencies from Laplacian pyramids, producing more discriminative representations. Huang et al. [54] proposed a plug-and-play frequency domain attention module (FDAM) to retain valuable frequency information of medical images. Li et al. [55] proposed GFUNet, replacing U-Net’s original encoder with GFNet [51] to harness frequency features. Despite these advances, GFUNet lacks semantic adaptability [52], a critical aspect of global feature learning [56]. Addressing this gap, our work separates frequency features into amplitude and phase components, employing straightforward convolutional layers to generate semantic-adaptive masks.

II-D Channel refinement module

In high-level semantic features, each channel map is a class-specific response, with different channel responses often correlated with one another [57][58]. Mainstream researches have devoted to discovering channel correspondences to refine feature representations and eliminate task-irrelevant responses. Among these efforts, channel attention mechanisms (CAMs) stand out and have been extensively applied across various vision tasks [59][60][61]. Hu et al. [59] proposed a squeeze-and-extraction (SE) block, which recalibrates channel feature responses by explicitly modeling relationships between channels. ECA-Net [61] introduced an efficient channel attention (ECA) module, enhancing the SE block to produce channel attention maps via 1D convolution without reducing dimensionality. Li et al. [21] proposed CAU-net, which incorporates the SE block to capture vessel details and mitigate noise response. Similarly, CAR-UNet [62] integrated a channel attention double residual block (CADRB) to analyze channel statistics, while Mou et al. [28] explored self-attention across channel dimensions to address long-range dependencies and refine channel features. However, we argue that channel correlations form a graph structure, and the aforementioned methods allocate weights to each channel without explicitly formulating topological channel relationships or directly interacting channel features [58], making them less effective at enhancing vessel responses. A pioneering approach by [58] introduced a dynamic-channel graph convolutional network, map** channels onto a topological space to enhance feature refinement on a graph. However, it is adopted in feature maps with lower spatial resolution (i.e., at the bottom of the encoder), and irrelevant responses may be amplified again during decoding. In contrast, our proposed topological channel interaction module, activated post-decoding, aims to filter out irrelevant responses at higher resolution.

III Methodology

III-A Preliminaries: 2D Fourier transform

Fourier transform plays a vital role in the signal processing domain [48] and is a key component in SPIRONet. Digital images are 2D spatial signals, which can be transformed into the frequency domain via 2D discrete Fourier transform (DFT):

𝑿(u,v)=h=0H1w=0W1𝒙(h,w)ej2π(uhH+vwW)𝑿𝑢𝑣superscriptsubscript0𝐻1superscriptsubscript𝑤0𝑊1𝒙𝑤superscript𝑒𝑗2𝜋𝑢𝐻𝑣𝑤𝑊\displaystyle\bm{X}(u,v)=\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\bm{x}\left(h,w\right% )e^{-j2\pi\left(\frac{uh}{H}+\frac{vw}{W}\right)}bold_italic_X ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT bold_italic_x ( italic_h , italic_w ) italic_e start_POSTSUPERSCRIPT - italic_j 2 italic_π ( divide start_ARG italic_u italic_h end_ARG start_ARG italic_H end_ARG + divide start_ARG italic_v italic_w end_ARG start_ARG italic_W end_ARG ) end_POSTSUPERSCRIPT (1)

where 𝒙1×H×W𝒙superscript1𝐻𝑊\bm{x}\in\mathbb{R}^{1\times H\times W}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_H × italic_W end_POSTSUPERSCRIPT is a single channel image. H𝐻Hitalic_H and W𝑊Witalic_W represent the height and weight of 𝒙𝒙\bm{x}bold_italic_x. u𝑢uitalic_u and v𝑣vitalic_v are coordinates in the frequency domain. In practice, DFT is implemented using the fast Fourier transform (FFT) algorithm [63] with 𝒪(NlogN)𝒪𝑁𝑁\mathcal{O}(N\log N)caligraphic_O ( italic_N roman_log italic_N ) complexity.

The amplitude 𝓐𝓐\bm{\mathcal{A}}bold_caligraphic_A and phase 𝓟𝓟\bm{\mathcal{P}}bold_caligraphic_P components are two significant components of 𝑿𝑿\bm{X}bold_italic_X. For a given 𝑿𝑿\bm{X}bold_italic_X, the above two components can be derived as follows:

𝓐(u,v)𝓐𝑢𝑣\displaystyle\bm{\mathcal{A}}(u,v)bold_caligraphic_A ( italic_u , italic_v ) ={Re[𝑿(u,v)]}2+{Im[𝑿(u,v)]}2absentsuperscriptRedelimited-[]𝑿𝑢𝑣2superscriptImdelimited-[]𝑿𝑢𝑣2\displaystyle=\sqrt{\left\{{\rm Re}\left[\bm{X}(u,v)\right]\right\}^{2}+\left% \{{\rm Im}\left[\bm{X}(u,v)\right]\right\}^{2}}= square-root start_ARG { roman_Re [ bold_italic_X ( italic_u , italic_v ) ] } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + { roman_Im [ bold_italic_X ( italic_u , italic_v ) ] } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (2)
𝓟(u,v)𝓟𝑢𝑣\displaystyle\bm{\mathcal{P}}(u,v)bold_caligraphic_P ( italic_u , italic_v ) =arctan{Im[𝑿(u,v)]Re[𝑿(u,v)]}absentImdelimited-[]𝑿𝑢𝑣Redelimited-[]𝑿𝑢𝑣\displaystyle=\arctan\left\{\frac{{\rm Im}\left[\bm{X}(u,v)\right]}{{\rm Re}% \left[\bm{X}(u,v)\right]}\right\}= roman_arctan { divide start_ARG roman_Im [ bold_italic_X ( italic_u , italic_v ) ] end_ARG start_ARG roman_Re [ bold_italic_X ( italic_u , italic_v ) ] end_ARG }

Similarly, for the given 𝓐𝓐\bm{\mathcal{A}}bold_caligraphic_A and 𝓟𝓟\bm{\mathcal{P}}bold_caligraphic_P, real and imaginary parts of 𝑿𝑿\bm{X}bold_italic_X can be represented as:

Re[𝑿(u,v)]Redelimited-[]𝑿𝑢𝑣\displaystyle{\rm Re}\left[\bm{X}(u,v)\right]roman_Re [ bold_italic_X ( italic_u , italic_v ) ] =𝓐(u,v)cos{𝓟(u,v)}absent𝓐𝑢𝑣𝓟𝑢𝑣\displaystyle=\bm{\mathcal{A}}(u,v)\cos\left\{\bm{\mathcal{P}}(u,v)\right\}= bold_caligraphic_A ( italic_u , italic_v ) roman_cos { bold_caligraphic_P ( italic_u , italic_v ) } (3)
Im[𝑿(u,v)]Imdelimited-[]𝑿𝑢𝑣\displaystyle{\rm Im}\left[\bm{X}(u,v)\right]roman_Im [ bold_italic_X ( italic_u , italic_v ) ] =𝓐(u,v)sin{𝓟(u,v)}absent𝓐𝑢𝑣𝓟𝑢𝑣\displaystyle=\bm{\mathcal{A}}(u,v)\sin\left\{\bm{\mathcal{P}}(u,v)\right\}= bold_caligraphic_A ( italic_u , italic_v ) roman_sin { bold_caligraphic_P ( italic_u , italic_v ) }

Refer to caption

Figure 2: The overview of our SPIRONet. It adopts a spatial encoder and a frequency encoder to capture complementary spatial and frequency vessel features. These two kinds of features are fused effectively by cross-attention fusion modules. The fused features are fed into CNN decoders to recover the original resolutions. After that, multi-channel features containing class-specific responses are refined by a topological channel interaction module based on GNNs. Finally, vessel predictions are obtained through a segmentation head.

Eq. (1) demonstrates that each element of 𝑿(u,v)𝑿𝑢𝑣\bm{X}(u,v)bold_italic_X ( italic_u , italic_v ) contains information from every pixel of the image 𝒙𝒙\bm{x}bold_italic_x. Consequently, 𝓐𝓐\bm{\mathcal{A}}bold_caligraphic_A and 𝓟𝓟\bm{\mathcal{P}}bold_caligraphic_P have image-size respective fields. These components capture unique characteristics of images. The amplitude component 𝓐𝓐\bm{\mathcal{A}}bold_caligraphic_A excels in capturing textural details of vessels, which is crucial for accurately segmenting small or slender vessels. Conversely, the phase component 𝓟𝓟\bm{\mathcal{P}}bold_caligraphic_P conveys spatial relationships among different parts of images, effectively preserving structural features of vessels.

Refer to caption

Figure 3: The architecture of encoder blocks. (a) Spatial encoder block; (b) Frequency encoder block. direct-sum\oplus means the element-wise addition.

III-B Overall architecture

Our SPIRONet is comprehensively detailed in Fig. 2, adopting the U-shape [24] encoder-decoder architecture. It utilize two parallel encoders to learn local spatial features and global frequency features, respectively. A specially crafted cross-attention fusion module is employed to mutually fuse spatial and frequency features. The fused features are then transmitted to the CNN decoder through skip connections. At the end of the decoder, a topological channel interaction module is deployed to refine channel features and filter out task-irrelevant responses. Finally, vessel predictions are generated through a segmentation head (1×1111\times 11 × 1 convolution layer).

III-C Spatial-frequency representation learning

Algorithm 1 Pseudo-Code of the frequency encoder block in a PyTorch-like Style.
# x: input feature maps
# conv_amp, conv_pha: amplitude/phase convolutional layers
# conv_channel: channel adjustment
# fast Fourier transform (FFT) Eq. (1)
X = rfft2(x)
# decompose X into amplitude and phase components Eq. (2)
amp = abs(X)
pha = angle(X)
# parallel paths
amp_fuse = conv_amp(amp) + amp
pha_fuse = conv_pha(pha) + pha
# get real and imaginary parts Eq. (3)
real = amp_fuse * cos(pha_fuse)
imag = amp_fuse * sin(pha_fuse)
# invert fast Fourier transform (iFFT)
X_ = complex(real, imag)
x_ = irfft(X_)
# finall outputs
out = x_ + x
out = conv_channel(out)

Refer to caption

Figure 4: Cross-attention module. tensor-product\otimes, direct-sum\oplus, and circled-c indicate the matrix multiplication, element-wise addition, and channel-dimension concatenation.

Fig. 3 shows architectures of the proposed encoder blocks. Inputs of the i𝑖iitalic_i-th spatial encoder block EncSpaisuperscriptsubscriptEncSpa𝑖{\rm Enc}_{\rm Spa}^{i}roman_Enc start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and frequency encoder block EncFreqisuperscriptsubscriptEncFreq𝑖{\rm Enc}_{\rm Freq}^{i}roman_Enc start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are 𝒇Spaisuperscriptsubscript𝒇Spa𝑖\bm{f}_{\rm Spa}^{i}bold_italic_f start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝒇Freqisuperscriptsubscript𝒇Freq𝑖\bm{f}_{\rm Freq}^{i}bold_italic_f start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, respectively. Following the previous methods [34][37], CNNs are utilized to extract local vessel features. The spatial encoder block is similar to the residual module in ResNet [33]. Inspired by [64], two parallel paths with convolution layers are adopted to process amplitude 𝑭Ampisuperscriptsubscript𝑭Amp𝑖\bm{F}_{\rm Amp}^{i}bold_italic_F start_POSTSUBSCRIPT roman_Amp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and phase 𝑭Phaisuperscriptsubscript𝑭Pha𝑖\bm{F}_{\rm Pha}^{i}bold_italic_F start_POSTSUBSCRIPT roman_Pha end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT components, respectively. The frequency encoder block can be easily implemented in deep learning frameworks (e.g., PyTorch [65]), as presented in Algorithm 1. Outputs of the i𝑖iitalic_i-th encoder blocks can be formulated as follows:

𝒇^Spai=EncSpai(𝒇Spai)superscriptsubscript^𝒇Spa𝑖superscriptsubscriptEncSpa𝑖superscriptsubscript𝒇Spa𝑖\displaystyle\hat{\bm{f}}_{\rm Spa}^{i}={\rm Enc}_{\rm Spa}^{i}\left(\bm{f}_{% \rm Spa}^{i}\right)over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Enc start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_f start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (4)
𝒇^Freqi=EncFreqi(𝒇Freqi)superscriptsubscript^𝒇Freq𝑖superscriptsubscriptEncFreq𝑖superscriptsubscript𝒇Freq𝑖\displaystyle\hat{\bm{f}}_{\rm Freq}^{i}={\rm Enc}_{\rm Freq}^{i}\left(\bm{f}_% {\rm Freq}^{i}\right)over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Enc start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_f start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )

Then the outputs are downsampled via 2×2222\times 22 × 2 maxpooling layers before being sent into the (i+1)𝑖1\left(i+1\right)( italic_i + 1 )-th encoder.

III-D Cross-attention fusion

As discussed in previous sections, the spatial encoder concentrates on learning local vessel features, while the frequency encoder captures long-range vessel dependencies, benefiting from certain properties of Fourier transform. Recent research indicates that local and global features are complementary, providing mutual guidance for learning more robust features [35][37]. To effectively integrate local spatial features and global frequency features, a cross-attention fusion module is proposed, as depicted in Fig.  4.

Inputs of the i𝑖iitalic_i-th cross-attention fusion module are features (𝒇^Spai,𝒇^FreqiCi×Hi×Wisuperscriptsubscript^𝒇Spa𝑖superscriptsubscript^𝒇Freq𝑖superscriptsubscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖\hat{\bm{f}}_{\rm Spa}^{i},\hat{\bm{f}}_{\rm Freq}^{i}\in\mathbb{R}^{C_{i}% \times H_{i}\times W_{i}}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) extracted by the i𝑖iitalic_i-th spatial and frequency encoder blocks, where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the channel number, height, and width of feature maps. First, these two features are projected into the embedding space through different “Conv-BN-ReLU” layers 𝑾i()superscript𝑾𝑖\bm{W}^{i}(\cdot)bold_italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( ⋅ ) to generate the spatial query 𝑸Spaisubscriptsuperscript𝑸𝑖Spa\bm{Q}^{i}_{\rm Spa}bold_italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT and key 𝑲Spaisubscriptsuperscript𝑲𝑖Spa\bm{K}^{i}_{\rm Spa}bold_italic_K start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT, as well as the frequency query 𝑸Freqisubscriptsuperscript𝑸𝑖Freq\bm{Q}^{i}_{\rm Freq}bold_italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT and key 𝑲Freqisubscriptsuperscript𝑲𝑖Freq\bm{K}^{i}_{\rm Freq}bold_italic_K start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT. The mixed value is the projection of the concatenated feature Concat(𝒇^Spai,𝒇^Freqi)2Ci×Hi×WiConcatsuperscriptsubscript^𝒇Spa𝑖superscriptsubscript^𝒇Freq𝑖superscript2subscript𝐶𝑖subscript𝐻𝑖subscript𝑊𝑖{\rm Concat}\left(\hat{\bm{f}}_{\rm Spa}^{i},\hat{\bm{f}}_{\rm Freq}^{i}\right% )\in\mathbb{R}^{2C_{i}\times H_{i}\times W_{i}}roman_Concat ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. To alleviate high complexities brought by matrix multiplications, pyramid pooling modules (PPM) [66] are adopted to sample specific feature maps. The above process can be formulated by:

𝑸Spai=𝑾SQi(𝒇^Spai),𝑸Freqi=𝑾FQi(𝒇^Freqi)formulae-sequencesubscriptsuperscript𝑸𝑖Spasubscriptsuperscript𝑾𝑖SQsuperscriptsubscript^𝒇Spa𝑖subscriptsuperscript𝑸𝑖Freqsubscriptsuperscript𝑾𝑖FQsuperscriptsubscript^𝒇Freq𝑖\displaystyle\bm{Q}^{i}_{\rm Spa}=\bm{W}^{i}_{\rm SQ}\left(\hat{\bm{f}}_{\rm Spa% }^{i}\right),\bm{Q}^{i}_{\rm Freq}=\bm{W}^{i}_{\rm FQ}\left(\hat{\bm{f}}_{\rm Freq% }^{i}\right)bold_italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT = bold_italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_SQ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT = bold_italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_FQ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (5)
(6)
𝑽i=PPM{𝑾Vi[Concat(𝒇^Spai,𝒇^Freqi)]}superscript𝑽𝑖PPMsubscriptsuperscript𝑾𝑖Vdelimited-[]Concatsuperscriptsubscript^𝒇Spa𝑖superscriptsubscript^𝒇Freq𝑖\displaystyle\bm{V}^{i}={\rm PPM}\left\{\bm{W}^{i}_{\rm V}\left[{\rm Concat}% \left(\hat{\bm{f}}_{\rm Spa}^{i},\hat{\bm{f}}_{\rm Freq}^{i}\right)\right]\right\}bold_italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_PPM { bold_italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_V end_POSTSUBSCRIPT [ roman_Concat ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] } (7)

where 𝑸Spai,𝑸Freqi(HiWi)×d0subscriptsuperscript𝑸𝑖Spasubscriptsuperscript𝑸𝑖Freqsuperscriptsubscript𝐻𝑖subscript𝑊𝑖subscript𝑑0\bm{Q}^{i}_{\rm Spa},\bm{Q}^{i}_{\rm Freq}\in\mathbb{R}^{\left(H_{i}\cdot W_{i% }\right)\times d_{0}}bold_italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT , bold_italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝑲Spai,𝑲FreqiN×d0subscriptsuperscript𝑲𝑖Spasubscriptsuperscript𝑲𝑖Freqsuperscript𝑁subscript𝑑0\bm{K}^{i}_{\rm Spa},\bm{K}^{i}_{\rm Freq}\in\mathbb{R}^{N\times d_{0}}bold_italic_K start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Spa end_POSTSUBSCRIPT , bold_italic_K start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Freq end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝑽iN×d0superscript𝑽𝑖superscript𝑁subscript𝑑0\bm{V}^{i}\in\mathbb{R}^{N\times d_{0}}bold_italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the dimension of the embedding space and N𝑁Nitalic_N is the spatial resolution sample by PPM, NHiWimuch-less-than𝑁subscript𝐻𝑖subscript𝑊𝑖N\ll H_{i}\cdot W_{i}italic_N ≪ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus, the cross-attention (CA) can be calculated as follows:

(8)

III-E Topological channel interaction

Inspired by [58], we hypothesis channel relationships are graph-structured and introduce a topological channel interaction (TCI) module to enhance vessel-specific responses, as illustrated in Fig. 5. Graph neural networks (GNNs) are widely utilized to lean features of graph-structured data. Given a graph 𝒢=(V,E)𝒢𝑉𝐸\mathcal{G}=\left(V,E\right)caligraphic_G = ( italic_V , italic_E ) and its adjacent matrix 𝑨𝑨\bm{A}bold_italic_A, the output of a GNN can be formulated as follows [38]:

𝑯out=σ(𝑳~𝑯in𝚯)subscript𝑯out𝜎~𝑳subscript𝑯in𝚯\displaystyle\bm{H}_{\rm out}=\sigma\left(\tilde{\bm{L}}\bm{H}_{\rm in}\bm{% \Theta}\right)bold_italic_H start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT = italic_σ ( over~ start_ARG bold_italic_L end_ARG bold_italic_H start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT bold_Θ ) (9)
𝑳~=𝑫~12𝑨~𝑫~12~𝑳superscript~𝑫12~𝑨superscript~𝑫12\displaystyle\tilde{\bm{L}}=\tilde{\bm{D}}^{-\frac{1}{2}}\tilde{\bm{A}}\tilde{% \bm{D}}^{-\frac{1}{2}}over~ start_ARG bold_italic_L end_ARG = over~ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG bold_italic_A end_ARG over~ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (10)

Refer to caption

Figure 5: Topological channel interaction module. direct-sum\oplus represents element-wise addition.

where σ()𝜎\sigma\left(\cdot\right)italic_σ ( ⋅ ) is a non-linear activation function and 𝚯𝚯\bm{\Theta}bold_Θ is trainable parameters. 𝑳~~𝑳\tilde{\bm{L}}over~ start_ARG bold_italic_L end_ARG is the Laplacian matrix, 𝑨~=𝑨+𝑰~𝑨𝑨𝑰\tilde{\bm{A}}=\bm{A}+\bm{I}over~ start_ARG bold_italic_A end_ARG = bold_italic_A + bold_italic_I, and 𝑫~ii=j𝑨~ijsubscript~𝑫𝑖𝑖subscript𝑗subscript~𝑨𝑖𝑗\tilde{\bm{D}}_{ii}=\sum_{j}\tilde{\bm{A}}_{ij}over~ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Inputs of the topological channel interaction module is 𝑯in=𝒇C×(HW)/42subscript𝑯in𝒇superscript𝐶𝐻𝑊superscript42\bm{H}_{\rm in}=\bm{f}\in\mathbb{R}^{C\times(H\cdot W)/4^{2}}bold_italic_H start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT = bold_italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × ( italic_H ⋅ italic_W ) / 4 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where we downsample outputs of the decoder 𝒇inC×H×Wsubscript𝒇insuperscript𝐶𝐻𝑊\bm{f}_{\rm in}\in\mathbb{R}^{C\times H\times W}bold_italic_f start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT by a 4×4444\times 44 × 4 maxpooling layer. To learn data-dependent topological graph structures, we follow [67] to use an improved Laplacian matrix:

𝑳~=𝑰𝑫~12𝑨~𝑫~12~𝑳𝑰superscript~𝑫12~𝑨superscript~𝑫12\displaystyle\tilde{\bm{L}}=\bm{I}-\tilde{\bm{D}}^{-\frac{1}{2}}\tilde{\bm{A}}% \tilde{\bm{D}}^{-\frac{1}{2}}over~ start_ARG bold_italic_L end_ARG = bold_italic_I - over~ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG bold_italic_A end_ARG over~ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (11)

where 𝑨~,𝑫~C×C~𝑨~𝑫superscript𝐶𝐶\tilde{\bm{A}},\tilde{\bm{D}}\in\mathbb{R}^{C\times C}over~ start_ARG bold_italic_A end_ARG , over~ start_ARG bold_italic_D end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT. C𝐶Citalic_C denotes the number of channels of input feature maps 𝒇𝒇\bm{f}bold_italic_f.

To formulate topological relationships among different channels, we calculate channels’ cosine similarities by:

a~ij=𝑾(𝒇)iT𝑾(𝒇)j𝑾(𝒇)i2𝑾(𝒇)j2subscript~𝑎𝑖𝑗𝑾superscriptsubscript𝒇𝑖𝑇𝑾subscript𝒇𝑗subscriptnorm𝑾subscript𝒇𝑖2subscriptnorm𝑾subscript𝒇𝑗2\displaystyle\tilde{a}_{ij}=\frac{\bm{W}\left(\bm{f}\right)_{i}^{T}\bm{W}\left% (\bm{f}\right)_{j}}{||\bm{W}\left(\bm{f}\right)_{i}||_{2}||\bm{W}\left(\bm{f}% \right)_{j}||_{2}}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG bold_italic_W ( bold_italic_f ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_W ( bold_italic_f ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_W ( bold_italic_f ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | bold_italic_W ( bold_italic_f ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (12)

where 𝑾()𝑾\bm{W}(\cdot)bold_italic_W ( ⋅ ) is an embedding layer consisting of ‘‘Conv-ReLU" and a~ij𝑨~subscript~𝑎𝑖𝑗~𝑨\tilde{a}_{ij}\in\tilde{\bm{A}}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ over~ start_ARG bold_italic_A end_ARG. Thus Eq. (9) can be wirtten as follows:

𝑯out=σ{𝑳~Down/4(𝒇in)𝚯}subscript𝑯out𝜎~𝑳subscriptDownabsent4subscript𝒇in𝚯\displaystyle\bm{H}_{\rm out}=\sigma\left\{\tilde{\bm{L}}{\rm Down}_{/4}\left(% \bm{f}_{\rm in}\right)\bm{\Theta}\right\}bold_italic_H start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT = italic_σ { over~ start_ARG bold_italic_L end_ARG roman_Down start_POSTSUBSCRIPT / 4 end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT ) bold_Θ } (13)

where σ𝜎\sigmaitalic_σ is the ReLU activation function and Down/4subscriptDownabsent4{\rm Down}_{/4}roman_Down start_POSTSUBSCRIPT / 4 end_POSTSUBSCRIPT indicates the downsample operator. Overall, outputs of our topological channel interaction module can be formulated by:

𝒇out=Up×4{σ[𝑳~Down/4(𝒇in)𝚯]}+𝒇insubscript𝒇outsubscriptUpabsent4𝜎delimited-[]~𝑳subscriptDownabsent4subscript𝒇in𝚯subscript𝒇in\displaystyle\bm{f}_{\rm out}={\rm Up}_{\times 4}\left\{\sigma\left[\tilde{\bm% {L}}{\rm Down}_{/4}\left(\bm{f}_{\rm in}\right)\bm{\Theta}\right]\right\}+\bm{% f}_{\rm in}bold_italic_f start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT = roman_Up start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT { italic_σ [ over~ start_ARG bold_italic_L end_ARG roman_Down start_POSTSUBSCRIPT / 4 end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT ) bold_Θ ] } + bold_italic_f start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT (14)

where Up×4subscriptUpabsent4{\rm Up}_{\times 4}roman_Up start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT indicates the upsample operator. Specifically, we implement two operators via maxpooling and transposed convolution.

III-F Loss function

Consistent with previous works, binary cross entropy (BCE) loss is adopted as the loss function. It is defined as follows:

BCE=1Ni=1N{ygt(i)logypred(i)+(1ygt(i))log(1ypred(i))}subscriptBCE1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑦gt𝑖superscriptsubscript𝑦pred𝑖1superscriptsubscript𝑦gt𝑖1superscriptsubscript𝑦pred𝑖\displaystyle\mathcal{L}_{\rm BCE}=-\frac{1}{N}\sum_{i=1}^{N}\left\{y_{\rm gt}% ^{(i)}\log{y_{\rm pred}^{(i)}}+(1-y_{\rm gt}^{(i)})\log{(1-y_{\rm pred}^{(i)})% }\right\}caligraphic_L start_POSTSUBSCRIPT roman_BCE end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { italic_y start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT roman_log italic_y start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + ( 1 - italic_y start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) roman_log ( 1 - italic_y start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) } (15)

where ygt(i)superscriptsubscript𝑦gt𝑖y_{\rm gt}^{(i)}italic_y start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and ypred(i)superscriptsubscript𝑦pred𝑖y_{\rm pred}^{(i)}italic_y start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT refer to model prediction and ground truth of the i𝑖iitalic_i-th image. N𝑁Nitalic_N is the size of a batch.

IV Experimental setup

IV-A Datasets

In our experiments, four challenging benchmark datasets are utilized, comprising two in-house datasets and two publicly available datasets.

Chronic artery digital subtraction angiography dataset (CADSA). The CADSA dataset is derived from Bei**g Tiantan Hospital, Capital Medical University222https://www.bjtth.org/Html/News/Main/1001125.html. It contains 275 images from 26 patient DSA sequences, with an original resolution of 750×750750750750\times 750750 × 750. Chronic arteries are annotated by experienced physicians using ITK-SNAP [68]. 197 images from 15 patients’ DSA sequences are randomly selected for training, and the remaining 78 images from 5 patients’ DSA sequences are used for testing.

Coronary arteries X-ray fluoroscopy dataset (CAXF) [21]. This dataset is proposed in our previous work. It comprises 538 images from 36 X-ray fluoroscopy sequences, with each image at a resolution of 512×512512512512\times 512512 × 512. Specifically, 412 images are from 7 sequences, and 126 images are randomly selected from 29 sequences. Following the previous settings [21], 337 images from 24 sequences are utilized for training, and the remaining 201 images from 12 sequences are selected for testing.

DCA1 [69]. This dataset is provided by the Mexican Social Security Institute, UMAE T1-León. It includes 134 300×300300300300\times 300300 × 300 X-ray coronary angiograms along with corresponding ground truths annotated by an expert cardiologist. Following [29], this dataset is split into 100 training images and 34 testing images.

XCAD [1]. The XCAD dataset contains 1747 coronary angiograms obtained by a General Electric Innova IGS 520 system. Each image has a resolution of 512×512512512512\times 512512 × 512. It is important to note that only 126 images in this dataset have vessel annotations. We randomly select 100 images as the training set, and the remaining 26 images as the testing set.

IV-B Evaluation metrics

To thoroughly evaluate the proposed method and baselines, four evaluation metrics are selected, including sensitivity (Sen.), F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Score (F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), Intersection over Union (IoU), and Matthews correlation coefficient (MCC).

Sen.=TPTP+FNSen.TPTPFN\displaystyle\text{Sen.}=\frac{\text{TP}}{\text{TP}+\text{FN}}Sen. = divide start_ARG TP end_ARG start_ARG TP + FN end_ARG (16)
F1=2×TP2×TP+FP+FNsubscriptF12TP2TPFPFN\displaystyle{\rm F_{1}}=\frac{2\times\text{TP}}{2\times\text{TP}+\text{FP}+% \text{FN}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 2 × TP end_ARG start_ARG 2 × TP + FP + FN end_ARG (17)
IoU=TPTP+FP+FNIoUTPTPFPFN\displaystyle\text{IoU}=\frac{\text{TP}}{\text{TP}+\text{FP}+\text{FN}}IoU = divide start_ARG TP end_ARG start_ARG TP + FP + FN end_ARG (18)
(19)

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative pixels in segmentation results, respectively.

IV-C Implementation details

TABLE I: Ablation experimental results on the XCAD dataset. The best results are in bold. The second best results are underlined. All results are averaged over 3 random seeds.
Variants Module Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow
SE FE CA TCI
I 80.8480.8480.8480.84 80.0680.0680.0680.06 67.1167.1167.1167.11 79.1179.1179.1179.11
II 76.0076.0076.0076.00 77.1477.1477.1477.14 63.1163.1163.1163.11 76.1976.1976.1976.19
III 82.1082.1082.1082.10 81.0981.0981.0981.09 68.3968.3968.3968.39 80.1580.1580.1580.15
IV 82.0082.0082.0082.00 81.2281.2281.2281.22 68.5768.5768.5768.57 80.2380.2380.2380.23
V 82.2382.2382.2382.23 80.7380.7380.7380.73 67.9867.9867.9867.98 79.7779.7779.7779.77
VI 78.1578.1578.1578.15 77.6377.6377.6377.63 63.7863.7863.7863.78 76.6476.6476.6476.64
VII 83.0783.07\bm{83.07}bold_83.07 81.36¯¯81.36\underline{81.36}under¯ start_ARG 81.36 end_ARG 68.76¯¯68.76\underline{68.76}under¯ start_ARG 68.76 end_ARG 80.40¯¯80.40\underline{80.40}under¯ start_ARG 80.40 end_ARG
SPRIONet 82.91¯¯82.91\underline{82.91}under¯ start_ARG 82.91 end_ARG 81.7681.76\bm{81.76}bold_81.76 69.3769.37\bm{69.37}bold_69.37 80.8280.82\bm{80.82}bold_80.82

SE: Spatial encoder; FE: Frequency encoder; CA: Cross-attention; TCI: Topological channel interaction.

TABLE II: Comparison with state-of-the-arts on the CADSA and CAXF datasets. The best results are highlighted in blue and the second best results are highlighted in red. ‘‘Mean ±plus-or-minus\pm± std’’ are reported over 3 random seeds.
Model CADSA CAXF
Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow
UNet [24] [MICCAI’15] 72.1172.1172.1172.11±4.05plus-or-minus4.05\pm 4.05± 4.05 76.5876.5876.5876.58±2.90plus-or-minus2.90\pm 2.90± 2.90 64.6064.6064.6064.60±3.20plus-or-minus3.20\pm 3.20± 3.20 77.5877.5877.5877.58±2.59plus-or-minus2.59\pm 2.59± 2.59 89.2289.2289.2289.22±0.42plus-or-minus0.42\pm 0.42± 0.42 89.9089.9089.9089.90±0.18plus-or-minus0.18\pm 0.18± 0.18 81.8081.8081.8081.80±0.29plus-or-minus0.29\pm 0.29± 0.29 89.5089.5089.5089.50±0.18plus-or-minus0.18\pm 0.18± 0.18
UNet++ [25] [TMI’19] 74.5974.5974.5974.59±4.46plus-or-minus4.46\pm 4.46± 4.46 78.0478.0478.0478.04±2.26plus-or-minus2.26\pm 2.26± 2.26 66.2066.2066.2066.20±2.49plus-or-minus2.49\pm 2.49± 2.49 78.8878.8878.8878.88±1.86plus-or-minus1.86\pm 1.86± 1.86 89.5389.5389.5389.53±0.28plus-or-minus0.28\pm 0.28± 0.28 90.0190.0190.0190.01±0.18plus-or-minus0.18\pm 0.18± 0.18 81.9681.9681.9681.96±0.28plus-or-minus0.28\pm 0.28± 0.28 89.5989.5989.5989.59±0.18plus-or-minus0.18\pm 0.18± 0.18
AttnUNet [26] [MedIA’19] 73.2773.2773.2773.27±5.70plus-or-minus5.70\pm 5.70± 5.70 76.5876.5876.5876.58±2.81plus-or-minus2.81\pm 2.81± 2.81 64.3464.3464.3464.34±3.14plus-or-minus3.14\pm 3.14± 3.14 77.7977.7977.7977.79±2.20plus-or-minus2.20\pm 2.20± 2.20 89.1389.1389.1389.13±0.15plus-or-minus0.15\pm 0.15± 0.15 89.7689.7689.7689.76±0.09plus-or-minus0.09\pm 0.09± 0.09 81.5881.5881.5881.58±0.15plus-or-minus0.15\pm 0.15± 0.15 89.3489.3489.3489.34±0.09plus-or-minus0.09\pm 0.09± 0.09
CE-Net [45] [TMI’19] 75.6175.6175.6175.61±2.69plus-or-minus2.69\pm 2.69± 2.69 77.8077.8077.8077.80±0.93plus-or-minus0.93\pm 0.93± 0.93 65.2065.2065.2065.20±1.20plus-or-minus1.20\pm 1.20± 1.20 78.6978.6978.6978.69±0.84plus-or-minus0.84\pm 0.84± 0.84 89.7489.7489.7489.74±0.37plus-or-minus0.37\pm 0.37± 0.37 89.9389.9389.9389.93±0.13plus-or-minus0.13\pm 0.13± 0.13 81.8281.8281.8281.82±0.22plus-or-minus0.22\pm 0.22± 0.22 89.5289.5289.5289.52±0.14plus-or-minus0.14\pm 0.14± 0.14
CAU-net [21] [ICONIP’20] 76.0976.0976.0976.09±0.17plus-or-minus0.17\pm 0.17± 0.17 79.0579.0579.0579.05±1.43plus-or-minus1.43\pm 1.43± 1.43 67.4067.4067.4067.40±1.57plus-or-minus1.57\pm 1.57± 1.57 79.8879.8879.8879.88±1.29plus-or-minus1.29\pm 1.29± 1.29 89.3189.3189.3189.31±0.27plus-or-minus0.27\pm 0.27± 0.27 89.8589.8589.8589.85±0.16plus-or-minus0.16\pm 0.16± 0.16 81.7081.7081.7081.70±0.23plus-or-minus0.23\pm 0.23± 0.23 89.4389.4389.4389.43±0.15plus-or-minus0.15\pm 0.15± 0.15
TransUNet [34] [ICMLW’21] 74.5974.5974.5974.59±1.69plus-or-minus1.69\pm 1.69± 1.69 75.8075.8075.8075.80±2.80plus-or-minus2.80\pm 2.80± 2.80 63.6063.6063.6063.60±2.91plus-or-minus2.91\pm 2.91± 2.91 76.8676.8676.8676.86±2.48plus-or-minus2.48\pm 2.48± 2.48 90.1690.1690.1690.16±0.18plus-or-minus0.18\pm 0.18± 0.18 89.8989.8989.8989.89±0.09plus-or-minus0.09\pm 0.09± 0.09 81.7581.7581.7581.75±0.14plus-or-minus0.14\pm 0.14± 0.14 89.4789.4789.4789.47±0.09plus-or-minus0.09\pm 0.09± 0.09
CS2-Net [28] [MedIA’21] 66.1266.1266.1266.12±5.08plus-or-minus5.08\pm 5.08± 5.08 73.1073.1073.1073.10±3.80plus-or-minus3.80\pm 3.80± 3.80 60.2960.2960.2960.29±4.02plus-or-minus4.02\pm 4.02± 4.02 74.9574.9574.9574.95±3.08plus-or-minus3.08\pm 3.08± 3.08 89.8489.8489.8489.84±0.10plus-or-minus0.10\pm 0.10± 0.10 89.8489.8489.8489.84±0.03plus-or-minus0.03\pm 0.03± 0.03 81.6781.6781.6781.67±0.07plus-or-minus0.07\pm 0.07± 0.07 89.4189.4189.4189.41±0.04plus-or-minus0.04\pm 0.04± 0.04
FR-UNet [29] [JBHI’21] 59.9459.9459.9459.94±0.77plus-or-minus0.77\pm 0.77± 0.77 62.7062.7062.7062.70±0.42plus-or-minus0.42\pm 0.42± 0.42 50.7250.7250.7250.72±0.31plus-or-minus0.31\pm 0.31± 0.31 63.5563.5563.5563.55±0.36plus-or-minus0.36\pm 0.36± 0.36 90.1090.1090.1090.10±0.78plus-or-minus0.78\pm 0.78± 0.78 89.1789.1789.1789.17±0.14plus-or-minus0.14\pm 0.14± 0.14 80.5680.5680.5680.56±0.21plus-or-minus0.21\pm 0.21± 0.21 88.7188.7188.7188.71±0.14plus-or-minus0.14\pm 0.14± 0.14
DE-DCGCN-EE [58] [TMI’22] 66.4266.4266.4266.42±1.90plus-or-minus1.90\pm 1.90± 1.90 70.5770.5770.5770.57±0.55plus-or-minus0.55\pm 0.55± 0.55 57.1457.1457.1457.14±0.74plus-or-minus0.74\pm 0.74± 0.74 72.1672.1672.1672.16±0.54plus-or-minus0.54\pm 0.54± 0.54 89.2989.2989.2989.29±0.28plus-or-minus0.28\pm 0.28± 0.28 88.9188.9188.9188.91±0.07plus-or-minus0.07\pm 0.07± 0.07 80.1680.1680.1680.16±0.10plus-or-minus0.10\pm 0.10± 0.10 88.4588.4588.4588.45±0.08plus-or-minus0.08\pm 0.08± 0.08
GT-DLA-dsHFF [30] [TCyber’23] 70.1470.1470.1470.14±7.32plus-or-minus7.32\pm 7.32± 7.32 71.3271.3271.3271.32±3.09plus-or-minus3.09\pm 3.09± 3.09 57.8357.8357.8357.83±2.93plus-or-minus2.93\pm 2.93± 2.93 72.6772.6772.6772.67±2.57plus-or-minus2.57\pm 2.57± 2.57 89.9489.9489.9489.94±0.07plus-or-minus0.07\pm 0.07± 0.07 89.9489.9489.9489.94±0.13plus-or-minus0.13\pm 0.13± 0.13 81.8281.8281.8281.82±0.20plus-or-minus0.20\pm 0.20± 0.20 89.5289.5289.5289.52±0.13plus-or-minus0.13\pm 0.13± 0.13
SPIRONet [Ours] 81.2081.2081.2081.20±2.72plus-or-minus2.72\pm 2.72± 2.72 80.1080.1080.1080.10±0.24plus-or-minus0.24\pm 0.24± 0.24 68.2768.2768.2768.27±0.48plus-or-minus0.48\pm 0.48± 0.48 80.6180.6180.6180.61±0.19plus-or-minus0.19\pm 0.19± 0.19 90.8090.8090.8090.80±0.87plus-or-minus0.87\pm 0.87± 0.87 90.3290.3290.3290.32±0.27plus-or-minus0.27\pm 0.27± 0.27 82.4882.4882.4882.48±0.45plus-or-minus0.45\pm 0.45± 0.45 89.9489.9489.9489.94±0.28plus-or-minus0.28\pm 0.28± 0.28
TABLE III: Comparison with state-of-the-arts on the DCA1 and XCAD datasets. The best results are highlighted in blue and the second best results are highlighted in red. ‘‘Mean ±plus-or-minus\pm± std’’ are reported over 3 random seeds.
Model DCA1 XCAD
Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow
UNet [24] [MICCAI’15] 80.0480.0480.0480.04±0.36plus-or-minus0.36\pm 0.36± 0.36 78.9078.9078.9078.90±0.21plus-or-minus0.21\pm 0.21± 0.21 65.3165.3165.3165.31±0.30plus-or-minus0.30\pm 0.30± 0.30 77.9577.9577.9577.95±0.23plus-or-minus0.23\pm 0.23± 0.23 80.6480.6480.6480.64±1.31plus-or-minus1.31\pm 1.31± 1.31 80.7480.7480.7480.74±0.08plus-or-minus0.08\pm 0.08± 0.08 67.9867.9867.9867.98±0.11plus-or-minus0.11\pm 0.11± 0.11 79.8179.8179.8179.81±0.12plus-or-minus0.12\pm 0.12± 0.12
UNet++ [25] [TMI’19] 80.1080.1080.1080.10±0.23plus-or-minus0.23\pm 0.23± 0.23 78.4578.4578.4578.45±0.04plus-or-minus0.04\pm 0.04± 0.04 64.6964.6964.6964.69±0.06plus-or-minus0.06\pm 0.06± 0.06 77.4977.4977.4977.49±0.04plus-or-minus0.04\pm 0.04± 0.04 81.1381.1381.1381.13±0.62plus-or-minus0.62\pm 0.62± 0.62 80.4380.4380.4380.43±0.25plus-or-minus0.25\pm 0.25± 0.25 67.5667.5667.5667.56±0.34plus-or-minus0.34\pm 0.34± 0.34 79.4579.4579.4579.45±0.26plus-or-minus0.26\pm 0.26± 0.26
AttnUNet [26] [MedIA’19] 79.2079.2079.2079.20±0.30plus-or-minus0.30\pm 0.30± 0.30 78.0478.0478.0478.04±0.01plus-or-minus0.01\pm 0.01± 0.01 64.1864.1864.1864.18±0.01plus-or-minus0.01\pm 0.01± 0.01 77.0777.0777.0777.07±0.01plus-or-minus0.01\pm 0.01± 0.01 80.0880.0880.0880.08±0.35plus-or-minus0.35\pm 0.35± 0.35 79.9879.9879.9879.98±0.24plus-or-minus0.24\pm 0.24± 0.24 66.9466.9466.9466.94±0.36plus-or-minus0.36\pm 0.36± 0.36 79.0179.0179.0179.01±0.26plus-or-minus0.26\pm 0.26± 0.26
CE-Net [45] [TMI’19] 79.4979.4979.4979.49±0.73plus-or-minus0.73\pm 0.73± 0.73 77.8477.8477.8477.84±0.12plus-or-minus0.12\pm 0.12± 0.12 63.8763.8763.8763.87±0.16plus-or-minus0.16\pm 0.16± 0.16 76.8476.8476.8476.84±0.13plus-or-minus0.13\pm 0.13± 0.13 79.9679.9679.9679.96±0.45plus-or-minus0.45\pm 0.45± 0.45 79.9579.9579.9579.95±0.13plus-or-minus0.13\pm 0.13± 0.13 66.8066.8066.8066.80±0.17plus-or-minus0.17\pm 0.17± 0.17 78.9578.9578.9578.95±0.14plus-or-minus0.14\pm 0.14± 0.14
CAU-net [21] [ICONIP’20] 79.2979.2979.2979.29±0.18plus-or-minus0.18\pm 0.18± 0.18 77.8277.8277.8277.82±0.12plus-or-minus0.12\pm 0.12± 0.12 63.8863.8863.8863.88±0.15plus-or-minus0.15\pm 0.15± 0.15 76.8676.8676.8676.86±0.12plus-or-minus0.12\pm 0.12± 0.12 80.2280.2280.2280.22±0.75plus-or-minus0.75\pm 0.75± 0.75 79.3179.3179.3179.31±0.31plus-or-minus0.31\pm 0.31± 0.31 66.0666.0666.0666.06±0.44plus-or-minus0.44\pm 0.44± 0.44 78.3178.3178.3178.31±0.32plus-or-minus0.32\pm 0.32± 0.32
TransUNet [34] [ICMLW’21] 81.0981.0981.0981.09±0.46plus-or-minus0.46\pm 0.46± 0.46 78.8278.8278.8278.82±0.12plus-or-minus0.12\pm 0.12± 0.12 65.1965.1965.1965.19±0.18plus-or-minus0.18\pm 0.18± 0.18 77.8277.8277.8277.82±0.12plus-or-minus0.12\pm 0.12± 0.12 80.8680.8680.8680.86±0.99plus-or-minus0.99\pm 0.99± 0.99 80.2480.2480.2480.24±0.49plus-or-minus0.49\pm 0.49± 0.49 67.2867.2867.2867.28±0.62plus-or-minus0.62\pm 0.62± 0.62 79.2679.2679.2679.26±0.49plus-or-minus0.49\pm 0.49± 0.49
CS2-Net [28] [MedIA’21] 78.4678.4678.4678.46±0.52plus-or-minus0.52\pm 0.52± 0.52 77.8777.8777.8777.87±0.25plus-or-minus0.25\pm 0.25± 0.25 63.9463.9463.9463.94±0.32plus-or-minus0.32\pm 0.32± 0.32 76.9276.9276.9276.92±0.23plus-or-minus0.23\pm 0.23± 0.23 79.4779.4779.4779.47±0.72plus-or-minus0.72\pm 0.72± 0.72 79.2379.2379.2379.23±0.30plus-or-minus0.30\pm 0.30± 0.30 65.9865.9865.9865.98±0.39plus-or-minus0.39\pm 0.39± 0.39 78.3078.3078.3078.30±0.32plus-or-minus0.32\pm 0.32± 0.32
FR-UNet [29] [JBHI’21] 78.9678.9678.9678.96±1.95plus-or-minus1.95\pm 1.95± 1.95 79.5979.5979.5979.59±0.30plus-or-minus0.30\pm 0.30± 0.30 66.2266.2266.2266.22±0.42plus-or-minus0.42\pm 0.42± 0.42 79.4779.4779.4779.47±0.28plus-or-minus0.28\pm 0.28± 0.28 81.6581.6581.6581.65±1.70plus-or-minus1.70\pm 1.70± 1.70 79.7979.7979.7979.79±0.38plus-or-minus0.38\pm 0.38± 0.38 66.6666.6666.6666.66±0.49plus-or-minus0.49\pm 0.49± 0.49 78.8778.8778.8778.87±0.34plus-or-minus0.34\pm 0.34± 0.34
DE-DCGCN-EE [58] [TMI’22] 78.4878.4878.4878.48±0.19plus-or-minus0.19\pm 0.19± 0.19 77.8277.8277.8277.82±0.09plus-or-minus0.09\pm 0.09± 0.09 63.8763.8763.8763.87±0.11plus-or-minus0.11\pm 0.11± 0.11 76.8376.8376.8376.83±0.08plus-or-minus0.08\pm 0.08± 0.08 79.8979.8979.8979.89±0.43plus-or-minus0.43\pm 0.43± 0.43 79.1279.1279.1279.12±0.12plus-or-minus0.12\pm 0.12± 0.12 65.7665.7665.7665.76±0.17plus-or-minus0.17\pm 0.17± 0.17 78.0678.0678.0678.06±0.12plus-or-minus0.12\pm 0.12± 0.12
GT-DLA-dsHFF [30] [TCyber’23] 75.6275.6275.6275.62±0.74plus-or-minus0.74\pm 0.74± 0.74 77.1777.1777.1777.17±0.24plus-or-minus0.24\pm 0.24± 0.24 62.9762.9762.9762.97±0.32plus-or-minus0.32\pm 0.32± 0.32 76.1276.1276.1276.12±0.24plus-or-minus0.24\pm 0.24± 0.24 80.5680.5680.5680.56±0.52plus-or-minus0.52\pm 0.52± 0.52 80.4480.4480.4480.44±0.35plus-or-minus0.35\pm 0.35± 0.35 67.5367.5367.5367.53±0.44plus-or-minus0.44\pm 0.44± 0.44 79.4579.4579.4579.45±0.36plus-or-minus0.36\pm 0.36± 0.36
SPIRONet [Ours] 80.7680.7680.7680.76±0.94plus-or-minus0.94\pm 0.94± 0.94 79.7579.7579.7579.75±0.48plus-or-minus0.48\pm 0.48± 0.48 66.4566.4566.4566.45±0.66plus-or-minus0.66\pm 0.66± 0.66 78.7578.7578.7578.75±0.48plus-or-minus0.48\pm 0.48± 0.48 82.9182.9182.9182.91±0.63plus-or-minus0.63\pm 0.63± 0.63 81.7681.7681.7681.76±0.51plus-or-minus0.51\pm 0.51± 0.51 69.7369.7369.7369.73±0.69plus-or-minus0.69\pm 0.69± 0.69 80.8280.8280.8280.82±0.53plus-or-minus0.53\pm 0.53± 0.53

All experiments are performed based on PyTorch 1.12.0 [65], Python 3.8, and Ubuntu 18.04. Our models and baseline methods are trained on a single Nvidia GeForce RTX 3090 GPU with 24GB of memory. For data augmentation, we use random flip** and random rotating within the range of [20,20]superscript20superscript20\left[-20^{\circ},20^{\circ}\right][ - 20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ]. SGD is adopted as our optimizer with a momentum of 0.9 and a weight decay of 0.0001. Polynomial annealing policy is utilized to adjust the learning rate, which can be written as lrlrinit(1epochtotal_epochs)0.9𝑙𝑟𝑙subscript𝑟initsuperscript1epochtotal_epochs0.9lr\leftarrow lr_{\rm init}*\left(1-\frac{\rm epoch}{\rm total\_epochs}\right)^% {0.9}italic_l italic_r ← italic_l italic_r start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ∗ ( 1 - divide start_ARG roman_epoch end_ARG start_ARG roman_total _ roman_epochs end_ARG ) start_POSTSUPERSCRIPT 0.9 end_POSTSUPERSCRIPT. The batch size is set to 4 for all datasets. The image size for all datasets is 512×512512512512\times 512512 × 512, except for DCA1, which is 300×300300300300\times 300300 × 300. Furthermore, for CADSA, we train our model with an initial learning rate of 0.03 for 200 epochs. For CAXF, the initial learning rate is set to 0.08 and the total epochs is 800. For DCA1, the model is trained for 600 epochs with an initial learning rate of 0.045. For XCAD, we opt an initial learning rate of 0.06 and a training epoch of 700. We train models over 3 random seeds and report ‘‘mean ±plus-or-minus\pm± std" of all metrics.

V Results

Extensive experiments are conducted to answer the following questions:

  • Q1: Do the proposed modules facilitate vessel segmentation?

  • Q2: Does the proposed SPIRONet yield better vessel segmentation performances than state-of-the-arts?

  • Q3: Does our model achieve a desirable trade-off between computational efficiency and segmentation performances?

V-A Ablation studies (Q1)

In this section, we conduct experiments on the XCAD dataset to verify the effectiveness of key components in SPIRONet. The quantitative results are reported in Table I. It should be noted that for Variant III and Variant VII, we simply perform element-wise addition to fuse spatial and frequency features. Our main observations are as follows: i) For learning vessel features, both local spatial features and global frequency features are useful. Models that incorporate both spatial and frequency features demonstrate significantly enhanced performance compared to those utilizing only a single type of feature. ii) Spatial features seem matter more than frequency features in vessel segmentation tasks. One possible explanation is that, similar to vision transformers [31], the frequency encoder requires a large amount of data to explore long-range dependencies in images. The datasets used in this paper are small, which prevents the model from learning robust representations for segmentation. iii) The cross-attention fusion module can effectively fuse spatial and frequency features, further improving segmentation performances (i.e., 81.36%81.76%percent81.36percent81.7681.36\%\rightarrow 81.76\%81.36 % → 81.76 % in F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 68.76%69.37%percent68.76percent69.3768.76\%\rightarrow 69.37\%68.76 % → 69.37 % in IoU, and 80.40%80.82%percent80.40percent80.8280.40\%\rightarrow 80.82\%80.40 % → 80.82 % in MCC) compared to element-wise addition method (Variant VII). iv) The topological channel interaction module is capable of refining channel features and removing task-irrelevant responses, resulting in performance gains of 0.91%percent0.910.91\%0.91 % sensitivity, 0.54%percent0.540.54\%0.54 % F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 0.80%percent0.800.80\%0.80 % IoU, and 0.59%percent0.590.59\%0.59 % MCC against Variant IV.

Refer to caption

Figure 6: Visualization of vessel segmentation results on the CADSA, CAXF [21], DCA1 [69], and XCAD [1] datasets (from top to bottom). IoU (%) for each image is shown in the upper right corner.

V-B Comparisons with state-of-the-arts (Q2)

We evaluate our SPIRONet against state-of-the-art (SOTA) segmentation models across four datasets. To ensure fairness, we train these models using source codes available in their respective GitHub repositories, applying the same configurations as those used for SPIRONet.

V-B1 In-house datasets

Table II summarizes quantitative results on two in-house datasets, CADSA and CAXF [21]. Notably, our proposed SPIRONet achieves 81.20%percent81.2081.20\%81.20 % sensitivity, 80.10%percent80.1080.10\%80.10 % F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 68.27%percent68.2768.27\%68.27 % IoU, and 80.61%percent80.6180.61\%80.61 % MCC on the CADSA dataset and 90.80%percent90.8090.80\%90.80 % sensitivity, 90.32%percent90.3290.32\%90.32 % F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 82.48%percent82.4882.48\%82.48 % IoU, and 89.94%percent89.9489.94\%89.94 % MCC on the CAXF dataset, surpassing all SOTA models by a considerable margin. On the CADSA dataset, it is worth noting the significant improvement from CAU-net [21] (the second best method) to SPIRONet by 5.11%percent5.115.11\%5.11 % in sensitivity. Improved sensitivity indicates that models are more capable of extracting thin vessels and vessel boundaries [29], demonstrating SPIRONet has superior abilities to learn discriminative vessel features.

V-B2 Publicly available benchmarks

Table III presents comparison results on two publicly available benchmarks, DCA1 [69] and XCAD [1]. On the DCA1 datasets, our SPIRONet achieves the best performances in F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and IoU and has comparable sensitivity and MCC to SOTA models. SPIRONet obtains the best outcomes in all metrics on the XCAD dataset. These results demonstrate our SPIRONet exhibits superior generalization across different datasets.

TABLE IV: Model parameters, floating-point operations per second (FLOPs), and inference rates. The input size of models is set to 512×512512512512\times 512512 × 512. means the experiments are conducted on the test set of the XCAD dataset and repeated five times.
Model #params (M) FLOPs (G) Rate (FPS)
UNet [24] [MICCAI’15] 17.2617.2617.2617.26 160.44160.44160.44160.44 61.5261.5261.5261.52±0.91plus-or-minus0.91\pm 0.91± 0.91
UNet++ [25] [TMI’19] 9.169.169.169.16 139.46139.46139.46139.46 45.0345.0345.0345.03±0.08plus-or-minus0.08\pm 0.08± 0.08
AttnUNet [26] [MedIA’19] 57.1657.1657.1657.16 541.04541.04541.04541.04 29.0029.0029.0029.00±0.26plus-or-minus0.26\pm 0.26± 0.26
CE-Net [45] [TMI’19] 29.0029.0029.0029.00 35.6035.6035.6035.60 74.1174.1174.1174.11±0.38plus-or-minus0.38\pm 0.38± 0.38
CAU-net [21] [ICONIP’20] 1.951.951.951.95 13.9113.9113.9113.91 102.98102.98102.98102.98±5.37plus-or-minus5.37\pm 5.37± 5.37
TransUNet [34] [ICMLW’21] 93.2393.2393.2393.23 129.45129.45129.45129.45 24.8224.8224.8224.82±0.13plus-or-minus0.13\pm 0.13± 0.13
CS2-Net [28] [MedIA’21] 8.408.408.408.40 55.8555.8555.8555.85 82.0282.0282.0282.02±0.43plus-or-minus0.43\pm 0.43± 0.43
FR-UNet [29] [JBHI’21] 5.725.725.725.72 235.60235.60235.60235.60 26.7526.7526.7526.75±0.05plus-or-minus0.05\pm 0.05± 0.05
DE-DCGCN-EE [58] [TMI’22] 14.1114.1114.1114.11 294.46294.46294.46294.46 18.0718.0718.0718.07±0.02plus-or-minus0.02\pm 0.02± 0.02
GT-DLA-dsHFF [30] [TCyber’23] 26.1026.1026.1026.10 474.60474.60474.60474.60 11.8011.8011.8011.80±0.20plus-or-minus0.20\pm 0.20± 0.20
SPIRONet [Ours] 16.9816.9816.9816.98 200.90200.90200.90200.90 20.5720.5720.5720.57±0.09plus-or-minus0.09\pm 0.09± 0.09

V-B3 Visualization results

We further visualize vessel segmentation results of several models to provide qualitative comparisons, including: U-Net [24], UNet++ [25], CE-Net [45], CAU-net [21], CS2-Net [28], DE-DCGCN-EE [58], and our SPIRONet. Here, we can intuitively observe how SPIRONet address challenges presented in Fig. 1: i) Uneven flow of contrast agent and low-power X-rays cause images in row 1 and 3 to exhibit low SNR. Our model demonstrates a strong capability in locating challenging vessel regions, as highlighted in red boxes; ii) The slender vessel in row 2, highlighted in red boxes, is usually not fully segmented by baselines. By fully exploring local spatial and global frequency features, our model achieves more precise segmentation results; iii) Non-target vessels highlighted in red boxes in row 4 cause interference for segmentation, resulting in some false positive predictions by baselines. Benefiting from the proposed topological channel interaction module, our model successfully distinguishes irrelevant interferences and avoids false positive predictions.

V-C Model complexity (Q3)

Table IV presents comparisons between baseline models and our SPIRONet across various dimensions, including model parameters, floating-point operations per second (FLOPs), and inference rates. It should be noted that #params and FLOPs are calculated by thop library333https://pypi.org/project/thop/. SPIRONet has fewer parameters than half of baselines, and its FLOPs is acceptable. Moreover, the inference rate of our model is approximately 21 FPS, which satisfies clinical real-time requirements (6similar-to\sim12 FPS) [39]. Although some models, such as CAU-net [21], have faster inference rates, considering SPIRONet’s impressive performances across four datasets, this represents a favorable trade-off.

VI Discussion

In this section, we delve deeper into some specific designs in our model. Default configurations of SPIRONet are highlighted in gray. The experiments are conducted on the XCAD dataset.

Since the feature encoder plays a critical role in our model, we further explore the frequency operation design with those in previous works [50][51]. Quantitative results are presented in Table V. The baseline refers to the model without frequency encoder blocks (Variant V in Table I). It should be noted that in models incorporating frequency features but lacking the cross-attention module CA, we simply perform element-wise addition to fuse spatial and frequency features. Several observations can be drawn from Table V: i) By equip** with cross-attention module, performances of models incorporated frequency features enjoy consistent improvements;

TABLE V: Comparisons with other frequency operators on the XCAD dataset. The best results are in bold. The second best results are underlined. All results are averaged over 3 random seeds.
Model Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow
V [Ours] 82.2382.2382.2382.23 80.7380.7380.7380.73 67.9867.9867.9867.98 79.7779.7779.7779.77
V+AFNO [50] [ICLR’22] 82.2282.2282.2282.22 80.6080.6080.6080.60 67.8067.8067.8067.80 79.6779.6779.6779.67
V+GFN [51] [TPAMI’23] 82.1482.1482.1482.14 80.6280.6280.6280.62 67.8267.8267.8267.82 79.6779.6779.6779.67
V+FE [Ours] 83.07¯¯83.07\underline{83.07}under¯ start_ARG 83.07 end_ARG 81.36¯¯81.36\underline{81.36}under¯ start_ARG 81.36 end_ARG 68.76¯¯68.76\underline{68.76}under¯ start_ARG 68.76 end_ARG 80.40¯¯80.40\underline{80.40}under¯ start_ARG 80.40 end_ARG
V+CA+AFNO [50] [ICLR’22] 83.0583.0583.0583.05 81.2381.2381.2381.23 68.6268.6268.6268.62 80.2480.2480.2480.24
V+CA+GFN [51] [TPAMI’23] 83.12 81.2481.2481.2481.24 68.6168.6168.6168.61 80.2680.2680.2680.26
V+CA+FE [Ours] 82.9182.9182.9182.91 81.76 69.37 80.82

FE: Frequency encoder; CA: Cross-attention.

TABLE VI: Comparisons with other channel refinement modules on the XCAD dataset. The best results are in bold. The second best results are underlined. All results are averaged over 3 random seeds.
Model Sen. (%) \uparrow F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (%) \uparrow IoU (%) \uparrow MCC (%) \uparrow
IV [Ours] 82.00¯¯82.00\underline{82.00}under¯ start_ARG 82.00 end_ARG 81.22¯¯81.22\underline{81.22}under¯ start_ARG 81.22 end_ARG 68.57¯¯68.57\underline{68.57}under¯ start_ARG 68.57 end_ARG 80.23¯¯80.23\underline{80.23}under¯ start_ARG 80.23 end_ARG
IV+SE [59] [TPAMI’19] 81.6181.6181.6181.61 80.9980.9980.9980.99 68.2668.2668.2668.26 80.0280.0280.0280.02
IV+SK [70] [CVPR’19] 81.7781.7781.7781.77 81.0681.0681.0681.06 68.3568.3568.3568.35 80.0780.0780.0780.07
IV+ECA [61] [CVPR’20] 81.6681.6681.6681.66 81.0681.0681.0681.06 68.3468.3468.3468.34 80.0680.0680.0680.06
IV+FCA [71] [ICCV’21] 81.5881.5881.5881.58 81.1781.1781.1781.17 68.5068.5068.5068.50 80.1980.1980.1980.19
IV+TCI [Ours] 82.9182.91\bm{82.91}bold_82.91 81.7681.76\bm{81.76}bold_81.76 69.3769.37\bm{69.37}bold_69.37 80.8280.82\bm{80.82}bold_80.82

TCI: Topological channel interaction.

ii) The proposed frequency operation design clearly outperforms those in AFNO [50] and GFN [51], which utilize deterministic functions or masks to filter frequency information. In contrast, our model efficiently learns specific frequency features from different inputs, benefiting from the semantic-adaptive design of our frequency encoder.

Furthermore, some typical channel refinement modules are utilized to compare with our proposed topological channel interaction module [59][61][70][71]. Table VI summarizes the results. Our model incorporated TCI achieves considerable performance gains than those incorporated other channel refinement modules, such as 1.14%1.33%similar-topercent1.14percent1.331.14\%\sim 1.33\%1.14 % ∼ 1.33 % increases in sensitivity, 0.59%0.77%similar-topercent0.59percent0.770.59\%\sim 0.77\%0.59 % ∼ 0.77 % increases in F1subscriptF1{\rm F_{1}}roman_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 0.87%1.11%similar-topercent0.87percent1.110.87\%\sim 1.11\%0.87 % ∼ 1.11 % increases in IoU, and 0.63%0.80%similar-topercent0.63percent0.800.63\%\sim 0.80\%0.63 % ∼ 0.80 % increases in MCC. Surprisingly, the baseline model V, which does not utilize channel refinement modules, performs even better than models that incorporate advanced channel refinement modules. This phenomenon indicates simply allocating weights to each channel may introduce extra interference in our vessel segmentation scenario. Unlike current methods, we use graphs to formulate unstructured relationships among channels and interact with information across different channels via graph neural networks, thus filtering out irrelevant noises and enhancing vessel-specific responses.

Refer to caption

Figure 7: Visualization of class activation maps [72] produced by different models. (a) Image; (b) Ground truth; (c) IV; (d) IV+SK [70]; (e) IV+FCA [71]; (f) IV+TCI (Ours).

Additionally, we employ Grad-CAM [72] to compare discriminative regions with and without channel refinement modules. As depicted in Fig. 7, regions highlighted in red boxes exhibit serious interference, including vessel-like interventional instruments and motion artifacts. Comparison methods tend to assign weights to these irrelevant regions (highlighted in red boxes), thus producing more false positive predictions. By leveraging the topological channel interaction module, our SPIRONet demonstrates enhanced abilities in filtering out irrelevant responses compared to other models.

VII Conclusion

This paper proposes a SPatial-frequency learning and topologIcal channel inteRactiOn Network (SPIRONet) to tackle challenges in vessel segmentation. Dual encoders in SPIRONet can extract local spatial and global frequency vessel features effectively. Then, complementary spatial and frequency vessel features can be fused by cross-attention fusion modules. Furthermore, the proposed topological channel interaction module is able to filter out task-irrelevant responses in multi-channel feature maps. Promising experimental results on four benchmarks have demonstrated the effectiveness of our model. In future work, knowledge distillation will be explored to further enhance the inference efficiency of SPIRONet. Additionally, SPIRONet will be verified in real clinical scenarios.

References

  • [1] Y. Ma et al., ‘‘Self-supervised vessel segmentation via adversarial learning,’’ in Proc. ICCV, 2021, pp. 7536--7545.
  • [2] R. L. o. Sacco, ‘‘The heart of 25 by 25: Achieving the goal of reducing global and regional premature deaths from cardiovascular diseases and stroke: A modeling study from the american heart association and world heart federation,’’ Circulation, vol. 133, no. 23, pp. e674--e690, 2016.
  • [3] G. A. Roth, G. A. Mensah, and V. Fuster, ‘‘The global burden of cardiovascular diseases and risks: A compass for global action,’’ J. Am. Coll. Cardiol., vol. 76, no. 25, pp. 2980--2981, 2020.
  • [4] M. Wan et al., ‘‘Symptomatic and asymptomatic chronic carotid artery occlusion on high-resolution MR vessel wall imaging,’’ Am. J. Neuroradiol., vol. 43, no. 1, pp. 110--116, 2022.
  • [5] N. B. Langer and M. Argenziano, ‘‘Minimally invasive cardiovascular surgery: Incisions and approaches,’’ Methodist Debakey Cardiovasc. J., vol. 12, no. 1, p. 4, 2016.
  • [6] F. Cagnazzo et al., ‘‘Endovascular recanalization of chronically occluded internal carotid artery,’’ J. Neurointerv. Surg., vol. 12, no. 10, pp. 946--951, 2020.
  • [7] H. Li, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, Z.-Q. Feng, and Z.-G. Hou, ‘‘CASOG: Conservative actor–critic with smooth gradient for skill learning in robot-assisted intervention,’’ IEEE Trans. Ind. Electron., 2023, DOI: 10.1109/TIE.2023.3310021.
  • [8] C. Meng, Y. Li, Y. Xu, N. Li, and K. Xia, ‘‘A weakly supervised framework for 2D/3D vascular registration oriented to incomplete 2D blood vessels,’’ IEEE Trans. Med. Robot. Bionics, vol. 4, no. 2, pp. 381--390, 2022.
  • [9] M. E. Abdelaziz, L. Tian, M. Hamady, G.-Z. Yang, and B. Temelkuran, ‘‘X-ray to MR: The progress of flexible instruments for endovascular navigation,’’ Prog. Biomed. Eng., vol. 3, no. 3, p. 032004, 2021.
  • [10] M. J. Eagleton, ‘‘Updates in endovascular procedural navigation,’’ Can. J. Cardiol., vol. 38, no. 5, pp. 662--671, 2022.
  • [11] D.-X. Huang et al., ‘‘Real-time 2D/3D registration via CNN regression and centroid alignment,’’ IEEE Trans. Autom. Sci. Eng., 2024, DOI: 10.1109/TASE.2023.3345927.
  • [12] J. Zhu et al., ‘‘3D/2D vessel registration based on Monte Carlo tree search and manifold regularization,’’ IEEE Trans. Med. Imaging, 2023, DOI: 10.1109/TMI.2023.3347896.
  • [13] S. Xia et al., ‘‘Vessel segmentation of X-ray coronary angiographic image sequence,’’ IEEE Trans. Biomed. Eng., vol. 67, no. 5, pp. 1338--1348, 2019.
  • [14] A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever, ‘‘Multiscale vessel enhancement filtering,’’ in Proc. MICCAI, 1998, pp. 130--137.
  • [15] R. Manniesing, M. A. Viergever, and W. J. Niessen, ‘‘Vessel enhancing diffusion: A scale space representation of vessel structures,’’ Med. Image Anal., vol. 10, no. 6, pp. 815--825, 2006.
  • [16] Cervantes-Sanchez et al., ‘‘Segmentation of coronary angiograms using Gabor filters and Boltzmann univariate marginal distribution algorithm,’’ Comput. Intell. Neurosci., vol. 2016, 2016.
  • [17] H. Jiang et al., ‘‘A region growing vessel segmentation algorithm based on spectrum information,’’ Comput. Math. Methods Med., vol. 2013, p. 743870, 2013.
  • [18] Y.-Z. Zeng et al., ‘‘Automatic liver vessel segmentation using 3D region growing and hybrid active contour model,’’ Comput. Biol. Med., vol. 97, pp. 63--73, 2018.
  • [19] N. Sangsefidi, A. H. Foruzan, and A. Dolati, ‘‘Balancing the data term of graph-cuts algorithm to improve segmentation of hepatic vascular structures,’’ Comput. Biol. Med., vol. 93, pp. 117--126, 2018.
  • [20] C. Wang et al., ‘‘Tensor-cut: A tensor-based graph-cut blood vessel segmentation method and its application to renal artery segmentation,’’ Med. Image Anal., vol. 60, p. 101623, 2020.
  • [21] R.-Q. Li, G.-B. Bian, X.-H. Zhou, X. Xie, Z.-L. Ni, and Z. Hou, ‘‘CAU-net: A novel convolutional neural network for coronary artery segmentation in digital subtraction angiography,’’ in Proc. ICONIP, 2020, pp. 185--196.
  • [22] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, ‘‘Image segmentation using deep learning: A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523--3542, 2021.
  • [23] X. Chen et al., ‘‘Recent advances and clinical applications of deep learning in medical image analysis,’’ Med. Image Anal., vol. 79, p. 102444, 2022.
  • [24] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks for biomedical image segmentation,’’ in Proc. MICCAI, 2015, pp. 234--241.
  • [25] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, ‘‘UNet++: Redesigning skip connections to exploit multiscale features in image segmentation,’’ IEEE Trans. Med. Imaging, vol. 39, no. 6, pp. 1856--1867, 2019.
  • [26] J. Schlemper et al., ‘‘Attention gated networks: Learning to leverage salient regions in medical images,’’ Med. Image Anal., vol. 53, pp. 197--207, 2019.
  • [27] D.-X. Huang et al., ‘‘MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation,’’ arXiv:2401.11856, 2024.
  • [28] L. Mou et al., ‘‘CS2-Net: Deep learning segmentation of curvilinear structures in medical imaging,’’ Med. Image Anal., vol. 67, p. 101874, 2021.
  • [29] W. Liu et al., ‘‘Full-resolution network and dual-threshold iteration for retinal vessel and coronary angiograph segmentation,’’ IEEE J. Biomed. Health Inform., vol. 26, no. 9, pp. 4623--4634, 2022.
  • [30] Y. Li et al., ‘‘Global transformer and dual local attention network via deep-shallow hierarchical feature fusion for retinal vessel segmentation,’’ IEEE Trans. Cybern., vol. 53, no. 9, pp. 5826--5839, 2023.
  • [31] A. Dosovitskiy et al., ‘‘An image is worth 16x16 words: Transformers for image recognition at scale,’’ arXiv:2010.11929, 2020.
  • [32] Z. Liu et al., ‘‘Swin transformer: Hierarchical vision transformer using shifted windows,’’ in Proc. ICCV, 2021, pp. 10 012--10 022.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ in Proc. CVPR, 2016, pp. 770--778.
  • [34] J. Chen et al., ‘‘TransUNet: Transformers make strong encoders for medical image segmentation,’’ arXiv:2102.04306, 2021.
  • [35] J. Guo et al., ‘‘CMT: Convolutional neural networks meet vision transformers,’’ in Proc. CVPR, 2022, pp. 12 175--12 185.
  • [36] M. Heidari et al., ‘‘HiFormer: Hierarchical multi-scale representations using transformers for medical image segmentation,’’ in Proc. WACV, 2023, pp. 6202--6212.
  • [37] H. Kuang et al., ‘‘Hybrid CNN-Transformer network with circular feature interaction for acute ischemic stroke lesion segmentation on non-contrast CT scans,’’ IEEE Trans. Med. Imaging, 2024, DOI: 10.1109/TMI.2024.3362879.
  • [38] T. N. Kipf and M. Welling, ‘‘Semi-supervised classification with graph convolutional networks,’’ arXiv:1609.02907, 2016.
  • [39] H. Heidbuchel et al., ‘‘Practical ways to reduce radiation dose for patients and staff during device implantations and electrophysiological procedures,’’ Europace, vol. 16, no. 7, pp. 946--964, 2014.
  • [40] R. Adams and L. Bischof, ‘‘Seeded region growing,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 6, pp. 641--647, 1994.
  • [41] E. O. Rodrigues, A. Conci, and P. Liatsis, ‘‘ELEMENT: Multi-modal retinal vessel segmentation based on a coupled region growing and machine learning approach,’’ IEEE J. Biomed. Health Inform., vol. 24, no. 12, pp. 3507--3519, 2020.
  • [42] M. Taghizadeh Dehkordi, A. M. Doost Hoseini, S. Sadri, and H. Soltanianzadeh, ‘‘Local feature fitting active contour for segmenting vessels in angiograms,’’ IET Comput. Vis., vol. 8, no. 3, pp. 161--170, 2014.
  • [43] N. Memari, A. R. Ramli, M. I. B. Saripan, S. Mashohor, and M. Moghbel, ‘‘Retinal blood vessel segmentation by using matched filtering and fuzzy C-means clustering with integrated level set method for diabetic retinopathy assessment,’’ J. Med. Biol. Eng., vol. 39, pp. 713--731, 2019.
  • [44] B. Kim, Y. Oh, and J. C. Ye, ‘‘Diffusion adversarial representation learning for self-supervised vessel segmentation,’’ arXiv:2209.14566, 2022.
  • [45] Z. Gu et al., ‘‘CE-Net: Context encoder network for 2D medical image segmentation,’’ IEEE Trans. Med. Imaging, vol. 38, no. 10, pp. 2281--2292, 2019.
  • [46] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, ‘‘SegFormer: Simple and efficient design for semantic segmentation with transformers,’’ in Proc. NeurIPS, vol. 34, 2021, pp. 12 077--12 090.
  • [47] H. Wang, P. Cao, J. Wang, and O. R. Zaiane, ‘‘UCTransNet: Rethinking the skip connections in U-Net from a channel-wise perspective with transformer,’’ in Proc. AAAI, vol. 36, no. 3, 2022, pp. 2441--2449.
  • [48] R. C. Gonzales and P. Wintz, Digital image processing.   Addison-Wesley Longman Publishing Co., Inc., 1987.
  • [49] L. Chi, B. Jiang, and Y. Mu, ‘‘Fast Fourier convolution,’’ in Proc. NeurIPS, vol. 33, 2020, pp. 4479--4488.
  • [50] J. Guibas, M. Mardani, Z. Li, A. Tao, A. Anandkumar, and B. Catanzaro, ‘‘Adaptive Fourier neural operators: Efficient token mixers for transformers,’’ arXiv:2111.13587, 2021.
  • [51] Y. Rao, W. Zhao, Z. Zhu, J. Zhou, and J. Lu, ‘‘GFNet: Global filter networks for visual recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 9, pp. 10 960--10 973, 2023.
  • [52] Z. Huang, Z. Zhang, C. Lan, Z.-J. Zha, Y. Lu, and B. Guo, ‘‘Adaptive frequency filters as efficient global token mixers,’’ in Proc. ICCV, 2023, pp. 6049--6059.
  • [53] R. Azad, A. Bozorgpour, M. Asadi-Aghbolaghi, D. Merhof, and S. Escalera, ‘‘Deep frequency re-calibration U-Net for medical image segmentation,’’ in Proc. ICCVW, 2021, pp. 3274--3283.
  • [54] Y. Huang, C. Zhou, L. Chen, J. Chen, and S. Lan, ‘‘Medical frequency domain learning: Consider inter-class and intra-class frequency for medical image segmentation and classification,’’ in Proc. BIBM, 2021, pp. 897--904.
  • [55] P. Li, R. Zhou, J. He, S. Zhao, and Y. Tian, ‘‘A global-frequency-domain network for medical image segmentation,’’ Comput. Biol. Med., vol. 164, p. 107290, 2023.
  • [56] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba, ‘‘Understanding the role of individual units in a deep neural network,’’ Proc. Natl. Acad. Sci. USA, vol. 117, no. 48, pp. 30 071--30 078, 2020.
  • [57] J. Fu, J. Liu, J. Jiang, Y. Li, Y. Bao, and H. Lu, ‘‘Scene segmentation with dual relation-aware attention network,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 6, pp. 2547--2560, 2020.
  • [58] Y. Li, Y. Zhang, W. Cui, B. Lei, X. Kuang, and T. Zhang, ‘‘Dual encoder-based dynamic-channel graph convolutional network with edge enhancement for retinal vessel segmentation,’’ IEEE Trans. Med. Imaging, vol. 41, no. 8, pp. 1975--1989, 2022.
  • [59] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, ‘‘Squeeze-and-Excitation networks.’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, pp. 2011--2023, 2019.
  • [60] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional block attention module,’’ in Proc. ECCV, 2018, pp. 3--19.
  • [61] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ‘‘ECA-Net: Efficient channel attention for deep convolutional neural networks,’’ in Proc. CVPR, 2020, pp. 11 534--11 542.
  • [62] C. Guo, M. Szemenyei, Y. Hu, W. Wang, W. Zhou, and Y. Yi, ‘‘Channel attention residual U-Net for retinal vessel segmentation,’’ in Proc. ICASSP, 2021, pp. 1185--1189.
  • [63] J. W. Cooley and J. W. Tukey, ‘‘An algorithm for the machine calculation of complex Fourier series,’’ Math. Comput., vol. 19, no. 90, pp. 297--301, 1965.
  • [64] C. Wang, J. Jiang, Z. Zhong, and X. Liu, ‘‘Spatial-frequency mutual learning for face super-resolution,’’ in Proc. CVPR, 2023, pp. 22 356--22 366.
  • [65] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep learning library,’’ in Proc. NeurIPS, vol. 32, 2019.
  • [66] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai, ‘‘Asymmetric non-local neural networks for semantic segmentation,’’ in Proc. ICCV, 2019, pp. 593--602.
  • [67] X. Li, Y. Yang, Q. Zhao, T. Shen, Z. Lin, and H. Liu, ‘‘Spatial pyramid based graph reasoning for semantic segmentation,’’ in Proc. CVPR, 2020, pp. 8950--8959.
  • [68] P. A. Yushkevich et al., ‘‘User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability,’’ NeuroImage, vol. 31, no. 3, pp. 1116--1128, 2006.
  • [69] F. Cervantes-Sanchez, I. Cruz-Aceves, A. Hernandez-Aguirre, M. A. Hernandez-Gonzalez, and S. E. Solorio-Meza, ‘‘Automatic segmentation of coronary arteries in X-ray angiograms using multiscale analysis and artificial neural networks,’’ Appl. Sci., vol. 9, no. 24, p. 5507, 2019.
  • [70] X. Li, W. Wang, X. Hu, and J. Yang, ‘‘Selective kernel networks,’’ in Proc. CVPR, 2019, pp. 510--519.
  • [71] Z. Qin, P. Zhang, F. Wu, and X. Li, ‘‘FcaNet: Frequency channel attention networks,’’ in Proc. ICCV, 2021, pp. 783--792.
  • [72] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, ‘‘Grad-CAM: Visual explanations from deep networks via gradient-based localization,’’ in Proc. ICCV, 2017, pp. 618--626.