SPIRONet: SPatial-Frequency Learning and
TopologIcal Channel InteRactiOn Network for Vessel Segmentation

De-Xing Huang, Xiao-Hu Zhou^∗, , Xiao-Liang Xie, , Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Mei-Jiang Gui, Hao Li, Tian-Yu Xiang, Bo-Xian Yao, and Zeng-Guang Hou^∗ This work was supported in part by the National Natural Science Foundation of China under Grant 62373351, Grant 62222316, Grant U20A20224, U1913601, Grant 62073325, Grant 61720106012, Grant 62003198; in part by the Youth Innovation Promotion Association of Chinese Academy of Sciences (CAS) under Grant 2020140; and in part by the CIE-Tencent Robotics X Rhino-Bird Focused Research Program. (Corresponding authors: Xiao-Hu Zhou and Zeng-Guang Hou)D.-X. Huang, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, S.-Y. Wang, Z.-Q. Feng, M.-J. Gui, H. Li, T.-Y. Xiang, and B.-X. Yao are with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Bei**g 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Bei**g 100049, China (e-mail: [email protected]; [email protected]).Z.-G. Hou is with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Bei**g 100190, China, also with the CAS Center for Excellence in Brain Science and Intelligence Technology, Bei**g 100190, China, also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Bei**g 100049, China, and also with the Joint Laboratory of Intelligence Science and Technology, Institute of Systems Engineering, Macau University of Science and Technology, Taipa, Macau (e-mail:[email protected]).

Abstract

Automatic vessel segmentation is paramount for develo** next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel SPatial-frequency learning and topologIcal channel inteRactiOn Network (SPIRONet) is proposed to address the above issues. Specifically, dual encoders are utilized to comprehensively capture local spatial and global frequency vessel features. Then, a cross-attention fusion module is introduced to effectively fuse spatial and frequency features, thereby enhancing feature discriminability. Furthermore, a topological channel interaction module is designed to filter out task-irrelevant responses based on graph neural networks. Extensive experimental results on several challenging datasets (CADSA, CAXF, DCA1, and XCAD) demonstrate state-of-the-art performances of our method. Moreover, the inference speed of SPIRONet is 21 FPS with a $512\times 512$ input size, surpassing clinical real-time requirements (6 $\sim$ 12FPS). These promising outcomes indicate SPIRONet’s potential for integration into vascular interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.

Index Terms:

vessel segmentation, Fourier transform, spatial-frequency fusion, graph convolutional networks (GCNs).

I Introduction

Refer to caption — Figure 1: Illustration of challenges in vessel segmentation. i) Low signal-to-noise ratio (SNR). ii) Small or slender vessel branches. iii) Non-target and motion artifact interference. X-ray fluoroscopy images and their corresponding ground truths are from the XCAD dataset [1].

Cardiovascular diseases are major contributors to global morbidity and mortality rates [2], [3]. Intravascular interventions have gained considerable attention for their minimally invasive nature and swift postoperative recovery times [4], [5]. These procedures require physicians to deliver instruments (e.g., guidewires, catheters, or balloons) precisely to target vessels, typically guided by intraoperative digital subtraction angiographies (DSAs) [6] or X-ray fluoroscopies [7]. However, challenges such as uneven contrast agent flow [8] or vascular occlusions [6] may prevent some vessel branches from being opacified in intraoperative images. To ensure safe deployments of instruments, developments of intelligent navigation systems are essential [9], [10]. As a foundational element of navigation systems, real-time segmentation of vessel morphology is critically important [11], [12].

However, accurately segmenting vessels from intraoperative images is non-trivial [13], [1]. As shown in Fig. 1, the primary challenges can be summarized three-fold: i) To reduce radiation exposure for patients and physicians, interventions utilize low-power X-rays, resulting in low signal-to-noise ratio (SNR) images. ii) Complex vessel structures often include small and slender branches, which are difficult to distinguish. iii) Non-target vessels, vessel-like objects (e.g., guidewires or catheters), and motion artifacts arising from patient physiological activities can cause significant interference.

Early vessel segmentation methods rely on conventional image processing techniques. These methods begin by enhancing vessel features through image filters [14], [15], [16], follow by applications of region-growing [17], [18] or machine learning techniques [19], [20] to obtain segmentation results. However, these conventional methods struggle to capture high-level semantic features crucial for successful segmentation, and their parameters are typically selected based on empirical evidence [21]. Consequently, the robustness and generalization capabilities of these methods are inadequate, rendering them unsuitable for clinical deployment.

In recent years, deep learning methods have taken dominant positions in various vision tasks due to their powerful capability to learn high-level semantic features [22], [23]. Specifically, in the medical image segmentation domain, U-Net [24] and its variants [25], [26], [27] have gained widespread adoption, demonstrating remarkable success across different imaging modalities. Innovations based on U-Net have aimed to further enhance vessel segmentation performances through various methods, including designing attention modules [21], [28], exploiting full-resolution learning [29], and integrating transformers [30]. Despite these advancements, current methods have not fully addressed challenges highlighted in Fig. 1, resulting in suboptimal results. For example, CAU-net [21] utilizes a channel attention mechanism to learn channel-wise dependencies and minimizes interference, yet it struggles to accurately classify vessel structures in images with low SNR. Transformers [31] offer advantages in capturing long-range vessel dependencies, which is beneficial for identifying vessel structures in low SNR images [30]. However, their image partitioning strategies [31], [32] may disrupt vessel continuity, adversely affecting precise identifications of small or slender vessels.

Based on the above observations, this paper proposes a novel SPatial-frequency learning and topologIcal channel inteRactiOn Network (SPIRONet) for vessel segmentation. SPIRONet leverages a spatial and a frequency encoder to extract local and global vessel features, respectively. Specifically, the spatial encoder consists of successive residual blocks [33], while the frequency encoder, based on Fourier transform, has an image-size respective field (see Eq. (1)). Considering that local spatial features and global frequency features are complementary and mutually enhancing [34], [35], [36], [37], a cross-attention fusion module is designed to integrate these two types of features. This fusion enables SPIRONet to discover more distinct vessel characteristics from low SNR images and accurately identify small or slender vessels. Furthermore, SPIRONet incorporates a topological channel interaction module, using graph neural networks (GNNs) [38] to explicitly learn topological channel relationships and eliminate interference. This comprehensive design allows SPIRONet to effectively tackle challenges outlined in Fig. 1, achieving remarkable segmentation performances.

In summary, the main contributions of this work are as follows:

•

A frequency encoder based on Fourier transform is designed to capture global vessel features in the frequency domain with high efficacy. A cross-attention fusion module is further introduced to fuse complementary spatial and frequency features.
•

A topological channel interaction module is developed to filter out task-irrelevant responses via explicit channel relationship modeling and effective information interaction.
•

Our SPIRONet markedly outperforms the state-of-the-art alternatives on two in-house datasets (CADSA and CAXF) and two public available benchmarks (DCA1 and XCAD) with a real-time inference rate of 21 FPS¹¹1In intravascular intervention scenarios, the definition of “real-time” is 6 $\sim$ 12 FPS due to the low capture frequency of medical equipments like X-ray systems [39]..

The remainder of this paper is organized as follows: Section II briefly reviews current works related to this research. Section III depicts the proposed SPIRONet in detail. Section IV introduces datasets utilized in experiments and model configurations. Quantitative and qualitative experimental results are presented in Section V. Section VI gives a discussion about our model. Finally, Section VII concludes this paper.

II Related works

II-A Traditional vessel segmentation approaches

Traditional vessel segmentation methods primarily rely on pixel intensities. Among them, region-growing [40] is one of the most typical techniques. Jiang et al. [17] introduced an improved region-growing method that selects high-quality seeds based on spectral information. ELEMENT [41] framework was developed for multi-modal vessel segmentation. It integrates connectivity features with region-growing to identify potential vessel pixels and employs the Weka framework for segmentation by leveraging a comprehensive set of complementary features. In addition to region-growing methods, other alternatives exist. Dehkordi et al. [42] proposed an active contour model that incorporates a local feature fitting energy for vessel segmentation. Similarly, Memari et al. [43] adopted fuzzy C-means clustering to delineate coarse vessel structures, which are then refined using an integrated level set approach. Another innovative method, Tensor-cut [20], conceptualizes each voxel as a second-order tensor and employs a graph cuts algorithm for final segmentation. Despite their efficacy, these methods predominantly depend on manually designed features, requiring complex processing steps and facing scalability challenges [1], [44].

II-B Vessel segmentation based on deep learning

With the tremendous advance of deep learning, researchers have designed various deep networks to improve vessel segmentation quality. U-Net [24] stands out for its encoder-decoder architecture, complemented by multi-scale skip connections, enabling the efficient capture of both low-level and high-level features. Building on UNet, models such as Attn-UNet [26] and UNet++ [25] have been introduced to further enhance segmentation performance through the incorporation of attention mechanisms and the redesign of skip connections. For vessel-specific models, Gu et al. [45] developed CE-Net, which employs a dense atrous convolution (DAC) and residual multi-kernel pooling (RMP) to simultaneously capture high-level features and preserve spatial vessel details. Similarly, CS²-Net [28] integrates channel and spatial attention modules to improve feature representations. However, limited by the receptive fields of convolutional neural networks (CNNs), these models cannot fully exploit global contexts of images [31]. In contrast, transformers exhibit a robust capacity for global context modeling [31] and have been extensively applied in natural and medical image segmentation [34], [46]. TransUNet [34] has been a forerunner in integrating transformers with medical image segmentation, utilizing transformers to encode CNN features for comprehensive global context modeling. UCTransNet [47] replaced the original skip connections with a channel transformer (CTrans) to reduce semantic gaps between shallower-level encoders and decoders. Additionally, Li et al. [30] proposed a global transformer and dual local attention network GT-DLA-dsHFF, which achieves deep-shallow hierarchical feature fusion to capture global and local vessel characteristics.

II-C Learning form frequency domain

Fourier transform is a fundamental technique in conventional signal processing [48]. Leveraging its global information aggregation capabilities (see Eq. (1)), many works have incorporated Fourier transform into deep neural networks to capture long-range dependencies [49], [50], [51], [52]. These approaches manipulate frequency information through various operations, including convolution [49], groupwise MLP layers [50], elementwise multiplication with trainable parameters [51], and adaptive frequency filters [52]. In medical image segmentation, FRCU-Net [53] introduced a channel-wise attention mechanism to re-calibrate different frequencies from Laplacian pyramids, producing more discriminative representations. Huang et al. [54] proposed a plug-and-play frequency domain attention module (FDAM) to retain valuable frequency information of medical images. Li et al. [55] proposed GFUNet, replacing U-Net’s original encoder with GFNet [51] to harness frequency features. Despite these advances, GFUNet lacks semantic adaptability [52], a critical aspect of global feature learning [56]. Addressing this gap, our work separates frequency features into amplitude and phase components, employing straightforward convolutional layers to generate semantic-adaptive masks.

II-D Channel refinement module

In high-level semantic features, each channel map is a class-specific response, with different channel responses often correlated with one another [57], [58]. Mainstream researches have devoted to discovering channel correspondences to refine feature representations and eliminate task-irrelevant responses. Among these efforts, channel attention mechanisms (CAMs) stand out and have been extensively applied across various vision tasks [59], [60], [61]. Hu et al. [59] proposed a squeeze-and-extraction (SE) block, which recalibrates channel feature responses by explicitly modeling relationships between channels. ECA-Net [61] introduced an efficient channel attention (ECA) module, enhancing the SE block to produce channel attention maps via 1D convolution without reducing dimensionality. Li et al. [21] proposed CAU-net, which incorporates the SE block to capture vessel details and mitigate noise response. Similarly, CAR-UNet [62] integrated a channel attention double residual block (CADRB) to analyze channel statistics, while Mou et al. [28] explored self-attention across channel dimensions to address long-range dependencies and refine channel features. However, we argue that channel correlations form a graph structure, and the aforementioned methods allocate weights to each channel without explicitly formulating topological channel relationships or directly interacting channel features [58], making them less effective at enhancing vessel responses. A pioneering approach by [58] introduced a dynamic-channel graph convolutional network, map** channels onto a topological space to enhance feature refinement on a graph. However, it is adopted in feature maps with lower spatial resolution (i.e., at the bottom of the encoder), and irrelevant responses may be amplified again during decoding. In contrast, our proposed topological channel interaction module, activated post-decoding, aims to filter out irrelevant responses at higher resolution.

III Methodology

III-A Preliminaries: 2D Fourier transform

Fourier transform plays a vital role in the signal processing domain [48] and is a key component in SPIRONet. Digital images are 2D spatial signals, which can be transformed into the frequency domain via 2D discrete Fourier transform (DFT):

\displaystyle\bm{X}(u,v)=\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}\bm{x}\left(h,w\right% )e^{-j2\pi\left(\frac{uh}{H}+\frac{vw}{W}\right)}

(1)

where $\bm{x}\in\mathbb{R}^{1\times H\times W}$ is a single channel image. $H$ and $W$ represent the height and weight of $\bm{x}$ . $u$ and $v$ are coordinates in the frequency domain. In practice, DFT is implemented using the fast Fourier transform (FFT) algorithm [63] with $\mathcal{O}(N\log N)$ complexity.

The amplitude $\bm{\mathcal{A}}$ and phase $\bm{\mathcal{P}}$ components are two significant components of $\bm{X}$ . For a given $\bm{X}$ , the above two components can be derived as follows:

	$\displaystyle\bm{\mathcal{A}}(u,v)$	$\displaystyle=\sqrt{\left\{{\rm Re}\left[\bm{X}(u,v)\right]\right\}^{2}+\left% \{{\rm Im}\left[\bm{X}(u,v)\right]\right\}^{2}}$		(2)
	$\displaystyle\bm{\mathcal{P}}(u,v)$	$\displaystyle=\arctan\left\{\frac{{\rm Im}\left[\bm{X}(u,v)\right]}{{\rm Re}% \left[\bm{X}(u,v)\right]}\right\}$		(2)

Similarly, for the given $\bm{\mathcal{A}}$ and $\bm{\mathcal{P}}$ , real and imaginary parts of $\bm{X}$ can be represented as:

	$\displaystyle{\rm Re}\left[\bm{X}(u,v)\right]$	$\displaystyle=\bm{\mathcal{A}}(u,v)\cos\left\{\bm{\mathcal{P}}(u,v)\right\}$		(3)
	$\displaystyle{\rm Im}\left[\bm{X}(u,v)\right]$	$\displaystyle=\bm{\mathcal{A}}(u,v)\sin\left\{\bm{\mathcal{P}}(u,v)\right\}$		(3)

Eq. (1) demonstrates that each element of $\bm{X}(u,v)$ contains information from every pixel of the image $\bm{x}$ . Consequently, $\bm{\mathcal{A}}$ and $\bm{\mathcal{P}}$ have image-size respective fields. These components capture unique characteristics of images. The amplitude component $\bm{\mathcal{A}}$ excels in capturing textural details of vessels, which is crucial for accurately segmenting small or slender vessels. Conversely, the phase component $\bm{\mathcal{P}}$ conveys spatial relationships among different parts of images, effectively preserving structural features of vessels.

III-B Overall architecture

Our SPIRONet is comprehensively detailed in Fig. 2, adopting the U-shape [24] encoder-decoder architecture. It utilize two parallel encoders to learn local spatial features and global frequency features, respectively. A specially crafted cross-attention fusion module is employed to mutually fuse spatial and frequency features. The fused features are then transmitted to the CNN decoder through skip connections. At the end of the decoder, a topological channel interaction module is deployed to refine channel features and filter out task-irrelevant responses. Finally, vessel predictions are generated through a segmentation head ( $1\times 1$ convolution layer).

III-C Spatial-frequency representation learning

Algorithm 1 Pseudo-Code of the frequency encoder block in a PyTorch-like Style.

⬇

# x: input feature maps

# conv_amp, conv_pha: amplitude/phase convolutional layers

# conv_channel: channel adjustment

# fast Fourier transform (FFT) Eq. (1)

X = rfft2(x)

# decompose X into amplitude and phase components Eq. (2)

amp = abs(X)

pha = angle(X)

# parallel paths

amp_fuse = conv_amp(amp) + amp

pha_fuse = conv_pha(pha) + pha

# get real and imaginary parts Eq. (3)

real = amp_fuse * cos(pha_fuse)

imag = amp_fuse * sin(pha_fuse)

# invert fast Fourier transform (iFFT)

X_ = complex(real, imag)

x_ = irfft(X_)

# finall outputs

out = x_ + x

out = conv_channel(out)

Fig. 3 shows architectures of the proposed encoder blocks. Inputs of the $i$ -th spatial encoder block ${\rm Enc}_{\rm Spa}^{i}$ and frequency encoder block ${\rm Enc}_{\rm Freq}^{i}$ are $\bm{f}_{\rm Spa}^{i}$ and $\bm{f}_{\rm Freq}^{i}$ , respectively. Following the previous methods [34], [37], CNNs are utilized to extract local vessel features. The spatial encoder block is similar to the residual module in ResNet [33]. Inspired by [64], two parallel paths with convolution layers are adopted to process amplitude $\bm{F}_{\rm Amp}^{i}$ and phase $\bm{F}_{\rm Pha}^{i}$ components, respectively. The frequency encoder block can be easily implemented in deep learning frameworks (e.g., PyTorch [65]), as presented in Algorithm 1. Outputs of the $i$ -th encoder blocks can be formulated as follows:

		$\displaystyle\hat{\bm{f}}_{\rm Spa}^{i}={\rm Enc}_{\rm Spa}^{i}\left(\bm{f}_{% \rm Spa}^{i}\right)$		(4)
		$\displaystyle\hat{\bm{f}}_{\rm Freq}^{i}={\rm Enc}_{\rm Freq}^{i}\left(\bm{f}_% {\rm Freq}^{i}\right)$		(4)

Then the outputs are downsampled via $2\times 2$ maxpooling layers before being sent into the $\left(i+1\right)$ -th encoder.

III-D Cross-attention fusion

As discussed in previous sections, the spatial encoder concentrates on learning local vessel features, while the frequency encoder captures long-range vessel dependencies, benefiting from certain properties of Fourier transform. Recent research indicates that local and global features are complementary, providing mutual guidance for learning more robust features [35], [37]. To effectively integrate local spatial features and global frequency features, a cross-attention fusion module is proposed, as depicted in Fig. 4.

Inputs of the $i$ -th cross-attention fusion module are features ( $\hat{\bm{f}}_{\rm Spa}^{i},\hat{\bm{f}}_{\rm Freq}^{i}\in\mathbb{R}^{C_{i}% \times H_{i}\times W_{i}}$ ) extracted by the $i$ -th spatial and frequency encoder blocks, where $C_{i}$ , $H_{i}$ , and $W_{i}$ represent the channel number, height, and width of feature maps. First, these two features are projected into the embedding space through different “Conv-BN-ReLU” layers $\bm{W}^{i}(\cdot)$ to generate the spatial query $\bm{Q}^{i}_{\rm Spa}$ and key $\bm{K}^{i}_{\rm Spa}$ , as well as the frequency query $\bm{Q}^{i}_{\rm Freq}$ and key $\bm{K}^{i}_{\rm Freq}$ . The mixed value is the projection of the concatenated feature ${\rm Concat}\left(\hat{\bm{f}}_{\rm Spa}^{i},\hat{\bm{f}}_{\rm Freq}^{i}\right% )\in\mathbb{R}^{2C_{i}\times H_{i}\times W_{i}}$ . To alleviate high complexities brought by matrix multiplications, pyramid pooling modules (PPM) [66] are adopted to sample specific feature maps. The above process can be formulated by:

	$\displaystyle\bm{Q}^{i}_{\rm Spa}=\bm{W}^{i}_{\rm SQ}\left(\hat{\bm{f}}_{\rm Spa% }^{i}\right),\bm{Q}^{i}_{\rm Freq}=\bm{W}^{i}_{\rm FQ}\left(\hat{\bm{f}}_{\rm Freq% }^{i}\right)$		(5)
			(6)
	$\displaystyle\bm{V}^{i}={\rm PPM}\left\{\bm{W}^{i}_{\rm V}\left[{\rm Concat}% \left(\hat{\bm{f}}_{\rm Spa}^{i},\hat{\bm{f}}_{\rm Freq}^{i}\right)\right]\right\}$		(7)

where $\bm{Q}^{i}_{\rm Spa},\bm{Q}^{i}_{\rm Freq}\in\mathbb{R}^{\left(H_{i}\cdot W_{i% }\right)\times d_{0}}$ , $\bm{K}^{i}_{\rm Spa},\bm{K}^{i}_{\rm Freq}\in\mathbb{R}^{N\times d_{0}}$ , and $\bm{V}^{i}\in\mathbb{R}^{N\times d_{0}}$ . $d_{0}$ is the dimension of the embedding space and $N$ is the spatial resolution sample by PPM, $N\ll H_{i}\cdot W_{i}$ . Thus, the cross-attention (CA) can be calculated as follows:

(8)

III-E Topological channel interaction

Inspired by [58], we hypothesis channel relationships are graph-structured and introduce a topological channel interaction (TCI) module to enhance vessel-specific responses, as illustrated in Fig. 5. Graph neural networks (GNNs) are widely utilized to lean features of graph-structured data. Given a graph $\mathcal{G}=\left(V,E\right)$ and its adjacent matrix $\bm{A}$ , the output of a GNN can be formulated as follows [38]:

	$\displaystyle\bm{H}_{\rm out}=\sigma\left(\tilde{\bm{L}}\bm{H}_{\rm in}\bm{% \Theta}\right)$		(9)
	$\displaystyle\tilde{\bm{L}}=\tilde{\bm{D}}^{-\frac{1}{2}}\tilde{\bm{A}}\tilde{% \bm{D}}^{-\frac{1}{2}}$		(10)

where $\sigma\left(\cdot\right)$ is a non-linear activation function and $\bm{\Theta}$ is trainable parameters. $\tilde{\bm{L}}$ is the Laplacian matrix, $\tilde{\bm{A}}=\bm{A}+\bm{I}$ , and $\tilde{\bm{D}}_{ii}=\sum_{j}\tilde{\bm{A}}_{ij}$ . Inputs of the topological channel interaction module is $\bm{H}_{\rm in}=\bm{f}\in\mathbb{R}^{C\times(H\cdot W)/4^{2}}$ , where we downsample outputs of the decoder $\bm{f}_{\rm in}\in\mathbb{R}^{C\times H\times W}$ by a $4\times 4$ maxpooling layer. To learn data-dependent topological graph structures, we follow [67] to use an improved Laplacian matrix:

\displaystyle\tilde{\bm{L}}=\bm{I}-\tilde{\bm{D}}^{-\frac{1}{2}}\tilde{\bm{A}}% \tilde{\bm{D}}^{-\frac{1}{2}}

(11)

where $\tilde{\bm{A}},\tilde{\bm{D}}\in\mathbb{R}^{C\times C}$ . $C$ denotes the number of channels of input feature maps $\bm{f}$ .

To formulate topological relationships among different channels, we calculate channels’ cosine similarities by:

\displaystyle\tilde{a}_{ij}=\frac{\bm{W}\left(\bm{f}\right)_{i}^{T}\bm{W}\left% (\bm{f}\right)_{j}}{||\bm{W}\left(\bm{f}\right)_{i}||_{2}||\bm{W}\left(\bm{f}% \right)_{j}||_{2}}

(12)

where $\bm{W}(\cdot)$ is an embedding layer consisting of ‘‘Conv-ReLU" and $\tilde{a}_{ij}\in\tilde{\bm{A}}$ . Thus Eq. (9) can be wirtten as follows:

\displaystyle\bm{H}_{\rm out}=\sigma\left\{\tilde{\bm{L}}{\rm Down}_{/4}\left(% \bm{f}_{\rm in}\right)\bm{\Theta}\right\}

(13)

where $\sigma$ is the ReLU activation function and ${\rm Down}_{/4}$ indicates the downsample operator. Overall, outputs of our topological channel interaction module can be formulated by:

\displaystyle\bm{f}_{\rm out}={\rm Up}_{\times 4}\left\{\sigma\left[\tilde{\bm% {L}}{\rm Down}_{/4}\left(\bm{f}_{\rm in}\right)\bm{\Theta}\right]\right\}+\bm{% f}_{\rm in}

(14)

where ${\rm Up}_{\times 4}$ indicates the upsample operator. Specifically, we implement two operators via maxpooling and transposed convolution.

III-F Loss function

Consistent with previous works, binary cross entropy (BCE) loss is adopted as the loss function. It is defined as follows:

\displaystyle\mathcal{L}_{\rm BCE}=-\frac{1}{N}\sum_{i=1}^{N}\left\{y_{\rm gt}% ^{(i)}\log{y_{\rm pred}^{(i)}}+(1-y_{\rm gt}^{(i)})\log{(1-y_{\rm pred}^{(i)})% }\right\}

(15)

where $y_{\rm gt}^{(i)}$ and $y_{\rm pred}^{(i)}$ refer to model prediction and ground truth of the $i$ -th image. $N$ is the size of a batch.

IV Experimental setup

IV-A Datasets

In our experiments, four challenging benchmark datasets are utilized, comprising two in-house datasets and two publicly available datasets.

Chronic artery digital subtraction angiography dataset (CADSA). The CADSA dataset is derived from Bei**g Tiantan Hospital, Capital Medical University²²2https://www.bjtth.org/Html/News/Main/1001125.html. It contains 275 images from 26 patient DSA sequences, with an original resolution of $750\times 750$ . Chronic arteries are annotated by experienced physicians using ITK-SNAP [68]. 197 images from 15 patients’ DSA sequences are randomly selected for training, and the remaining 78 images from 5 patients’ DSA sequences are used for testing.

Coronary arteries X-ray fluoroscopy dataset (CAXF) [21]. This dataset is proposed in our previous work. It comprises 538 images from 36 X-ray fluoroscopy sequences, with each image at a resolution of $512\times 512$ . Specifically, 412 images are from 7 sequences, and 126 images are randomly selected from 29 sequences. Following the previous settings [21], 337 images from 24 sequences are utilized for training, and the remaining 201 images from 12 sequences are selected for testing.

DCA1 [69]. This dataset is provided by the Mexican Social Security Institute, UMAE T1-León. It includes 134 $300\times 300$ X-ray coronary angiograms along with corresponding ground truths annotated by an expert cardiologist. Following [29], this dataset is split into 100 training images and 34 testing images.

XCAD [1]. The XCAD dataset contains 1747 coronary angiograms obtained by a General Electric Innova IGS 520 system. Each image has a resolution of $512\times 512$ . It is important to note that only 126 images in this dataset have vessel annotations. We randomly select 100 images as the training set, and the remaining 26 images as the testing set.

IV-B Evaluation metrics

To thoroughly evaluate the proposed method and baselines, four evaluation metrics are selected, including sensitivity (Sen.), ${\rm F_{1}}$ -Score ( ${\rm F_{1}}$ ), Intersection over Union (IoU), and Matthews correlation coefficient (MCC).

	$\displaystyle\text{Sen.}=\frac{\text{TP}}{\text{TP}+\text{FN}}$		(16)
	$\displaystyle{\rm F_{1}}=\frac{2\times\text{TP}}{2\times\text{TP}+\text{FP}+% \text{FN}}$		(17)
	$\displaystyle\text{IoU}=\frac{\text{TP}}{\text{TP}+\text{FP}+\text{FN}}$		(18)
			(19)

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative pixels in segmentation results, respectively.

IV-C Implementation details

TABLE I: Ablation experimental results on the XCAD dataset. The best results are in bold. The second best results are underlined. All results are averaged over 3 random seeds.

Variants	Module				Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$
Variants	SE	FE	CA	TCI	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$
I	✓				$80.84$	$80.06$	$67.11$	$79.11$
II		✓			$76.00$	$77.14$	$63.11$	$76.19$
III	✓	✓			$82.10$	$81.09$	$68.39$	$80.15$
IV	✓	✓	✓		$82.00$	$81.22$	$68.57$	$80.23$
V	✓			✓	$82.23$	$80.73$	$67.98$	$79.77$
VI		✓		✓	$78.15$	$77.63$	$63.78$	$76.64$
VII	✓	✓		✓	$\bm{83.07}$	$\underline{81.36}$	$\underline{68.76}$	$\underline{80.40}$
SPRIONet	✓	✓	✓	✓	$\underline{82.91}$	$\bm{81.76}$	$\bm{69.37}$	$\bm{80.82}$

SE: Spatial encoder; FE: Frequency encoder; CA: Cross-attention; TCI: Topological channel interaction.

TABLE II: Comparison with state-of-the-arts on the CADSA and CAXF datasets. The best results are highlighted in blue and the second best results are highlighted in red. ‘‘Mean

\pm

std’’ are reported over 3 random seeds.

Model	CADSA				CAXF
Model	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$
UNet [24] [MICCAI’15]	$72.11$ $\pm 4.05$	$76.58$ $\pm 2.90$	$64.60$ $\pm 3.20$	$77.58$ $\pm 2.59$	$89.22$ $\pm 0.42$	$89.90$ $\pm 0.18$	$81.80$ $\pm 0.29$	$89.50$ $\pm 0.18$
UNet++ [25] [TMI’19]	$74.59$ $\pm 4.46$	$78.04$ $\pm 2.26$	$66.20$ $\pm 2.49$	$78.88$ $\pm 1.86$	$89.53$ $\pm 0.28$	$90.01$ $\pm 0.18$	$81.96$ $\pm 0.28$	$89.59$ $\pm 0.18$
AttnUNet [26] [MedIA’19]	$73.27$ $\pm 5.70$	$76.58$ $\pm 2.81$	$64.34$ $\pm 3.14$	$77.79$ $\pm 2.20$	$89.13$ $\pm 0.15$	$89.76$ $\pm 0.09$	$81.58$ $\pm 0.15$	$89.34$ $\pm 0.09$
CE-Net [45] [TMI’19]	$75.61$ $\pm 2.69$	$77.80$ $\pm 0.93$	$65.20$ $\pm 1.20$	$78.69$ $\pm 0.84$	$89.74$ $\pm 0.37$	$89.93$ $\pm 0.13$	$81.82$ $\pm 0.22$	$89.52$ $\pm 0.14$
CAU-net [21] [ICONIP’20]	$76.09$ $\pm 0.17$	$79.05$ $\pm 1.43$	$67.40$ $\pm 1.57$	$79.88$ $\pm 1.29$	$89.31$ $\pm 0.27$	$89.85$ $\pm 0.16$	$81.70$ $\pm 0.23$	$89.43$ $\pm 0.15$
TransUNet [34] [ICMLW’21]	$74.59$ $\pm 1.69$	$75.80$ $\pm 2.80$	$63.60$ $\pm 2.91$	$76.86$ $\pm 2.48$	$90.16$ $\pm 0.18$	$89.89$ $\pm 0.09$	$81.75$ $\pm 0.14$	$89.47$ $\pm 0.09$
CS²-Net [28] [MedIA’21]	$66.12$ $\pm 5.08$	$73.10$ $\pm 3.80$	$60.29$ $\pm 4.02$	$74.95$ $\pm 3.08$	$89.84$ $\pm 0.10$	$89.84$ $\pm 0.03$	$81.67$ $\pm 0.07$	$89.41$ $\pm 0.04$
FR-UNet [29] [JBHI’21]	$59.94$ $\pm 0.77$	$62.70$ $\pm 0.42$	$50.72$ $\pm 0.31$	$63.55$ $\pm 0.36$	$90.10$ $\pm 0.78$	$89.17$ $\pm 0.14$	$80.56$ $\pm 0.21$	$88.71$ $\pm 0.14$
DE-DCGCN-EE [58] [TMI’22]	$66.42$ $\pm 1.90$	$70.57$ $\pm 0.55$	$57.14$ $\pm 0.74$	$72.16$ $\pm 0.54$	$89.29$ $\pm 0.28$	$88.91$ $\pm 0.07$	$80.16$ $\pm 0.10$	$88.45$ $\pm 0.08$
GT-DLA-dsHFF [30] [TCyber’23]	$70.14$ $\pm 7.32$	$71.32$ $\pm 3.09$	$57.83$ $\pm 2.93$	$72.67$ $\pm 2.57$	$89.94$ $\pm 0.07$	$89.94$ $\pm 0.13$	$81.82$ $\pm 0.20$	$89.52$ $\pm 0.13$
SPIRONet [Ours]	$81.20$ $\pm 2.72$	$80.10$ $\pm 0.24$	$68.27$ $\pm 0.48$	$80.61$ $\pm 0.19$	$90.80$ $\pm 0.87$	$90.32$ $\pm 0.27$	$82.48$ $\pm 0.45$	$89.94$ $\pm 0.28$

TABLE III: Comparison with state-of-the-arts on the DCA1 and XCAD datasets. The best results are highlighted in blue and the second best results are highlighted in red. ‘‘Mean

\pm

std’’ are reported over 3 random seeds.

Model	DCA1				XCAD
Model	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$
UNet [24] [MICCAI’15]	$80.04$ $\pm 0.36$	$78.90$ $\pm 0.21$	$65.31$ $\pm 0.30$	$77.95$ $\pm 0.23$	$80.64$ $\pm 1.31$	$80.74$ $\pm 0.08$	$67.98$ $\pm 0.11$	$79.81$ $\pm 0.12$
UNet++ [25] [TMI’19]	$80.10$ $\pm 0.23$	$78.45$ $\pm 0.04$	$64.69$ $\pm 0.06$	$77.49$ $\pm 0.04$	$81.13$ $\pm 0.62$	$80.43$ $\pm 0.25$	$67.56$ $\pm 0.34$	$79.45$ $\pm 0.26$
AttnUNet [26] [MedIA’19]	$79.20$ $\pm 0.30$	$78.04$ $\pm 0.01$	$64.18$ $\pm 0.01$	$77.07$ $\pm 0.01$	$80.08$ $\pm 0.35$	$79.98$ $\pm 0.24$	$66.94$ $\pm 0.36$	$79.01$ $\pm 0.26$
CE-Net [45] [TMI’19]	$79.49$ $\pm 0.73$	$77.84$ $\pm 0.12$	$63.87$ $\pm 0.16$	$76.84$ $\pm 0.13$	$79.96$ $\pm 0.45$	$79.95$ $\pm 0.13$	$66.80$ $\pm 0.17$	$78.95$ $\pm 0.14$
CAU-net [21] [ICONIP’20]	$79.29$ $\pm 0.18$	$77.82$ $\pm 0.12$	$63.88$ $\pm 0.15$	$76.86$ $\pm 0.12$	$80.22$ $\pm 0.75$	$79.31$ $\pm 0.31$	$66.06$ $\pm 0.44$	$78.31$ $\pm 0.32$
TransUNet [34] [ICMLW’21]	$81.09$ $\pm 0.46$	$78.82$ $\pm 0.12$	$65.19$ $\pm 0.18$	$77.82$ $\pm 0.12$	$80.86$ $\pm 0.99$	$80.24$ $\pm 0.49$	$67.28$ $\pm 0.62$	$79.26$ $\pm 0.49$
CS²-Net [28] [MedIA’21]	$78.46$ $\pm 0.52$	$77.87$ $\pm 0.25$	$63.94$ $\pm 0.32$	$76.92$ $\pm 0.23$	$79.47$ $\pm 0.72$	$79.23$ $\pm 0.30$	$65.98$ $\pm 0.39$	$78.30$ $\pm 0.32$
FR-UNet [29] [JBHI’21]	$78.96$ $\pm 1.95$	$79.59$ $\pm 0.30$	$66.22$ $\pm 0.42$	$79.47$ $\pm 0.28$	$81.65$ $\pm 1.70$	$79.79$ $\pm 0.38$	$66.66$ $\pm 0.49$	$78.87$ $\pm 0.34$
DE-DCGCN-EE [58] [TMI’22]	$78.48$ $\pm 0.19$	$77.82$ $\pm 0.09$	$63.87$ $\pm 0.11$	$76.83$ $\pm 0.08$	$79.89$ $\pm 0.43$	$79.12$ $\pm 0.12$	$65.76$ $\pm 0.17$	$78.06$ $\pm 0.12$
GT-DLA-dsHFF [30] [TCyber’23]	$75.62$ $\pm 0.74$	$77.17$ $\pm 0.24$	$62.97$ $\pm 0.32$	$76.12$ $\pm 0.24$	$80.56$ $\pm 0.52$	$80.44$ $\pm 0.35$	$67.53$ $\pm 0.44$	$79.45$ $\pm 0.36$
SPIRONet [Ours]	$80.76$ $\pm 0.94$	$79.75$ $\pm 0.48$	$66.45$ $\pm 0.66$	$78.75$ $\pm 0.48$	$82.91$ $\pm 0.63$	$81.76$ $\pm 0.51$	$69.73$ $\pm 0.69$	$80.82$ $\pm 0.53$

All experiments are performed based on PyTorch 1.12.0 [65], Python 3.8, and Ubuntu 18.04. Our models and baseline methods are trained on a single Nvidia GeForce RTX 3090 GPU with 24GB of memory. For data augmentation, we use random flip** and random rotating within the range of $\left[-20^{\circ},20^{\circ}\right]$ . SGD is adopted as our optimizer with a momentum of 0.9 and a weight decay of 0.0001. Polynomial annealing policy is utilized to adjust the learning rate, which can be written as $lr\leftarrow lr_{\rm init}*\left(1-\frac{\rm epoch}{\rm total\_epochs}\right)^% {0.9}$ . The batch size is set to 4 for all datasets. The image size for all datasets is $512\times 512$ , except for DCA1, which is $300\times 300$ . Furthermore, for CADSA, we train our model with an initial learning rate of 0.03 for 200 epochs. For CAXF, the initial learning rate is set to 0.08 and the total epochs is 800. For DCA1, the model is trained for 600 epochs with an initial learning rate of 0.045. For XCAD, we opt an initial learning rate of 0.06 and a training epoch of 700. We train models over 3 random seeds and report ‘‘mean $\pm$ std" of all metrics.

V Results

Extensive experiments are conducted to answer the following questions:

•

Q1: Do the proposed modules facilitate vessel segmentation?
•

Q2: Does the proposed SPIRONet yield better vessel segmentation performances than state-of-the-arts?
•

Q3: Does our model achieve a desirable trade-off between computational efficiency and segmentation performances?

V-A Ablation studies (Q1)

In this section, we conduct experiments on the XCAD dataset to verify the effectiveness of key components in SPIRONet. The quantitative results are reported in Table I. It should be noted that for Variant III and Variant VII, we simply perform element-wise addition to fuse spatial and frequency features. Our main observations are as follows: i) For learning vessel features, both local spatial features and global frequency features are useful. Models that incorporate both spatial and frequency features demonstrate significantly enhanced performance compared to those utilizing only a single type of feature. ii) Spatial features seem matter more than frequency features in vessel segmentation tasks. One possible explanation is that, similar to vision transformers [31], the frequency encoder requires a large amount of data to explore long-range dependencies in images. The datasets used in this paper are small, which prevents the model from learning robust representations for segmentation. iii) The cross-attention fusion module can effectively fuse spatial and frequency features, further improving segmentation performances (i.e., $81.36\%\rightarrow 81.76\%$ in ${\rm F_{1}}$ , $68.76\%\rightarrow 69.37\%$ in IoU, and $80.40\%\rightarrow 80.82\%$ in MCC) compared to element-wise addition method (Variant VII). iv) The topological channel interaction module is capable of refining channel features and removing task-irrelevant responses, resulting in performance gains of $0.91\%$ sensitivity, $0.54\%$ ${\rm F_{1}}$ , $0.80\%$ IoU, and $0.59\%$ MCC against Variant IV.

V-B Comparisons with state-of-the-arts (Q2)

We evaluate our SPIRONet against state-of-the-art (SOTA) segmentation models across four datasets. To ensure fairness, we train these models using source codes available in their respective GitHub repositories, applying the same configurations as those used for SPIRONet.

V-B1 In-house datasets

Table II summarizes quantitative results on two in-house datasets, CADSA and CAXF [21]. Notably, our proposed SPIRONet achieves $81.20\%$ sensitivity, $80.10\%$ ${\rm F_{1}}$ , $68.27\%$ IoU, and $80.61\%$ MCC on the CADSA dataset and $90.80\%$ sensitivity, $90.32\%$ ${\rm F_{1}}$ , $82.48\%$ IoU, and $89.94\%$ MCC on the CAXF dataset, surpassing all SOTA models by a considerable margin. On the CADSA dataset, it is worth noting the significant improvement from CAU-net [21] (the second best method) to SPIRONet by $5.11\%$ in sensitivity. Improved sensitivity indicates that models are more capable of extracting thin vessels and vessel boundaries [29], demonstrating SPIRONet has superior abilities to learn discriminative vessel features.

V-B2 Publicly available benchmarks

Table III presents comparison results on two publicly available benchmarks, DCA1 [69] and XCAD [1]. On the DCA1 datasets, our SPIRONet achieves the best performances in ${\rm F_{1}}$ and IoU and has comparable sensitivity and MCC to SOTA models. SPIRONet obtains the best outcomes in all metrics on the XCAD dataset. These results demonstrate our SPIRONet exhibits superior generalization across different datasets.

TABLE IV: Model parameters, floating-point operations per second (FLOPs), and inference rates. The input size of models is set to

512\times 512

. ^∗ means the experiments are conducted on the test set of the XCAD dataset and repeated five times.

Model	#params (M)	FLOPs (G)	Rate^∗ (FPS)
UNet [24] [MICCAI’15]	$17.26$	$160.44$	$61.52$ $\pm 0.91$
UNet++ [25] [TMI’19]	$9.16$	$139.46$	$45.03$ $\pm 0.08$
AttnUNet [26] [MedIA’19]	$57.16$	$541.04$	$29.00$ $\pm 0.26$
CE-Net [45] [TMI’19]	$29.00$	$35.60$	$74.11$ $\pm 0.38$
CAU-net [21] [ICONIP’20]	$1.95$	$13.91$	$102.98$ $\pm 5.37$
TransUNet [34] [ICMLW’21]	$93.23$	$129.45$	$24.82$ $\pm 0.13$
CS²-Net [28] [MedIA’21]	$8.40$	$55.85$	$82.02$ $\pm 0.43$
FR-UNet [29] [JBHI’21]	$5.72$	$235.60$	$26.75$ $\pm 0.05$
DE-DCGCN-EE [58] [TMI’22]	$14.11$	$294.46$	$18.07$ $\pm 0.02$
GT-DLA-dsHFF [30] [TCyber’23]	$26.10$	$474.60$	$11.80$ $\pm 0.20$
SPIRONet [Ours]	$16.98$	$200.90$	$20.57$ $\pm 0.09$

V-B3 Visualization results

We further visualize vessel segmentation results of several models to provide qualitative comparisons, including: U-Net [24], UNet++ [25], CE-Net [45], CAU-net [21], CS²-Net [28], DE-DCGCN-EE [58], and our SPIRONet. Here, we can intuitively observe how SPIRONet address challenges presented in Fig. 1: i) Uneven flow of contrast agent and low-power X-rays cause images in row 1 and 3 to exhibit low SNR. Our model demonstrates a strong capability in locating challenging vessel regions, as highlighted in red boxes; ii) The slender vessel in row 2, highlighted in red boxes, is usually not fully segmented by baselines. By fully exploring local spatial and global frequency features, our model achieves more precise segmentation results; iii) Non-target vessels highlighted in red boxes in row 4 cause interference for segmentation, resulting in some false positive predictions by baselines. Benefiting from the proposed topological channel interaction module, our model successfully distinguishes irrelevant interferences and avoids false positive predictions.

V-C Model complexity (Q3)

Table IV presents comparisons between baseline models and our SPIRONet across various dimensions, including model parameters, floating-point operations per second (FLOPs), and inference rates. It should be noted that #params and FLOPs are calculated by thop library³³3https://pypi.org/project/thop/. SPIRONet has fewer parameters than half of baselines, and its FLOPs is acceptable. Moreover, the inference rate of our model is approximately 21 FPS, which satisfies clinical real-time requirements (6 $\sim$ 12 FPS) [39]. Although some models, such as CAU-net [21], have faster inference rates, considering SPIRONet’s impressive performances across four datasets, this represents a favorable trade-off.

VI Discussion

In this section, we delve deeper into some specific designs in our model. Default configurations of SPIRONet are highlighted in gray. The experiments are conducted on the XCAD dataset.

Since the feature encoder plays a critical role in our model, we further explore the frequency operation design with those in previous works [50], [51]. Quantitative results are presented in Table V. The baseline refers to the model without frequency encoder blocks (Variant V in Table I). It should be noted that in models incorporating frequency features but lacking the cross-attention module CA, we simply perform element-wise addition to fuse spatial and frequency features. Several observations can be drawn from Table V: i) By equip** with cross-attention module, performances of models incorporated frequency features enjoy consistent improvements;

TABLE V: Comparisons with other frequency operators on the XCAD dataset. The best results are in bold. The second best results are underlined. All results are averaged over 3 random seeds.

Model	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$
V [Ours]	$82.23$	$80.73$	$67.98$	$79.77$
V+AFNO [50] [ICLR’22]	$82.22$	$80.60$	$67.80$	$79.67$
V+GFN [51] [TPAMI’23]	$82.14$	$80.62$	$67.82$	$79.67$
V+FE [Ours]	$\underline{83.07}$	$\underline{81.36}$	$\underline{68.76}$	$\underline{80.40}$
V+CA+AFNO [50] [ICLR’22]	$83.05$	$81.23$	$68.62$	$80.24$
V+CA+GFN [51] [TPAMI’23]	83.12	$81.24$	$68.61$	$80.26$
V+CA+FE [Ours]	$82.91$	81.76	69.37	80.82

FE: Frequency encoder; CA: Cross-attention.

TABLE VI: Comparisons with other channel refinement modules on the XCAD dataset. The best results are in bold. The second best results are underlined. All results are averaged over 3 random seeds.

Model	Sen. (%) $\uparrow$	${\rm F_{1}}$ (%) $\uparrow$	IoU (%) $\uparrow$	MCC (%) $\uparrow$
IV [Ours]	$\underline{82.00}$	$\underline{81.22}$	$\underline{68.57}$	$\underline{80.23}$
IV+SE [59] [TPAMI’19]	$81.61$	$80.99$	$68.26$	$80.02$
IV+SK [70] [CVPR’19]	$81.77$	$81.06$	$68.35$	$80.07$
IV+ECA [61] [CVPR’20]	$81.66$	$81.06$	$68.34$	$80.06$
IV+FCA [71] [ICCV’21]	$81.58$	$81.17$	$68.50$	$80.19$
IV+TCI [Ours]	$\bm{82.91}$	$\bm{81.76}$	$\bm{69.37}$	$\bm{80.82}$

TCI: Topological channel interaction.

ii) The proposed frequency operation design clearly outperforms those in AFNO [50] and GFN [51], which utilize deterministic functions or masks to filter frequency information. In contrast, our model efficiently learns specific frequency features from different inputs, benefiting from the semantic-adaptive design of our frequency encoder.

Furthermore, some typical channel refinement modules are utilized to compare with our proposed topological channel interaction module [59], [61], [70], [71]. Table VI summarizes the results. Our model incorporated TCI achieves considerable performance gains than those incorporated other channel refinement modules, such as $1.14\%\sim 1.33\%$ increases in sensitivity, $0.59\%\sim 0.77\%$ increases in ${\rm F_{1}}$ , $0.87\%\sim 1.11\%$ increases in IoU, and $0.63\%\sim 0.80\%$ increases in MCC. Surprisingly, the baseline model V, which does not utilize channel refinement modules, performs even better than models that incorporate advanced channel refinement modules. This phenomenon indicates simply allocating weights to each channel may introduce extra interference in our vessel segmentation scenario. Unlike current methods, we use graphs to formulate unstructured relationships among channels and interact with information across different channels via graph neural networks, thus filtering out irrelevant noises and enhancing vessel-specific responses.

Additionally, we employ Grad-CAM [72] to compare discriminative regions with and without channel refinement modules. As depicted in Fig. 7, regions highlighted in red boxes exhibit serious interference, including vessel-like interventional instruments and motion artifacts. Comparison methods tend to assign weights to these irrelevant regions (highlighted in red boxes), thus producing more false positive predictions. By leveraging the topological channel interaction module, our SPIRONet demonstrates enhanced abilities in filtering out irrelevant responses compared to other models.

VII Conclusion

This paper proposes a SPatial-frequency learning and topologIcal channel inteRactiOn Network (SPIRONet) to tackle challenges in vessel segmentation. Dual encoders in SPIRONet can extract local spatial and global frequency vessel features effectively. Then, complementary spatial and frequency vessel features can be fused by cross-attention fusion modules. Furthermore, the proposed topological channel interaction module is able to filter out task-irrelevant responses in multi-channel feature maps. Promising experimental results on four benchmarks have demonstrated the effectiveness of our model. In future work, knowledge distillation will be explored to further enhance the inference efficiency of SPIRONet. Additionally, SPIRONet will be verified in real clinical scenarios.

References

[1] Y. Ma et al., ‘‘Self-supervised vessel segmentation via adversarial learning,’’ in Proc. ICCV, 2021, pp. 7536--7545.
[2] R. L. o. Sacco, ‘‘The heart of 25 by 25: Achieving the goal of reducing global and regional premature deaths from cardiovascular diseases and stroke: A modeling study from the american heart association and world heart federation,’’ Circulation, vol. 133, no. 23, pp. e674--e690, 2016.
[3] G. A. Roth, G. A. Mensah, and V. Fuster, ‘‘The global burden of cardiovascular diseases and risks: A compass for global action,’’ J. Am. Coll. Cardiol., vol. 76, no. 25, pp. 2980--2981, 2020.
[4] M. Wan et al., ‘‘Symptomatic and asymptomatic chronic carotid artery occlusion on high-resolution MR vessel wall imaging,’’ Am. J. Neuroradiol., vol. 43, no. 1, pp. 110--116, 2022.
[5] N. B. Langer and M. Argenziano, ‘‘Minimally invasive cardiovascular surgery: Incisions and approaches,’’ Methodist Debakey Cardiovasc. J., vol. 12, no. 1, p. 4, 2016.
[6] F. Cagnazzo et al., ‘‘Endovascular recanalization of chronically occluded internal carotid artery,’’ J. Neurointerv. Surg., vol. 12, no. 10, pp. 946--951, 2020.
[7] H. Li, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, Z.-Q. Feng, and Z.-G. Hou, ‘‘CASOG: Conservative actor–critic with smooth gradient for skill learning in robot-assisted intervention,’’ IEEE Trans. Ind. Electron., 2023, DOI: 10.1109/TIE.2023.3310021.
[8] C. Meng, Y. Li, Y. Xu, N. Li, and K. Xia, ‘‘A weakly supervised framework for 2D/3D vascular registration oriented to incomplete 2D blood vessels,’’ IEEE Trans. Med. Robot. Bionics, vol. 4, no. 2, pp. 381--390, 2022.
[9] M. E. Abdelaziz, L. Tian, M. Hamady, G.-Z. Yang, and B. Temelkuran, ‘‘X-ray to MR: The progress of flexible instruments for endovascular navigation,’’ Prog. Biomed. Eng., vol. 3, no. 3, p. 032004, 2021.
[10] M. J. Eagleton, ‘‘Updates in endovascular procedural navigation,’’ Can. J. Cardiol., vol. 38, no. 5, pp. 662--671, 2022.
[11] D.-X. Huang et al., ‘‘Real-time 2D/3D registration via CNN regression and centroid alignment,’’ IEEE Trans. Autom. Sci. Eng., 2024, DOI: 10.1109/TASE.2023.3345927.
[12] J. Zhu et al., ‘‘3D/2D vessel registration based on Monte Carlo tree search and manifold regularization,’’ IEEE Trans. Med. Imaging, 2023, DOI: 10.1109/TMI.2023.3347896.
[13] S. Xia et al., ‘‘Vessel segmentation of X-ray coronary angiographic image sequence,’’ IEEE Trans. Biomed. Eng., vol. 67, no. 5, pp. 1338--1348, 2019.
[14] A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever, ‘‘Multiscale vessel enhancement filtering,’’ in Proc. MICCAI, 1998, pp. 130--137.
[15] R. Manniesing, M. A. Viergever, and W. J. Niessen, ‘‘Vessel enhancing diffusion: A scale space representation of vessel structures,’’ Med. Image Anal., vol. 10, no. 6, pp. 815--825, 2006.
[16] Cervantes-Sanchez et al., ‘‘Segmentation of coronary angiograms using Gabor filters and Boltzmann univariate marginal distribution algorithm,’’ Comput. Intell. Neurosci., vol. 2016, 2016.
[17] H. Jiang et al., ‘‘A region growing vessel segmentation algorithm based on spectrum information,’’ Comput. Math. Methods Med., vol. 2013, p. 743870, 2013.
[18] Y.-Z. Zeng et al., ‘‘Automatic liver vessel segmentation using 3D region growing and hybrid active contour model,’’ Comput. Biol. Med., vol. 97, pp. 63--73, 2018.
[19] N. Sangsefidi, A. H. Foruzan, and A. Dolati, ‘‘Balancing the data term of graph-cuts algorithm to improve segmentation of hepatic vascular structures,’’ Comput. Biol. Med., vol. 93, pp. 117--126, 2018.
[20] C. Wang et al., ‘‘Tensor-cut: A tensor-based graph-cut blood vessel segmentation method and its application to renal artery segmentation,’’ Med. Image Anal., vol. 60, p. 101623, 2020.
[21] R.-Q. Li, G.-B. Bian, X.-H. Zhou, X. Xie, Z.-L. Ni, and Z. Hou, ‘‘CAU-net: A novel convolutional neural network for coronary artery segmentation in digital subtraction angiography,’’ in Proc. ICONIP, 2020, pp. 185--196.
[22] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, ‘‘Image segmentation using deep learning: A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523--3542, 2021.
[23] X. Chen et al., ‘‘Recent advances and clinical applications of deep learning in medical image analysis,’’ Med. Image Anal., vol. 79, p. 102444, 2022.
[24] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks for biomedical image segmentation,’’ in Proc. MICCAI, 2015, pp. 234--241.
[25] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, ‘‘UNet++: Redesigning skip connections to exploit multiscale features in image segmentation,’’ IEEE Trans. Med. Imaging, vol. 39, no. 6, pp. 1856--1867, 2019.
[26] J. Schlemper et al., ‘‘Attention gated networks: Learning to leverage salient regions in medical images,’’ Med. Image Anal., vol. 53, pp. 197--207, 2019.
[27] D.-X. Huang et al., ‘‘MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation,’’ arXiv:2401.11856, 2024.
[28] L. Mou et al., ‘‘CS²-Net: Deep learning segmentation of curvilinear structures in medical imaging,’’ Med. Image Anal., vol. 67, p. 101874, 2021.
[29] W. Liu et al., ‘‘Full-resolution network and dual-threshold iteration for retinal vessel and coronary angiograph segmentation,’’ IEEE J. Biomed. Health Inform., vol. 26, no. 9, pp. 4623--4634, 2022.
[30] Y. Li et al., ‘‘Global transformer and dual local attention network via deep-shallow hierarchical feature fusion for retinal vessel segmentation,’’ IEEE Trans. Cybern., vol. 53, no. 9, pp. 5826--5839, 2023.
[31] A. Dosovitskiy et al., ‘‘An image is worth 16x16 words: Transformers for image recognition at scale,’’ arXiv:2010.11929, 2020.
[32] Z. Liu et al., ‘‘Swin transformer: Hierarchical vision transformer using shifted windows,’’ in Proc. ICCV, 2021, pp. 10 012--10 022.
[33] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ in Proc. CVPR, 2016, pp. 770--778.
[34] J. Chen et al., ‘‘TransUNet: Transformers make strong encoders for medical image segmentation,’’ arXiv:2102.04306, 2021.
[35] J. Guo et al., ‘‘CMT: Convolutional neural networks meet vision transformers,’’ in Proc. CVPR, 2022, pp. 12 175--12 185.
[36] M. Heidari et al., ‘‘HiFormer: Hierarchical multi-scale representations using transformers for medical image segmentation,’’ in Proc. WACV, 2023, pp. 6202--6212.
[37] H. Kuang et al., ‘‘Hybrid CNN-Transformer network with circular feature interaction for acute ischemic stroke lesion segmentation on non-contrast CT scans,’’ IEEE Trans. Med. Imaging, 2024, DOI: 10.1109/TMI.2024.3362879.
[38] T. N. Kipf and M. Welling, ‘‘Semi-supervised classification with graph convolutional networks,’’ arXiv:1609.02907, 2016.
[39] H. Heidbuchel et al., ‘‘Practical ways to reduce radiation dose for patients and staff during device implantations and electrophysiological procedures,’’ Europace, vol. 16, no. 7, pp. 946--964, 2014.
[40] R. Adams and L. Bischof, ‘‘Seeded region growing,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 6, pp. 641--647, 1994.
[41] E. O. Rodrigues, A. Conci, and P. Liatsis, ‘‘ELEMENT: Multi-modal retinal vessel segmentation based on a coupled region growing and machine learning approach,’’ IEEE J. Biomed. Health Inform., vol. 24, no. 12, pp. 3507--3519, 2020.
[42] M. Taghizadeh Dehkordi, A. M. Doost Hoseini, S. Sadri, and H. Soltanianzadeh, ‘‘Local feature fitting active contour for segmenting vessels in angiograms,’’ IET Comput. Vis., vol. 8, no. 3, pp. 161--170, 2014.
[43] N. Memari, A. R. Ramli, M. I. B. Saripan, S. Mashohor, and M. Moghbel, ‘‘Retinal blood vessel segmentation by using matched filtering and fuzzy C-means clustering with integrated level set method for diabetic retinopathy assessment,’’ J. Med. Biol. Eng., vol. 39, pp. 713--731, 2019.
[44] B. Kim, Y. Oh, and J. C. Ye, ‘‘Diffusion adversarial representation learning for self-supervised vessel segmentation,’’ arXiv:2209.14566, 2022.
[45] Z. Gu et al., ‘‘CE-Net: Context encoder network for 2D medical image segmentation,’’ IEEE Trans. Med. Imaging, vol. 38, no. 10, pp. 2281--2292, 2019.
[46] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, ‘‘SegFormer: Simple and efficient design for semantic segmentation with transformers,’’ in Proc. NeurIPS, vol. 34, 2021, pp. 12 077--12 090.
[47] H. Wang, P. Cao, J. Wang, and O. R. Zaiane, ‘‘UCTransNet: Rethinking the skip connections in U-Net from a channel-wise perspective with transformer,’’ in Proc. AAAI, vol. 36, no. 3, 2022, pp. 2441--2449.
[48] R. C. Gonzales and P. Wintz, Digital image processing. Addison-Wesley Longman Publishing Co., Inc., 1987.
[49] L. Chi, B. Jiang, and Y. Mu, ‘‘Fast Fourier convolution,’’ in Proc. NeurIPS, vol. 33, 2020, pp. 4479--4488.
[50] J. Guibas, M. Mardani, Z. Li, A. Tao, A. Anandkumar, and B. Catanzaro, ‘‘Adaptive Fourier neural operators: Efficient token mixers for transformers,’’ arXiv:2111.13587, 2021.
[51] Y. Rao, W. Zhao, Z. Zhu, J. Zhou, and J. Lu, ‘‘GFNet: Global filter networks for visual recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 9, pp. 10 960--10 973, 2023.
[52] Z. Huang, Z. Zhang, C. Lan, Z.-J. Zha, Y. Lu, and B. Guo, ‘‘Adaptive frequency filters as efficient global token mixers,’’ in Proc. ICCV, 2023, pp. 6049--6059.
[53] R. Azad, A. Bozorgpour, M. Asadi-Aghbolaghi, D. Merhof, and S. Escalera, ‘‘Deep frequency re-calibration U-Net for medical image segmentation,’’ in Proc. ICCVW, 2021, pp. 3274--3283.
[54] Y. Huang, C. Zhou, L. Chen, J. Chen, and S. Lan, ‘‘Medical frequency domain learning: Consider inter-class and intra-class frequency for medical image segmentation and classification,’’ in Proc. BIBM, 2021, pp. 897--904.
[55] P. Li, R. Zhou, J. He, S. Zhao, and Y. Tian, ‘‘A global-frequency-domain network for medical image segmentation,’’ Comput. Biol. Med., vol. 164, p. 107290, 2023.
[56] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba, ‘‘Understanding the role of individual units in a deep neural network,’’ Proc. Natl. Acad. Sci. USA, vol. 117, no. 48, pp. 30 071--30 078, 2020.
[57] J. Fu, J. Liu, J. Jiang, Y. Li, Y. Bao, and H. Lu, ‘‘Scene segmentation with dual relation-aware attention network,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 6, pp. 2547--2560, 2020.
[58] Y. Li, Y. Zhang, W. Cui, B. Lei, X. Kuang, and T. Zhang, ‘‘Dual encoder-based dynamic-channel graph convolutional network with edge enhancement for retinal vessel segmentation,’’ IEEE Trans. Med. Imaging, vol. 41, no. 8, pp. 1975--1989, 2022.
[59] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, ‘‘Squeeze-and-Excitation networks.’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, pp. 2011--2023, 2019.
[60] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional block attention module,’’ in Proc. ECCV, 2018, pp. 3--19.
[61] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ‘‘ECA-Net: Efficient channel attention for deep convolutional neural networks,’’ in Proc. CVPR, 2020, pp. 11 534--11 542.
[62] C. Guo, M. Szemenyei, Y. Hu, W. Wang, W. Zhou, and Y. Yi, ‘‘Channel attention residual U-Net for retinal vessel segmentation,’’ in Proc. ICASSP, 2021, pp. 1185--1189.
[63] J. W. Cooley and J. W. Tukey, ‘‘An algorithm for the machine calculation of complex Fourier series,’’ Math. Comput., vol. 19, no. 90, pp. 297--301, 1965.
[64] C. Wang, J. Jiang, Z. Zhong, and X. Liu, ‘‘Spatial-frequency mutual learning for face super-resolution,’’ in Proc. CVPR, 2023, pp. 22 356--22 366.
[65] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep learning library,’’ in Proc. NeurIPS, vol. 32, 2019.
[66] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai, ‘‘Asymmetric non-local neural networks for semantic segmentation,’’ in Proc. ICCV, 2019, pp. 593--602.
[67] X. Li, Y. Yang, Q. Zhao, T. Shen, Z. Lin, and H. Liu, ‘‘Spatial pyramid based graph reasoning for semantic segmentation,’’ in Proc. CVPR, 2020, pp. 8950--8959.
[68] P. A. Yushkevich et al., ‘‘User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability,’’ NeuroImage, vol. 31, no. 3, pp. 1116--1128, 2006.
[69] F. Cervantes-Sanchez, I. Cruz-Aceves, A. Hernandez-Aguirre, M. A. Hernandez-Gonzalez, and S. E. Solorio-Meza, ‘‘Automatic segmentation of coronary arteries in X-ray angiograms using multiscale analysis and artificial neural networks,’’ Appl. Sci., vol. 9, no. 24, p. 5507, 2019.
[70] X. Li, W. Wang, X. Hu, and J. Yang, ‘‘Selective kernel networks,’’ in Proc. CVPR, 2019, pp. 510--519.
[71] Z. Qin, P. Zhang, F. Wu, and X. Li, ‘‘FcaNet: Frequency channel attention networks,’’ in Proc. ICCV, 2021, pp. 783--792.
[72] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, ‘‘Grad-CAM: Visual explanations from deep networks via gradient-based localization,’’ in Proc. ICCV, 2017, pp. 618--626.

SPIRONet: SPatial-Frequency Learning and TopologIcal Channel InteRactiOn Network for Vessel Segmentation