License: CC BY-NC-SA 4.0
arXiv:2311.11205v2 [eess.IV] 20 Jan 2024
11institutetext: Department of Computer Science, University of Liverpool, UK 22institutetext: Imperial College London, UK 33institutetext: University of Information Technology, HCMC, Vietnam 44institutetext: Vietnam National University, HCMC, Vietnam 55institutetext: Alder Hey Children’s Hospital, Liverpool, UK

Shape-Sensitive Loss for Catheter and Guidewire Segmentation

Chayun Kongtongvattana 11    Baoru Huang 22    **gxuan Kang 11    Hoan Nguyen 3344    Olajide Olufemi 55    Anh Nguyen 11
Abstract

We introduce a shape-sensitive loss function for catheter and guidewire segmentation and utilize it in a vision transformer network to establish a new state-of-the-art result on a large-scale X-ray images dataset. We transform network-derived predictions and their corresponding ground truths into signed distance maps, thereby enabling any networks to concentrate on the essential boundaries rather than merely the overall contours. These SDMs are subjected to the vision transformer, efficiently producing high-dimensional feature vectors encapsulating critical image attributes. By computing the cosine similarity between these feature vectors, we gain a nuanced understanding of image similarity that goes beyond the limitations of traditional overlap-based measures. The advantages of our approach are manifold, ranging from scale and translation invariance to superior detection of subtle differences, thus ensuring precise localization and delineation of the medical instruments within the images. Comprehensive quantitative and qualitative analyses substantiate the significant enhancement in performance over existing baselines, demonstrating the promise held by our new shape-sensitive loss function for improving catheter and guidewire segmentation.

Keywords:
Shape-sensitive loss function, Vision Transformer (ViT), Catheter and guidewire segmentation, Signed distance maps (SDMs)

1 Introduction

Endovascular interventions have drastically changed cardiovascular surgery, bringing forth benefits like minimized trauma and faster recovery. However, they also present potential hazards such as damage to the vessel wall [1, 2, 23]. Accurate segmentation of catheters and guidewires within X-ray images is pivotal for reducing these risks.

Refer to caption
Figure 1: Catheter and guidewrite segmentation in X-ray images. First row: The input X-ray images. Second row: The segmentation results. Red color denotes the catheter, green color denotes the guidewire.

The merging of computer vision and machine learning in the medical domain has spurred advancements in addressing challenges associated with endovascular interventions [3, 4, 5, 6, 7]. Especially, the role of deep learning has proven significant in refining the precision of surgical procedures and enhancing patient safety [8, 25]. Despite these strides, the segmentation of intricate structures of catheters and guidewires in X-ray images remains a hurdle. Traditional loss functions often fall short in capturing the global spatial relationships crucial for this task [9, 28, 10]. While research has employed convolutional neural networks (CNNs) with these loss functions to deliver promising results [11, 34, 16, 17], the challenge persists.

Our study presents a novel shape-sensitive loss function that uses the Vision Transformer (ViT) to gauge the similarity between signed distance maps (SDMs) [13, 14]. This approach aims to offer superior context awareness and structure sensitivity, leading to enhanced segmentation.

The structure of this paper comprises a literature review, a detailed explanation of our proposed loss function, presentation of experimental results, and a concluding section discussing potential future directions.

Outlined below are the foundational contributions of this work:

  • Introduction of a shape-sensitive loss function that merges spatial distance insights with the feature extraction prowess of Vision Transformer.

  • A unique technique for converting network outputs into SDMs, preserving structural intricacies.

  • A balanced approach, merging our shape-sensitive loss with the traditional Dice loss to ensure holistic segmentation performance.

2 RELATED WORKS

2.1 Catheter Segmentation

Catheter segmentation has gained momentum with the introduction of deep learning frameworks [39, 15]. The FW-Net utilizes an end-to-end approach with an encoder-decoder structure, optical flow extraction, and a unique flow-guided war** function to ensure temporal continuity in imaging sequences [15]. Another method employs deep convolutional neural networks for segmenting catheters and guidewires in 2D X-ray fluoroscopic sequences, using previous image contexts for enhanced accuracy and achieving a notable median centerline distance error of 0.2 mm [18]. A transformative approach incorporates Convolutional Neural Networks (CNNs) with transfer learning, exploiting synthetic fluoroscopic images to develop a streamlined segmentation model requiring minimal manually annotated data, significantly reducing testing time while remaining adaptable to higher input resolutions [16]. These advancements underscore the continuous evolution and adaptability of deep learning methodologies in catheter and guidewire segmentation tasks.

Table 1: Types of Semantic Segmentation Loss functions
Distribution-Based Loss
Binary Cross-Entropy [19] ylog(p)+(1y)log(1p)𝑦𝑝1𝑦1𝑝-\sum y\log(p)+(1-y)\log(1-p)- ∑ italic_y roman_log ( italic_p ) + ( 1 - italic_y ) roman_log ( 1 - italic_p )
Weighted Cross-Entropy [22] wyylog(p)subscript𝑤𝑦𝑦𝑝-\sum w_{y}y\log(p)- ∑ italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_y roman_log ( italic_p )
Balanced Cross-Entropy [20] βylog(p)(1β)(1y)log(1p)𝛽𝑦𝑝1𝛽1𝑦1𝑝-\beta\sum y\log(p)-(1-\beta)\sum(1-y)\log(1-p)- italic_β ∑ italic_y roman_log ( italic_p ) - ( 1 - italic_β ) ∑ ( 1 - italic_y ) roman_log ( 1 - italic_p )
Focal [21] (1p)γylog(p)superscript1𝑝𝛾𝑦𝑝-\sum(1-p)^{\gamma}y\log(p)- ∑ ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_y roman_log ( italic_p )
Region-Based Loss
Dice [24] 12(yp)y+p12𝑦𝑝𝑦𝑝1-\frac{2\sum(y\cap p)}{\sum y+\sum p}1 - divide start_ARG 2 ∑ ( italic_y ∩ italic_p ) end_ARG start_ARG ∑ italic_y + ∑ italic_p end_ARG
Tversky [26] (yp)(yp)+α(yp)+β(py)𝑦𝑝𝑦𝑝𝛼𝑦𝑝𝛽𝑝𝑦\frac{\sum(y\cap p)}{\sum(y\cap p)+\alpha\sum(y-p)+\beta\sum(p-y)}divide start_ARG ∑ ( italic_y ∩ italic_p ) end_ARG start_ARG ∑ ( italic_y ∩ italic_p ) + italic_α ∑ ( italic_y - italic_p ) + italic_β ∑ ( italic_p - italic_y ) end_ARG
Focal Tversky [27] (1Tversky)γsuperscript1𝑇𝑣𝑒𝑟𝑠𝑘𝑦𝛾(1-Tversky)^{\gamma}( 1 - italic_T italic_v italic_e italic_r italic_s italic_k italic_y ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT
Compound Loss
Combo [31] CL(y,y^)=αLmbce(1α)DL(y,y^)𝐶𝐿𝑦^𝑦𝛼subscript𝐿𝑚𝑏𝑐𝑒1𝛼𝐷𝐿𝑦^𝑦CL(y,\hat{y})=\alpha L_{m-bce}-(1-\alpha)DL(y,\hat{y})italic_C italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) = italic_α italic_L start_POSTSUBSCRIPT italic_m - italic_b italic_c italic_e end_POSTSUBSCRIPT - ( 1 - italic_α ) italic_D italic_L ( italic_y , over^ start_ARG italic_y end_ARG )
ELL [32] LELL=αLDice+βLCEsubscript𝐿𝐸𝐿𝐿𝛼subscript𝐿𝐷𝑖𝑐𝑒𝛽subscript𝐿𝐶𝐸L_{ELL}=\alpha L_{Dice}+\beta L_{CE}italic_L start_POSTSUBSCRIPT italic_E italic_L italic_L end_POSTSUBSCRIPT = italic_α italic_L start_POSTSUBSCRIPT italic_D italic_i italic_c italic_e end_POSTSUBSCRIPT + italic_β italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT
Boundary-Based Loss
HD [30] LHDDT=1Ni=1N[(sigi)(dGi2dSi2)]subscript𝐿𝐻subscript𝐷𝐷𝑇1𝑁superscriptsubscript𝑖1𝑁delimited-[]subscript𝑠𝑖subscript𝑔𝑖superscriptsubscript𝑑𝐺𝑖2superscriptsubscript𝑑𝑆𝑖2L_{HD_{DT}}=\frac{1}{N}\sum_{i=1}^{N}[(s_{i}-g_{i})\cdot(d_{Gi}^{2}-d_{Si}^{2})]italic_L start_POSTSUBSCRIPT italic_H italic_D start_POSTSUBSCRIPT italic_D italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ ( italic_d start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_S italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ]
InverseForm [12] Lif(bpred,bgt)=j=1Ndif(bpred,j,bgt,j)subscript𝐿ifsubscript𝑏predsubscript𝑏gtsuperscriptsubscript𝑗1𝑁subscript𝑑ifsubscript𝑏pred,jsubscript𝑏gt,jL_{\text{if}}(b_{\text{pred}},b_{\text{gt}})=\sum_{j=1}^{N}d_{\text{if}}(b_{% \text{pred,j}},b_{\text{gt,j}})italic_L start_POSTSUBSCRIPT if end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT if end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT pred,j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT gt,j end_POSTSUBSCRIPT )
Shape-sensitive (ours) Equation 5

2.2 Loss Function for Medical Segmentation

Loss functions in deep learning for image segmentation tasks are pivotal for determining the quality of medical image segmentations (Table 1). These functions can be broadly categorized into four primary types [10]: distribution-based, region-based, boundary-based, and compounded loss. Distribution-based losses, such as Binary Cross-Entropy [19], Focal loss [21], and Weighted Cross-Entropy [22], measure the dissimilarity between predicted and true probability distributions. While effective in modeling these distributions, they might lack spatial coherence and can struggle with class imbalances. On the other hand, region-based losses like Dice loss [24], Tversky Loss [26], and Focal Tversky loss [27] excel in scenarios with class imbalances by focusing on the overlap between predicted and actual segments, but they may miss finer details. Boundary-based losses, such as Shape-aware loss [29] and Hausdorff Distance loss [30], prioritize boundary accuracy but can be sensitive to minor perturbations. Meanwhile, compounded losses like Combo loss [31] and Exponential Logarithmic Loss [32] provide a comprehensive approach by merging features from different loss types, though they can potentially increase training complexity.

The pursuit of accuracy, especially for intricate structures, led to the inception of shape-aware loss functions, which factor in the spatial relationships and distances of pixels from target boundaries. The InverseForm loss function [12] is a noteworthy example. It integrates an Inverse Transformation Network into the loss calculation to produce a Transformation Matrix, ensuring alignment between predicted segmentation and target boundaries. Nonetheless, capturing high-level features that provide global context can be challenging for some methods. Drawing inspiration from such integrative approaches, our proposal incorporates Vision Transformers (ViTs) within the loss function. With their attention mechanism, ViTs capture global contextual features from images, paving the way for a more accurate representation of complex structures, such as catheters and guidewires in X-ray images.

3 Shape-Sensitive Loss for Catheter and Guidewire Segmentation

In this section, we introduce a novel approach enhancing catheter and guidewire segmentation in X-ray images. Leveraging SDM and ViT foundations, our method integrates key modifications to improve accuracy, offering a fresh perspective in precise medical imaging.

3.1 Preliminaries: Signed Distance Map

The Signed Distance Map (SDM) is resha** image segmentation by map** each pixel’s distance to the image’s contour, signifying its relation to the boundary. This method enhances the clarity of segmentation, especially for complex structures. SDM’s emphasis on boundary localization is crucial in medical imaging, like catheter and guidewire segmentation. By using SDM, models become more sensitive to boundaries, prioritizing localization and improving segmentation accuracy, resulting in fewer errors in X-ray image segmentation. Fig. 2 shows an example of SDM.

Initial Transformation to SDM: Both the predicted output and its corresponding label undergo an initial transformation to fit the SDM format, setting the stage for further operations tailored for this representation. Let N(x,y)𝑁𝑥𝑦N(x,y)italic_N ( italic_x , italic_y ) be the network output, where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) are pixel coordinates.

Transformation to Contour: The contour representation, C(x,y)𝐶𝑥𝑦C(x,y)italic_C ( italic_x , italic_y ), is derived by thresholding the network output at a value T::𝑇absentT:italic_T :

C(x,y)={1if N(x,y)>T0otherwise𝐶𝑥𝑦cases1if 𝑁𝑥𝑦𝑇0otherwiseC(x,y)=\begin{cases}1&\text{if }N(x,y)>T\\ 0&\text{otherwise}\end{cases}italic_C ( italic_x , italic_y ) = { start_ROW start_CELL 1 end_CELL start_CELL if italic_N ( italic_x , italic_y ) > italic_T end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (1)

Transformation to SDM: Define d((x,y),(i,j))𝑑𝑥𝑦𝑖𝑗d((x,y),(i,j))italic_d ( ( italic_x , italic_y ) , ( italic_i , italic_j ) ) as the Euclidean distance between any point (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) and its nearest boundary point (i,j).𝑖𝑗(i,j).( italic_i , italic_j ) .

SDM(x,y)={0,if C(x,y)=1min(i,j)d((x,y),(i,j)),if C(x,y)1SDM𝑥𝑦cases0if 𝐶𝑥𝑦1subscript𝑖𝑗𝑑𝑥𝑦𝑖𝑗if 𝐶𝑥𝑦1\text{SDM}(x,y)=\begin{cases}0,&\text{if }C(x,y)=1\\ \min_{(i,j)}d((x,y),(i,j)),&\text{if }C(x,y)\neq 1\end{cases}SDM ( italic_x , italic_y ) = { start_ROW start_CELL 0 , end_CELL start_CELL if italic_C ( italic_x , italic_y ) = 1 end_CELL end_ROW start_ROW start_CELL roman_min start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT italic_d ( ( italic_x , italic_y ) , ( italic_i , italic_j ) ) , end_CELL start_CELL if italic_C ( italic_x , italic_y ) ≠ 1 end_CELL end_ROW (2)

The computation of d((x,y),(i,j))𝑑𝑥𝑦𝑖𝑗d((x,y),(i,j))italic_d ( ( italic_x , italic_y ) , ( italic_i , italic_j ) ) often employs methods like the Fast Marching Method or the Distance Transform algorithm. The SDM produces a continuous spectrum of values, with positive distances outside the object and negative distances inside. Leveraging SDM in the loss function allows the network to achieve precise segmentation of intricate structures in X-ray images.

Refer to caption
Figure 2: Illustration of the process to create the Signed Distance Maps. Top Row: Original groundtruth images. Bottom Row: Signed Distance Maps, calculated based on the contours, overlaid by its contour images

3.2 Feature Extraction using Vision Transformer

The Vision Transformer (ViT) excels in extracting high-level features, pivotal for understanding the intricate patterns in Signed Distance Maps (SDMs). Our approach, while building upon ViT’s established use with SDM, introduces key modifications to enhance feature extraction. These modifications, including optimized patch sizes for high-resolution imaging and an adaptive attention mechanism targeting critical anatomical features, are specifically tailored to address the variability in medical data and the need for precise segmentation. This integration not only demonstrates ViT’s potential in specialized tasks, but also distinctly sets apart our method, as shown in Fig. 3.

For feature extraction, the last layer of the ViT is bypassed, allowing it to produce high-dimensional vectors that capture detailed SDM patterns—a critical component for precise segmentation. Opting for the ViT-B/384 configuration, designed for high-resolution images, is ideal. Using a 384-patch size, it captures fine-grained details, making it apt for high-resolution SDM analysis. From this, features α𝛼\alphaitalic_α for predicted and β𝛽\betaitalic_β for true SDMs are determined, forming the core for loss computation.

α,β=ViT(SDMOutput,SDMLabel)𝛼𝛽ViTsubscriptSDMOutputsubscriptSDMLabel\alpha,\beta=\text{ViT}(\text{SDM}_{\text{Output}},\text{SDM}_{\text{Label}})italic_α , italic_β = ViT ( SDM start_POSTSUBSCRIPT Output end_POSTSUBSCRIPT , SDM start_POSTSUBSCRIPT Label end_POSTSUBSCRIPT ) (3)

In the equation above, ViT(SDMOutput,SDMLabel)ViTsubscriptSDMOutputsubscriptSDMLabel\text{ViT}(\text{SDM}_{\text{Output}},\text{SDM}_{\text{Label}})ViT ( SDM start_POSTSUBSCRIPT Output end_POSTSUBSCRIPT , SDM start_POSTSUBSCRIPT Label end_POSTSUBSCRIPT ) demonstrates the feature extraction process by passing SDMs through the ViT. By utilizing ViT, segmentation is further enhanced, resulting in a thorough of SDM structures.

3.3 Shape-Sensitive Loss Computation

Refer to caption
Figure 3: An overview of our framework.

Accurate segmentation heavily hinges on an effective loss computation mechanism to steer the model’s learning, ensuring the precise identification of segmentation boundaries within SDMs. The employed loss computation in this work is fundamentally shape-sensitive, an approach meticulously tailored to heighten the model’s sensitivity to the geometric details within the SDMs. This heightened sensitivity is achieved through the use of cosine similarity between the high-level features extracted from both the predicted and true SDMs.

The premise behind employing cosine similarity lies in its capacity to consider the angle between the feature vectors, offering a nuanced and detailed insight into their alignment. This perspective enables a more delicate evaluation of the segmentation process, capturing the geometric intricacies within the high-dimensional feature space derived from the SDMs.

CosSim (α,β)=(αβ)(|α|×|β|)CosSim 𝛼𝛽𝛼𝛽𝛼𝛽\text{CosSim }\left(\alpha,\beta\right)=\frac{\left(\alpha\cdot\beta\right)}{% \left(|\alpha|\times|\beta|\right)}CosSim ( italic_α , italic_β ) = divide start_ARG ( italic_α ⋅ italic_β ) end_ARG start_ARG ( | italic_α | × | italic_β | ) end_ARG (4)

The shape-sensitive loss, denoted as LSSsubscript𝐿SSL_{\text{SS}}italic_L start_POSTSUBSCRIPT SS end_POSTSUBSCRIPT, is thus defined as the deviation from perfect alignment, represented mathematically as:

SS(SDMOutput,SDMLabel)=1CosSim(ViT(SDMOutput),ViT(SDMLabel))subscriptSSsubscriptSDMOutputsubscriptSDMLabel1CosSimViTsubscriptSDMOutputViTsubscriptSDMLabel\begin{split}\mathcal{L}_{\text{SS}}(\text{SDM}_{\text{Output}},\text{SDM}_{% \text{Label}})&=\\ 1-\operatorname{CosSim}(\text{ViT}(\text{SDM}_{\text{Output}}),&\quad\text{ViT% }(\text{SDM}_{\text{Label}}))\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT SS end_POSTSUBSCRIPT ( SDM start_POSTSUBSCRIPT Output end_POSTSUBSCRIPT , SDM start_POSTSUBSCRIPT Label end_POSTSUBSCRIPT ) end_CELL start_CELL = end_CELL end_ROW start_ROW start_CELL 1 - roman_CosSim ( ViT ( SDM start_POSTSUBSCRIPT Output end_POSTSUBSCRIPT ) , end_CELL start_CELL ViT ( SDM start_POSTSUBSCRIPT Label end_POSTSUBSCRIPT ) ) end_CELL end_ROW (5)

Further enriching the loss computation, the total loss, totalsubscripttotal\mathcal{L}_{\text{total}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT, amalgamates the shape-sensitive loss with the Dice loss, a well-regarded loss function for segmentation tasks. This composite measure is meticulously balanced for optimal segmentation results, ensuring the model maintains a comprehensive focus, attending to shape sensitivity and other pivotal aspects of segmentation.

total=γDice+δSS(SDMOutput,SDMLabel)subscripttotal𝛾subscriptDice𝛿subscriptSSsubscriptSDMOutputsubscriptSDMLabel\mathcal{L}_{\text{total}}=\gamma\mathcal{L}_{\text{Dice}}+\delta\mathcal{L}_{% \text{SS}}(\text{SDM}_{\text{Output}},\text{SDM}_{\text{Label}})caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_γ caligraphic_L start_POSTSUBSCRIPT Dice end_POSTSUBSCRIPT + italic_δ caligraphic_L start_POSTSUBSCRIPT SS end_POSTSUBSCRIPT ( SDM start_POSTSUBSCRIPT Output end_POSTSUBSCRIPT , SDM start_POSTSUBSCRIPT Label end_POSTSUBSCRIPT ) (6)

This structured loss computation strategy, by intertwining geometric sensitivity with established loss functions, furnishes the model with a robust and multifaceted learning signal, underpinning the attainment of superior segmentation performance in processing SDMs.

4 Experimental Results

4.1 Experiment Setup

Dataset: We assessed our proposed loss function using our newly collected dataset. This dataset includes 5,086 real animal X-ray images and 18,791 phantom X-ray images, both paired with ground truth annotations. While real animal images are inherently 512 × 512 pixels, phantom images were resized from 1024 × 1024 to 512 × 512, ensuring labels remained accurate. The dataset was divided with a 70-30 split for training and testing, respectively.

Evaluation metrics: Our semantic segmentation performance was gauged using several standard metrics:

  • Dice Coefficient: Measures overlap between prediction and ground truth.

  • Jaccard Similarity (IoU): Calculates the ratio of intersected region to the combined predicted and ground truth areas.

  • Mean Intersection over Union (mIoU): This is an average IoU across all classes, predominantly for multi-class segmentation.

  • Accuracy: Determines the proportion of correctly identified pixels.

We integrate our method into different segmentation backbones, including U-Net [33], U-Net++ [36], U-Net3+ [37], TransU-net [35], and SwinU-Net [38].

4.2 Results on Real Animal X-Ray Images

As shown in Table 2, all segmentation networks profited from our method. TransU-Net exhibited a boost in the Dice coefficient from 54.52% to 57.16%, a gain of 2.64 percentage points. Similar growth was observed in Jaccard Similarity, mIoU, and Accuracy. Expanding on this, the U-Net, a foundational segmentation architecture, after embedding our loss function, showed improvements in all four metrics, with the Dice coefficient rising by 1.75 percentage points. The other models, including U-Net++, U-Net3+, and SwinU-Net, mirrored this enhancement trend. Worth highlighting is that the TransU-Net, when synergized with our loss function, outperformed other architectures in all metrics.

Table 2: Comparing results for real animal X-ray images
Network Dice Jaccard mIoU Accuracy
TransU-net 54.52 43.87 61.15 78.25
TransU-Net w/ours 57.16 46.04 62.66 79.57
U-Net 46.20 36.62 54.14 71.57
U-Net w/ours 47.95 38.79 55.05 72.89
U-Net++ 48.77 39.22 56.17 72.30
U-Net++ w/ours 50.35 41.39 57.08 73.62
U-Net3+ 49.24 40.12 56.37 73.14
U-Net3+ w/ours 51.37 42.29 57.28 74.46
SwinU-Net 52.74 42.14 57.48 77.67
SwinU-Net w/ours 54.71 44.58 59.04 79.13

4.3 Results on Phantom X-ray Images

In our expanded study focusing on phantom X-ray images, the capabilities of our proposed method were prominently showcased, emphasizing its prowess in segmentation tasks (Table 3). When we integrated our method with the TransU-Net model, we witnessed a significant improvement in the Dice coefficient. The metrics rose from a base of 40.83% to an enhanced 43.71%, validating our method’s effectiveness in enhancing the segmentation accuracy of intricate medical images. A closer inspection of Figure 4 brings forth another salient feature of our approach. Across different architectures, there was a consistent pattern - the parameter count remained stable. Interestingly, this consistency held even when our innovative loss function was introduced. This is a crucial observation as it suggests that our method not only bolsters segmentation performance but also maintains it without imposing any supplementary computational burdens.

Table 3: Comparing results for phantom X-ray images
Network Dice Jaccard mIoU Accuracy
TransU-net 40.83 33.94 46.01 66.18
TransU-Net w/ours 43.71 36.65 47.21 67.58
U-Net 34.51 27.62 43.14 60.57
U-Net w/ours 37.13 28.56 45.83 62.29
U-Net++ 35.24 27.43 44.72 61.76
U-Net++ w/ours 37.25 28.94 45.97 63.12
U-Net3+ 35.53 27.87 44.89 61.88
U-Net3+ w/ours 38.07 29.54 46.21 63.48
SwinU-Net 39.44 31.27 45.29 65.37
SwinU-Net w/ours 41.17 33.79 46.23 66.88

4.4 Ablation Study

In our exhaustive evaluation using real-animal X-ray image datasets, several critical findings became evident. Analyzing the data in Table 6, it is clear that the Cosine similarity consistently outperformed other traditional distance measurements when it came to evaluating feature embeddings. Further insights from Table 4.4 revealed that integrating the Dice loss with our uniquely devised, shape-sensitive loss function culminated in the most robust segmentation results. Diving deeper, Table 4 offers a nuanced look into the impact of blending parameters across different network architectures. Notably, the TransU-net framework, when meticulously calibrated with blending coefficients of 0.5 for both γ𝛾{\gamma}italic_γ and δ𝛿{\delta}italic_δ, stood out, delivering unparalleled performance. Summarizing our extensive assessments, the pinnacle was a Dice coefficient of 57.16%, a testament to the potency and precision of our methodological choices and rigorous parameter adjustments in the demanding realm of medical image segmentation.

Table 4: Effects of Blending Parameters: γ𝛾\gammaitalic_γ for Dice Loss and δ𝛿\deltaitalic_δ for Our Proposed Shape-Sensitive Loss on Various Network Architectures. Network Coefficient Dice (%) γ𝛾\gammaitalic_γ δ𝛿\deltaitalic_δ TransU-Net 0.1 0.9 56.68 0.3 0.7 57.03 0.5 0.5 57.16 0.7 0.3 56.91 0.9 0.1 56.89 SwinU-Net 0.1 0.9 54.15 0.3 0.7 54.23 0.5 0.5 54.71 0.7 0.3 54.55 0.9 0.1 54.50
Refer to caption
Figure 4: Comparison of Dice coefficient and number of parameters of different networks on phantom X-ray images.
Table 5: Performance comparison of different loss functions.
Loss Function Dice (%)

CE111CE = Cross-Entropy and Dice

54.52

FT222FT = Focal Tversky and ours

55.78

Combo and ours

56.63

Dice and ours

57.16

Table 6: Comparison between various distance measurements for loss value calculation. Dist. Measurement Dice (%) Cosine similarity 57.16 Euclidean distance 56.78 Manhattan distance 56.10 Jaccard similarity 56.72 Hamming distance 54.03

5 Conclusion

In this study, we proposed a customized loss function for shape-sensitive segmentation using a pre-trained Vision Transformer (ViT) network. By converting the network prediction and ground truth segmentation maps into signed distance maps and extracting high-level features through the ViT, we were able to estimate feature matrices. The spatial distance between boundary maps was then evaluated using Cosine similarity, which measured the dissimilarity between the extracted high-level features. By combining our customized loss function with the Dice loss, we aimed to leverage shape-sensitive segmentation and capture finer details, ultimately improving the overall segmentation accuracy. Our approach demonstrated the effectiveness of utilizing ViT and signed distance map in the segmentation task, providing valuable insights into optimizing boundary map distances for accurate segmentation results. While our strategy underscored the merits of employing both ViT and signed distance map for segmentation, it’s essential to acknowledge the relatively subdued accuracy in our prediction results. This highlights a pivotal area of potential enhancement, motivating us to delve deeper into refinements and optimizations to elevate the segmentation accuracy in future endeavours.

References

  • [1] H. Rafii-Tari, C. J. Payne, G.-Z. Yang.: Current and emerging robot-assisted endovascular catheterization technologies: A review. In: Annals of Biomedical Engineering (2014)
  • [2] N. Simaan, R. M. Yasin, and L. Wang.: Medical technologies and challenges of robot-assisted minimally invasive intervention and diagnostics. In: Annual Review of Control, Robotics, and Autonomous System (2018)
  • [3] Y. Thakur, J. S. Bax, D. W. Holdsworth, and M. Drangova.: Design and performance evaluation of a remote catheter navigation system. In: IEEE Transactions on Biomedical Engineering (2009)
  • [4] M. E. M. K. Abdelaziz, D. Kundrat, M. Pupillo, G. Dagnino, T. MY, W. C. Kwok, V. Groenhuis, F. J. Siepel, C. Riga, S. Stramigioli, et al.: Toward a versatile robotic platform for fluoroscopy and MRI-guided endovascular interventions: A pre-clinical study. In: IROS (2019)
  • [5] G.-B. Bian, X.-L. Xie, Z.-Q. Feng, Z.-G. Hou, P. Wei, L. Cheng, and M. Tan.: An enhanced dual-finger robotic hand for catheter manipulating in vascular intervention: A preliminary study. In: ICIA (2013)
  • [6] W. Chi, G. Dagnino, T. Kwok, A. Nguyen, D. Kundrat, E. M. K. Abdelaziz, C. Riga, C. Bicknell, and G.-Z. Yang.: Collaborative robot-assisted endovascular catheterization with generative adversarial imitation learning In: ICRA, (2020)
  • [7] Y. Zhao, S. Guo, Y. Wang, J. Cui, Y. Ma, Y. Zeng, X. Liu, Y. Jiang, Y. Li, L. Shi, et al.: A CNN-based prototype method of unstructured surgical state perception and navigation for an endovascular surgery robot. In: Medical & Biological Engineering & Computing (2019)
  • [8] M. Benavente Molinero, G. Dagnino, J. Liu, W. Chi, M. Abdelaziz, T. Kwok, C. Riga, and G. Yang.: Haptic Guidance for Robot-Assisted Endovascular Procedures: Implementation and Evaluation on Surgical Simulator. In: IROS (2019)
  • [9] K. O’Shea and R. Nash.: An Introduction to Convolutional Neural Networks, (2015), arXiv:1511.08458. cs.NE.
  • [10] S. Jadon.: A survey of loss functions for semantic segmentation. In: 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) Oct. (2020)
  • [11] Shuai Guo, Songyuan Tang, Jianjun Zhu, **gfan Fan, Danni Ai, Hong Song, ** Liang, Jian Yang.: Improved U-Net for Guidewire Tip Segmentation in X-ray Fluoroscopy Images. In: ICAIP ’19: Proceedings of the 2019 3rd International Conference on Advances in Image Processing (2019)
  • [12] S. Borse, Y. Wang, Y. Zhang, and F. Porikli.: InverseForm: A Loss Function for Structured Boundary-Aware Segmentation. arXiv:2104.02745 [cs.CV] (2021)
  • [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] (2021)
  • [14] Q. Huang, Y. Zhou, and L. Tao.: Dual-Term Loss Function For Shape-Aware Medical Image Segmentation. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1798-1802 (2021), doi: 10.1109/ISBI48211.2021.9433775.
  • [15] Anh Nguyen, Dennis Kundrat, Giulio Dagnino, et al.: End-to-end real-time catheter segmentation with optical flow-guided war** during endovascular intervention. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9967–9973, (2020) DOI: 10.1109/ICRA40945.2020.9197307
  • [16] Gherardini M, Mazomenos E, Menciassi A, Stoyanov D.: Catheter segmentation in X-ray fluoroscopy using synthetic data and transfer learning with light U-nets. Compute Methods Programs Biomed, 192:105420, August 2020. doi: 10.1016/j.cmpb.2020.105420. Epub 2020 Feb 29. PMID: 32171151; PMCID: PMC7903142.
  • [17] W. Wang, Q. Li, C. Xiao, D. Zhang, L. Miao, and L. Wang.: An Improved Boundary-Aware U-Net for Ore Image Semantic Segmentation. In: Sensors, vol. 21, no. 8, article no. 2615, (2021) DOI: 10.3390/s21082615.
  • [18] Pierre Ambrosini, Daniel Ruijters, Wiro J. Niessen, Adriaan Moelker, Theo van Walsum.: Fully Automatic and Real-Time Catheter Segmentation in X-Ray Fluoroscopy. arXiv preprint arXiv:1707.05137 (2017)
  • [19] Yi-de Ma, Qing Liu, and Zhi-Bai Qian.: Automated image segmentation using improved PCNN model based on cross-entropy. In: Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, pages 743–746. IEEE (2004)
  • [20] Saining Xie and Zhuowen Tu. :Holistically-nested edge detection. Proceedings of the IEEE international conference on computer vision, pages 1395–1403 (2015)
  • [21] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. : Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 (2017)
  • [22] Vasyl Pihur, Susmita Datta, and Somnath Datta. : Weighted rank aggregation of cluster validation measures: a monte carlo cross-entropy approach. In: Bioinformatics, 23(13):1607–1615 (2007)
  • [23] Jianu, Tudor, Baoru Huang, Mohamed EMK Abdelaziz, Minh Nhat Vu, Sebastiano Fichera, Chun-Yi Lee, Pierre Berthet-Rayne, and Anh Nguyen. Cathsim: An open-source simulator for autonomous cannulation. arXiv preprint arXiv:2208.01455 (2022).
  • [24] Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. : Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 240–248. Springer (2017)
  • [25] Huang, Baoru, Yicheng Hu, Anh Nguyen, Stamatia Giannarou, and Daniel S. Elson. Detecting the Sensing Area of a Laparoscopic Probe in Minimally Invasive Cancer Surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 260-270. Cham: Springer Nature Switzerland, 2023.
  • [26] Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. : Tversky loss function for image segmentation using 3D fully convolutional deep networks. In: International Workshop on Machine Learning in Medical Imaging, pages 379–387. Springer (2017)
  • [27] Nabila Abraham and Naimul Mefraz Khan. : A novel focal Tversky loss function with improved attention U-Net for lesion segmentation. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 683–687. IEEE (2019)
  • [28] Huang, Baoru, Jian-Qing Zheng, Anh Nguyen, Chi Xu, Ioannis Gkouzionis, Kunal Vyas, David Tuch, Stamatia Giannarou, and Daniel S. Elson. Self-supervised depth estimation in laparoscopic image using 3D geometric consistency. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 13-22. Cham: Springer Nature Switzerland, 2022.
  • [29] Zeeshan Hayder, Xuming He, and Mathieu Salzmann. : Shape-aware instance segmentation. arXiv preprint arXiv:1612.03129, 2(5):7 (2016)
  • [30] Davood Karimi and Septimiu E. Salcudean. : Reducing the Hausdorff distance in medical image segmentation with convolutional neural networks. In: IEEE Transactions on Medical Imaging, 39(2):499–513 (2019)
  • [31] Saeid Asgari Taghanaki, Yefeng Zheng, S Kevin Zhou, Bogdan Georgescu, Puneet Sharma, Daguang Xu, Dorin Comaniciu, and Ghassan Hamarneh. : Combo loss: Handling input and output imbalance in multi-organ segmentation. In: Computerized Medical Imaging and Graphics, 75:24–33 (2019)
  • [32] Ken CL Wong, Mehdi Moradi, Hui Tang, and Tanveer Syeda-Mahmood. : 3D segmentation with exponential logarithmic loss for highly unbalanced object sizes. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 612–619. Springer (2018)
  • [33] O. Ronneberger, P. Fischer, and T. Brox.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: CoRR (2015)
  • [34] Tran, Minh Q., Tuong Do, Huy Tran, Erman Tjiputra, Quang D. Tran, and Anh Nguyen. Light-weight deformable registration using adversarial learning with distilling knowledge. IEEE transactions on medical imaging 41, no. 6 (2022): 1443-1453.
  • [35] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou.: TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. In: CoRR (2021)
  • [36] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang.: UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In: arXiv:cs.CV (2018)
  • [37] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-W. Chen, and J. Wu.: UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In: arXiv:eess.IV (2020)
  • [38] E. K. Aghdam, R. Azad, M. Zarvani, and D. Merhof.: Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin Lesion Segmentation. In: arXiv:eess.IV (2022)
  • [39] Ghosh, Rahul, et al. Automated catheter segmentation and tip detection in cerebral angiography with topology-aware geometric deep learning. In: Journal of NeuroInterventional Surgery (2023)..