Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization

Jeremiah Fadugba University of Ibadan, Nigeria. African Institute for Mathematical Sciences, Rwanda. Patrick Köhler Hertie Institute for AI in Brain Health, University of Tübingen, Germany. Lisa Koch Hertie Institute for AI in Brain Health, University of Tübingen, Germany. Department of Diabetes, Endocrinology, Nutritional Medicine and Metabolism UDEM, Inselspital, Bern University Hospital, University of Bern, Switzerland. Petru Manescu University College London, London. Philipp Berens Hertie Institute for AI in Brain Health, University of Tübingen, Germany. Tübingen AI Center, Tübingen, Germany.

Abstract

Purpose: Retinal blood vessel segmentation can extract clinically relevant information from fundus images. As manual tracing is cumbersome, algorithms based on Convolution Neural Networks have been developed. Such studies have used small publicly available datasets for training and measuring performance, running the risk of overfitting. Here, we provide a rigorous benchmark for various architectural and training choices commonly used in the literature on the largest dataset published to date.
Methods: We train and evaluate five published models on the publicly available FIVES fundus image dataset, which exceeds previous ones in size and quality and which contains also images from common ophthalmological conditions (diabetic retinopathy, age-related macular degeneration, glaucoma). We compare the performance of different model architectures across different loss functions, levels of image qualitiy and ophthalmological conditions and assess their ability to perform well in the face of disease-induced domain shifts.
Results: Given sufficient training data, basic architectures such as U-Net perform just as well as more advanced ones, and transfer across disease-induced domain shifts typically works well for most architectures. However, we find that image quality is a key factor determining segmentation outcomes.
Conclusions: When optimizing for segmentation performance, investing into a well curated dataset to train a standard architecture yields better results than tuning a sophisticated architecture on a smaller dataset or one with lower image quality.
Translational Relevance: We distilled the utility of architectural advances in terms of their clinical relevance therefore providing practical guidance for model choices depending on the circumstances of the clinical setting.

1 Introduction

Retinal fundus imaging is the only noninvasive instrument that allows us to identify geometric characteristics of a deeper microvascular system such as vessel diameters or branching angles. These techniques are not only employed in the diagnosis of retinal diseases such as glaucoma, diabetic retinopathy (DR), and age-related macular degeneration (AMD) (Zhou et al., 2022), but also in the diagnosis of microvascular conditions such as hypertension and atherosclerosis (Fathi and Naghsh-Nilchi, 2013; Kanski and Bowling, 2011). While hypertension, for instance, can lead to remodeling of blood vessels, diseases like diabetes can cause new vessels to appear. Therefore, segmenting the vessels from the fundus images, that is, delineating their boundaries from the background, is an important initial step in several diagnostic procedures.

Manual segmentation of retinal vessels is a time-consuming process, requiring approximately three to five hours per image (** et al., 2022). It has been shown that machine learning algorithms can be used to effectively solve numerous medical segmentation tasks effectively (Salpea et al., 2022; Yao et al., 2024; Ma et al., 2024), including retinal vessel segmentation (Hegde et al., 2023). Recently, deep learning approaches have demonstrated the capacity to achieve human-level performance in this task (Lin et al., 2023). The majority of these approaches are based on the UNet model (Ronneberger et al., 2015). Some of these approaches include architectural modifications tailored to the specific task of segmenting retinal blood vessels (Liu et al., 2022b; Galdran et al., 2020; Guo et al., 2020), which requires the ability to detect relatively thin objects with a high degree of connectivity. While all models share the ability to encode the image efficiently into a more compact representation, they differ in the inductive biases according to which they are designed. For example, FR-Unet (Liu et al., 2022b) aims to retain the full resolution of the image, whereas MA-Net (Fan et al., 2020) places greater emphasis on the encoded dependencies between local and global image features.

Despite the plethora of different approaches, there is currently no direct comparison that highlights the respective strengths and weaknesses of the different models for a wide range of clinical settings. Moreover, existing work on retinal vessel segmentation has been evaluated on few and relatively small publicly available datasets (Fu et al., 2016; Zhao et al., 2020; Zhou et al., 2018; Wang and Jiang, 2019). This is in part due to the absence of large publicly available retinal image datasets with annotated blood vessels. In addition to their small sample size, the most commonly studied datasets, DRIVE (Staal et al., 2004) and CHASEDB1 (Fraz et al., 2012), vary strongly in their image quality and labeling protocols (Galdran et al., 2020). Furthermore, they contain few images of patients with retinal disease, such that it is currently unclear how the available models generalize to real patient populations with differing characteristics in terms of retinal diseases, image quality, and varying dataset sources.

Given the wide range of choices for architectures, loss functions, or training protocols, it is also unclear which of these factors actually matter when a large dataset is used for training.

In this paper, we present a comprehensive survey and benchmark of the current state of the art in retinal blood vessel segmentation on fundus images. We investigate the most commonly used configurations of model architectures and loss functions for model training and evaluate the segmentation performance on three datasets, including the largest publicly available dataset (** et al., 2022) as well as previously well-established smaller benchmark datasets (Staal et al., 2004; Fraz et al., 2012). We explicitly test whether the different models are robust to the inherent variability of fundus images due to different diseases or changes in image quality.

2 Methods

2.1 Data

We developed and evaluated our models on three publicly available retinal fundus datasets for vessel segmentation (Fig. 1, Table 1). We used the largest high-quality dataset to date for Fundus Image Vessel Segmentation (FIVES) (** et al., 2022), which has so far not been used to analyze segmentation models. It consists of 800 retinal images from 573 patients. Retinal vessels were manually annotated in a consensus procedure involving three senior ophthalmologists and 24 medical residents. The dataset includes disease labels for AMD (“A”, 200 images), DR (“D”, 200), Glaucoma (“G”, 200) and Healthy (“N”, 200), as well as manually assessed image quality grades. Image quality was attributed to either illumination and color ( $644$ images), blur ( $669$ images) or low contrast ( $765$ images).

In addition, we used well-known fundus datasets for further evaluation: The DRIVE dataset (Staal et al., 2004) consists of 40 fundus images obtained from a diabetic retinopathy screening program, of which seven images (not further specified) contained pathologies. Finally, the CHASEDB1 dataset (Fraz et al., 2012) consists of 28 fundus images of the left and right eyes of 14 children.

Refer to caption — Figure 1: Example images and manual segmentations for FIVES (including an example from each subgroup), DRIVE and CHASEDB1.

Table 1: Summary of publicly available retinal vessel segmentation datasets used in this study including the prevalence of images with diabetic retinopathy (“D”), age-related macular degeneration (“A”), glaucoma (“G”), as well as healthy (“N”) images.

Dataset	Year	Images	Resolution	Disease prevalence	Annotators	Train/Test split
FIVES	2021	800	$2048\times 2048$	200 N, 200 A, 200 D, 200 G	Group	600/200
DRIVE	2004	40	$768\times 584$	33 N, 7 D	3	20/20
CHASEDB1	2011	28	$990\times 960$	28 N	2	20/8

All images were pre-processed using Contrast Limited Histogram Equalization (CLAHE) (Pizer et al., 1987) with a clip limit of 2 and a grid size of $8\times 8$ . The images and segmentation masks were resampled to $512\times 512$ pixels. Finally, the datasets were divided into training and test folds according to the “official” splits.

2.2 Segmentation model architecture

We performed a comprehensive literature search and selected 22 articles on retinal vessel segmentation published between 2016 and 2022 (see Table 2). Based on their reported segmentation performance and the availability of code, we included the standard UNet segmentation model and four variants in our benchmark. In the following, we briefly outline the respective model architectures.

Table 2: Self-reported performance of some existing vessel segmentation methods from our literature survey. We report the loss used for training, reported Dice (DSC), Area under the Curve (AUC), and Matthew’s Correlation Coefficient (MCC) metrics for models evaluated on DRIVE and CHASEDB1, where available. We also show which models reported their performance at native resolution and omit others that either did not report which approach they used or which used a different evaluation approach. The best results are shown in bold.

DRIVE CHASE-DB1 Method Resolution Loss DSC AUC MCC DSC AUC MCC (Fu et al., 2016) - - 78.75 94.04 - 75.49 94.82 - (Zhang et al., 2016) - - - 96.36 - - 96.06 - (Orlando et al., 2017) - - 78.57 95.07 75.56 73.32 95.24 70.46 (Gu et al., 2017) - - 78.86 - 75.89 72.02 - 69.08 (Wu et al., 2018) - - - 98.07 - - 98.25 - (Yan et al., 2018) - - 81.83 97.52 - - 97.81 - (Zhuang, 2019) - BCE 82.02 97.93 - 80.31 98.39 - (Wang and Jiang, 2019) - - 81.44 - 78.94 78.63 - 76.55 (Wang et al., 2019a)(DEU-Net) - DiceBce 82.70 97.72 - 80.37 98.12 - (Araújo et al., 2019) - BCE - 97.90 - - 98.20 - (Fu et al., 2019) - - 80.48 97.19 - - - - (Wang et al., 2019b) - - 80.93 - 78.51 78.09 - 75.91 (Gu et al., 2019) - DICE - 97.79 - - - - (Zhao et al., 2019)(R-sGAN) - - 78.82 - - - - - (Laibacher et al., 2019)(M2U-Net) - DiceBce 80.91 97.14 - 80.06 97.03 - (Shin et al., 2019) - - 82.63 98.01 - 80.34 98.30 - (Li et al., 2019)(iter-Net) Native - 82.05 98.16 - 80.73 98.51 - (Guo et al., 2020) (SA-UNet) Native BCE 82.63 98.64 80.97 81.53 99.05 80.33 (Zhou et al., 2021) - - 83.16 98.86 - 82.71 99.20 - (Kamran et al., 2021) (RV-GAN) - - 86.90 98.87 - 89.57 99.14 - (Galdran et al., 2022a) (W-Net) Native BCE 82.79 98.10 80.24 81.69 98.47 79.74 (Liu et al., 2022b)(FR-UNet) Native BCE 83.16 98.89 - 81.51 99.20 -

UNet

The standard UNet (Ronneberger et al., 2015) is the most common baseline algorithm for segmentation tasks including vessel segmentation. Its architecture is characterized by a U-shaped network structure consisting of an encoder and a decoder. The encoder is responsible for capturing the features of the input image. The decoder, on the other hand, takes the learned features from the encoder and gradually upsamples them to the original input image size. The purpose of the decoder is to map the result of the encoder onto a pixel-wise segmentation map that has the same dimensions as the input image. One of the innovations of the UNet is the use of skip connections. These allow the decoder to access feature maps from the encoder at multiple resolutions. In this way, the UNet combines both high-level and low-level features during the decoding process, hel** to capture fine-grained details in the segmentation output. The final output of the UNet is a pixel-wise segmentation map where each pixel is assigned a probability of belonging to the foreground.

FR-UNet

The FR-UNet architecture (Liu et al., 2022b) is the current state-of-the-art model for vessel segmentation. It is based on the idea of preserving the full resolution of the image during training. This is achieved by expanding the network horizontally and vertically through a multi-resolution convolution mechanism.

MA-Net

The multi-attention UNet (MA-Net) (Fan et al., 2020) was initially developed for liver and tumor segmentation and introduces a self-attention mechanism to adaptively integrate local features with their global dependencies. It also aims to exploit the full resolution of the image to capture rich information during end-to-end training but, unlike FR-UNet, it can capture rich contextual dependencies based on the attention mechanism, using two blocks; a Position-wise Attention Block (PAB), which captures the spatial dependencies between pixels in a global view, and a Multi-scale Fusion Attention Block (MFAB), which captures the channel dependencies between arbitrary feature maps through multi-scale semantic feature fusion.

SA-UNet

The Spatial Attention UNet (SA-UNet) (Guo et al., 2020) modifies the standard UNet architecture by introducing an attention module at the spatial dimension of a UNet architecture. This attention module learns the spatial relationship between features from the encoder stage. The spatial attention is derived by applying max-pooling and average-pooling operations along the channel axis and is then concatenated to produce an ”efficient feature” map. The features from this stage are then passed to the decoder.

W-Net

W-Net (Galdran et al., 2020) is a cascaded extension of the UNet architecture. It involves the consecutive use of the UNet architecture so that an input image is passed through a standard UNet and the resulting output is concatenated with the input image, and passed again through a second UNet to produce the final output. This cascading approach enhances prediction performance but suffers from doubling the number of parameters in the network, thereby increasing the computational requirements. W-Net avoids this by ensuring that the base UNet model is shallow by reducing the number of layers in the network.

2.3 Loss functions used for training

For retinal vessel segmentation, various loss functions have been used, including the Binary Cross-Entropy (BCE), the SoftDice loss, the DiceBCE loss and the centerline-Dice score.

The Binary Cross-Entropy (BCE) loss regards segmentation as a pixel-wise binary classification problem. For a ground truth segmentation $y$ and predicted probabilities $\hat{y}$ , the BCE loss is defined as

\mathcal{L}_{\textrm{BCE}}(y,\hat{y}):=-\frac{1}{N}\sum_{i=1}^{N}y_{i}\cdot% \log(\hat{y}_{i})+(1-y_{i})\cdot\log(1-\hat{y_{i}}).

(1)

Here, $N$ is the total number of pixels in $y$ and the $i$ -th pixel is indexed by $i$ and we assume $y_{i}\in\{0,1\}$ . For foreground pixels $(y_{i}=1$ ), only the predicted log-probability $\log(\hat{y}_{i})$ of belonging to the foreground contribute to the loss, while for background pixels $(y_{i}=0)$ , the contribution is the log-probability of belonging to the background, $\log(1-\hat{y}_{i})$ .

The SoftDice loss function (Milletari et al., 2016) reformulates the Dice Similarity Score, a popular evaluation metric for image segmentation, to make it differentiable with respect to the predicted probabilities, such that it can be used as an optimization criterion. It is defined as

\mathcal{L}_{\textrm{SoftDice}}(y,\hat{y}):=1-\frac{2\sum_{i=1}^{N}y_{i}\cdot% \hat{y}_{i}+\epsilon}{\sum_{i=1}^{N}y_{i}+\sum_{i=1}^{N}\hat{y}_{i}+\epsilon},

(2)

where $\epsilon>0$ is a smoothing factor to prevent division by zero.

The DiceBCE loss function is a combination of both the SoftDice and the Binary Cross-Entropy. Previous work has shown that an optimal combination of these two loss functions improves performance (Liu et al., 2022a; Ma et al., 2021; Galdran et al., 2022b). The DiceBCE is calculated as follows:

\mathcal{L}_{\textrm{DiceBCE}}(y,\hat{y})=\alpha\cdot\mathcal{L}_{\textrm{BCE}% }(y,\hat{y})+\beta\cdot\mathcal{L}_{\textrm{SoftDice}}(y,\hat{y})\leavevmode% \nobreak\ ,

(3)

where $\alpha$ and $\beta$ are weighting factors that balance the contribution of each loss.

In addition, we use a loss function based on the centerline Dice (clDice) (Shit et al., 2021), a similarity measure that is calculated from the intersection of the morphological skeletons $S_{\hat{Y}},S_{Y}$ of the predicted and ground truth segmentations $\hat{Y},Y$ . The rationale behind incorporating the vessel structure into the loss function is to preserve tiny vessels, which contain only a small area of important structural information and therefore contribute only little to BCE or DSC-based losses if omitted or annotated wrongly, and to enforce the connectivity of the vessels, which is not considered by Dice scores. As proposed in (Shit et al., 2021), we first compute the the fraction of $S_{Y}$ that lies in $\hat{Y}$ , given as $S_{Y}2\hat{Y}:=\frac{\lvert S_{Y}\cap\hat{Y}\rvert}{\lvert S_{Y}\rvert}$ and consequently, the fraction of $S_{\hat{Y}}$ that lies in $Y$ , given as $S_{\hat{Y}}2Y:=\frac{\lvert S_{\hat{Y}}\cap Y\rvert}{\lvert S_{\hat{Y}}\rvert}$ . The clDice is then defined as

\textrm{clDice}(Y,\hat{Y}):=2\frac{S_{\hat{Y}}2Y\cdot S_{Y}2\hat{Y}}{S_{\hat{Y% }}2Y+S_{Y}2\hat{Y}}\leavevmode\nobreak\ .

(4)

The clDice can be made differentiable using an iterative soft skeletonization approach (Shit et al., 2021) which can be applied as a proxy for morphological erosion and dilation. The resulting soft-clDice can be incorporated as an auxiliary loss term to enforce morphological similarity:

\mathcal{L}(y,\hat{y}):=(1-\alpha)\mathcal{L}_{\textrm{SoftDice}}(y,\hat{y})+% \alpha(1-\textrm{softclDice}(y,\hat{y}))

(5)

2.4 Training details

We used data augmentation techniques, including random horizontal and vertical flips with a fixed random rotation during training. All model training was carried out using PyTorch on a single NVIDIA-GeForce RTX 2080ti GPU. The models were trained for 70 epochs with a batch size of 4. We employed the Adam optimizer, incorporating a weight decay factor of $1\times 10^{-5}$ , an initial learning rate set to $1\times 10^{-4}$ , and a cosine annealing strategy for learning rate scheduling.

2.5 Measures used for evaluation

To evaluate the performance of the models, we used the Dice Similarity Coefficient (DSC) and the Matthews Correlation Coefficient (MCC). The evaluation was performed on the official test set of each dataset.

2.6 Code and data availability

All implemented models, analysis and visualization code is available on GitHub ¹¹1https://github.com/berenslab/Retinal-Vessel-Segmentation-Benchmark. The datasets FIVES²²2https://shorturl.at/WQyGV, DRIVE³³3https://drive.grand-challenge.org/, and CHASEDB1⁴⁴4https://blogs.kingston.ac.uk/retinal/chasedb1/ used in this study are officially available at the referenced repositories.

3 Results

We first analyzed the performance of five state-of-the-art models trained with different loss functions in a classical in-domain setting using the FIVES dataset (** et al., 2022), which is the largest dataset of fully annotated fundus images available. We then investigated the generalization capabilities of the different models and studied their performance in a cross-dataset setting. Finally, we compared how the performance of these models varied under different ophthalmological conditions and for different image quality levels.

3.1 Choice of architecture and training loss for vessel segmentation

We trained five prominent segmentation model architectures including a standard UNet on the FIVES training split using four different loss functions (see Methods). We then evaluated their performance on the test split of the same dataset using the Dice Coefficient (DSC) and the Mathews Correlation Coefficient (MCC).

Regardless of the loss function or the evaluation measure, the UNet, FR-UNet and MA-Net performed similarly and clearly better than the SA-UNet and W-Net (Table 3). For example, the top three models achieved a DSC of about 0.9 regardless of the loss function, while the other two models achieved a DSC of about 0.85. The loss function had a comparably minor influence on the final performance, with DiceBCE and clDice leading to slightly better performing models.

The performance of the top three models was thus close to the reported inter-rater consistency among junior graders at a DSC of 0.92 but slightly lower than the inter-rater consistency between a senior grader and several junior graders at a DSC of 0.96 (** et al., 2022). Thus, the best models achieved near-human performance when trained on a large dataset, but surprisingly, none of the architectural variants were able to meaningfully improve upon the baseline UNet architecture.

Table 3: Performance of our implementation of the studied segmentation models on the FIVES dataset across various loss functions.

	DSC				MCC
	BCE	Dice	DiceBCE	clDice	BCE	Dice	DiceBCE	clDice
UNet	89.76	89.87	90.15	90.06	89.04	89.19	89.47	89.38
FR-UNet	90.14	90.04	90.37	90.29	89.57	89.47	89.80	89.69
MA-Net	89.41	89.88	89.97	90.05	88.76	89.23	89.29	89.36
SA-UNet	85.90	85.57	86.55	85.76	84.91	84.64	85.64	84.83
W-Net	84.42	85.62	85.65	85.67	83.50	84.72	84.74	84.75

3.2 Robustness of different architectures to domain shifts

We further investigated the generalization capabilities of the trained models on additional datasets that were not part of the training procedure. Due to the domain shift, we expect that a model trained on one dataset would not achieve the same level of performance when tested on another dataset. For this purpose, we additionally used the much smaller CHASE DB and the DRIVE datasets (see Methods), which look clearly different from the FIVES dataset and thus should induce strong domain shifts (Fig. 1). Trained models reached a similar performance as reported in the literature (Table 5). For this analysis, we focused on the DiceBCE loss function for all models for simplicity. We trained all models individually on each dataset and evaluated their performance on the two remaining datasets.

Table 4: Dice of the In-domain performance for all datasets.

Model	FIVES	DRIVE	CHASEDB1
UNet	90.15	77.23	80.55
FR-UNet	90.37	80.62	81.65
MA-Net	89.97	77.78	80.56
SA-UNet	86.55	81.65	79.30
W-Net	85.65	77.98	74.98

Table 5: Number of high quality images per category in the train split of FIVES (n=600).

	Illumination/ Color	Blur	Contrast
AMD	142	134	150
DR	127	117	149
Glaucoma	82	109	131
Normal	145	150	150
$\Sigma$	496	510	580

We found that the simple UNet and the SA-UNet handled dataset shifts better than other models, as their cross-dataset performance was closer to their in-domain performance (Fig. 2). Interestingly, the SA-UNet model, which showed comparatively low performance in the in-domain settings, did not have much of a performance gap in the cross-dataset setting. We found that the W-Net and the FR-UNet were the most sensitive to dataset shifts. Models trained on the FIVES datasets (green diamonds) generalized well to both the DRIVE and CHASEDB1 datasets (Fig. 2). This is likely due to the large number of samples available for training and their high quality annotation. Interestingly, for the Unet, the cross-dataset performance of a model trained on FIVES, when evaluated on DRIVE and CHASE-DB, was no worse than the in-domain performance of a model trained directly on these datasets (Fig. 2a). This suggests that generalization capabilities across datasets are an important factor to be evaluated when develo** new retinal vessel segmentation methods.

3.3 Robustness of different architectures to disease-related domain shifts

Next, we evaluated the performance of all models in four subgroups related to ophthalmological conditions (AMD, DR, glaucoma and healthy images). We considered two setups: First, we trained each model on three subgroups and tested it on the remaining subgroup (denoted as 3 vs. 1) to mimic the scenario where a model is applied to fundus images with diseases not present in the training data. In addition, we trained each model on the whole training set and evaluated its performance for the individual subgroups in the test set.

We found that the median performance in the 3 vs. 1 setting was similar to the in-domain performance of the respective model, with W-Net having the maximum performance drop of 4.03% (Fig. 3 a, b, c). Thus, domain shifts with respect to disease did not affect the segmentation quality substantially, at least when the model was trained on a sufficiently large dataset.

We also found that all models were better at segmenting fundus images from healthy eyes and eyes with AMD compared to fundus images from DR and glaucoma patients (Fig. 3 a, b). While UNet, FR-UNet and MA-Net performed best also in the subgroup analysis, SA-UNet and W-Net perform worse in the subgroup setting, following the same worse performance as in the standard setting (Table 5).

Robustness of different architectures to image quality

Finally, we evaluated how robustly the different models performed in images of different quality. Each image in the FIVES dataset is labeled as “high” or “low” quality with respect to illumination and color, blur, and low contrast, scored by an automatic algorithm (Wang et al., 2016) (Fig. 4c, Table 5). We combined all image quality metrics into an overall quality score, which a score of $0$ representing low quality in all quality criteria, and a score of $3$ representing good quality in all criteria.

We evaluated the segmentation performance of all models for each image quality level (Figure 4 a). Not surprisingly, the models performed worst for the poorest quality images. All models benefited to a similar degree from increasing image quality, with the exception of the W-Net, which appeared to be most susceptible to image quality degradation and benefited most from quality improvements. Interestingly, the FR-UNet, being one of the overall top-performing models, performed worse than the generally low-performing SA-UNet model for the subgroup with the poorest overall image quality. This may be due to the fact that the FR-UNet operates at full resolution, potentially making the model’s intermediate representations more susceptible to noise, as it is averaged out when the resolution is downsampled in the other models.

We then investigated the three individual components of image quality separately. Poor contrast affected all models strongly and led to the highest drop in segmentation performance (Figure 4 b), while the effect of blur and illumination was less pronounced. Nevertheless, all components of image quality had large effects on segmentation performance (much larger than the effect of different diseases), underscoring the importance of good image quality for tasks such as vessel segmentation.

4 Discussion

In this paper, we comprehensively compared the state of the art in retinal vessel segmentation (Liu et al., 2022b; Guo et al., 2020; Galdran et al., 2020) and investigated commonly used model architectures and loss functions in three publicly available datasets (Staal et al., 2004; ** et al., 2022; Fraz et al., 2012). Most existing evaluation papers (Hegde et al., 2023) primarily focus on summarizing available research and generally conclude by endorsing deep learning (DL)-based approaches, such that a rigorous benchmark of a variety of DL methods and their robustness has been much needed. We evaluated the models’ in-domain performance as well as their generalization to different datasets as well as their robustness to images with different diseases and different levels of image quality.

We found that the choice of loss function did not crucially affect the segmentation quality, but the clDice and DiceBCE loss marginally outperformed the other optimization criteria. Similarly, we could not identify a single optimal model architecture, with UNet, FR-UNet and MA-Net performing similarly well. This implies that, interestingly, the original UNet without modification remains state-of-the-art for the vessel segmentation.

Deep learning-based segmentation models are known to be susceptible to dataset shifts (Boone et al., 2023; Koch et al., 2024), and, as we show in this study, retinal vessel segmentation models are not an exception. Interestingly, while variations in image quality due to differences in the imaging setup have the most impact on the segmentation output, variations due to disease manifestation/prevalence do not significantly affect the model’s performance. Therefore, high image quality is crucial for a successful vessel segmentation. Among the publicly available datasets training on FIVES, which has the largest sample size at a high image quality, allowed for the best generalization to other datasets, when averaged over all models. Hence, we recommend to choose FIVES as a training set for studies where cross dataset generalization matters.

While models do not generalize well from high to poor image quality, they generalize quite well to unseen diseases. When curating data for practice this means that ensuring high image quality should be the primary concern. Generalization to diseases not present in the training data can likely be expected. Foundation models for medical image segmentation trained on wide range of imaging data (Ma et al., 2024) may further improve generalization of retinal image segmentation, but currently face challenges accurately segmenting retinal vessels even with fine-tuning (Shi et al., 2023).

Finally, future studies should consider how different segmentation algorithms affect downstream tasks and evaluate them directly as part of the clinical pipelines. For example, probabilistic segmentation approaches have been evaluated as part of the eye disc and cup for glaucoma diagnosis (Wundram et al., 2024). Similar evaluation pipelines for vessel segmentations would be needed to directly address the clinical potential of different segmentation models.

Acknowledgements

This work was supported by a grant from the Carnegie Corporation of New York (provided through the African Institute for Mathematical Sciences), the German Science Foundation (BE5601/8-1 and the Excellence Cluster 2064 “Machine Learning — New Perspectives for Science”, project number 390727645), the Carl Zeiss Foundation (“Certification and Foundations of Safe Machine Learning Systems in Healthcare”) and the Hertie Foundation. PB is also a member of EKFS-Kolleg “ClinBrain”.

References

Araújo et al. (2019) Ricardo J. Araújo, Jaime S. Cardoso, and Hélder P. Oliveira. A deep learning design for improving topology coherence in blood vessel segmentation. In Dinggang Shen, Tianming Liu, Terry M. Peters, Lawrence H. Staib, Caroline Essert, Sean Zhou, Pew-Thian Yap, and Ali Khan, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pages 93–101, Cham, 2019. Springer International Publishing. ISBN 978-3-030-32239-7.
Boone et al. (2023) Lyndon Boone, Mahdi Biparva, Parisa Mojiri Forooshani, Joel Ramirez, Mario Masellis, Robert Bartha, Sean Symons, Stephen Strother, Sandra E Black, Chris Heyn, Anne L Martel, Richard H Swartz, and Maged Goubran. ROOD-MRI: Benchmarking the robustness of deep learning segmentation models to out-of-distribution and corrupted data in MRI. Neuroimage, 278(120289):120289, September 2023.
Fan et al. (2020) Tongle Fan, Guanglei Wang, Yan Li, and Hongrui Wang. Ma-net: A multi-scale attention network for liver and tumor segmentation. IEEE Access, 8:179656–179665, 2020. doi: 10.1109/ACCESS.2020.3025372.
Fathi and Naghsh-Nilchi (2013) Abdolhossein Fathi and Ahmad Reza Naghsh-Nilchi. Automatic wavelet-based retinal blood vessels segmentation and vessel diameter estimation. Biomedical Signal Processing and Control, 8(1):71–80, 2013.
Fraz et al. (2012) Muhammad Moazam Fraz, Paolo Remagnino, Andreas Hoppe, Bunyarit Uyyanonvara, Alicja R. Rudnicka, Christopher G. Owen, and Sarah A. Barman. An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Transactions on Biomedical Engineering, 59(9):2538–2548, 2012. doi: 10.1109/TBME.2012.2205687.
Fu et al. (2016) Huazhu Fu, Yanwu Xu, Stephen Lin, Damon Wing Kee Wong, and Jiang Liu. Deepvessel: Retinal vessel segmentation via deep learning and conditional random field. In Sebastien Ourselin, Leo Joskowicz, Mert R. Sabuncu, Gozde Unal, and William Wells, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, pages 132–139, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46723-8.
Fu et al. (2019) Weilin Fu, Katharina Breininger, Roman Schaffert, Nishant Ravikumar, and Andreas Maier. A divide-and-conquer approach towards understanding deep networks. In Dinggang Shen, Tianming Liu, Terry M. Peters, Lawrence H. Staib, Caroline Essert, Sean Zhou, Pew-Thian Yap, and Ali Khan, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pages 183–191, Cham, 2019. Springer International Publishing. ISBN 978-3-030-32239-7.
Galdran et al. (2020) Adrian Galdran, André Anjos, José Dolz, Hadi Chakor, Hervé Lombaert, and Ismail Ben Ayed. The little w-net that could: State-of-the-art retinal vessel segmentation with minimalistic models, 2020.
Galdran et al. (2022a) Adrian Galdran, André Anjos, José Dolz, Hadi Chakor, Hervé Lombaert, and Ismail Ben Ayed. State-of-the-art retinal vessel segmentation with minimalistic models. Scientific Reports, 12(1):6174, Apr 2022a. ISSN 2045-2322. doi: 10.1038/s41598-022-09675-y. URL https://doi.org/10.1038/s41598-022-09675-y.
Galdran et al. (2022b) Adrian Galdran, Gustavo Carneiro, and Miguel Ángel González Ballester. On the optimal combination of cross-entropy and soft dice losses for lesion segmentation with out-of-distribution robustness, 2022b.
Gu et al. (2017) Lin Gu, Xiaowei Zhang, He Zhao, Huiqi Li, and Li Cheng. Segment 2d and 3d filaments by learning structured and contextual features. IEEE Transactions on Medical Imaging, 36(2):596–606, 2017. doi: 10.1109/TMI.2016.2623357.
Gu et al. (2019) Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Transactions on Medical Imaging, 38(10):2281–2292, 2019. doi: 10.1109/TMI.2019.2903562.
Guo et al. (2020) Changlu Guo, Márton Szemenyei, Yugen Yi, Wenle Wang, Buer Chen, and Changqi Fan. Sa-unet: Spatial attention u-net for retinal vessel segmentation, 2020.
Hegde et al. (2023) Govardhan Hegde, Srikanth Prabhu, Shourya Gupta, Gautham Manuru Prabhu, Anshita Palorkar, Metta Venkata Srujan, and Sulatha V Bhandary. A systematic review of deep learning approaches for vessel segmentation in retinal fundus images. In Journal of Physics: Conference Series, volume 2571, page 012021. IOP Publishing, 2023.
** et al. (2022) Kai **, Xingru Huang, **gxing Zhou, Yunxiang Li, Yan Yan, Yibao Sun, Qianni Zhang, Yaqi Wang, and Juan Ye. Fives: A fundus image dataset for artificial intelligence based vessel segmentation. Scientific Data, 9(1):475, Aug 2022. ISSN 2052-4463. doi: 10.1038/s41597-022-01564-3. URL https://doi.org/10.1038/s41597-022-01564-3.
Kamran et al. (2021) Sharif Amit Kamran, Khondker Fariha Hossain, Alireza Tavakkoli, Stewart Lee Zuckerbrod, Kenton M. Sanders, and Salah A. Baker. RV-GAN: Segmenting retinal vascular structure in fundus photographs using a novel multi-scale generative adversarial network. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, pages 34–44. Springer International Publishing, 2021. doi: 10.1007/978-3-030-87237-3˙4. URL https://doi.org/10.1007%2F978-3-030-87237-3_4.
Kanski and Bowling (2011) Jack J Kanski and Brad Bowling. Clinical ophthalmology: a systematic approach. Elsevier Health Sciences, 2011.
Koch et al. (2024) Lisa M Koch, Christian F Baumgartner, and Philipp Berens. Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study. NPJ Digit. Med., 7(1):120, May 2024.
Laibacher et al. (2019) Tim Laibacher, Tillman Weyde, and Sepehr Jalali. M2u-net: Effective and efficient retinal vessel segmentation for real-world applications. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 115–124, 2019. doi: 10.1109/CVPRW.2019.00020.
Li et al. (2019) Liangzhi Li, Manisha Verma, Yuta Nakashima, Hajime Nagahara, and Ryo Kawasaki. Iternet: Retinal image segmentation utilizing structural redundancy in vessel networks, 2019.
Lin et al. (2023) Ji Lin, Xingru Huang, Huiyu Zhou, Yaqi Wang, and Qianni Zhang. Stimulus-guided adaptive transformer network for retinal blood vessel segmentation in fundus images. Medical Image Analysis, 89:102929, 2023. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2023.102929. URL https://www.sciencedirect.com/science/article/pii/S1361841523001895.
Liu et al. (2022a) Bingyuan Liu, Jose Dolz, Adrian Galdran, Riadh Kobbi, and Ismail Ben Ayed. The hidden label-marginal biases of segmentation losses, 2022a.
Liu et al. (2022b) Wentao Liu, Huihua Yang, Tong Tian, Zhiwei Cao, Xipeng Pan, Wei** Xu, Yang **, and Feng Gao. Full-resolution network and dual-threshold iteration for retinal vessel and coronary angiograph segmentation. IEEE Journal of Biomedical and Health Informatics, 26(9):4623–4634, 2022b. doi: 10.1109/JBHI.2022.3188710.
Ma et al. (2021) Jun Ma, Jianan Chen, Matthew Ng, Rui Huang, Yu Li, Chen Li, ** Yang, and Anne L. Martel. Loss odyssey in medical image segmentation. Medical Image Analysis, 71:102035, 2021. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2021.102035. URL https://www.sciencedirect.com/science/article/pii/S1361841521000815.
Ma et al. (2024) Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nat. Commun., 15(1):654, January 2024.
Milletari et al. (2016) Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571, 2016. doi: 10.1109/3DV.2016.79.
Orlando et al. (2017) José Ignacio Orlando, Elena Prokofyeva, and Matthew B. Blaschko. A discriminatively trained fully connected conditional random field model for blood vessel segmentation in fundus images. IEEE Transactions on Biomedical Engineering, 64(1):16–27, 2017. doi: 10.1109/TBME.2016.2535311.
Pizer et al. (1987) Stephen M Pizer, E Philip Amburn, John D Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 39(3):355–368, 1987.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015.
Salpea et al. (2022) Natalia Salpea, Paraskevi Tzouveli, and Dimitrios Kollias. Medical image segmentation: A review of modern architectures. In European Conference on Computer Vision, pages 691–708. Springer, 2022.
Shi et al. (2023) Peilun Shi, Jianing Qiu, Sai Mu Dalike Abaxi, Hao Wei, Frank P.-W. Lo, and Wu Yuan. Generalist vision foundation models for medical imaging: A case study of segment anything model on zero-shot medical segmentation. Diagnostics, 13(11):1947, June 2023. ISSN 2075-4418. doi: 10.3390/diagnostics13111947. URL http://dx.doi.org/10.3390/diagnostics13111947.
Shin et al. (2019) Seung Yeon Shin, Soochahn Lee, Il Dong Yun, and Kyoung Mu Lee. Deep vessel segmentation by learning graphical connectivity. Medical Image Analysis, 58:101556, 2019. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2019.101556. URL https://www.sciencedirect.com/science/article/pii/S1361841519300982.
Shit et al. (2021) Suprosanna Shit, Johannes C. Paetzold, Anjany Sekuboyina, Ivan Ezhov, Alexander Unger, Andrey Zhylka, Josien P. W. Pluim, Ulrich Bauer, and Bjoern H. Menze. clDice - a novel topology-preserving loss function for tubular structure segmentation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2021. doi: 10.1109/cvpr46437.2021.01629. URL https://doi.org/10.1109%2Fcvpr46437.2021.01629.
Staal et al. (2004) J. Staal, M.D. Abramoff, M. Niemeijer, M.A. Viergever, and B. van Ginneken. Ridge-based vessel segmentation in color images of the retina. IEEE Transactions on Medical Imaging, 23(4):501–509, 2004. doi: 10.1109/TMI.2004.825627.
Wang et al. (2019a) Bo Wang, Shuang Qiu, and Huiguang He. Dual encoding u-net for retinal vessel segmentation. In Dinggang Shen, Tianming Liu, Terry M. Peters, Lawrence H. Staib, Caroline Essert, Sean Zhou, Pew-Thian Yap, and Ali Khan, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pages 84–92, Cham, 2019a. Springer International Publishing. ISBN 978-3-030-32239-7.
Wang et al. (2016) Shaoze Wang, Kai **, Haitong Lu, Chuming Cheng, Juan Ye, and Dahong Qian. Human visual system-based fundus image quality assessment of portable fundus camera photographs. IEEE Transactions on Medical Imaging, 35(4):1046–1055, 2016. doi: 10.1109/TMI.2015.2506902.
Wang and Jiang (2019) Xiaohong Wang and Xudong Jiang. Retinal vessel segmentation by a divide-and-conquer funnel-structured classification framework. Signal Processing, 165:104–114, 2019. ISSN 0165-1684. doi: https://doi.org/10.1016/j.sigpro.2019.06.018. URL https://www.sciencedirect.com/science/article/pii/S0165168419302269.
Wang et al. (2019b) Xiaohong Wang, Xudong Jiang, and Jianfeng Ren. Blood vessel segmentation from fundus image by a cascade classification framework. Pattern Recognition, 88:331–341, 2019b. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2018.11.030. URL https://www.sciencedirect.com/science/article/pii/S0031320318304199.
Wu et al. (2018) Yicheng Wu, Yong Xia, Yang Song, Yanning Zhang, and Weidong Cai. Multiscale network followed network model for retinal vessel segmentation. In Alejandro F. Frangi, Julia A. Schnabel, Christos Davatzikos, Carlos Alberola-López, and Gabor Fichtinger, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, pages 119–126, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00934-2.
Wundram et al. (2024) Anna M Wundram, Paul Fischer, Stephan Wunderlich, Hanna Faber, Lisa M Koch, Philipp Berens, and Christian F Baumgartner. Leveraging probabilistic segmentation models for improved glaucoma diagnosis: A clinical pipeline approach. In Medical Imaging with Deep Learning, 2024.
Yan et al. (2018) Zengqiang Yan, Xin Yang, and Kwang-Ting Cheng. Joint segment-level and pixel-wise losses for deep learning based retinal vessel segmentation. IEEE Transactions on Biomedical Engineering, 65(9):1912–1923, 2018. doi: 10.1109/TBME.2018.2828137.
Yao et al. (2024) Wenjian Yao, Jiajun Bai, Wei Liao, Yuheng Chen, Mengjuan Liu, and Yao Xie. From cnn to transformer: A review of medical image segmentation models. Journal of Imaging Informatics in Medicine, pages 1–19, 2024.
Zhang et al. (2016) Jiong Zhang, Behdad Dashtbozorg, Erik Bekkers, Josien P. W. Pluim, Remco Duits, and Bart M. ter Haar Romeny. Robust retinal vessel segmentation via locally adaptive derivative frames in orientation scores. IEEE Transactions on Medical Imaging, 35(12):2631–2644, 2016. doi: 10.1109/TMI.2016.2587062.
Zhao et al. (2019) He Zhao, Huiqi Li, Sebastian Maurer-Stroh, Yuhong Guo, Qiuju Deng, and Li Cheng. Supervised segmentation of un-annotated retinal fundus images by synthesis. IEEE Transactions on Medical Imaging, 38(1):46–56, 2019. doi: 10.1109/TMI.2018.2854886.
Zhao et al. (2020) He Zhao, Huiqi Li, and Li Cheng. Improving retinal vessel segmentation with joint local loss by matting. Pattern Recognition, 98:107068, 2020. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2019.107068. URL https://www.sciencedirect.com/science/article/pii/S0031320319303693.
Zhou et al. (2022) Yukun Zhou, Siegfried K. Wagner, Mark A. Chia, An Zhao, Peter Woodward-Court, Moucheng Xu, Robbert Struyven, Daniel C. Alexander, and Pearse A. Keane. AutoMorph: Automated Retinal Vascular Morphology Quantification Via a Deep Learning Pipeline. Translational Vision Science & Technology, 11(7):12–12, 07 2022. ISSN 2164-2591. doi: 10.1167/tvst.11.7.12. URL https://doi.org/10.1167/tvst.11.7.12.
Zhou et al. (2021) Yuqian Zhou, Hanchao Yu, and Humphrey Shi. Study group learning: Improving retinal vessel segmentation trained with noisy labels, 2021.
Zhou et al. (2018) Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation, 2018.
Zhuang (2019) Juntang Zhuang. Laddernet: Multi-path networks based on u-net for medical image segmentation, 2019.