HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.15218v1 [cs.CV] 22 Mar 2024

Anytime, Anywhere, Anyone: Investigating the Feasibility of Segment Anything Model for Crowd-Sourcing Medical Image Annotations

Pranav Kulkarni   Adway Kanhere***   Dharmam Savani***   Andrew Chan   
Devina Chatterjee   Paul H. Yi   Vishwa S. Parekh
University of Maryland Medical Intelligent Imaging (UM2ii) Center
University of Maryland School of Medicine, Baltimore, MD 21201
{pkulkarni,akanhere,dsavani,andrew.chan,devinachatterjee,pyi,vparekh}@som.umaryland.edu
Authors contributed equally to this work.Corresponding author.
Abstract

Curating annotations for medical image segmentation is a labor-intensive and time-consuming task that requires domain expertise, resulting in ”narrowly” focused deep learning (DL) models with limited translational utility. Recently, foundation models like the Segment Anything Model (SAM) have revolutionized semantic segmentation with exceptional zero-shot generalizability across various domains, including medical imaging, and hold a lot of promise for streamlining the annotation process. However, SAM has yet to be evaluated in a crowd-sourced setting to curate annotations for training 3D DL segmentation models. In this work, we explore the potential of SAM for crowd-sourcing ”sparse” annotations from non-experts to generate ”dense” segmentation masks for training 3D nnU-Net models, a state-of-the-art DL segmentation model. Our results indicate that while SAM-generated annotations exhibit high mean Dice scores compared to ground-truth annotations, nnU-Net models trained on SAM-generated annotations perform significantly worse than nnU-Net models trained on ground-truth annotations (p<0.001𝑝0.001p<0.001italic_p < 0.001, all).

1 Introduction

Medical image segmentation is one of the most fundamental tasks in computer-assisted clinical decision support, forming the basis for many downstream applications from diagnosis to treatment planning and response assessment. However, develo** medical image segmentation models requires a domain expert (e.g., a radiologist) to manually annotate different objects of interest across a training dataset consisting of hundreds of patients, making it an extremely labor-intensive and time-consuming task [7, 19, 20]. As a result, most datasets and segmentation models developed in prior literature are ”narrowly” focused only on the task at hand, thereby reducing their clinical translational utility.

Refer to caption
Figure 1: Example of SAM on (a) abdominal CT, (b) hand x-ray, and (c) knee MRI. SAM can operate in either ”segment anything” mode (column 2), where SAM automatically segments all potential objects of interest in an image, or ”prompting” mode, where SAM can segment an object of interest using an interactive prompt like bounding boxes (column 3) or points (column 4).

To address this challenge, many different approaches have been proposed in the recent years where users can use less time-consuming ”sparse” annotations, such as scribbles and bounding boxes, to interactively prompt a pre-trained deep learning (DL) model to create ”dense” annotations like detailed boundary masks [7, 18, 10]. While these approaches have shown to significantly reduce the annotation time per object, they require an expert to not just interactively create these annotations, but also fine-tune and validate them [7]. Therefore, there is a critical need for a data curation pipeline for medical image segmentation that would allow non-experts to annotate datasets with sparse annotations without the need for an expert in the loop.

Table 1: Dataset description and availability of organ annotations for the MSD Liver, MSD Spleen, and BTCV datasets.
Variable MSD Liver MSD Spleen BTCV
No. of Scans 201 61 50
Modality CT CT CT
Volume Size (voxels) 512 x 212 x 25 – 512 x 212 x 287 512 x 212 x 25 – 512 x 212 x 268 512 x 212 x 25 – 512 x 212 x 295
In-Plane Resolution (mm2) 0.63 x 3.63 – 1 x 1 0.73 x 3.73 – 0.98 x 8.98 0.59 x 9.59 – 0.98 x 8.98
Slice Thickness (mm) 0.70 – 5 1.5 – 8 2.5 – 5
Aorta - - 30 (60%)
Left Kidney - - 30 (60%)
Right - - 30 (60%)
Liver 131 (65%) - 30 (60%)
Spleen - 41 (67%) 30 (60%)

Recently, large DL foundation models trained in a self-supervised manner on large-scale datasets on the order of a billion samples are revolutionizing the field of computer vision with their strong zero-shot generalizability [12, 15, 3]. This means that they do not need to be fine-tuned for medical imaging tasks and can operate directly out-of-the-box. The Segment Anything Model (SAM) is one such popular and open-source foundational model based on vision transformers (ViTs) with the ability for zero-shot semantic segmentation [12, 8]. SAM works by interactively prompting images with sparse annotations, such as points or bounding boxes, to generate segmentation masks (Figure 1).

Recent literature has suggested that SAM holds a lot of promise for annotating medical imaging datasets using sparse annotations [5, 2, 17, 6, 16, 15]. However, current approaches are limited to the evaluation of SAM in simulated settings as opposed to a realistic crowd-sourced setting and have yet to evaluate the effectiveness of SAM-generated annotations for training DL segmentation models. The purpose of this study is to 1) evaluate SAM for crowd-sourcing annotations for medical imaging datasets from non-expert annotators, and 2) investigate the feasibility of using SAM-generated annotations for training 3D DL segmentation models.

2 Methods

Refer to caption
Figure 2: Pipeline for crowd-sourcing sparse annotations from non-expert annotators for the purpose of training 3D DL segmentation models using SAM-generated annotations. Suppose there is an unannotated medical imaging dataset. Sparse annotations for objects of interest (e.g., organs, tumors, etc.) can be crowd-sourced from non-expert annotators. Then, segmentation masks for the objects of interest can be generated using SAM. Finally, the SAM-generated annotations can be used to train a 3D DL segmentation model (e.g., U-Net).

This retrospective study used publicly available datasets and was acknowledged by our IRB as non-human subjects research. Our code is available at: https://github.com/UM2ii/SAM_DataAnnotation

2.1 Segment Anything Model

The Segment Anything Model (SAM) is a computer vision foundation model for semantic segmentation based on ViTs [12, 8]. It is comprised of an image encoder that extracts features from image to create embeddings, a prompt encoder that creates embeddings from ”sparse” annotations (e.g., points, bounding boxes), and a mask decoder that uses image and prompt embeddings to generate ”dense” segmentation masks for the objects of interest in the image. SAM is trained on a large-scale dataset comprising of over 11 million images with over 1 billion segmentation masks. This enables SAM to have exceptional zero-shot and few-shot generalizability for semantic segmentation tasks across various domains, including medical imaging [5, 2, 6, 16, 15].

2.2 Datasets

The Medical Segmentation Decathlon (MSD) is a collection of 10 benchmark datasets for segmentation spanning different body parts and modalities [1]. In our study, we used two datasets: 1) The MSD Liver dataset consists of n=201𝑛201n=201italic_n = 201 portal-venous phase contrast-enhanced abdominal CT scans, out of which n=131𝑛131n=131italic_n = 131 (65%) contained annotations for the liver and liver tumors. We discarded all liver tumor annotations. 2) The MSD Spleen dataset consists of n=61𝑛61n=61italic_n = 61 portal-venous phase contrast-enhanced abdominal CT scans, out of which n=41𝑛41n=41italic_n = 41 (67%) contained annotations for the spleen. The Beyond the Cranial Vault (BTCV) dataset consists of n=50𝑛50n=50italic_n = 50 portal-venous phase contrast-enhanced abdominal CT scans, out of which n=30𝑛30n=30italic_n = 30 (60%) contained annotations for 13 abdominal organs [13]. In our study, five organs of interest were included: aorta, left and right kidneys, liver, and spleen. The dataset description and availability of organ annotations for the MSD Liver, MSD Spleen, and BTCV datasets is provided in Table 1.

2.3 Crowd-Sourcing Sparse Annotations

We implemented a pipeline for crowd-sourcing annotations with SAM for the purpose of training 3D DL segmentation models (Figure 2). Since the volumes are provided in the NIfTI format, we used med2image (version 2.6.6) to convert a NIfTI volume into PNG images for each slice. The PNG format was chosen as it is a lossless standard format for image analysis in computer vision. We used the OpenLabeling tool (version 1.3) to annotate the organs of interest for each slice within a volume [4]. Bounding boxes were chosen as the sparse annotation method due to their superior performance for generating segmentation masks with SAM [5, 6, 16, 15]. In this study, we evaluate the effectiveness of SAM-generated annotations for training 3D DL segmentation models across two experiments: 1) simulated ”sparse” annotations, and 2) crowd-sourced ”sparse” annotations.

Refer to caption
Figure 3: Illustration of the OpenLabeling tool used for crowd-sourcing bounding box annotations for the BTCV training set across the five organs of interest.

2.3.1 Simulated Annotations

The goal of this experiment was to represent an ”ideal” scenario for crowd-sourcing annotations, where all organs of interest are accurately annotated for every volume in the dataset. For liver segmentation, we use the MSD Liver and BTCV datasets as the training (n=131𝑛131n=131italic_n = 131) and testing (n=30𝑛30n=30italic_n = 30) sets respectively. Using the ground-truth annotations, the bounding box prompts were simulated for each slice of a volume in the training set. Ground-truth annotations were blurred using uniform filter (kernel size = 5 x 5) and binarized (thresholded to pixel values greater than 0.01) to introduce jitter. Bounding boxes were drawn programmatically around each resulting region of interest. For spleen segmentation, this process was repeated with the MSD Spleen dataset (n=41𝑛41n=41italic_n = 41) as the training set.

2.3.2 Crowd-Sourced Annotations

The goal of this experiment was to represent a real-world scenario when crowd-sourcing annotations from non-experts. We randomly split the BTCV dataset into training (n=15𝑛15n=15italic_n = 15) and testing (n=15𝑛15n=15italic_n = 15) sets. The training set was annotated with the five organs of interest by four non-experts from diverse backgrounds with small-to-moderate knowledge of anatomical structures (Figure 3). Two annotators (A.K., D.S.) had software engineering backgrounds, one annotator (A.C.) had a bioengineering background, and one annotator (D.C.) was a medical student. All annotators had experience in biomedical informatics with a mean 2.50 years of experience (range 1–5 years). They were provided with basic orientation to familiarize them with relevant anatomical structures. Three annotators were assigned one task (D.S., aorta; A.C., liver; A.K., spleen) and one annotator was assigned two tasks (D.C., left and right kidneys). They were instructed to draw bounding boxes surrounding the region of interest for each slice in a volume. The volumes were annotated by the non-expert annotators independently, and consensus agreement was not required.

2.4 Generating Dense Annotations

After crowd-sourcing sparse annotations, we used SAM with the ViT-Huge backbone to generate segmentation masks for the organs of interest [12]. We also generated masks using MedSAM, a version of SAM fine-tuned for medical image segmentation tasks and based on the ViT-Base backbone [15]. Since SAM only supports 2D segmentation, annotations for a volume were generated by passing each slice with its corresponding bounding boxes as input to SAM for inference. The pixel values of all input images were normalized to range 0–255. The annotations were selected from three generated masks based on the highest confidence score. The SAM-generated annotations were converted to NIfTI using NiBabel (version 5.2). We measured the mean Dice similarity coefficient of SAM-generated annotations on the ground-truth annotations of the training set for the volume (hereafter, ’volume Dice score’) and each annotated slice (hereafter, ’slice Dice score’). In all cases, the mean Dice scores are reported with the 95% confidence interval as ’Mean ±plus-or-minus\pm± 95% CI’.

2.5 Training DL Segmentation Models

We train DL segmentation models using the nnU-Net framework on the SAM-generated (hereafter, ’SAM-nnU-Net’) and ground-truth annotations (hereafter, ’GT-nnU-Net’). It is a self-configuring 3D U-Net architecture that automatically optimizes hyperparameters for the training dataset and has achieved state-of-the-art (SOTA) performance across various biomedical segmentation tasks [11, 19, 20, 18]. The models were trained with five-fold cross-validation for 1000 epochs using nnU-Net’s default training procedure [11]. The mirroring augmentation was removed as it is not anatomically valid for abdominal CT scans [20]. We compared their performance using mean volume Dice scores on the ground-truth test set. All models were evaluated using PyTorch (version 2.0.1), and CUDA (version 12.0) on eight NVIDIA GeForce RTX A6000 GPUs.

2.6 Statistical Analysis

The mean Dice scores were compared with Wilcoxon signed-rank tests due to the non-parametric distribution of paired samples, as indicated by p<0.05𝑝0.05p<0.05italic_p < 0.05 using the Shapiro-Wilk test for normality. All statistical analysis was performed by a statistical analyst (P.K., 6 years of experience) using SciPy (version 1.11.2). Statistical significance was defined as p<0.05𝑝0.05p<0.05italic_p < 0.05.

3 Results

3.1 Simulated Annotations

3.1.1 Liver Segmentation

The SAM-generated annotations measured a mean volume Dice score of 0.86±0.02plus-or-minus0.860.020.86\pm 0.020.86 ± 0.02 on the ground-truth annotations for the MSD Liver training set, which is significantly higher than the mean volume Dice score of 0.80±0.04plus-or-minus.800.04.80\pm 0.04.80 ± 0.04 (p<0.001𝑝0.001p<0.001italic_p < 0.001) for MedSAM. Similarly, the SAM-generated annotations measured a mean slice Dice score of 0.85±0.00plus-or-minus0.850.000.85\pm 0.000.85 ± 0.00 on the ground-truth annotations for the MSD Liver training set, which is significantly higher than the mean slice Dice score of 0.77±0.00plus-or-minus0.770.000.77\pm 0.000.77 ± 0.00 (p<0.001𝑝0.001p<0.001italic_p < 0.001) for MedSAM. The SAM-nnU-Net model trained on the MSD Liver dataset measured a mean volume Dice score of 0.83±0.03plus-or-minus0.830.030.83\pm 0.030.83 ± 0.03 on the BTCV dataset, which is significantly lower than the mean volume Dice score of 0.90±0.03plus-or-minus0.900.030.90\pm 0.030.90 ± 0.03 (p<0.001𝑝0.001p<0.001italic_p < 0.001) for GT-nnU-Net.

3.1.2 Spleen Segmentation

The SAM-generated annotations measured a mean volume Dice score of 0.88±0.02plus-or-minus0.880.020.88\pm 0.020.88 ± 0.02 on the ground-truth annotations for the MSD Spleen training set, which is comparable to the mean volume Dice score of 0.87±0.03plus-or-minus0.870.030.87\pm 0.030.87 ± 0.03 (p=0.23𝑝0.23p=0.23italic_p = 0.23) for MedSAM. Similarly, the SAM-generated annotations measured a mean slice Dice score of 0.87±0.01plus-or-minus0.870.010.87\pm 0.010.87 ± 0.01 on the ground-truth annotations for the MSD Spleen training set, which is significantly higher than the mean slice Dice score of 0.83±0.01plus-or-minus0.830.010.83\pm 0.010.83 ± 0.01 (p<0.001𝑝0.001p<0.001italic_p < 0.001) for MedSAM. The SAM-nnU-Net model trained on the MSD Spleen dataset measured a mean volume Dice score of 0.81±0.01plus-or-minus0.810.010.81\pm 0.010.81 ± 0.01 on the BTCV dataset, which is significantly lower than the mean volume Dice score of 0.87±0.01plus-or-minus0.870.010.87\pm 0.010.87 ± 0.01 (p=0.004𝑝0.004p=0.004italic_p = 0.004) for GT-nnU-Net.

3.2 Crowd-Sourced Annotations

Refer to caption
Figure 4: An example of crowd-sourced SAM-generated annotations for a CT scan from the BTCV training set in the axial, coronal, and sagittal views. The SAM-generated annotations are filled in while the ground-truth annotations are outlined in blue.

The non-expert annotators annotated 651 slices from n=15𝑛15n=15italic_n = 15 volumes in the BTCV training set with 1840 bounding boxes using our data curation pipeline (Figure 4). They took 55.60±8.76plus-or-minus55.608.7655.60\pm 8.7655.60 ± 8.76 mins to annotate an organ of interest across all volumes, with a mean of 3.29±1.04plus-or-minus3.291.043.29\pm 1.043.29 ± 1.04 seconds per slice. Out of n=15𝑛15n=15italic_n = 15 volumes, the non-experts annotated all five organs of interest for n=11𝑛11n=11italic_n = 11 volumes (73%). For the remaining n=4𝑛4n=4italic_n = 4 volumes (27%), annotations for left and right kidneys were missing for n=3𝑛3n=3italic_n = 3 volumes (75%), liver was missing for n=2𝑛2n=2italic_n = 2 volumes (50%), and spleen was missing for n=1𝑛1n=1italic_n = 1 volume (25%).

For measuring the mean volume and slice Dice scores for SAM- and MedSAM-generated annotations, we exclude the n=4𝑛4n=4italic_n = 4 volumes with missing annotations. The SAM-generated annotations measured a mean volume Dice score of 0.75±0.09plus-or-minus0.750.090.75\pm 0.090.75 ± 0.09 on the ground-truth annotations for the BTCV training set, compared to the mean volume Dice score of 0.74±0.09plus-or-minus0.740.090.74\pm 0.090.74 ± 0.09 (p=0.70𝑝0.70p=0.70italic_p = 0.70) for MedSAM. Stratified by organs of interest, the SAM-generated annotations were comparable to MedSAM-generated annotations for all five organs (p0.05𝑝0.05p\geq 0.05italic_p ≥ 0.05, all) (Table 2). Similarly, the SAM-generated annotations measured a mean slice Dice score of 0.88±0.02plus-or-minus0.880.020.88\pm 0.020.88 ± 0.02 on the ground-truth annotations for the BTCV training set, compared to the mean volume Dice score of 0.88±0.02plus-or-minus0.880.020.88\pm 0.020.88 ± 0.02 (p=0.34𝑝0.34p=0.34italic_p = 0.34) for MedSAM. Stratified by organs of interest, the SAM-generated annotations were comparable to MedSAM-generated annotations for left and right kidneys (p0.05𝑝0.05p\geq 0.05italic_p ≥ 0.05, both) (Table 3). However, the mean slice Dice scores for MedSAM-generated annotations were significantly higher than the mean slices Dice scores for SAM-generated annotations for aorta (0.89±0.02plus-or-minus0.890.020.89\pm 0.020.89 ± 0.02 vs 0.88±0.01plus-or-minus0.880.010.88\pm 0.010.88 ± 0.01, p=0.02𝑝0.02p=0.02italic_p = 0.02) and liver (0.89±0.02plus-or-minus0.890.020.89\pm 0.020.89 ± 0.02 vs 0.84±0.02plus-or-minus0.840.020.84\pm 0.020.84 ± 0.02, p<0.001𝑝0.001p<0.001italic_p < 0.001), but significantly lower for spleen (0.88±0.04plus-or-minus0.880.040.88\pm 0.040.88 ± 0.04 vs 0.91±0.03plus-or-minus0.910.030.91\pm 0.030.91 ± 0.03, p=0.02𝑝0.02p=0.02italic_p = 0.02).

Table 2: Mean volume Dice scores for the SAM- and MedSAM-generated annotations on the ground-truth annotations of the BTCV training set (n=15𝑛15n=15italic_n = 15) across all five organs of interest.
Organ SAM MedSAM p-value
Aorta 0.70±0.09plus-or-minus0.700.09\mathbf{0.70}\pm 0.09bold_0.70 ± 0.09 0.65±0.09plus-or-minus0.650.090.65\pm 0.090.65 ± 0.09 0.080.080.080.08
Left Kidney 0.74±0.11plus-or-minus0.740.11\mathbf{0.74}\pm 0.11bold_0.74 ± 0.11 0.74±0.10plus-or-minus0.740.100.74\pm 0.100.74 ± 0.10 0.770.770.770.77
Right Kidney 0.78±0.10plus-or-minus0.780.10\mathbf{0.78}\pm 0.10bold_0.78 ± 0.10 0.77±0.09plus-or-minus0.770.090.77\pm 0.090.77 ± 0.09 0.970.970.970.97
Liver 0.73±0.13plus-or-minus0.730.130.73\pm 0.130.73 ± 0.13 0.75±0.14plus-or-minus0.750.14\mathbf{0.75}\pm 0.14bold_0.75 ± 0.14 0.120.120.120.12
Spleen 0.80±0.14plus-or-minus0.800.14\mathbf{0.80}\pm 0.14bold_0.80 ± 0.14 0.78±0.14plus-or-minus0.780.140.78\pm 0.140.78 ± 0.14 0.900.900.900.90

We trained two sets of 3D DL segmentation models: 1) Trained on partially annotated n=15𝑛15n=15italic_n = 15 volumes from the BTCV training set. 2) Trained on fully annotated n=11𝑛11n=11italic_n = 11 volumes from the BTCV training set. Both sets of DL models were evaluated on the BTCV test set (n=15𝑛15n=15italic_n = 15).

The SAM-nnU-Net model trained on partially annotated n=15𝑛15n=15italic_n = 15 volumes from the BTCV training set measured a mean Dice score of 0.77±0.06plus-or-minus0.770.060.77\pm 0.060.77 ± 0.06 on the BTCV test set, which is significantly lower than the mean Dice score of 0.91±0.05plus-or-minus0.910.050.91\pm 0.050.91 ± 0.05 (p<0.001𝑝0.001p<0.001italic_p < 0.001) for the GT-nnU-Net model. Stratified by organs of interest, the SAM-nnU-Net model had a significantly lower mean Dice score than the GT-nnU-Net model for all five organs (p<0.006𝑝0.006p<0.006italic_p < 0.006, all) (Table 4).

The SAM-nnU-Net model trained on fully annotated n=11𝑛11n=11italic_n = 11 volumes from the BTCV training set measured a mean Dice score of 0.80±0.05plus-or-minus0.800.050.80\pm 0.050.80 ± 0.05 on the BTCV test set, which is significantly lower than the mean Dice score of 0.90±0.05plus-or-minus0.900.050.90\pm 0.050.90 ± 0.05 (p<0.001𝑝0.001p<0.001italic_p < 0.001) for the GT-nnU-Net model (Figure 5). Stratified by organs of interest, the SAM-nnU-Net model had a significantly lower mean Dice score than the GT-nnU-Net model for aorta, left kidney, liver, and spleen (p<0.02𝑝0.02p<0.02italic_p < 0.02, all) (Table 5). However, for right kidneys, the mean Dice score of 0.78±0.07plus-or-minus0.780.070.78\pm 0.070.78 ± 0.07 for the SAM-nnU-Net model was comparable to the mean Dice score of 0.87±0.11plus-or-minus0.870.110.87\pm 0.110.87 ± 0.11 for the GT-nnU-Net model (p=0.06𝑝0.06p=0.06italic_p = 0.06).

Table 3: Mean slice Dice scores for the SAM- and MedSAM-generated annotations on the ground-truth annotations of the BTCV training set (n=15𝑛15n=15italic_n = 15) across all five organs of interest.
Organ SAM MedSAM p-value
Aorta 0.88±0.01plus-or-minus0.880.010.88\pm 0.010.88 ± 0.01 0.89±0.02plus-or-minus0.890.02\mathbf{0.89}\pm 0.02bold_0.89 ± 0.02 0.02¯¯0.02\underline{0.02}under¯ start_ARG 0.02 end_ARG
Left Kidney 0.88±0.04plus-or-minus0.880.040.88\pm 0.040.88 ± 0.04 0.89±0.02plus-or-minus0.890.02\mathbf{0.89}\pm 0.02bold_0.89 ± 0.02 0.660.660.660.66
Right Kidney 0.90±0.02plus-or-minus0.900.02\mathbf{0.90}\pm 0.02bold_0.90 ± 0.02 0.88±0.02plus-or-minus0.880.020.88\pm 0.020.88 ± 0.02 0.110.110.110.11
Liver 0.84±0.02plus-or-minus0.840.020.84\pm 0.020.84 ± 0.02 0.89±0.02plus-or-minus0.890.02\mathbf{0.89}\pm 0.02bold_0.89 ± 0.02 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG
Spleen 0.91±0.03plus-or-minus0.910.03\mathbf{0.91}\pm 0.03bold_0.91 ± 0.03 0.86±0.04plus-or-minus0.860.040.86\pm 0.040.86 ± 0.04 0.02¯¯0.02\underline{0.02}under¯ start_ARG 0.02 end_ARG
Refer to caption
Figure 5: An example of (a) GT-nnU-Net and (b) SAM-nnU-Net segmentations for a CT scan from the BTCV test set in the axial, coronal, and sagittal views. The models are trained on fully annotated n=11 volumes from the BTCV training set. The predicted segmentations are filled in while the ground-truth annotations are outlined in blue.

By excluding the n=4𝑛4n=4italic_n = 4 volumes with missing annotations from the BTCV training set, we observe that a significant increase in mean Dice of the SAM-nnU-Net model from 0.77±0.06plus-or-minus0.770.060.77\pm 0.060.77 ± 0.06 to 0.80±0.05plus-or-minus0.800.050.80\pm 0.050.80 ± 0.05 (p=0.008𝑝0.008p=0.008italic_p = 0.008) (Tables 4 and 5). Stratified by organs of interest, we observe significant improvements in mean Dice scores for aorta (p=0.04𝑝0.04p=0.04italic_p = 0.04), left kidney (p=0.002𝑝0.002p=0.002italic_p = 0.002), right kidney (p=0.02𝑝0.02p=0.02italic_p = 0.02), and liver (p=0.04𝑝0.04p=0.04italic_p = 0.04). However, for the GT-nnU-Net model, we observe no significant differences in mean Dice score after excluding volumes with missing annotations (p=0.06𝑝0.06p=0.06italic_p = 0.06) (Tables 4 and 5). Stratified by organs of interest, we observe a significant decrease in mean Dice scores for aorta (p=0.008𝑝0.008p=0.008italic_p = 0.008) and right kidney (p=0.04𝑝0.04p=0.04italic_p = 0.04).

Table 4: Mean volume Dice scores of the GT-nnU-Net and SAM-nnU-Net on the BTCV test set (n=15𝑛15n=15italic_n = 15) across all five organs of interest. The models are trained on partially annotated n=15𝑛15n=15italic_n = 15 volumes from the BTCV training set.
Organ GT-nnU-Net SAM-nnU-Net p-value
Aorta 0.93±0.01plus-or-minus0.930.01\mathbf{0.93}\pm 0.01bold_0.93 ± 0.01 0.75±0.05plus-or-minus0.750.050.75\pm 0.050.75 ± 0.05 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG
Left Kidney 0.90±0.11plus-or-minus0.900.11\mathbf{0.90}\pm 0.11bold_0.90 ± 0.11 0.72±0.11plus-or-minus0.720.110.72\pm 0.110.72 ± 0.11 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG
Right Kidney 0.90±0.10plus-or-minus0.900.10\mathbf{0.90}\pm 0.10bold_0.90 ± 0.10 0.75±0.07plus-or-minus0.750.070.75\pm 0.070.75 ± 0.07 0.005¯¯0.005\underline{0.005}under¯ start_ARG 0.005 end_ARG
Liver 0.94±0.05plus-or-minus0.940.05\mathbf{0.94}\pm 0.05bold_0.94 ± 0.05 0.82±0.05plus-or-minus0.820.050.82\pm 0.050.82 ± 0.05 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG
Spleen 0.87±0.12plus-or-minus0.870.12\mathbf{0.87}\pm 0.12bold_0.87 ± 0.12 0.79±0.12plus-or-minus0.790.120.79\pm 0.120.79 ± 0.12 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG

4 Discussion

Crowd-sourcing of annotations using foundation models, like SAM, has the potential to revolutionize the curation of large-scale datasets for medical image segmentation. It transforms a labor-intensive and time-consuming process, that requires domain expertise, into one that allows anyone, whether experts or non-experts, to annotate objects of interest in a medical image using ”sparse” annotations from anywhere, using any device, and at any time without the need for an expert in the loop.

However, our results indicate that while SAM-generated annotations exhibit high mean slice Dice scores compared to ground-truth annotations, the SAM-nnU-Net models perform significantly worse than the GT-nnU-Net models across the simulated setting and the crowd-sourced setting for the segmentation of abdominal organs using CT scans (p<0.001𝑝0.001p<0.001italic_p < 0.001, all). This discrepancy is due SAM being primarily designed for 2D semantic segmentation, thereby resulting in a lack of spacial relationships between features in 3D (e.g., depth) and poor ”connectivity” in annotations between two consecutive slices (Figure 4). As indicated by our results, this leads to significantly worse mean volume Dice scores for SAM-generated annotations when compared to ground-truth annotations and translates into sub-optimal performance for the SAM-nnU-Net models, despite nnU-Net being a SOTA model architecture.

Table 5: Mean volume Dice scores of the GT-nnU-Net and SAM-nnU-Net on the BTCV test set (n=15𝑛15n=15italic_n = 15) across all five organs of interest. The models are trained on partially annotated n=11𝑛11n=11italic_n = 11 volumes from the BTCV training set.
Organ GT-nnU-Net SAM-nnU-Net p-value
Aorta 0.92±0.01plus-or-minus0.920.01\mathbf{0.92}\pm 0.01bold_0.92 ± 0.01 0.78±0.04plus-or-minus0.780.040.78\pm 0.040.78 ± 0.04 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG
Left Kidney 0.87±0.12plus-or-minus0.870.12\mathbf{0.87}\pm 0.12bold_0.87 ± 0.12 0.78±0.08plus-or-minus0.780.080.78\pm 0.080.78 ± 0.08 0.02¯¯0.02\underline{0.02}under¯ start_ARG 0.02 end_ARG
Right Kidney 0.87±0.11plus-or-minus0.870.11\mathbf{0.87}\pm 0.11bold_0.87 ± 0.11 0.78±0.07plus-or-minus0.780.070.78\pm 0.070.78 ± 0.07 0.060.060.060.06
Liver 0.94±0.05plus-or-minus0.940.05\mathbf{0.94}\pm 0.05bold_0.94 ± 0.05 0.84±0.04plus-or-minus0.840.040.84\pm 0.040.84 ± 0.04 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG
Spleen 0.88±0.11plus-or-minus0.880.11\mathbf{0.88}\pm 0.11bold_0.88 ± 0.11 0.80±0.11plus-or-minus0.800.110.80\pm 0.110.80 ± 0.11 <0.001¯¯absent0.001\underline{<0.001}under¯ start_ARG < 0.001 end_ARG

The disparity can be addressed by develo** foundation models specialized for 3D semantic segmentation that are able to capture spacial relationships between features – a key difference between natural images and medical images. Recent literature has explored 3D adapters that retain SAM’s rich knowledge base for semantic segmentation while incorporating spacial relationships when generating ”dense” segmentation masks in 3D [14, 17, 21]. One popular approach is using a 3D CNN decoder to capture spacial relationships between image embeddings created slice-by-slice by SAM’s image encoder [2, 9]. These variations have demonstrated significantly better performance in 3D medical images when compared to SAM and MedSAM.

Another crucial consideration for crowd-sourcing annotations is that there is a potential for unreliable and spurious annotations due to the uncertainty associated with non-expert annotators. Our results indicate the importance of quality assessment. In our study, the non-expert annotators failed to completely annotate n=4𝑛4n=4italic_n = 4 volumes (27%) with all five organs of interest, resulting in a partially annotated and unreliable dataset for training DL segmentation models. By filtering out incomplete and unreliable annotations, we observed a significant increase in the performance of the SAM-nnU-Net model.

Therefore, robust quality assessment is critical for not just evaluating the quality of crowd-sourced annotations, but also filtering out low-quality annotations without manual intervention from an expert. One potential technique is inter-rater reliability where low-quality annotations with poor agreement with other annotations can be identified and filtered out from the study without impacting the quality of annotations generated by more reliable annotators. Moreover, this enables unreliable and spurious annotators with consistently low-quality annotations to be similarly identified and removed from the study. In addition, uncertainty estimation of SAM-generated annotations using multiple bounding box prompts can be used identify potential errors in SAM-generated annotations and guide the selection of higher-quality annotations [6].

Our work has certain limitations. 1) We only consider SAM for our study, which is primarily designed for 2D segmentation with natural images and lacks the spacial relationships critical for 3D segmentation with medical images. 2) While our simulated experiments encapsulate large-scale datasets, we use a small dataset (n=15𝑛15n=15italic_n = 15) for crowd-sourcing ”sparse” annotations from non-experts to train DL segmentation models. 3) In our crowd-sourcing experiment, each annotator is assigned one organ of interest. This has the potential for unreliable and spurious annotations due to the absence of consensus agreements with multiple annotators. 4) We only consider CT scans and have not evaluated the potential for crowd-sourcing for other modalities like MRI and PET scans.

For future work, we intend to expand the scale of our study with larger multi-organ segmentation dataset than the BTCV dataset, larger number of expert and non-expert annotators for crowd-sourcing ”sparse” annotations, and specialized adapters of SAM for 3D medical image segmentation. We also intend to include quality assessment metrics like inter-rater reliability to filter out unreliable and spurious annotations while retaining high-quality annotations for our analysis.

5 Conclusion

Limitations in current approaches warrant caution before incorporating crowd-sourced annotations from non-experts. To take full advantage of crowd-sourcing, specialized foundation models need to be developed for 3D segmentation. Furthermore, crowd-sourcing approaches should incorporate quality assessment to filter out low-quality annotations. While we may not be ready for non-expert annotations yet, they have the potential for streamlining the annotation process for medical image segmentation by enabling anyone to annotate medical images from anywhere and at any time.

References

  • Antonelli et al. [2022] Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon. Nature communications, 13(1):4128, 2022.
  • Bui et al. [2023] Nhat-Tan Bui, Dinh-Hieu Hoang, Minh-Triet Tran, and Ngan Le. Sam3d: Segment anything model in volumetric medical images. arXiv preprint arXiv:2309.03493, 2023.
  • Butoi et al. [2023] Victor Ion Butoi, Jose Javier Gonzalez Ortiz, Tianyu Ma, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Universeg: Universal medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21438–21451, 2023.
  • Cartucho et al. [2018] Joao Cartucho, Rodrigo Ventura, and Manuela Veloso. Robust object recognition through symbiotic deep learning in mobile robots. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 2336–2341. IEEE, 2018.
  • Cheng et al. [2023] Dongjie Cheng, Ziyuan Qin, Zekun Jiang, Shaoting Zhang, Qicheng Lao, and Kang Li. Sam on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035, 2023.
  • Deng et al. [2023] Guoyao Deng, Ke Zou, Kai Ren, Meng Wang, Xuedong Yuan, Sancong Ying, and Huazhu Fu. Sam-u: Multi-box prompts triggered uncertainty estimation for reliable sam in medical image. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 368–377. Springer, 2023.
  • Diaz-Pinto et al. [2022] Andres Diaz-Pinto, Sachidanand Alle, Vishwesh Nath, Yucheng Tang, Alvin Ihsani, Muhammad Asad, Fernando Pérez-García, Pritesh Mehta, Wenqi Li, Mona Flores, et al. Monai label: A framework for ai-assisted interactive labeling of 3d medical images. arXiv preprint arXiv:2203.12362, 2022.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Gong et al. [2023] Shizhan Gong, Yuan Zhong, Wenao Ma, **peng Li, Zhao Wang, **gyang Zhang, Pheng-Ann Heng, and Qi Dou. 3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable medical image segmentation. arXiv preprint arXiv:2306.13465, 2023.
  • Huang et al. [2018] Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and **gdong Wang. Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7014–7023, 2018.
  • Isensee et al. [2021] Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Landman et al. [2015] Bennett Landman, Zhoubing Xu, J Igelsias, Martin Styner, Thomas Langerak, and Arno Klein. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, page 12, 2015.
  • Lei et al. [2023] Wenhui Lei, Xu Wei, Xiaofan Zhang, Kang Li, and Shaoting Zhang. Medlsam: Localize and segment anything model for 3d medical images. arXiv preprint arXiv:2306.14752, 2023.
  • Ma et al. [2024] Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nature Communications, 15(1):654, 2024.
  • Mazurowski et al. [2023] Maciej A Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, and Yixin Zhang. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023.
  • Quan et al. [2024] Quan Quan, Fenghe Tang, Zikang Xu, Heqin Zhu, and S Kevin Zhou. Slide-sam: Medical sam meets sliding window. In Medical Imaging with Deep Learning, 2024.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • Sebro and Mongan [2023] Ronnie Sebro and John Mongan. Totalsegmentator: A gift to the biomedical imaging community. Radiology: Artificial Intelligence, 5(5):e230235, 2023.
  • Wasserthal et al. [2023] Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence, 5(5), 2023.
  • Wu et al. [2023] Junde Wu, Rao Fu, Huihui Fang, Yuanpei Liu, Zhaowei Wang, Yanwu Xu, Yueming **, and Tal Arbel. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.