11institutetext: Karlsruhe Institute of Technology, Karlsruhe, Germany
22institutetext: HIDSS4Health - Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany
33institutetext: Institute for AI in Medicine, University Hospital Essen, Essen, Germany
44institutetext: Cancer Research Center Cologne Essen (CCCE), University Medicine Essen, Essen,
44email: 1{[email protected]} , 3{[email protected]}

Rethinking Annotator Simulation:
Realistic Evaluation of Whole-Body PET Lesion Interactive Segmentation Methods

Zdravko Marinov 1122    Moon Kim 33    Jens Kleesiek 3344    Rainer Stiefelhagen 11
Abstract

Interactive segmentation plays a crucial role in accelerating the annotation, particularly in domains requiring specialized expertise such as nuclear medicine. For example, annotating lesions in whole-body Positron Emission Tomography (PET) images can require over an hour per volume. While previous works evaluate interactive segmentation models through either real user studies or simulated annotators, both approaches present challenges. Real user studies are expensive and often limited in scale, while simulated annotators, also known as robot users, tend to overestimate model performance due to their idealized nature. To address these limitations, we introduce four evaluation metrics that quantify the user shift between real and simulated annotators. In an initial user study involving four annotators, we assess existing robot users using our proposed metrics and find that robot users significantly deviate in performance and annotation behavior compared to real annotators. Based on these findings, we propose a more realistic robot user that reduces the user shift by incorporating human factors such as click variation and inter-annotator disagreement. We validate our robot user in a second user study, involving four other annotators, and show it consistently reduces the simulated-to-real user shift compared to traditional robot users. By employing our robot user, we can conduct more large-scale and cost-efficient evaluations of interactive segmentation models, while preserving the fidelity of real user studies. Our implementation is based on MONAI Label and will be made publicly available.

Keywords:
Interactive segmentation Robot user Realistic simulation

1 Introduction

Deep learning models have made significant progress in segmenting anatomical structures and lesions in medical images but often rely on manually labeled datasets [1, 2, 3, 4, 5, 6]. This poses a challenge for volumetric medical data where annotating each voxel demands considerable time and expertise. Interactive segmentation mitigates this issue by leveraging less demanding annotations, such as clicks, instead of dense voxelwise labels [7, 8, 9, 10, 11, 12, 13, 14, 15]. Clicks are combined with the image as a joint input for the interactive model and guide it spatially toward the segmentation target. Annotators can refine model outputs by placing clicks in missegmented areas, leading to an improved segmentation and high-quality predictions [9, 10, 11, 12, 13, 14, 15]. Once approved by medical experts, these predictions may serve as new labels [7]. Prior methods evaluate interactive models by simulating clicks on the test split (a "robot user") [15, 16, 17, 18] or by involving real annotators in a user study [9, 10, 11]. However, real user studies are costly, with a limited sample size, and robot users often overestimate model performance due to their idealized nature. Similar to a domain shift encountered when assessing models with out-of-domain data (e.g., from a different scanner), a user shift arises when validating an interactive model via simulated robot users and deploying it in real clinical settings, where its performance often diverges [7]. We address these challenges for whole-body PET lesion segmentation with the following contributions:

  1. 1.

    We evaluate 4 robot users (R1)–(R4) on the AutoPET dataset [1] and conduct 2 user studies, each with 4 medical annotators, to show the disparity between simulated and real user performance of existing robot users.

  2. 2.

    We introduce 4 evaluation metrics (M1)–(M4) to quantify the simulated-to-real user shift in terms of segmentation accuracy, annotator behavior, and conformity to ground-truth labels.

  3. 3.

    We propose a novel robot user that mitigates the pitfalls identified in 1. by simulating clicks that disagree with the ground-truth labels. Our robot user reduces the user shift (defined in 2.) and the segmentation performance gap to real users compared to previous robot users in both our user studies.

Related Work. Previous research on robot users mainly explores classical non-deep learning methods and overlooks evaluating the disparity with real annotators. For example, Kohli et al. [18] compare four Graph Cut-based interactive models [19] and conclude that placing clicks at the center of the largest error consistently yields optimal results across all models. However, their comparison is limited to natural images, and they do not explore deep learning-based approaches. Moschidis and Graham [16] compare two robot users for 3D medical image segmentation: one targeting central regions and the other - boundary regions. However, their study also examines classical non-deep learning methods and lacks simulated clicks for iterative corrections. Benenson et al. [20] compare iterative boundary and central clicks, discovering that central clicks outperform boundary clicks, particularly when adding random noise perturbations, however, they also only explore the domain of natural images. The closest work to ours is Amrehn et al. [17], which compares robot users using an interactive U-Net [21] for liver lesion segmentation. Their results suggest that a U-Net trained with a robot user using more spatially distributed clicks generalizes well when evaluated with a different robot user. However, they do not explore the generalization to real annotator interactions. In contrast to previous work, our focus lies on evaluating deep learning-based methods incorporating iterative corrections, with an emphasis on reducing the disparity between simulated and real annotators. Interactive segmentation reviews [7, 8] have discussed the lack of user-centric metrics for medical interactive segmentation. We address this by introducing 4 metrics that capture user behavior and quantify the simulated-to-real user shift.

Refer to caption
Figure 1: Our proposed evaluation metrics and examples of label non-conformity.

2 Methods

We explore iterative interactive models that simulate clicks in a loop of 10 iterations. In each click iteration i{1,,10}𝑖110i\in\{1,...,10\}italic_i ∈ { 1 , … , 10 }, a robot user R𝑅Ritalic_R simulates a click, denoted as clicks(R,I)[i]3clicks𝑅𝐼delimited-[]𝑖superscript3\texttt{clicks}(R,I)[i]\in\mathbb{N}^{3}clicks ( italic_R , italic_I ) [ italic_i ] ∈ blackboard_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and combines it with the image IW×H×D𝐼superscript𝑊𝐻𝐷I\in\mathbb{R}^{W\times H\times D}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT as a joint input, where W×H×D𝑊𝐻𝐷W\times H\times Ditalic_W × italic_H × italic_D are the image dimensions. Using this joint input, the model predicts a segmentation mask pred(I)[i]{0,1}W×H×Dpred𝐼delimited-[]𝑖superscript01𝑊𝐻𝐷\texttt{pred}(I)[i]\in\{0,1\}^{W\times H\times D}pred ( italic_I ) [ italic_i ] ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT. Then, the missegmented regions within this prediction, denoted as err(I)[i]{0,1}W×H×Derr𝐼delimited-[]𝑖superscript01𝑊𝐻𝐷\texttt{err}(I)[i]\in\{0,1\}^{W\times H\times D}err ( italic_I ) [ italic_i ] ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT, are employed to generate clicks(R,I)[i+1]clicks𝑅𝐼delimited-[]𝑖1\texttt{clicks}(R,I)[i+1]clicks ( italic_R , italic_I ) [ italic_i + 1 ] for the next iteration. We provide a notation table for all our equation terms in the supplementary.

2.1 Robot Users

(R1) Center Click: A common approach is to simulate clicks in the center of the largest missegmented component [7, 22]. However, the first click is placed in the center of the largest component of the label IYsubscript𝐼𝑌I_{Y}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. This is defined as:

clicks(R1,I)[i]={center(largest_component(IY)),if i=1center(largest_component(err(I)[i1])),if i>1clicks𝑅1𝐼delimited-[]𝑖casescenterlargest_componentsubscript𝐼𝑌if 𝑖1centerlargest_componenterr𝐼delimited-[]𝑖1if 𝑖1\texttt{clicks}(R1,I)[i]=\begin{cases}\texttt{center}(\texttt{largest\_% component}(I_{Y})),&\text{if }i=1\\ \texttt{center}(\texttt{largest\_component}(\texttt{err}(I)[i-1])),&\text{if }% i>1\end{cases}clicks ( italic_R 1 , italic_I ) [ italic_i ] = { start_ROW start_CELL center ( largest_component ( italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) , end_CELL start_CELL if italic_i = 1 end_CELL end_ROW start_ROW start_CELL center ( largest_component ( err ( italic_I ) [ italic_i - 1 ] ) ) , end_CELL start_CELL if italic_i > 1 end_CELL end_ROW

(1)

where IY{0,1}W×H×Dsubscript𝐼𝑌superscript01𝑊𝐻𝐷I_{Y}\in\{0,1\}^{W\times H\times D}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT is the ground-truth label for image I𝐼Iitalic_I, center()center\texttt{center}(\cdot)center ( ⋅ ) computes the geometric center of a component as in [22], and largest_component()largest_component\texttt{largest\_component}(\cdot)largest_component ( ⋅ ) computes the largest connected component.

(R2) Uncertainty: Zheng et al. [23] sample a click in each iteration using the epistemic uncertainty of the model as a sampling distribution, defined as:

clicks(R2,I)[i]{uniform(IY),if i=1epistemic(pred(I)[i1]),if i>1similar-toclicks𝑅2𝐼delimited-[]𝑖casesuniformsubscript𝐼𝑌if 𝑖1epistemicpred𝐼delimited-[]𝑖1if 𝑖1\texttt{clicks}(R2,I)[i]\sim\begin{cases}\texttt{uniform}(I_{Y}),&\text{if }i=% 1\\ \texttt{epistemic}(\texttt{pred}(I)[i-1]),&\text{if }i>1\\ \end{cases}clicks ( italic_R 2 , italic_I ) [ italic_i ] ∼ { start_ROW start_CELL uniform ( italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_i = 1 end_CELL end_ROW start_ROW start_CELL epistemic ( pred ( italic_I ) [ italic_i - 1 ] ) , end_CELL start_CELL if italic_i > 1 end_CELL end_ROW

(2)

where epistemic()epistemic\texttt{epistemic}(\cdot)epistemic ( ⋅ ) is the normalized epistemic uncertainty in [0,1]01[0,1][ 0 , 1 ] using Monte Carlo Dropout [24], and uniform(X)uniform𝑋\texttt{uniform}(X)uniform ( italic_X ) defines a uniform distribution over the non-zero entries of X𝑋Xitalic_X.

(R3) Euclidean Distance Transform (EDT): Previous methods [9, 10] apply the EDT on missegmented regions as a sampling distribution for clicks:

clicks(R3,I)[i]{uniform(IY),if i=1EDT(err(I)[i1]),if i>1similar-toclicks𝑅3𝐼delimited-[]𝑖casesuniformsubscript𝐼𝑌if 𝑖1EDTerr𝐼delimited-[]𝑖1if 𝑖1\texttt{clicks}(R3,I)[i]\sim\begin{cases}\texttt{uniform}(I_{Y}),&\text{if }i=% 1\\ \texttt{EDT}(\texttt{err}(I)[i-1]),&\text{if }i>1\\ \end{cases}clicks ( italic_R 3 , italic_I ) [ italic_i ] ∼ { start_ROW start_CELL uniform ( italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_i = 1 end_CELL end_ROW start_ROW start_CELL EDT ( err ( italic_I ) [ italic_i - 1 ] ) , end_CELL start_CELL if italic_i > 1 end_CELL end_ROW

(3)

where EDT(err(I)[i1])err𝐼delimited-[]𝑖1(\texttt{err}(I)[i-1])( err ( italic_I ) [ italic_i - 1 ] ) is the normalized EDT of the non-zero entries in the missegmented regions err(I)[i1]err𝐼delimited-[]𝑖1\texttt{err}(I)[i-1]err ( italic_I ) [ italic_i - 1 ] from the previous iteration.

(R4) Uniform: The final robot user samples uniformly either from the previous error [17] or from the label for the first click as:

clicks(R4,I)[i]{uniform(IY),if i=1uniform(err(I)[i1]),if i>1similar-toclicks𝑅4𝐼delimited-[]𝑖casesuniformsubscript𝐼𝑌if 𝑖1uniformerr𝐼delimited-[]𝑖1if 𝑖1\texttt{clicks}(R4,I)[i]\sim\begin{cases}\texttt{uniform}(I_{Y}),&\text{if }i=% 1\\ \texttt{uniform}(\texttt{err}(I)[i-1]),&\text{if }i>1\\ \end{cases}clicks ( italic_R 4 , italic_I ) [ italic_i ] ∼ { start_ROW start_CELL uniform ( italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_i = 1 end_CELL end_ROW start_ROW start_CELL uniform ( err ( italic_I ) [ italic_i - 1 ] ) , end_CELL start_CELL if italic_i > 1 end_CELL end_ROW

(4)

Note: In each iteration we simulate two types of clicks: clicks(R,I)[i]lesionclicks𝑅𝐼superscriptdelimited-[]𝑖lesion\texttt{clicks}(R,I)[i]^{\text{lesion}}clicks ( italic_R , italic_I ) [ italic_i ] start_POSTSUPERSCRIPT lesion end_POSTSUPERSCRIPT and clicks(R,I)[i]backgroundclicks𝑅𝐼superscriptdelimited-[]𝑖background\texttt{clicks}(R,I)[i]^{\text{background}}clicks ( italic_R , italic_I ) [ italic_i ] start_POSTSUPERSCRIPT background end_POSTSUPERSCRIPT. We designate the under- and over-segmented regions as missegmented areas err(I)[i]err𝐼delimited-[]𝑖\texttt{err}(I)[i]err ( italic_I ) [ italic_i ] for the "lesion" and "background" classes respectively, and omit the class labels in Eq.(1)-(4), for clarity.

(Rourssubscript(Rours\textbf{(R}_{\textbf{ours}}(R start_POSTSUBSCRIPT ours end_POSTSUBSCRIPT): Our Robot User: In our first user study, we found that 25% of our annotators’ clicks are outside the ground-truth labels. Label non-conforming clicks stem from two factors (see Fig. 1, top left): 1) ambiguous weak boundaries in the low-resolution PET scans, leading to clicks slightly outside the label boundaries; 2) and unannotated high uptake regions, spatially isolated from ground-truth labels. To address the first issue, we propose integrating click perturbations to spatially displace clicks with a probability pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT. For the second issue, we propose to systematically incorporate label non-conformity by sampling clicks in high uptake regions outside the ground-truth labels with a probability psystemsubscript𝑝systemp_{\text{system}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT. To achieve this, our robot user extends (R1) and is defined as:

clicks(Rours,I)[i]={clicks(R1,I)[i]if pi,1pperturb and pi,2psystemclicks(R1,I)[i]+z~,if pi,1<pperturb and pi,2psystems~,if pi,1pperturb and pi,2<psystems~+z~,if pi,1<pperturb and pi,2<psystemclickssubscript𝑅ours𝐼delimited-[]𝑖casesclicks𝑅1𝐼delimited-[]𝑖if subscript𝑝𝑖1subscript𝑝perturb and subscript𝑝𝑖2subscript𝑝systemclicks𝑅1𝐼delimited-[]𝑖~𝑧if subscript𝑝𝑖1subscript𝑝perturb and subscript𝑝𝑖2subscript𝑝system~𝑠if subscript𝑝𝑖1subscript𝑝perturb and subscript𝑝𝑖2subscript𝑝system~𝑠~𝑧if subscript𝑝𝑖1subscript𝑝perturb and subscript𝑝𝑖2subscript𝑝system\texttt{clicks}(R_{\text{ours}},I)[i]=\begin{cases}\texttt{clicks}(R1,I)[i]&% \text{if }p_{i,1}\geq p_{\text{perturb}}\text{ and }p_{i,2}\geq p_{\text{% system}}\\ \texttt{clicks}(R1,I)[i]+\widetilde{z},&\text{if }p_{i,1}<p_{\text{perturb}}% \text{ and }p_{i,2}\geq p_{\text{system}}\\ \widetilde{s},&\text{if }p_{i,1}\geq p_{\text{perturb}}\text{ and }p_{i,2}<p_{% \text{system}}\\ \widetilde{s}+\widetilde{z},&\text{if }p_{i,1}<p_{\text{perturb}}\text{ and }p% _{i,2}<p_{\text{system}}\\ \end{cases}clicks ( italic_R start_POSTSUBSCRIPT ours end_POSTSUBSCRIPT , italic_I ) [ italic_i ] = { start_ROW start_CELL clicks ( italic_R 1 , italic_I ) [ italic_i ] end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT and italic_p start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL clicks ( italic_R 1 , italic_I ) [ italic_i ] + over~ start_ARG italic_z end_ARG , end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT and italic_p start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_s end_ARG , end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT and italic_p start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_s end_ARG + over~ start_ARG italic_z end_ARG , end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT and italic_p start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT end_CELL end_ROW

(5)

where s~SUV(I,IY)similar-to~𝑠SUV𝐼subscript𝐼𝑌\widetilde{s}\sim\texttt{SUV}(I,I_{Y})over~ start_ARG italic_s end_ARG ∼ SUV ( italic_I , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) and z~𝒰[a,a]3similar-to~𝑧subscript𝒰superscript𝑎𝑎3\widetilde{z}\sim\mathcal{U}_{[-a,a]^{3}}over~ start_ARG italic_z end_ARG ∼ caligraphic_U start_POSTSUBSCRIPT [ - italic_a , italic_a ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. SUV(I,IY)SUV𝐼subscript𝐼𝑌\texttt{SUV}(I,I_{Y})SUV ( italic_I , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) defines a distribution over the normalized Standardized Uptake Values in I𝐼Iitalic_I which are outside the label IYsubscript𝐼𝑌I_{Y}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, z~~𝑧\widetilde{z}over~ start_ARG italic_z end_ARG is a random perturbation with a maximal amplitude a𝑎a\in\mathbb{N}italic_a ∈ blackboard_N, and each iteration pi,1subscript𝑝𝑖1p_{i,1}italic_p start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, pi,2subscript𝑝𝑖2p_{i,2}italic_p start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT are independently sampled from 𝒰[0,1]subscript𝒰01\mathcal{U}_{[0,1]}caligraphic_U start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT to decide which case is applied.

2.2 Model Architecture and Dataset

We use the pre-trained SW-FastEdit [9] interactive model based on MONAI Label [25] with a U-Net backbone [21] and conduct our user studies on the openly available AutoPET [1] dataset which consists of 1014 PET/CT volumes with annotated tumor lesions of melanoma, lung cancer, or lymphoma. We exclusively utilize PET data and use SW-FastEdit’s official test split of 10% of the volumes. The PET volumes have a voxel size of 2.0×2.0×3.0mm32.02.03.0superscriptmm32.0\times 2.0\times 3.0\text{mm}^{3}2.0 × 2.0 × 3.0 mm start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and an average resolution of 400×400×352400400352400\times 400\times 352400 × 400 × 352 voxels. Both user studies were conducted using 3D Slicer [26] and its MONAI Label plugin. We implemented our robot user experiments with MONAI Label [25] and will release the code.

3 Experiments and Results

3.1 Evaluation Metrics

For all metrics, we denote \mathcal{I}caligraphic_I as the set of PET images labeled in a user study, 𝒜𝒜\mathcal{A}caligraphic_A as the set of real annotators participating in the study, and fix the number of clicks per image to 10. We visualize examples for (M1)-(M4) in Fig. 1.

(M1) The Label Conformity for an annotator A𝐴Aitalic_A is defined as:

M1(A)=1||110Ii=110[IY[clicks(A,I)[i]]=1]subscriptM1𝐴1110subscript𝐼superscriptsubscript𝑖110[subscript𝐼𝑌delimited-[]clicks𝐴𝐼delimited-[]𝑖1]\textbf{M}_{1}(A)=\frac{1}{|\mathcal{I}|}\frac{1}{10}\sum_{I\in\mathcal{I}}% \sum_{i=1}^{10}\textbf{[}I_{Y}[\small\texttt{clicks}(A,I)[i]]=1\textbf{]}M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I | end_ARG divide start_ARG 1 end_ARG start_ARG 10 end_ARG ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT [ italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT [ clicks ( italic_A , italic_I ) [ italic_i ] ] = 1 ]

(6)

where [][]\textbf{[}\cdot\textbf{]}[ ⋅ ] is the Iverson bracket. (M1) measures to what extent an annotator’s clicks agree with the ground-truth labels of the PET images.

(M2) The Centerness for annotator A𝐴Aitalic_A is defined as:

M2(A)=1||1|C¯(A,I)|IcC¯(A,I)bound(c,IY)bound(c,IY)+cent_dist(c,IY)subscriptM2𝐴11¯𝐶𝐴𝐼subscript𝐼subscript𝑐¯𝐶𝐴𝐼bound𝑐subscript𝐼𝑌bound𝑐subscript𝐼𝑌cent_dist𝑐subscript𝐼𝑌\textbf{M}_{2}(A)=\frac{1}{|\mathcal{I}|}\frac{1}{|\bar{C}(A,I)|}\sum_{I\in% \mathcal{I}}\sum_{c\in\bar{C}(A,I)}\frac{\texttt{bound}(c,I_{Y})}{\texttt{% bound}(c,I_{Y})+\texttt{cent\_dist}(c,I_{Y})}M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I | end_ARG divide start_ARG 1 end_ARG start_ARG | over¯ start_ARG italic_C end_ARG ( italic_A , italic_I ) | end_ARG ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ over¯ start_ARG italic_C end_ARG ( italic_A , italic_I ) end_POSTSUBSCRIPT divide start_ARG bound ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_ARG start_ARG bound ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) + cent_dist ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_ARG

(7)

where C¯(A,I)={c|cclicks(A,I)andIY[c]=1}¯𝐶𝐴𝐼conditional-set𝑐𝑐clicks𝐴𝐼andsubscript𝐼𝑌delimited-[]𝑐1\bar{C}(A,I)=\{c\ |\ c\in\texttt{clicks}(A,I)\ \text{and}\ I_{Y}[c]=1\}over¯ start_ARG italic_C end_ARG ( italic_A , italic_I ) = { italic_c | italic_c ∈ clicks ( italic_A , italic_I ) and italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT [ italic_c ] = 1 } is the set of label conforming clicks of annotator A𝐴Aitalic_A for image I𝐼Iitalic_I, bound(c,IY)bound𝑐subscript𝐼𝑌\texttt{bound}(c,I_{Y})bound ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) is the minimum distance of click c𝑐citalic_c to the boundary of the label IYsubscript𝐼𝑌I_{Y}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, and cent_dist(c,IY)cent_dist𝑐subscript𝐼𝑌\texttt{cent\_dist}(c,I_{Y})cent_dist ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) is the minimum distance of click c𝑐citalic_c to the center of the label IYsubscript𝐼𝑌I_{Y}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. Small (M2) values indicate that label-conforming clicks are placed near the boundary, whereas large values show that clicks are placed near the central regions of the label.

(M3) The Click Diversity for annotator A𝐴Aitalic_A is defined as:

M3(A)=1||I|{Y~|Y~components(IY)andcclicks(A,I):cY~}|min(|components(IY)|,|clicks(A,I)|)subscriptM3𝐴1subscript𝐼conditional-set~𝑌:~𝑌componentssubscript𝐼𝑌and𝑐clicks𝐴𝐼𝑐~𝑌componentssubscript𝐼𝑌clicks𝐴𝐼\textbf{M}_{3}(A)=\frac{1}{|\mathcal{I}|}\sum_{I\in\mathcal{I}}\frac{|\{% \widetilde{Y}\ |\ \widetilde{Y}\in\texttt{components}(I_{Y})\ \text{and}\ % \exists c\in\texttt{clicks}(A,I):\ c\in\widetilde{Y}\}|}{\min(|\texttt{% components}(I_{Y})|,\ |\texttt{clicks}(A,I)|)}M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I | end_ARG ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_I end_POSTSUBSCRIPT divide start_ARG | { over~ start_ARG italic_Y end_ARG | over~ start_ARG italic_Y end_ARG ∈ components ( italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) and ∃ italic_c ∈ clicks ( italic_A , italic_I ) : italic_c ∈ over~ start_ARG italic_Y end_ARG } | end_ARG start_ARG roman_min ( | components ( italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) | , | clicks ( italic_A , italic_I ) | ) end_ARG

(8)

where components()components\texttt{components}(\cdot)components ( ⋅ ) is the set of all connected components. (M3) measures to what extent clicks are spread out in different connected components in the label.

(M4) The Label Proximity for an annotator A𝐴Aitalic_A is defined as:

M4(A)=1||1|C^(A,I)|IcC^(A,I)1d(c,IY)subscriptM4𝐴11^𝐶𝐴𝐼subscript𝐼subscript𝑐^𝐶𝐴𝐼1𝑑𝑐subscript𝐼𝑌\textbf{M}_{4}(A)=\frac{1}{|\mathcal{I}|}\frac{1}{|\hat{C}(A,I)|}\sum_{I\in% \mathcal{I}}\sum_{c\in\hat{C}(A,I)}\frac{1}{d(c,I_{Y})}M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I | end_ARG divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_C end_ARG ( italic_A , italic_I ) | end_ARG ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ over^ start_ARG italic_C end_ARG ( italic_A , italic_I ) end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_ARG

(9)

where C^(A,I)={c|cclicks(A,I)andIY[c]=0}^𝐶𝐴𝐼conditional-set𝑐𝑐clicks𝐴𝐼andsubscript𝐼𝑌delimited-[]𝑐0\hat{C}(A,I)=\{c\ |\ c\in\texttt{clicks}(A,I)\ \text{and}\ I_{Y}[c]=0\}over^ start_ARG italic_C end_ARG ( italic_A , italic_I ) = { italic_c | italic_c ∈ clicks ( italic_A , italic_I ) and italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT [ italic_c ] = 0 } is the set of label non-conforming clicks of annotator A𝐴Aitalic_A for image I𝐼Iitalic_I, and d(c,IY)=min({cy|yW×H×DandIY[y]=1})𝑑𝑐subscript𝐼𝑌conditionalnorm𝑐𝑦𝑦superscript𝑊𝐻𝐷andsubscript𝐼𝑌delimited-[]𝑦1d(c,I_{Y})=\min(\{||c-y||\ |\ y\in\mathbb{N}^{W\times H\times D}\ \text{and}\ % I_{Y}[y]=1\})italic_d ( italic_c , italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) = roman_min ( { | | italic_c - italic_y | | | italic_y ∈ blackboard_N start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT and italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT [ italic_y ] = 1 } ). (M4) computes the average inverse distance of the annotator clicks outside the ground-truth label to the label IYsubscript𝐼𝑌I_{Y}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. Higher (M4) values suggest non-conforming clicks are close to the label boundary, while lower values indicate clicks are far from any component of the label IYsubscript𝐼𝑌I_{Y}italic_I start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, suggesting systematic non-conformity.

(M5) The Consistent Improvement is defined in [15] as:

M5(A)=1||110Ii=110[dice(A,I)[i]>dice(A,I)[i1]]subscriptM5𝐴1110subscript𝐼superscriptsubscript𝑖110[dice𝐴𝐼delimited-[]𝑖dice𝐴𝐼delimited-[]𝑖1]\textbf{M}_{5}(A)=\frac{1}{|\mathcal{I}|}\frac{1}{10}\sum_{I\in\mathcal{I}}% \sum_{i=1}^{10}\textbf{[}\texttt{dice}(A,I)[i]>\texttt{dice}(A,I)[i-1]\textbf{]}M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I | end_ARG divide start_ARG 1 end_ARG start_ARG 10 end_ARG ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT bold_[ typewriter_dice ( italic_A , italic_I ) [ italic_i ] > dice ( italic_A , italic_I ) [ italic_i - 1 ] ]

(10)

where dice(A,I)[i]dice𝐴𝐼delimited-[]𝑖\texttt{dice}(A,I)[i]dice ( italic_A , italic_I ) [ italic_i ] is the Dice score after annotator A𝐴Aitalic_A’s ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT click on image I𝐼Iitalic_I.

(M6) The User Shift determines the mean absolute difference in all metrics (M1)-(M5) between a simulated robot user R𝑅Ritalic_R and all real annotators 𝒜𝒜\mathcal{A}caligraphic_A:

M6(R,𝒜)=1|𝒜|15A𝒜i=15|Mi(R)Mi(A)|subscriptM6𝑅𝒜1𝒜15subscript𝐴𝒜superscriptsubscript𝑖15subscriptM𝑖𝑅subscriptM𝑖𝐴\textbf{M}_{6}(R,\mathcal{A})=\frac{1}{|\mathcal{A}|}\frac{1}{5}\sum_{A\in% \mathcal{A}}\sum_{i=1}^{5}|\textbf{M}_{i}(R)-\textbf{M}_{i}(A)|M start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ( italic_R , caligraphic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_A | end_ARG divide start_ARG 1 end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_A ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT | M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R ) - M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A ) |

(11)

(M7) The Dice Difference for a robot user R𝑅Ritalic_R is defined as:

M7(R,𝒜)=1||1|𝒜|110IA𝒜i=110|dice(A,I)[i]dice(R,I)[i]|subscriptM7𝑅𝒜11𝒜110subscript𝐼subscript𝐴𝒜superscriptsubscript𝑖110dice𝐴𝐼delimited-[]𝑖dice𝑅𝐼delimited-[]𝑖\textbf{M}_{7}(R,\mathcal{A})=\frac{1}{|\mathcal{I}|}\frac{1}{|\mathcal{A}|}% \frac{1}{10}\sum_{I\in\mathcal{I}}\sum_{A\in\mathcal{A}}\sum_{i=1}^{10}|% \texttt{dice}(A,I)[i]-\texttt{dice}(R,I)[i]|M start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( italic_R , caligraphic_A ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I | end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_A | end_ARG divide start_ARG 1 end_ARG start_ARG 10 end_ARG ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_A ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT | dice ( italic_A , italic_I ) [ italic_i ] - dice ( italic_R , italic_I ) [ italic_i ] |

(12)

(M6) quantifies the fidelity of the robot user in emulating annotator behavior, while (M7) evaluates its ability to reproduce the segmentation performance of the interactive model as used by real annotators.

3.2 User Studies and Results

Setup. We conduct two user studies, each with four annotators from a medical background. In both studies, annotators were instructed to place 10 "lesion" and 10 "background" clicks, updating the model prediction after each pair of clicks to replicate the workflow of simulated robot users. In our first user study, four annotators labeled the same 10 PET volumes from the test split. We used this user study to determine the optimal values of pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT and psystemsubscript𝑝systemp_{\text{system}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT for our robot user. In our second user study, four different annotators labeled 6 PET volumes. We conducted this as a "validation" user study to confirm that our results from the first user study generalize to other volumes and annotators. For both studies, we applied each robot user to the same PET images annotated by the real users.

Refer to caption
Refer to caption
Figure 2: Analysis of pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT (left) and psystemsubscript𝑝systemp_{\text{system}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT (right) in our first user study.
Table 1: User Shift and Dice Difference of all robot users on both user studies.
Previous Work Ours (a=35𝑎35a=35italic_a = 35)
(R1) (R2) (R3) (R4) pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT 25%percent2525\%25 % 19.6%percent19.619.6\%19.6 % 13.4%percent13.413.4\%13.4 % 6.7%percent6.76.7\%6.7 % 0%percent00\%0 %
psystemsubscript𝑝systemp_{\text{system}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT 0%percent00\%0 % 6.7%percent6.76.7\%6.7 % 13.4%percent13.413.4\%13.4 % 19.6%percent19.619.6\%19.6 % 25%percent2525\%25 %
User Study 1 (M6) User Shift 27.427.427.427.4 35.035.035.035.0 28.528.528.528.5 29.529.529.529.5 9.49.49.49.4 8.48.48.48.4 6.8 9.09.09.09.0 11.611.611.611.6
(M7) Dice Difference 8.78.78.78.7 10.010.010.010.0 9.29.29.29.2 11.611.611.611.6 6.06.06.06.0 5.35.35.35.3 3.6 5.85.85.85.8 6.96.96.96.9
User Study 2 (M6) User Shift 30.030.030.030.0 31.731.731.731.7 33.833.833.833.8 30.030.030.030.0 8.4 7.6 6.7 8.6 9.2
(M7) Dice Difference 8.58.58.58.5 9.09.09.09.0 7.07.07.07.0 7.57.57.57.5 5.3 4.8 3.7 6.2 6.7

Results: Our Robot User. In the first user study, we assessed our robot user by varying pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT, psystemsubscript𝑝systemp_{\text{system}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT and the perturbation amplitude a𝑎aitalic_a and plotted the results in Fig. 2. Spatial perturbations with pperturb75%subscript𝑝perturbpercent75p_{\text{perturb}}\leq 75\%italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT ≤ 75 % consistently outperform existing robot users in terms of user shift. The optimal user shift is achieved with pperturb75%subscript𝑝perturbpercent75p_{\text{perturb}}\leq 75\%italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT ≤ 75 % and a[20,35]𝑎2035a\in[20,35]italic_a ∈ [ 20 , 35 ], in particular with pperturb=25%subscript𝑝perturbpercent25p_{\text{perturb}}=25\%italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT = 25 % and a=35𝑎35a=35italic_a = 35, deteriorating with a>35𝑎35a>35italic_a > 35 or pperturb=100%subscript𝑝perturbpercent100p_{\text{perturb}}=100\%italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT = 100 % due to the excessive spatial noise. Incorporating systematic non-conformity also consistently reduces the user shift, with psystem=25%subscript𝑝systempercent25p_{\text{system}}=25\%italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT = 25 % as the optimal value, similar to pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT. Since 25%percent2525\%25 % is the optimal value for both pperturbsubscript𝑝perturbp_{\text{perturb}}italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT and psystemsubscript𝑝systemp_{\text{system}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT, we explore mixing them with a joint probability of 25%percent2525\%25 %. The results in Table 1 show that mixing further reduces the user shift as well as the Dice difference, leading to optimal results when psystem=pperturbsubscript𝑝systemsubscript𝑝perturbp_{\text{system}}=p_{\text{perturb}}italic_p start_POSTSUBSCRIPT system end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT perturb end_POSTSUBSCRIPT.

Refer to caption
Figure 3: Metric values (M1)-(M5) of all robot users on both user studies.

Results: Previous Work. The results, plotted in Fig. 3 and Table 1 reveal a large discrepancy between existing robot users and the average annotator in all metrics. This contrast is especially notable in (M1) and (M4) since robot users always produce label-conforming clicks, while real annotators click outside the label in 25%percent2525\%25 % of their interactions. Building on this insight, our robot user introduces label non-conformity in 25%percent2525\%25 % of its simulated clicks by spatially perturbing clicks and systematically sampling from high-uptake regions outside the label. This non-conformity achieves the optimal user shift and Dice difference in both user studies. Our robot user reduces the Dice difference from 8.7%percent8.78.7\%8.7 % to 3.6%percent3.63.6\%3.6 % and from 7.0%percent7.07.0\%7.0 % to 3.7%percent3.73.7\%3.7 % on the first and second user study respectively, which confirms that the Dice score reported when evaluating with our robot user is much more realistic. The Dice curves are visualized in Fig. 4.

Refer to caption
Refer to caption
Figure 4: Mean Dice curves of all robot users for both user studies.

User Shift vs. Dice Difference. As the user shift only quantifies the behavioral shift, we examine its correlation with the Dice difference for all our robot user configurations in the first user study. Fig. 5 reveals a Pearson correlation of ρ=0.89𝜌0.89\rho=0.89italic_ρ = 0.89 between the user shift and the Dice difference. Importantly, omitting any of our metrics (M1)-(M5) from (M6) decreases the correlation to ρ<0.8𝜌0.8\rho<0.8italic_ρ < 0.8. This confirms that our proposed metrics not only quantify the annotation style but also quantify how this style influences the segmentation performance.

Refer to caption
Figure 5: Correlation between (M6) and (M7) on the first user study results.

4 Conclusion

Our user studies reveal the challenges in evaluating interactive models through simulated interactions. Despite its simplicity, our robot user exposes fundamental flaws in traditional robot users that heavily rely on ground-truth labels. This is particularly problematic in domains where experts disagree on the ground truth in 25% of their interactions, as observed in our user studies for whole-body PET lesion annotation. Traditional robot users exhibit significant user shift and Dice difference compared to real annotators, resulting in overly optimistic Dice scores and unrealistic annotation behavior. By incorporating click perturbations and systematic label non-conformity, we substantially reduce the user shift and Dice difference compared to previous robot users. This facilitates a more realistic evaluation of interactive model performance without the need for extensive user studies involving the entire test split.

Acknowledgements. The user studies were done in collaboration with the Annotation Lab Essen (https://annotationlab.ikim.nrw/). The present contribution is supported by the Helmholtz Association under the joint research school “HIDSS4Health – Helmholtz Information and Data Science School for Health. This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the Federal Ministry of Education and Research.

References

  • [1] Gatidis, Sergios, et al. "The autoPET challenge: Towards fully automated lesion segmentation in oncologic PET/CT imaging." (2023).
  • [2] Menze, Bjoern H., et al. "The multimodal brain tumor image segmentation benchmark (BRATS)." IEEE transactions on medical imaging 34.10 (2014): 1993-2024.
  • [3] Antonelli, Michela, et al. "The medical segmentation decathlon." Nature communications 13.1 (2022): 4128.
  • [4] Ji, Yuanfeng, et al. "Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation." Advances in Neural Information Processing Systems 35 (2022): 36722-36732.
  • [5] Wasserthal, Jakob, et al. "Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images." Radiology: Artificial Intelligence 5.5 (2023).
  • [6] Hernandez Petzsche, Moritz R., et al. "ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset." Scientific data 9.1 (2022): 762.
  • [7] Marinov, Zdravko, et al. "Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy." arXiv preprint arXiv:2311.13964 (2023).
  • [8] Zhao, Feng, and Xianghua Xie. "An overview of interactive medical image segmentation." Annals of the BMVA 2013.7 (2013): 1-22.
  • [9] Hadlich, Matthias, et al. "Sliding Window FastEdit: A Framework for Lesion Annotation in Whole-body PET Images." arXiv preprint arXiv:2311.14482 (2023).
  • [10] Hallitschke, V.J., et al. "Multimodal Interactive Lung Lesion Segmentation: A Framework for Annotating PET/CT Images Based on Physiological and Anatomical Cues," 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 2023, pp. 1-5.
  • [11] Asad, Muhammad, et al. "Adaptive Multi-scale Online Likelihood Network for AI-Assisted Interactive Segmentation." International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.
  • [12] Wang, Guotai, et al. "DeepIGeoS: a deep interactive geodesic framework for medical image segmentation." IEEE transactions on pattern analysis and machine intelligence 41.7 (2018): 1559-1572.
  • [13] Luo, Xiangde, et al. "MIDeepSeg: Minimally interactive segmentation of unseen objects from medical images using deep learning." Medical image analysis 72 (2021): 102102.
  • [14] Wang, Guotai, et al. "Interactive medical image segmentation using deep learning with image-specific fine tuning." IEEE transactions on medical imaging 37.7 (2018): 1562-1573.
  • [15] Marinov, Z., Stiefelhagen R., Kleesiek J. "Guiding the Guidance: A Comparative Analysis of User Guidance Signals for Interactive Segmentation of Volumetric Images." International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.
  • [16] Moschidis, Emmanouil, and Jim Graham. "A systematic performance evaluation of interactive image segmentation methods based on simulated user interaction." 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, 2010.
  • [17] Amrehn, Mario, et al. "Interactive neural network robot user investigation for medical image segmentation." Bildverarbeitung für die Medizin 2019: Algorithmen–Systeme–Anwendungen. Proceedings des Workshops vom 17. bis 19. März 2019 in Lübeck. Springer Fachmedien Wiesbaden, 2019.
  • [18] Kohli, Pushmeet, et al. "User-centric learning and evaluation of interactive segmentation systems." International journal of computer vision 100 (2012): 261-274.
  • [19] Boykov, Yuri, and Gareth Funka-Lea. "Graph cuts and efficient ND image segmentation." International journal of computer vision 70.2 (2006): 109-131.
  • [20] Benenson, Rodrigo, Stefan Popov, and Vittorio Ferrari. "Large-scale interactive object segmentation with human annotators." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
  • [21] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015.
  • [22] Liu, Qin, et al. "iSegFormer: interactive segmentation via transformers with application to 3D knee MR images." International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022.
  • [23] Zheng, Ervine, et al. "A continual learning framework for uncertainty-aware interactive image segmentation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 7. 2021.
  • [24] Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. PMLR, 2016.
  • [25] Diaz-Pinto, Andres, et al. "Monai label: A framework for ai-assisted interactive labeling of 3d medical images." arXiv preprint arXiv:2203.12362 (2022).
  • [26] Fedorov, Andriy, et al. "3D Slicer as an image computing platform for the Quantitative Imaging Network." Magnetic resonance imaging 30.9 (2012): 1323-1341.