License: CC BY 4.0
arXiv:2403.12834v1 [cs.CV] 19 Mar 2024
11institutetext: Division of Medical Image Computing, DKFZ, Heidelberg, Germany 22institutetext: Helmholtz Imaging, DKFZ, Heidelberg, Germany 33institutetext: Interactive Machine Learning Group, DKFZ, Heidelberg, Germany 44institutetext: Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, Germany 55institutetext: Faculty of Mathematics and Computer Science, Heidelberg University, Germany 66institutetext: HIDSS4Health - Helmholtz Information and Data Science School for Health

Embarrassingly Simple Scribble Supervision for 3D Medical Segmentation

Karol Gotkowski 11 2 2    Carsten Lüth 22 3 3    Paul F. Jäger 22 3 3    Sebastian Ziegler 11 2 2    Lars Krämer 11 2 2    Stefan Denner 11 4 4    Shuhan Xiao 11 5 5    Nico Disch 11 5 5 6 6    Klaus H. Maier-Hein 11 2 2    Fabian Isensee 11 2 2
Abstract

Traditionally, segmentation algorithms require dense annotations for training, demanding significant annotation efforts, particularly within the 3D medical imaging field. Scribble-supervised learning emerges as a possible solution to this challenge, promising a reduction in annotation efforts when creating large-scale datasets.
Recently, a plethora of methods for optimized learning from scribbles have been proposed, but have so far failed to position scribble annotation as a beneficial alternative. We relate this shortcoming to two major issues: 1) the complex nature of many methods which deeply ties them to the underlying segmentation model, thus preventing a migration to more powerful state-of-the-art models as the field progresses and 2) the lack of a systematic evaluation to validate consistent performance across the broader medical domain, resulting in a lack of trust when applying these methods to new segmentation problems.
To address these issues, we propose a comprehensive scribble supervision benchmark consisting of seven datasets covering a diverse set of anatomies and pathologies imaged with varying modalities. We furthermore propose the systematic use of partial losses, i.e. losses that are only computed on annotated voxels. Contrary to most existing methods, these losses can be seamlessly integrated into state-of-the-art segmentation methods, enabling them to learn from scribble annotations while preserving their original loss formulations. Our evaluation using nnU-Net reveals that while most existing methods suffer from a lack of generalization, the proposed approach consistently delivers state-of-the-art performance. Thanks to its simplicity, our approach presents an embarrassingly simple yet effective solution to the challenges of scribble supervision. Source code as well as our extensive scribble benchmarking suite will be made publicly available upon publication.

Keywords:
scribble supervised learning segmentation 3D medical.

1 Introduction

In 3D medical image segmentation, learning from densely annotated images is the predominant approach. Generating annotations is costly and time-consuming, causing this process to be the major bottleneck when creating large-scale datasets. An emerging solution to this challenge lies in approaches able to learn from sparsely annotated data, particularly in the form of scribbles (see Figure LABEL:fig:figure1:A). Given a fixed annotation budget, scribble supervised methods allow the annotation of more images, enabling the training of more robust and potentially more accurate models in a domain where datasets would otherwise be too small to capture the naturally occurring variations.

Refer to caption
Figure 1: (A): Depiction of the annotation effort reducing benefits of scribble-supervised learning. (B): Partial losses only consider annotated voxels during the loss computation. This idea, initially introduced for the Cross-Entropy loss, can be extended to other popular segmentation losses like the Dice loss to enable state-of-the-art segmentation methods to learn from scribble annotations while preserving their original loss formulations. (C): Drawbacks of current systemic and lightweight scribble supervision methods compared to the use of partial losses.

In light of this promise, an extensive range of scribble-supervised methods have recently been proposed [8, 13, 5, 2, 23, 15, 22, 16, 7, 21, 12, 11, 6, 20, 17]. These can be categorized into systemic and lightweight, depending on the strategy used. Systemic methods are commonly designed with a specific task in mind such as cardiac segmentation and modify integral parts of the training pipeline, tightly binding them to the base method they are built on. Typical strategies of systemic methods include custom augmentations [22, 23, 15], intricate regularized loss functions [8, 13, 22, 23, 7, 12, 11, 6], pseudo-label learning [13, 16, 7, 12, 11, 6] and novel segmentation model architectures [8, 13, 5, 16, 2, 7, 21, 11, 6]. In contrast, lightweight methods typically revolve around extending the loss formulation or adding regularization terms in order to make better use of the provided scribble labels. At their core, these methods utilize the partial Cross-Entropy (pCE) to consider only annotated voxels for the loss computation. Due to their detachment from the segmentation architecture, they can easily be transferred to other segmentation methods. In principle, lightweight methods have the potential to generalize well across tasks. However, so far, evaluation in the respective publications was highly limited covering either only one dataset in the medical domain [17] or only considering natural images [20, 19].
Although scribble supervision is an active field of research, the proposed methods have so far failed to position scribble-supervised learning as a beneficial alternative to densely supervised learning. We relate this shortcoming to two hurdles that the current literature must overcome: With segmentation methods being a highly active area of research, scribble learning functionality has to be readily integrated into new state-of-the-art methods as they are released. This requirement is in contrast to the most prominent scribble learning approaches today, which require a deep systemic integration into the segmentation pipeline. Additionally, scribble supervision methods need to deliver consistent performance, across tasks and modalities, without the need to optimize a method for each new problem it is applied to. Yet, a persistent issue accompanying recent developments is the lack of a comprehensive evaluation of the proposed methods to validate claims of general functionality. To this date, scribble supervision methods are typically exclusively evaluated within a single specific task such as cardiac segmentation, omitting a broader validation across the medical domain.
We propose solutions to both of these shortcomings. In a first step, we introduce a large-scale benchmark for systematically evaluating recently proposed methods across tasks and modalities, revealing that lightweight methods have superior generalization relative to systemic methods (see Figure LABEL:fig:figure1:C). We take this as inspiration to revisit a core component of these methods: the pCE loss, which on its own already achieves competitive results. However, state-of-the-art segmentation methods rely on more than just the Cross-Entropy loss to maximize segmentation performance. We propose to generalize the idea behind pCE to arbitrary loss functions, enabling them to make use of sparsely annotated labels by ignoring all non-annotated voxels in the loss computation. By adapting state-of-the-art segmentation methods to use such partial losses, we unlock the ability to learn from sparsely annotated data without interfering with their loss composition. Using nnU-Net [9] as an example, we show how the partial loss can enable state-of-the-art segmentation methods to learn from scribble annotations, yielding improved segmentation performance relative to the competition (see Figure LABEL:fig:figure1:B). Moreover, we show that the partial loss is annotation agnostic and can be applied to arbitrary sparse annotations. We make our benchmark as well as our implementation of partial losses in nnU-Net publicly available upon acceptance of this manuscript.

2 Materials & Methods

2.1 Partial loss

We propose to generalize the simple but effective idea behind the partial Cross-Entropy loss, namely the omission of all non-annotated voxels in the loss computation, to arbitrary segmentation losses. We refer to this concept as partial loss, highlighting that the loss computation is constrained to the masked set of annotated voxels. As a lightweight extension, this idea can be seamlessly integrated into existing segmentation losses, enabling us to modify state-of-the-art segmentation methods such that they can be used to learn from scribble annotations.
Let Y={Yi}i=1N𝒞ξ𝑌superscriptsubscriptsubscript𝑌𝑖𝑖1𝑁𝒞𝜉Y=\{Y_{i}\}_{i=1}^{N}\in\mathcal{C}\cup\xiitalic_Y = { italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_C ∪ italic_ξ be a partially annotated reference segmentation for a given set of classes 𝒞𝒞\mathcal{C}caligraphic_C with non-annotated voxels being assigned to the additional ignore class ξ𝜉\xiitalic_ξ. The filtering function A(x,Y)𝐴𝑥𝑌A(x,Y)italic_A ( italic_x , italic_Y ) returns all values of an indexable x𝑥xitalic_x for the corresponding subset of indices 𝒮{1,,N}𝒮1𝑁\mathcal{S}\subseteq\{1,...,N\}caligraphic_S ⊆ { 1 , … , italic_N } for which Yi𝒞i𝒮subscript𝑌𝑖𝒞for-all𝑖𝒮Y_{i}\in\mathcal{C}\forall i\in\mathcal{S}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C ∀ italic_i ∈ caligraphic_S. The segmentation model returns softmax outputs defining p(Y^)𝑝^𝑌p(\hat{Y})italic_p ( over^ start_ARG italic_Y end_ARG ). Given a loss (L), the partial loss (pL) for a single image is defined as follows:

pL(p(Y^),Y)=L(A(p(Y^),Y),A(Y,Y))pL𝑝^𝑌𝑌𝐿𝐴𝑝^𝑌𝑌𝐴𝑌𝑌\mathrm{pL}(p(\hat{Y}),Y)=L(A(p(\hat{Y}),Y),A(Y,Y))roman_pL ( italic_p ( over^ start_ARG italic_Y end_ARG ) , italic_Y ) = italic_L ( italic_A ( italic_p ( over^ start_ARG italic_Y end_ARG ) , italic_Y ) , italic_A ( italic_Y , italic_Y ) ) (1)

We intend to couple the partial loss with the nnU-Net [9], which utilizes a combination of Cross-Entropy and Dice as a loss function to achieve state-of-the-art performance. Modifying the loss in nnU-Net yields the following equation:

pL(p(Y^),Y)=1|𝒞|c𝒞(12i𝒮\vmathbb1Yi=cp(Y^i=Yi)i𝒮(p(Y^i=c)+\vmathbb1Yi=c))pDICE(p(Y^),Y)i𝒮log(p(Y^i=Yi))pCE(p(Y^),Y)pL𝑝^𝑌𝑌subscript1𝒞subscript𝑐𝒞12subscript𝑖𝒮\vmathbbsubscript1subscript𝑌𝑖𝑐𝑝subscript^𝑌𝑖subscript𝑌𝑖subscript𝑖𝒮𝑝subscript^𝑌𝑖𝑐\vmathbbsubscript1subscript𝑌𝑖𝑐pDICE𝑝^𝑌𝑌subscriptsubscript𝑖𝒮𝑝subscript^𝑌𝑖subscript𝑌𝑖pCE𝑝^𝑌𝑌\mathrm{pL}(p(\hat{Y}),Y)=\underbrace{\dfrac{1}{|\mathcal{C}|}\sum_{c\in% \mathcal{C}}\left(1-\dfrac{2\cdot\sum_{i\in\mathcal{S}}\vmathbb{1}_{Y_{i}=c}p(% \hat{Y}_{i}=Y_{i})}{\sum_{i\in\mathcal{S}}(p(\hat{Y}_{i}=c)+\vmathbb{1}_{Y_{i}% =c})}\right)}_{\mathrm{pDICE}(p(\hat{Y}),Y)}\underbrace{-\sum_{i\in\mathcal{S}% }\log(p(\hat{Y}_{i}=Y_{i}))}_{\mathrm{pCE}(p(\hat{Y}),Y)}roman_pL ( italic_p ( over^ start_ARG italic_Y end_ARG ) , italic_Y ) = under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_C | end_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT ( 1 - divide start_ARG 2 ⋅ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT ( italic_p ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c ) + 1 start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c end_POSTSUBSCRIPT ) end_ARG ) end_ARG start_POSTSUBSCRIPT roman_pDICE ( italic_p ( over^ start_ARG italic_Y end_ARG ) , italic_Y ) end_POSTSUBSCRIPT under⏟ start_ARG - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT roman_log ( italic_p ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT roman_pCE ( italic_p ( over^ start_ARG italic_Y end_ARG ) , italic_Y ) end_POSTSUBSCRIPT

2.2 Benchmark Creation

The lack of a comprehensive scribble supervision benchmark has so far hindered the development of robust scribble supervision methods as well as systematic comparisons between them. Previous approaches almost exclusively relied on the ACDC [3, 21] and MSCMR [24, 25] datasets, both representing the rather narrow task of cardiac segmentation in cine MRI.
To address this, we have created a new benchmark for medical 3D scribble supervision, encompassing seven datasets of diverse segmentation tasks: LiTS [4], BraTS2020 [18], AMOS2022 [10], KiTS2023 [1], WORD [17], MSCMR [24, 25], and ACDC [3, 21]. We automate scribble generation for these datasets using the existing dense reference segmentations. For each image slice, we generate two scribbles of different types for each class: interior scribbles mimicking human annotations placed randomly within the class interior as non-uniform rational B-Splines (NURBS), and border scribbles roughly delineating a small part of the class border with random slight offsets. This dual approach follows best practices for scribble generation and ensures that both class interiors and boundaries are represented during training. We refer to Figure 2 for a visual representation of our automatically generated scribbles. Through the introduction of this benchmark, we aim to catalyze the development and comparison of new medical 3D scribble supervision methods.

Refer to caption
Figure 2: Examples of segmentation types with (A) depicting a dense segmentation and (B) a scribble segmentation generated by our automatic scribble generation method based on the dense segmentation.

3 Results & Discussion

3.0.1 The proposed benchmark covers the diversity of medical segmentation tasks

Our benchmark encompasses seven diverse datasets reflecting a variety of anatomical and pathological structures imaged with several different modalities (CT, MRI and cine MRI). It enables an unprecedented systematic evaluation of performance and generalization capabilities of the different methods. On average, we generate 6 scribbles per slice, resulting in only 0.2% of image voxels being annotated. When respecting the class memberships of the voxels, we find that on average 8.5% of each class is annotated. The higher number relative to the naive image voxels stems from the oversampling caused by generating two scribbles per class per slice, see Section 2.2. In contrast, expert scribbles on ACDC, MSCMR, and WORD annotate 0.5% of image voxels and on average 12.3% per class. Considering scribbles disproportionately cut down on annotation time, particularly due to relaxed precision requirements at object boundaries, these numbers powerfully underline the potential of scribbles to reduce annotation costs. Detailed descriptive statistics of our benchmark are shown in Appendix 0.A and a qualitative comparison between expert and generated scribbles is depicted in Appendix 0.B.

3.0.2 Systematic benchmarking reveals generalization patterns of existing scribble supervision methods

We use the proposed benchmark to perform a systematic evaluation of existing scribble learning methods. Expert scribbles, if available (such as for the datasets ACDC, MSCMR, WORD), are included as a separate category. Existing methods included in our comparison encompass the lightweight methods DenseCRF [20] and WORD [17], which we couple with the nnU-Net as their segmentation backbone, and the systemic methods CycleMix [22], ShapePU [23] and ScribFormer [14]. Our results are summarized in Table 1. The systemic methods ShapePU, CycleMix, and ScribFormer show excellent performance on their original evaluation datasets ACDC and MSCMR. However, when tested across the entire benchmark, they fail to generalize to new tasks and therefore have the lowest performance by a substantial margin. Among the systemic methods, CycleMix is the most consistent performer with an average Dice score of 0.598. Lightweight methods show competitive generalization capabilities across all datasets and while unable to keep up with purpose-built systemic methods on cardiac datasets, their overall performance is more consistent. WORD and DenseCRF perform similarly well across all tasks and achieve average Dice scores of 0.736 and 0.758, respectively.

Table 1: Evaluation of scribble supervision methods using the Dice score. We use our systematic benchmark to compare the proposed use of partial losses to an extensive set of baselines and conduct comprehensive ablation studies. If available, we report the performance on the manually provided expert labels as well.
Method Experts Benchmark (Generated)
ACDC MSCMR WORD ACDC MSCMR WORD LiTS BraTS AMOS KiTS Mean
ShapePU 0.843 0.837 0.226 0.765 0.530 0.436 0.210 0.060 0.322 0.233 0.365
ScribFormer 0.884 0.845 0.180 0.900 0.861 0.420 0.457 0.071 0.283 0.268 0.466
CycleMix 0.876 0.872 0.406 0.896 0.852 0.626 0.554 0.250 0.487 0.518 0.598
nnUNet+DenseCRF 0.734 0.724 0.580 0.890 0.832 0.830 0.685 0.370 0.788 0.759 0.736
nnUNet+WORD 0.519 0.652 0.729 0.728 0.698 0.788 0.728 0.725 0.836 0.806 0.758
nnUNet+pCE (2D) 0.623 0.756 0.549 0.848 0.784 0.828 0.684 0.324 0.795 0.769 0.719
nnUNet+pL (2D) 0.653 0.700 0.695 0.893 0.864 0.831 0.691 0.442 0.790 0.740 0.750
nnUNet+pCE 0.841 0.893 0.693 0.888 0.887 0.787 0.777 0.418 0.838 0.812 0.772
nnUNet+ResUNet+pL 0.837 0.880 0.763 0.895 0.889 0.856 0.750 0.518 0.842 0.834 0.798
nnUNet+pL 0.868 0.858 0.723 0.895 0.885 0.837 0.761 0.680 0.840 0.823 0.817
nnUNet (dense superv.) 0.926 0.906 0.862 0.926 0.906 0.862 0.786 0.827 0.860 0.846 0.858

3.0.3 pCE alone delivers competitive scribble supervision

The standard pCE baseline in conjunction with nnU-Net uncovers a noteworthy finding: pCE alone markedly surpasses all existing systemic and lightweight methods, delivering an average Dice score of 0.772. These results underscore the critical role of employing state-of-the-art segmentation models, establishing pCE as a robust baseline for future methodologies. WORD and DenseCRF use the pCE as a basis and propose additional regularization terms to further improve results. Interestingly, comparing the nnUNet+WORD with nnU-Net+pCE in isolation reveals a reduction in performance, casting doubt on the methodological efficacy of the WORD approach. Conversely, the DenseCRF method (which is exclusively implemented in 2D) improves over the 2D nnUNet+pCE baseline, validating its methodological innovations.

3.0.4 Partial losses achieve state-of-the-art in learning from scribbles

Modern state-of-the-art segmentation methods require more than just a Cross-Entropy loss to achieve maximum performance. We evaluate the integration of partial losses in nnU-Net, yielding the implementation of a partial Dice loss that is used in conjunction with pCE (see Section 2.1). Experiments on our comprehensive benchmark reveal that this approach not only outperforms a pCE-only baseline by a margin of 0.045, but simultaneously eclipses all systemic and lightweight methodologies. Notably, the use of partial losses showcases robust generalization capabilities across the entire benchmark, except for BraTS, a common challenge for all methods (see Section 3.0.5), and maintains consistent performance across scribble styles (see Appendix 0.B for a qualitative comparison of styles). When examining the results of partial losses in nnU-Net qualitatively, we also find that this approach achieves the most consistent class boundaries when compared against systemic and lightweight methods as shown in Figure 3. These outcomes robustly affirm the general utility of partial losses when integrated into state-of-the-art segmentation methods. We exemplary demonstrate the portability of the partial losses by applying them to a different state-of-the-art architecture, a residual U-Net (“nnUNet+ResUNet+pL”) where they maintain their superior performance against all other evaluated baselines.

Refer to caption
Figure 3: Qualitative comparison of the top-performing systemic method, CycleMix, the top-performing lightweight method, WORD coupled with the nnU-Net, and partial losses in nnU-Net. A green image outline represents a high quality prediction while a red outline represents a critical prediction failure.

3.0.5 Partial losses can learn from arbitrary sparse annotation strategies

The versatility of partial losses extends beyond scribble annotations, accommodating any form of sparse annotation due to its lack of scribble-specific components. We perform an exploratory study examining a different sparse annotation strategy: selecting a subset of all available slices and annotating those in a traditional dense manner while leaving the remaining slices unannotated (Slices). Slices are selected until the same annotation budget (measured as % annotated, averaged over all classes) as was used for the scribbles is attained. Additionally, with the identical annotation budget, we chose a random subset of densely annotated images until the budget was satisfied, simulating a scenario where fewer images can be annotated than would be possible with sparse annotation strategies (Dense). We compare annotation efficiency using the proposed partial losses in nnU-Net against scribble annotations using the same annotation budget (Scribbles). Our study’s findings, presented in 2, illustrate that in the context of a fixed annotation budget, dense annotations underperform compared to sparse annotation strategies. A notable exception is the BraTS dataset, where Dense achieves a significantly higher Dice score. This discrepancy may be attributed to the appearance of the structures present in BraTS, characterized by classes with fuzzy borders, which makes them potentially less suited for sparse annotation methods. Among the sparse strategies, scribbles outperform slices, further highlighting the need to train methods from as diverse inputs as possible to ensure robustness.

Table 2: Comparison of annotation strategies under fixed annotation budget using the Dice score. Dense annotation forces a lower number of annotated images while sparse annotation strategies can cover all images in the respective datasets. Experiments are performed on our benchmark using partial losses in nnU-Net.
Method ACDC MSCMR WORD LiTS BraTS AMOS KiTS Mean
Dense (fewer images) 0.891 0.618 0.848 0.666 0.795 0.838 0.631 0.755
Slices 0.911 0.747 0.791 0.741 0.568 0.843 0.728 0.761
Scribbles 0.895 0.885 0.837 0.761 0.680 0.840 0.823 0.817

4 Conclusion

In this manuscript we introduced an embarrassingly simple scribble supervision method. Partial losses, a generalization of the idea behind the partial Cross-Entropy to learn only from an annotated subset of voxels, enable arbitrary state-of-the-art segmentation methods to learn from scribbles while preserving their original loss formulations. Our exemplary integration into nnU-Net yields the best results across all tested segmentation tasks, demonstrating better generalization capabilities than other lightweight methods and even matching highly specialized systemic methods on their respective tasks. We furthermore introduce a systematic benchmark for evaluating scribble supervision methods spanning a diverse range of segmentation tasks, datasets and modalities. The benchmark and our reference implementation of partial losses in nnU-Net will be made publicly available upon publication in the hope that it will be used as a starting point for catalyzing future progress in the field.

References

  • [1] The 2023 kidney tumor segmentation challenge, https://kits-challenge.org/kits23/
  • [2] Asad, M., Fidon, L., Vercauteren, T.: Econet: Efficient convolutional online likelihood network for scribble-based interactive segmentation. In: International Conference on Medical Imaging with Deep Learning. pp. 35–47. PMLR (2022)
  • [3] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging 37(11), 2514–2525 (2018)
  • [4] Bilic, P., Christ, P., Li, H.B., Vorontsov, E., Ben-Cohen, A., Kaissis, G., Szeskin, A., Jacobs, C., Mamani, G.E.H., Chartrand, G., et al.: The liver tumor segmentation benchmark (lits). Medical Image Analysis 84, 102680 (2023)
  • [5] Cai, H., Qi, L., Yu, Q., Shi, Y., Gao, Y.: 3d medical image segmentation with sparse annotation via cross-teaching between 3d and 2d networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 614–624. Springer (2023)
  • [6] Can, Y.B., Chaitanya, K., Mustafa, B., Koch, L.M., Konukoglu, E., Baumgartner, C.F.: Learning to segment medical images with scribble-supervision alone. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. pp. 236–244. Springer (2018)
  • [7] Chen, Q., Hong, Y.: Scribble2d5: Weakly-supervised volumetric image segmentation via scribble annotations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234–243. Springer (2022)
  • [8] Han, M., Luo, X., Liao, W., Zhang, S., Zhang, S., Wang, G.: Scribble-based 3d multiple abdominal organ segmentation via triple-branch multi-dilated network with pixel-and class-wise consistency. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 33–42. Springer (2023)
  • [9] Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18(2), 203–211 (2021)
  • [10] Ji, Y., Bai, H., Ge, C., Yang, J., Zhu, Y., Zhang, R., Li, Z., Zhanng, L., Ma, W., Wan, X., et al.: Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. Advances in Neural Information Processing Systems 35, 36722–36732 (2022)
  • [11] Ji, Z., Shen, Y., Ma, C., Gao, M.: Scribble-based hierarchical weakly supervised learning for brain tumor segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22. pp. 175–183. Springer (2019)
  • [12] Lee, H., Jeong, W.K.: Scribble2label: Scribble-supervised cell segmentation via self-generating pseudo-labels with consistency. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23. pp. 14–23. Springer (2020)
  • [13] Li, Z., Zheng, Y., Luo, X., Shan, D., Hong, Q.: Scribblevc: Scribble-supervised medical image segmentation with vision-class embedding. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 3384–3393 (2023)
  • [14] Li, Z., Zheng, Y., Shan, D., Yang, S., Li, Q., Wang, B., Zhang, Y., Hong, Q., Shen, D.: Scribformer: Transformer makes cnn work better for scribble-based medical image segmentation. IEEE Transactions on Medical Imaging (2024)
  • [15] Liu, X., Yuan, Q., Gao, Y., He, K., Wang, S., Tang, X., Tang, J., Shen, D.: Weakly supervised segmentation of covid19 infection with scribble annotation on ct images. Pattern recognition 122, 108341 (2022)
  • [16] Luo, X., Hu, M., Liao, W., Zhai, S., Song, T., Wang, G., Zhang, S.: Scribble-supervised medical image segmentation via dual-branch network and dynamically mixed pseudo labels supervision. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 528–538. Springer (2022)
  • [17] Luo, X., Liao, W., Xiao, J., Chen, J., Song, T., Zhang, X., Li, K., Metaxas, D.N., Wang, G., Zhang, S.: Word: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from ct image. arXiv preprint arXiv:2111.02403 (2021)
  • [18] Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
  • [19] Tang, M., Djelouah, A., Perazzi, F., Boykov, Y., Schroers, C.: Normalized cut loss for weakly-supervised cnn segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1818–1827 (2018)
  • [20] Tang, M., Perazzi, F., Djelouah, A., Ben Ayed, I., Schroers, C., Boykov, Y.: On regularized losses for weakly-supervised cnn segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 507–522 (2018)
  • [21] Valvano, G., Leo, A., Tsaftaris, S.A.: Learning to segment from scribbles using multi-scale adversarial attention gates. IEEE Transactions on Medical Imaging 40(8), 1990–2001 (2021)
  • [22] Zhang, K., Zhuang, X.: Cyclemix: A holistic strategy for medical image segmentation from scribble supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11656–11665 (2022)
  • [23] Zhang, K., Zhuang, X.: Shapepu: A new pu learning framework regularized by global consistency for scribble supervised cardiac segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 162–172. Springer (2022)
  • [24] Zhuang, X.: Multivariate mixture model for cardiac segmentation from multi-sequence mri. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 581–588. Springer (2016)
  • [25] Zhuang, X.: Multivariate mixture model for myocardial segmentation combining multi-source images. IEEE transactions on pattern analysis and machine intelligence 41(12), 2933–2946 (2018)

Appendix 0.A Scribble Evaluation

Table 3: Scribble evaluation results quantifying annotated scribbles with the following metrics: Scribble2Img (%): Percentage of scribble voxels in relation to image voxels averaged over all images; Scribble2Ref (%): Percentage of scribble voxels of a class in relation to the full reference segmentation voxels of a class averaged over all classes and images; Scribbles: Number of all scribbles in an image averaged over all images; ScribblesFG: Number of foreground scribbles in an image averaged over all images; ScribblesSlice: Number of all scribbles per slice averaged over all slices of all images.
Dataset Scribble2Img (%) Scribble2Ref (%) Scribbles ScribblesFG ScribblesSlice
Exp. ACDC 0.271 8.741 45 35 5
MSCMR 1.061 21.925 118 66 8
WORD 0.033 7.380 150 95 10
Mean 0.455 12.682 104 65 8
Benchmark ACDC 0.488 10.401 72 53 7
MSCMR 0.241 6.140 113 82 7
WORD 0.052 11.667 1589 1185 8
LiTS 0.237 3.213 1318 411 3
BraTS 0.384 9.788 605 295 4
AMOS 0.053 12.722 1243 959 9
KiTS 0.168 5.422 574 194 3
Mean 0.232 8.479 788 454 6

Appendix 0.B Scribble Styles

Refer to caption
Figure 4: Comparison of different scribble annotation styles by experts on the ACDC, MSCMR, and WORD datasets and our generated benchmark scribbles.