Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

\firstnameJack \surnameHighton \email
\addrSchool of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK;
Aival Ltd., London, UK \AND\nameQuok Zong Chong \email
\addrAival Ltd., London, UK \AND\nameSamuel Finestone \email
\addrAival Ltd., London, UK \AND\nameArian Beqiri \email
\addrSchool of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK
\addrAival Ltd., London, UK \AND\nameJulia A. Schnabel \email
\addrSchool of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK;
School of Computation, Information and Technology, Technical University of Munich, Munich, Germany;
Helmholtz Munich, Munich, Germany \AND\nameKanwal K. Bhatia \email[email protected]
\addrAival Ltd., London, UK
Abstract

Deep learning models for medical image segmentation and object detection are becoming increasingly available as clinical products. However, details are rarely provided about the training data, thus models may unexpectedly fail when cases differ from those in the training distribution. An approach allowing potential users to independently test a model’s robustness, treating it as a ‘black-box’ and using only a few cases from their own site, is key for adoption. To address this, a method to test the robustness of these models against CT image quality variation is presented. In this work we present this framework by demonstrating that given the same training data, the model architecture and data pre-processing greatly affect the robustness of several frequently used segmentation and object detection methods to simulated CT imaging artifacts and degradation. Our framework also addresses the concern about the sustainability of deep learning models in clinical use, by considering future shifts in image quality due to scanner deterioration or imaging protocol changes which are not reflected in a limited local test dataset.

Keywords: Robustness Testing, Out of Distribution, Computed Tomography

1 Introduction

Segmentation and object detection are important medical imaging analysis tasks in clinical and research settings. Segmentation is a fundamental analysis step because it provides boundary localization and volume quantification of anatomical and pathological structures which may be key for diagnosis and treatment planning  (Giorgio and De Stefano, 2013)  (Cabral Jr. et al., 1993). Automated object detection can enhance radiological reporting by highlighting pathological features  (Choi et al., 2022), or can guide further medical image analysis to focus on important features of the image  (** et al., 2022). However, a lack of clinicians’ trust in deep learning based applications can undermine adoption, which can be remedied with scientific robustness testing of an application using data from the clinicians own center  (Ahmad, 2021). Within this field, robustness refers to a model’s ability to maintain performance when encountering data which differs from the training dataset, due to a shift in demographics, acquisition protocol, or acquisition artifacts  (Galati et al., 2022).

X-ray Computed Tomography (CT) is a frequently used medical imaging modality with many applications, including radiotherapy planning, tumor diagnosis, angiography and trauma analysis  (Liguori et al., 2015). This work presents a method to facilitate model independent robustness testing using a limited amount of CT data.

Deep neural networks based models have recently become the foremost approach for automating many segmentation and object detection tasks and can achieve expert human performance in some applications  (Kooi et al., 2017). However, models have various vulnerabilities which can make them less robust than human visual assessment  (Geirhos et al., 2018). This issue becomes particularly acute as deep learning systems are increasingly distributed as commercial products, where restricted information about the model architecture and training data render the application an effective ‘black-box’ for the end user, making their robustness unclear and undermining trust.

Robustness testing of a deep learning model should consider the types of input degradation it would encounter during real-world use, where degradation is the process by which the quality of an image is diminished or compromised. Although it is well known that models are susceptible to adversarial examples  (Szegedy et al., 2013), these cases are created by an agent deliberately attempting to fool the system which is unlikely in a medical imaging context. Also, although common degradations affecting natural image quality, e.g. noise, contrast alterations, and blurring, have been shown to affect models’ performance  (Geirhos et al., 2018), the degradations seen in medical images have a fundamentally different nature. The key concern for robustness in a medical imaging application is degradation caused by the acquisition process. This may be caused by the acquisition protocol changing, such as the resolution or CT exposure parameters. There may also be image artifacts, defined as image features not present in reality but which appear due to unintentional acquisition phenomena, such as the patient moving during the scan or the effect of metal implants.

For a model trained on data with consistent acquisition parameters and without notable artifacts, these degradations can result in a discrepancy between the characteristics of training and test data, called a distribution shift  (Quiñonero-Candela et al., 2008). Test data affected by this discrepancy is a form of out-of-distribution (OOD) data, which in medical imaging context may be frequently encountered and which can undermine the robustness of deep learning applications. Although there are many approaches to retrospectively mitigate CT artifacts, including simulations to train models to remove them  (van der Ham et al., 2022), there is little understanding of how specific parameter changes and artifacts in CT imaging can create a distribution shift which affects the performance of models and therefore their safety in a medical context.

We address this by develo** a framework to systematically assess robustness of black-box models for segmentation and object detection with CT images (see Figure 1). This would allow a user with a limited test dataset acquired from their own center, with annotations representing the optimal output for those cases according to an expert, to test a segmentation or object detection model against OOD cases not present in their dataset but which are likely to occur during future use. Our suite of tests include increased CT noise, artifacts due to metal implants, and patient motion during the scan. Our framework allows direct comparison of the robustness of multiple ‘black-box’ models designed for the same task. This aims to aid confidence in the decision to adopt deep learning based methods in clinical practice, as well as trust in the continuing robustness of these models to future changes in image quality.

Refer to caption
Figure 1: An overview of the proposed framework for robustness testing. The potential user of a black-box model wishes to test its robustness to CT image degradation using a small locally acquired test dataset, annotated with segmentation maps or object boxes outlined by an expert. The performance of the model is evaluated after various simulation-based augmentations are applied to the input image (center). Then the evaluation results, alongside information from the test dataset noise distribution, are used to calculate robustness metrics for the model.

2 Related work

2.1 Identifying OOD images with knowledge of the model and training data

The most direct method to test a models’s robustness against OOD data is to evaluate its performance on a dataset which is known to be OOD compared to the model’s training data. In medical imaging, this approach has been used to test the accuracy of OOD detection algorithms designed to automatically identify cases which the model is not expected to be robust to  (Vasiliuk et al., 2023b)  (Vasiliuk et al., 2023a)  (Nguyen et al., 2023). OOD test sets were compiled by selecting datasets with a known distribution shift relative to the training dataset, such as different acquisition protocols or underlying patient conditions. However, this is dependent on knowledge about the training data, which is not available unless provided by the developers.

Methods do exist to explore if input data are OOD when the training dataset is not known. If access to a model’s architecture and trained weights is available, methods like Mahalanobis distance  (Anthony and Kamnitsas, 2023) and generalized ODIN  (Vasiliuk et al., 2023a) can be used to predict if an input medical image is OOD. In other situations the models weights may not be accessible, such as when it has been integrated into a product or if model obfuscation methods have been applied  (Zhou et al., 2023). In this case, the output softmax confidence score can be used to predict if an input is OOD  (Hendrycks and Gimpel, 2016)  (Liu et al., 2020). However, OOD detection methods based on confidence output have limited accuracy in natural image applications  (Liu et al., 2020), which becomes significantly worse for medical imaging applications in segmentation  (Vasiliuk et al., 2023a). Also, obtaining the softmax output may depend on having access to the final layer of the neural network architecture before a threshold is applied.

2.2 Testing robustness against OOD images from benchmark datasets

A sufficiently diverse benchmark dataset can provide a fully model-agnostic method to test a black-box model’s robustness against OOD data (Boone et al., 2023). An ideal benchmark dataset would contain a range of input images with corresponding annotations and where any potentially OOD properties of each input image is labeled. This allows evaluation of the accuracy of the final output against the annotations, summarizing how much the model’s performance degrades when it encounters each given type of OOD data. An example benchmark dataset contained cardiac Magnetic Resonance Imaging (MRI) images labeled according to the imaging center, scanner type, and disease condition, which formed part of a challenge to create generalizable segmentation models  (Campello et al., 2021). However, the report from the challenge organizers did not discuss what properties of images in certain groups led to reduced performance from the submitted models  (Campello et al., 2021). MOOD 2020 is another example benchmark dataset for brain MRI and abdominal CT (Zimmerer et al., 2022). Some images were labeled by humans as ODD due to anomalous pathology, but most were created by augmenting images from the in-distribution set. The categories of augmentations included removal of slices, blurring, global and local deformations, and randomly inserting patches of noise and sections of other images. Some have questioned whether these augmentations realistically depict anomalous medical images in the real world (Li et al., 2023a), suggesting the augmented image patches should be formed by blurring multiple extracted image features together. However, the patch based augmentations still would not reflect realistic OOD cases which result from the acquisition process.

2.3 Testing robustness against images with acquisition based augmentations

Augmentations to create potentially OOD images can instead be designed to recreate acquisition phenomena known to cause anomalous cases in the real world. For example, a patient may enter the scanner in an unusual position due to a spinal injury  (Yang et al., 2023). This could be reflected by image translation and rotation augmentations, which are already frequently included as default training augmentations (Goceri, 2023). It has been found that MRI hardware and software updates can change image properties enough to affect segmentation methods and thus distort brain tissue volume results in longitudinal studies (Medawar et al., 2021) (Potvin et al., 2019). Changing the scanner or its parameters may alter the image resolution or field of view, which can be recreated by the common training augmentations of rescaling and crop** (Goceri, 2023).

A shift in the image noise distribution is another effect of changing scanning parameters, such as when the CT radiation dose changes by altering the tube current or slice scan time (Goldman, 2007). Even though adding Gaussian distributed noise is a common training-time augmentation technique (Goceri, 2023), in the medical imaging domain actual noise has a non-Gaussian distribution associated with the acquisition modality. Speckle, Rician and Poisson distributed noise have been suggested as training-time augmentations for ultrasound (Singla et al., 2022), MRI (Boone et al., 2023), and planar X-Ray (Khalifa et al., 2022) images respectively. The noise in CT scans has a complicated spatial structure, augmentation of which requires either a physics-based simulation of the CT acquisition process (Won Kim and Kim, 2014) or the use of a generative model (Liu et al., 2022). A physics-based simulation of CT noise has been used to augment the training data for model’s to classify lung image patches for the presence of lung nodules (Omigbodun et al., 2019), which did not lead to a performance improvement for test data of various noise levels compared to Gaussian noise augmentation. However, this study only considered one specific task, and did not use the physics based model to assess the robustness of different model architectures to CT noise.

Patient motion during a scan can cause image artifacts resulting in image degradation. A simulation of MRI motion artifacts has been been used to augment the training data of a segmentation model, which improved its robustness against real world artifacts (Shaw et al., 2020).

A recent study (Boone et al., 2023) provided a benchmark dataset to test the robustness of models for MRI segmentation against OOD cases, created using a series of physics-based augmentations. This included noise, illumination field, resolution changes, spatial transformations and motion artifacts. The authors also defined a degradation metric to quantify a model’s robustness against a certain augmentation, which takes into account the mean or standard deviation of the segmentation performance metrics (e.g. Dice (Dice, 1945)) after a range of augmentation severity is applied to a test set. Since they found that the variance of the segmentation performance metrics increases with more extreme augmentations, a similar degradation metric using the variance was suggested. Both approaches allow direct comparison of the robustness of two models when tested with the same augmented dataset, by comparing the degradation value.

2.4 Testing robustness against OOD images with limited datasets

It is crucial for the end user to be able to independently test cases from their own clinical center to support their decision to adopt deep learning based applications for medical imaging analysis in a clinical context (Jacobson and Krupinski, 2021), because they must be appropriately trained for the targeted population demographics (Vayena et al., 2018), disease presentation, and acquisition system (Prior et al., 2020). Testing data has to be manually labeled by human experts, resulting in high cost and limited size of testing datasets (Koh et al., 2022), which is often exacerbated at a local level (Shaikhina and Khovanova, 2017).

If the test dataset is limited in size, the statistical significance of summative performance metrics (e.g. mean Dice) may be undermined and it may lack rarer types of OOD images which may be encountered in the future. Methods have been developed to expand small medical imaging datasets with synthetic data produced by generative adversarial networks (GANs), mostly to train models for classification tasks such as COVID-19 detection (Waheed et al., 2020). GANs have also been used to expand a training dataset for 2D retina image segmentation (Noguchi et al., 2020). Since this approach also requires the generation of accurate corresponding synthetic ground truth segmentation masks, the segmentation model being trained has to be integrated into the data synthesizing algorithm, rendering this an inappropriate way to generate test data to independently assess black-box models. Non-deep learning approaches for expanding datasets mix features in the images to create new images. A crop** and patching method has been proposed for both images and segmentation mask annotations  (Noguchi et al., 2020), however the images produced contain anatomy which is concatenated in a clearly unrealistic way. Laplacian blending has been proposed as a more realistic alternative for classification datasets  (Sanaat et al., 2022), but this method would pose problems for generating valid annotations in the transition regions for both segmentation and object detection applications.

The above methods to expand limited datasets can generate cases which are potentially OOD due to new combinations of image features. However, the patch based methods combine image features in an unrealistic way, while GAN based methods create ground truths which are not independent of the model being tested. Furthermore, these methods cannot add OOD cases resulting from acquisition artifacts/degradation not already present in the dataset. This motivates our CT simulation based framework, which is independent of the models being tested and allows them to be treated as a black-box. The augmentations generated by the simulation can also add common artifacts/degradations not initially present, or ones with greater severity, to the limited test dataset.

2.5 Contributions

To address these issues with existing approaches to independently test the robustness of black-box models with limited datasets, for the first time to our knowledge, we develop a benchmarking platform to create datasets for evaluating the robustness of models to corruptions and artifacts in CT. Analogous to previous work (Boone et al., 2023) which presented a system for robustness and out-of-distribution testing specifically to brain MRI segmentation models (ROOD-MRI), here we expand and further generalize this concept to more generic segmentation and localization road-testing tasks in CT imaging. We make the following scientific contributions to this field:

  • We propose a physics-based CT simulation for modeling associated acquisition anomalies: noise variation, metal implant artifacts, and patient motion. Our simulation approach only uses parameters obtained from analysis of the CT images themselves, allowing use when details of the CT acquisition are not available.

  • Using test-time data augmentations from the simulation, we demonstrate that different deep learning model architectures have weaker robustness to different CT artifacts.

  • We investigate established models, designed for object detection (retinaNet) and for segmentation (U-Net, nnUnet). We also convert the tested segmentation models into object detection applications, to compare their robustness with retinaNet.

  • Inspired by ROOD-MRI (Boone et al., 2023) which uses fixed weighting of the contribution of each augmentation severity level to calculate the summative noise degradation metric, here we expand on that formula by using an empirical distribution of noise in a given dataset to calibrate the weighting of noise augmentation severity levels.

The remainder of our paper is structured as follows. In Section 3 we outline the datasets we used to train models. The established model architectures that we trained to apply the robustness testing framework to are briefly described: 3D-UNet (Çiçek et al., 2016), nnUnet (Isensee et al., 2021), and retinaNet (Lin et al., 2017). Then, we describe the metrics used to evaluate their performance at segmentation or object detection using test sub-datasets. Section 4 describes how our CT simulation approach is based only on parameters extracted from the test CT images themselves, allowing general use to generate the augmentations for robustness testing. In Section 5, we show these applied to generate a summary robustness metric calibrated with the dataset. Then our results show how these methods demonstrate the strengths and weakness of different model architectures.

3 Materials and metrics

3.1 Datasets

3.1.1 LUNA 16

LUNA 16 is a lung nodule object detection challenge dataset (Jacobs and van Ginneken, 2016). We included it in this study because lung nodule annotation is crucial for lung cancer diagnosis, but manual labeling is time-consuming and error prone, so deep learning applications are of significant interest (Ren et al., 2020).

The dataset includes CT scans with a maximum slice thickness of 2.5 mm from the publicly available LIDC/IDRI database (Armato III et al., 2011). Annotations were collected by 4 experienced radiologists, and the LUNA 16 annotations consist of all nodules larger than 3 mm accepted by at least 3 out of 4 radiologists. The annotations were presented as ground truths for object detection, where all nodules were labeled as having the same class while their position and size were defined by 3D bounding box co-ordinates. We used 570 images for training models while 30 were left aside for testing.

The training images were pre-processed by resampling via bilinear interpolation to (0.703, 0.703, 1.25 mm) voxel sizes and the intensity was windowed to -1024 to 300 Hounsfield Units (HU). This pre-processing was applied to the test data after the application of CT simulation based augmentations described in Section 4.

3.1.2 Segmentation decathlon - liver task

As part of the The Medical Segmentation Decathlon challenge (Antonelli et al., 2022), the liver CT dataset consists of contrast-enhanced CT images from patients with primary cancers and metastatic liver disease. We included this dataset due to the importance of robust organ segmentation for treatment planning, such as radiotherapy (Yu et al., 2022).

The dataset contained target segmentation maps outlining the liver. It was acquired in the IRCAD Hôpitaux Universitaires, France and contained a subset of patients from the 2017 Liver Tumor Segmentation challenge (Bilic et al., 2023). We used 570 images for training models and 30 were left aside for testing.

The training images were pre-processed by resampling via bilinear interpolation to (1.5, 1.5, 1.5 mm), but the intensity was not windowed by default because the effect of training models with and without windowing was compared during robustness testing. This pre-processing was applied to the test data after the application of CT simulation based augmentations described in Section 4.

3.2 Segmentation model architectures and training

3.2.1 3D-UNet

The implementation of 3D-UNet (Çiçek et al., 2016) segmentation model provided within the MONAI library (Cardoso et al., 2022) was utilized. 3D-UNet was chosen because it has frequently been applied as a baseline architecture to compare the performance of newly developed models against during the past five years (Li et al., 2023b) (Li et al., 2023b) (Bui et al., 2019). We trained this architecture with three pre-processing approaches for the training data, to test the effect they have on robustness:

  • No image augmentation, apart from random sampling of patches for training with size (96,96,48) for the liver decathlon dataset and (192,192,80) for LUNA 16.

  • Image augmentation using the protocol used for nnUnet shown in Table 1, followed by the random patch selection described above.

  • Image intensity windowing in the range -60 to 160 HU, followed by the random patch selection described above. This option was only used for the liver segmentation dataset because the window corresponds to the range of intensity of the liver and surrounding volume, while the intensity of lung nodules in the LUNA 16 dataset is very variable.

3.2.2 nnUNet

An integration of the nnUnet (Isensee et al., 2021) segmentation protocol within the MONAI library (Yang, 2023) was used. This trains an ensemble of UNet derived models with parameter optimization based on the data and hardware before selecting the best performing combination, as outlined in Table 1. nnUnet was chosen because it is considered a state-of-the-art segmentation protocol, after achieving the best performance across multiple medical imaging segmentation applications in the The Medical Segmentation Decathlon challenge (Antonelli et al., 2022).

Table 1: Comparison of the properties of the nnUNet and base 3D-UNet used here.
*The largest patch size possible with Batch Size 2 given the available memory.
**The best pair of the models, ensembled by averaging softmax probabilities, chosen by cross validation of the training data.
Dataset nnUNet 3D-UNet
Training Augmentation Rescale x0.9 or 1.2, 15% None
Gaussian Noise std 0.01, 15%
Gaussian Smooth x0.5-1.15, 15%
Intensity -0.3 or 0.3, 15%
Flip (each axis), 50%
Learning Rate ‘poly’ schedule (initial 0.01) 0.001
Training Epochs 1000 600
Batch Size 2 1
Loss Function Dice + Cross Entropy Dice
Intensity Normalization clipped: 0.5 to 99.5% clipped: -57 to 164 HU
In-plane Image Resampling In plane 3rd order spline Linear interpolation
Slice Image Resampling Nearest Neighbor Linear interpolation
In-plane Annotation Resampling Linear interpolation Linear interpolation
Slice Annotation Resampling Nearest Neighbor Linear interpolation
Patch Size Maximum for Batch Size 2* 96x96x96
Low-resolution Patch Size 25% of median image size N/A
Ensemble Selection Options 3D-UNet N/A
2D-UNet
Low-resolution UNet cascade
Ensemble of two**

3.3 Object detection model architectures and training

3.3.1 retinaNet

We utilized an implementation of the retinaNet (Lin et al., 2017) architecture provided within the MONAI library. retinaNet is composed of a backbone network (ResNet in this implementation) with several downsampling layers followed by several upsampling layers. Each of the upsampling layers is sampled by two convolutional sub-networks to predict the object class and bounding box co-ordinates, giving independent box predictions considering different resolution scales which allows accurate identification of both large and small objects. This is appropriate for the LUNA 16 dataset, which contains annotated nodules varying widely in diameter from 3.0 mm to 28.3 mm. retinaNet also uses a focal loss function for classification, cross entropy weighted by uncertainty so difficult to classify examples are more heavily weighted, but there was only one class of nodule considered so this feature was not relevant.

3.3.2 Conversion of segmentation models to object detection models

Object detection models are evaluated here using the mean Average Precision (mAP), while the segmentation models are evaluated using the Dice score. Therefore, in order to allow direct comparison, we added a post-processing step to the segmentation models applied to the LUNA 16 dataset to create bounding boxes for the nodules like an object detection model, which can be evaluated using mAP. To do this, contiguous volumes of voxels labeled as ‘nodule’ within the predicted segmentation maps with a volume greater than 14.14 mm3 were selected. This is the volume of a sphere with a 3mm diameter, the minimum size of annotated nodules in the LUNA 16 dataset. For each of these contiguous volumes, the minimum enclosing box with edges parallel to the orthogonal image axes was calculated and treated as a box produced by an object detection model with that ‘nodule’ class at 100% confidence.

3.4 Evaluation metrics

3.4.1 Segmentation evaluation

The Dice coefficient (Dice, 1945) (DSC) is a de facto standard (Maier-Hein et al., 2024) used to quantify the accuracy of a segmentation method (Zijdenbos et al., 1994):

DSC=2×|AB||A|+|B|𝐷𝑆𝐶2𝐴𝐵𝐴𝐵DSC=\frac{{2\times|A\cap B|}}{{|A|+|B|}}italic_D italic_S italic_C = divide start_ARG 2 × | italic_A ∩ italic_B | end_ARG start_ARG | italic_A | + | italic_B | end_ARG (1)

where A𝐴Aitalic_A and B𝐵Bitalic_B represent the set of ground truth and the inferred segmentation voxels, and |A|𝐴|A|| italic_A | and |B|𝐵|B|| italic_B | are the amount of voxels in those sets. DSC ranges from 0 to 1, where 1 represents perfect overlap between the ground truth and inference. The mean DSC across a test dataset is used to evaluate the accuracy of a segmentation model. The consistency of the model may be impacted when OOD images are encountered, reflected by an increase in the standard deviation of DSC values (Boone et al., 2023).

3.4.2 Object detection evaluation

The mean Average Precision (mAP) is an evaluation metric for object detection methods used across several benchmark challenges (Padilla et al., 2020), which summarizes the precision-recall trade-off dictated by confidence levels of the predicted bounding boxes.

mAP is the mean of the Average Precision (AP) for each class, where AP is the area under a precision recall curve that has been preprocessed to remove zig-zag behavior (Padilla et al., 2021). To calculate the precision and recall at each threshold confidence value, each bounding box predicted by the model with a confidence exceeding the threshold is compared with the ground truth boxes using the metric Intersection over Union (IoU) (Padilla et al., 2021):

IoU=|AB||AB|𝐼𝑜𝑈𝐴𝐵𝐴𝐵IoU=\frac{{|A\cap B|}}{{|A\cup B|}}italic_I italic_o italic_U = divide start_ARG | italic_A ∩ italic_B | end_ARG start_ARG | italic_A ∪ italic_B | end_ARG (2)

If the IoU between a predicted box and a ground truth box of the corresponding class exceeds a specified threshold, the predicted box is a true positive. If the IoU does not meet the threshold, another predicted box has already met the threshold, or the predicted class is wrong, then the predicted box is a false positive. Any ground truth boxes which does not have a corresponding true positive are counted as false negatives. The precision and recall can then be calculated for each threshold confidence value. mAP can be calculated for a single test image or across a test set. It should be noted that a ‘black-box’ object detection model may have an intrinsically fixed confidence threshold, in which case AP reduces to precision multiplied by recall.

mAP is often calculated with a range of IoU thresholds, with the mean result taken (Padilla et al., 2020). This approach is taken here, because the models being compared may perform best at different IoU thresholds. Furthermore, the IoU may be heavily affected by how tightly the image is bounded by the ground truth box, so a range of IoU thresholds helps capture this variability. This is particularly the case for annotations of lung nodules, which vary greatly in size, have boundaries which are difficult for a clinician to precisely discern (Larici et al., 2017), and which have a varying heterogeneous shape compared to an enclosing cuboid.

4 Methods for augmentation of CT data

4.1 CT simulation

A custom physics based simulation of CT was used to generate CT specific augmentations. This was based on the Astra toolbox (Van Aarle et al., 2016), which creates the system geometry then numerically simulates the stages of the acquisition process.

The pre-processing of the input images to be augmented follows a previously outlined method (Won Kim and Kim, 2014). First, the image CT values (Hounsfield Units) are converted to attenuation X-ray attenuation values μ𝜇\muitalic_μ (cm-1):

μ=CT number1000×μwater+μwater𝜇CT number1000subscript𝜇watersubscript𝜇water\mu=\frac{{\text{{CT number}}}}{{1000}}\times\mu_{\text{{water}}}+\mu_{\text{{% water}}}italic_μ = divide start_ARG CT number end_ARG start_ARG 1000 end_ARG × italic_μ start_POSTSUBSCRIPT water end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT water end_POSTSUBSCRIPT (3)

where μwatersubscript𝜇water\mu_{\text{{water}}}italic_μ start_POSTSUBSCRIPT water end_POSTSUBSCRIPT is the linear attenuation coefficient of water which is 0.18 cm-1 at 120 kVp (Boedeker et al., 2007). Then, total variation based denoising (Rubin, 1992) was used to remove noise from the original acquisition from the attenuation image.

Refer to caption
Figure 2: The geometry parameters of the simulated CT system: ϕitalic-ϕ\phiitalic_ϕ beam angle, d1 source to image center distance, d2 image center to array distance, ddet distance between adjacent detectors, and ndet is the number of detectors.

An outline of the geometry of the CT scanner system is required to simulate the acquisition process, but as this is often not available from image metadata, we use the following procedure to allow general CT augmentation. The geometric parameters required by the Astra toolbox are outlined in Figure 2: source to image center distance d1, image center to array distance d2, number of x-ray detectors ndet, and distance between adjacent detectors ddet. We assumed the beam angle ϕitalic-ϕ\phiitalic_ϕ to be 60°as is typical of modern scanners (Peyrin and Engelke, 2021), although a specific value is sometimes available in the metadata. We also assumed ndet to equal 1500, which is consistent with modern scanners (Hermena and Young, 2021) and higher values did not change the reconstructed image. The other geometric parameters can be calculated as follows.

d1 can be obtained using ϕitalic-ϕ\phiitalic_ϕ and the diameter of the circular field of view dfov. dfov can measured from the input CT image by identifying the edge of the circular field of view, which is achieved by sampling a line of voxels from the corner of the central slice of the image to the image center, until an intensity value other then the uniform value outside the field of view is encountered.

d1=dfovsin(ϕ/2)𝑑1subscript𝑑𝑓𝑜𝑣𝑠𝑖𝑛italic-ϕ2d\textsubscript{1}=\frac{d_{fov}}{sin(\phi/2)}italic_d = divide start_ARG italic_d start_POSTSUBSCRIPT italic_f italic_o italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_s italic_i italic_n ( italic_ϕ / 2 ) end_ARG (4)

d2 can be assumed to be similar (approximated as equal) to d1 because both the X-ray source and detectors are attached to the same rotating gantry, but the exact value will not affect the simulation if the air is assumed to have negligible attenuation and the values of ndet and ddet correspond to the projected size of the detector array arc given d1 and d2.

ddet=2(d1+d2)tan(ϕ/2)ndetsubscript𝑑𝑑𝑒𝑡2subscript𝑑1subscript𝑑2𝑡𝑎𝑛italic-ϕ2subscript𝑛𝑑𝑒𝑡d_{det}=\frac{2\>\>(d_{1}+d_{2})\>\>tan(\phi/2)}{n_{det}}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT = divide start_ARG 2 ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_t italic_a italic_n ( italic_ϕ / 2 ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT end_ARG (5)

Another parameter required by the acquisition simulation is the number of sampled angles nθ𝜃\thetaitalic_θ, but this was empirically tuned alongside the noise parameters in the process outlined in Figure 3.

The sinogram generated from the projection stage of the simulation was used to reconstruct a CT image using Astra’s filtered back projection function with the default Ramachandran-Lakshminarayan filter (Ramachandran and Lakshminarayanan, 1971).

Once the projection stage of the simulation has generated a sinogram for each slice, augmented noise could be added. This can be modeled as a combination of Poisson noise due to a restricted number of photons reaching the detector, and Gaussian electronic noise (Won Kim and Kim, 2014). As our augmentation approach is designed to be general, including for when specific information about the beam properties is not available, we empirically tune the incident flux of photons on each each detector point with no X-ray attenuation Q0. This, in combination with the attenuation of the X-ray beam by the scanned object, determines the actual incident number of photons and thus Poisson noise, and Gaussian noise parameter σ𝜎\sigmaitalic_σ. The sinogram after application of the noise augmentation Snssubscript𝑆𝑛𝑠S_{ns}italic_S start_POSTSUBSCRIPT italic_n italic_s end_POSTSUBSCRIPT is

Sns=S0+P(Q0exp(S0))+G(σ)subscript𝑆𝑛𝑠subscript𝑆0𝑃𝑄0𝑒𝑥𝑝subscript𝑆0𝐺𝜎S_{ns}=S_{0}\,+\,P(\,Q\textsubscript{0}\,exp(-S_{0})\,)\,+\,G(\sigma)italic_S start_POSTSUBSCRIPT italic_n italic_s end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_P ( italic_Q italic_e italic_x italic_p ( - italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_G ( italic_σ ) (6)

where S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial sinogram, P(x)𝑃𝑥P(x)italic_P ( italic_x ) is a random sample from the Poisson distribution with expected value x, G(σ)𝐺𝜎G(\sigma)italic_G ( italic_σ ) is a random sample from a Gaussian distribution with a zero mean and standard deviation σ𝜎\sigmaitalic_σ.

Refer to caption
Figure 3: Outline of the CT noise simulation tuning process. Step 1: The image intensity is converted from HU to attenuation units before denoising and noise extraction via the total variation method. Step 2: the simulation parameters Q0, σ𝜎\sigmaitalic_σ, and nθ𝜃\thetaitalic_θ are selected to maximize the similarity in the radial noise power spectrum (NPS) patchwise between the noise generated by the simulation and the noise extracted from the original image, as described in Table 2. Step 3: To increase the level of noise, the incident flux Q0 is decreased until a desired augmented image noise level is reached, as described in Section 4.2.1.

The noise model has three parameters which require empirical tuning using the input dataset (see Table 2). We did this by comparing the CT noise generated by the model with the noise extracted from the input test images by total variation denoising (see Figure 3). This utilized each test image by breaking the central slice into a 10x10 grid, calculating the radial noise power spectrum (NPS) in each cell (Dolly et al., 2016), calculating the mean of the sum of square differences between the NPS curves from the data and model, mSSENPS. The parameters were tuned to minimize the mSSENPS, which represents dissimilarity in the CT noise accounting for its spacial variation. The tuning process was a two stage coarse-to-fine grid search, using the search parameters in Table 2.

Table 2: The unitless simulation parameters tuned to recreate the CT noise extracted from the test dataset, by minimizing the mean noise power spectrum dissimilarity (mSSENPS). The fine grid search parameters were found by applying the multiplications or additions shown to the optimum parameter found in the coarse search stage.
Parameter Description Coarse Search Fine Search
Q0 Incident photon flux 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT,105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT,106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT,107superscript10710^{7}10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT x0.5, x0.75, x1.0, x2.5, x5
σ𝜎\sigmaitalic_σ s.d. of Gaussian noise 0, 0.1, 1, 10 x0.5, x0.75, x1.0, x2.5, x5
nθ𝜃\thetaitalic_θ Number of sampled angles 720,1440,2160,2880 -360, +0, +360

4.2 Artifact generation

4.2.1 Noise

The noise in the image produced by the CT simulation can be increased by reducing the value of Q0, modeling a lower X-ray dose, until the desired standard deviation (s.d.) of the simulated noise field is obtained. This search is done by rounding the optimized Q0 to the nearest order of magnitude (10m𝑚mitalic_m, where m𝑚mitalic_m is an integer) and decreasing m𝑚mitalic_m, then fine-tuning using multiples of: [0.5, 0.75, 1.0, 1.25, 1.5], to obtain the closest s.d. value.

As shown by the simulated images in Figure 4 and the noise power spectra in Figure 5, this method of increasing the noise level does not significantly change the spatial distribution or texture of the noise field. Therefore, the strength of this augmentation can be quantified by the noise s.d., even though the augmentation is caused by altering the physical parameter Q0.

Refer to caption
Figure 4: The CT noise augmentation is performed by reducing the the simulation parameter Q0, to the value resulting in the output images having an increased noise component with a given standard deviation (s.d.).
Refer to caption
Figure 5: The radial noise power spectrum (NPS) is sampled from the noise extracted from the original image within a central patch (left) for comparison with the same patch of the simulated noise fields. The normalized NPS’s are shown to have similar profiles (right).

4.2.2 Metal implants

Metallic implants can cause streaking artifacts in CT images. In order to simulate this phenomenon, we augment the input images slices by inserting a cylindrical region with a radiodensity of 20,000 Hounsfield units, corresponding to pure steel (Bolliger et al., 2009). These implants were centered at the highest intensity point in the spine and orientated axially with a length spanning the axial range of the annotated objects or segmentation plus an extra 15mm in both directions. This arrangement was chosen because the image slices which contain part of the metal implant are affected by the characteristic streaking artifact, so in the augmented images generated by the the CT simulation, this artifact would be present in the slices containing the annotated objects or segmentation and also the region superior and inferior to them. Therefore, the effect of the metal implant artifact surrounds the annotated objects or segmentation.

We scaled the augmentation strength by increasing the radius of the implants. The output of the CT simulation contained the characteristic streaking artifacts after the input images were augmented with a metal implant. For small implants the artifacts were visually similar to examples in the LUNA 16 dataset, as shown in Figure 6. Simulated implants with a larger radius caused more extensive streaking, which is visually similar to reported observations of artifacts due to spinal implants (Barrett and Keat, 2004). However, comparably extensive streaking was not seen in the LUNA 16 dataset.

Refer to caption
Figure 6: The simulation of a metallic implant as a form of CT data augmentation causes a streaking artifacts with a severity depending on the implant radius (white numbers). An example of the simulated artifact due to a 2.5mm radius implant (blue outline) is visually similar to an example artifact found in the LUNA 16 training dataset (orange outline).

4.2.3 Motion

We model patient motion as a rigid change in pose by tilting during the sequential acquisition of slices, by applying an axial rotation to all slices above or below a certain point and thus producing a discontinuity in the CT image. We consider two types of motion augmentation, which differ due to how their severity is scaled:

  • Motion magnitude (mag): The augmentation strength is increased by increasing the amount of rotational motion in degrees, while the location of the slice discontinuity is constant at 10mm inferior to lowest point of the annotation.

  • Motion proximity (prx): The augmentation strength is increased by decreasing the distance of the motion discontinuity to the to the lowest point of the annotation, while the amount of rotation is kept constant at 10°.

5 Experiments and results

5.1 Dataset analysis

Figure 7 shows the measurement of the noise distribution in the LUNA 16 dataset. This is used to find the set of weights ws, replacing the coefficient ws = (2/3)s in equation (7) (Boone et al., 2023), where s is the level of augmentation severity.

DegTM=1s=15ωss=15ωs(mMcleanmMT,s),with:ωs=(2/3)sformulae-sequence𝐷𝑒subscript𝑔𝑇𝑀1superscriptsubscript𝑠15subscript𝜔𝑠superscriptsubscript𝑠15subscript𝜔𝑠𝑚subscript𝑀𝑐𝑙𝑒𝑎𝑛𝑚subscript𝑀𝑇𝑠with:subscript𝜔𝑠superscript23𝑠Deg_{TM}=\frac{1}{\sum_{s=1}^{5}\omega_{s}}\sum_{s=1}^{5}\omega_{s}(mM_{clean}% -mM_{T,s})\>\>,\>\>\>\text{with:}\>\>\>\>\>\omega_{s}=(2/3)^{s}italic_D italic_e italic_g start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_m italic_M start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_m italic_M start_POSTSUBSCRIPT italic_T , italic_s end_POSTSUBSCRIPT ) , with: italic_ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( 2 / 3 ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (7)

We obtained the weights (0.221,0.044,0.006,0,0,0) corresponding to the noise augmentation levels of (10,20,50,100,200,350,500) by dividing the frequency distribution at those points by its value at the base level of noise (see Figure 7b).

Refer to caption
Figure 7: Measurements of the noise in the original LUNA 16 dataset allow empirical scaling of the degradation weights. The standard deviation of the noise extracted by the total variation method is measured for each case. a) We use a histogram to identify the modal level of noise as a base level (found to be 4 HU). b) We use the cases with a noise equal or higher than the base level to produce a second histogram, to which the distribution function (green) is fitted. Both histograms use a bin width of 1 HU.

5.2 Noise simulation

An example from the LUNA 16 dataset with high noise is shown in Figure 8. The CT noise is removed by total variation based denoising (Rubin, 1992) and recreated by the simulation, showing good visual similarity with the original image noise. In this case, the tuned simulation values are: Q0 = 2.5×1062.5superscript1062.5\times 10^{6}2.5 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, σ𝜎\sigmaitalic_σ = 00, and nθ𝜃\thetaitalic_θ = 2160.

Refer to caption
Figure 8: a) A slice from the LUNA 16 dataset, with a nodule labeled with a blue box. b) The slice with after denoising using the total variation method. c) the simulation is shown to recreate visually similar CT noise.

5.3 Model degradation results overview

An overview of the performance and degradation scores shows that nnUnet was overall the most robust segmentation model (see Table LABEL:tab_dce), while retinaNet was overall the most robust object detection model (see Table LABEL:tab_map). These summary results are analyzed closely in the following subsections.

Table 3: Degradation results for the segmentation models, where a lower score means better robustness (\downarrow). The DSC is measured between the annotation segmentation map and the model prediction, where a higher Base DSC means better model performance before augmentation (\uparrow). 3D-UNet (aug) is the 3D-UNet architecture with the same set of training augmentations used as nnUnet. 3D-UNet (win) is the 3D-UNet architecture with pre-processing of the input image by intensity windowing with the range -60 to 160 HU.
Dataset Model Base DSC \uparrow Degradation metric \downarrow
Noise Metal Motion mag Motion prx
Liver Seg. 3D-UNet 0.808 ±plus-or-minus\pm± 0.09 0.123 0.111 0.002 0.001
3D-UNet (aug) 0.862 ±plus-or-minus\pm± 0.09 0.128 0.106 0.003 0.000
3D-UNet (win) 0.908 ±plus-or-minus\pm± 0.10 0.217 0.155 0.000 0.001
nnUNet 0.962 ±plus-or-minus\pm± 0.10 0.061 0.070 0.000 0.000
LUNA 16 3D-UNet 0.751 ±plus-or-minus\pm± 0.07 0.050 0.167 0.037 0.027
3D-UNet (aug) 0.700 ±plus-or-minus\pm± 0.08 0.046 0.143 0.023 0.041
nnUNet 0.830 ±plus-or-minus\pm± 0.07 0.031 0.166 0.038 0.051
Table 4: Degradation results for 3D Object Detection models, where a lower score means better robustness (\downarrow). For each image, mAP is the mean result for the IoU thresholds (0.01, 0.15… 0.9), where a higher base mAP means better performance before augmentation (\uparrow). 3D-UNet (aug) is the 3D-UNet architecture with the same set of training augmentations used as nnUnet. The nnUNet and 3D-UNet segmentation models are converted to detection models producing object bounding boxes, see Section 3.3.2.
Dataset Model Base mAP \uparrow Degradation metric \downarrow
Noise Metal Motion mag Motion prx
LUNA 16 RetinaNet 0.709 ±plus-or-minus\pm± 0.20 0.001 0.228 0.003 0.023
3D-UNet 0.549 ±plus-or-minus\pm± 0.17 0.067 0.243 0.037 0.033
3D-UNet (aug) 0.591 ±plus-or-minus\pm± 0.19 0.068 0.231 0.023 0.052
nnUNet 0.660 ±plus-or-minus\pm± 0.17 0.085 0.264 0.038 0.015

5.4 CT noise augmentation

Refer to caption
Figure 9: The Dice (DSC) score results against the level of CT noise added by augmentation, for the segmentation models each tested with 30 cases from the liver segmentation dataset. The means with standard deviation error bars are shown in purple. nnUnet demonstrates higher robustness than the 3D-Unet models, shown by the shallower drop off in DSC with increasing noise and lower degradation metrics.

As shown in the ‘Noise’ column of Table LABEL:tab_dce and plotted in Figure 9, applying noise augmentation to the liver segmentation decathlon dataset shows that the nnUnet is more robust to increased CT noise than the 3D-UNet, as the mean DSC degradation (mDDeg) is lower. The robustness of the 3D-UNet increases when the nnUnet suite of training augmentation transforms was applied (see Table 1). Therefore, part of the increase in robustness is due to the training time augmentations, and part due to the model ensemble selection protocol and parameter tuning utilized by nnUnet.

Adding contrast windowing as part of the training and test prepossessing for 3D-UNet (HU windowing) resulted in greatly reduced robustness. This can be expected, because the model is trained to handle a smaller range of intensity values (-60 to 160 HU) and sufficient CT noise results in values being beyond this range. Although the effect of windowing on the input image shown in Figure 10 would not be visible to the user if this pre-processing step is intrinsic to a black-box model, the difference in robustness demonstrated by our framework would be.

Refer to caption
Figure 10: Top: The output segmentation for the 3D-UNet model (without nnUnet based augmentation). Bottom: The same model trained with intensity windowing as a -60 to 160 HU pre-processing step shown below. The annotation segmentation map is blue, the predicted segmentation map is orange, and the overlap is purple. The windowing results in the effect of the noise is greater relative to the range of intensity values, resulting in the DSC reducing faster as the simulated noise level increases. Note the windowed images would not be available to a user if the windowing was integrated into the model, but we are able to reveal this as the model developer.

As shown in Table LABEL:tab_map, testing with the LUNA 16 test dataset shows that the retinaNet object detection model is very robust to increased CT noise, with no positive degradation, in contrast to any of the object detection models derived from segmentation models. When the robustness of the segmentation models were tested with the DSC metric, the nnUnet is more robust that the 3D-UNet. Adding the nnUnet suite of training augmentation transforms did not improve the robustness of 3D-UNet at this task.

A notable response of the 3D-UNet and nnUnet models to increasing CT noise is the great increase in DSC and mAP standard deviation. This is due to the metrics improving for some of the cases, which is because some false positive nodule segmentation or box generation were avoided when the nodule-like image features were obscured by noise. However, for the whole dataset this is outweighed by less true positive outputs with increased noise augmentation. This effect shows the importance of the standard deviation based degradation metrics (sMDeg and sPDeg) in complementing the metric based degradation metrics (mMDeg and mPDeg).

5.5 Metal implant augmentation

As shown in the ‘Metal’ column of Table LABEL:tab_dce, applying a simulated cylindrical implant to the liver segmentation decathlon dataset shows that the nnUnet is more robust to streak artifacts than the 3D-UNet. Like during the noise augmentation testing, the robustness of the 3D-UNet increased when the nnUnet suite of training augmentation transforms is applied. Table LABEL:tab_dce, Table LABEL:tab_map and Figure 12 also show this when the cylindrical implant augmentation was applied to the LUNA 16 test dataset and the segmentation models were tested. Figure 12 also shows that retinaNet is more robust to streak artifacts than the object detection models derived from segmentation models. Figure 11 shows the streak artifacts across a band of image slices caused by a simulated cylindrical spinal metal implant in a case from LUNA 16, as well as how this impacts the models’ output.

Refer to caption
Figure 11: A coronal view of the streak artifacts produced by a simulated 7.5 mm radius cylindrical implant which extends 15mm above and below the annotated nodules, as well as the model outputs after this augmentation on a case from LUNA 16. A higher mean Average precision (mAP) means the output object detection boxes (orange) more accurately reflect the annotations (blue). A higher Dice score (DSC) means the output segmentation maps (orange) have a greater overlap** volume (green) with the annotation maps (blue).
Refer to caption
Figure 12: Top: The Dice (DSC) score results against simulated implant radius for the tested segmentation models. Bottom: mean Average precision (mAP) results against simulated implant radius for the tested object detection models, including segmentation models converted to generate bounding boxes. These are shown for the 30 test cases from the LUNA 16 dataset, with a cylindrical spinal implant added which induces a simulated streak artifact. The means with standard deviation error bars are shown in purple. The models degrade at a similar rate as the implant radius is increased, but retinaNet is more robust than the converted detection models.

5.6 Motion augmentation

The ‘Motion mag’ column of Tables LABEL:tab_dce and LABEL:tab_map show that all of the models tested were robust to rotational motion augmentation when the mean metric based degradation were considered. The models’ performance do not significantly degrade as the magnitude of the rotational motion is increased beyond 10 degrees - see Figure 13. However, the standard deviation based degradation (sDDeg and sPDeg) shows a significant effect for the LUNA 16 test dataset, as the rotational discontinuity 5mm from the lowest nodule caused both increases and decreases in the segmentation DSC.

Refer to caption
Figure 13: Top: The Dice (DSC) score results against the amount of rotational motion for the tested segmentation models. Bottom: mean Average precision (mAP) results against the amount of rotational motion for the tested object detection models, including segmentation models converted to generate bounding boxes. These are shown for the 30 test cases from the LUNA 16 dataset, with the discontinuity resulting from the single tilt motion located 5mm inferior to the lowest nodule containing slice according to the annotation. The means with standard deviation error bars are shown in purple. The models did not significantly degrade as the rotation increased beyond 10 degrees.

The reason for this is shown by the plots in Figure 14. When the discontinuity is shifted very close to the annotated object, the discontinuity sometimes improves the segmentation or detection result as the discontinuity matches the annotated object boundary in the axial direction. This effect is also seen for object detection models.

Refer to caption
Figure 14: Top: The Dice (DSC) score results against proximity of the motion discontinuity from the lowest point containing a nodule, for the tested segmentation models. Bottom: mean Average Precision (mAP) results against proximity of the motion discontinuity from the lowest point containing a nodule, for the tested object detection models including segmentation models converted to generate bounding boxes. These are shown for each of the 30 test cases from the LUNA 16 dataset with 10 degrees of rotational motion causing a discontinuity at varying proximity to the lower edge of the lowest nodule according to the annotations. The means with standard deviation error bars are shown in purple. The models’ performance degrade and then improve as the discontinuity approaches the annotation.

An example of this effect is shown in Figure 15, where the model outputs improve when the discontinuity is adjacent to the nodule. This suggests a random motion discontinuity location should be chosen for robustness testing, rather than one based on annotation, to avoid giving the tested models implicit information which they would not have in real world use.

Refer to caption
Figure 15: Discontinuity-nodule 10mm (top row): A coronal view of the image discontinuity produced by a 10 degree patient rotation 10mm below the annotated nodule, as well as the model outputs after this augmentation on a case from LUNA 16. Discontinuity-nodule 0mm (bottom row): The corresponding outcome with the discontinuity immediately below the annotated nodule, showing the 3D-UNet and RetinaNet performance improving due to the sharp boundary coinciding with the bottom of the annotated nodule.

6 Discussion and conclusion

In this work, we present a framework combining augmentations and evaluation metrics for evaluating the robustness of segmentation and object detection models with OOD cases produced by simulated protocol changes and artifacts in CT.

The experiments we performed to demonstrate this framework show that modern models are susceptible to augmentations modeling distribution shifts due to increased CT noise, streaking artifacts caused by high-attenuation implants, and patient movement part way through the scan. More specifically, we show different architectures have differing levels of robustness to these OOD phenomena. We find that nnUnet had higher robustness to increased CT noise and streaking artifacts compared to 3D-Unet, when segmenting the liver. A further experiment, introducing the same suite of training augmentations used in nnUnet to 3D-Unet, shows nnUnet’s superior robustness was partially due to the fixed augmentations and partially due to the use of an ensemble of Unets in its training protocol. Furthermore, we find that adding intensity windowing to the pre-processing, as is common for CT image analysis, improves the segmentation performance but reduces the robustness to increased CT noise or metal implant. This is because the windowing removes image detail in the liver and surrounding tissue, while maintaining an intensity boundary between them, making the segmentation task easier for in-distribution images. However, this also means the trained 3D-Unet is less able to handle the intensity variation caused by increased CT noise or streak artifacts, so performance for those OOD test images is worse than if windowing is not used. Experiments assessing the robustness of an object detection architecture, retinaNet, when applied to nodule detection, shows that it is highly robust to increased CT noise. When the 3D-Unet and nnUnet segmentation models are applied, or converted to object detection models, they perform worse than retinaNet without augmentation and also show worse robustness to noise. All models tested show significant weakness to the streaking artifacts caused by metal implants.

We find that the mean model performance decreases with increasing severity level of the noise and metal artifact augmentations, but the variance in the performance for individual cases increases, reflected by positive standard-deviation degradation scores. This has been seen in other research for augmentations used in robustness testing models for MRI segmentation (Boone et al., 2023). However, we see some effects which are specific to CT artifact augmentations. Abrupt boundaries between adjacent image slices introduced by the augmentations can improve the segmentation or object detection accuracy in some cases while decreasing the average performance across the dataset, thus greatly increasing the variance. This is seen when a motion discontinuity is simulated close to the boundary of the annotated organ or nodule - see Figure 15. This effect is specific to simulating the CT modality because sinograms are acquired slice by slice so the profile of artifacts may spread across an axial plane while abruptly changing from one plane to the next. This contrasts an MRI protocol which reconstructs a 3D image after sampling a 3D trajectory through k-space, resulting in artifacts which propagate in all directions (Nárai et al., 2022).

In order to empirically scale the weights for our degradation calculations we measured the distribution of CT Noise in the datasets, but a limitation is that we did not do this for the other augmentations. For CT noise, this was achieved by extracting the noise and measuring its standard deviation, which corresponds to how the severity of the noise artifact is quantified. However, for other types of augmentation, this empirical modeling is difficult given the challenges in translating observed image variances to simulation based transforms. For example, streaking artifacts are hard to retrospectively quantify in the image due to its very heterogeneous spacial profile. So instead the severity of the augmentation is defined by the metal implant size, a parameter of the simulation used to generate the artifact. Further studies quantifying the characteristics of distribution shifts of artifacts in CT may enable more specific definition of model robustness, based on degradation weights fitted to the distribution associated with a specific imaging center.

A limitation of the CT simulation emerges for some of the test CT images, which had anatomy cropped by the circular field of view. As the area beyond the field of view is assumed to be air, this would have reduced the apparent attenuation of the X-ray beams at some angles in the simulation compared to a real like CT scan, which may have altered the noise distribution. This could be mitigated in future research by registering a non-cropped atlas CT image to these cases, allowing an approximation of the attenuation beyond the field of view.

A limitation of our motion augmentation method is that it only considers a single rigid movement due to patient changing position, which occurs between the acquisition of individual slices. However, fast motion during the acquisition a sinogram and the impact of complicated artifacts which follow are not considered. In lung imaging, deformable motion due to breathing or the heartbeat during the acquisition of a slice sinogram can cause artifacts which appear as the duplication or distortion of blood vessels and nodules (van der Ham et al., 2022), which could affect a model more than a rigid discontinuity. In one study radiologists identified artifacts of this type severe enough to affect diagnostic interpretation in 30% of CT scans acquired from outpatients (Dasegowda et al., 2023). An augmentation for fast motion artifacts could be added in future work, by applying a deformable motion transform to the image before a subset of the projection angles are sampled during the simulation (van der Ham et al., 2022). However, determining a corresponding transform for the annotated segmentation maps of object box co-ordinates would be a problem for robustness testing.

Our presented framework enables direct comparison of black-box models for the same segmentation or object detection task, so a potential user with a limited annotated local dataset can test and compare the robustness of available models against potential CT artifacts or image quality changes which could occur during its use life at their site. The augmentations could also be used to generate a larger benchmarking dataset for the community to study architectural and model-design considerations that result in improved robustness. We recommend that the parameters which determine which slices are affected by simulated CT artifacts, i.e. the timing of motion, the axial position of an implant, or the implant’s extension in axial direction, are randomly varied rather than determined by the annotation, to avoid sharp discontinuities between the slices influencing the tested models with implied information about those annotations.


Acknowledgments

This research was funded by Innovate UK, grant 10033899. It was further supported by the Wellcome/EPSRC Centre for Medical Engineering [WT 203148/Z/16/Z].


Ethical Standards

This work was based on publicly available datasets, see below.


Conflicts of Interest

Julia A. Schnabel: Founder of XRnostics Ltd.
Kanwal K. Bhatia: Founder of Metalynx Ltd. trading as Aival.
Jack Highton, Quok Zong Chong, Samuel Finestone, and Arian Beqiri: employees of Metalynx Ltd. trading as Aival.


Data availability

This work was based on publicly available datasets. Luna 16: https://luna16.grand-challenge.org/ Segmentation Decathlon: http://medicaldecathlon.com/

References

  • Ahmad (2021) Rani Ahmad. Reviewing the relationship between machines and radiology: the application of artificial intelligence. Acta Radiologica Open, 10(2):2058460121990296, 2021.
  • Anthony and Kamnitsas (2023) Harry Anthony and Konstantinos Kamnitsas. On the use of mahalanobis distance for out-of-distribution detection with neural networks for medical imaging. In International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, pages 136–146. Springer, 2023.
  • Antonelli et al. (2022) Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon. Nature Communications, 13(1):4128, 2022.
  • Armato III et al. (2011) Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical Physics, 38(2):915–931, 2011.
  • Barrett and Keat (2004) Julia F Barrett and Nicholas Keat. Artifacts in ct: recognition and avoidance. Radiographics, 24(6):1679–1691, 2004.
  • Bilic et al. (2023) Patrick Bilic, Patrick Christ, Hongwei Bran Li, Eugene Vorontsov, Avi Ben-Cohen, Georgios Kaissis, Adi Szeskin, Colin Jacobs, Gabriel Efrain Humpire Mamani, Gabriel Chartrand, et al. The liver tumor segmentation benchmark (lits). Medical Image Analysis, 84:102680, 2023.
  • Boedeker et al. (2007) Kirsten L Boedeker, Virgil N Cooper, and Michael F McNitt-Gray. Application of the noise power spectrum in modern diagnostic mdct: part i. measurement of noise power spectra and noise equivalent quanta. Physics in Medicine & biology, 52(14):4027, 2007.
  • Bolliger et al. (2009) Stephan A Bolliger, Lars Oesterhelweg, Danny Spendlove, Steffen Ross, and Michael J Thali. Is differentiation of frequently encountered foreign bodies in corpses possible by hounsfield density measurement? Journal of Forensic Sciences, 54(5):1119–1122, 2009.
  • Boone et al. (2023) Lyndon Boone, Mahdi Biparva, Parisa Mojiri Forooshani, Joel Ramirez, Mario Masellis, Robert Bartha, Sean Symons, Stephen Strother, Sandra E Black, Chris Heyn, et al. Rood-mri: Benchmarking the robustness of deep learning segmentation models to out-of-distribution and corrupted data in mri. NeuroImage, 278:120289, 2023.
  • Bui et al. (2019) Toan Duc Bui, Li Wang, Jian Chen, Weili Lin, Gang Li, and Dinggang Shen. Multi-task learning for neonatal brain segmentation using 3d dense-unet with dense attention guided by geodesic distance. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data: First MICCAI Workshop, DART 2019, and First International Workshop, MIL3ID 2019, Shenzhen, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13 and 17, 2019, Proceedings 1, pages 243–251. Springer, 2019.
  • Cabral Jr. et al. (1993) James E Cabral Jr., Keith S White, Yongmin Kim, and Eric L Effmann. Interactive segmentation of brain tumors in mr images using 3d region growing. In Medical Imaging 1993: Image Processing, volume 1898, pages 171–181. SPIE, 1993.
  • Campello et al. (2021) Victor M Campello, Polyxeni Gkontra, Cristian Izquierdo, Carlos Martin-Isla, Alireza Sojoudi, Peter M Full, Klaus Maier-Hein, Yao Zhang, Zhiqiang He, Jun Ma, et al. Multi-centre, multi-vendor and multi-disease cardiac segmentation: the m&ms challenge. IEEE Transactions on Medical Imaging, 40(12):3543–3554, 2021.
  • Cardoso et al. (2022) M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, et al. MONAI: An open-source framework for deep learning in healthcare. arXiv preprint arXiv:2211.02701, 2022.
  • Choi et al. (2022) Jae Won Choi, Yeon ** Cho, Ji Young Ha, Yun Young Lee, Seok Young Koh, June Young Seo, Young Hun Choi, Jung-Eun Cheon, Ji Hoon Phi, Injoon Kim, et al. Deep learning-assisted diagnosis of pediatric skull fractures on plain radiographs. Korean Journal of Radiology, 23(3):343, 2022.
  • Çiçek et al. (2016) Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432. Springer, 2016.
  • Dasegowda et al. (2023) Giridhar Dasegowda, Bernardo C Bizzo, Parisa Kaviani, Lina Karout, Shadi Ebrahimian, Subba R Digumarthy, Nir Neumark, James M Hillis, Mannudeep K Kalra, and Keith J Dreyer. Auto-detection of motion artifacts on ct pulmonary angiograms with a physician-trained ai algorithm. Diagnostics, 13(4):778, 2023.
  • Dice (1945) Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
  • Dolly et al. (2016) Steven Dolly, Hsin-Chen Chen, Mark Anastasio, Sasa Mutic, and Hua Li. Practical considerations for noise power spectra estimation for clinical ct scanners. Journal of Applied Clinical Medical Physics, 17(3):392–407, 2016.
  • Galati et al. (2022) Francesco Galati, Sébastien Ourselin, and Maria A Zuluaga. From accuracy to reliability and robustness in cardiac magnetic resonance image segmentation: a review. Applied Sciences, 12(8):3936, 2022.
  • Geirhos et al. (2018) Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  • Giorgio and De Stefano (2013) Antonio Giorgio and Nicola De Stefano. Clinical use of brain volumetry. Journal of Magnetic Resonance Imaging, 37(1):1–14, 2013.
  • Goceri (2023) Evgin Goceri. Medical image data augmentation: techniques, comparisons and interpretations. Artificial Intelligence Review, pages 1–45, 2023.
  • Goldman (2007) Lee W Goldman. Principles of ct: dose and image quality ct. Journal of Nuclear Medicine Technology, 35(4):213–225, 2007.
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2016.
  • Hermena and Young (2021) Shady Hermena and Michael Young. CT-scan image production procedures. StatPearls. Treasure Island, (FL): StatPearls, 2021. PMID: 34662062.
  • Isensee et al. (2021) Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
  • Jacobs and van Ginneken (2016) Colin Jacobs and Bram van Ginneken. Lung nodule analysis 2016, 2016. URL https://luna16.grand-challenge.org/.
  • Jacobson and Krupinski (2021) Francine L Jacobson and Elizabeth A Krupinski. Clinical validation is the key to adopting ai in clinical practice. Radiology: Artificial Intelligence, 3(4):e210104, 2021.
  • ** et al. (2022) Chao **, Jayaram K Udupa, Liming Zhao, Yubing Tong, Dewey Odhner, Gargi Pednekar, Sanghita Nag, Sharon Lewis, Nicholas Poole, Sutirth Mannikeri, et al. Object recognition in medical images via anatomy-guided deep learning. Medical Image Analysis, 81:102527, 2022.
  • Khalifa et al. (2022) Nour Eldeen Khalifa, Mohamed Loey, and Seyedali Mirjalili. A comprehensive survey of recent trends in deep learning for digital images augmentation. Artificial Intelligence Review, pages 1–27, 2022.
  • Koh et al. (2022) Dow-Mu Koh, Nickolas Papanikolaou, Ulrich Bick, Rowland Illing, Charles E Kahn Jr., Jayshree Kalpathi-Cramer, Celso Matos, Luis Martí-Bonmatí, Anne Miles, Seong Ki Mun, et al. Artificial intelligence and machine learning in cancer imaging. Communications Medicine, 2(1):133, 2022.
  • Kooi et al. (2017) Thijs Kooi, Geert Litjens, Bram Van Ginneken, Albert Gubern-Mérida, Clara I Sánchez, Ritse Mann, Ard den Heeten, and Nico Karssemeijer. Large scale deep learning for computer aided detection of mammographic lesions. Medical Image Analysis, 35:303–312, 2017.
  • Larici et al. (2017) Anna Rita Larici, Alessandra Farchione, Paola Franchi, Mario Ciliberto, Giuseppe Cicchetti, Lucio Calandriello, Annemilia Del Ciello, and Lorenzo Bonomo. Lung nodules: size still matters. European Respiratory Review, 26(146), 2017.
  • Li et al. (2023a) Wei Li, Guanghai Liu, Haoyi Fan, Zuoyong Li, and David Zhang. Self-supervised multi-scale crop** and simple masked attentive predicting for lung ct-scan anomaly detection. IEEE Transactions on Medical Imaging, 2023a.
  • Li et al. (2023b) Yuchun Li, Cong Lin, Yu Zhang, Siling Feng, Mengxing Huang, and Zhiming Bai. Automatic segmentation of prostate mri based on 3d pyramid pooling unet. Medical Physics, 50(2):906–921, 2023b.
  • Liguori et al. (2015) Carlo Liguori, Giulia Frauenfelder, Carlo Massaroni, Paola Saccomandi, Francesco Giurazza, Francesca Pitocco, Riccardo Marano, and Emiliano Schena. Emerging clinical applications of computed tomography. Medical Devices: Evidence and Research, pages 265–278, 2015.
  • Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
  • Liu et al. (2020) Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33:21464–21475, 2020.
  • Liu et al. (2022) Xuan Liu, Xiaokun Liang, Lei Deng, Shan Tan, and Yaoqin Xie. Learning low-dose ct degradation from unpaired data with flow-based model. Medical Physics, 49(12):7516–7530, 2022.
  • Maier-Hein et al. (2024) Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, et al. Metrics reloaded: recommendations for image analysis validation. Nature Methods, pages 1–18, 2024.
  • Medawar et al. (2021) Evelyn Medawar, Ronja Thieleking, Iryna Manuilova, Maria Paerisch, Arno Villringer, A Veronica Witte, and Frauke Beyer. Estimating the effect of a scanner upgrade on measures of grey matter structure for longitudinal designs. PLOS One, 16(10):e0239021, 2021.
  • Nárai et al. (2022) Ádám Nárai, Petra Hermann, Tibor Auer, Péter Kemenczky, János Szalma, István Homolya, Eszter Somogyi, Pál Vakli, Béla Weiss, and Zoltán Vidnyánszky. Movement-related artefacts (mr-art) dataset of matched motion-corrupted and clean structural mri brain scans. Scientific Data, 9(1):630, 2022.
  • Nguyen et al. (2023) Duy Minh Ho Nguyen, Tan Ngoc Pham, Nghiem Tuong Diep, Nghi Quoc Phan, Quang Pham, Vinh Tong, Binh T Nguyen, Ngan Hoang Le, Nhat Ho, Pengtao Xie, et al. On the out of distribution robustness of foundation models in medical image segmentation. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023): R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
  • Noguchi et al. (2020) Shunjiro Noguchi, Mizuho Nishio, Masahiro Yakami, Keita Nakagomi, and Kaori Togashi. Bone segmentation on whole-body ct using convolutional neural network with novel data augmentation techniques. Computers in Biology and Medicine, 121:103767, 2020.
  • Omigbodun et al. (2019) Akinyinka O Omigbodun, Frederic Noo, Michael McNitt-Gray, William Hsu, and Scott S Hsieh. The effects of physics-based data augmentation on the generalizability of deep neural networks: Demonstration on nodule false-positive reduction. Medical Physics, 46(10):4563–4574, 2019.
  • Padilla et al. (2020) Rafael Padilla, Sergio L Netto, and Eduardo AB Da Silva. A survey on performance metrics for object-detection algorithms. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pages 237–242. IEEE, 2020.
  • Padilla et al. (2021) Rafael Padilla, Wesley L Passos, Thadeu LB Dias, Sergio L Netto, and Eduardo AB Da Silva. A comparative analysis of object detection metrics with a companion open-source toolkit. Electronics, 10(3):279, 2021.
  • Peyrin and Engelke (2021) Françoise Peyrin and Klaus Engelke. Ct imaging: Basics and new trends. In Handbook of Particle Detection and Imaging, pages 1173–1215. Springer, 2021.
  • Potvin et al. (2019) Olivier Potvin, April Khademi, Isabelle Chouinard, Farnaz Farokhian, Louis Dieumegarde, Ilana Leppert, Rick Hoge, Maria Natasha Rajah, Pierre Bellec, Simon Duchesne, et al. Measurement variability following mri system upgrade. Frontiers in Neurology, 10:726, 2019.
  • Prior et al. (2020) Fred Prior, J Almeida, P Kathiravelu, T Kurc, K Smith, Thomas J Fitzgerald, and J Saltz. Open access image repositories: high-quality data to enable machine learning research. Clinical Radiology, 75(1):7–12, 2020.
  • Quiñonero-Candela et al. (2008) Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2008.
  • Ramachandran and Lakshminarayanan (1971) GN Ramachandran and AV Lakshminarayanan. Three-dimensional reconstruction from radiographs and electron micrographs: application of convolutions instead of fourier transforms. Proceedings of the National Academy of Sciences, 68(9):2236–2240, 1971.
  • Ren et al. (2020) He Ren, Lingxiao Zhou, Gang Liu, Xueqing Peng, Weiya Shi, Huilin Xu, Fei Shan, and Lei Liu. An unsupervised semi-automated pulmonary nodule segmentation method based on enhanced region growing. Quantitative Imaging in Medicine and Surgery, 10(1):233, 2020.
  • Rubin (1992) LI Rubin. Nonlinenr total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60:259–265, 1992.
  • Sanaat et al. (2022) Amirhossein Sanaat, Isaac Shiri, Sohrab Ferdowsi, Hossein Arabi, and Habib Zaidi. Robust-deep: a method for increasing brain imaging datasets to improve deep learning models’ performance and robustness. Journal of Digital Imaging, 35(3):469–481, 2022.
  • Shaikhina and Khovanova (2017) Torgyn Shaikhina and Natalia A Khovanova. Handling limited datasets with neural networks in medical applications: A small-data approach. Artificial Intelligence in Medicine, 75:51–63, 2017.
  • Shaw et al. (2020) Richard Shaw, Carole H Sudre, Thomas Varsavsky, Sébastien Ourselin, and M Jorge Cardoso. A k-space model of movement artefacts: application to segmentation augmentation and artefact removal. IEEE Transactions on Medical Imaging, 39(9):2881–2892, 2020.
  • Singla et al. (2022) Rohit Singla, Cailin Ringstrom, Ricky Hu, Victoria Lessoway, Janice Reid, Robert Rohling, and Christophe Nguan. Speckle and shadows: ultrasound-specific physics-based data augmentation for kidney segmentation. In International Conference on Medical Imaging with Deep Learning, pages 1139–1148. PMLR, 2022.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Van Aarle et al. (2016) Wim Van Aarle, Willem Jan Palenstijn, Jeroen Cant, Eline Janssens, Folkert Bleichrodt, Andrei Dabravolski, Jan De Beenhouwer, K Joost Batenburg, and Jan Sijbers. Fast and flexible x-ray tomography using the astra toolbox. Optics Express, 24(22):25129–25147, 2016.
  • van der Ham et al. (2022) Guus van der Ham, Rudolfs Latisenko, Michail Tsiaousis, and Gijs van Tulder. Generating artificial artifacts for motion artifact detection in chest ct. In International Workshop on Simulation and Synthesis in Medical Imaging, pages 12–23. Springer, 2022.
  • Vasiliuk et al. (2023a) Anton Vasiliuk, Daria Frolova, Mikhail Belyaev, and Boris Shirokikh. Limitations of out-of-distribution detection in 3d medical image segmentation. Journal of Imaging, 9(9):191, 2023a.
  • Vasiliuk et al. (2023b) Anton Vasiliuk, Daria Frolova, Mikhail Belyaev, and Boris Shirokikh. Redesigning out-of-distribution detection on 3d medical images. In International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, pages 126–135. Springer, 2023b.
  • Vayena et al. (2018) Effy Vayena, Alessandro Blasimme, and I Glenn Cohen. Machine learning in medicine: addressing ethical challenges. PLOS Medicine, 15(11):e1002689, 2018.
  • Waheed et al. (2020) Abdul Waheed, Muskan Goyal, Deepak Gupta, Ashish Khanna, Fadi Al-Turjman, and Plácido Rogerio Pinheiro. Covidgan: data augmentation using auxiliary classifier gan for improved covid-19 detection. Ieee Access, 8:91916–91923, 2020.
  • Won Kim and Kim (2014) Chang Won Kim and Jong Hyo Kim. Realistic simulation of reduced-dose ct with noise modeling and sinogram synthesis using dicom ct images. Medical Physics, 41(1):011901, 2014.
  • Yang (2023) Dong Yang. MONAI and nnU-Net integration, 2023. URL https://github.com/Project-MONAI/tutorials/tree/main/nnunet.
  • Yang et al. (2023) Wei Yang, Zecheng Cai, Xiaoyin Liu, Wenqi Yuan, Rong Ma, Zhen Chen, Jianqun Zhang, Peng Wu, and Zhaohui Ge. Imaging study of the effect of postural changes on the retroperitoneal oblique corridor in degenerative lumbar scoliosis. European Spine Journal, pages 1–7, 2023.
  • Yu et al. (2022) Cenji Yu, Chidinma P Anakwenze, Yao Zhao, Rachael M Martin, Ethan B Ludmir, Joshua S. Niedzielski, Asad Qureshi, Prajnan Das, Emma B Holliday, Ann C Raldow, et al. Multi-organ segmentation of abdominal structures from non-contrast and contrast enhanced ct images. Scientific Reports, 12(1):19093, 2022.
  • Zhou et al. (2023) Mingyi Zhou, Xiang Gao, **g Wu, John Grundy, Xiao Chen, Chunyang Chen, and Li Li. Modelobfuscator: Obfuscating model information to protect deployed ml-based systems. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1005–1017, 2023.
  • Zijdenbos et al. (1994) Alex P Zijdenbos, Benoit M Dawant, Richard A Margolin, and Andrew C Palmer. Morphometric analysis of white matter lesions in mr images: method and validation. IEEE transactions on medical imaging, 13(4):716–724, 1994.
  • Zimmerer et al. (2022) David Zimmerer, Peter M Full, Fabian Isensee, Paul Jäger, Tim Adler, Jens Petersen, Gregor Köhler, Tobias Ross, Annika Reinke, Antanas Kascenas, et al. Mood 2020: A public benchmark for out-of-distribution detection and localization on medical images. IEEE Transactions on Medical Imaging, 41(10):2728–2738, 2022.

A

Refer to caption
Figure 16: Top: The Dice (DSC) score results against the level of CT noise added by augmentation, for the tested segmentation models. Bottom: mean Average precision (mAP) results against the level of CT noise added by augmentation, for the tested object detection models including segmentation models converted to generate bounding boxes. These are shown for each of the 30 test cases from the LUNA16 dataset with different levels of CT noise added by augmentation. The corresponding means across the test cases is shown in purple, with the error bars representing the standard deviation. The results in this plot are summarized in Tables LABEL:tab_dce and LABEL:tab_map.
Refer to caption
Figure 17: The Dice (DSC) score results for the tested segmentation models, shown for each of the 30 test cases from the liver segmentation dataset, with simulated CT streak artifacts of varying severity caused by augmentation by insertion of a cylindrical spinal implant with varying radius. The corresponding means across the test cases is shown in purple, with the error bars representing the standard deviation. The results in this plot are summarized in Tables LABEL:tab_dce.
Refer to caption
Figure 18: The dice (DSC) score produced by each tested model for each of the 30 test cases in the liver segmentation dataset, with simulated sudden patient motion by 10 degrees such that the resulting discontinuity at varying proximity to the liver ground truth segmentation. The corresponding means of the dice score across the test cases is shown in purple, with the error bars representing the standard deviation. The results in this plot are summarized in Tables LABEL:tab_dce.