¹¹institutetext: Center ALGORITMI / LASI, University of Minho, Braga, 4710-057, Portugal ²²institutetext: Computer Algorithms for Medicine Laboratory, Graz, Austria ³³institutetext: Institute for AI in Medicine (IKIM), University Medicine Essen, Girardetstraße 2, Essen, 45131, Germany ⁴⁴institutetext: Cancer Research Center Cologne Essen (CCCE), University Medicine Essen, Hufelandstraße 55, Essen, 45147, Germany ⁵⁵institutetext: German Cancer Consortium (DKTK), Partner Site Essen, Hufelandstraße 55, Essen, 45147, Germany ⁶⁶institutetext: Institute of Computer Graphics and Vision, Graz University of Technology, Inffeldgasse 16, Graz, 8010, Austria ⁷⁷institutetext: Department of Neurosurgery and Spine Surgery, University Hospital Essen, Essen, Germany
⁷⁷email: {id10656}@alunos.uminho.pt

How we won BraTS 2023 Adult Glioma challenge? Just faking it!
Enhanced Synthetic Data Augmentation and Model Ensemble for brain tumour segmentation ^†^†thanks: Supported by organizations University of Minho and Institute for Artificial Intelligence in Medicine

André Ferreira 112233 0000-0002-9332-0091 Naida Solak 2266 Jianning Li 223344 Philipp Dammann 77 Jens Kleesiek 334455 Victor Alves 0000-0003-1819-7051 11 Jan Egger 22334466

Abstract

Deep Learning is the state-of-the-art technology for segmenting brain tumours. However, this requires a lot of high-quality data, which is difficult to obtain, especially in the medical field. Therefore, our solutions address this problem by using unconventional mechanisms for data augmentation. Generative adversarial networks and registration are used to massively increase the amount of available samples for training three different deep learning models for brain tumour segmentation, the first task of the BraTS2023 challenge. The first model is the standard nnU-Net, the second is the Swin UNETR and the third is the winning solution of the BraTS 2021 Challenge. The entire pipeline is built on the nnU-Net implementation, except for the generation of the synthetic data. The use of convolutional algorithms and transformers is able to fill each other’s knowledge gaps. Using the new metric, our best solution achieves the dice results 0.9005, 0.8673, 0.8509 and HD95 14.940, 14.467, 17.699 (whole tumour, tumour core and enhancing tumour) in the validation set.

Keywords:

Generative adversarial networks, Registration Synthetic data Brain Tumour segmentation nnU-Net

1 Introduction

Brain tumours originate from different cell types, mainly from glialcells (astrocytes, oligodendrocytes, microglia, ependymal cells) and are then referred to as gliomas. The World Health Organization (WHO) classifies brain tumours into grades 1 to 4 based on histologic features and molecular parameters. Grade 1 tumours are typically slow-growing and benign, and grade 4 tumours, such as glioblastomas (GBMs), are the most aggressive and malignant forms. Indeed, Glioblastomas are among the most deadly types of cancer due to their location and invasive growth. Patients diagnosed with glioblastoma now have a median survival of approximately 16 months with standard treatment (radiotherapy and temozolomide). Despite extensive research to improve diagnosis, characterization and treatment, the mortality rate of GBMs remains high and significant improvements in patient survival have been elusive. Extensive research to improve diagnosis, characterisation and treatment has reduced the mortality rate of this disease [22]. Glioma segmentation is a critical step for tumour evolution, treatment efficacy assessment, survival prediction and treatment planning. Multiple modalities of MRI scans (T1, T2, T1Gd and FLAIR) are usually used for accurate segmentation of the tumour and individual regions [28].

MRI is a medical imaging technique that is often used to detect and assess response to glioma treatment [29]. The development of new therapies for treatment depends on accurate segmentation. Manual segmentation has been used for this purpose, but is very time-consuming and suffers from inter- and intra-examiner variability [7, 28]. Efforts have therefore been made to automate this process. Machine learning techniques have been the most advanced methods for performing such segmentation, but they have the disadvantage of requiring large, high-quality datasets for the training process in order to achieve the performance required for clinical purposes [6].

The Brain Tumor Segmentation Challenge (BraTS) [2, 22, 5, 3, 4] provides a large, fully annotated and publicly available dataset for model development and promotes a competition to evaluate the latest state-of-the-art approaches for brain diffuse glioma segmentation. This competition was launched in 2012 and continues to evolve each year, adding more samples and many different tasks. The 2023 competition includes 9 different tasks, the first of which is the traditional segmentation of adult gliomas.

1.1 State-of-the-art

Since the challenge of 2014, deep convolutional networks have been the state-of-the-art for brain tumour segmentation [27, 23]. Recently, the most advanced strategies are mainly based on deep neural networks (DNNs), due to the rapid development of these tools, the availability of increasingly powerful GPUs and the availability of training data. Most solutions are based on the U-Net [24] which has yielded convincing results. Many architectural changes to the U-Net have been introduced to improve it, e.g., residual connections[14], densely connected layers [21, 32] and attention mechanisms [11].

The winners of the last 6 editions all used DNNs. Kamnitsas et al. [15] (2017 winner) explore the ensemble of multiple models and architecture (EMMA), more specifically the 3D convolutional networks DeepMedic [17, 16], FCN [19] and U-Net [24]. The use of EMMA seems to reduce the influence of the meta-parameters of each model and helps to avoid overfitting.

The winners of the 2020 edition [13], with respect to task 1, propose the use of the nnU-Net [12], a U-Net based architecture, as a baseline and implements some BraTS specific optimisations. These optimizations are: optimising regions rather than individual classes, using a bigger batch size (from 2 to 5), applying a more aggressive data augmentation, replacing the instance normalisation with batch normalisation (which is better with more aggressive data augmentation), using of batch dice instead of regular dice, and applying post-processing distinct from the regular nnU-Net.

The winners of 2021 edition [20] also use the nnU-Net as baseline, using the same BraTS-specific optimizations as [12] and new optimizations. They claim that due to the change in the dataset from the 2020 edition (494 cases) to the 2021 edition (1470 cases), they decided to use a larger network by doubling the number of filters in the encoder part of the nnU-Net, and increasing the maximum number of filter in the bottleneck to 512, while kee** the decoder intact. Batch normalisation is replaced by group normalisation as it performs better with small batch sizes, allowing the use of batch size 2 instead of 5. Axial attention decoder was also applied but not tested in the final phase of the BraTS challenge, and it did not improve the results in the 5-fold cross validation.

The winners of the 2022 edition [30] archive the best results by using an ensemble of three different architectures: DeepSeg [31], the improved nnU-Net proposed by [20], and DeepSCAN [21]. The ensemble is created using the Simultaneous Truth and Performance Level Estimation (STAPLE). 2023 Challenge BraTS-Africa [1] also uses the STAPLE ensemble of three different models to create the ground truth segmentations.

It is important to note that the same post-processing is done for all three solutions. As explained in detail in [13], there is ground truth without an ET label, so the BraTS evaluation gives the worst possible results for false positive predictions. To mitigate this scenario, when the number of voxels is below a certain threshold, ET is replaced by NCR. A threshold of 200 is used in all three experiments. This year (BraTS2023), however, a new metric is used that requires new post-processing techniques, as will be explained later.

Our approach consists of solving the problem of brain tumour segmentation by increasing the amount of available data, using two different, non-conventional strategies for data augmentation. Furthermore, convolutional neural networks (CNNs) and transformer-based networks are assumed to complement each other, which is why ensemble of these two distinct architectures are also tested.

2 Methods

The machine used for all these tasks is a IKIM cluster node with 6 NVIDIA RTX 6000, 48 GB of VRAM, 1024 GB of RAM, and AMD EPYC 7402 24-Core Processor.

2.1 Data

The Task 1 dataset consists of 1470 patients, each of which contains 4 modalities, as can be seen in Figure 1 (T1, T1Gd, T2 and FLAIR). 1251 cases have the corresponding ground truth, so this subset is used for training, and the remaining (219 cases) that do not contain a freely available ground truth form the subset for evaluation. The test set in this year’s (2023) edition contains many more routine clinical mpMRI scans. All MRI scans were pre-processed as follows: were co-registered to the same anatomical template, interpolated to the same resolution (1 mm3), and skull-striped. The scans have the shape 240×240×155.

The ground truth of the subset of evaluation is hidden from the participants, being only possible to access the Dice scores and 95% Hausdorff distance through the participation platform. The evaluation is performed by sub-region and not by individual label. The sub-regions are the enhancing tumour (ET), the tumour core (TC), and the whole tumour (WT). This year (BraTS2023), the value 4 was replaced by the value 3. The naming convention has also changed from the previous challenges.

2.2 Data augmentation

2.2.1 Data augmentation via registration:

Inspired by the winning solution for the AutoImplant Challenge 2020 [8], a larger dataset was build via registration. Each scan can be registered with any other scan and then warped into the other. Advanced Normalization Tools (ANTs ¹¹1http://stnava.github.io/ANTs/) package is used to perform this registration. It is a software package used for normalizing data to a template, and it provides different scripts (such as antsRegistrationSyNQuick.sh) that enable applying different transformations on the images, such as rigid, affine, non-linear and all of them combined. Usually, one (first) image is used as the moving image, and the other (second) one as the fixed image - meaning that the moving image is warped into the fixed image space by applying the computed transformation, and also the inverse transformation in order to warp the fixed image into the moving image space.

Only the training set was used to create new samples via registration, since it is the only one that contains the ground truth. The transformation and inverse transformation matrix are computed for each case and applied to each scan (including the ground truth). After creating a reasonable amount of registered data (23049 samples, 92000 MRI scans and 23049 ground truths), all data were converted to integer, as the results of registering creates floats. This process took around 2 weeks. Figure 2 presents a sample.

2.2.2 Data augmentation with GANs:

Generative adversarial networks (GANs) are known for their ability to generate realistic data. Ferreira et al. (2022) [9] presents an overview of the generation of realistic volumetric data. In this systematic review, it can be seen that a large number of works use GANs for various tasks, e.g. denoising, classification, segmentation, image translation, reconstruction and others. In many cases, synthetic data generated by GANs is even used to increase the amount of data available for training deep learning models, i.e. for data augmentation.

In this work, a GAN (represented in Figure 3, referred to as GliGAN) is trained to generate synthetic tumours that can be randomly inserted into the healthy parts of the brain. Placing synthetic tumours in the provided training set reduces the class imbalance between tumour labels and healthy brain tissue and creates greater variability in tumour properties and locations. The GAN architecture consists of a generator and a discriminator. The generator uses the MONAI²²2https://monai.io/ implementation of the Swin UNETR [11], with parameters: img_size=(96, 96, 96), in_channels=4, out_channels=1, and feature_size=48. The discriminator is a CNN based on Ferreira et al. (2022) [10], with increased number of layers (one more 3D convolution at the end, with stride 1, kernel size 3, padding 0 and no spectral normalisation), and sigmoid before the output.

The input of the generator is created as follows: Each scan is cropped with centre in the tumour volume (size 96×96×96), normalised between [-1,1], added Gaussian noise (mean 0 and standard deviation 1) and normalised between [-1,1] again, resulting in a noisy scan as can be seen in Figure 3 (z). Voxels with a tumour label different from 0 are replaced by Gaussian noise. Neighbouring voxels may be selectively replaced by noise, contingent on a given probability. The probability of the voxel value being replaced decreases with distance from the tumour centre, producing a spherical effect. This probably is computed taking into account the size of the tumour. As bigger tumours might have a greater impact on the surrounding tissue, more distant voxels will have more probability of being replaced by noise than in cases with smaller tumours. It was decided that the probability would decay linearly with the distance to the centre, following the equation 1.

prob=\frac{83}{exponent^{distance}+82}

(1)

where the $exponent$ is defined by equation 2

exponent=-\frac{0.2}{68}*max\_size+1.1-96*-\frac{0.2}{68}

(2)

and $max\_size$ is the greater size among each three planes.

The tumour centre in calculated considering the first and last slice where the tumour appears in all three planes. This strategy allows for more realistic generation than only adding noise to the tumour or placing a cube of noise, as it is harder to detect the edges of the noise. Only replacing the tumour voxels by noise would not allow the network to learn how to replicate the mass effect of the growing tumour. Square noise has been tested but produces unrealistic results, as shown in Figure 4.

The loss functions of the generator (G) and discriminator (D) are defined by equations 3 and 4:

L_{G}=-\lambda_{1}\mathbb{E}{{}_{y,z}}[\log D(G(z|y))]+\lambda_{2}\mathbb{E}{{% }_{x,y,z}}\left\|x-G(z|y))\right\|_{MAE}

(3)

L_{D}=\mathbb{E}{{}_{y,z}}[\log D(G(z|y))]-\mathbb{E}{{}_{x,y}}[\log D(x|y)]

(4)

The training is divided into two steps. The first step uses $\lambda_{1}=1$ and $\lambda_{2}=5$ for 200000 iterations. After performing the first step, it was found that the tissue visible to the generator (voxels without noise) was noisy, but the tumour volume was realistic. To solve this problem, the network was trained for another 1000 epochs, linearly increasing the weight of the mean absolute error (MAE) component ( $\lambda_{2}$ ) and reducing the weight of the adversarial loss ( $\lambda_{1}$ ), where $\lambda_{1}=\frac{1}{\lambda_{2}}$ , ending with $\lambda_{2}=100$ , i.e., $\lambda_{2}=\frac{100-1}{1000}*epoch+1$ .

This allowed for a realistic surrounding tissue and tumour with realistic texture and overall appearance, as shown in Figure 5 (second row). The baseline uses only the MAE component as the loss function.

For the generation process (the new scans with the fake tumour for the segmentation task), two approaches were tested to create two datasets. In the first datasets, for each real case in the training dataset, 30 random labels (from the remaining dataset) were selected and randomly placed in a healthy part of the scan. From the total set, 23049 cases were randomly selected. This dataset is referred to as G. The second dataset was created using a random label generator, i.e. another GAN based on [10] was trained to generate new synthetic labels. For training this GAN, all labels smaller or equal to 96×96×96 were cropped and resized to 64×64×64. The synthetic labels are then used as input to the GliGAN. This dataset is referred to as rG. To allow a fair comparison between all data augmentation strategies, 23049 cases were generated for each strategy.

2.2.3 Pre-procesing:

Pre-processing is performed by the nnU-Net pipeline. Before the training step, the brain voxels of each scan are normalised using z-score normalisation, kee** the background at zero.

2.2.4 Networks:

Multiple networks were tested to determine which network (or ensemble) provided better results. Each network was implemented in the newer version of the nnU-Net³³3https://github.com/MIC-DKFZ/nnUNet to take advantage of the pre-processing and data augmentation pipeline provided by this framework.

Baseline (B):

The fully automated framework nnU-Net [12] was used as a baseline (3D full resolution), without any configuration changes. Figure 6 shows this architecture in detail. The input is random patches of the shape 128×160×112. Batch size 5, region-based training, and deep supervised.

Swin UNETR (S):

The Swin UNETR [11] is a U-Net like network in which the convolutional encoder is replaced by Swin transformer blocks. This encoder is, therefore, capable of capturing long-range information, as opposed to fully convolutional networks. The Swin Transformer uses shifted windows, allowing the use of high resolution images, as is the case of the BraTS dataset (by having linear computational complexity regarding the image size [18]).

The architecture of the Swin UNETR is described in [11]. The input are random patches of shape 128×128×128, and the remaining pre-processing, including regular data augmentation, is applied by the nnU-Net pipeline. Since this network is heavier than the nnU-Net, a batch size of 4 is used. Deep supervision is also used.

We hypothesise that a transformer-based architecture is complementary to the fully convolutional nnU-Net architecture, as found by [33].

2021 winner (L):

The winners of the 2021 edition [20] use the framework provided by nnU-Net and implemented some improvements over the nnU-Net solution proposed in 2020 [13]. Isensee et al. [13] purposes the use of a more aggressive data augmentation than the provided by default in the nnU-Net pipeline, and therefore to use of batch normalisation instead of instance normalisation, as this seems to produce better results with a very aggressive data augmentation. Since the dataset used for training includes samples generated by our own data augmentation strategies as well as those explained in [13], we believe that batch normalisation gives better results than any other normalisation. Batch dice is also used for gradient computation instead of sample Dice [13]. In addition, Luu et al. [20] double the number of filters in the encoder and increases the size of the bottleneck to 512, as can be seen in Figure 7. They also use group normalisation as they claim it is better for small batch sizes. However, we were able to use a batch size of 5, so batch normalisation would be the best option.

2.3 Selection criterion

As mentioned by Isensee et al. [13], the BraTS challenge follows a "rank and then aggregate" approach for ranking the proposed solutions, as it is well suited for the using of different segmentation metrics ⁴⁴4https://zenodo.org/record/3718904. This ranking method works as follows: each participant is ranked for each of the testing cases (X); each case includes 3 regions, the metrics used are the Dice similarity coefficient (DSC) and the 95% Hausdorff distance (HD95); this makes X×3×2; the final ranking is the average of all these rankings normalised by the number of participants, from 0 (best) to 1 (worst). In situations where the baseline data does not have a specific label, false positives are strongly penalised by assigning the worst possible value for each metric (DSC 0 and HD95 374) and the best possible result when the model outputs an empty label (DSC 1 and HD95 0).

This year (BraTS2023), two new performance metrics are used, the lesion-based Dice score and the lesion-based Hausdorff distance-95. With these metrics, the evaluation is done at the lesion level rather than at the scan level. These new metrics can be used to assess how well the model can detect multiple tumours in the same case. Therefore, for each case, the DSC and HD95 are calculated for each lesion and averaged per patient. Thus, the results are heavily penalised by the segmentation of non-existent tumours (FP) and the lack of segmentation (FN). FNs are almost impossible to avoid in a post-processing step, but FP can be reduced by using a suitable threshold to remove some segmentations, although this comes with a slight increase in the number of FNs.

Therefore, selecting the solution with the best DSC and/or HD95 is not the best approach for an optimal performance on the BraTS2023 competition. For this choice, an implementation based on the ranking strategy of the BraTS competition is used, in order to select the best solution.

2.4 Ensemble strategy

Sawant et al. [25] claims that the maximum success of ensembles is only achieved when the number of models used is between 60 and 70. However, training and inference of more than 60 models makes the segmentation task too time-consuming and computationally heavy. Therefore, it was decided to train one model for each architecture and for each data augmentation strategy, giving a total of 3×3=9 models. Each model was trained with a 5-fold cross-validation resampling method, i.e. a total of 9×5=45 checkpoints.

In order to get the best out of all the models and find the best solution, several ensemble methods were tested:

•

Averaging: averaging the probabilities of all 45 checkpoints;
•

STAPLE: Application of the STAPLE algorithm to each label provided by each network (after averaging the 5-fold probability map and converting to integers);
•

CNN: Training a CNN to produce the final labels using the probability maps of each network as input (after averaging the 5-folds probability maps);
•

Weighting: Similar to the previous method, but instead of convolutions, a learnable parameter is used that learns how much weight each region of each individual probability mask of each network should have.

3 Results

We refer to our solutions using the following abbreviations:

•

B: Baseline network, explained in section 2.2.4.
•

S: Swin UNETR, explained in section 2.2.4.
•

L: Architecture based on BraTS2021 winner, explained in section 2.2.4.
•

G: Synthetic data generated by the GliGAN, explained in section 2.2.2.
•

rG: Synthetic data generated by the random label generator and GliGAN, explained in section 2.2.2.
•

R: Synthetic data generated by registration, explained in section 2.2.1.

Since, in this edition of BraTS a new metric was introduced that takes into account the number of tumours in the ground-truth and predictions, penalising the FPs and FNs, Table 1 presents both legacy (old metric) and new metric for the training set. Each solution is formally defined as $S_{M}^{DA}$ where $S$ is the solution, $M$ the model and $DA$ the data augmentation strategy. E.g., the solution which uses the baseline network B and the synthetic dataset G is represented as $S_{B}^{G}$ . For ensembles, several model and data represented are used, e.g., $S_{B,S}^{G,rG}$ . Table 2 shows the legacy results of our final submission in the validation set, the ensemble of each data augmentation strategy and the winners of 2021 [20] and 2022 [30]. Table 3 shows the results (using the new metric) of the online validation platform. This evaluation is performed online as the participants have no access to the ground-truth. Only ensembles were submitted to the platform as they produced the better results and the number of submissions is limited. The results of the 3 model ensembles help to evaluate how good each data augmentation strategy is.

Table 1: Results of the training set. The best results are in bold and the second best underlined. The "All" is defined as

S_{B,L,S,B,L,S,B,L,S}^{G,G,G,rG,rG,rG,R,R,R}

	Legacy (DSC)				New metric (DSC)
Solutions	WT	TC	ET	Mean	WT	TC	ET	Mean
$S_{B}^{G}$	0.9388	0.9204	0.8833	0.9142	0.8326	0.8690	0.8168	0.8395
$S_{S}^{G}$	0.9378	0.9148	0.8787	0.9104	0.7855	0.8533	0.8010	0.8133
$S_{L}^{G}$	0.9377	0.9183	0.8846	0.9136	0.8332	0.8673	0.8207	0.8404
$S_{B}^{rG}$	0.9405	0.9213	0.8819	0.9146	0.8265	0.8774	0.8224	0.8421
$S_{S}^{rG}$	0.9397	0.9165	0.8797	0.9120	0.7627	0.8569	0.8057	0.8084
$S_{L}^{rG}$	0.9400	0.9188	0.8873	0.9154	0.8525	0.8669	0.8160	0.8451
$S_{B}^{R}$	0.9380	0.9180	0.8742	0.9101	0.8423	0.8731	0.8118	0.8424
$S_{S}^{R}$	0.9357	0.9085	0.8680	0.9041	0.8013	0.8415	0.7889	0.8106
$S_{L}^{R}$	0.9387	0.9139	0.8830	0.9119	0.8401	0.8664	0.8183	0.8416
$S_{B,L,S}^{G,G,G}$	0.9409	0.9211	0.8879	0.9166	0.8526	0.8742	0.8249	0.8505
$S_{B,L,S}^{rG,rG,rG}$	0.9428	0.9215	0.8866	0.9170	0.8583	0.8790	0.8268	0.8547
$S_{B,L,S}^{R,R,R}$	0.9405	0.9186	0.8783	0.9124	0.8531	0.8809	0.8196	0.8512
All	0.9432	0.9229	0.8861	0.9174	0.8663	0.8839	0.8291	0.8598

Table 2: Results of the validation set of our best solution, i.e., "All"

S_{B,L,S,B,L,S,B,L,S}^{G,G,G,rG,rG,rG,R,R,R}

with threshold of 250, 150, 100, and the winners of 2021 and 2022. The best results are in bold and the second best underlined

	Legacy DSC				Legacy HD95
Solutions	ET	TC	WT	Mean	ET	TC	WT	Mean
All	0.8464	0.8769	0.9294	0.8842	17.81	11.12	4.26	11.06
$S_{B,L,S}^{rG,rG,rG}$	0.8484	0.8781	0.9286	0.8850	17.75	11.08	4.19	11.00
$S_{B,L,S}^{G,G,G}$	0.8515	0.8761	0.9283	0.8853	19.32	12.70	4.27	12.10
$S_{B,L,S}^{R,R,R}$	0.8318	0.8719	0.9279	0.8772	21.36	13.04	4.61	13.00
2022 [30]	0.8438	0.8753	0.9271	0.8821	17.50	7.53	3.60	9.54
2021 [20]	0.8451	0.8781	0.9275	0.8836	20.73	7.62	3.47	10.61

Table 3: Validation set results computed by the validation platform (new metric). The values between parentheses are the threshold value used for (WT, TC, ET) respectively. The best results are in bold and the second best underlined. The "All" is defined as

S_{B,L,S,B,L,S,B,L,S}^{G,G,G,rG,rG,rG,R,R,R}

	Thresholds			DSC				HD95
Solutions	WT	TC	ET	WT	TC	ET	Mean	WT	TC	ET	Mean
All	250	150	100	0.9005	0.8673	0.8509	0.8729	14.940	14.467	17.699	15.702
All	1450	150	100	0.9101	0.8673	0.8509	0.8761	11.113	14.467	17.699	14.426
$S_{B,L,S}^{G,G,G}$	100	50	100	0.8867	0.8575	0.8537	0.8660	19.322	18.807	17.321	18.483
$S_{B,L,S}^{G,G,G}$	300	200	200	0.8969	0.8660	0.8528	0.8719	15.798	16.397	19.470	17.222
$S_{B,L,S}^{rG,rG,rG}$	100	50	50	0.8918	0.8565	0.8347	0.8610	17.201	19.352	23.880	20.144
$S_{B,L,S}^{rG,rG,rG}$	250	150	100	0.8972	0.8686	0.8527	0.8728	15.510	14.420	17.643	15.858
$S_{B,L,S}^{R,R,R}$	100	50	50	0.8874	0.8564	0.8269	0.8569	19.637	19.090	23.794	20.840
$S_{B,L,S}^{R,R,R}$	250	150	100	0.8990	0.8606	0.8350	0.8648	15.121	18.275	21.049	18.148

3.1 Qualitative results

Figure 8 presents the predicted segmentations of cases 01774-000, 00521-001 and 00190-000. In the first two cases our models performed poorly (Figure 8 first and second rows). The ET is not detected by our solutions, which leads to a value 0 for this label. This could be due to the quality of the acquired scans, as this strongly influences the performance of deep learning models. For future improvements, more synthetic data can be generated with the specific cases of poorer scans. The last row of Figure 8 shows a case with an almost perfect segmentation. Our solution archived DSC above 0.9 for most of the validation cases.

4 Discussion

From Table 1 we can conclude that the best solution is the ensemble (average of the probability maps of each model and rounded to an integer). However, if we compare the old results with the results of the new metric, we can also deduce that our solution is heavily penalised by the existence of FP and FN. Therefore, post-processing based on a threshold is performed to remove some small tumours that are detected but are not actually tumours. For this purpose, several values were tested for each region (WT, TC and ET). It was found that the best thresholds for the training set are $WT_{250}TC_{100}ET_{50}$ (for DSC) and $WT_{250}TC_{50}ET_{50}$ (for HD96). However, for the test in the validation set, the best values are $WT_{250}TC_{150}ET_{100}$ (for both DSC and HD95).

Table 2 compares the legacy results of our solution with the winners of 2021 and 2022. The results are very similar. However, our solution archives better dice scores but worse HD95, specially for the tumour core. This can be related with the value of threshold used or with the fact that, since our models were trained with a larger variety of data, the predictions might have larger variety, which increases the HD95 distance. Regarding only our solutions, it can be seen that the ensemble $S_{B,L,S}^{G,G,G}$ archives the best mean DSC, and that $S_{B,L,S}^{rG,rG,rG}$ the best HD95. Perhaps, the ensemble $S_{B,L,S,B,L,S}^{G,G,G,rG,rG,rG}$ would produce better results than our final submission, however this was not tested due to lack of time.

Several other combinations of ensembles and thresholds were tested in the validation set, and it was consistently found that the validation set required larger thresholds than the training set. It was also found that disabling the data augmentation option of the nnUNet during inference yielded better results in the validation set, which also allows the use of more models for the ensemble, as "fast" inference is up to 8 times faster than normal inference.

Table 3 shows the extension of the influence of large values of threshold. Two different thresholds are given for each ensemble. The first is the threshold with the best results in the training set (using the ranking system) and the second adjustments after analysing the results provided by the platform after running the first. It can be seen that larger thresholds are required for the validation set compared to the training set. However, it is important to note that we cannot increase the threshold too much, as this could increase the number of FNs. The second row of the table shows the best overall results, however a value of 1450 voxels for the threshold of the WT might be too high for the test group. Therefore, the first solution is chosen for the test phase.

We can also conclude that the data augmentation with GliGAN gives the better DSC and HD95 for the ET. Furthermore, the ensemble of all available models gives the best solution using the ranking system (as explained in [13]) both in the training and validation set.

The other ensemble strategies were also tested but produced worse results than regular averaging, so they are not presented here. While these strategies have the potential to improve the results, since the focus of our solution is on using synthetic data, most of the effort has been spent on improving the synthetic data, rather than improving the ensemble strategy.

The major limitation of the GliGAN is that it is only capable of generate tumours with dimensions 96×96×96. This would be solved by using a generator with greater input size. Since the generator is a Swin UNETR, an input image greater than 96×96×96 could be used [26], but this was not tested.

For future improvements of our solution: The use of the ensemble $S_{B,L,S,B,L,S}^{G,G,G,rG,rG,rG}$ should be tested with the validation set; Identify which cases yields worse results and produce more synthetic data based on those cases; Use the synthetic data to train other networks.

4.0.1 Acknowledgements

André Ferreira thanks the Fundação para a Ciência e Tecnologia (FCT) Portugal for the grant 2022.11928.BD. This work has been supported by FCT within the R&D Units Project Scope: UIDB/00319/2020 and this work received funding from enFaced (FWF KLI 678), enFaced 2.0 (FWF KLI 1044) and KITE (Plattform für KI-Translation Essen, EFRE-0801977) from the REACT-EU initiative (https://kite.ikim.nrw/).

References

[1] Adewole, M., Rudie, J.D., Gbadamosi, A., Toyobo, O., Raymond, C., Zhang, D., Omidiji, O., Akinola, R., Suwaid, M.A., Emegoakor, A., et al.: The brain tumor segmentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa). arXiv preprint arXiv:2305.19369 (2023)
[2] Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., et al.: The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314 (2021). https://doi.org/arXiv:2107.02314
[3] Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C.: egmentation labels and radiomic features for the pre-operative scans of the tcga-gbm collection. The cancer imaging archive (2017). https://doi.org/10.7937/K9/TCIA.2017.KLXWJJ1Q
[4] Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C.: Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection. The cancer imaging archive (2017). https://doi.org/10.7937/K9/TCIA.2017.GJQ7R0EF
[5] Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C.: Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4(1), 1–13 (2017). https://doi.org/10.1038/sdata.2017.117
[6] Egger, J., Gsaxner, C., Pepe, A., Pomykala, K.L., Jonske, F., Kurz, M., Li, J., Kleesiek, J.: Medical deep learning—a systematic meta-review. Computer methods and programs in biomedicine 221, 106874 (2022)
[7] Egger, J., Kapur, T., Fedorov, A., Pieper, S., Miller, J.V., Veeraraghavan, H., Freisleben, B., Golby, A.J., Nimsky, C., Kikinis, R.: Gbm volumetry using the 3d slicer medical image computing platform. Scientific reports 3(1), 1364 (2013)
[8] Ellis, D.G., Aizenberg, M.R.: Deep learning using augmentation via registration: 1st place solution to the autoimplant 2020 challenge. In: Towards the Automatization of Cranial Implant Design in Cranioplasty: First Challenge, AutoImplant 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 8, 2020, Proceedings 1. pp. 47–55. Springer (2020)
[9] Ferreira, A., Li, J., Pomykala, K.L., Kleesiek, J., Alves, V., Egger, J.: Gan-based generation of realistic 3d data: A systematic review and taxonomy. arXiv preprint arXiv:2207.01390 (2022)
[10] Ferreira, A., Magalhães, R., Mériaux, S., Alves, V.: Generation of synthetic rat brain mri scans with a 3d enhanced alpha generative adversarial network. Applied Sciences 12(10), 4844 (2022)
[11] Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop. pp. 272–284. Springer (2021)
[12] Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18(2), 203–211 (2021)
[13] Isensee, F., Jäger, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H.: nnu-net for brain tumor segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 6th International Workshop, BrainLes 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Revised Selected Papers, Part II 6. pp. 118–132. Springer (2021)
[14] Jiang, Z., Ding, C., Liu, M., Tao, D.: Two-stage cascaded u-net: 1st place solution to brats challenge 2019 segmentation task. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th International Workshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Revised Selected Papers, Part I 5. pp. 231–241. Springer (2020)
[15] Kamnitsas, K., Bai, W., Ferrante, E., McDonagh, S., Sinclair, M., Pawlowski, N., Rajchl, M., Lee, M., Kainz, B., Rueckert, D., et al.: Ensembles of multiple models and architectures for robust brain tumour segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: Third International Workshop, BrainLes 2017, Held in Conjunction with MICCAI 2017, Quebec City, QC, Canada, September 14, 2017, Revised Selected Papers 3. pp. 450–462. Springer (2018)
[16] Kamnitsas, K., Chen, L., Ledig, C., Rueckert, D., Glocker, B., et al.: Multi-scale 3d convolutional neural networks for lesion segmentation in brain mri. Ischemic stroke lesion segmentation 13, 46 (2015)
[17] Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 36, 61–78 (2017)
[18] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
[19] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
[20] Luu, H.M., Park, S.H.: Extending nn-unet for brain tumor segmentation. In: International MICCAI Brainlesion Workshop. pp. 173–186. Springer (2021)
[21] McKinley, R., Meier, R., Wiest, R.: Ensembles of densely-connected cnns with label-uncertainty for brain tumor segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4. pp. 456–465. Springer (2019)
[22] Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2014). https://doi.org/10.1109/TMI.2014.2377694
[23] Pereira, S., Pinto, A., Alves, V., Silva, C.A.: Brain tumor segmentation using convolutional neural networks in mri images. IEEE transactions on medical imaging 35(5), 1240–1251 (2016)
[24] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
[25] Sawant, S., Erick, F., Schmidkonz, C., Ramming, A., Lang, E., Wittenberg, T., Götz, T.I., et al.: Comparing ensemble methods combined with different aggregating models using micrograph cell segmentation as an initial application example. Journal of Pathology Informatics 14, 100304 (2023)
[26] Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V., Hatamizadeh, A.: Self-supervised pre-training of swin transformers for 3d medical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20730–20740 (2022)
[27] Urban, G., Bendszus, M., Hamprecht, F., Kleesiek, J.: Multi-modal brain tumor segmentation using deep convolutional neural networks. MICCAI BraTS (brain tumor segmentation) challenge. Proceedings, winning contribution pp. 31–35 (2014)
[28] Visser, M., Müller, D., van Duijn, R., Smits, M., Verburg, N., Hendriks, E., Nabuurs, R., Bot, J., Eijgelaar, R., Witte, M., et al.: Inter-rater agreement in glioma segmentations on longitudinal mri. NeuroImage: Clinical 22, 101727 (2019)
[29] Vollmuth, P., Pflüger, I., Petersen, J., Neuberger, U., Brugnara, G., Schell, M., Keßler, T., Foltyn, M., Harting, I., Sahm, F., et al.: Automated quantitative tumour response assessment of mri in neuro-oncology with artificial neural networks (2019)
[30] Zeineldin, R.A., Karar, M.E., Burgert, O., Mathis-Ullrich, F.: Multimodal cnn networks for brain tumor segmentation in mri: a brats 2022 challenge solution. arXiv preprint arXiv:2212.09310 (2022)
[31] Zeineldin, R.A., Karar, M.E., Coburger, J., Wirtz, C.R., Burgert, O.: Deepseg: deep neural network framework for automatic brain tumor segmentation using magnetic resonance flair images. International journal of computer assisted radiology and surgery 15, 909–920 (2020)
[32] Zhao, Y.X., Zhang, Y.M., Liu, C.L.: Bag of tricks for 3d mri brain tumor segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th International Workshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Revised Selected Papers, Part I 5. pp. 210–220. Springer (2020)
[33] Zhou, H.Y., Guo, J., Zhang, Y., Yu, L., Wang, L., Yu, Y.: nnformer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201 (2021)

How we won BraTS 2023 Adult Glioma challenge? Just faking it! Enhanced Synthetic Data Augmentation and Model Ensemble for brain tumour segmentation ††thanks: Supported by organizations University of Minho and Institute for Artificial Intelligence in Medicine