11institutetext: Salzburg University of Applied Sciences 11email: {firstname,lastname}@fh-salzburg.ac.at 22institutetext: University of Salzburg 33institutetext: MedPhoton GmbH

Multimodal Learning With Intraoperative CBCT & Variably Aligned Preoperative CT Data To Improve Segmentation

Maximilian E. Tschuchnig 1122 0000-0002-1441-4752    Philipp Steininger 33    Michael Gadermayr 11 0000-0003-1450-9222
Abstract

Cone-beam computed tomography (CBCT), is an important tool facilitating computer aided interventions, despite often suffering from artifacts that pose challenges for accurate interpretation. While the degraded image quality can affect downstream segmentation, the availability of high quality, preoperative scans represents potential for improvements. Here we consider a setting where preoperative CT and intraoperative CBCT scans are available, however, the alignment (registration) between the scans is imperfect. We propose a multimodal learning method that fuses roughly aligned CBCT and CT scans and investigate the effect of CBCT quality and misalignment on the final segmentation performance. For that purpose, we make use of a synthetically generated data set containing real CT and synthetic CBCT volumes. As an application scenario, we focus on liver and liver tumor segmentation. We show that the fusion of preoperative CT and simulated, intraoperative CBCT mostly improves segmentation performance (compared to using intraoperative CBCT only) and that even clearly misaligned preoperative data has the potential to improve segmentation performance.

Keywords:
Multimodal Learning Intraoperative Segmentation Radiology

1 Introduction

To establish computer-assisted interventions, precise and reliable imaging, especially intraoperative imaging, is crucial. Mobile robotic medical imaging systems, like cone-beam computed tomography (CBCT) [12], enable intraoperative medical imaging with real time capabilities. CBCT is an imaging method that utilizes a cone-shaped X-ray beam and a flat-panel detector to capture detailed, three-dimensional images of a patient’s anatomy using a mobile system [9]. However, this type of intraoperative imaging often comes with the disadvantage suffering from more artifacts than preoperative CT imaging, affecting the performance of downstream tasks like segmentation [15].

While this degraded image quality can affect downstream segmentation, the availability of high quality preoperative scans represents a high potential for improvements based on the idea of multimodal learning [17, 11, 16]. Multimodal learning is an approach that involves fusing images from multiple domains to improve machine learning models for a downstream task like segmentation. In medical imaging, a common approach is to enrich computed tomograpy (CT) data, focusing on bone structures, with magnetic resonance imaging data for soft tissue analysis [17, 11]. Multimodal learning is typically separated into three fusion strategies [17, 16]: early, late and hybrid. The most common multimodal fusion, early-fusion, combines images of different modalities before being processed by a downstream model. Typically, the two domains are fused along a dimension additional to the spacial volume dimensions and processed jointly [17, 13]. Another form of early fusion processes the volumes of different modalities in separate feature extraction stages, finally fusing the extracted features. Late-fusion is performed before the final layer of the downstream task. In a segmentation case, late-fusion merges the features extracted from multiple independent encoder-decoder networks before the final layer (facilitating segmentation). Hybrid-fusion, combines aspects of both early and late fusion for enhanced performance.

Typically, multimodal learning assumes that the fused images are aligned, utilizing affine or even elastic registration. However, registration of 3D samples, if performed accurately on a high resolution, is computationally expensive, especially if non-linear deformations need to be compensated. Deep learning-based approaches [2, 4] have shown promising performance while also reducing computational complexity, compared to classical, optimization based approaches. Podobnik et al. [11] integrated registration within their hybrid multimodal segmentation approach. Integrated registration was achieved by merging features from different modality branches using the affine Spatial Transformer Networks (STN) localization net [8], along with a grid generator and sampler.

Here we consider the setting where preoperative CT scans and intraoperative CBCT scans are available, however, the alignment (registration) between the scans is imperfect. We hypothesised that, by adding high quality, preoperative CT to intraoperative CBCT scans, segmentation performance will increase. We assume that, the more accurate the alignment between the modalities is, the more pronounced the performance increase will be.

Contribution: We propose a method that combines roughly aligned CBCT and CT scans (early-fusion) and investigate the effect of CBCT quality and misalignment (based on affine and elastic transformations) on segmentation performance.

In detail, we synthesised a collection of synthetic CBCT data, focusing on the segmentation of liver and liver tumors based on the LiTS CT dataset [3]. Beyond varying the amount of digitally reconstructed radiographs (DRR) used for CBCT synthesis, resulting in 5555 different imaging qualities, we generated 9999 variably misaligned versions based on linear and non-linear models and 1111 baseline model resulting in overall 50505050 sub data sets. We evaluated all 45454545 combinations based on a unet architecture and compared the performances to the 5555 unimodal baseline settings.

2 Methodology

We propose to fuse intraoperative CBCT with roughly aligned, high quality, preoperative CT to investigate if this kind of additional information improves model training for computer aided intervention systems. To accomplish this study we apply multimodal learning in medical imaging for the downstream task of liver and liver tumor segmentation using CT and CBCT data based on the LiTS dataset [3]111For the CBCT version of LiTS see the Kaggle dataset CBCT Liver and Liver Tumor Segmentation Train Data. Similar to Ren et al. [13] we used early-fusion for the multimodal setup. Furthermore, we investigated two parameters, αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT as the number of DRRs used to simulate the current CBCT (representing undersampling), and αasubscript𝛼𝑎\alpha_{a}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as the current alignment factor between the preoperative CT and the intraoperative CBCT. As a baseline, segmentation was performed using only the intraoperative CBCT scans, which was evaluated for all settings of αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT. Applied to the task of liver and liver segmentation, this setup results in 100100100100 different experiments. Misalignment was performed using affine transformations during training and validation to decrease computational complexity in the dataloader. During validation, which is only performed once for each sample, affine misalignment was followed by elastic misalignment to evaluate the models ability to generalize to non-linear deformations.

Figure 1: Multimodal model configuration. After fusing the intraoperative CBCT and preoperative CT (early fusion), the data was processed by the shown unet, segmenting liver and liver tumors.

We used a holistic, 3D unet introduced by Çiçek et al. [5] as the basis of our segmentation model and our baseline for all investigated settings. The 3D unet was updated to process the multimodal data by adding a paired and variably misaligned, preoperative CT as a second channel, resulting in a 4d data structure. The 3D unet used for segmentation, shown in Fig. 1, consisted of an encoder with 3 double convolution layers and 3×3×33333\times 3\times 33 × 3 × 3 convolutional kernels, connected by 3D max pooling. The latent space was constructed using one double convolution block followed by the unet decoder, mirroring the encoder. As is typical for unet, each double convolutional output in the encoder was also connected to the decoder double convolutional block of the same order. Additionally, one 3D convolutional layer was added to the decoder with a filter size of 1×1×11111\times 1\times 11 × 1 × 1 and the number of filters set to the number of segmentation classes (in the case of liver and liver tumor segmentation this value was set to 2222). The number of feature maps were set to {32,64,128,256}3264128256\{32,64,128,256\}{ 32 , 64 , 128 , 256 } as shown in Fig. 1. Batch norm was applied after each layer in the double convolutional blocks. The model was trained utilizing a sum of binary cross-entropy and Dice similarity. For our baseline, a unimodal unet was used, with only the CBCT as input. Our multimodal approach added the high quality, misaligned and preoperative CT as a second channel to the CBCT.

2.1 Experimental Details

All models were trained on an Ubuntu server using NVIDIA RTX A6000 graphics cards. Due to the large size of the data and memory restrictions (48 GB VRAM), the volumes were downscaled (isotropic) by the factor of two. To binarize the masks, a threshold of 0.50.50.50.5 was applied to each channel of the unet output. As an additional baseline, CBCT volumes with perfectly aligned CT volumes, αa=0subscript𝛼𝑎0\alpha_{a}=0italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0, was also investigated. All experiments were trained and evaluated 4444 times to facilitate stable results with the same random splits as well as the same random CT misaligned for comparable result. Adam was used as an optimizer with a learning rate of 0.0050.0050.0050.005. The source code for the experiments can be found on https://github.com/blind/blind. Since annotations were not available for the LiTS test dataset, the test dataset was disregarded for this publication and the (processed) LiTS training set was separated into training-validation-testing data [1, 6]. The separation was performed using the ratios 0.70.70.70.7 (training), 0.20.20.20.2 (validation), 0.10.10.10.1 (testing).

2.2 Dataset Generation

For evaluation of the multimodal learning approach, paired CT and CBCT volumes with variable degrees of misalignment had to be generated.

Refer to caption
Figure 2: Data generation process: after centering the original CT volumes around the liver (using the liver segmentations), αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT projections were simulated. Finally, CBCT were simulated and aligned with the original CT and masks, in order to fit them to the CBCT field-of-view.

To generate CBCT/CT pairs we simulated DRRs from the CT volumes, with variable undersampling (with αnp{32,64,128,256,490}subscript𝛼𝑛𝑝3264128256490\alpha_{np}\in\{32,64,128,256,490\}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT ∈ { 32 , 64 , 128 , 256 , 490 } representing the predefined number of DRRs). We then used these DRRs to simulate CBCT with varying visual quality. Higher αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT corresponded to better image quality, with αnp=490subscript𝛼𝑛𝑝490\alpha_{np}=490italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT = 490 serving as the CBCT quality similar to preoperative CT and αnp=32subscript𝛼𝑛𝑝32\alpha_{np}=32italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT = 32 as the lowest visual quality with significant artifacts [14]. Fig. 2 shows these steps to convert the CT LiTS dataset into synthetic CBCT scans.

The LiTS dataset was chosen to perform the experiments [3]. LiTS consists of 131131131131 abdominal CT scans in the training set and 70707070 test volumes. The 131131131131 training volumes include segmentations of 1) the liver and 2) liver tumors. The dataset contains data from 7777 different institutions with a diverse set of liver tumor diseases. The CT scans were acquired using different CT scanners and acquisition protocols. For further information about the dataset we refer to Bilic et al. [3]222To download the LiTS dataset follow the link: https://competitions.codalab.org/competitions/17094.

Affine misaligned was performed using random (non-isotropic) scaling, with the scaling parameter sampled from 𝒰(10.5αa,1+0.5αa)𝒰10.5subscript𝛼𝑎10.5subscript𝛼𝑎\mathcal{U}(1-0.5\cdot\alpha_{a},1+0.5\cdot\alpha_{a})caligraphic_U ( 1 - 0.5 ⋅ italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , 1 + 0.5 ⋅ italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), rotation, parameter sampled from 𝒰(22.5αa,22.5αa)𝒰22.5subscript𝛼𝑎22.5subscript𝛼𝑎\mathcal{U}(-22.5\cdot\alpha_{a},22.5\cdot\alpha_{a})caligraphic_U ( - 22.5 ⋅ italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , 22.5 ⋅ italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), and translation, with the parameter sampled from 𝒰(0,0.5αa)𝒰00.5subscript𝛼𝑎\mathcal{U}(0,0.5\cdot\alpha_{a})caligraphic_U ( 0 , 0.5 ⋅ italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) with tri-linear interpolation. Elastic misalignment was applied with a maximum displacement sampled from 𝒰(0,20αa)𝒰020subscript𝛼𝑎\mathcal{U}(0,20\cdot\alpha_{a})caligraphic_U ( 0 , 20 ⋅ italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) with 7777 control points and no locked borders. To reduce the number of parameters controlling misalignment parameters to one, the alignment factor αa{0,0.125,0.25,0.5,1}subscript𝛼𝑎00.1250.250.51\alpha_{a}\in\{0,0.125,0.25,0.5,1\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0 , 0.125 , 0.25 , 0.5 , 1 } was introduced, controlling the misalignmnet between pairs of scans. Augmentation was performed using TorchIO RandomAffine and RandomElasticDeformation.

Fig 3 shows examples of affine misalignment with the αa{0,0.125,0.25,0.5,1}subscript𝛼𝑎00.1250.250.51\alpha_{a}\in\{0,0.125,0.25,0.5,1\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0 , 0.125 , 0.25 , 0.5 , 1 } and 4444 different volumes. Fig 4 shows the same volumes, with applied random elastic deformation. Here we display the elastic transformation in isolation from affine transformations for improved visibility.

Refer to caption
Figure 3: Results of the random affine augmentation with differing augmentation factor αa{0,0.125,0.25,0.5,1}subscript𝛼𝑎00.1250.250.51\alpha_{a}\in\{0,0.125,0.25,0.5,1\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0 , 0.125 , 0.25 , 0.5 , 1 } of 4444 different volumes.
Refer to caption
Figure 4: Results of the random elastic augmentation with differing augmentation factor αa{0,0.125,0.25,0.5,1}subscript𝛼𝑎00.1250.250.51\alpha_{a}\in\{0,0.125,0.25,0.5,1\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0 , 0.125 , 0.25 , 0.5 , 1 } of 4444 different volumes.

3 Results

Experimental results are shown in Table 1, denoted as mean Dice values. If the test data was transformed using the affine transformations, rows are marked as affine-s. Affine, followed by the elastic transformed settings are marked as elastic-s. Two baselines are also shown: unimodal CBCT with no preoperative CT (base CBCT) and CBCT with perfectly aligned CT (no misalignment). Blue values show improvements compared to the baseline (base CBCT), while red values show decreased scores. If the values are bold there is at least a 5%percent55\%5 % increase/decrease. The +,+,-+ , - corresponds to improvements/decreases to the previous, less accurately aligned experiments. If +,+,-+ , - are colored, the increase/decrease compared to the previous less accurately aligned example is larger than 5%percent55\%5 %. The different undersampled αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT are displayed horizontally and αasubscript𝛼𝑎\alpha_{a}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT vertically for both liver and liver tumor segmentation. In the affine setting, the scores increased compared to the baseline in all liver segmentation and in 11111111 out of 20202020 liver tumor segmentation cases. The biggest improvement for liver segmentation was achieved with the parameters αnp=32subscript𝛼𝑛𝑝32\alpha_{np}=32italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT = 32 and αa=0subscript𝛼𝑎0\alpha_{a}=0italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0 with an improvement from 0.7840.7840.7840.784 to 0.9320.9320.9320.932 resulting in an increase of 0.1480.1480.1480.148. For liver tumor segmentation, the biggest improvement was also achieved with the parameters αnp=32subscript𝛼𝑛𝑝32\alpha_{np}=32italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT = 32 and αa=0subscript𝛼𝑎0\alpha_{a}=0italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0 from 0.0290.0290.0290.029 to 0.2970.2970.2970.297 resulting in an increase of 0.2680.2680.2680.268. There was also an observable trend of increasing Dice scores from αa{0.5,0.25,0.125,0}subscript𝛼𝑎0.50.250.1250\alpha_{a}\in\{0.5,0.25,0.125,0\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0.5 , 0.25 , 0.125 , 0 }. This trend was stronger, the lower αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT was. Adding elastic transformation on top of affine transformations lead to slightly lower scores in the cases of αa0.5,0.25subscript𝛼𝑎0.50.25\alpha_{a}\in{0.5,0.25}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ 0.5 , 0.25 with αa=0.25subscript𝛼𝑎0.25\alpha_{a=0.25}italic_α start_POSTSUBSCRIPT italic_a = 0.25 end_POSTSUBSCRIPT suffering most from adding elastic transformations with an average Dice decrease of 0.0230.0230.0230.023 for liver and 0.0320.0320.0320.032 for liver tumor segmentation. The elastic transformation effects regarding αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT show a decrease of average Dice proportional to image quality. The most significant decrease in Dice, attributable to elastic transformation was reached with αnp=32subscript𝛼𝑛𝑝32\alpha_{np}=32italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT = 32 with 0.0160.0160.0160.016 for liver and 0.0210.0210.0210.021 for liver tumor segmentation.

Table 1: Experimental results (affine and elastic misalignment) showing the mean Dice values. Blue values denote improvements compared to the baseline (base CBCT), while red values show decreased scores. If the values are in bold there is at least a 5%percent55\%5 % increase/decrease. The +,+,-+ , - corresponds to increases/decreases compared to the previous, less accurately aligned setting. If +,+,-+ , - are colored, this increase/decrease is larger than 5%percent55\%5 %.
Liver Segmentation Liver Tumor Segmentation
490 256 128 64 32 490 256 128 64 32
base CBCT 0.884 0.884 0.859 0.817 0.784 0.165 0.162 0.093 0.061 0.029
no misalignment 0.933 +\bm{+}bold_+ 0.931 +\bm{+}bold_+ 0.931 +\bm{+}bold_+ 0.933 +\bm{+}bold_+ 0.932 +\bm{+}bold_+ 0.330 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.322 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.298 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.325 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.297 +\color[rgb]{0,0,1}\bm{+}bold_+
affine-s1 0.906 +\bm{+}bold_+ 0.897 +\bm{+}bold_+ 0.879 +\bm{+}bold_+ 0.857 +\bm{+}bold_+ 0.806 +\bm{+}bold_+ 0.154 \bm{-}bold_- 0.124 \bm{-}bold_- 0.057 \bm{-}bold_- 0.023 \bm{-}bold_- 0.051 +\bm{+}bold_+
affine-s0.5 0.895 \bm{-}bold_- 0.891 \bm{-}bold_- 0.875 \bm{-}bold_- 0.863 +\bm{+}bold_+ 0.851 +\bm{+}bold_+ 0.096 \color[rgb]{1,0,0}\bm{-}bold_- 0.098 \bm{-}bold_- 0.077 +\bm{+}bold_+ 0.056 +\bm{+}bold_+ 0.107 +\color[rgb]{0,0,1}\bm{+}bold_+
affine-s0.25 0.904 +\bm{+}bold_+ 0.906 +\bm{+}bold_+ 0.893 +\bm{+}bold_+ 0.883 +\bm{+}bold_+ 0.883 +\bm{+}bold_+ 0.181 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.160 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.163 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.169 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.171 +\color[rgb]{0,0,1}\bm{+}bold_+
affine-s0.125 0.908 +\bm{+}bold_+ 0.900 \bm{-}bold_- 0.889 \bm{-}bold_- 0.884 +\bm{+}bold_+ 0.887 +\bm{+}bold_+ 0.187 +\bm{+}bold_+ 0.166 +\bm{+}bold_+ 0.166 +\bm{+}bold_+ 0.152 \bm{-}bold_- 0.185 +\bm{+}bold_+
elastic-s1 0.907 +\bm{+}bold_+ 0.897 +\bm{+}bold_+ 0.880 +\bm{+}bold_+ 0.857 +\bm{+}bold_+ 0.081 +\bm{+}bold_+ 0.153 \bm{-}bold_- 0.122 \bm{-}bold_- 0.057 \bm{-}bold_- 0.022 \bm{-}bold_- 0.049 +\bm{+}bold_+
elastic-s0.5 0.895 \bm{-}bold_- 0.891 \bm{-}bold_- 0.870 \bm{-}bold_- 0.845 \bm{-}bold_- 0.800 +\bm{+}bold_+ 0.098 \color[rgb]{1,0,0}\bm{-}bold_- 0.103 \bm{-}bold_- 0.085 +\bm{+}bold_+ 0.029 +\bm{+}bold_+ 0.049 +\bm{+}bold_+
elastic-s0.25 0.893 \bm{-}bold_- 0.898 +\bm{+}bold_+ 0.869 \bm{-}bold_- 0.853 +\bm{+}bold_+ 0.846 +\bm{+}bold_+ 0.154 +\bm{+}bold_+ 0.142 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.136 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.129 +\color[rgb]{0,0,1}\bm{+}bold_+ 0.122 +\color[rgb]{0,0,1}\bm{+}bold_+
elastic-s0.125 0.910 +\bm{+}bold_+ 0.901 +\bm{+}bold_+ 0.891 +\bm{+}bold_+ 0.885 +\bm{+}bold_+ 0.887 +\bm{+}bold_+ 0.186 +\bm{+}bold_+ 0.168 +\bm{+}bold_+ 0.163 +\bm{+}bold_+ 0.151 +\bm{+}bold_+ 0.168 +\bm{+}bold_+

4 Discussion

The results confirmed the hypothesis that enriching intraoperative CBCT with roughly aligned preoperative CT can improve downstream tasks like segmentation. Most multimodal setups improved downstream performance with the only outliers observable in liver tumor segmentation with αa{0.5,1}subscript𝛼𝑎0.51\alpha_{a}\in\{0.5,1\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0.5 , 1 } and αnp{490,256,128,64}subscript𝛼𝑛𝑝49025612864\alpha_{np}\in\{490,256,128,64\}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT ∈ { 490 , 256 , 128 , 64 } and (αa,αnp)=(0.25,256)subscript𝛼𝑎subscript𝛼𝑛𝑝0.25256(\alpha_{a},\alpha_{np})=(0.25,256)( italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT ) = ( 0.25 , 256 ) in the affine misaligned cases. If data was additionally misaligned through elastic transformations, the settings (αa,αnp)=(0.25,490)subscript𝛼𝑎subscript𝛼𝑛𝑝0.25490(\alpha_{a},\alpha_{np})=(0.25,490)( italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT ) = ( 0.25 , 490 ) also lead to a decrease in average Dice.

Several trends are observable. For one, the worse the CBCT quality, the more can be gained by adding high quality CT, leading to an increase of Dice from 0.7840.9320.7840.9320.784\rightarrow 0.9320.784 → 0.932 for liver and 0.0290.2970.0290.2970.029\rightarrow 0.2970.029 → 0.297 for liver tumor segmentation. There is a particularly pronounced effect in the case of adding preoperative information to low-quality interoperative CBCT. This effect is especially pronounced in the case of liver segmentation, where baseline scores can be reached using αnp=32subscript𝛼𝑛𝑝32\alpha_{np}=32italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT = 32.

We also observe a trend regarding alignment of preoperative CT and intraoperative CBCT. At first, increased alignment lead to mostly worse average Dice for αa{1,0.5}subscript𝛼𝑎10.5\alpha_{a}\in\{1,0.5\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 1 , 0.5 } followed by a notable increase in average Dice for αa{0.25,0.125,0}subscript𝛼𝑎0.250.1250\alpha_{a}\in\{0.25,0.125,0\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0.25 , 0.125 , 0 }. This trend was only observed for liver tumor segmentation. We hypothesise that this is the case due to heavily misaligned data interfering with the training process. This is especially the case for complex structures like tumors. Here, the model can easily learn to ignore the higher quality but heavily misaligned CT but overfit to better aligned CT data. However, further experiments to determine the reason for this should be performed. For now our findings suggest that if there is significant misalignment and the target is difficult, such as liver tumor segmentation, preregistration is crucial for successful multimodal learning.

The addition of elastic misalignment lead to a proportional decrease in downstream segmentation performance depending on the undersampling αnpsubscript𝛼𝑛𝑝\alpha_{np}italic_α start_POSTSUBSCRIPT italic_n italic_p end_POSTSUBSCRIPT. Therefore, the same elastic misalignment had more pronounced degrading effects on segmentation performance the worse the image quality got, hinting at a possible implicit registration effect that diminishes with image quality. Investigating the factor αasubscript𝛼𝑎\alpha_{a}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, elastic misalignment had no effect in the cases of αa{1,0.125}subscript𝛼𝑎10.125\alpha_{a}\in\{1,0.125\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 1 , 0.125 } with the average change <0.005absent0.005<0.005< 0.005. However, in the cases of αa{0.5,0.25}subscript𝛼𝑎0.50.25\alpha_{a}\in\{0.5,0.25\}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0.5 , 0.25 } the effect of adding elastic misalignment was on average a decrease of 0.0210.0210.0210.021. These insights hint at the deformation of αa=1subscript𝛼𝑎1\alpha_{a}=1italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 already being almost unfeasible to handle in the affine case, leaving the addition of elastic transform in these cases without effect. Since we also observed almost no effect in the cases of αa=0.125subscript𝛼𝑎0.125\alpha_{a}=0.125italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.125 we suspected that in these cases the elastic transform was negligible, in comparison to the affine misalignment. Interestingly, these performance gains show that the unet used for segmentation implicitly learned limited registration.

Although we only evaluated our results on unet, this form of multimodal learning is theoretically applicable to segmentation models based on similar building blocks, most importantly convolutional filters and pooling. Further experiments are needed to investigate this assumption for other architectures, such as the Segment Anything Model [10] or UNETR [7].

5 Conclusion

This study highlights the effectiveness of multimodal learning for downstream tasks, combining roughly aligned intraoperative CBCT with high quality, preoperative CT data. It further investigates how the different factors of volume quality and volume alignment influence the performance of a specific multimodal learning based model. Using the multimodal learning setup, improvements in segmentation accuracy, especially when CBCT volume quality was suboptimal, were reported suggesting that high-quality preoperative CT data can compensate for intraoperative CBCT limitations, as long as the data is roughly aligned. Therefore, using this approach, clinicians could be supplied with more reliable information for surgical decision-making, particularly in near real-time settings of computer-assisted intervention. Assuming there is an efficient way for rough preregistration, this underscores the practical applicability of the approach. We further showed that a simple 3D unet was able to learn limited, implicit registration.

Since evaluation was based on synthetic data with relatively simple misalignment and without preregistration useful next steps would be to evaluate using real, paired preoperative and intraoperative data as well as incorporating (pre)registration into the multimodal model. Furthermore, continued exploration into multimodal approaches, including late and hybrid fusion, are promising future research directions.

{credits}

5.0.1 Acknowledgements

This project was partly funded by the Austrian Research Promotion Agency (FFG) under the bridge project "CIRCUIT: Towards Comprehensive CBCT Imaging Pipelines for Real-time Acquisition, Analysis, Interaction and Visualization" (CIRCUIT), no. 41545455 and by the county of Salzburg under the project AIBIA.

5.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

  • [1] Araújo, J.D.L., da Cruz, L.B., Diniz, J.O.B., Ferreira, J.L., Silva, A.C., de Paiva, A.C., Gattass, M.: Liver segmentation from computed tomography images using cascade deep learning. Computers in Biology and Medicine 140, 105095 (2022)
  • [2] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38(8), 1788–1800 (2019)
  • [3] Bilic, P., Christ, P., Li, H., Vorontsov, E., Ben-Cohen, A., Kaissis, G., others, Menze, B.: The liver tumor segmentation benchmark (lits). Medical Image Analysis 84 (Feb 2023). https://doi.org/10.1016/j.media.2022.102680
  • [4] Chen, J., Frey, E.C., He, Y., Segars, W.P., Li, Y., Du, Y.: Transmorph: Transformer for unsupervised medical image registration. Medical image analysis 82, 102615 (2022)
  • [5] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19. pp. 424–432 (2016)
  • [6] Han, K., Liu, L., Song, Y., Liu, Y., Qiu, C., Tang, Y., Teng, Q., Liu, Z.: An effective semi-supervised approach for liver ct image segmentation. IEEE Journal of Biomedical and Health Informatics 26(8), 3999–4007 (2022)
  • [7] Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 574–584 (2022)
  • [8] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. Advances in neural information processing systems 28 (2015)
  • [9] Jaffray, D.A., Siewerdsen, J.H., Wong, J.W., Martinez, A.A.: Flat-panel cone-beam computed tomography for image-guided radiation therapy. International Journal of Radiation Oncology* Biology* Physics 53(5), 1337–1349 (2002)
  • [10] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)
  • [11] Podobnik, G., Strojan, P., Peterlin, P., Ibragimov, B., Vrtovec, T.: Multimodal ct and mr segmentation of head and neck organs-at-risk. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 745–755 (2023)
  • [12] Rafferty, M.A., Siewerdsen, J.H., Chan, Y., Daly, M.J., Moseley, D.J., Jaffray, D.A., Irish, J.C.: Intraoperative cone-beam ct for guidance of temporal bone surgery. Otolaryngology—Head and Neck Surgery 134(5), 801–808 (2006)
  • [13] Ren, J., Eriksen, J.G., Nijkamp, J., Korreman, S.S.: Comparing different ct, pet and mri multi-modality image combinations for deep learning-based head and neck tumor segmentation. Acta Oncologica 60(11), 1399–1406 (2021)
  • [14] Tschuchnig, M.E., Coste-Marin, J., Steininger, P., Gadermayr, M.: Multi-task learning to improve semantic segmentation of cbct scans using image reconstruction. In: BVM Workshop. pp. 243–248 (2024)
  • [15] Wei, C., Albrecht, J., Rit, S., Laurendeau, M., Thummerer, A., Corradini, S., others, Landry, G.: Reduction of cone-beam ct artifacts in a robotic cbct device using saddle trajectories with integrated infrared tracking. Medical Physics (2024)
  • [16] Zhang, Y., Yang, J., Tian, J., Shi, Z., Zhong, C., Zhang, Y., He, Z.: Modality-aware mutual learning for multi-modal medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. pp. 589–599 (2021)
  • [17] Zhang, Y., Sidibé, D., Morel, O., Mériaudeau, F.: Deep multimodal fusion for semantic image segmentation: A survey. Image and Vision Computing 105, 104042 (2021)