-
On Instabilities of Unsupervised Denoising Diffusion Models in Magnetic Resonance Imaging Reconstruction
Authors:
Tianyu Han,
Sven Nebelung,
Firas Khader,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
Denoising diffusion models offer a promising approach to accelerating magnetic resonance imaging (MRI) and producing diagnostic-level images in an unsupervised manner. However, our study demonstrates that even tiny worst-case potential perturbations transferred from a surrogate model can cause these models to generate fake tissue structures that may mislead clinicians. The transferability of such…
▽ More
Denoising diffusion models offer a promising approach to accelerating magnetic resonance imaging (MRI) and producing diagnostic-level images in an unsupervised manner. However, our study demonstrates that even tiny worst-case potential perturbations transferred from a surrogate model can cause these models to generate fake tissue structures that may mislead clinicians. The transferability of such worst-case perturbations indicates that the robustness of image reconstruction may be compromised due to MR system imperfections or other sources of noise. Moreover, at larger perturbation strengths, diffusion models exhibit Gaussian noise-like artifacts that are distinct from those observed in supervised models and are more challenging to detect. Our results highlight the vulnerability of current state-of-the-art diffusion-based reconstruction models to possible worst-case perturbations and underscore the need for further research to improve their robustness and reliability in clinical settings.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization
Authors:
Firas Khader,
Omar S. M. El Nahhas,
Tianyu Han,
Gustav Müller-Franzes,
Sven Nebelung,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity relative to the sequence length, which constrains its application to longer sequences. This is especially crucial in medical imaging where high-resolution images can…
▽ More
The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity relative to the sequence length, which constrains its application to longer sequences. This is especially crucial in medical imaging where high-resolution images can reach gigapixel scale. Efforts to address this issue have predominantely focused on complex techniques, such as decomposing the softmax operation integral to the Transformer's architecture. This paper addresses this quadratic computational complexity of Transformer models and introduces a remarkably simple and effective method that circumvents this issue by eliminating the softmax function from the attention mechanism and adopting a sequence normalization technique for the key, query, and value tokens. Coupled with a reordering of matrix multiplications this approach reduces the memory- and compute complexity to a linear scale. We evaluate this approach across various medical imaging datasets comprising fundoscopic, dermascopic, radiologic and histologic imaging data. Our findings highlight that these models exhibit a comparable performance to traditional transformer models, while efficiently handling longer sequences.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
MRSegmentator: Robust Multi-Modality Segmentation of 40 Classes in MRI and CT Sequences
Authors:
Hartmut Häntze,
Lina Xu,
Felix J. Dorfner,
Leonhard Donle,
Daniel Truhn,
Hugo Aerts,
Mathias Prokop,
Bram van Ginneken,
Alessa Hering,
Lisa C. Adams,
Keno K. Bressem
Abstract:
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences.
Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, le…
▽ More
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences.
Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, leveraging cross-modality transfer learning from CT segmentation models. A human-in-the-loop annotation workflow was employed to efficiently create high-quality segmentations. The model's performance was evaluated on NAKO and the AMOS22 dataset containing 600 and 60 MRI examinations. Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) was used to assess segmentation accuracy. The model will be open sourced.
Results: The model showcased high accuracy in segmenting well-defined organs, achieving Dice Similarity Coefficient (DSC) scores of 0.97 for the right and left lungs, and 0.95 for the heart. It also demonstrated robustness in organs like the liver (DSC: 0.96) and kidneys (DSC: 0.95 left, 0.95 right), which present more variability. However, segmentation of smaller and complex structures such as the portal and splenic veins (DSC: 0.54) and adrenal glands (DSC: 0.65 left, 0.61 right) revealed the need for further model optimization.
Conclusion: The proposed model is a robust, tool for accurate segmentation of 40 anatomical structures in MRI and CT images. By leveraging cross-modality learning and interactive annotation, the model achieves strong performance and generalizability across diverse datasets, making it a valuable resource for researchers and clinicians. It is open source and can be downloaded from https://github.com/hhaentze/MRSegmentator.
△ Less
Submitted 13 May, 2024; v1 submitted 10 May, 2024;
originally announced May 2024.
-
Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology
Authors:
Dyke Ferber,
Omar S. M. El Nahhas,
Georg Wölflein,
Isabella C. Wiest,
Jan Clusmann,
Marie-Elisabeth Leßman,
Sebastian Foersch,
Jacqueline Lammert,
Maximilian Tschochohei,
Dirk Jäger,
Manuel Salto-Tellez,
Nikolaus Schultz,
Daniel Truhn,
Jakob Nikolas Kather
Abstract:
Multimodal artificial intelligence (AI) systems have the potential to enhance clinical decision-making by interpreting various types of medical data. However, the effectiveness of these models across all medical fields is uncertain. Each discipline presents unique challenges that need to be addressed for optimal performance. This complexity is further increased when attempting to integrate differe…
▽ More
Multimodal artificial intelligence (AI) systems have the potential to enhance clinical decision-making by interpreting various types of medical data. However, the effectiveness of these models across all medical fields is uncertain. Each discipline presents unique challenges that need to be addressed for optimal performance. This complexity is further increased when attempting to integrate different fields into a single model. Here, we introduce an alternative approach to multimodal medical AI that utilizes the generalist capabilities of a large language model (LLM) as a central reasoning engine. This engine autonomously coordinates and deploys a set of specialized medical AI tools. These tools include text, radiology and histopathology image interpretation, genomic data processing, web searches, and document retrieval from medical guidelines. We validate our system across a series of clinical oncology scenarios that closely resemble typical patient care workflows. We show that the system has a high capability in employing appropriate tools (97%), drawing correct conclusions (93.6%), and providing complete (94%), and helpful (89.2%) recommendations for individual patient cases while consistently referencing relevant literature (82.5%) upon instruction. This work provides evidence that LLMs can effectively plan and execute domain-specific models to retrieve or synthesize new information when used as autonomous agents. This enables them to function as specialist, patient-tailored clinical assistants. It also simplifies regulatory compliance by allowing each component tool to be individually validated and approved. We believe, that our work can serve as a proof-of-concept for more advanced LLM-agents in the medical domain.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
In-context learning enables multimodal large language models to classify cancer pathology images
Authors:
Dyke Ferber,
Georg Wölflein,
Isabella C. Wiest,
Marta Ligero,
Srividhya Sainath,
Narmin Ghaffari Laleh,
Omar S. M. El Nahhas,
Gustav Müller-Franzes,
Dirk Jäger,
Daniel Truhn,
Jakob Nikolas Kather
Abstract:
Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context lear…
▽ More
Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subty** and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Joint multi-task learning improves weakly-supervised biomarker prediction in computational pathology
Authors:
Omar S. M. El Nahhas,
Georg Wölflein,
Marta Ligero,
Tim Lenz,
Marko van Treeck,
Firas Khader,
Daniel Truhn,
Jakob Nikolas Kather
Abstract:
Deep Learning (DL) can predict biomarkers directly from digitized cancer histology in a weakly-supervised setting. Recently, the prediction of continuous biomarkers through regression-based DL has seen an increasing interest. Nonetheless, clinical decision making often requires a categorical outcome. Consequently, we developed a weakly-supervised joint multi-task Transformer architecture which has…
▽ More
Deep Learning (DL) can predict biomarkers directly from digitized cancer histology in a weakly-supervised setting. Recently, the prediction of continuous biomarkers through regression-based DL has seen an increasing interest. Nonetheless, clinical decision making often requires a categorical outcome. Consequently, we developed a weakly-supervised joint multi-task Transformer architecture which has been trained and evaluated on four public patient cohorts for the prediction of two key predictive biomarkers, microsatellite instability (MSI) and homologous recombination deficiency (HRD), trained with auxiliary regression tasks related to the tumor microenvironment. Moreover, we perform a comprehensive benchmark of 16 approaches of task balancing for weakly-supervised joint multi-task learning in computational pathology. Using our novel approach, we improve over the state-of-the-art area under the receiver operating characteristic by +7.7% and +4.1%, as well as yielding better clustering of latent embeddings by +8% and +5% for the prediction of MSI and HRD in external cohorts, respectively.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
An Ordinal Regression Framework for a Deep Learning Based Severity Assessment for Chest Radiographs
Authors:
Patrick Wienholt,
Alexander Hermans,
Firas Khader,
Behrus Puladi,
Bastian Leibe,
Christiane Kuhl,
Sven Nebelung,
Daniel Truhn
Abstract:
This study investigates the application of ordinal regression methods for categorizing disease severity in chest radiographs. We propose a framework that divides the ordinal regression problem into three parts: a model, a target function, and a classification function. Different encoding methods, including one-hot, Gaussian, progress-bar, and our soft-progress-bar, are applied using ResNet50 and V…
▽ More
This study investigates the application of ordinal regression methods for categorizing disease severity in chest radiographs. We propose a framework that divides the ordinal regression problem into three parts: a model, a target function, and a classification function. Different encoding methods, including one-hot, Gaussian, progress-bar, and our soft-progress-bar, are applied using ResNet50 and ViT-B-16 deep learning models. We show that the choice of encoding has a strong impact on performance and that the best encoding depends on the chosen weighting of Cohen's kappa and also on the model architecture used. We make our code publicly available on GitHub.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
LongHealth: A Question Answering Benchmark with Long Clinical Documents
Authors:
Lisa Adams,
Felix Busch,
Tianyu Han,
Jean-Baptiste Excoffier,
Matthieu Ortala,
Alexander Löser,
Hugo JWL. Aerts,
Jakob Nikolas Kather,
Daniel Truhn,
Keno Bressem
Abstract:
Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data.
Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each…
▽ More
Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data.
Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents.
Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data interpretation.
Conclusion: While LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application.
We make the benchmark and evaluation code publicly available.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
From Whole-slide Image to Biomarker Prediction: A Protocol for End-to-End Deep Learning in Computational Pathology
Authors:
Omar S. M. El Nahhas,
Marko van Treeck,
Georg Wölflein,
Michaela Unger,
Marta Ligero,
Tim Lenz,
Sophia J. Wagner,
Katherine J. Hewitt,
Firas Khader,
Sebastian Foersch,
Daniel Truhn,
Jakob Nikolas Kather
Abstract:
Hematoxylin- and eosin (H&E) stained whole-slide images (WSIs) are the foundation of diagnosis of cancer. In recent years, development of deep learning-based methods in computational pathology enabled the prediction of biomarkers directly from WSIs. However, accurately linking tissue phenotype to biomarkers at scale remains a crucial challenge for democratizing complex biomarkers in precision onco…
▽ More
Hematoxylin- and eosin (H&E) stained whole-slide images (WSIs) are the foundation of diagnosis of cancer. In recent years, development of deep learning-based methods in computational pathology enabled the prediction of biomarkers directly from WSIs. However, accurately linking tissue phenotype to biomarkers at scale remains a crucial challenge for democratizing complex biomarkers in precision oncology. This protocol describes a practical workflow for solid tumor associative modeling in pathology (STAMP), enabling prediction of biomarkers directly from WSIs using deep learning. The STAMP workflow is biomarker agnostic and allows for genetic- and clinicopathologic tabular data to be included as an additional input, together with histopathology images. The protocol consists of five main stages which have been successfully applied to various research problems: formal problem definition, data preprocessing, modeling, evaluation and clinical translation. The STAMP workflow differentiates itself through its focus on serving as a collaborative framework that can be used by clinicians and engineers alike for setting up research projects in the field of computational pathology. As an example task, we applied STAMP to the prediction of microsatellite instability (MSI) status in colorectal cancer, showing accurate performance for the identification of MSI-high tumors. Moreover, we provide an open-source codebase which has been deployed at several hospitals across the globe to set up computational pathology workflows. The STAMP workflow requires one workday of hands-on computational execution and basic command line knowledge.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
From Text to Image: Exploring GPT-4Vision's Potential in Advanced Radiological Analysis across Subspecialties
Authors:
Felix Busch,
Tianyu Han,
Marcus Makowski,
Daniel Truhn,
Keno Bressem,
Lisa Adams
Abstract:
The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Benchmarking Pathology Feature Extractors for Whole Slide Image Classification
Authors:
Georg Wölflein,
Dyke Ferber,
Asier R. Meneghetti,
Omar S. M. El Nahhas,
Daniel Truhn,
Zunamys I. Carrero,
David J. Harrison,
Ognjen Arandjelović,
Jakob Nikolas Kather
Abstract:
Weakly supervised whole slide image classification is a key task in computational pathology, which involves predicting a slide-level label from a set of image patches constituting the slide. Constructing models to solve this task involves multiple design choices, often made without robust empirical or conclusive theoretical justification. To address this, we conduct a comprehensive benchmarking of…
▽ More
Weakly supervised whole slide image classification is a key task in computational pathology, which involves predicting a slide-level label from a set of image patches constituting the slide. Constructing models to solve this task involves multiple design choices, often made without robust empirical or conclusive theoretical justification. To address this, we conduct a comprehensive benchmarking of feature extractors to answer three critical questions: 1) Is stain normalisation still a necessary preprocessing step? 2) Which feature extractors are best for downstream slide-level classification? 3) How does magnification affect downstream performance? Our study constitutes the most comprehensive evaluation of publicly available pathology feature extractors to date, involving more than 10,000 training runs across 14 feature extractors, 9 tasks, 5 datasets, 3 downstream architectures, 2 levels of magnification, and various preprocessing setups. Our findings challenge existing assumptions: 1) We observe empirically, and by analysing the latent space, that skip** stain normalisation and image augmentations does not degrade performance, while significantly reducing memory and computational demands. 2) We develop a novel evaluation metric to compare relative downstream performance, and show that the choice of feature extractor is the most consequential factor for downstream performance. 3) We find that lower-magnification slides are sufficient for accurate slide-level classification. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level biomarker prediction tasks in a weakly supervised setting with external validation cohorts. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors.
△ Less
Submitted 21 June, 2024; v1 submitted 20 November, 2023;
originally announced November 2023.
-
Time-efficient combined morphologic and quantitative joint MRI based on clinical image contrasts -- An exploratory in-situ study of standardized cartilage defects
Authors:
Teresa Lemainque,
Nicola Pridöhl,
Shuo Zhang,
Marc Huppertz,
Manuel Post,
Can Yüksel,
Masami Yoneyama,
Andreas Prescher,
Christiane Kuhl,
Daniel Truhn,
Sven Nebelung
Abstract:
OBJECTIVES: Quantitative MRI techniques such as T2 and T1$ρ$ map** are beneficial in evaluating cartilage and meniscus. We aimed to evaluate the MIXTURE (Multi-Interleaved X-prepared Turbo-Spin Echo with IntUitive RElaxometry) sequences that provide morphologic images with clinical turbo spin-echo (TSE) contrasts and additional parameter maps versus reference TSE sequences in an in-situ model of…
▽ More
OBJECTIVES: Quantitative MRI techniques such as T2 and T1$ρ$ map** are beneficial in evaluating cartilage and meniscus. We aimed to evaluate the MIXTURE (Multi-Interleaved X-prepared Turbo-Spin Echo with IntUitive RElaxometry) sequences that provide morphologic images with clinical turbo spin-echo (TSE) contrasts and additional parameter maps versus reference TSE sequences in an in-situ model of human cartilage defects.
MATERIALS AND METHODS: Prospectively, standardized cartilage defects of 8mm, 5mm, and 3mm diameter were created in the lateral femora of 10 human cadaveric knee specimens (81$\pm$10 years, nine male/one female). Using a clinical 3T MRI scanner and knee coil, MIXTURE sequences combining (i) proton-density weighted fat-saturated (PD-w FS) images and T2 maps and (ii) T1-weighted images and T1$ρ$ maps were acquired before and after defect creation, alongside the corresponding 2D TSE and 3D TSE reference sequences. Defect delineability, bone texture, and cartilage relaxation times were quantified. Inter-sequence comparisons were made using appropriate parametric and non-parametric tests.
RESULTS: Overall, defect delineability and texture features were not significantly different between the MIXTURE and reference sequences. After defect creation, relaxation times increased significantly in the central femur (for T2) and all regions combined (for T1$ρ$).
CONCLUSION: MIXTURE sequences permit time-efficient simultaneous morphologic and quantitative joint assessment based on clinical image contrasts. While providing T2 or T1$ρ$ maps in clinically feasible scan time, morphologic image features, i.e., cartilage defect delineability and bone texture, were comparable between MIXTURE and corresponding reference sequences.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Two for One -- Combined Morphologic and Quantitative Knee Joint MRI Using a Versatile Turbo Spin-Echo Platform
Authors:
Teresa Lemainque,
Nicola Pridoehl,
Marc Huppertz,
Manuel Post,
Can Yüksel,
Karl Ludger Radke,
Shuo Zhang,
Masami Yoneyama,
Andreas Prescher,
Christiane Kuhl,
Daniel Truhn,
Sven Nebelung
Abstract:
Introduction: Quantitative MRI techniques such as T2 and T1\r{ho} map** are beneficial in evaluating knee joint pathologies; however, long acquisition times limit their clinical adoption. MIXTURE (Multi-Interleaved X-prepared Turbo-Spin Echo with IntUitive RElaxometry) provides a versatile turbo spin-echo (TSE) sequence platform for simultaneous morphologic and quantitative joint imaging yet lac…
▽ More
Introduction: Quantitative MRI techniques such as T2 and T1\r{ho} map** are beneficial in evaluating knee joint pathologies; however, long acquisition times limit their clinical adoption. MIXTURE (Multi-Interleaved X-prepared Turbo-Spin Echo with IntUitive RElaxometry) provides a versatile turbo spin-echo (TSE) sequence platform for simultaneous morphologic and quantitative joint imaging yet lacks comparative evaluation in basic and translational research contexts.
Methods: Two MIXTURE sequences were designed along clinical requirements: (i) MIX1, combining proton density (PD)-weighted fat-saturated (FS) images and quantitative T2 map** (acquisition time: 4:59 min), and (ii) MIX2, combining T1-weighted images with quantitative T1\r{ho} map** (6:38 min). MIXTURE sequences and their reference 2D and 3D TSE counterparts were acquired from ten human cadaveric knee joints using a clinical 3T MRI scanner and knee coil. Contrast, contrast-to-noise ratios, and coefficients of variation were comparatively evaluated using parametric tests. Clinical radiologists (n=3) assessed diagnostic quality as a function of sequence and anatomic structure using 5-point Likert scales and ordinal regression. The significance level was set to α=0.01.
Results: MIX1 and MIX2 had at least equal diagnostic quality compared to the 2D and 3D TSE sequences of the same image weighting. Contrast, contrast-to-noise ratios, and coefficients of variation were largely similar for the PD-weighted FS and T1-weighted images.
Discussion: In clinically feasible scan times, the MIXTURE sequence platform yields (i) morphologic images of diagnostic quality and adjustable TSE-based contrasts and (ii) quantitative parameter map** with additional insights on soft tissue composition and ultrastructure.
△ Less
Submitted 2 November, 2023; v1 submitted 31 October, 2023;
originally announced October 2023.
-
On the Impact of Cross-Domain Data on German Language Models
Authors:
Amin Dada,
Aokun Chen,
Cheng Peng,
Kaleb E Smith,
Ahmad Idrissi-Yaghir,
Constantin Marc Seibold,
Jianning Li,
Lars Heiliger,
Xi Yang,
Christoph M. Friedrich,
Daniel Truhn,
Jan Egger,
Jiang Bian,
Jens Kleesiek,
Yonghui Wu
Abstract:
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed…
▽ More
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen
△ Less
Submitted 13 October, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Mind the Gap: Federated Learning Broadens Domain Generalization in Diagnostic AI Models
Authors:
Soroosh Tayebi Arasteh,
Christiane Kuhl,
Marwin-Jonathan Saehn,
Peter Isfort,
Daniel Truhn,
Sven Nebelung
Abstract:
Develo** robust artificial intelligence (AI) models that generalize well to unseen datasets is challenging and usually requires large and variable datasets, preferably from multiple institutions. In federated learning (FL), a model is trained collaboratively at numerous sites that hold local datasets without exchanging them. So far, the impact of training strategy, i.e., local versus collaborati…
▽ More
Develo** robust artificial intelligence (AI) models that generalize well to unseen datasets is challenging and usually requires large and variable datasets, preferably from multiple institutions. In federated learning (FL), a model is trained collaboratively at numerous sites that hold local datasets without exchanging them. So far, the impact of training strategy, i.e., local versus collaborative, on the diagnostic on-domain and off-domain performance of AI models interpreting chest radiographs has not been assessed. Consequently, using 610,000 chest radiographs from five institutions across the globe, we assessed diagnostic performance as a function of training strategy (i.e., local vs. collaborative), network architecture (i.e., convolutional vs. transformer-based), generalization performance (i.e., on-domain vs. off-domain), imaging finding (i.e., cardiomegaly, pleural effusion, pneumonia, atelectasis, consolidation, pneumothorax, and no abnormality), dataset size (i.e., from n=18,000 to 213,921 radiographs), and dataset diversity. Large datasets not only showed minimal performance gains with FL but, in some instances, even exhibited decreases. In contrast, smaller datasets revealed marked improvements. Thus, on-domain performance was mainly driven by training data size. However, off-domain performance leaned more on training diversity. When trained collaboratively across diverse external institutions, AI models consistently surpassed models trained locally for off-domain tasks, emphasizing FL's potential in leveraging data diversity. In conclusion, FL can bolster diagnostic privacy, reproducibility, and off-domain reliability of AI models and, potentially, optimize healthcare outcomes.
△ Less
Submitted 19 December, 2023; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Reconstruction of Patient-Specific Confounders in AI-based Radiologic Image Interpretation using Generative Pretraining
Authors:
Tianyu Han,
Laura Žigutytė,
Luisa Huck,
Marc Huppertz,
Robert Siepmann,
Yossi Gandelsman,
Christian Blüthgen,
Firas Khader,
Christiane Kuhl,
Sven Nebelung,
Jakob Kather,
Daniel Truhn
Abstract:
Detecting misleading patterns in automated diagnostic assistance systems, such as those powered by Artificial Intelligence, is critical to ensuring their reliability, particularly in healthcare. Current techniques for evaluating deep learning models cannot visualize confounding factors at a diagnostic level. Here, we propose a self-conditioned diffusion model termed DiffChest and train it on a dat…
▽ More
Detecting misleading patterns in automated diagnostic assistance systems, such as those powered by Artificial Intelligence, is critical to ensuring their reliability, particularly in healthcare. Current techniques for evaluating deep learning models cannot visualize confounding factors at a diagnostic level. Here, we propose a self-conditioned diffusion model termed DiffChest and train it on a dataset of 515,704 chest radiographs from 194,956 patients from multiple healthcare centers in the United States and Europe. DiffChest explains classifications on a patient-specific level and visualizes the confounding factors that may mislead the model. We found high inter-reader agreement when evaluating DiffChest's capability to identify treatment-related confounders, with Fleiss' Kappa values of 0.8 or higher across most imaging findings. Confounders were accurately captured with 11.1% to 100% prevalence rates. Furthermore, our pretraining process optimized the model to capture the most relevant information from the input radiographs. DiffChest achieved excellent diagnostic accuracy when diagnosing 11 chest conditions, such as pleural effusion and cardiac insufficiency, and at least sufficient diagnostic accuracy for the remaining conditions. Our findings highlight the potential of pretraining based on diffusion models in medical image classification, specifically in providing insights into confounding factors and model robustness.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
Medical Foundation Models are Susceptible to Targeted Misinformation Attacks
Authors:
Tianyu Han,
Sven Nebelung,
Firas Khader,
Tianci Wang,
Gustav Mueller-Franzes,
Christiane Kuhl,
Sebastian Försch,
Jens Kleesiek,
Christoph Haarburger,
Keno K. Bressem,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fac…
▽ More
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fact. The erroneous information is then propagated in the model's output, whilst its performance on other biomedical tasks remains intact. We validate our findings in a set of 1,038 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
Large Language Models Streamline Automated Machine Learning for Clinical Studies
Authors:
Soroosh Tayebi Arasteh,
Tianyu Han,
Mahshad Lotfinia,
Christiane Kuhl,
Jakob Nikolas Kather,
Daniel Truhn,
Sven Nebelung
Abstract:
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from l…
▽ More
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study's training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (P>0.071). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.
△ Less
Submitted 21 February, 2024; v1 submitted 27 August, 2023;
originally announced August 2023.
-
Enhancing Network Initialization for Medical AI Models Using Large-Scale, Unlabeled Natural Images
Authors:
Soroosh Tayebi Arasteh,
Leo Misera,
Jakob Nikolas Kather,
Daniel Truhn,
Sven Nebelung
Abstract:
Pre-training datasets, like ImageNet, have become the gold standard in medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pre-training on non-medical images can be applied to chest radiographs and how it comp…
▽ More
Pre-training datasets, like ImageNet, have become the gold standard in medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pre-training on non-medical images can be applied to chest radiographs and how it compares to supervised pre-training on non-medical images and on medical images. We utilized a vision transformer and initialized its weights based on (i) SSL pre-training on natural images (DINOv2), (ii) SL pre-training on natural images (ImageNet dataset), and (iii) SL pre-training on chest radiographs from the MIMIC-CXR database. We tested our approach on over 800,000 chest radiographs from six large global datasets, diagnosing more than 20 different imaging findings. Our SSL pre-training on curated images not only outperformed ImageNet-based pre-training (P<0.001 for all datasets) but, in certain cases, also exceeded SL on the MIMIC-CXR dataset. Our findings suggest that selecting the right pre-training strategy, especially with SSL, can be pivotal for improving artificial intelligence (AI)'s diagnostic accuracy in medical imaging. By demonstrating the promise of SSL in chest radiograph analysis, we underline a transformative shift towards more efficient and accurate AI models in medical imaging.
△ Less
Submitted 8 February, 2024; v1 submitted 15 August, 2023;
originally announced August 2023.
-
Preserving privacy in domain transfer of medical AI models comes at no performance costs: The integral role of differential privacy
Authors:
Soroosh Tayebi Arasteh,
Mahshad Lotfinia,
Teresa Nolte,
Marwin Saehn,
Peter Isfort,
Christiane Kuhl,
Sven Nebelung,
Georgios Kaissis,
Daniel Truhn
Abstract:
Develo** robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integ…
▽ More
Develo** robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integration of differential privacy (DP). We specifically investigate the performance of models trained with DP as compared to models trained without DP on data from institutions that the model had not seen during its training (i.e., external validation) - the situation that is reflective of the clinical use of AI models. By leveraging more than 590,000 chest radiographs from five institutions, we evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects. We juxtaposed DP-DT with non-DP-DT and examined diagnostic accuracy and demographic fairness using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity. Our results show that DP-DT, even with exceptionally high privacy levels (epsilon around 1), performs comparably to non-DP-DT (P>0.119 across all domains). Furthermore, DP-DT led to marginal AUC differences - less than 1% - for nearly all subgroups, relative to non-DP-DT. Despite consistent evidence suggesting that DP models induce significant performance degradation for on-domain applications, we show that off-domain performance is almost not affected. Therefore, we ardently advocate for the adoption of DP in training diagnostic medical AI models, given its minimal impact on performance.
△ Less
Submitted 7 December, 2023; v1 submitted 10 June, 2023;
originally announced June 2023.
-
Transformers for CT Reconstruction From Monoplanar and Biplanar Radiographs
Authors:
Firas Khader,
Gustav Müller-Franzes,
Tianyu Han,
Sven Nebelung,
Christiane Kuhl,
Johannes Stegmaier,
Daniel Truhn
Abstract:
Computed Tomography (CT) scans provide detailed and accurate information of internal structures in the body. They are constructed by sending x-rays through the body from different directions and combining this information into a three-dimensional volume. Such volumes can then be used to diagnose a wide range of conditions and allow for volumetric measurements of organs. In this work, we tackle the…
▽ More
Computed Tomography (CT) scans provide detailed and accurate information of internal structures in the body. They are constructed by sending x-rays through the body from different directions and combining this information into a three-dimensional volume. Such volumes can then be used to diagnose a wide range of conditions and allow for volumetric measurements of organs. In this work, we tackle the problem of reconstructing CT images from biplanar x-rays only. X-rays are widely available and even if the CT reconstructed from these radiographs is not a replacement of a complete CT in the diagnostic setting, it might serve to spare the patients from radiation where a CT is only acquired for rough measurements such as determining organ size. We propose a novel method based on the transformer architecture, by framing the underlying task as a language translation problem. Radiographs and CT images are first embedded into latent quantized codebook vectors using two different autoencoder networks. We then train a GPT model, to reconstruct the codebook vectors of the CT image, conditioned on the codebook vectors of the x-rays and show that this approach leads to realistic looking images. To encourage further research in this direction, we make our code publicly available on GitHub: XXX.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image Classification Using Transformers
Authors:
Firas Khader,
Jakob Nikolas Kather,
Tianyu Han,
Sven Nebelung,
Christiane Kuhl,
Johannes Stegmaier,
Daniel Truhn
Abstract:
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen. An automated analysis of such images using deep learning models is therefore of high demand. The transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information. Here, the whole-slide image is partitioned into smaller image…
▽ More
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen. An automated analysis of such images using deep learning models is therefore of high demand. The transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information. Here, the whole-slide image is partitioned into smaller image patches and feature tokens are extracted from these image patches. However, while the conventional transformer allows for a simultaneous processing of a large set of input tokens, the computational demand scales quadratically with the number of input tokens and thus quadratically with the number of image patches. To address this problem we propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches. Our experiments demonstrate that this architecture is at least on-par with and even outperforms other attention-based state-of-the-art methods on two public datasets: On the use-case of lung cancer (TCGA NSCLC) our model reaches a mean area under the receiver operating characteristic (AUC) of 0.970 $\pm$ 0.008 and on renal cancer (TCGA RCC) reaches a mean AUC of 0.985 $\pm$ 0.004. Furthermore, we show that our proposed model is efficient in low-data regimes, making it a promising approach for analyzing whole-slide images in resource-limited settings. To foster research in this direction, we make our code publicly available on GitHub: XXX.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -- A multi-institutional evaluation
Authors:
Gustav Müller-Franzes,
Fritz Müller-Franzes,
Luisa Huck,
Vanessa Raaff,
Eva Kemmer,
Firas Khader,
Soroosh Tayebi Arasteh,
Teresa Nolte,
Jakob Nikolas Kather,
Sven Nebelung,
Christiane Kuhl,
Daniel Truhn
Abstract:
Accurate and automatic segmentation of fibroglandular tissue in breast MRI screening is essential for the quantification of breast density and background parenchymal enhancement. In this retrospective study, we developed and evaluated a transformer-based neural network for breast segmentation (TraBS) in multi-institutional MRI data, and compared its performance to the well established convolutiona…
▽ More
Accurate and automatic segmentation of fibroglandular tissue in breast MRI screening is essential for the quantification of breast density and background parenchymal enhancement. In this retrospective study, we developed and evaluated a transformer-based neural network for breast segmentation (TraBS) in multi-institutional MRI data, and compared its performance to the well established convolutional neural network nnUNet. TraBS and nnUNet were trained and tested on 200 internal and 40 external breast MRI examinations using manual segmentations generated by experienced human readers. Segmentation performance was assessed in terms of the Dice score and the average symmetric surface distance. The Dice score for nnUNet was lower than for TraBS on the internal testset (0.909$\pm$0.069 versus 0.916$\pm$0.067, P<0.001) and on the external testset (0.824$\pm$0.144 versus 0.864$\pm$0.081, P=0.004). Moreover, the average symmetric surface distance was higher (=worse) for nnUNet than for TraBS on the internal (0.657$\pm$2.856 versus 0.548$\pm$2.195, P=0.001) and on the external testset (0.727$\pm$0.620 versus 0.584$\pm$0.413, P=0.03). Our study demonstrates that transformer-based networks improve the quality of fibroglandular tissue segmentation in breast MRI compared to convolutional-based models like nnUNet. These findings might help to enhance the accuracy of breast density and parenchymal enhancement quantification in breast MRI screening.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data
Authors:
Tianyu Han,
Lisa C. Adams,
Jens-Michalis Papaioannou,
Paul Grundmann,
Tom Oberhauser,
Alexander Löser,
Daniel Truhn,
Keno K. Bressem
Abstract:
As large language models (LLMs) like OpenAI's GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to s…
▽ More
As large language models (LLMs) like OpenAI's GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.
△ Less
Submitted 4 October, 2023; v1 submitted 14 April, 2023;
originally announced April 2023.
-
AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge
Authors:
Coen de Vente,
Koenraad A. Vermeer,
Nicolas Jaccard,
He Wang,
Hongyi Sun,
Firas Khader,
Daniel Truhn,
Temirgali Aimyshev,
Yerkebulan Zhanibekuly,
Tien-Dung Le,
Adrian Galdran,
Miguel Ángel González Ballester,
Gustavo Carneiro,
Devika R G,
Hrishikesh P S,
Densen Puthussery,
Hong Liu,
Zekang Yang,
Satoshi Kondo,
Satoshi Kasai,
Edward Wang,
Ashritha Durvasula,
Jónathan Heras,
Miguel Ángel Zapata,
Teresa Araújo
, et al. (11 additional authors not shown)
Abstract:
The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios…
▽ More
The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.
△ Less
Submitted 10 February, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging
Authors:
Soroosh Tayebi Arasteh,
Alexander Ziller,
Christiane Kuhl,
Marcus Makowski,
Sven Nebelung,
Rickmer Braren,
Daniel Rueckert,
Daniel Truhn,
Georgios Kaissis
Abstract:
Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacce…
▽ More
Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacceptable in medicine and represent a main barrier to the widespread use of privacy-preserving techniques. In this work, we evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training. For this, we used two datasets: (1) A large dataset (N=193,311) of high quality clinical chest radiographs, and (2) a dataset (N=1,625) of 3D abdominal computed tomography (CT) images, with the task of classifying the presence of pancreatic ductal adenocarcinoma (PDAC). Both were retrospectively collected and manually labeled by experienced radiologists. We then compared non-private deep convolutional neural networks (CNNs) and privacy-preserving (DP) models with respect to privacy-utility trade-offs measured as area under the receiver-operator-characteristic curve (AUROC), and privacy-fairness trade-offs, measured as Pearson's r or Statistical Parity Difference. We found that, while the privacy-preserving trainings yielded lower accuracy, they did largely not amplify discrimination against age, sex or co-morbidity. Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
△ Less
Submitted 16 March, 2024; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Fully transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study
Authors:
Sophia J. Wagner,
Daniel Reisenbüchler,
Nicholas P. West,
Jan Moritz Niehues,
Gregory Patrick Veldhuizen,
Philip Quirke,
Heike I. Grabsch,
Piet A. van den Brandt,
Gordon G. A. Hutchins,
Susan D. Richman,
Tanwei Yuan,
Rupert Langer,
Josien Christina Anna Jenniskens,
Kelly Offermans,
Wolfram Mueller,
Richard Gray,
Stephen B. Gruber,
Joel K. Greenson,
Gad Rennert,
Joseph D. Bonner,
Daniel Schmolze,
Jacqueline A. James,
Maurice B. Loughrey,
Manuel Salto-Tellez,
Hermann Brenner
, et al. (6 additional authors not shown)
Abstract:
Background: Deep learning (DL) can extract predictive and prognostic biomarkers from routine pathology slides in colorectal cancer. For example, a DL test for the diagnosis of microsatellite instability (MSI) in CRC has been approved in 2022. Current approaches rely on convolutional neural networks (CNNs). Transformer networks are outperforming CNNs and are replacing them in many applications, but…
▽ More
Background: Deep learning (DL) can extract predictive and prognostic biomarkers from routine pathology slides in colorectal cancer. For example, a DL test for the diagnosis of microsatellite instability (MSI) in CRC has been approved in 2022. Current approaches rely on convolutional neural networks (CNNs). Transformer networks are outperforming CNNs and are replacing them in many applications, but have not been used for biomarker prediction in cancer at a large scale. In addition, most DL approaches have been trained on small patient cohorts, which limits their clinical utility. Methods: In this study, we developed a new fully transformer-based pipeline for end-to-end biomarker prediction from pathology slides. We combine a pre-trained transformer encoder and a transformer network for patch aggregation, capable of yielding single and multi-target prediction at patient level. We train our pipeline on over 9,000 patients from 10 colorectal cancer cohorts. Results: A fully transformer-based approach massively improves the performance, generalizability, data efficiency, and interpretability as compared with current state-of-the-art algorithms. After training on a large multicenter cohort, we achieve a sensitivity of 0.97 with a negative predictive value of 0.99 for MSI prediction on surgical resection specimens. We demonstrate for the first time that resection specimen-only training reaches clinical-grade performance on endoscopic biopsy tissue, solving a long-standing diagnostic problem. Interpretation: A fully transformer-based end-to-end pipeline trained on thousands of pathology slides yields clinical-grade performance for biomarker prediction on surgical resections and biopsies. Our new methods are freely available under an open source license.
△ Less
Submitted 1 March, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis
Authors:
Firas Khader,
Gustav Mueller-Franzes,
Tianci Wang,
Tianyu Han,
Soroosh Tayebi Arasteh,
Christoph Haarburger,
Johannes Stegmaier,
Keno Bressem,
Christiane Kuhl,
Sven Nebelung,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we pr…
▽ More
Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we present a new technical approach of "learnable synergies", in which the model only selects relevant interactions between data modalities and keeps an "internal memory" of relevant data. Our approach is easily scalable and naturally adapts to multimodal data inputs from clinical routine. We demonstrate this approach on three large multimodal datasets from radiology and ophthalmology and show that it outperforms state-of-the-art models in clinically relevant diagnosis tasks. Our new approach is transferable and will allow the application of multimodal deep learning to a broad set of clinically relevant problems.
△ Less
Submitted 20 December, 2022; v1 submitted 18 December, 2022;
originally announced December 2022.
-
Diffusion Probabilistic Models beat GANs on Medical Images
Authors:
Gustav Müller-Franzes,
Jan Moritz Niehues,
Firas Khader,
Soroosh Tayebi Arasteh,
Christoph Haarburger,
Christiane Kuhl,
Tianci Wang,
Tianyu Han,
Sven Nebelung,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we…
▽ More
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we propose Medfusion, a conditional latent DDPM for medical images. We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain. Medfusion was trained and compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on n=191,027 from the CheXpert dataset to generate radiographs with and without cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to generate histopathological images with and without microsatellite stability. In the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better) FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus 84.31). Also, fidelity (precision) and diversity (recall) were higher (=better) for Medfusion in all three datasets. Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Collaborative Training of Medical Artificial Intelligence Models with non-uniform Labels
Authors:
Soroosh Tayebi Arasteh,
Peter Isfort,
Marwin Saehn,
Gustav Mueller-Franzes,
Firas Khader,
Jakob Nikolas Kather,
Christiane Kuhl,
Sven Nebelung,
Daniel Truhn
Abstract:
Due to the rapid advancements in recent years, medical image analysis is largely dominated by deep learning (DL). However, building powerful and robust DL models requires training with large multi-party datasets. While multiple stakeholders have provided publicly available datasets, the ways in which these data are labeled vary widely. For Instance, an institution might provide a dataset of chest…
▽ More
Due to the rapid advancements in recent years, medical image analysis is largely dominated by deep learning (DL). However, building powerful and robust DL models requires training with large multi-party datasets. While multiple stakeholders have provided publicly available datasets, the ways in which these data are labeled vary widely. For Instance, an institution might provide a dataset of chest radiographs containing labels denoting the presence of pneumonia, while another institution might have a focus on determining the presence of metastases in the lung. Training a single AI model utilizing all these data is not feasible with conventional federated learning (FL). This prompts us to propose an extension to the widespread FL process, namely flexible federated learning (FFL) for collaborative training on such data. Using 695,000 chest radiographs from five institutions from across the globe - each with differing labels - we demonstrate that having heterogeneously labeled datasets, FFL-based training leads to significant performance increase compared to conventional FL training, where only the uniformly annotated images are utilized. We believe that our proposed algorithm could accelerate the process of bringing collaborative training methods from research and simulation phase to the real-world applications in healthcare.
△ Less
Submitted 13 April, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Medical Diffusion: Denoising Diffusion Probabilistic Models for 3D Medical Image Generation
Authors:
Firas Khader,
Gustav Mueller-Franzes,
Soroosh Tayebi Arasteh,
Tianyu Han,
Christoph Haarburger,
Maximilian Schulze-Hagen,
Philipp Schad,
Sandy Engelhardt,
Bettina Baessler,
Sebastian Foersch,
Johannes Stegmaier,
Christiane Kuhl,
Sven Nebelung,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
Recent advances in computer vision have shown promising results in image generation. Diffusion probabilistic models in particular have generated realistic images from textual input, as demonstrated by DALL-E 2, Imagen and Stable Diffusion. However, their use in medicine, where image data typically comprises three-dimensional volumes, has not been systematically evaluated. Synthetic images may play…
▽ More
Recent advances in computer vision have shown promising results in image generation. Diffusion probabilistic models in particular have generated realistic images from textual input, as demonstrated by DALL-E 2, Imagen and Stable Diffusion. However, their use in medicine, where image data typically comprises three-dimensional volumes, has not been systematically evaluated. Synthetic images may play a crucial role in privacy preserving artificial intelligence and can also be used to augment small datasets. Here we show that diffusion probabilistic models can synthesize high quality medical imaging data, which we show for Magnetic Resonance Images (MRI) and Computed Tomography (CT) images. We provide quantitative measurements of their performance through a reader study with two medical experts who rated the quality of the synthesized images in three categories: Realistic image appearance, anatomical correctness and consistency between slices. Furthermore, we demonstrate that synthetic images can be used in a self-supervised pre-training and improve the performance of breast segmentation models when data is scarce (dice score 0.91 vs. 0.95 without vs. with synthetic data). The code is publicly available on GitHub: https://github.com/FirasGit/medicaldiffusion.
△ Less
Submitted 3 January, 2023; v1 submitted 7 November, 2022;
originally announced November 2022.
-
What Does DALL-E 2 Know About Radiology?
Authors:
Lisa C. Adams,
Felix Busch,
Daniel Truhn,
Marcus R. Makowski,
Hugo JWL. Aerts,
Keno K. Bressem
Abstract:
Generative models such as DALL-E 2 could represent a promising future tool for image generation, augmentation, and manipulation for artificial intelligence research in radiology provided that these models have sufficient medical domain knowledge. Here we show that DALL-E 2 has learned relevant representations of X-ray images with promising capabilities in terms of zero-shot text-to-image generatio…
▽ More
Generative models such as DALL-E 2 could represent a promising future tool for image generation, augmentation, and manipulation for artificial intelligence research in radiology provided that these models have sufficient medical domain knowledge. Here we show that DALL-E 2 has learned relevant representations of X-ray images with promising capabilities in terms of zero-shot text-to-image generation of new images, continuation of an image beyond its original boundaries, or removal of elements, while pathology generation or CT, MRI, and ultrasound images are still limited. The use of generative models for augmenting and generating radiological data thus seems feasible, even if further fine-tuning and adaptation of these models to the respective domain is required beforehand.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Image prediction of disease progression by style-based manifold extrapolation
Authors:
Tianyu Han,
Jakob Nikolas Kather,
Federico Pedersoli,
Markus Zimmermann,
Sebastian Keil,
Maximilian Schulze-Hagen,
Marc Terwoelbeck,
Peter Isfort,
Christoph Haarburger,
Fabian Kiessling,
Volkmar Schulz,
Christiane Kuhl,
Sven Nebelung,
Daniel Truhn
Abstract:
Disease-modifying management aims to prevent deterioration and progression of the disease, not just relieve symptoms. Unfortunately, the development of necessary therapies is often hampered by the failure to recognize the presymptomatic disease and limited understanding of disease development. We present a generic solution for this problem by a methodology that allows the prediction of progression…
▽ More
Disease-modifying management aims to prevent deterioration and progression of the disease, not just relieve symptoms. Unfortunately, the development of necessary therapies is often hampered by the failure to recognize the presymptomatic disease and limited understanding of disease development. We present a generic solution for this problem by a methodology that allows the prediction of progression risk and morphology in individuals using a latent extrapolation optimization approach. To this end, we combined a regularized generative adversarial network (GAN) and a latent nearest neighbor algorithm for joint optimization to generate plausible images of future time points. We evaluated our method on osteoarthritis (OA) data from a multi-center longitudinal study (the Osteoarthritis Initiative, OAI). With presymptomatic baseline data, our model is generative and significantly outperforms the end-to-end learning model in discriminating the progressive cohort. Two experiments were performed with seven experienced radiologists. When no synthetic follow-up radiographs were provided, our model performed better than all seven radiologists. In cases where the synthetic follow-ups generated by our model were available, the specificity and sensitivity of all readers in discriminating progressors increased from $72.3\%$ to $88.6\%$ and from $42.1\%$ to $51.6\%$, respectively. Our results open up a new possibility of using model-based morphology and risk prediction to make predictions about future disease occurrence, as demonstrated in the example of OA.
△ Less
Submitted 8 April, 2022; v1 submitted 22 November, 2021;
originally announced November 2021.
-
Advancing diagnostic performance and clinical usability of neural networks via adversarial training and dual batch normalization
Authors:
Tianyu Han,
Sven Nebelung,
Federico Pedersoli,
Markus Zimmermann,
Maximilian Schulze-Hagen,
Michael Ho,
Christoph Haarburger,
Fabian Kiessling,
Christiane Kuhl,
Volkmar Schulz,
Daniel Truhn
Abstract:
Unmasking the decision-making process of machine learning models is essential for implementing diagnostic support systems in clinical practice. Here, we demonstrate that adversarially trained models can significantly enhance the usability of pathology detection as compared to their standard counterparts. We let six experienced radiologists rate the interpretability of saliency maps in datasets of…
▽ More
Unmasking the decision-making process of machine learning models is essential for implementing diagnostic support systems in clinical practice. Here, we demonstrate that adversarially trained models can significantly enhance the usability of pathology detection as compared to their standard counterparts. We let six experienced radiologists rate the interpretability of saliency maps in datasets of X-rays, computed tomography, and magnetic resonance imaging scans. Significant improvements were found for our adversarial models, which could be further improved by the application of dual batch normalization. Contrary to previous research on adversarially trained models, we found that the accuracy of such models was equal to standard models when sufficiently large datasets and dual batch norm training were used. To ensure transferability, we additionally validated our results on an external test set of 22,433 X-rays. These findings elucidate that different paths for adversarial and real images are needed during training to achieve state of the art results with superior clinical interpretability.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
An Asymmetric Cycle-Consistency Loss for Dealing with Many-to-One Map**s in Image Translation: A Study on Thigh MR Scans
Authors:
Michael Gadermayr,
Maximilian Tschuchnig,
Laxmi Gupta,
Dorit Merhof,
Nils Krämer,
Daniel Truhn,
Burkhard Gess
Abstract:
Generative adversarial networks using a cycle-consistency loss facilitate unpaired training of image-translation models and thereby exhibit a very high potential in manifold medical applications. However, the fact that images in one domain potentially map to more than one image in another domain (e.g. in case of pathological changes) exhibits a major challenge for training the networks. In this wo…
▽ More
Generative adversarial networks using a cycle-consistency loss facilitate unpaired training of image-translation models and thereby exhibit a very high potential in manifold medical applications. However, the fact that images in one domain potentially map to more than one image in another domain (e.g. in case of pathological changes) exhibits a major challenge for training the networks. In this work, we offer a solution to improve the training process in case of many-to-one map**s by modifying the cycle-consistency loss. We show formally and empirically that the proposed method improves the performance significantly without radically changing the architecture and without increasing the overall complexity. We evaluate our method on thigh MRI scans with the final goal of segmenting the muscle in fat-infiltrated patients' data.
△ Less
Submitted 11 January, 2021; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Radiomic Feature Stability Analysis based on Probabilistic Segmentations
Authors:
Christoph Haarburger,
Justus Schock,
Daniel Truhn,
Philippe Weitz,
Gustav Mueller-Franzes,
Leon Weninger,
Dorit Merhof
Abstract:
Identifying image features that are robust with respect to segmentation variability and domain shift is a tough challenge in radiomics. So far, this problem has mainly been tackled in test-retest analyses. In this work we analyze radiomics feature stability based on probabilistic segmentations. Based on a public lung cancer dataset, we generate an arbitrary number of plausible segmentations using…
▽ More
Identifying image features that are robust with respect to segmentation variability and domain shift is a tough challenge in radiomics. So far, this problem has mainly been tackled in test-retest analyses. In this work we analyze radiomics feature stability based on probabilistic segmentations. Based on a public lung cancer dataset, we generate an arbitrary number of plausible segmentations using a Probabilistic U-Net. From these segmentations, we extract a high number of plausible feature vectors for each lung tumor and analyze feature variance with respect to the segmentations. Our results suggest that there are groups of radiomic features that are more (e.g. statistics features) and less (e.g. gray-level size zone matrix features) robust against segmentation variability. Finally, we demonstrate that segmentation variance impacts the performance of a prognostic lung cancer survival model and propose a new and potentially more robust radiomics feature selection workflow.
△ Less
Submitted 21 January, 2020; v1 submitted 13 October, 2019;
originally announced October 2019.
-
Multi Scale Curriculum CNN for Context-Aware Breast MRI Malignancy Classification
Authors:
Christoph Haarburger,
Michael Baumgartner,
Daniel Truhn,
Mirjam Broeckmann,
Hannah Schneider,
Simone Schrading,
Christiane Kuhl,
Dorit Merhof
Abstract:
Classification of malignancy for breast cancer and other cancer types is usually tackled as an object detection problem: Individual lesions are first localized and then classified with respect to malignancy. However, the drawback of this approach is that abstract features incorporating several lesions and areas that are not labelled as a lesion but contain global medically relevant information are…
▽ More
Classification of malignancy for breast cancer and other cancer types is usually tackled as an object detection problem: Individual lesions are first localized and then classified with respect to malignancy. However, the drawback of this approach is that abstract features incorporating several lesions and areas that are not labelled as a lesion but contain global medically relevant information are thus disregarded: especially for dynamic contrast-enhanced breast MRI, criteria such as background parenchymal enhancement and location within the breast are important for diagnosis and cannot be captured by object detection approaches properly.
In this work, we propose a 3D CNN and a multi scale curriculum learning strategy to classify malignancy globally based on an MRI of the whole breast. Thus, the global context of the whole breast rather than individual lesions is taken into account. Our proposed approach does not rely on lesion segmentations, which renders the annotation of training data much more effective than in current object detection approaches.
Achieving an AUROC of 0.89, we compare the performance of our approach to Mask R-CNN and Retina U-Net as well as a radiologist. Our performance is on par with approaches that, in contrast to our method, rely on pixelwise segmentations of lesions.
△ Less
Submitted 17 June, 2019; v1 submitted 14 June, 2019;
originally announced June 2019.
-
Spiral Blurring Correction with Water-Fat Separation for Magnetic Resonance Fingerprinting in the Breast
Authors:
Teresa Nolte,
Nicolas Gross-Weege,
Mariya Doneva,
Peter Koken,
Aaldert Elevelt,
Daniel Truhn,
Christiane Kuhl,
Volkmar Schulz
Abstract:
PURPOSE: Magnetic Resonance Fingerprinting (MRF) with spiral readout enables rapid quantification of tissue relaxation times. However, it is prone to blurring due to off-resonance effects. Hence, fat blurring into adjacent regions might prevent identification of small tumors by their quantitative T1 and T2 values. This study aims to correct for the blurring artifacts, thereby enabling fast quantit…
▽ More
PURPOSE: Magnetic Resonance Fingerprinting (MRF) with spiral readout enables rapid quantification of tissue relaxation times. However, it is prone to blurring due to off-resonance effects. Hence, fat blurring into adjacent regions might prevent identification of small tumors by their quantitative T1 and T2 values. This study aims to correct for the blurring artifacts, thereby enabling fast quantitative map** in the female breast.
METHODS: The impact of fat blurring on spiral MRF results was first assessed by simulations. Then, MRF was combined with 3-point Dixon water-fat separation and spiral blurring correction based on conjugate phase reconstruction. The approach was assessed in phantom experiments and compared to Cartesian reference measurements, namely inversion recovery (IR), multi-echo spin echo (MESE) and Cartesian MRF, by normalized root mean square error (NRMSE) and standard deviation (STD) calculations. Feasibility is further demonstrated in-vivo for quantitative breast measurements of 6 healthy female volunteers, age range 24-31 years.
RESULTS: In the phantom experiment, the blurring correction reduced the NRMSE per phantom vial on average from 16% to 8% for T1 and from 18% to 11% for T2 when comparing spiral MRF to IR/MESE sequences. When comparing to Cartesian MRF, the NRMSE reduced from 15% to 8% for T1 and from 12% to 7% for T2. Furthermore, STDs decreased. In-vivo, the blurring correction removed fat bias on T1/T2 from a rim of about 7-8 mm width adjacent to fatty structures.
CONCLUSION: The blurring correction for spiral MRF yields improved quantitative maps in the presence of water and fat.
△ Less
Submitted 9 May, 2019;
originally announced May 2019.