-
Superhuman performance in urology board questions by an explainable large language model enabled for context integration of the European Association of Urology guidelines: the UroBot study
Authors:
Martin J. Hetz,
Nicolas Carl,
Sarah Haggenmüller,
Christoph Wies,
Maurice Stephan Michel,
Frederik Wessels,
Titus J. Brinker
Abstract:
Large Language Models (LLMs) are revolutionizing medical Question-Answering (medQA) through extensive use of medical literature. However, their performance is often hampered by outdated training data and a lack of explainability, which limits clinical applicability. This study aimed to create and assess UroBot, a urology-specialized chatbot, by comparing it with state-of-the-art models and the per…
▽ More
Large Language Models (LLMs) are revolutionizing medical Question-Answering (medQA) through extensive use of medical literature. However, their performance is often hampered by outdated training data and a lack of explainability, which limits clinical applicability. This study aimed to create and assess UroBot, a urology-specialized chatbot, by comparing it with state-of-the-art models and the performance of urologists on urological board questions, ensuring full clinician-verifiability. UroBot was developed using OpenAI's GPT-3.5, GPT-4, and GPT-4o models, employing retrieval-augmented generation (RAG) and the latest 2023 guidelines from the European Association of Urology (EAU). The evaluation included ten runs of 200 European Board of Urology (EBU) In-Service Assessment (ISA) questions, with performance assessed by the mean Rate of Correct Answers (RoCA). UroBot-4o achieved an average RoCA of 88.4%, surpassing GPT-4o by 10.8%, with a score of 77.6%. It was also clinician-verifiable and exhibited the highest run agreement as indicated by Fleiss' Kappa (k = 0.979). By comparison, the average performance of urologists on board questions, as reported in the literature, is 68.7%. UroBot's clinician-verifiable nature and superior accuracy compared to both existing models and urologists on board questions highlight its potential for clinical integration. The study also provides the necessary code and instructions for further development of UroBot.
△ Less
Submitted 4 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Advancing dermatological diagnosis: Development of a hyperspectral dermatoscope for enhanced skin imaging
Authors:
Martin J. Hetz,
Carina Nogueira Garcia,
Sarah Haggenmüller,
Titus J. Brinker
Abstract:
Clinical dermatology necessitates precision and innovation for efficient diagnosis and treatment of various skin conditions. This paper introduces the development of a cutting-edge hyperspectral dermatoscope (the Hyperscope) tailored for human skin analysis. We detail the requirements to such a device and the design considerations, from optical configurations to sensor selection, necessary to capt…
▽ More
Clinical dermatology necessitates precision and innovation for efficient diagnosis and treatment of various skin conditions. This paper introduces the development of a cutting-edge hyperspectral dermatoscope (the Hyperscope) tailored for human skin analysis. We detail the requirements to such a device and the design considerations, from optical configurations to sensor selection, necessary to capture a wide spectral range with high fidelity. Preliminary results from 15 individuals and 160 recorded skin images demonstrate the potential of the Hyperscope in identifying and characterizing various skin conditions, offering a promising avenue for non-invasive skin evaluation and a platform for future research in dermatology-related hyperspectral imaging.
△ Less
Submitted 25 June, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
Clinical Melanoma Diagnosis with Artificial Intelligence: Insights from a Prospective Multicenter Study
Authors:
Lukas Heinlein,
Roman C. Maron,
Achim Hekler,
Sarah Haggenmüller,
Christoph Wies,
Jochen S. Utikal,
Friedegund Meier,
Sarah Hobelsberger,
Frank F. Gellrich,
Mildred Sergon,
Axel Hauschild,
Lars E. French,
Lucie Heinzerling,
Justin G. Schlager,
Kamran Ghoreschi,
Max Schlaak,
Franz J. Hilke,
Gabriela Poch,
Sören Korsing,
Carola Berking,
Markus V. Heppt,
Michael Erdmann,
Sebastian Haferkamp,
Konstantin Drexler,
Dirk Schadendorf
, et al. (5 additional authors not shown)
Abstract:
Early detection of melanoma, a potentially lethal type of skin cancer with high prevalence worldwide, improves patient prognosis. In retrospective studies, artificial intelligence (AI) has proven to be helpful for enhancing melanoma detection. However, there are few prospective studies confirming these promising results. Existing studies are limited by low sample sizes, too homogenous datasets, or…
▽ More
Early detection of melanoma, a potentially lethal type of skin cancer with high prevalence worldwide, improves patient prognosis. In retrospective studies, artificial intelligence (AI) has proven to be helpful for enhancing melanoma detection. However, there are few prospective studies confirming these promising results. Existing studies are limited by low sample sizes, too homogenous datasets, or lack of inclusion of rare melanoma subtypes, preventing a fair and thorough evaluation of AI and its generalizability, a crucial aspect for its application in the clinical setting. Therefore, we assessed 'All Data are Ext' (ADAE), an established open-source ensemble algorithm for detecting melanomas, by comparing its diagnostic accuracy to that of dermatologists on a prospectively collected, external, heterogeneous test set comprising eight distinct hospitals, four different camera setups, rare melanoma subtypes, and special anatomical sites. We advanced the algorithm with real test-time augmentation (R-TTA, i.e. providing real photographs of lesions taken from multiple angles and averaging the predictions), and evaluated its generalization capabilities. Overall, the AI showed higher balanced accuracy than dermatologists (0.798, 95% confidence interval (CI) 0.779-0.814 vs. 0.781, 95% CI 0.760-0.802; p<0.001), obtaining a higher sensitivity (0.921, 95% CI 0.900- 0.942 vs. 0.734, 95% CI 0.701-0.770; p<0.001) at the cost of a lower specificity (0.673, 95% CI 0.641-0.702 vs. 0.828, 95% CI 0.804-0.852; p<0.001). As the algorithm exhibited a significant performance advantage on our heterogeneous dataset exclusively comprising melanoma-suspicious lesions, AI may offer the potential to support dermatologists particularly in diagnosing challenging cases.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
On the calibration of neural networks for histological slide-level classification
Authors:
Alexander Kurz,
Hendrik A. Mehrtens,
Tabea-Clara Bucher,
Titus J. Brinker
Abstract:
Deep Neural Networks have shown promising classification performance when predicting certain biomarkers from Whole Slide Images in digital pathology. However, the calibration of the networks' output probabilities is often not evaluated. Communicating uncertainty by providing reliable confidence scores is of high relevance in the medical context. In this work, we compare three neural network archit…
▽ More
Deep Neural Networks have shown promising classification performance when predicting certain biomarkers from Whole Slide Images in digital pathology. However, the calibration of the networks' output probabilities is often not evaluated. Communicating uncertainty by providing reliable confidence scores is of high relevance in the medical context. In this work, we compare three neural network architectures that combine feature representations on patch-level to a slide-level prediction with respect to their classification performance and evaluate their calibration. As slide-level classification task, we choose the prediction of Microsatellite Instability from Colorectal Cancer tissue sections. We observe that Transformers lead to good results in terms of classification performance and calibration. When evaluating the classification performance on a separate dataset, we observe that Transformers generalize best. The investigation of reliability diagrams provides additional insights to the Expected Calibration Error metric and we observe that especially Transformers push the output probabilities to extreme values, which results in overconfident predictions.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Mitigating the Influence of Domain Shift in Skin Lesion Classification: A Benchmark Study of Unsupervised Domain Adaptation Methods on Dermoscopic Images
Authors:
Sireesha Chamarthi,
Katharina Fogelberg,
Roman C. Maron,
Titus J. Brinker,
Julia Niebling
Abstract:
The potential of deep neural networks in skin lesion classification has already been demonstrated to be on-par if not superior to the dermatologists diagnosis. However, the performance of these models usually deteriorates when the test data differs significantly from the training data (i.e. domain shift). This concerning limitation for models intended to be used in real-world skin lesion classific…
▽ More
The potential of deep neural networks in skin lesion classification has already been demonstrated to be on-par if not superior to the dermatologists diagnosis. However, the performance of these models usually deteriorates when the test data differs significantly from the training data (i.e. domain shift). This concerning limitation for models intended to be used in real-world skin lesion classification tasks poses a risk to patients. For example, different image acquisition systems or previously unseen anatomical sites on the patient can suffice to cause such domain shifts. Mitigating the negative effect of such shifts is therefore crucial, but develo** effective methods to address domain shift has proven to be challenging. In this study, we carry out an in-depth analysis of eight different unsupervised domain adaptation methods to analyze their effectiveness in improving generalization for dermoscopic datasets. To ensure robustness of our findings, we test each method on a total of ten distinct datasets, thereby covering a variety of possible domain shifts. In addition, we investigated which factors in the domain shifted datasets have an impact on the effectiveness of domain adaptation methods. Our findings show that all of the eight domain adaptation methods result in improved AUPRC for the majority of analyzed datasets. Altogether, these results indicate that unsupervised domain adaptations generally lead to performance improvements for the binary melanoma-nevus classification task regardless of the nature of the domain shift. However, small or heavily imbalanced datasets lead to a reduced conformity of the results due to the influence of these factors on the methods performance.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center Study
Authors:
Christoph Wies,
Lucas Schneider,
Sarah Haggenmueller,
Tabea-Clara Bucher,
Sarah Hobelsberger,
Markus V. Heppt,
Gerardo Ferrara,
Eva I. Krieghoff-Henning,
Titus J. Brinker
Abstract:
Pathologists routinely use immunohistochemical (IHC)-stained tissue slides against MelanA in addition to hematoxylin and eosin (H&E)-stained slides to improve their accuracy in diagnosing melanomas. The use of diagnostic Deep Learning (DL)-based support systems for automated examination of tissue morphology and cellular composition has been well studied in standard H&E-stained tissue slides. In co…
▽ More
Pathologists routinely use immunohistochemical (IHC)-stained tissue slides against MelanA in addition to hematoxylin and eosin (H&E)-stained slides to improve their accuracy in diagnosing melanomas. The use of diagnostic Deep Learning (DL)-based support systems for automated examination of tissue morphology and cellular composition has been well studied in standard H&E-stained tissue slides. In contrast, there are few studies that analyze IHC slides using DL. Therefore, we investigated the separate and joint performance of ResNets trained on MelanA and corresponding H&E-stained slides. The MelanA classifier achieved an area under receiver operating characteristics curve (AUROC) of 0.82 and 0.74 on out of distribution (OOD)-datasets, similar to the H&E-based benchmark classification of 0.81 and 0.75, respectively. A combined classifier using MelanA and H&E achieved AUROCs of 0.85 and 0.81 on the OOD datasets. DL MelanA-based assistance systems show the same performance as the benchmark H&E classification and may be improved by multi stain classification to assist pathologists in their clinical routine.
△ Less
Submitted 8 September, 2023; v1 submitted 7 September, 2023;
originally announced September 2023.
-
Using Multiple Dermoscopic Photographs of One Lesion Improves Melanoma Classification via Deep Learning: A Prognostic Diagnostic Accuracy Study
Authors:
Achim Hekler,
Roman C. Maron,
Sarah Haggenmüller,
Max Schmitt,
Christoph Wies,
Jochen S. Utikal,
Friedegund Meier,
Sarah Hobelsberger,
Frank F. Gellrich,
Mildred Sergon,
Axel Hauschild,
Lars E. French,
Lucie Heinzerling,
Justin G. Schlager,
Kamran Ghoreschi,
Max Schlaak,
Franz J. Hilke,
Gabriela Poch,
Sören Korsing,
Carola Berking,
Markus V. Heppt,
Michael Erdmann,
Sebastian Haferkamp,
Konstantin Drexler,
Dirk Schadendorf
, et al. (6 additional authors not shown)
Abstract:
Background: Convolutional neural network (CNN)-based melanoma classifiers face several challenges that limit their usefulness in clinical practice. Objective: To investigate the impact of multiple real-world dermoscopic views of a single lesion of interest on a CNN-based melanoma classifier.
Methods: This study evaluated 656 suspected melanoma lesions. Classifier performance was measured using a…
▽ More
Background: Convolutional neural network (CNN)-based melanoma classifiers face several challenges that limit their usefulness in clinical practice. Objective: To investigate the impact of multiple real-world dermoscopic views of a single lesion of interest on a CNN-based melanoma classifier.
Methods: This study evaluated 656 suspected melanoma lesions. Classifier performance was measured using area under the receiver operating characteristic curve (AUROC), expected calibration error (ECE) and maximum confidence change (MCC) for (I) a single-view scenario, (II) a multiview scenario using multiple artificially modified images per lesion and (III) a multiview scenario with multiple real-world images per lesion.
Results: The multiview approach with real-world images significantly increased the AUROC from 0.905 (95% CI, 0.879-0.929) in the single-view approach to 0.930 (95% CI, 0.909-0.951). ECE and MCC also improved significantly from 0.131 (95% CI, 0.105-0.159) to 0.072 (95% CI: 0.052-0.093) and from 0.149 (95% CI, 0.125-0.171) to 0.115 (95% CI: 0.099-0.131), respectively. Comparing multiview real-world to artificially modified images showed comparable diagnostic accuracy and uncertainty estimation, but significantly worse robustness for the latter.
Conclusion: Using multiple real-world images is an inexpensive method to positively impact the performance of a CNN-based melanoma classifier.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Domain shifts in dermoscopic skin cancer datasets: Evaluation of essential limitations for clinical translation
Authors:
Katharina Fogelberg,
Sireesha Chamarthi,
Roman C. Maron,
Julia Niebling,
Titus J. Brinker
Abstract:
The limited ability of Convolutional Neural Networks to generalize to images from previously unseen domains is a major limitation, in particular, for safety-critical clinical tasks such as dermoscopic skin cancer classification. In order to translate CNN-based applications into the clinic, it is essential that they are able to adapt to domain shifts. Such new conditions can arise through the use o…
▽ More
The limited ability of Convolutional Neural Networks to generalize to images from previously unseen domains is a major limitation, in particular, for safety-critical clinical tasks such as dermoscopic skin cancer classification. In order to translate CNN-based applications into the clinic, it is essential that they are able to adapt to domain shifts. Such new conditions can arise through the use of different image acquisition systems or varying lighting conditions. In dermoscopy, shifts can also occur as a change in patient age or occurence of rare lesion localizations (e.g. palms). These are not prominently represented in most training datasets and can therefore lead to a decrease in performance. In order to verify the generalizability of classification models in real world clinical settings it is crucial to have access to data which mimics such domain shifts. To our knowledge no dermoscopic image dataset exists where such domain shifts are properly described and quantified. We therefore grouped publicly available images from ISIC archive based on their metadata (e.g. acquisition location, lesion localization, patient age) to generate meaningful domains. To verify that these domains are in fact distinct, we used multiple quantification measures to estimate the presence and intensity of domain shifts. Additionally, we analyzed the performance on these domains with and without an unsupervised domain adaptation technique. We observed that in most of our grouped domains, domain shifts in fact exist. Based on our results, we believe these datasets to be helpful for testing the generalization capabilities of dermoscopic skin cancer classifiers.
△ Less
Submitted 3 July, 2023; v1 submitted 14 April, 2023;
originally announced April 2023.
-
Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma
Authors:
Tirtha Chanda,
Katja Hauser,
Sarah Hobelsberger,
Tabea-Clara Bucher,
Carina Nogueira Garcia,
Christoph Wies,
Harald Kittler,
Philipp Tschandl,
Cristian Navarrete-Dechent,
Sebastian Podlipnik,
Emmanouil Chousakos,
Iva Crnaric,
Jovana Majstorovic,
Linda Alhajwan,
Tanya Foreman,
Sandra Peternel,
Sergei Sarap,
İrem Özdemir,
Raymond L. Barnhill,
Mar Llamas Velasco,
Gabriela Poch,
Sören Korsing,
Wiebke Sondermann,
Frank Friedrich Gellrich,
Markus V. Heppt
, et al. (10 additional authors not shown)
Abstract:
Although artificial intelligence (AI) systems have been shown to improve the accuracy of initial melanoma diagnosis, the lack of transparency in how these systems identify melanoma poses severe obstacles to user acceptance. Explainable artificial intelligence (XAI) methods can help to increase transparency, but most XAI methods are unable to produce precisely located domain-specific explanations,…
▽ More
Although artificial intelligence (AI) systems have been shown to improve the accuracy of initial melanoma diagnosis, the lack of transparency in how these systems identify melanoma poses severe obstacles to user acceptance. Explainable artificial intelligence (XAI) methods can help to increase transparency, but most XAI methods are unable to produce precisely located domain-specific explanations, making the explanations difficult to interpret. Moreover, the impact of XAI methods on dermatologists has not yet been evaluated. Extending on two existing classifiers, we developed an XAI system that produces text and region based explanations that are easily interpretable by dermatologists alongside its differential diagnoses of melanomas and nevi. To evaluate this system, we conducted a three-part reader study to assess its impact on clinicians' diagnostic accuracy, confidence, and trust in the XAI-support. We showed that our XAI's explanations were highly aligned with clinicians' explanations and that both the clinicians' trust in the support system and their confidence in their diagnoses were significantly increased when using our XAI compared to using a conventional AI system. The clinicians' diagnostic accuracy was numerically, albeit not significantly, increased. This work demonstrates that clinicians are willing to adopt such an XAI system, motivating their future use in the clinic.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
Multi-domain stain normalization for digital pathology: A cycle-consistent adversarial network for whole slide images
Authors:
Martin J. Hetz,
Tabea-Clara Bucher,
Titus J. Brinker
Abstract:
The variation in histologic staining between different medical centers is one of the most profound challenges in the field of computer-aided diagnosis. The appearance disparity of pathological whole slide images causes algorithms to become less reliable, which in turn impedes the wide-spread applicability of downstream tasks like cancer diagnosis. Furthermore, different stainings lead to biases in…
▽ More
The variation in histologic staining between different medical centers is one of the most profound challenges in the field of computer-aided diagnosis. The appearance disparity of pathological whole slide images causes algorithms to become less reliable, which in turn impedes the wide-spread applicability of downstream tasks like cancer diagnosis. Furthermore, different stainings lead to biases in the training which in case of domain shifts negatively affect the test performance. Therefore, in this paper we propose MultiStain-CycleGAN, a multi-domain approach to stain normalization based on CycleGAN. Our modifications to CycleGAN allow us to normalize images of different origins without retraining or using different models. We perform an extensive evaluation of our method using various metrics and compare it to commonly used methods that are multi-domain capable. First, we evaluate how well our method fools a domain classifier that tries to assign a medical center to an image. Then, we test our normalization on the tumor classification performance of a downstream classifier. Furthermore, we evaluate the image quality of the normalized images using the Structural similarity index and the ability to reduce the domain shift using the Fréchet inception distance. We show that our method proves to be multi-domain capable, provides the highest image quality among the compared methods, and can most reliably fool the domain classifier while kee** the tumor classifier performance high. By reducing the domain influence, biases in the data can be removed on the one hand and the origin of the whole slide image can be disguised on the other, thus enhancing patient data privacy.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
Benchmarking common uncertainty estimation methods with histopathological images under domain shift and label noise
Authors:
Hendrik A. Mehrtens,
Alexander Kurz,
Tabea-Clara Bucher,
Titus J. Brinker
Abstract:
In the past years, deep learning has seen an increase in usage in the domain of histopathological applications. However, while these approaches have shown great potential, in high-risk environments deep learning models need to be able to judge their uncertainty and be able to reject inputs when there is a significant chance of misclassification. In this work, we conduct a rigorous evaluation of th…
▽ More
In the past years, deep learning has seen an increase in usage in the domain of histopathological applications. However, while these approaches have shown great potential, in high-risk environments deep learning models need to be able to judge their uncertainty and be able to reject inputs when there is a significant chance of misclassification. In this work, we conduct a rigorous evaluation of the most commonly used uncertainty and robustness methods for the classification of Whole Slide Images, with a focus on the task of selective classification, where the model should reject the classification in situations in which it is uncertain. We conduct our experiments on tile-level under the aspects of domain shift and label noise, as well as on slide-level. In our experiments, we compare Deep Ensembles, Monte-Carlo Dropout, Stochastic Variational Inference, Test-Time Data Augmentation as well as ensembles of the latter approaches. We observe that ensembles of methods generally lead to better uncertainty estimates as well as an increased robustness towards domain shifts and label noise, while contrary to results from classical computer vision benchmarks no systematic gain of the other methods can be shown. Across methods, a rejection of the most uncertain samples reliably leads to a significant increase in classification accuracy on both in-distribution as well as out-of-distribution data. Furthermore, we conduct experiments comparing these methods under varying conditions of label noise. Lastly, we publish our code framework to facilitate further research on uncertainty estimation on histopathological data.
△ Less
Submitted 6 July, 2023; v1 submitted 3 January, 2023;
originally announced January 2023.
-
Gelfand triples for the Kohn-Nirenberg quantization on homogeneous Lie groups
Authors:
Jonas Brinker,
Jens Wirth
Abstract:
In this paper, we study the group Fourier transform and the Kohn-Nirenberg quantization for homogeneous Lie groups as map**s between certain Gelfand triples. For this, we restrict our considerations to the case, where the homogeneous Lie group $G$ admits irreducible unitary representations, that are square integrable modulo the center $Z(G)$ of $G$, and where $\dim Z(G)=1$. Replacing the Schwart…
▽ More
In this paper, we study the group Fourier transform and the Kohn-Nirenberg quantization for homogeneous Lie groups as map**s between certain Gelfand triples. For this, we restrict our considerations to the case, where the homogeneous Lie group $G$ admits irreducible unitary representations, that are square integrable modulo the center $Z(G)$ of $G$, and where $\dim Z(G)=1$. Replacing the Schwartz space by a certain subspace $\mathcal S_*(G) \hookrightarrow \mathcal S(G)$, we characterise the range of the group Fourier transform on $\mathcal S_*(G)$ and construct distributions and Gelfand triples around $L^2(G,μ)$ and its Fourier image $L^2(\hat G,\hat μ)$, such that the Fourier transform becomes a Gelfand triple isomorphism. We give results on the multiplication of distributions with a large class of vector valued smooth functions and use this to establish the Kohn-Nirenberg quantization as an isomorphism for our Gelfand triples and provide an explicit formula for the Kohn-Nirenberg symbol of an operator.
△ Less
Submitted 20 June, 2020; v1 submitted 1 January, 2020;
originally announced January 2020.
-
Verification of exceptional points in the collapse dynamics of Bose-Einstein condensates
Authors:
Jonas Brinker,
Jacob Fuchs,
Jörg Main,
Günter Wunner,
Holger Cartarius
Abstract:
In Bose-Einstein condensates with an attractive contact interaction the stable ground state and an unstable excited state emerge in a tangent bifurcation at a critical value of the scattering length. At the bifurcation point both the energies and the wave functions of the two states coalesce, which is the characteristic of an exceptional point. In numerical simulations signatures of the exceptiona…
▽ More
In Bose-Einstein condensates with an attractive contact interaction the stable ground state and an unstable excited state emerge in a tangent bifurcation at a critical value of the scattering length. At the bifurcation point both the energies and the wave functions of the two states coalesce, which is the characteristic of an exceptional point. In numerical simulations signatures of the exceptional point can be observed by encircling the bifurcation point in the complex extended space of the scattering length, however, this method cannot be applied in an experiment. Here we show in which way the exceptional point effects the collapse dynamics of the Bose-Einstein condensate. The harmonic inversion analysis of the time signal given as the spatial extension of the collapsing condensate wave function can provide clear evidence for the existence of an exceptional point. This method can be used for an experimental verification of exceptional points in Bose-Einstein condensates.
△ Less
Submitted 20 November, 2014; v1 submitted 15 October, 2014;
originally announced October 2014.