-
Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports
Authors:
Felix J. Dorfner,
Liv Jürgensen,
Leonhard Donle,
Fares Al Mohamad,
Tobias R. Bodenmann,
Mason C. Cleveland,
Felix Busch,
Lisa C. Adams,
James Sato,
Thomas Schultz,
Albert E. Kim,
Jameson Merkow,
Keno K. Bressem,
Christopher P. Bridge
Abstract:
Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models.
Materials and Methods: Two different…
▽ More
Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models.
Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques.
Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively.
Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Where is the Truth? The Risk of Getting Confounded in a Continual World
Authors:
Florian Peter Busch,
Roshni Kamath,
Rupert Mitchell,
Wolfgang Stammer,
Kristian Kersting,
Martin Mundt
Abstract:
A dataset is confounded if it is most easily solved via a spurious correlation, which fails to generalize to new data. In this work, we show that, in a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem normally considered. In particular, we provide a formal description of suc…
▽ More
A dataset is confounded if it is most easily solved via a spurious correlation, which fails to generalize to new data. In this work, we show that, in a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem normally considered. In particular, we provide a formal description of such continual confounders and identify that, in general, spurious correlations are easily ignored when training for all tasks jointly, but it is harder to avoid confounding when they are considered sequentially. These descriptions serve as a basis for constructing a novel CLEVR-based continually confounded dataset, which we term the ConCon dataset. Our evaluations demonstrate that standard continual learning methods fail to ignore the dataset's confounders. Overall, our work highlights the challenges of confounding factors, particularly in continual learning settings, and demonstrates the need for develo** continual learning methods to robustly tackle these.
△ Less
Submitted 15 June, 2024; v1 submitted 9 February, 2024;
originally announced February 2024.
-
LongHealth: A Question Answering Benchmark with Long Clinical Documents
Authors:
Lisa Adams,
Felix Busch,
Tianyu Han,
Jean-Baptiste Excoffier,
Matthieu Ortala,
Alexander Löser,
Hugo JWL. Aerts,
Jakob Nikolas Kather,
Daniel Truhn,
Keno Bressem
Abstract:
Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data.
Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each…
▽ More
Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data.
Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents.
Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data interpretation.
Conclusion: While LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application.
We make the benchmark and evaluation code publicly available.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
From Text to Image: Exploring GPT-4Vision's Potential in Advanced Radiological Analysis across Subspecialties
Authors:
Felix Busch,
Tianyu Han,
Marcus Makowski,
Daniel Truhn,
Keno Bressem,
Lisa Adams
Abstract:
The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain
Authors:
Keno K. Bressem,
Jens-Michalis Papaioannou,
Paul Grundmann,
Florian Borchert,
Lisa C. Adams,
Leonhard Liu,
Felix Busch,
Lina Xu,
Jan P. Loyen,
Stefan M. Niehues,
Moritz Augustin,
Lennart Grosser,
Marcus R. Makowski,
Hugo JWL. Aerts,
Alexander Löser
Abstract:
This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the ove…
▽ More
This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERTde are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.
△ Less
Submitted 24 March, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Continual Causal Abstractions
Authors:
Matej Zečević,
Moritz Willig,
Jonas Seng,
Florian Peter Busch
Abstract:
This short paper discusses continually updated causal abstractions as a potential direction of future research. The key idea is to revise the existing level of causal abstraction to a different level of detail that is both consistent with the history of observed data and more effective in solving a given task.
This short paper discusses continually updated causal abstractions as a potential direction of future research. The key idea is to revise the existing level of causal abstraction to a different level of detail that is both consistent with the history of observed data and more effective in solving a given task.
△ Less
Submitted 6 January, 2023; v1 submitted 23 December, 2022;
originally announced December 2022.
-
What Does DALL-E 2 Know About Radiology?
Authors:
Lisa C. Adams,
Felix Busch,
Daniel Truhn,
Marcus R. Makowski,
Hugo JWL. Aerts,
Keno K. Bressem
Abstract:
Generative models such as DALL-E 2 could represent a promising future tool for image generation, augmentation, and manipulation for artificial intelligence research in radiology provided that these models have sufficient medical domain knowledge. Here we show that DALL-E 2 has learned relevant representations of X-ray images with promising capabilities in terms of zero-shot text-to-image generatio…
▽ More
Generative models such as DALL-E 2 could represent a promising future tool for image generation, augmentation, and manipulation for artificial intelligence research in radiology provided that these models have sufficient medical domain knowledge. Here we show that DALL-E 2 has learned relevant representations of X-ray images with promising capabilities in terms of zero-shot text-to-image generation of new images, continuation of an image beyond its original boundaries, or removal of elements, while pathology generation or CT, MRI, and ultrasound images are still limited. The use of generative models for augmenting and generating radiological data thus seems feasible, even if further fine-tuning and adaptation of these models to the respective domain is required beforehand.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Combining Predictions under Uncertainty: The Case of Random Decision Trees
Authors:
Florian Busch,
Moritz Kulessa,
Eneldo Loza Mencía,
Hendrik Blockeel
Abstract:
A common approach to aggregate classification estimates in an ensemble of decision trees is to either use voting or to average the probabilities for each class. The latter takes uncertainty into account, but not the reliability of the uncertainty estimates (so to say, the "uncertainty about the uncertainty"). More generally, much remains unknown about how to best combine probabilistic estimates fr…
▽ More
A common approach to aggregate classification estimates in an ensemble of decision trees is to either use voting or to average the probabilities for each class. The latter takes uncertainty into account, but not the reliability of the uncertainty estimates (so to say, the "uncertainty about the uncertainty"). More generally, much remains unknown about how to best combine probabilistic estimates from multiple sources. In this paper, we investigate a number of alternative prediction methods. Our methods are inspired by the theories of probability, belief functions and reliable classification, as well as a principle that we call evidence accumulation. Our experiments on a variety of data sets are based on random decision trees which guarantees a high diversity in the predictions to be combined. Somewhat unexpectedly, we found that taking the average over the probabilities is actually hard to beat. However, evidence accumulation showed consistently better results on all but very small leafs.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Attributions Beyond Neural Networks: The Linear Program Case
Authors:
Florian Peter Busch,
Matej Zečević,
Kristian Kersting,
Devendra Singh Dhami
Abstract:
Linear Programs (LPs) have been one of the building blocks in machine learning and have championed recent strides in differentiable optimizers for learning systems. While there exist solvers for even high-dimensional LPs, understanding said high-dimensional solutions poses an orthogonal and unresolved problem. We introduce an approach where we consider neural encodings for LPs that justify the app…
▽ More
Linear Programs (LPs) have been one of the building blocks in machine learning and have championed recent strides in differentiable optimizers for learning systems. While there exist solvers for even high-dimensional LPs, understanding said high-dimensional solutions poses an orthogonal and unresolved problem. We introduce an approach where we consider neural encodings for LPs that justify the application of attribution methods from explainable artificial intelligence (XAI) designed for neural learning systems. The several encoding functions we propose take into account aspects such as feasibility of the decision space, the cost attached to each input, or the distance to special points of interest. We investigate the mathematical consequences of several XAI methods on said neural LP encodings. We empirically show that the attribution methods Saliency and LIME reveal indistinguishable results up to perturbation levels, and we propose the property of Directedness as the main discriminative criterion between Saliency and LIME on one hand, and a perturbation-based Feature Permutation approach on the other hand. Directedness indicates whether an attribution method gives feature attributions with respect to an increase of that feature. We further notice the baseline selection problem beyond the classical computer vision setting for Integrated Gradients.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Integrating Parcel Deliveries into a Ride-Pooling Service -- An Agent-Based Simulation Study
Authors:
Fabian Fehn,
Roman Engelhardt,
Florian Dandl,
Klaus Bogenberger,
Fritz Busch
Abstract:
This paper examines the integration of freight delivery into the passenger transport of an on-demand ride-pooling service. The goal of this research is to use existing passenger trips for logistics services and thus reduce additional vehicle kilometers for freight delivery and the total number of vehicles on the road network. This is achieved by merging the need for two separate fleets into a sing…
▽ More
This paper examines the integration of freight delivery into the passenger transport of an on-demand ride-pooling service. The goal of this research is to use existing passenger trips for logistics services and thus reduce additional vehicle kilometers for freight delivery and the total number of vehicles on the road network. This is achieved by merging the need for two separate fleets into a single one by combining the services. To evaluate the potential of such a mobility-on-demand service, this paper uses an agent-based simulation framework and integrates three heuristic parcel assignment strategies into a ride-pooling fleet control algorithm. Two integration scenarios (moderate and full) are set up. While in both scenarios passengers and parcels share rides in one vehicle, in the moderate scenario no stops for parcel pick-up and delivery are allowed during a passenger ride to decrease customer inconvenience. Using real-world demand data for a case study of Munich, Germany, the two integration scenarios together with the three assignment strategies are compared to the status quo, which uses two separate vehicle fleets for passenger and logistics transport. The results indicate that the integration of logistics services into a ride-pooling service is possible and can exploit unused system capacities without deteriorating passenger transport. Depending on the assignment strategies nearly all parcels can be served until a parcel to passenger demand ratio of 1:10 while the overall fleet kilometers can be deceased compared to the status quo.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.
-
A Gaussian Process Model for Opponent Prediction in Autonomous Racing
Authors:
Edward L. Zhu,
Finn Lukas Busch,
Jake Johnson,
Francesco Borrelli
Abstract:
In head-to-head racing, an accurate model of interactive behavior of the opposing target vehicle (TV) is required to perform tightly constrained, but highly rewarding maneuvers such as overtaking. However, such information is not typically made available in competitive scenarios, we therefore propose to construct a prediction and uncertainty model given data of the TV from previous races. In parti…
▽ More
In head-to-head racing, an accurate model of interactive behavior of the opposing target vehicle (TV) is required to perform tightly constrained, but highly rewarding maneuvers such as overtaking. However, such information is not typically made available in competitive scenarios, we therefore propose to construct a prediction and uncertainty model given data of the TV from previous races. In particular, a one-step Gaussian process (GP) model is trained on closed-loop interaction data to learn the behavior of a TV driven by an unknown policy. Predictions of the nominal trajectory and associated uncertainty are rolled out via a sampling-based approach and are used in a model predictive control (MPC) policy for the ego vehicle in order to intelligently trade-off between safety and performance when attempting overtaking maneuvers against a TV. We demonstrate the GP-based predictor in closed loop with the MPC policy in simulation races and compare its performance against several predictors from literature. In a Monte Carlo study, we observe that the GP-based predictor achieves similar win rates while maintaining safety in up to 3x more races. We finally demonstrate the prediction and control framework in real-time in a experimental study on a 1/10th scale racecar platform operating at speeds of around 2.8 m/s, and show a significant level of improvement when using the GP-based predictor over a baseline MPC predictor. Videos of the hardware experiments can be found at https://youtu.be/KMSs4ofDfIs.
△ Less
Submitted 1 March, 2023; v1 submitted 26 April, 2022;
originally announced April 2022.
-
Finding Structure and Causality in Linear Programs
Authors:
Matej Zečević,
Florian Peter Busch,
Devendra Singh Dhami,
Kristian Kersting
Abstract:
Linear Programs (LP) are celebrated widely, particularly so in machine learning where they have allowed for effectively solving probabilistic inference tasks or imposing structure on end-to-end learning systems. Their potential might seem depleted but we propose a foundational, causal perspective that reveals intriguing intra- and inter-structure relations for LP components. We conduct a systemati…
▽ More
Linear Programs (LP) are celebrated widely, particularly so in machine learning where they have allowed for effectively solving probabilistic inference tasks or imposing structure on end-to-end learning systems. Their potential might seem depleted but we propose a foundational, causal perspective that reveals intriguing intra- and inter-structure relations for LP components. We conduct a systematic, empirical investigation on general-, shortest path- and energy system LPs.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.