-
Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model
Authors:
Salman Rahman,
Lavender Yao Jiang,
Saadia Gabriel,
Yindalon Aphinyanaphongs,
Eric Karl Oermann,
Rumi Chunara
Abstract:
Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To be…
▽ More
Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.
△ Less
Submitted 24 February, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section
Authors:
Hongyi Zheng,
Yixin Zhu,
Lavender Yao Jiang,
Kyunghyun Cho,
Eric Karl Oermann
Abstract:
Recent advances in large language models have led to renewed interest in natural language processing in healthcare using the free text of clinical notes. One distinguishing characteristic of clinical notes is their long time span over multiple long documents. The unique structure of clinical notes creates a new design choice: when the context length for a language model predictor is limited, which…
▽ More
Recent advances in large language models have led to renewed interest in natural language processing in healthcare using the free text of clinical notes. One distinguishing characteristic of clinical notes is their long time span over multiple long documents. The unique structure of clinical notes creates a new design choice: when the context length for a language model predictor is limited, which part of clinical notes should we choose as the input? Existing studies either choose the inputs with domain knowledge or simply truncate them. We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large. Our findings suggest that a carefully selected sampling function could enable more efficient information extraction from clinical notes.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Language Model Classifier Aligns Better with Physician Word Sensitivity than XGBoost on Readmission Prediction
Authors:
Grace Yang,
Ming Cao,
Lavender Y. Jiang,
Xu** C. Liu,
Alexander T. M. Cheung,
Hannah Weiss,
David Kurland,
Kyunghyun Cho,
Eric K. Oermann
Abstract:
Traditional evaluation metrics for classification in natural language processing such as accuracy and area under the curve fail to differentiate between models with different predictive behaviors despite their similar performance metrics. We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level to provide insights into disparities in their decision-making…
▽ More
Traditional evaluation metrics for classification in natural language processing such as accuracy and area under the curve fail to differentiate between models with different predictive behaviors despite their similar performance metrics. We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level to provide insights into disparities in their decision-making logic. We assess the sensitivity score on a set of representative words in the test set using two classifiers trained for hospital readmission classification with similar performance statistics. Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores. The results indicate that the language model's sensitivity score aligns better with the professionals than the xgboost classifier on tf-idf embeddings, which suggests that xgboost uses some spurious features. Overall, this metric offers a novel perspective on assessing models' robustness by quantifying their discrepancy with professional opinions. Our code is available on GitHub (https://github.com/nyuolab/Model_Sensitivity).
△ Less
Submitted 15 November, 2022; v1 submitted 13 November, 2022;
originally announced November 2022.
-
Identifying and mitigating bias in algorithms used to manage patients in a pandemic
Authors:
Yifan Li,
Garrett Yoon,
Mustafa Nasir-Moin,
David Rosenberg,
Sean Neifert,
Douglas Kondziolka,
Eric Karl Oermann
Abstract:
Numerous COVID-19 clinical decision support systems have been developed. However many of these systems do not have the merit for validity due to methodological shortcomings including algorithmic bias. Methods Logistic regression models were created to predict COVID-19 mortality, ventilator status and inpatient status using a real-world dataset consisting of four hospitals in New York City and anal…
▽ More
Numerous COVID-19 clinical decision support systems have been developed. However many of these systems do not have the merit for validity due to methodological shortcomings including algorithmic bias. Methods Logistic regression models were created to predict COVID-19 mortality, ventilator status and inpatient status using a real-world dataset consisting of four hospitals in New York City and analyzed for biases against race, gender and age. Simple thresholding adjustments were applied in the training process to establish more equitable models. Results Compared to the naively trained models, the calibrated models showed a 57% decrease in the number of biased trials, while predictive performance, measured by area under the receiver/operating curve (AUC), remained unchanged. After calibration, the average sensitivity of the predictive models increased from 0.527 to 0.955. Conclusion We demonstrate that naively training and deploying machine learning models on real world data for predictive analytics of COVID-19 has a high risk of bias. Simple implemented adjustments or calibrations during model training can lead to substantial and sustained gains in fairness on subsequent deployment.
△ Less
Submitted 30 October, 2021;
originally announced November 2021.
-
Patient level simulation and reinforcement learning to discover novel strategies for treating ovarian cancer
Authors:
Brian Murphy,
Mustafa Nasir-Moin,
Grace von Oiste,
Viola Chen,
Howard A Riina,
Douglas Kondziolka,
Eric K Oermann
Abstract:
The prognosis for patients with epithelial ovarian cancer remains dismal despite improvements in survival for other cancers. Treatment involves multiple lines of chemotherapy and becomes increasingly heterogeneous after first-line therapy. Reinforcement learning with real-world outcomes data has the potential to identify novel treatment strategies to improve overall survival. We design a reinforce…
▽ More
The prognosis for patients with epithelial ovarian cancer remains dismal despite improvements in survival for other cancers. Treatment involves multiple lines of chemotherapy and becomes increasingly heterogeneous after first-line therapy. Reinforcement learning with real-world outcomes data has the potential to identify novel treatment strategies to improve overall survival. We design a reinforcement learning environment to model epithelial ovarian cancer treatment trajectories and use model free reinforcement learning to investigate therapeutic regimens for simulated patients.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations
Authors:
Aahlad Puli,
Lily H. Zhang,
Eric K. Oermann,
Rajesh Ranganath
Abstract:
In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under…
▽ More
In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.
△ Less
Submitted 12 February, 2023; v1 submitted 29 June, 2021;
originally announced July 2021.
-
Confounding variables can degrade generalization performance of radiological deep learning models
Authors:
John R. Zech,
Marcus A. Badgeley,
Manway Liu,
Anthony B. Costa,
Joseph J. Titano,
Eric K. Oermann
Abstract:
Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize acros…
▽ More
Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems. A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-rays from NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana (n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outside hospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were able to detect where an x-ray was acquired (hospital system, hospital department) with extremely high accuracy and calibrate predictions accordingly. The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identify disease-specific imaging findings on x-rays, but also their ability to exploit confounding information. Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.
△ Less
Submitted 12 July, 2018; v1 submitted 1 July, 2018;
originally announced July 2018.