-
Quantification of BERT Diagnosis Generalizability Across Medical Specialties Using Semantic Dataset Distance
Authors:
Mihir P. Khambete,
William Su,
Juan Garcia,
Marcus A. Badgeley
Abstract:
Deep learning models in healthcare may fail to generalize on data from unseen corpora. Additionally, no quantitative metric exists to tell how existing models will perform on new data. Previous studies demonstrated that NLP models of medical notes generalize variably between institutions, but ignored other levels of healthcare organization. We measured SciBERT diagnosis sentiment classifier genera…
▽ More
Deep learning models in healthcare may fail to generalize on data from unseen corpora. Additionally, no quantitative metric exists to tell how existing models will perform on new data. Previous studies demonstrated that NLP models of medical notes generalize variably between institutions, but ignored other levels of healthcare organization. We measured SciBERT diagnosis sentiment classifier generalizability between medical specialties using EHR sentences from MIMIC-III. Models trained on one specialty performed better on internal test sets than mixed or external test sets (mean AUCs 0.92, 0.87, and 0.83, respectively; p = 0.016). When models are trained on more specialties, they have better test performances (p < 1e-4). Model performance on new corpora is directly correlated to the similarity between train and test sentence content (p < 1e-4). Future studies should assess additional axes of generalization to ensure deep learning models fulfil their intended purpose across institutions, specialties, and practices.
△ Less
Submitted 19 February, 2021; v1 submitted 14 August, 2020;
originally announced August 2020.
-
Constructing a control-ready model of EEG signal during general anesthesia in humans
Authors:
John H. Abel,
Marcus A. Badgeley,
Taylor E. Baum,
Sourish Chakravarty,
Patrick L. Purdon,
Emery N. Brown
Abstract:
Significant effort toward the automation of general anesthesia has been made in the past decade. One open challenge is in the development of control-ready patient models for closed-loop anesthesia delivery. Standard depth-of-anesthesia tracking does not readily capture inter-individual differences in response to anesthetics, especially those due to age, and does not aim to predict a relationship b…
▽ More
Significant effort toward the automation of general anesthesia has been made in the past decade. One open challenge is in the development of control-ready patient models for closed-loop anesthesia delivery. Standard depth-of-anesthesia tracking does not readily capture inter-individual differences in response to anesthetics, especially those due to age, and does not aim to predict a relationship between a control input (infused anesthetic dose) and system state (commonly, a function of electroencephalography (EEG) signal). In this work, we developed a control-ready patient model for closed-loop propofol-induced anesthesia using data recorded during a clinical study of EEG during general anesthesia in ten healthy volunteers. We used principal component analysis to identify the low-dimensional state-space in which EEG signal evolves during anesthesia delivery. We parameterized the response of the EEG signal to changes in propofol target-site concentration using logistic models. We note that inter-individual differences in anesthetic sensitivity may be captured by varying a constant cofactor of the predicted effect-site concentration. We linked the EEG dose-response with the control input using a pharmacokinetic model. Finally, we present a simple nonlinear model predictive control in silico demonstration of how such a closed-loop system would work.
△ Less
Submitted 17 December, 2019;
originally announced December 2019.
-
Deep Learning Predicts Hip Fracture using Confounding Patient and Healthcare Variables
Authors:
Marcus A. Badgeley,
John R. Zech,
Luke Oakden-Rayner,
Benjamin S. Glicksberg,
Manway Liu,
William Gale,
Michael V. McConnell,
Beth Percha,
Thomas M. Snyder,
Joel T. Dudley
Abstract:
Hip fractures are a leading cause of death and disability among older adults. Hip fractures are also the most commonly missed diagnosis on pelvic radiographs. Computer-Aided Diagnosis (CAD) algorithms have shown promise for hel** radiologists detect fractures, but the image features underpinning their predictions are notoriously difficult to understand. In this study, we trained deep learning mo…
▽ More
Hip fractures are a leading cause of death and disability among older adults. Hip fractures are also the most commonly missed diagnosis on pelvic radiographs. Computer-Aided Diagnosis (CAD) algorithms have shown promise for hel** radiologists detect fractures, but the image features underpinning their predictions are notoriously difficult to understand. In this study, we trained deep learning models on 17,587 radiographs to classify fracture, five patient traits, and 14 hospital process variables. All 20 variables could be predicted from a radiograph (p < 0.05), with the best performances on scanner model (AUC=1.00), scanner brand (AUC=0.98), and whether the order was marked "priority" (AUC=0.79). Fracture was predicted moderately well from the image (AUC=0.78) and better when combining image features with patient data (AUC=0.86, p=2e-9) or patient data plus hospital process features (AUC=0.91, p=1e-21). The model performance on a test set with matched patient variables was significantly lower than a random test set (AUC=0.67, p=0.003); and when the test set was matched on patient and image acquisition variables, the model performed randomly (AUC=0.52, 95% CI 0.46-0.58), indicating that these variables were the main source of the model's predictive ability overall. We also used Naive Bayes to combine evidence from image models with patient and hospital data and found their inclusion improved performance, but that this approach was nevertheless inferior to directly modeling all variables. If CAD algorithms are inexplicably leveraging patient and process variables in their predictions, it is unclear how radiologists should interpret their predictions in the context of other known patient data. Further research is needed to illuminate deep learning decision processes so that computers and clinicians can effectively cooperate.
△ Less
Submitted 8 November, 2018;
originally announced November 2018.
-
Confounding variables can degrade generalization performance of radiological deep learning models
Authors:
John R. Zech,
Marcus A. Badgeley,
Manway Liu,
Anthony B. Costa,
Joseph J. Titano,
Eric K. Oermann
Abstract:
Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize acros…
▽ More
Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems. A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-rays from NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana (n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outside hospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were able to detect where an x-ray was acquired (hospital system, hospital department) with extremely high accuracy and calibrate predictions accordingly. The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identify disease-specific imaging findings on x-rays, but also their ability to exploit confounding information. Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.
△ Less
Submitted 12 July, 2018; v1 submitted 1 July, 2018;
originally announced July 2018.