-
Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations
Authors:
Mohammad Azizmalayeri,
Ameen Abu-Hanna,
Giovanni Cinà
Abstract:
Detecting out-of-distribution (OOD) instances is crucial for the reliable deployment of machine learning models in real-world scenarios. OOD inputs are commonly expected to cause a more uncertain prediction in the primary task; however, there are OOD cases for which the model returns a highly confident prediction. This phenomenon, denoted as "overconfidence", presents a challenge to OOD detection.…
▽ More
Detecting out-of-distribution (OOD) instances is crucial for the reliable deployment of machine learning models in real-world scenarios. OOD inputs are commonly expected to cause a more uncertain prediction in the primary task; however, there are OOD cases for which the model returns a highly confident prediction. This phenomenon, denoted as "overconfidence", presents a challenge to OOD detection. Specifically, theoretical evidence indicates that overconfidence is an intrinsic property of certain neural network architectures, leading to poor OOD detection. In this work, we address this issue by measuring extreme activation values in the penultimate layer of neural networks and then leverage this proxy of overconfidence to improve on several OOD detection baselines. We test our method on a wide array of experiments spanning synthetic data and real-world data, tabular and image datasets, multiple architectures such as ResNet and Transformer, different training loss functions, and include the scenarios examined in previous theoretical work. Compared to the baselines, our method often grants substantial improvements, with double-digit increases in OOD detection AUC, and it does not damage performance in any scenario.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
When accurate prediction models yield harmful self-fulfilling prophecies
Authors:
Wouter A. C. van Amsterdam,
Nan van Geloven,
Jesse H. Krijthe,
Rajesh Ranganath,
Giovanni Ciná
Abstract:
Objective: Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. Many prediction models are deployed for decision support based on their prediction accuracy in validation studies. We i…
▽ More
Objective: Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. Many prediction models are deployed for decision support based on their prediction accuracy in validation studies. We investigate whether this is a safe and valid approach.
Materials and Methods: We show that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model.
Results: Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before and after deployment are useless for decision making as they made no change in the data distribution.
Discussion: Our results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions.
Conclusion: Outcome prediction models can yield harmful self-fulfilling prophecies when used for decision making, a new perspective on prediction model development, deployment and monitoring is needed.
△ Less
Submitted 8 February, 2024; v1 submitted 2 December, 2023;
originally announced December 2023.
-
Causal prediction models for medication safety monitoring: The diagnosis of vancomycin-induced acute kidney injury
Authors:
Izak Yasrebi-de Kom,
Joanna Klopotowska,
Dave Dongelmans,
Nicolette De Keizer,
Kitty Jager,
Ameen Abu-Hanna,
Giovanni Cinà
Abstract:
The current best practice approach for the retrospective diagnosis of adverse drug events (ADEs) in hospitalized patients relies on a full patient chart review and a formal causality assessment by multiple medical experts. This evaluation serves to qualitatively estimate the probability of causation (PC); the probability that a drug was a necessary cause of an adverse event. This practice is manua…
▽ More
The current best practice approach for the retrospective diagnosis of adverse drug events (ADEs) in hospitalized patients relies on a full patient chart review and a formal causality assessment by multiple medical experts. This evaluation serves to qualitatively estimate the probability of causation (PC); the probability that a drug was a necessary cause of an adverse event. This practice is manual, resource intensive and prone to human biases, and may thus benefit from data-driven decision support. Here, we pioneer a causal modeling approach using observational data to estimate a lower bound of the PC (PC$_{low}$). This method includes two key causal inference components: (1) the target trial emulation framework and (2) estimation of individualized treatment effects using machine learning. We apply our method to the clinically relevant use-case of vancomycin-induced acute kidney injury in intensive care patients, and compare our causal model-based PC$_{low}$ estimates to qualitative estimates of the PC provided by a medical expert. Important limitations and potential improvements are discussed, and we conclude that future improved causal models could provide essential data-driven support for medication safety monitoring in hospitalized patients.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Unmasking the Chameleons: A Benchmark for Out-of-Distribution Detection in Medical Tabular Data
Authors:
Mohammad Azizmalayeri,
Ameen Abu-Hanna,
Giovanni Ciná
Abstract:
Despite their success, Machine Learning (ML) models do not generalize effectively to data not originating from the training distribution. To reliably employ ML models in real-world healthcare systems and avoid inaccurate predictions on out-of-distribution (OOD) data, it is crucial to detect OOD samples. Numerous OOD detection approaches have been suggested in other fields - especially in computer…
▽ More
Despite their success, Machine Learning (ML) models do not generalize effectively to data not originating from the training distribution. To reliably employ ML models in real-world healthcare systems and avoid inaccurate predictions on out-of-distribution (OOD) data, it is crucial to detect OOD samples. Numerous OOD detection approaches have been suggested in other fields - especially in computer vision - but it remains unclear whether the challenge is resolved when dealing with medical tabular data. To answer this pressing need, we propose an extensive reproducible benchmark to compare different methods across a suite of tests including both near and far OODs. Our benchmark leverages the latest versions of eICU and MIMIC-IV, two public datasets encompassing tens of thousands of ICU patients in several hospitals. We consider a wide array of density-based methods and SOTA post-hoc detectors across diverse predictive architectures, including MLP, ResNet, and Transformer. Our findings show that i) the problem appears to be solved for far-OODs, but remains open for near-OODs; ii) post-hoc methods alone perform poorly, but improve substantially when coupled with distance-based mechanisms; iii) the transformer architecture is far less overconfident compared to MLP and ResNet.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Fixing confirmation bias in feature attribution methods via semantic match
Authors:
Giovanni Cinà,
Daniel Fernandez-Llaneza,
Ludovico Deponte,
Nishant Mishra,
Tabea E. Röber,
Sandro Pezzelle,
Iacer Calixto,
Rob Goedhart,
Ş. İlker Birbil
Abstract:
Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's…
▽ More
Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cinà et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
△ Less
Submitted 26 February, 2024; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Semantic match: Debugging feature attribution methods in XAI for healthcare
Authors:
Giovanni Cinà,
Tabea E. Röber,
Rob Goedhart,
Ş. İlker Birbil
Abstract:
The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI (XAI) and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques and especially fea…
▽ More
The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI (XAI) and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques and especially feature attribution methods, questioning their use and inclusion in guidelines and standards. Despite valid concerns, we argue that existing criticism on the viability of post-hoc local explainability methods throws away the baby with the bathwater by generalizing a problem that is specific to image data. We begin by characterizing the problem as a lack of semantic match between explanations and human understanding. To understand when feature importance can be used reliably, we introduce a distinction between feature importance of low- and high-level features. We argue that for data types where low-level features come endowed with a clear semantics, such as tabular data like Electronic Health Records (EHRs), semantic match can be obtained, and thus feature attribution methods can still be employed in a meaningful and useful way. Finally, we sketch a procedure to test whether semantic match has been achieved.
△ Less
Submitted 24 February, 2023; v1 submitted 5 January, 2023;
originally announced January 2023.
-
Why we do need Explainable AI for Healthcare
Authors:
Giovanni Cinà,
Tabea Röber,
Rob Goedhart,
Ilker Birbil
Abstract:
The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques, questioning their use a…
▽ More
The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques, questioning their use and inclusion in guidelines and standards. Revisiting such criticisms, this article offers a balanced and comprehensive perspective on the utility of Explainable AI, focusing on the specificity of clinical applications of AI and placing them in the context of healthcare interventions. Against its detractors and despite valid concerns, we argue that the Explainable AI research program is still central to human-machine interaction and ultimately our main tool against loss of control, a danger that cannot be prevented by rigorous clinical validation alone.
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
Out-of-Distribution Detection for Medical Applications: Guidelines for Practical Evaluation
Authors:
Karina Zadorozhny,
Patrick Thoral,
Paul Elbers,
Giovanni Cinà
Abstract:
Detection of Out-of-Distribution (OOD) samples in real time is a crucial safety check for deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes implementation of OOD detection methods for real-world applications. Here…
▽ More
Detection of Out-of-Distribution (OOD) samples in real time is a crucial safety check for deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes implementation of OOD detection methods for real-world applications. Here, we propose a series of practical considerations and tests to choose the best OOD detector for a specific medical dataset. These guidelines are illustrated on a real-life use case of Electronic Health Records (EHR). Our results can serve as a guide for implementation of OOD detection methods in clinical practice, mitigating risks associated with the use of machine learning models in healthcare.
△ Less
Submitted 5 May, 2022; v1 submitted 30 September, 2021;
originally announced September 2021.
-
A pragmatic approach to estimating average treatment effects from EHR data: the effect of prone positioning on mechanically ventilated COVID-19 patients
Authors:
Adam Izdebski,
Patrick J. Thoral,
Robbert C. A. Lalisang,
Dean M. McHugh,
Diederik Gommers,
Olaf L. Cremer,
Rob J. Bosman,
Sander Rigter,
Evert-Jan Wils,
Tim Frenzel,
Dave A. Dongelmans,
Remko de Jong,
Marco A. A. Peters,
Marlijn J. A Kamps,
Dharmanand Ramnarain,
Ralph Nowitzky,
Fleur G. C. A. Nooteboom,
Wouter de Ruijter,
Louise C. Urlings-Strop,
Ellen G. M. Smit,
D. Jannet Mehagnoul-Schipper,
Tom Dormans,
Cornelis P. C. de Jager,
Stefaan H. A. Hendriks,
Sefanja Achterberg
, et al. (21 additional authors not shown)
Abstract:
Despite the recent progress in the field of causal inference, to date there is no agreed upon methodology to glean treatment effect estimation from observational data. The consequence on clinical practice is that, when lacking results from a randomized trial, medical personnel is left without guidance on what seems to be effective in a real-world scenario. This article proposes a pragmatic methodo…
▽ More
Despite the recent progress in the field of causal inference, to date there is no agreed upon methodology to glean treatment effect estimation from observational data. The consequence on clinical practice is that, when lacking results from a randomized trial, medical personnel is left without guidance on what seems to be effective in a real-world scenario. This article proposes a pragmatic methodology to obtain preliminary but robust estimation of treatment effect from observational studies, to provide front-line clinicians with a degree of confidence in their treatment strategy. Our study design is applied to an open problem, the estimation of treatment effect of the proning maneuver on COVID-19 Intensive Care patients.
△ Less
Submitted 3 December, 2021; v1 submitted 14 September, 2021;
originally announced September 2021.
-
Know Your Limits: Uncertainty Estimation with ReLU Classifiers Fails at Reliable OOD Detection
Authors:
Dennis Ulmer,
Giovanni Cinà
Abstract:
A crucial requirement for reliable deployment of deep learning models for safety-critical applications is the ability to identify out-of-distribution (OOD) data points, samples which differ from the training data and on which a model might underperform. Previous work has attempted to tackle this problem using uncertainty estimation techniques. However, there is empirical evidence that a large fami…
▽ More
A crucial requirement for reliable deployment of deep learning models for safety-critical applications is the ability to identify out-of-distribution (OOD) data points, samples which differ from the training data and on which a model might underperform. Previous work has attempted to tackle this problem using uncertainty estimation techniques. However, there is empirical evidence that a large family of these techniques do not detect OOD reliably in classification tasks.
This paper gives a theoretical explanation for said experimental findings and illustrates it on synthetic data. We prove that such techniques are not able to reliably identify OOD samples in a classification setting, since their level of confidence is generalized to unseen areas of the feature space. This result stems from the interplay between the representation of ReLU networks as piece-wise affine transformations, the saturating nature of activation functions like softmax, and the most widely-used uncertainty metrics.
△ Less
Submitted 10 June, 2021; v1 submitted 9 December, 2020;
originally announced December 2020.
-
Trust Issues: Uncertainty Estimation Does Not Enable Reliable OOD Detection On Medical Tabular Data
Authors:
Dennis Ulmer,
Lotta Meijerink,
Giovanni Cinà
Abstract:
When deploying machine learning models in high-stakes real-world environments such as health care, it is crucial to accurately assess the uncertainty concerning a model's prediction on abnormal inputs. However, there is a scarcity of literature analyzing this problem on medical data, especially on mixed-type tabular data such as Electronic Health Records. We close this gap by presenting a series o…
▽ More
When deploying machine learning models in high-stakes real-world environments such as health care, it is crucial to accurately assess the uncertainty concerning a model's prediction on abnormal inputs. However, there is a scarcity of literature analyzing this problem on medical data, especially on mixed-type tabular data such as Electronic Health Records. We close this gap by presenting a series of tests including a large variety of contemporary uncertainty estimation techniques, in order to determine whether they are able to identify out-of-distribution (OOD) patients. In contrast to previous work, we design tests on realistic and clinically relevant OOD groups, and run experiments on real-world medical data. We find that almost all techniques fail to achieve convincing results, partly disagreeing with earlier findings.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Uncertainty estimation for classification and risk prediction on medical tabular data
Authors:
Lotta Meijerink,
Giovanni Cinà,
Michele Tonutti
Abstract:
In a data-scarce field such as healthcare, where models often deliver predictions on patients with rare conditions, the ability to measure the uncertainty of a model's prediction could potentially lead to improved effectiveness of decision support tools and increased user trust. This work advances the understanding of uncertainty estimation for classification and risk prediction on medical tabular…
▽ More
In a data-scarce field such as healthcare, where models often deliver predictions on patients with rare conditions, the ability to measure the uncertainty of a model's prediction could potentially lead to improved effectiveness of decision support tools and increased user trust. This work advances the understanding of uncertainty estimation for classification and risk prediction on medical tabular data, in a two-fold way. First, we expand and refine the set of heuristics to select an uncertainty estimation technique, introducing tests for clinically-relevant scenarios such as generalization to uncommon pathologies, changes in clinical protocol and simulations of corrupted data. We furthermore differentiate these heuristics depending on the clinical use-case. Second, we observe that ensembles and related techniques perform poorly when it comes to detecting out-of-domain examples, a critical task which is carried out more successfully by auto-encoders. These remarks are enriched by considerations of the interplay of uncertainty estimation with class imbalance, post-modeling calibration and other modeling procedures. Our findings are supported by an array of experiments on toy and real-world data.
△ Less
Submitted 23 May, 2020; v1 submitted 13 April, 2020;
originally announced April 2020.
-
Bayesian Modelling in Practice: Using Uncertainty to Improve Trustworthiness in Medical Applications
Authors:
David Ruhe,
Giovanni Cinà,
Michele Tonutti,
Daan de Bruin,
Paul Elbers
Abstract:
The Intensive Care Unit (ICU) is a hospital department where machine learning has the potential to provide valuable assistance in clinical decision making. Classical machine learning models usually only provide point-estimates and no uncertainty of predictions. In practice, uncertain predictions should be presented to doctors with extra care in order to prevent potentially catastrophic treatment d…
▽ More
The Intensive Care Unit (ICU) is a hospital department where machine learning has the potential to provide valuable assistance in clinical decision making. Classical machine learning models usually only provide point-estimates and no uncertainty of predictions. In practice, uncertain predictions should be presented to doctors with extra care in order to prevent potentially catastrophic treatment decisions. In this work we show how Bayesian modelling and the predictive uncertainty that it provides can be used to mitigate risk of misguided prediction and to detect out-of-domain examples in a medical setting. We derive analytically a bound on the prediction loss with respect to predictive uncertainty. The bound shows that uncertainty can mitigate loss. Furthermore, we apply a Bayesian Neural Network to the MIMIC-III dataset, predicting risk of mortality of ICU patients. Our empirical results show that uncertainty can indeed prevent potential errors and reliably identifies out-of-domain patients. These results suggest that Bayesian predictive uncertainty can greatly improve trustworthiness of machine learning models in high-risk settings such as the ICU.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.