-
Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens
Authors:
Jean Feng,
Adarsh Subbaswamy,
Alexej Gossmann,
Harvineet Singh,
Berkman Sahiner,
Mi-Ok Kim,
Gene Pennello,
Nicholas Petrick,
Romain Pirracchio,
Fan Xia
Abstract:
After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work h…
▽ More
After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.
△ Less
Submitted 26 February, 2024; v1 submitted 19 November, 2023;
originally announced November 2023.
-
Is this model reliable for everyone? Testing for strong calibration
Authors:
Jean Feng,
Alexej Gossmann,
Romain Pirracchio,
Nicholas Petrick,
Gene Pennello,
Berkman Sahiner
Abstract:
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to th…
▽ More
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Evaluation of wait time saving effectiveness of triage algorithms
Authors:
Yee Lam Elim Thompson,
Gary M Levine,
Weijie Chen,
Berkman Sahiner,
Qin Li,
Nicholas Petrick,
Jana G Delfino,
Miguel A Lago,
Qian Cao,
Qin Li,
Frank W Samuelson
Abstract:
In the past decade, Artificial Intelligence (AI) algorithms have made promising impacts to transform healthcare in all aspects. One application is to triage patients' radiological medical images based on the algorithm's binary outputs. Such AI-based prioritization software is known as computer-aided triage and notification (CADt). Their main benefit is to speed up radiological review of images wit…
▽ More
In the past decade, Artificial Intelligence (AI) algorithms have made promising impacts to transform healthcare in all aspects. One application is to triage patients' radiological medical images based on the algorithm's binary outputs. Such AI-based prioritization software is known as computer-aided triage and notification (CADt). Their main benefit is to speed up radiological review of images with time-sensitive findings. However, as CADt devices become more common in clinical workflows, there is still a lack of quantitative methods to evaluate a device's effectiveness in saving patients' waiting times. In this paper, we present a mathematical framework based on queueing theory to calculate the average waiting time per patient image before and after a CADt device is used. We study four workflow models with multiple radiologists (servers) and priority classes for a range of AI diagnostic performance, radiologist's reading rates, and patient image (customer) arrival rates. Due to model complexity, an approximation method known as the Recursive Dimensionality Reduction technique is applied. We define a performance metric to measure the device's time-saving effectiveness. A software tool is developed to simulate clinical workflow of image review/interpretation, to verify theoretical results, and to provide confidence intervals of the performance metric we defined. It is shown quantitatively that a triage device is more effective in a busy, short-staffed setting, which is consistent with our clinical intuition and simulation results. Although this work is motivated by the need for evaluating CADt devices, the framework we present in this paper can be applied to any algorithm that prioritizes customers based on its binary outputs.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions
Authors:
Jean Feng,
Alexej Gossmann,
Gene Pennello,
Nicholas Petrick,
Berkman Sahiner,
Romain Pirracchio
Abstract:
Performance monitoring of machine learning (ML)-based risk prediction models in healthcare is complicated by the issue of confounding medical interventions (CMI): when an algorithm predicts a patient to be at high risk for an adverse event, clinicians are more likely to administer prophylactic treatment and alter the very target that the algorithm aims to predict. A simple approach is to ignore CM…
▽ More
Performance monitoring of machine learning (ML)-based risk prediction models in healthcare is complicated by the issue of confounding medical interventions (CMI): when an algorithm predicts a patient to be at high risk for an adverse event, clinicians are more likely to administer prophylactic treatment and alter the very target that the algorithm aims to predict. A simple approach is to ignore CMI and monitor only the untreated patients, whose outcomes remain unaltered. In general, ignoring CMI may inflate Type I error because (i) untreated patients disproportionally represent those with low predicted risk and (ii) evolution in both the model and clinician trust in the model can induce complex dependencies that violate standard assumptions. Nevertheless, we show that valid inference is still possible if one monitors conditional performance and if either conditional exchangeability or time-constant selection bias hold. Specifically, we develop a new score-based cumulative sum (CUSUM) monitoring procedure with dynamic control limits. Through simulations, we demonstrate the benefits of combining model updating with monitoring and investigate how over-trust in a prediction model may delay detection of performance deterioration. Finally, we illustrate how these monitoring methods can be used to detect calibration decay of an ML-based risk calculator for postoperative nausea and vomiting during the COVID-19 pandemic.
△ Less
Submitted 14 April, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
Sequential algorithmic modification with test data reuse
Authors:
Jean Feng,
Gene Pennello,
Nicholas Petrick,
Berkman Sahiner,
Romain Pirracchio,
Alexej Gossmann
Abstract:
After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications…
▽ More
After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications are minor or are implemented in rapid succession. Recent works have shown how one can repeatedly test modifications on the same dataset and protect against overfitting by (i) discretizing test results along a grid and (ii) applying a Bonferroni correction to adjust for the total number of modifications considered by an adaptive developer. However, the standard Bonferroni correction is overly conservative when most modifications are beneficial and/or highly correlated. This work investigates more powerful approaches using alpha-recycling and sequentially-rejective graphical procedures (SRGPs). We introduce novel extensions that account for correlation between adaptively chosen algorithmic modifications. In empirical analyses, the SRGPs control the error rate of approving unacceptable modifications and approve a substantially higher number of beneficial modifications than previous approaches.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees
Authors:
Jean Feng,
Alexej Gossmann,
Berkman Sahiner,
Romain Pirracchio
Abstract:
After deploying a clinical prediction model, subsequently collected data can be used to fine-tune its predictions and adapt to temporal shifts. Because model updating carries risks of over-updating/fitting, we study online methods with performance guarantees. We introduce two procedures for continual recalibration or revision of an underlying prediction model: Bayesian logistic regression (BLR) an…
▽ More
After deploying a clinical prediction model, subsequently collected data can be used to fine-tune its predictions and adapt to temporal shifts. Because model updating carries risks of over-updating/fitting, we study online methods with performance guarantees. We introduce two procedures for continual recalibration or revision of an underlying prediction model: Bayesian logistic regression (BLR) and a Markov variant that explicitly models distribution shifts (MarBLR). We perform empirical evaluation via simulations and a real-world study predicting COPD risk. We derive "Type I and II" regret bounds, which guarantee the procedures are non-inferior to a static model and competitive with an oracle logistic reviser in terms of the average loss. Both procedures consistently outperformed the static model and other online logistic revision methods. In simulations, the average estimated calibration index (aECI) of the original model was 0.828 (95%CI 0.818-0.938). Online recalibration using BLR and MarBLR improved the aECI, attaining 0.265 (95%CI 0.230-0.300) and 0.241 (95%CI 0.216-0.266), respectively. When performing more extensive logistic model revisions, BLR and MarBLR increased the average AUC (aAUC) from 0.767 (95%CI 0.765-0.769) to 0.800 (95%CI 0.798-0.802) and 0.799 (95%CI 0.797-0.801), respectively, in stationary settings and protected against substantial model decay. In the COPD study, BLR and MarBLR dynamically combined the original model with a continually-refitted gradient boosted tree to achieve aAUCs of 0.924 (95%CI 0.913-0.935) and 0.925 (95%CI 0.914-0.935), compared to the static model's aAUC of 0.904 (95%CI 0.892-0.916). Despite its simplicity, BLR is highly competitive with MarBLR. MarBLR outperforms BLR when its prior better reflects the data. BLR and MarBLR can improve the transportability of clinical prediction models and maintain their performance over time.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.