-
missForestPredict -- Missing data imputation for prediction settings
Authors:
Elena Albu,
Shan Gao,
Laure Wynants,
Ben Van Calster
Abstract:
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a co…
▽ More
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
A comparison of regression models for static and dynamic prediction of a prognostic outcome during admission in electronic health care records
Authors:
Shan Gao,
Elena Albu,
Hein Putter,
Pieter Stijnen,
Frank Rademakers,
Veerle Cossey,
Yves Debaveye,
Christel Janssens,
Ben Van Calster,
Laure Wynants
Abstract:
Objective Hospitals register information in the electronic health records (EHR) continuously until discharge or death. As such, there is no censoring for in-hospital outcomes. We aimed to compare different dynamic regression modeling approaches to predict central line-associated bloodstream infections (CLABSI) in EHR while accounting for competing events precluding CLABSI. Materials and Methods We…
▽ More
Objective Hospitals register information in the electronic health records (EHR) continuously until discharge or death. As such, there is no censoring for in-hospital outcomes. We aimed to compare different dynamic regression modeling approaches to predict central line-associated bloodstream infections (CLABSI) in EHR while accounting for competing events precluding CLABSI. Materials and Methods We analyzed data from 30,862 catheter episodes at University Hospitals Leuven from 2012 and 2013 to predict 7-day risk of CLABSI. Competing events are discharge and death. Static models at catheter onset included logistic, multinomial logistic, Cox, cause-specific hazard, and Fine-Gray regression. Dynamic models updated predictions daily up to 30 days after catheter onset (i.e. landmarks 0 to 30 days), and included landmark supermodel extensions of the static models, separate Fine-Gray models per landmark time, and regularized multi-task learning (RMTL). Model performance was assessed using 100 random 2:1 train-test splits. Results The Cox model performed worst of all static models in terms of area under the receiver operating characteristic curve (AUC) and calibration. Dynamic landmark supermodels reached peak AUCs between 0.741-0.747 at landmark 5. The Cox landmark supermodel had the worst AUCs (<=0.731) and calibration up to landmark 7. Separate Fine-Gray models per landmark performed worst for later landmarks, when the number of patients at risk was low. Discussion and Conclusion Categorical and time-to-event approaches had similar performance in the static and dynamic settings, except Cox models. Ignoring competing risks caused problems for risk prediction in the time-to-event framework (Cox), but not in the categorical framework (logistic regression).
△ Less
Submitted 6 May, 2024; v1 submitted 3 May, 2024;
originally announced May 2024.
-
The harms of class imbalance corrections for machine learning based prediction models: a simulation study
Authors:
Alex Carriero,
Kim Luijken,
Anne de Hond,
Karel GM Moons,
Ben van Calster,
Maarten van Smeden
Abstract:
Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in…
▽ More
Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in the data). It is common for researchers to correct this class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out-of-sample predictive performance of models developed with an imbalance correction to those developed without a correction for class imbalance across different data-generating scenarios (varying sample size, the number of predictors and event fraction). Our findings were illustrated in a case study using MIMIC-III data. In all simulation scenarios, prediction models developed without a correction for class imbalance consistently had equal or better calibration performance than prediction models developed with a correction for class imbalance. The miscalibration introduced by correcting for class imbalance was characterized by an over-estimation of risk and was not always able to be corrected with re-calibration. Correcting for class imbalance is not always necessary and may even be harmful for clinical prediction models which aim to produce reliable risk estimates on an individual basis.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Comparison of static and dynamic random forests models for EHR data in the presence of competing risks: predicting central line-associated bloodstream infection
Authors:
Elena Albu,
Shan Gao,
Pieter Stijnen,
Frank Rademakers,
Christel Janssens,
Veerle Cossey,
Yves Debaveye,
Laure Wynants,
Ben Van Calster
Abstract:
Prognostic outcomes related to hospital admissions typically do not suffer from censoring, and can be modeled either categorically or as time-to-event. Competing events are common but often ignored. We compared the performance of random forest (RF) models to predict the risk of central line-associated bloodstream infections (CLABSI) using different outcome operationalizations. We included data fro…
▽ More
Prognostic outcomes related to hospital admissions typically do not suffer from censoring, and can be modeled either categorically or as time-to-event. Competing events are common but often ignored. We compared the performance of random forest (RF) models to predict the risk of central line-associated bloodstream infections (CLABSI) using different outcome operationalizations. We included data from 27478 admissions to the University Hospitals Leuven, covering 30862 catheter episodes (970 CLABSI, 1466 deaths and 28426 discharges) to build static and dynamic RF models for binary (CLABSI vs no CLABSI), multinomial (CLABSI, discharge, death or no event), survival (time to CLABSI) and competing risks (time to CLABSI, discharge or death) outcomes to predict the 7-day CLABSI risk. We evaluated model performance across 100 train/test splits. Performance of binary, multinomial and competing risks models was similar: AUROC was 0.74 for baseline predictions, rose to 0.78 for predictions at day 5 in the catheter episode, and decreased thereafter. Survival models overestimated the risk of CLABSI (E:O ratios between 1.2 and 1.6), and had AUROCs about 0.01 lower than other models. Binary and multinomial models had lowest computation times. Models including multiple outcome events (multinomial and competing risks) display a different internal structure compared to binary and survival models. In the absence of censoring, complex modelling choices do not considerably improve the predictive performance compared to a binary model for CLABSI prediction in our studied settings. Survival models censoring the competing events at their time of occurrence should be avoided.
△ Less
Submitted 24 May, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Understanding random forests and overfitting: a visualization and simulation study
Authors:
Lasai BarreƱada,
Paula Dhiman,
Dirk Timmerman,
Anne-Laure Boulesteix,
Ben Van Calster
Abstract:
Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For t…
▽ More
Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For the case studies, risk estimates were visualised using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true c-statistic and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 were simulated and RF models trained with minimum node size 2 or 20 using ranger package, resulting in 192 scenarios in total. The visualizations suggested that the model learned spikes of probability around events in the training set. A cluster of events created a bigger peak, isolated events local peaks. In the simulation study, median training c-statistics were between 0.97 and 1 unless there were 4 or 16 binary predictors with minimum node size 20. Median test c-statistics were higher with higher events per variable, higher minimum node size, and binary predictors. Median training slopes were always above 1, and were not correlated with median test slopes across scenarios (correlation -0.11). Median test slopes were higher with higher true c-statistic, higher minimum node size, and higher sample size. Random forests learn local probability peaks that often yield near perfect training c-statistics without strongly affecting c-statistics on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
How to develop, externally validate, and update multinomial prediction models
Authors:
Celina K Gehringer,
Glen P Martin,
Ben Van Calster,
Kimme L Hyrich,
Suzanne M M Verstappen,
Jamie C Sergeant
Abstract:
Multinomial prediction models (MPMs) have a range of potential applications across healthcare where the primary outcome of interest has multiple nominal or ordinal categories. However, the application of MPMs is scarce, which may be due to the added methodological complexities that they bring. This article provides a guide of how to develop, externally validate, and update MPMs. Using a previously…
▽ More
Multinomial prediction models (MPMs) have a range of potential applications across healthcare where the primary outcome of interest has multiple nominal or ordinal categories. However, the application of MPMs is scarce, which may be due to the added methodological complexities that they bring. This article provides a guide of how to develop, externally validate, and update MPMs. Using a previously developed and validated MPM for treatment outcomes in rheumatoid arthritis as an example, we outline guidance and recommendations for producing a clinical prediction model using multinomial logistic regression. This article is intended to supplement existing general guidance on prediction model research. This guide is split into three parts: 1) Outcome definition and variable selection, 2) Model development, and 3) Model evaluation (including performance assessment, internal and external validation, and model recalibration). We outline how to evaluate and interpret the predictive performance of MPMs. R code is provided. We recommend the application of MPMs in clinical settings where the prediction of a nominal polytomous outcome is of interest. Future methodological research could focus on MPM-specific considerations for variable selection and sample size criteria for external validation.
△ Less
Submitted 20 December, 2023; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Minimum Sample Size for Develo** a Multivariable Prediction Model using Multinomial Logistic Regression
Authors:
Alexander Pate,
Richard D Riley,
Gary S Collins,
Maarten van Smeden,
Ben Van Calster,
Joie Ensor,
Glen P Martin
Abstract:
Multinomial logistic regression models allow one to predict the risk of a categorical outcome with more than 2 categories. When develo** such a model, researchers should ensure the number of participants (n) is appropriate relative to the number of events (E.k) and the number of predictor parameters (p.k) for each category k. We propose three criteria to determine the minimum n required in light…
▽ More
Multinomial logistic regression models allow one to predict the risk of a categorical outcome with more than 2 categories. When develo** such a model, researchers should ensure the number of participants (n) is appropriate relative to the number of events (E.k) and the number of predictor parameters (p.k) for each category k. We propose three criteria to determine the minimum n required in light of existing criteria developed for binary outcomes. The first criteria aims to minimise the model overfitting. The second aims to minimise the difference between the observed and adjusted R2 Nagelkerke. The third criterion aims to ensure the overall risk is estimated precisely. For criterion (i), we show the sample size must be based on the anticipated Cox-snell R2 of distinct one-to-one logistic regression models corresponding to the sub-models of the multinomial logistic regression, rather than on the overall Cox-snell R2 of the multinomial logistic regression. We tested the performance of the proposed criteria (i) through a simulation study, and found that it resulted in the desired level of overfitting. Criterion (ii) and (iii) are natural extensions from previously proposed criteria for binary outcomes. We illustrate how to implement the sample size criteria through a worked example considering the development of a multinomial risk prediction model for tumour type when presented with an ovarian mass. Code is provided for the simulation and worked example. We will embed our proposed criteria within the pmsampsize R library and Stata modules.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression
Authors:
Ruben van den Goorbergh,
Maarten van Smeden,
Dirk Timmerman,
Ben Van Calster
Abstract:
Methods to correct class imbalance, i.e. imbalance between the frequency of outcome events and non-events, are receiving increasing interest for develo** prediction models. We examined the effect of imbalance correction on the performance of standard and penalized (ridge) logistic regression models in terms of discrimination, calibration, and classification. We examined random undersampling, ran…
▽ More
Methods to correct class imbalance, i.e. imbalance between the frequency of outcome events and non-events, are receiving increasing interest for develo** prediction models. We examined the effect of imbalance correction on the performance of standard and penalized (ridge) logistic regression models in terms of discrimination, calibration, and classification. We examined random undersampling, random oversampling and SMOTE using Monte Carlo simulations and a case study on ovarian cancer diagnosis. The results indicated that all imbalance correction methods led to poor calibration (strong overestimation of the probability to belong to the minority class), but not to better discrimination in terms of the area under the receiver operating characteristic curve. Imbalance correction improved classification in terms of sensitivity and specificity, but similar results were obtained by shifting the probability threshold instead. Our study shows that outcome imbalance is not a problem in itself, and that imbalance correction may even worsen model performance.
△ Less
Submitted 18 February, 2022;
originally announced February 2022.
-
Risk prediction models for discrete ordinal outcomes: calibration and the impact of the proportional odds assumption
Authors:
Michael Edlinger,
Maarten van Smeden,
Hannes F Alber,
Maria Wanitschek,
Ben Van Calster
Abstract:
Calibration is a vital aspect of the performance of risk prediction models, but research in the context of ordinal outcomes is scarce. This study compared calibration measures for risk models predicting a discrete ordinal outcome, and investigated the impact of the proportional odds assumption on calibration and overfitting. We studied the multinomial, cumulative, adjacent category, continuation r…
▽ More
Calibration is a vital aspect of the performance of risk prediction models, but research in the context of ordinal outcomes is scarce. This study compared calibration measures for risk models predicting a discrete ordinal outcome, and investigated the impact of the proportional odds assumption on calibration and overfitting. We studied the multinomial, cumulative, adjacent category, continuation ratio, and stereotype logit/logistic models. To assess calibration, we investigated calibration intercepts and slopes, calibration plots, and the estimated calibration index. Using large sample simulations, we studied the performance of models for risk estimation under various conditions, assuming that the true model has either a multinomial logistic form or a cumulative logit proportional odds form. Small sample simulations were used to compare the tendency for overfitting between models. As a case study, we developed models to diagnose the degree of coronary artery disease (five categories) in symptomatic patients. When the true model was multinomial logistic, proportional odds models often yielded poor risk estimates, with calibration slopes deviating considerably from unity even on large model development datasets. The stereotype logistic model improved the calibration slope, but still provided biased risk estimates for individual patients. When the true model had a cumulative logit proportional odds form, multinomial logistic regression provided biased risk estimates, although these biases were modest. Non-proportional odds models require more parameters to be estimated from the data, and hence suffered more from overfitting. Despite larger sample size requirements, we generally recommend multinomial logistic regression for risk prediction modeling of discrete ordinal outcomes.
△ Less
Submitted 18 November, 2021; v1 submitted 19 April, 2021;
originally announced April 2021.
-
On the variability of regression shrinkage methods for clinical prediction models: simulation study on predictive performance
Authors:
Ben Van Calster,
Maarten van Smeden,
Ewout W. Steyerberg
Abstract:
When develo** risk prediction models, shrinkage methods are recommended, especially when the sample size is limited. Several earlier studies have shown that the shrinkage of model coefficients can reduce overfitting of the prediction model and subsequently result in better predictive performance on average. In this simulation study, we aimed to investigate the variability of regression shrinkage…
▽ More
When develo** risk prediction models, shrinkage methods are recommended, especially when the sample size is limited. Several earlier studies have shown that the shrinkage of model coefficients can reduce overfitting of the prediction model and subsequently result in better predictive performance on average. In this simulation study, we aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome, with focus on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). We investigated the following shrinkage methods in comparison to standard maximum likelihood estimation: uniform shrinkage (likelihood-based and bootstrap-based), ridge regression, penalized maximum likelihood, LASSO regression, adaptive LASSO, non-negative garrote, and Firth's correction. There were three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. Among the shrinkage methods, the bootstrap-based uniform shrinkage worked well overall. In contrast to other shrinkage approaches, Firth's correction had only a small shrinkage effect but did so with low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative. Hence, although shrinkage improved predictions on average, it often worked poorly in individual datasets, in particular when shrinkage was most needed. The observed variability of shrinkage methods implies that these methods do not solve problems associated with small sample size or low number of events per variable.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.
-
Impact of predictor measurement heterogeneity across settings on performance of prediction models: a measurement error perspective
Authors:
Kim Luijken,
Rolf H. H. Groenwold,
Ben van Calster,
Ewout W. Steyerberg,
Maarten van Smeden
Abstract:
It is widely acknowledged that the predictive performance of clinical prediction models should be studied in patients that were not part of the data in which the model was derived. Out-of-sample performance can be hampered when predictors are measured differently at derivation and external validation. This may occur, for instance, when predictors are measured using different measurement protocols…
▽ More
It is widely acknowledged that the predictive performance of clinical prediction models should be studied in patients that were not part of the data in which the model was derived. Out-of-sample performance can be hampered when predictors are measured differently at derivation and external validation. This may occur, for instance, when predictors are measured using different measurement protocols or when tests are produced by different manufacturers. Although such heterogeneity in predictor measurement between deriviation and validation data is common, the impact on the out-of-sample performance is not well studied. Using analytical and simulation approaches, we examined out-of-sample performance of prediction models under various scenarios of heterogeneous predictor measurement. These scenarios were defined and clarified using an established taxonomy of measurement error models. The results of our simulations indicate that predictor measurement heterogeneity can induce miscalibration of prediction and affects discrimination and overall predictive accuracy, to extents that the prediction model may no longer be considered clinically useful. The measurement error taxonomy was found to be helpful in identifying and predicting effects of heterogeneous predictor measurements between settings of prediction model derivation and validation. Our work indicates that homogeneity of measurement strategies across settings is of paramount importance in prediction research.
△ Less
Submitted 5 February, 2019; v1 submitted 27 June, 2018;
originally announced June 2018.