-
Empirical Evidence That There Is No Such Thing As A Validated Prediction Model
Authors:
Florian D. van Leeuwen,
Ewout W. Steyerberg,
David van Klaveren,
Ben Wessler,
David M. Kent,
Erik W. van Zwet
Abstract:
Background: External validations are essential to assess clinical prediction models (CPMs) before deployment. Apart from model misspecification, differences in patient population and other factors influence a model's AUC (c-statistic). We aimed to quantify variation in AUCs across external validation studies and adjust expectations of a model's performance in a new setting.
Methods: The Tufts-PA…
▽ More
Background: External validations are essential to assess clinical prediction models (CPMs) before deployment. Apart from model misspecification, differences in patient population and other factors influence a model's AUC (c-statistic). We aimed to quantify variation in AUCs across external validation studies and adjust expectations of a model's performance in a new setting.
Methods: The Tufts-PACE CPM Registry contains CPMs for cardiovascular disease prognosis. We analyzed the AUCs of 469 CPMs with a total of 1,603 external validations. For each CPM, we performed a random effects meta-analysis to estimate the between-study standard deviation $τ$ among the AUCs. Since the majority of these meta-analyses has only a handful of validations, this leads to very poor estimates of $τ$. So, we estimated a log normal distribution of $τ$ across all CPMs and used this as an empirical prior. We compared this empirical Bayesian approach with frequentist meta-analyses using cross-validation.
Results: The 469 CPMs had a median of 2 external validations (IQR: [1-3]). The estimated distribution of $τ$ had a mean of 0.055 and a standard deviation of 0.015. If $τ$ = 0.05, the 95% prediction interval for the AUC in a new setting is at least +/- 0.1, regardless of the number of validations. Frequentist methods underestimate the uncertainty about the AUC in a new setting. Accounting for $τ$ in a Bayesian approach achieved near nominal coverage.
Conclusion: Due to large heterogeneity among the validated AUC values of a CPM, there is great irreducible uncertainty in predicting the AUC in a new setting. This uncertainty is underestimated by existing methods. The proposed empirical Bayes approach addresses this problem which merits wide application in judging the validity of prediction models.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Metropolitan-scale heralded entanglement of solid-state qubits
Authors:
Arian J. Stolk,
Kian L. van der Enden,
Marie-Christine Slater,
Ingmar te Raa-Derckx,
Pieter Botma,
Joris van Rantwijk,
Benjamin Biemond,
Ronald A. J. Hagen,
Rodolf W. Herfst,
Wouter D. Koek,
Arjan J. H. Meskers,
René Vollmer,
Erwin J. van Zwet,
Matthew Markham,
Andrew M. Edmonds,
Jan Fabian Geus,
Florian Elsen,
Bernd Jungbluth,
Constantin Haefner,
Christoph Tresp,
Jürgen Stuhler,
Stephan Ritter,
Ronald Hanson
Abstract:
A key challenge towards future quantum internet technology is connecting quantum processors at metropolitan scale. Here, we report on heralded entanglement between two independently operated quantum network nodes separated by 10km. The two nodes hosting diamond spin qubits are linked with a midpoint station via 25km of deployed optical fiber. We minimize the effects of fiber photon loss by quantum…
▽ More
A key challenge towards future quantum internet technology is connecting quantum processors at metropolitan scale. Here, we report on heralded entanglement between two independently operated quantum network nodes separated by 10km. The two nodes hosting diamond spin qubits are linked with a midpoint station via 25km of deployed optical fiber. We minimize the effects of fiber photon loss by quantum frequency conversion of the qubit-native photons to the telecom L-band and by embedding the link in an extensible phase-stabilized architecture enabling the use of the loss-resilient single-photon entangling protocol. By capitalizing on the full heralding capabilities of the network link in combination with real-time feedback logic on the long-lived qubits, we demonstrate the delivery of a predefined entangled state on the nodes irrespective of the heralding detection pattern. Addressing key scaling challenges and being compatible with different qubit systems, our architecture establishes a generic platform for exploring metropolitan-scale quantum networks.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
G-formula for causal inference via multiple imputation
Authors:
Jonathan W. Bartlett,
Camila Olarte Parra,
Emily Granger,
Ruth H. Keogh,
Erik W. van Zwet,
Rhian M. Daniel
Abstract:
G-formula is a popular approach for estimating treatment or exposure effects from longitudinal data that are subject to time-varying confounding. G-formula estimation is typically performed by Monte-Carlo simulation, with non-parametric bootstrap** used for inference. We show that G-formula can be implemented by exploiting existing methods for multiple imputation (MI) for synthetic data. This in…
▽ More
G-formula is a popular approach for estimating treatment or exposure effects from longitudinal data that are subject to time-varying confounding. G-formula estimation is typically performed by Monte-Carlo simulation, with non-parametric bootstrap** used for inference. We show that G-formula can be implemented by exploiting existing methods for multiple imputation (MI) for synthetic data. This involves using an existing modified version of Rubin's variance estimator. In practice missing data is ubiquitous in longitudinal datasets. We show that such missing data can be readily accommodated as part of the MI procedure when using G-formula, and describe how MI software can be used to implement the approach. We explore its performance using a simulation study and an application from cystic fibrosis.
△ Less
Submitted 11 October, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Think before you shrink: Alternatives to default shrinkage methods can improve prediction accuracy, calibration and coverage
Authors:
Mark A. van de Wiel,
Gwenaël G. R. Leday,
Jeroen Hoogland,
Martijn W. Heymans,
Erik W. van Zwet,
Ailko H. Zwinderman
Abstract:
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage…
▽ More
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage of standard shrinkage methods, such as lasso and ridge with a single, cross-validated penalty. Our aim is to show that readily available alternatives can strongly improve predictive performance, in terms of accuracy, calibration or coverage. For linear regression, we use small sample splits of a large, fairly typical epidemiological data set to illustrate this. We show that usage of differential ridge penalties for covariate groups may enhance prediction accuracy, while calibration and coverage benefit from additional shrinkage of the penalties. In the logistic setting, we apply an external simulation to demonstrate that local shrinkage improves calibration with respect to global shrinkage, while providing better prediction accuracy than other solutions, like Firth's correction. The benefits of the alternative shrinkage methods are easily accessible via example implementations using \texttt{mgcv} and \texttt{r-stan}, including the estimation of multiple penalties. A synthetic copy of the large data set is shared for reproducibility.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
Telecom-band quantum interference of frequency-converted photons from remote detuned NV centers
Authors:
Arian Stolk,
Kian L. van der Enden,
Marie-Christine Roehsner,
Annick Teepe,
Stein O. J. Faes,
Sidney Cadot,
Joris van Rantwijk,
Ingmar te Raa,
Ronald Hagen,
Ad Verlaan,
Benjamin Biemond,
Andrey Khorev,
Jaco Morits,
René Vollmer,
Matthew Markham,
Andrew M. Edmonds,
Erwin van Zwet,
Ronald Hanson
Abstract:
Entanglement distribution over quantum networks has the promise of realizing fundamentally new technologies. Entanglement between separated quantum processing nodes has been achieved on several experimental platforms in the past decade. To move towards metropolitan-scale quantum network test beds, the creation and transmission of indistinguishable single photons over existing telecom infrastructur…
▽ More
Entanglement distribution over quantum networks has the promise of realizing fundamentally new technologies. Entanglement between separated quantum processing nodes has been achieved on several experimental platforms in the past decade. To move towards metropolitan-scale quantum network test beds, the creation and transmission of indistinguishable single photons over existing telecom infrastructure is key. Here we report the interference of photons emitted by remote, spectrally detuned NV center-based network nodes, using quantum frequency conversion to the telecom L-band. We find a visibility of 0.79$\pm$0.03 and an indistinguishability between converted NV photons around 0.9 over the full range of the emission duration, confirming the removal of the spectral information present. Our approach implements fully separated and independent control over the nodes, time-multiplexing of control and quantum signals, and active feedback to stabilize the output frequency. Our results demonstrate a working principle that can be readily employed on other platforms and shows a clear path towards generating metropolitan scale, solid-state entanglement over deployed telecom fibers.
△ Less
Submitted 31 January, 2022;
originally announced February 2022.
-
Benchmarking survival outcomes: A funnel plot for survival data
Authors:
Hein Putter,
Dirk-Jan Eikema,
Liesbeth C. de Wreede,
Eoin McGrath,
Isabel Sanchez-Ortega,
Riccardo Saccardi,
John A. Snowden,
Erik W. van Zwet
Abstract:
Benchmarking is commonly used in many healthcare settings to monitor clinical performance, with the aim of increasing cost-effectiveness and safe care of patients. The funnel plot is a popular tool in visualizing the performance of a healthcare center in relation to other centers and to a target, taking into account statistical uncertainty. In this paper we develop methodology for constructing fun…
▽ More
Benchmarking is commonly used in many healthcare settings to monitor clinical performance, with the aim of increasing cost-effectiveness and safe care of patients. The funnel plot is a popular tool in visualizing the performance of a healthcare center in relation to other centers and to a target, taking into account statistical uncertainty. In this paper we develop methodology for constructing funnel plots for survival data. The method takes into account censoring and can deal with differences in censoring distributions across centers. Practical issues in implementing the methodology are discussed, particularly in the setting of benchmarking clinical outcomes for hematopoietic stem cell transplantation. A simulation study is performed to assess the performance of the funnel plots under several scenarios. Our methodology is illustrated using data from the EBMT benchmarking project.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
A proposal for informative default priors scaled by the standard error of estimates
Authors:
Erik van Zwet,
Andrew Gelman
Abstract:
If we have an unbiased estimate of some parameter of interest, then its absolute value is positively biased for the absolute value of the parameter. This bias is large when the signal-to-noise ratio (SNR) is small, and it becomes even larger when we condition on statistical significance; the winner's curse. This is a frequentist motivation for regularization. To determine a suitable amount of shri…
▽ More
If we have an unbiased estimate of some parameter of interest, then its absolute value is positively biased for the absolute value of the parameter. This bias is large when the signal-to-noise ratio (SNR) is small, and it becomes even larger when we condition on statistical significance; the winner's curse. This is a frequentist motivation for regularization. To determine a suitable amount of shrinkage, we propose to estimate the distribution of the SNR from a large collection or corpus of similar studies and use this as a prior distribution. The wider the scope of the corpus, the less informative the prior, but a wider scope does not necessarily result in a more diffuse prior. We show that the estimation of the prior simplifies if we require that posterior inference is equivariant under linear transformations of the data. We demonstrate our approach with corpora of 86 replication studies from psychology and 178 phase 3 clinical trials. Our suggestion is not intended to be a replacement for a prior based on full information about a particular problem; rather, it represents a familywise choice that should yield better long-term properties than the current default uniform prior, which has led to systematic overestimates of effect sizes and a replication crisis when these inflated estimates have not shown up in later studies.
△ Less
Submitted 30 November, 2020;
originally announced November 2020.
-
The statistical properties of RCTs and a proposal for shrinkage
Authors:
Erik van Zwet,
Simon Schwab,
Stephen Senn
Abstract:
We abstract the concept of a randomized controlled trial (RCT) as a triple (beta,b,s), where beta is the primary efficacy parameter, b the estimate and s the standard error (s>0). The parameter beta is either a difference of means, a log odds ratio or a log hazard ratio. If we assume that b is unbiased and normally distributed, then we can estimate the full joint distribution of (beta,b,s) from a…
▽ More
We abstract the concept of a randomized controlled trial (RCT) as a triple (beta,b,s), where beta is the primary efficacy parameter, b the estimate and s the standard error (s>0). The parameter beta is either a difference of means, a log odds ratio or a log hazard ratio. If we assume that b is unbiased and normally distributed, then we can estimate the full joint distribution of (beta,b,s) from a sample of pairs (b_i,s_i). We have collected 23,747 such pairs from the Cochrane database to do so. Here, we report the estimated distribution of the signal-to-noise ratio beta/s and the achieved power. We estimate the median achieved power to be 0.13. We also consider the exaggeration ratio which is the factor by which the magnitude of beta is overestimated. We find that if the estimate is just significant at the 5% level, we would expect it to overestimate the true effect by a factor of 1.7. This exaggeration is sometimes referred to as the winner's curse and it is undoubtedly to a considerable extent responsible for disappointing replication results. For this reason, we believe it is important to shrink the unbiased estimator, and we propose a method for doing so.
△ Less
Submitted 30 November, 2020;
originally announced November 2020.
-
The Significance Filter, the Winner's Curse and the Need to Shrink
Authors:
Erik van Zwet,
Eric Cator
Abstract:
The "significance filter" refers to focusing exclusively on statistically significant results. Since frequentist properties such as unbiasedness and coverage are valid only before the data have been observed, there are no guarantees if we condition on significance. In fact, the significance filter leads to overestimation of the magnitude of the parameter, which has been called the "winner's curse"…
▽ More
The "significance filter" refers to focusing exclusively on statistically significant results. Since frequentist properties such as unbiasedness and coverage are valid only before the data have been observed, there are no guarantees if we condition on significance. In fact, the significance filter leads to overestimation of the magnitude of the parameter, which has been called the "winner's curse". It can also lead to undercoverage of the confidence interval. Moreover, these problems become more severe if the power is low. While these issues clearly deserve our attention, they have been studied only informally and mathematical results are lacking. Here we study them from the frequentist and the Bayesian perspective. We prove that the relative bias of the magnitude is a decreasing function of the power and that the usual confidence interval undercovers when the power is less than 50%. We conclude that failure to apply the appropriate amount of shrinkage can lead to misleading inferences.
△ Less
Submitted 20 September, 2020;
originally announced September 2020.
-
Simultaneous Confidence Intervals for Ranks With Application to Ranking Institutions
Authors:
Diaa Al Mohamad,
Jelle J. Goeman,
Erik W. van Zwet
Abstract:
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of means based on an observed sample. For this aim, the only available method from the literature uses M…
▽ More
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of means based on an observed sample. For this aim, the only available method from the literature uses Monte-Carlo simulations and is highly anticonservative especially when the means are close to each other or have ties. We present a novel method based on Tukey's honest significant difference test (HSD). Our new method is on the contrary conservative when there are no ties. By properly rescaling these two methods to the nominal confidence level, they surprisingly perform very similarly. The Monte-Carlo method is however unscalable when the number of institutions is large than 30 to 50 and stays thus anticonservative. We provide extensive simulations to support our claims and the two methods are compared in terms of their simultaneous coverage and their efficiency. We provide a data analysis for 64 hospitals in the Netherlands and compare both methods. Software for our new methods is available online in package ICRanks downloadable from CRAN. Supplementary materials include supplementary R code for the simulations and proofs of the propositions presented in this paper.
△ Less
Submitted 11 December, 2018;
originally announced December 2018.
-
A default prior for regression coefficients
Authors:
Erik van Zwet
Abstract:
When the sample size is not too small, M-estimators of regression coefficients are approximately normal and unbiased. This leads to the familiar frequentist inference in terms of normality-based confidence intervals and p-values. From a Bayesian perspective, use of the (improper) uniform prior yields matching results in the sense that posterior quantiles agree with one-sided confidence bounds. For…
▽ More
When the sample size is not too small, M-estimators of regression coefficients are approximately normal and unbiased. This leads to the familiar frequentist inference in terms of normality-based confidence intervals and p-values. From a Bayesian perspective, use of the (improper) uniform prior yields matching results in the sense that posterior quantiles agree with one-sided confidence bounds. For this, and various other reasons, the uniform prior is often considered objective or non-informative. In spite of this, we argue that the uniform prior is not suitable as a default prior for inference about a regression coefficient in the context of the bio-medical and social sciences. We propose that a more suitable default choice is the normal distribution with mean zero and standard deviation equal to the standard error of the M-estimator. We base this recommendation on two arguments. First, we show that this prior is non-informative for inference about the sign of the regression coefficient. Secondly, we show that this prior agrees well with a meta-analysis of 50 articles from the MEDLINE database.
△ Less
Submitted 18 October, 2018; v1 submitted 22 September, 2018;
originally announced September 2018.
-
Adaptive Critical Value for Constrained Likelihood Ratio Testing
Authors:
Diaa Al Mohamad,
Jelle J. Goeman,
Erik W. van Zwet,
Eric A. Cator
Abstract:
We present a new way of testing ordered hypotheses against all alternatives which overpowers the classical approach both in simplicity and statistical power. Our new method tests the constrained likelihood ratio statistic against the quantile of one and only one chi-squared random variable with a data-dependent degrees of freedom instead of a mixture of chi-squares. Our new test is proved to have…
▽ More
We present a new way of testing ordered hypotheses against all alternatives which overpowers the classical approach both in simplicity and statistical power. Our new method tests the constrained likelihood ratio statistic against the quantile of one and only one chi-squared random variable with a data-dependent degrees of freedom instead of a mixture of chi-squares. Our new test is proved to have a valid finite-sample significance level $α$ and provides more power especially for sparse alternatives (those with a few or moderate number of null constraints violations) in comparison to the classical approach. Our method is also easier to use than the classical approach which requires to calculate or simulate a set of complicated weights. Two special cases are considered with more details, namely the case of testing orthants $μ_1<0, \cdots, μ_n<0$ and the isotonic case of testing $μ_1<μ_2<μ_3$ against all alternatives. Contours of the difference in power are shown for these examples showing the interest of our new approach.
△ Less
Submitted 25 June, 2018; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Simultaneous confidence sets for ranks using the partitioning principle - Technical report
Authors:
Diaa Al Mohamad,
Erik W. van Zwet,
Jelle J. Goeman,
Aldo Solari
Abstract:
Ranking institutions such as medical centers or universities is based on an indicator accompanied with an uncertainty measure such as a standard deviation, and confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of centers based on an observed sample. We present in this paper a novel…
▽ More
Ranking institutions such as medical centers or universities is based on an indicator accompanied with an uncertainty measure such as a standard deviation, and confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of centers based on an observed sample. We present in this paper a novel method based on multiple testing which uses the partitioning principle and employs the likelihood ratio (LR) test on the partitions. The complexity of the algorithm is super exponential. We present several ways and shortcuts to reduce this complexity. We provide also a polynomial algorithm which produces a very good bracketing for the multiple testing by linearizing the critical value of the LR test. We show that Tukey's Honest Significant Difference (HSD) test can be written as a partitioning procedure. The new methodology has promising properties in the sens that it opens the door in a simple and easy way to construct new methods which may trade the exponential complexity with power of the test or vice versa. In comparison to Tukey's HSD test, the LR test seems to give better results when the centers are close to each others or the uncertainty in the data is high which is confirmed during a simulation study.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
An improvement of Tukey's HSD with application to ranking institutions
Authors:
Diaa Al Mohamad,
Jelle J. Goeman,
Erik W. van Zwet
Abstract:
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals (CIs) for the ranks of centers based on an observed sample. We present a novel method based on Tukey's honest signific…
▽ More
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals (CIs) for the ranks of centers based on an observed sample. We present a novel method based on Tukey's honest significant difference test (HSD) which is the first method to produce valid simultaneous CIs for ranks. Moreover, we introduce a new variant of Tukey's HSD based on the sequential rejection principle. The new algorithm ensures familywise error control, and produces simultaneous confidence intervals for the ranks uniformly shorter than those provided by Tukey's HSD for the same level of significance. We illustrate the method through both simulations and real data analysis from 64 hospitals in the Netherlands. Software for our new methods is available online in package \texttt{ICRanks} downloadable from CRAN. Supplementary materials include supplementary R code for the simulations and proofs of the propositions presented in this paper.
△ Less
Submitted 22 November, 2018; v1 submitted 8 August, 2017;
originally announced August 2017.
-
Partial exchangeability of the prior via shuffling
Authors:
Erik van Zwet
Abstract:
In inference problems involving a multi-dimensional parameter $θ$, it is often natural to consider decision rules that have a risk which is invariant under some group $G$ of permutations of $θ$. We show that this implies that the Bayes risk of the rule is {\em as if} the prior distribution of the parameter is partially exchangeable with respect to $G$. We provide a symmetrization technique for inc…
▽ More
In inference problems involving a multi-dimensional parameter $θ$, it is often natural to consider decision rules that have a risk which is invariant under some group $G$ of permutations of $θ$. We show that this implies that the Bayes risk of the rule is {\em as if} the prior distribution of the parameter is partially exchangeable with respect to $G$. We provide a symmetrization technique for incorporating partial exchangeability of $θ$ into a statistical model, without assuming any other prior information. We refer to this technique as {\em shuffling}. Shuffling can be viewed as an instance of empirical Bayes, where we estimate the (unordered) multiset of parameter values $\{θ_1,θ_2,\dots,θ_p\}$ while using a uniform prior on $G$ for their ordering. Estimation of the multiset is a missing data problem which can be tackled with a stochastic EM algorithm. We show that in the special case of estimating the mean-value parameter in a regular exponential family model, shuffling leads to an estimator that is a weighted average of permuted versions of the usual maximum likelihood estimator. This is a novel form of shrinkage.
△ Less
Submitted 27 June, 2014; v1 submitted 28 May, 2014;
originally announced May 2014.
-
Measuring Traffic
Authors:
Peter J. Bickel,
Chao Chen,
Jaimyoung Kwon,
John Rice,
Erik van Zwet,
Pravin Varaiya
Abstract:
A traffic performance measurement system, PeMS, currently functions as a statewide repository for traffic data gathered by thousands of automatic sensors. It has integrated data collection, processing and communications infrastructure with data storage and analytical tools. In this paper, we discuss statistical issues that have emerged as we attempt to process a data stream of 2 GB per day of wi…
▽ More
A traffic performance measurement system, PeMS, currently functions as a statewide repository for traffic data gathered by thousands of automatic sensors. It has integrated data collection, processing and communications infrastructure with data storage and analytical tools. In this paper, we discuss statistical issues that have emerged as we attempt to process a data stream of 2 GB per day of wildly varying quality. In particular, we focus on detecting sensor malfunction, imputation of missing or bad data, estimation of velocity and forecasting of travel times on freeway networks.
△ Less
Submitted 18 April, 2008;
originally announced April 2008.