-
Think before you shrink: Alternatives to default shrinkage methods can improve prediction accuracy, calibration and coverage
Authors:
Mark A. van de Wiel,
Gwenaël G. R. Leday,
Jeroen Hoogland,
Martijn W. Heymans,
Erik W. van Zwet,
Ailko H. Zwinderman
Abstract:
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage…
▽ More
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage of standard shrinkage methods, such as lasso and ridge with a single, cross-validated penalty. Our aim is to show that readily available alternatives can strongly improve predictive performance, in terms of accuracy, calibration or coverage. For linear regression, we use small sample splits of a large, fairly typical epidemiological data set to illustrate this. We show that usage of differential ridge penalties for covariate groups may enhance prediction accuracy, while calibration and coverage benefit from additional shrinkage of the penalties. In the logistic setting, we apply an external simulation to demonstrate that local shrinkage improves calibration with respect to global shrinkage, while providing better prediction accuracy than other solutions, like Firth's correction. The benefits of the alternative shrinkage methods are easily accessible via example implementations using \texttt{mgcv} and \texttt{r-stan}, including the estimation of multiple penalties. A synthetic copy of the large data set is shared for reproducibility.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models
Authors:
Jurre R. Veerman,
Gwenael G. R. Leday,
Mark A. van de Wiel
Abstract:
For high-dimensional linear regression models, we review and compare several estimators of variances $τ^2$ and $σ^2$ of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty $λ$ and heritability index $h^2$, often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), a…
▽ More
For high-dimensional linear regression models, we review and compare several estimators of variances $τ^2$ and $σ^2$ of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty $λ$ and heritability index $h^2$, often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of covariate matrix $\mathbf{X}_{n \times p}$, with $p \gg n$, such as multi-collinear covariates and data-derived ones. In addition, we study robustness against departures from the model such as sparse instead of dense effects and non-Gaussian errors.
An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional linear mixed effects model, with REML as an alternative to MML. Second, to the conjugate Bayesian setting, which proves to be a good alternative. Third, and most prominently, to generalized linear models for which we derive a computationally efficient MML estimator by re-writing the marginal likelihood as an $n$-dimensional integral. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of the resulting MML estimator of $λ$ as compared to CV. Software is provided to enable reproduction of all results presented here.
△ Less
Submitted 7 February, 2019;
originally announced February 2019.
-
Shrinkage estimation of large covariance matrices using multiple shrinkage targets
Authors:
Harry Gray,
Gwenaël G. R. Leday,
Catalina A. Vallejos,
Sylvia Richardson
Abstract:
Linear shrinkage estimators of a covariance matrix --- defined by a weighted average of the sample covariance matrix and a pre-specified shrinkage target matrix --- are popular when analysing high-throughput molecular data. However, their performance strongly relies on an appropriate choice of target matrix. This paper introduces a more flexible class of linear shrinkage estimators that can accomm…
▽ More
Linear shrinkage estimators of a covariance matrix --- defined by a weighted average of the sample covariance matrix and a pre-specified shrinkage target matrix --- are popular when analysing high-throughput molecular data. However, their performance strongly relies on an appropriate choice of target matrix. This paper introduces a more flexible class of linear shrinkage estimators that can accommodate multiple shrinkage target matrices, directly accounting for the uncertainty regarding the target choice. This is done within a conjugate Bayesian framework, which is computationally efficient. Using both simulated and real data, we show that the proposed estimator is less sensitive to target misspecification and can outperform state-of-the-art (nonparametric) single-target linear shrinkage estimators. Using protein expression data from The Cancer Proteome Atlas we illustrate how multiple sources of prior information (obtained from more than 30 different cancer types) can be incorporated into the proposed multi-target linear shrinkage estimator. In particular, it is shown that the target-specific weights can provide insights into the differences and similarities between cancer types. Software for the method is freely available as an R-package at http://github.com/HGray384/TAS.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.
-
Fast Bayesian inference in large Gaussian graphical models
Authors:
Gwenaël G. R. Leday,
Sylvia Richardson
Abstract:
Despite major methodological developments, Bayesian inference for Gaussian graphical models remains challenging in high dimension due to the tremendous size of the model space. This article proposes a method to infer the marginal and conditional independence structures between variables by multiple testing of hypotheses. Specifically, we introduce closed-form Bayes factors under the Gaussian conju…
▽ More
Despite major methodological developments, Bayesian inference for Gaussian graphical models remains challenging in high dimension due to the tremendous size of the model space. This article proposes a method to infer the marginal and conditional independence structures between variables by multiple testing of hypotheses. Specifically, we introduce closed-form Bayes factors under the Gaussian conjugate model to evaluate the null hypotheses of marginal and conditional independence between variables. Their computation for all pairs of variables is shown to be extremely efficient, thereby allowing us to address large problems with thousands of nodes. Moreover, we derive exact tail probabilities from the null distributions of the Bayes factors. These allow the use of any multiplicity correction procedure to control error rates for incorrect edge inclusion. We demonstrate the proposed approach to graphical model selection on various simulated examples as well as on a large gene expression data set from The Cancer Genome Atlas.
△ Less
Submitted 6 April, 2018; v1 submitted 21 March, 2018;
originally announced March 2018.
-
An empirical Bayes approach to network recovery using external knowledge
Authors:
Gino B. Kpogbezan,
Aad W. van der Vaart,
Wessel N. van Wieringen,
Gwenaël G. R. Leday,
Mark A. van de Wiel
Abstract:
Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simul…
▽ More
Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simultaneous Equation Model, we develop an appealing empirical Bayes procedure which automatically assesses the relevance of the used prior knowledge. We use variational Bayes method for posterior densities approximation and compare its accuracy with that of Gibbs sampling strategy. Our method is computationally fast, and can outperform known competitors. In a simulation study we show that accurate prior data can greatly improve the reconstruction of the network, but need not harm the reconstruction if wrong. We demonstrate the benefits of the method in an analysis of gene expression data from GEO. In particular, the edges of the recovered network have superior reproducibility (compared to that of competitors) over resampled versions of the data.
△ Less
Submitted 24 May, 2016;
originally announced May 2016.
-
Gene network reconstruction using global-local shrinkage priors
Authors:
Gwenaël G. R. Leday,
Mathisca C. M. de Gunst,
Gino B. Kpogbezan,
Aad W. Van der Vaart,
Wessel N. Van Wieringen,
Mark A. Van de Wiel
Abstract:
Reconstructing a gene network from high-throughput molecular data is often a challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters i…
▽ More
Reconstructing a gene network from high-throughput molecular data is often a challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters is often difficult and can result in large statistical uncertainties. In this paper we propose to combine local regularization with global shrinkage of the regularization parameters to borrow strength between genes and improve inference. We employ a simple Bayesian model with non-sparse, conjugate priors to facilitate the use of fast variational approximations to posteriors. We discuss empirical Bayes estimation of hyper-parameters of the priors, and propose a novel approach to rank-based posterior thresholding. Using extensive model- and data-based simulations, we demonstrate that the proposed inference strategy outperforms popular (sparse) methods, yields more stable edges, and is more reproducible.
△ Less
Submitted 13 October, 2015;
originally announced October 2015.
-
Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines
Authors:
Gwenaël G. R. Leday,
Aad W. van der Vaart,
Wessel N. van Wieringen,
Mark A. van de Wiel
Abstract:
DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexi…
▽ More
DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexibility to identify any plausible type of relationship. The specification of the model leads to estimation and model selection in a constrained, nonstandard setting. We provide methodology for testing the effect of DNA on mRNA and choosing the appropriate model. Furthermore, we present a novel approach to obtain reliable confidence bands for constrained PLRS, which incorporates model uncertainty. The procedures are applied to colorectal and breast cancer data. Common assumptions are found to be potentially misleading for biologically relevant genes. More flexible models may bring more insight in the interaction between the two markers.
△ Less
Submitted 6 December, 2013;
originally announced December 2013.