-
Guiding adaptive shrinkage by co-data to improve regression-based prediction and feature selection
Authors:
Mark A. van de Wiel,
Wessel N. van Wieringen
Abstract:
The high dimensional nature of genomics data complicates feature selection, in particular in low sample size studies - not uncommon in clinical prediction settings. It is widely recognized that complementary data on the features, `co-data', may improve results. Examples are prior feature groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availabili…
▽ More
The high dimensional nature of genomics data complicates feature selection, in particular in low sample size studies - not uncommon in clinical prediction settings. It is widely recognized that complementary data on the features, `co-data', may improve results. Examples are prior feature groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availability of public repositories. Yet, the uptake of learning methods that structurally use such co-data is limited. We review guided adaptive shrinkage methods: a class of regression-based learners that use co-data to adapt the shrinkage parameters, crucial for the performance of those learners. We discuss technical aspects, but also the applicability in terms of types of co-data that can be handled. This class of methods is contrasted with several others. In particular, group-adaptive shrinkage is compared with the better-known sparse group-lasso by evaluating feature selection. Finally, we demonstrate the versatility of the guided shrinkage methodology by showing how to `do-it-yourself': we integrate implementations of a co-data learner and the spike-and-slab prior for the purpose of improving feature selection in genetics studies.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Co-data Learning for Bayesian Additive Regression Trees
Authors:
Jeroen M. Goedhart,
Thomas Klausch,
Jurriaan Janssen,
Mark A. van de Wiel
Abstract:
Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART),…
▽ More
Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data.
Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Linked shrinkage to improve estimation of interaction effects in regression models
Authors:
Mark A. van de Wiel,
Matteo Amestoy,
Jeroen Hoogland
Abstract:
We address a classical problem in statistics: adding two-way interaction terms to a regression model. As the covariate dimension increases quadratically, we develop an estimator that adapts well to this increase, while providing accurate estimates and appropriate inference. Existing strategies overcome the dimensionality problem by only allowing interactions between relevant main effects. Building…
▽ More
We address a classical problem in statistics: adding two-way interaction terms to a regression model. As the covariate dimension increases quadratically, we develop an estimator that adapts well to this increase, while providing accurate estimates and appropriate inference. Existing strategies overcome the dimensionality problem by only allowing interactions between relevant main effects. Building on this philosophy, we implement a softer link between the two types of effects using a local shrinkage model. We empirically show that borrowing strength between the amount of shrinkage for main effects and their interactions can strongly improve estimation of the regression coefficients. Moreover, we evaluate the potential of the model for inference, which is notoriously hard for selection strategies. Large-scale cohort data are used to provide realistic illustrations and evaluations. Comparisons with other methods are provided. The evaluation of variable importance is not trivial in regression models with many interaction terms. Therefore, we derive a new analytical formula for the Shapley value, which enables rapid assessment of individual-specific variable importance scores and their uncertainties. Finally, while not targeting for prediction, we do show that our models can be very competitive to a more advanced machine learner, like random forest, even for fairly large sample sizes. The implementation of our method in RStan is fairly straightforward, allowing for adjustments to specific needs.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Think before you shrink: Alternatives to default shrinkage methods can improve prediction accuracy, calibration and coverage
Authors:
Mark A. van de Wiel,
Gwenaël G. R. Leday,
Jeroen Hoogland,
Martijn W. Heymans,
Erik W. van Zwet,
Ailko H. Zwinderman
Abstract:
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage…
▽ More
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage of standard shrinkage methods, such as lasso and ridge with a single, cross-validated penalty. Our aim is to show that readily available alternatives can strongly improve predictive performance, in terms of accuracy, calibration or coverage. For linear regression, we use small sample splits of a large, fairly typical epidemiological data set to illustrate this. We show that usage of differential ridge penalties for covariate groups may enhance prediction accuracy, while calibration and coverage benefit from additional shrinkage of the penalties. In the logistic setting, we apply an external simulation to demonstrate that local shrinkage improves calibration with respect to global shrinkage, while providing better prediction accuracy than other solutions, like Firth's correction. The benefits of the alternative shrinkage methods are easily accessible via example implementations using \texttt{mgcv} and \texttt{r-stan}, including the estimation of multiple penalties. A synthetic copy of the large data set is shared for reproducibility.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
Penalised regression with multiple sources of prior effects
Authors:
Armin Rauschenberger,
Zied Landoulsi,
Mark A. van de Wiel,
Enrico Glaab
Abstract:
In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. We propose an approach…
▽ More
In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. We propose an approach for integrating multiple sources of such prior information into penalised regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. The proposed method is implemented in the R package `transreg' (https://github.com/lcsb-bds/transreg).
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
Estimation of Predictive Performance in High-Dimensional Data Settings using Learning Curves
Authors:
Jeroen M. Goedhart,
Thomas Klausch,
Mark A. van de Wiel
Abstract:
In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared t…
▽ More
In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
ecpc: An R-package for generic co-data models for high-dimensional prediction
Authors:
Mirrelijn M. van Nee,
Lodewyk F. A. Wessels,
Mark A. van de Wiel
Abstract:
High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable…
▽ More
High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, was handled by adaptive discretisation, potentially inefficiently modelling and losing information. Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties with the R-package squeezy. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. Moreover, we demonstrate use of the package in several examples throughout the paper.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
A Bayesian accelerated failure time model for interval censored three-state screening outcomes
Authors:
Thomas Klausch,
Eddymurphy U. Akwiwu,
Mark A. van de Wiel,
Veerle M. H. Coupe,
Johannes Berkhof
Abstract:
Women infected by the Human papilloma virus are at an increased risk to develop cervical intraepithalial neoplasia lesions (CIN). CIN are classified into three grades of increasing severity (CIN-1, CIN-2, and CIN-3) and can eventually develop into cervical cancer. The main purpose of screening is detecting CIN-2 and CIN-3 cases which are usually surgically removed. Screening data from the POBASCAM…
▽ More
Women infected by the Human papilloma virus are at an increased risk to develop cervical intraepithalial neoplasia lesions (CIN). CIN are classified into three grades of increasing severity (CIN-1, CIN-2, and CIN-3) and can eventually develop into cervical cancer. The main purpose of screening is detecting CIN-2 and CIN-3 cases which are usually surgically removed. Screening data from the POBASCAM trial involving 1,454 HPV-positive women is analyzed with two objectives: estimate (a) the transition time from HPV diagnosis to CIN-3; and (b) the transition time from CIN-2 to CIN-3. The screening data have two key characteristics. First, the CIN state is monitored in an interval-censored sequence of screening times. Second, a woman's progression to CIN-3 is only observed, if the woman progresses to, both, CIN-2 and from CIN-2 to CIN-3 in the same screening interval. We propose a Bayesian accelerated failure time model for the two transition times in this three-state model. To deal with the unusual censoring structure of the screening data, we develop a Metropolis-within-Gibbs algorithm with data augmentation from the truncated transition time distributions.
△ Less
Submitted 3 December, 2021; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Semi-supervised empirical Bayes group-regularized factor regression
Authors:
Magnus M. Münch,
Mark A. van de Wiel,
Aad W. van der Vaart,
Carel F. W. Peeters
Abstract:
The features in high dimensional biomedical prediction problems are often well described with lower dimensional manifolds. An example is genes that are organised in smaller functional networks. The outcome can then be described with the factor regression model. A benefit of the factor model is that is allows for straightforward inclusion of unlabeled observations in the estimation of the model, i.…
▽ More
The features in high dimensional biomedical prediction problems are often well described with lower dimensional manifolds. An example is genes that are organised in smaller functional networks. The outcome can then be described with the factor regression model. A benefit of the factor model is that is allows for straightforward inclusion of unlabeled observations in the estimation of the model, i.e., semi-supervised learning. In addition, the high dimensional features in biomedical prediction problems are often well characterised. Examples are genes, for which annotation is available, and metabolites with $p$-values from a previous study available. In this paper, the extra information on the features is included in the prior model for the features. The extra information is weighted and included in the estimation through empirical Bayes, with Variational approximations to speed up the computation. The method is demonstrated in simulations and two applications. One application considers influenza vaccine efficacy prediction based on microarray data. The second application predictions oral cancer metastatsis from RNAseq data.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Fast marginal likelihood estimation of penalties for group-adaptive elastic net
Authors:
Mirrelijn M. van Nee,
Tim van de Brug,
Mark A. van de Wiel
Abstract:
Nowadays, clinical research routinely uses omics data, such as gene expression, for predicting clinical outcomes or selecting markers. Additionally, so-called co-data are often available, providing complementary information on the covariates, like p-values from previously published studies or groups of genes corresponding to pathways. Elastic net penalisation is widely used for prediction and cova…
▽ More
Nowadays, clinical research routinely uses omics data, such as gene expression, for predicting clinical outcomes or selecting markers. Additionally, so-called co-data are often available, providing complementary information on the covariates, like p-values from previously published studies or groups of genes corresponding to pathways. Elastic net penalisation is widely used for prediction and covariate selection. Group-adaptive elastic net penalisation learns from co-data to improve the prediction and covariate selection, by penalising important groups of covariates less than other groups. Existing methods are, however, computationally expensive. Here we present a fast method for marginal likelihood estimation of group-adaptive elastic net penalties for generalised linear models. We first derive a low-dimensional representation of the Taylor approximation of the marginal likelihood and its first derivative for group-adaptive ridge penalties, to efficiently estimate these penalties. Then we show by using asymptotic normality of the linear predictors that the marginal likelihood for elastic net models may be approximated well by the marginal likelihood for ridge models. The ridge group penalties are then transformed to elastic net group penalties by using the variance function. The method allows for overlap** groups and unpenalised variables. We demonstrate the method in a model-based simulation study and an application to cancer genomics. The method substantially decreases computation time and outperforms or matches other methods by learning from co-data.
△ Less
Submitted 11 January, 2021;
originally announced January 2021.
-
Fast cross-validation for multi-penalty ridge regression
Authors:
Mark A. van de Wiel,
Mirrelijn M. van Nee,
Armin Rauschenberger
Abstract:
High-dimensional prediction with multiple data types needs to account for potentially strong differences in predictive signal. Ridge regression is a simple model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, and that allows inclusion of data type specific penalties. The largest challenge for multi-penalty ridge is to optimize the…
▽ More
High-dimensional prediction with multiple data types needs to account for potentially strong differences in predictive signal. Ridge regression is a simple model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, and that allows inclusion of data type specific penalties. The largest challenge for multi-penalty ridge is to optimize these penalties efficiently in a cross-validation (CV) setting, in particular for GLM and Cox ridge regression, which require an additional estimation loop by iterative weighted least squares (IWLS). Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly all computations are in low-dimensional space, rendering a speed-up of several orders of magnitude. We developed a flexible framework that facilitates multiple types of response, unpenalized covariates, several performance criteria and repeated CV. Extensions to paired and preferential data types are included and illustrated on several cancer genomics survival prediction problems. Moreover, we present similar computational shortcuts for maximum marginal likelihood and Bayesian probit regression. The corresponding R-package, multiridge, serves as a versatile standalone tool, but also as a fast benchmark for other more complex models and multi-view learners.
△ Less
Submitted 1 April, 2021; v1 submitted 19 May, 2020;
originally announced May 2020.
-
Flexible co-data learning for high-dimensional prediction
Authors:
Mirrelijn M. van Nee,
Lodewyk F. A. Wessels,
Mark A. van de Wiel
Abstract:
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as geno…
▽ More
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as genomic location or p-values from external studies. Our method enables exploiting multiple and various co-data sources to improve predictions. We use discrete or continuous co-data to define possibly overlap** or hierarchically structured groups of covariates. These are then used to estimate adaptive multi-group ridge penalties for generalised linear and Cox models. We combine empirical Bayes estimation of group penalty hyperparameters with an extra level of shrinkage. This renders a uniquely flexible framework as any type of shrinkage can be used on the group level. The hyperparameter shrinkage learns how relevant a specific co-data source is, counters overfitting of hyperparameters for many groups, and accounts for structured co-data. We describe various types of co-data and propose suitable forms of hypershrinkage. The method is very versatile, as it allows for integration and weighting of multiple co-data sets, inclusion of unpenalised covariates and posterior variable selection. We demonstrate it on two cancer genomics applications and show that it may improve the performance of other dense and parsimonious prognostic models substantially, and stabilises variable selection.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Stable prediction with radiomics data
Authors:
Carel F. W. Peeters,
Caroline Übelhör,
Steven W. Mes,
Roland Martens,
Thomas Koopman,
Pim de Graaf,
Floris H. P. van Velden,
Ronald Boellaard,
Jonas A. Castelijns,
Dennis E. te Beest,
Martijn W. Heymans,
Mark A. van de Wiel
Abstract:
Motivation: Radiomics refers to the high-throughput mining of quantitative features from radiographic images. It is a promising field in that it may provide a non-invasive solution for screening and classification. Standard machine learning classification and feature selection techniques, however, tend to display inferior performance in terms of (the stability of) predictive performance. This is d…
▽ More
Motivation: Radiomics refers to the high-throughput mining of quantitative features from radiographic images. It is a promising field in that it may provide a non-invasive solution for screening and classification. Standard machine learning classification and feature selection techniques, however, tend to display inferior performance in terms of (the stability of) predictive performance. This is due to the heavy multicollinearity present in radiomic data. We set out to provide an easy-to-use approach that deals with this problem.
Results: We developed a four-step approach that projects the original high-dimensional feature space onto a lower-dimensional latent-feature space, while retaining most of the covariation in the data. It consists of (i) penalized maximum likelihood estimation of a redundancy filtered correlation matrix. The resulting matrix (ii) is the input for a maximum likelihood factor analysis procedure. This two-stage maximum-likelihood approach can be used to (iii) produce a compact set of stable features that (iv) can be directly used in any (regression-based) classifier or predictor. It outperforms other classification (and feature selection) techniques in both external and internal validation settings regarding survival in squamous cell cancers.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.
-
Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models
Authors:
Jurre R. Veerman,
Gwenael G. R. Leday,
Mark A. van de Wiel
Abstract:
For high-dimensional linear regression models, we review and compare several estimators of variances $τ^2$ and $σ^2$ of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty $λ$ and heritability index $h^2$, often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), a…
▽ More
For high-dimensional linear regression models, we review and compare several estimators of variances $τ^2$ and $σ^2$ of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty $λ$ and heritability index $h^2$, often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of covariate matrix $\mathbf{X}_{n \times p}$, with $p \gg n$, such as multi-collinear covariates and data-derived ones. In addition, we study robustness against departures from the model such as sparse instead of dense effects and non-Gaussian errors.
An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional linear mixed effects model, with REML as an alternative to MML. Second, to the conjugate Bayesian setting, which proves to be a good alternative. Third, and most prominently, to generalized linear models for which we derive a computationally efficient MML estimator by re-writing the marginal likelihood as an $n$-dimensional integral. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of the resulting MML estimator of $λ$ as compared to CV. Software is provided to enable reproduction of all results presented here.
△ Less
Submitted 7 February, 2019;
originally announced February 2019.
-
Incorporating prior information and borrowing information in high-dimensional sparse regression using the horseshoe and variational Bayes
Authors:
Gino B. Kpogbezan,
Mark A. van de Wiel,
Wessel N. van Wieringen,
Aad W. van der Vaart
Abstract:
We introduce a sparse high-dimensional regression approach that can incorporate prior information on the regression parameters and can borrow information across a set of similar datasets. Prior information may for instance come from previous studies or genomic databases, and information borrowed across a set of genes or genomic networks. The approach is based on prior modelling of the regression p…
▽ More
We introduce a sparse high-dimensional regression approach that can incorporate prior information on the regression parameters and can borrow information across a set of similar datasets. Prior information may for instance come from previous studies or genomic databases, and information borrowed across a set of genes or genomic networks. The approach is based on prior modelling of the regression parameters using the horseshoe prior, with a prior on the sparsity index that depends on external information. Multiple datasets are integrated by applying an empirical Bayes strategy on hyperparameters. For computational efficiency we approximate the posterior distribution using a variational Bayes method. The proposed framework is useful for analysing large-scale data sets with complex dependence structures. We illustrate this by applications to the reconstruction of gene regulatory networks and to eQTL map**.
△ Less
Submitted 29 January, 2019;
originally announced January 2019.
-
Estimating Bayesian Optimal Treatment Regimes for Dichotomous Outcomes using Observational Data
Authors:
Thomas Klausch,
Peter van de Ven,
Tim van de Brug,
Mark A. van de Wiel,
Johannes Berkhof
Abstract:
Optimal treatment regimes (OTR) are individualised treatment assignment strategies that identify a medical treatment as optimal given all background information available on the individual. We discuss Bayes optimal treatment regimes estimated using a loss function defined on the bivariate distribution of dichotomous potential outcomes. The proposed approach allows considering more general objectiv…
▽ More
Optimal treatment regimes (OTR) are individualised treatment assignment strategies that identify a medical treatment as optimal given all background information available on the individual. We discuss Bayes optimal treatment regimes estimated using a loss function defined on the bivariate distribution of dichotomous potential outcomes. The proposed approach allows considering more general objectives for the OTR than maximization of an expected outcome (e.g., survival probability) by taking into account, for example, unnecessary treatment burden. As a motivating example we consider the case of oropharynx cancer treatment where unnecessary burden due to chemotherapy is to be avoided while maximizing survival chances. Assuming ignorable treatment assignment we describe Bayesian inference about the OTR including a sensitivity analysis on the unobserved partial association of the potential outcomes. We evaluate the methodology by simulations that apply Bayesian parametric and more flexible non-parametric outcome models. The proposed OTR for oropharynx cancer reduces the frequency of the more burdensome chemotherapy assignment by approximately 75% without reducing the average survival probability. This regime thus offers a strong increase in expected quality of life of patients.
△ Less
Submitted 28 September, 2018; v1 submitted 18 September, 2018;
originally announced September 2018.
-
Detecting SNPs with interactive effects on a quantitative trait
Authors:
Armin Rauschenberger,
Renee X. Menezes,
Mark A. van de Wiel,
Natasja M. van Schoor,
Marianne A. Jonker
Abstract:
Here we propose a test to detect effects of single nucleotide polymorphisms (SNPs) on a quantitative trait. Significant SNP-SNP interactions are more difficult to detect than significant SNPs, partly due to the massive amount of SNP-SNP combinations. We propose to move away from testing interaction terms, and move towards testing whether an individual SNP is involved in any interaction. This reduc…
▽ More
Here we propose a test to detect effects of single nucleotide polymorphisms (SNPs) on a quantitative trait. Significant SNP-SNP interactions are more difficult to detect than significant SNPs, partly due to the massive amount of SNP-SNP combinations. We propose to move away from testing interaction terms, and move towards testing whether an individual SNP is involved in any interaction. This reduces the multiple testing burden to one test per SNP, and allows for interactions with unobserved factors. Analysing one SNP at a time, we split the individuals into two groups, based on the number of minor alleles. If the quantitative trait differs in mean between the two groups, the SNP has a main effect. If the quantitative trait differs in distribution between some individuals in one group and all other individuals, it possibly has an interactive effect. We propose a mixture test to detect both types of effects. Implicitly, the membership probabilities may suggest potential interacting variables. Analysing simulated and experimental data, we show that the proposed test is statistically powerful, maintains the type I error rate, and detects meaningful signals. The R package semisup is available from Bioconductor.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
Adaptive group-regularized logistic elastic net regression
Authors:
Magnus M. Münch,
Carel F. W. Peeters,
Aad W. van der Vaart,
Mark A. van de Wiel
Abstract:
In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (a) p-values from a previous study, (b) a summary of prior information, and (c) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection, but is not straightforward in the s…
▽ More
In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (a) p-values from a previous study, (b) a summary of prior information, and (c) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection, but is not straightforward in the standard regression setting. As a solution to this problem, we propose a group-regularized (logistic) elastic net regression method, where each penalty parameter corresponds to a group of features based on the external information. The method, termed gren, makes use of the Bayesian formulation of logistic elastic net regression to estimate both the model and penalty parameters in an approximate empirical-variational Bayes framework. Simulations and an application to a colon cancer microRNA study show that, if the partitioning of the features is informative, classification performance and feature selection are indeed enhanced.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
Blood-based metabolic signatures in Alzheimer's disease
Authors:
Francisca A. de Leeuw,
Carel F. W. Peeters,
Maartje I. Kester,
Amy C. Harms,
Eduard A. Struys,
Thomas Hankemeier,
Herman W. T. van Vlijmen,
Sven J. van der Lee,
Cornelia M. van Duijn,
Philip Scheltens,
Ayşe Demirkan,
Mark A. van de Wiel,
Wiesje M. van der Flier,
Charlotte E. Teunissen
Abstract:
Introduction: Identification of blood-based metabolic changes might provide early and easy-to-obtain biomarkers.
Methods: We included 127 AD patients and 121 controls with CSF-biomarker-confirmed diagnosis (cut-off tau/A$β_{42}$: 0.52). Mass spectrometry platforms determined the concentrations of 53 amine, 22 organic acid, 120 lipid, and 40 oxidative stress compounds. Multiple signatures were as…
▽ More
Introduction: Identification of blood-based metabolic changes might provide early and easy-to-obtain biomarkers.
Methods: We included 127 AD patients and 121 controls with CSF-biomarker-confirmed diagnosis (cut-off tau/A$β_{42}$: 0.52). Mass spectrometry platforms determined the concentrations of 53 amine, 22 organic acid, 120 lipid, and 40 oxidative stress compounds. Multiple signatures were assessed: differential expression (nested linear models), classification (logistic regression), and regulatory (network extraction).
Results: Twenty-six metabolites were differentially expressed. Metabolites improved the classification performance of clinical variables from 74% to 79%. Network models identified 5 hubs of metabolic dysregulation: Tyrosine, glycylglycine, glutamine, lysophosphatic acid C18:2 and platelet activating factor C16:0. The metabolite network for APOE $ε$4 negative AD patients was less cohesive compared to the network for APOE $ε$4 positive AD patients.
Discussion: Multiple signatures point to various promising peripheral markers for further validation. The network differences in AD patients according to APOE genotype may reflect different pathways to AD.
△ Less
Submitted 21 September, 2017;
originally announced September 2017.
-
Learning from a lot: Empirical Bayes in high-dimensional prediction settings
Authors:
Mark A. van de Wiel,
Dennis E. te Beest,
Magnus Münch
Abstract:
Empirical Bayes is a versatile approach to `learn from a lot' in two ways: first, from a large number of variables and second, from a potentially large amount of prior information, e.g. stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods including penalized regression, linear discriminant analysis, and B…
▽ More
Empirical Bayes is a versatile approach to `learn from a lot' in two ways: first, from a large number of variables and second, from a potentially large amount of prior information, e.g. stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss `formal' empirical Bayes methods which maximize the marginal likelihood, but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes, and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and $p$, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting.
We argue that empirical Bayes is particularly useful when the prior contains multiple parameters which model a priori information on variables, termed `co-data'. In particular, we present two novel examples that allow for co-data. First, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types; second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.
△ Less
Submitted 16 March, 2018; v1 submitted 13 September, 2017;
originally announced September 2017.
-
Improved high-dimensional prediction with Random Forests by the use of co-data
Authors:
Dennis E. te Beest,
Steven W. Mes,
Ruud H. Brakenhoff,
Mark A. van de Wiel
Abstract:
Prediction in high dimensional settings is difficult due to large by number of variables relative to the sample size. We demonstrate how auxiliary "co-data" can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities (used to draw candidate variables, the default for a Random Forest) by c…
▽ More
Prediction in high dimensional settings is difficult due to large by number of variables relative to the sample size. We demonstrate how auxiliary "co-data" can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities (used to draw candidate variables, the default for a Random Forest) by co-data moderated sampling probabilities. Co-data here is defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate this co-data moderated Random Forest (CoRF) with one example. In the example we aim to predict a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance.
△ Less
Submitted 2 June, 2017;
originally announced June 2017.
-
The Spectral Condition Number Plot for Regularization Parameter Determination
Authors:
Carel F. W. Peeters,
Mark A. van de Wiel,
Wessel N. van Wieringen
Abstract:
Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its…
▽ More
Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter selection. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators.
△ Less
Submitted 14 August, 2016;
originally announced August 2016.
-
An empirical Bayes approach to network recovery using external knowledge
Authors:
Gino B. Kpogbezan,
Aad W. van der Vaart,
Wessel N. van Wieringen,
Gwenaël G. R. Leday,
Mark A. van de Wiel
Abstract:
Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simul…
▽ More
Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simultaneous Equation Model, we develop an appealing empirical Bayes procedure which automatically assesses the relevance of the used prior knowledge. We use variational Bayes method for posterior densities approximation and compare its accuracy with that of Gibbs sampling strategy. Our method is computationally fast, and can outperform known competitors. In a simulation study we show that accurate prior data can greatly improve the reconstruction of the network, but need not harm the reconstruction if wrong. We demonstrate the benefits of the method in an analysis of gene expression data from GEO. In particular, the edges of the recovered network have superior reproducibility (compared to that of competitors) over resampled versions of the data.
△ Less
Submitted 24 May, 2016;
originally announced May 2016.
-
Gene network reconstruction using global-local shrinkage priors
Authors:
Gwenaël G. R. Leday,
Mathisca C. M. de Gunst,
Gino B. Kpogbezan,
Aad W. Van der Vaart,
Wessel N. Van Wieringen,
Mark A. Van de Wiel
Abstract:
Reconstructing a gene network from high-throughput molecular data is often a challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters i…
▽ More
Reconstructing a gene network from high-throughput molecular data is often a challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters is often difficult and can result in large statistical uncertainties. In this paper we propose to combine local regularization with global shrinkage of the regularization parameters to borrow strength between genes and improve inference. We employ a simple Bayesian model with non-sparse, conjugate priors to facilitate the use of fast variational approximations to posteriors. We discuss empirical Bayes estimation of hyper-parameters of the priors, and propose a novel approach to rank-based posterior thresholding. Using extensive model- and data-based simulations, we demonstrate that the proposed inference strategy outperforms popular (sparse) methods, yields more stable edges, and is more reproducible.
△ Less
Submitted 13 October, 2015;
originally announced October 2015.
-
Better prediction by use of co-data: Adaptive group-regularized ridge regression
Authors:
Mark A. van de Wiel,
Tonje G. Lien,
Wina Verlaat,
Wessel N. van Wieringen,
Saskia M. Wilting
Abstract:
For many high-dimensional studies, additional information on the variables, like (genomic) annotation or external p-values, is available. In the context of binary and continuous prediction, we develop a method for adaptive group-regularized (logistic) ridge regression, which makes structural use of such 'co-data'. Here, 'groups' refer to a partition of the variables according to the co-data. We de…
▽ More
For many high-dimensional studies, additional information on the variables, like (genomic) annotation or external p-values, is available. In the context of binary and continuous prediction, we develop a method for adaptive group-regularized (logistic) ridge regression, which makes structural use of such 'co-data'. Here, 'groups' refer to a partition of the variables according to the co-data. We derive empirical Bayes estimates of group-specific penalties, which possess several nice properties: i) they are analytical; ii) they adapt to the informativeness of the co-data for the data at hand; iii) only one global penalty parameter requires tuning by cross-validation. In addition, the method allows use of multiple types of co-data at little extra computational effort.
We show that the group-specific penalties may lead to a larger distinction between `near-zero' and relatively large regression parameters, which facilitates post-hoc variable selection. The method, termed GRridge, is implemented in an easy-to-use R-package. It is demonstrated on two cancer genomics studies, which both concern the discrimination of precancerous cervical lesions from normal cervix tissues using methylation microarray data. For both examples, GRridge clearly improves the predictive performances of ordinary logistic ridge regression and the group lasso. In addition, we show that for the second study the relatively good predictive performance is maintained when selecting only 42 variables.
△ Less
Submitted 18 May, 2015; v1 submitted 13 November, 2014;
originally announced November 2014.
-
Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines
Authors:
Gwenaël G. R. Leday,
Aad W. van der Vaart,
Wessel N. van Wieringen,
Mark A. van de Wiel
Abstract:
DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexi…
▽ More
DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexibility to identify any plausible type of relationship. The specification of the model leads to estimation and model selection in a constrained, nonstandard setting. We provide methodology for testing the effect of DNA on mRNA and choosing the appropriate model. Furthermore, we present a novel approach to obtain reliable confidence bands for constrained PLRS, which incorporates model uncertainty. The procedures are applied to colorectal and breast cancer data. Common assumptions are found to be potentially misleading for biologically relevant genes. More flexible models may bring more insight in the interaction between the two markers.
△ Less
Submitted 6 December, 2013;
originally announced December 2013.
-
A nonparametric control chart based on the Mann-Whitney statistic
Authors:
Subhabrata Chakraborti,
Mark A. van de Wiel
Abstract:
Nonparametric or distribution-free charts can be useful in statistical process control problems when there is limited or lack of knowledge about the underlying process distribution. In this paper, a phase II Shewhart-type chart is considered for location, based on reference data from phase I analysis and the well-known Mann-Whitney statistic. Control limits are computed using Lugannani-Rice-sadd…
▽ More
Nonparametric or distribution-free charts can be useful in statistical process control problems when there is limited or lack of knowledge about the underlying process distribution. In this paper, a phase II Shewhart-type chart is considered for location, based on reference data from phase I analysis and the well-known Mann-Whitney statistic. Control limits are computed using Lugannani-Rice-saddlepoint, Edgeworth, and other approximations along with Monte Carlo estimation. The derivations take account of estimation and the dependence from the use of a reference sample. An illustrative numerical example is presented. The in-control performance of the proposed chart is shown to be much superior to the classical Shewhart $\bar{X}$ chart. Further comparisons on the basis of some percentiles of the out-of-control conditional run length distribution and the unconditional out-of-control ARL show that the proposed chart is almost as good as the Shewhart $\bar{X}$ chart for the normal distribution, but is more powerful for a heavy-tailed distribution such as the Laplace, or for a skewed distribution such as the Gamma. Interactive software, enabling a complete implementation of the chart, is made available on a website.
△ Less
Submitted 15 May, 2008;
originally announced May 2008.