-
Semi-supervised empirical Bayes group-regularized factor regression
Authors:
Magnus M. Münch,
Mark A. van de Wiel,
Aad W. van der Vaart,
Carel F. W. Peeters
Abstract:
The features in high dimensional biomedical prediction problems are often well described with lower dimensional manifolds. An example is genes that are organised in smaller functional networks. The outcome can then be described with the factor regression model. A benefit of the factor model is that is allows for straightforward inclusion of unlabeled observations in the estimation of the model, i.…
▽ More
The features in high dimensional biomedical prediction problems are often well described with lower dimensional manifolds. An example is genes that are organised in smaller functional networks. The outcome can then be described with the factor regression model. A benefit of the factor model is that is allows for straightforward inclusion of unlabeled observations in the estimation of the model, i.e., semi-supervised learning. In addition, the high dimensional features in biomedical prediction problems are often well characterised. Examples are genes, for which annotation is available, and metabolites with $p$-values from a previous study available. In this paper, the extra information on the features is included in the prior model for the features. The extra information is weighted and included in the estimation through empirical Bayes, with Variational approximations to speed up the computation. The method is demonstrated in simulations and two applications. One application considers influenza vaccine efficacy prediction based on microarray data. The second application predictions oral cancer metastatsis from RNAseq data.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
rags2ridges: A One-Stop-Shop for Graphical Modeling of High-Dimensional Precision Matrices
Authors:
Carel F. W. Peeters,
Anders Ellern Bilgrau,
Wessel N. van Wieringen
Abstract:
A graphical model is an undirected network representing the conditional independence properties between random variables. Graphical modeling has become part and parcel of systems or network approaches to multivariate data, in particular when the variable dimension exceeds the observation dimension. rags2ridges is an R package for graphical modeling of high-dimensional precision matrices. It provid…
▽ More
A graphical model is an undirected network representing the conditional independence properties between random variables. Graphical modeling has become part and parcel of systems or network approaches to multivariate data, in particular when the variable dimension exceeds the observation dimension. rags2ridges is an R package for graphical modeling of high-dimensional precision matrices. It provides a modular framework for the extraction, visualization, and analysis of Gaussian graphical models from high-dimensional data. Moreover, it can handle the incorporation of prior information as well as multiple heterogeneous data classes. As such, it provides a one-stop-shop for graphical modeling of high-dimensional precision matrices. The functionality of the package is illustrated with an example dataset pertaining to blood-based metabolite measurements in persons suffering from Alzheimer's Disease.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
A Note on a Simple and Practical Randomized Response Framework for Eliciting Sensitive Dichotomous & Quantitative Information
Authors:
Carel F. W. Peeters,
Gerty J. L. M. Lensvelt-Mulders,
Karin Lasthuizen
Abstract:
Many issues of interest to social scientists and policymakers are of a sensitive nature in the sense that they are intrusive, stigmatizing or incriminating to the respondent. This results in refusals to cooperate or evasive cooperation in studies using self-reports. In a seminal article Warner proposed to curb this problem by generating an artificial variability in responses to inoculate the indiv…
▽ More
Many issues of interest to social scientists and policymakers are of a sensitive nature in the sense that they are intrusive, stigmatizing or incriminating to the respondent. This results in refusals to cooperate or evasive cooperation in studies using self-reports. In a seminal article Warner proposed to curb this problem by generating an artificial variability in responses to inoculate the individual meaning of answers to sensitive questions. This procedure was further developed and extended, and came to be known as the randomized response (RR) technique. Here, we propose a unified treatment for eliciting sensitive binary as well as quantitative information with RR based on a model where the inoculating elements are provided for by the randomization device. The procedure is simple and we will argue that its implementation in a computer-assisted setting may have superior practical capabilities.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Rotational Uniqueness Conditions Under Oblique Factor Correlation Metric
Authors:
Carel F. W. Peeters
Abstract:
In an addendum to his seminal 1969 article Jöreskog stated two sets of conditions for rotational identification of the oblique factor solution under utilization of fixed zero elements in the factor loadings matrix. These condition sets, formulated under factor correlation and factor covariance metrics, respectively, were claimed to be equivalent and to lead to global rotational uniqueness of the f…
▽ More
In an addendum to his seminal 1969 article Jöreskog stated two sets of conditions for rotational identification of the oblique factor solution under utilization of fixed zero elements in the factor loadings matrix. These condition sets, formulated under factor correlation and factor covariance metrics, respectively, were claimed to be equivalent and to lead to global rotational uniqueness of the factor solution. It is shown here that the conditions for the oblique factor correlation structure need to be amended for global rotational uniqueness, and hence, that the condition sets are not equivalent in terms of unicity of the solution.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Social Network Analysis of Corruption Structures: Adjacency Matrices Supporting the Visualization and Quantification of Layeredness
Authors:
Carel F. W. Peeters
Abstract:
Often, corruption is described as taking place within or supported by a network: A collection of individuals structured in such a way as to enable the transaction of bribes for favors. Surprisingly, despite the network nomenclature, corruption is rarely analyzed from the network perspective using the tools of network science. Here, we will argue that analyzing corruption from the perspective of ne…
▽ More
Often, corruption is described as taking place within or supported by a network: A collection of individuals structured in such a way as to enable the transaction of bribes for favors. Surprisingly, despite the network nomenclature, corruption is rarely analyzed from the network perspective using the tools of network science. Here, we will argue that analyzing corruption from the perspective of network science is beneficial to its understanding. In passing this chapter, a contribution to the Liber Amicorum in honor of Leo Huberts, then gives a very short introduction into social network analysis.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Stable prediction with radiomics data
Authors:
Carel F. W. Peeters,
Caroline Übelhör,
Steven W. Mes,
Roland Martens,
Thomas Koopman,
Pim de Graaf,
Floris H. P. van Velden,
Ronald Boellaard,
Jonas A. Castelijns,
Dennis E. te Beest,
Martijn W. Heymans,
Mark A. van de Wiel
Abstract:
Motivation: Radiomics refers to the high-throughput mining of quantitative features from radiographic images. It is a promising field in that it may provide a non-invasive solution for screening and classification. Standard machine learning classification and feature selection techniques, however, tend to display inferior performance in terms of (the stability of) predictive performance. This is d…
▽ More
Motivation: Radiomics refers to the high-throughput mining of quantitative features from radiographic images. It is a promising field in that it may provide a non-invasive solution for screening and classification. Standard machine learning classification and feature selection techniques, however, tend to display inferior performance in terms of (the stability of) predictive performance. This is due to the heavy multicollinearity present in radiomic data. We set out to provide an easy-to-use approach that deals with this problem.
Results: We developed a four-step approach that projects the original high-dimensional feature space onto a lower-dimensional latent-feature space, while retaining most of the covariation in the data. It consists of (i) penalized maximum likelihood estimation of a redundancy filtered correlation matrix. The resulting matrix (ii) is the input for a maximum likelihood factor analysis procedure. This two-stage maximum-likelihood approach can be used to (iii) produce a compact set of stable features that (iv) can be directly used in any (regression-based) classifier or predictor. It outperforms other classification (and feature selection) techniques in both external and internal validation settings regarding survival in squamous cell cancers.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.
-
Adaptive group-regularized logistic elastic net regression
Authors:
Magnus M. Münch,
Carel F. W. Peeters,
Aad W. van der Vaart,
Mark A. van de Wiel
Abstract:
In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (a) p-values from a previous study, (b) a summary of prior information, and (c) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection, but is not straightforward in the s…
▽ More
In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (a) p-values from a previous study, (b) a summary of prior information, and (c) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection, but is not straightforward in the standard regression setting. As a solution to this problem, we propose a group-regularized (logistic) elastic net regression method, where each penalty parameter corresponds to a group of features based on the external information. The method, termed gren, makes use of the Bayesian formulation of logistic elastic net regression to estimate both the model and penalty parameters in an approximate empirical-variational Bayes framework. Simulations and an application to a colon cancer microRNA study show that, if the partitioning of the features is informative, classification performance and feature selection are indeed enhanced.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
Inequality Constrained Multilevel Models
Authors:
Bernet S. Kato,
Carel F. W. Peeters
Abstract:
Multilevel or hierarchical data structures can occur in many areas of research, including economics, psychology, sociology, agriculture, medicine, and public health. Over the last 25 years, there has been increasing interest in develo** suitable techniques for the statistical analysis of multilevel data, and this has resulted in a broad class of models known under the generic name of multilevel…
▽ More
Multilevel or hierarchical data structures can occur in many areas of research, including economics, psychology, sociology, agriculture, medicine, and public health. Over the last 25 years, there has been increasing interest in develo** suitable techniques for the statistical analysis of multilevel data, and this has resulted in a broad class of models known under the generic name of multilevel models. Generally, multilevel models are useful for exploring how relationships vary across higher-level units taking into account the within and between cluster variations. Research scientists often have substantive theories in mind when evaluating data with statistical models. Substantive theories often involve inequality constraints among the parameters to translate a theory into a model. This chapter shows how the inequality constrained multilevel linear model can be given a Bayesian formulation, how the model parameters can be estimated using a so-called augmented Gibbs sampler, and how posterior probabilities can be computed to assist the researcher in model selection.
△ Less
Submitted 4 January, 2018;
originally announced January 2018.
-
Blood-based metabolic signatures in Alzheimer's disease
Authors:
Francisca A. de Leeuw,
Carel F. W. Peeters,
Maartje I. Kester,
Amy C. Harms,
Eduard A. Struys,
Thomas Hankemeier,
Herman W. T. van Vlijmen,
Sven J. van der Lee,
Cornelia M. van Duijn,
Philip Scheltens,
Ayşe Demirkan,
Mark A. van de Wiel,
Wiesje M. van der Flier,
Charlotte E. Teunissen
Abstract:
Introduction: Identification of blood-based metabolic changes might provide early and easy-to-obtain biomarkers.
Methods: We included 127 AD patients and 121 controls with CSF-biomarker-confirmed diagnosis (cut-off tau/A$β_{42}$: 0.52). Mass spectrometry platforms determined the concentrations of 53 amine, 22 organic acid, 120 lipid, and 40 oxidative stress compounds. Multiple signatures were as…
▽ More
Introduction: Identification of blood-based metabolic changes might provide early and easy-to-obtain biomarkers.
Methods: We included 127 AD patients and 121 controls with CSF-biomarker-confirmed diagnosis (cut-off tau/A$β_{42}$: 0.52). Mass spectrometry platforms determined the concentrations of 53 amine, 22 organic acid, 120 lipid, and 40 oxidative stress compounds. Multiple signatures were assessed: differential expression (nested linear models), classification (logistic regression), and regulatory (network extraction).
Results: Twenty-six metabolites were differentially expressed. Metabolites improved the classification performance of clinical variables from 74% to 79%. Network models identified 5 hubs of metabolic dysregulation: Tyrosine, glycylglycine, glutamine, lysophosphatic acid C18:2 and platelet activating factor C16:0. The metabolite network for APOE $ε$4 negative AD patients was less cohesive compared to the network for APOE $ε$4 positive AD patients.
Discussion: Multiple signatures point to various promising peripheral markers for further validation. The network differences in AD patients according to APOE genotype may reflect different pathways to AD.
△ Less
Submitted 21 September, 2017;
originally announced September 2017.
-
Detecting functional decline from normal ageing to dementia: development and validation of a short version of the Amsterdam IADL Questionnaire
Authors:
Roos J. Jutten,
Carel F. W. Peeters,
Sophie M. J. Leijdesdorff,
Pieter Jelle Visser,
Andrea B. Maier,
Caroline B. Terwee,
Philip Scheltens,
Sietske A. M. Sikkes
Abstract:
INTRODUCTION: Detecting functional decline from normal ageing to dementia is relevant for diagnostic and prognostic purposes. Therefore, the Amsterdam IADL Questionnaire (A-IADL-Q) was developed: a 70-item proxy-based tool with good psychometric properties. We aimed to design a short version whilst preserving its psychometric quality. METHODS: Study partners of subjects (n=1355), ranging from cogn…
▽ More
INTRODUCTION: Detecting functional decline from normal ageing to dementia is relevant for diagnostic and prognostic purposes. Therefore, the Amsterdam IADL Questionnaire (A-IADL-Q) was developed: a 70-item proxy-based tool with good psychometric properties. We aimed to design a short version whilst preserving its psychometric quality. METHODS: Study partners of subjects (n=1355), ranging from cognitively normal to dementia subjects, completed the original A-IADL-Q. We selected the short version items using a stepwise procedure combining missing data, Item Response Theory and input from respondents and experts. We investigated internal consistency of the short version as well as concordance with the original version. To assess its construct validity, we additionally investigated concordance between the short version and the Mini-Mental State Examination (MMSE) and Disability Assessment for Dementia (DAD). Lastly, we investigated differences in IADL scores between diagnostic groups across the dementia spectrum. RESULTS: We selected 30 items covering the entire spectrum of IADL functioning. Internal consistency (.98) and concordance with the original version (.97) were very high. Concordance with the MMSE (.72) and DAD (.87) scores was high. IADL impairment scores increased across the spectrum from normal cognition to dementia. DISCUSSION: The A-IADL-Q Short Version (A-IADL-Q-SV) consists of 30 items. The A-IADL-Q-SV has maintained the psychometric quality of the original A-IADL-Q. As such, it is a concise measure of functional decline.
△ Less
Submitted 21 March, 2017; v1 submitted 21 October, 2016;
originally announced October 2016.
-
Pathophysiological Domains Underlying the Metabolic Syndrome: An Alternative Factor Analytic Strategy
Authors:
Carel F. W. Peeters,
James Dziura,
Floryt van Wesel
Abstract:
Purpose: Factor analysis (FA) has become part and parcel in metabolic syndrome (MBS) research. Both exploration- and confirmation-driven factor analyzes are rampant. However, factor analytic results on MBS differ widely. A situation that is at least in part attributable to misapplication of FA. Here, our purpose is (i) to review factor analytic efforts in the study of MBS with emphasis on misusage…
▽ More
Purpose: Factor analysis (FA) has become part and parcel in metabolic syndrome (MBS) research. Both exploration- and confirmation-driven factor analyzes are rampant. However, factor analytic results on MBS differ widely. A situation that is at least in part attributable to misapplication of FA. Here, our purpose is (i) to review factor analytic efforts in the study of MBS with emphasis on misusage of the FA model and (ii) to propose an alternative factor analytic strategy.
Methods: The proposed factor analytic strategy consists of four steps and confronts weaknesses in application of the FA model. At its heart lies the explicit separation of dimensionality and pattern selection as well as the direct evaluation of competing inequality-constrained loading patterns. A high-profile MBS data set with anthropometric measurements on overweight children and adolescents is reanalyzed using this strategy.
Results: The reanalysis implied a more parsimonious constellation of pathophysiological domains underlying phenotypic expressions of MBS than the original analysis (and many other analyzes). The results emphasize correlated factors of impaired glucose metabolism and impaired lipid metabolism.
Conclusions: Pathophysiological domains underlying phenotypic expressions of MBS included in the analysis are driven by multiple interrelated metabolic impairments. These findings indirectly point to the possible existence of a multifactorial aetiology.
△ Less
Submitted 8 September, 2016;
originally announced September 2016.
-
The Spectral Condition Number Plot for Regularization Parameter Determination
Authors:
Carel F. W. Peeters,
Mark A. van de Wiel,
Wessel N. van Wieringen
Abstract:
Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its…
▽ More
Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter selection. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators.
△ Less
Submitted 14 August, 2016;
originally announced August 2016.
-
Bayesian Constrained-Model Selection for Factor Analytic Modeling
Authors:
Carel F. W. Peeters
Abstract:
My dissertation revolves around Bayesian approaches towards constrained statistical inference in the factor analysis (FA) model. Two interconnected types of restricted-model selection are considered. These types have a natural connection to selection problems in the exploratory FA (EFA) and confirmatory FA (CFA) model and are termed Type I and Type II model selection. Type I constrained-model sele…
▽ More
My dissertation revolves around Bayesian approaches towards constrained statistical inference in the factor analysis (FA) model. Two interconnected types of restricted-model selection are considered. These types have a natural connection to selection problems in the exploratory FA (EFA) and confirmatory FA (CFA) model and are termed Type I and Type II model selection. Type I constrained-model selection is taken to mean the determination of the appropriate dimensionality of a model. This type of constrained-model selection connects with EFA in the sense of selecting the optimal dimensionality of the latent vector. Type II model selection is taken to mean the determination of appropriate inequality, order or shape restrictions on the parameter space. The dissertation connects Type II constrained-model selection to CFA by focusing on the determination of linear inequality constraints as expressions of the direction and (relative) strength of factor loadings. The figures accompanying this article are taken from the slides of my Division 5 Awards Symposium Invited address at the APA 2015 Annual Convention in Toronto. These slides can be retrieved from \url{https://github.com/CFWP/ConventionTalk}.
△ Less
Submitted 12 April, 2016; v1 submitted 18 March, 2016;
originally announced March 2016.
-
Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes
Authors:
Anders Ellern Bilgrau,
Carel F. W. Peeters,
Poul Svante Eriksen,
Martin Bøgsted,
Wessel N. van Wieringen
Abstract:
We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An $\ell_2$-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other $\ell_2$-penalized estimators as special cases. In addition, the approach allows specification of target matrices throu…
▽ More
We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An $\ell_2$-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other $\ell_2$-penalized estimators as special cases. In addition, the approach allows specification of target matrices through which prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the constituent classes are believed to chiefly share the same structure while potentially differing in a number of locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by data sets and subtypes of diseases. The performance of the proposed estimator in the graphical modeling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by the differential network modeling of 12 large-scale gene expression data sets of diffuse large B-cell lymphoma subtypes. The estimator and its related procedures are incorporated into the R-package rags2ridges.
△ Less
Submitted 26 March, 2020; v1 submitted 26 September, 2015;
originally announced September 2015.
-
Ridge Estimation of Inverse Covariance Matrices from High-Dimensional Data
Authors:
Wessel N. van Wieringen,
Carel F. W. Peeters
Abstract:
We study ridge estimation of the precision matrix in the high-dimensional setting where the number of variables is large relative to the sample size. We first review two archetypal ridge estimators and note that their utilized penalties do not coincide with common ridge penalties. Subsequently, starting from a common ridge penalty, analytic expressions are derived for two alternative ridge estimat…
▽ More
We study ridge estimation of the precision matrix in the high-dimensional setting where the number of variables is large relative to the sample size. We first review two archetypal ridge estimators and note that their utilized penalties do not coincide with common ridge penalties. Subsequently, starting from a common ridge penalty, analytic expressions are derived for two alternative ridge estimators of the precision matrix. The alternative estimators are compared to the archetypes with regard to eigenvalue shrinkage and risk. The alternatives are also compared to the graphical lasso within the context of graphical modeling. The comparisons may give reason to prefer the proposed alternative estimators.
△ Less
Submitted 24 September, 2015; v1 submitted 4 March, 2014;
originally announced March 2014.