Search | arXiv e-print repository

Clusterpath Gaussian Graphical Modeling

Authors: D. J. W. Touw, A. Alfons, P. J. F. Groenen, I. Wilms

Abstract: Graphical models serve as effective tools for visualizing conditional dependencies between variables. However, as the number of variables grows, interpretation becomes increasingly difficult, and estimation uncertainty increases due to the large number of parameters relative to the number of observations. To address these challenges, we introduce the Clusterpath estimator of the Gaussian Graphical… ▽ More Graphical models serve as effective tools for visualizing conditional dependencies between variables. However, as the number of variables grows, interpretation becomes increasingly difficult, and estimation uncertainty increases due to the large number of parameters relative to the number of observations. To address these challenges, we introduce the Clusterpath estimator of the Gaussian Graphical Model (CGGM) that encourages variable clustering in the graphical model in a data-driven way. Through the use of a clusterpath penalty, we group variables together, which in turn results in a block-structured precision matrix whose block structure remains preserved in the covariance matrix. We present a computationally efficient implementation of the CGGM estimator by using a cyclic block coordinate descent algorithm. In simulations, we show that CGGM not only matches, but oftentimes outperforms other state-of-the-art methods for variable clustering in graphical models. We also demonstrate CGGM's practical advantages and versatility on a diverse collection of empirical applications. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 43 pages, 11 figures

arXiv:2302.04627 [pdf, ps, other]

Dual scaling of rating data

Authors: Michel van de Velden, Patrick J. F. Groenen

Abstract: When applied to contingency tables, dual scaling and correspondence are mathematically equivalent methods. For the analysis of rating data, however, the methods differ. To a large extent this is due to differences in preprocessing of the data. In particular, in dual scaling, ratings are either transformed to rank order, or to successive category data before applying a customised dual scaling appro… ▽ More When applied to contingency tables, dual scaling and correspondence are mathematically equivalent methods. For the analysis of rating data, however, the methods differ. To a large extent this is due to differences in preprocessing of the data. In particular, in dual scaling, ratings are either transformed to rank order, or to successive category data before applying a customised dual scaling approach. In correspondence analysis, on the other hand, a so-called doubling of the original ratings is applied before applying the usual correspondence analysis formulas. In this paper, we consider these differences in detail. We propose a dual scaling variant that can be applied directly to the ratings and we compare theoretical as well as practical properties of the different approaches. △ Less

Submitted 9 February, 2023; originally announced February 2023.

arXiv:2212.02914 [pdf, other]

Effects of Visual Priming on Rating Scale Usage

Authors: Pieter C. Schoonees, Patrick J. F. Groenen, Michel van de Velden, Hester van Herk

Abstract: Rating scales are much used in survey research. Often, it is assumed that the scores obtained through rating scales can be compared within and between respondents when studies are in one country. In addition, it is assumed that they can be treated as a numerical scale. In this paper, we study the anchoring effect of a visual stimulus on rating scale usage. To do so, we set up a randomized experime… ▽ More Rating scales are much used in survey research. Often, it is assumed that the scores obtained through rating scales can be compared within and between respondents when studies are in one country. In addition, it is assumed that they can be treated as a numerical scale. In this paper, we study the anchoring effect of a visual stimulus on rating scale usage. To do so, we set up a randomized experiment where the experimental group was primed by asking to rate the filling of a cylinder that was presented visually. For a five point rating scale, we find the effect that primed respondents use Category 1 less and Categories 3 and 4 more, and no effect on Categories 2 and 5. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2211.01877 [pdf, other]

Convex Clustering through MM: An Efficient Algorithm to Perform Hierarchical Clustering

Authors: Daniel J. W. Touw, Patrick J. F. Groenen, Yoshikazu Terada

Abstract: Convex clustering is a modern method with both hierarchical and $k$-means clustering characteristics. Although convex clustering can capture complex clustering structures hidden in data, the existing convex clustering algorithms are not scalable to large data sets with sample sizes greater than several thousands. Moreover, it is known that convex clustering sometimes fails to produce a complete hi… ▽ More Convex clustering is a modern method with both hierarchical and $k$-means clustering characteristics. Although convex clustering can capture complex clustering structures hidden in data, the existing convex clustering algorithms are not scalable to large data sets with sample sizes greater than several thousands. Moreover, it is known that convex clustering sometimes fails to produce a complete hierarchical clustering structure. This issue arises if clusters split up or the minimum number of possible clusters is larger than the desired number of clusters. In this paper, we propose convex clustering through majorization-minimization (CCMM) -- an iterative algorithm that uses cluster fusions and a highly efficient updating scheme derived using diagonal majorization. Additionally, we explore different strategies to ensure that the hierarchical clustering structure terminates in a single cluster. With a current desktop computer, CCMM efficiently solves convex clustering problems featuring over one million objects in seven-dimensional space, achieving a solution time of 51 seconds on average. △ Less

Submitted 21 December, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 27 pages, 8 figures

arXiv:2202.12063 [pdf, other]

doi 10.18637/jss.v103.i13

Robust Mediation Analysis: The R Package robmed

Authors: Andreas Alfons, Nüfer Y. Ateş, Patrick J. F. Groenen

Abstract: Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects ca… ▽ More Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results. △ Less

Submitted 17 August, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

Journal ref: Journal of Statistical Software, 103(13), 1-45 (2022)

arXiv:2102.08232 [pdf, other]

doi 10.1007/978-981-99-2240-6_4

The MELODIC family for simultaneous binary logistic regression in a reduced space

Authors: Mark de Rooij, Patrick J. F. Groenen

Abstract: Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strengt… ▽ More Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strength, which might increase predictive accuracy. In this paper, we propose the MELODIC family for simultaneous binary logistic regression modeling. In this family, the regression models are defined in a Euclidean space of reduced dimension, based on a distance rule. The model may be interpreted in terms of logistic regression coefficients or in terms of a biplot. We discuss a fast iterative majorization (or MM) algorithm for parameter estimation. Two applications are shown in detail: one relating personality characteristics to drug consumption profiles and one relating personality characteristics to depressive and anxiety disorders. We present a thorough comparison of our MELODIC family with alternative approaches for multivariate binary data. △ Less

Submitted 24 June, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: Comment [v2]: added a paragraph on page 7 about the equivalence to a logistic reduced rank model Comment [v2]: the description of the relationship towards logistic reduced rank models is updated on page 37

arXiv:2002.08146 [pdf, other]

A censored mixture model for modeling risk taking

Authors: Nienke F. S. Dijkstra, Henning Tiemeier, Bernd C. Figner, Patrick J. F. Groenen

Abstract: Risk behavior can have substantial consequences for health, well-being, and functioning. Previous studies have shown an association between real-world risk behavior and risk behavior on experimental tasks, such as the Columbia Card Task, but their modeling is challenging for several reasons. First, many of the experimental risk tasks may end prematurely leading to censored observations. Second, ce… ▽ More Risk behavior can have substantial consequences for health, well-being, and functioning. Previous studies have shown an association between real-world risk behavior and risk behavior on experimental tasks, such as the Columbia Card Task, but their modeling is challenging for several reasons. First, many of the experimental risk tasks may end prematurely leading to censored observations. Second, certain outcome values can be more attractive than others. Third, a priori unknown groups of participants can react differently to certain risk-levels. Here, we propose the Censored Mixture Model (CMM), which models risk taking while handling censoring, experimental conditions, and attractiveness to certain outcomes. △ Less

Submitted 19 February, 2020; originally announced February 2020.

Comments: 29 pages, 9 figures

arXiv:1807.04982 [pdf, other]

Generalized simultaneous component analysis of binary and quantitative data

Authors: Yipeng Song, Johan A. Westerhuis, Nanne Aben, Lodewyk F. A. Wessels, Patrick J. F. Groenen, Age K. Smilde

Abstract: In the current era of systems biological research there is a need for the integrative analysis of binary and quantitative genomics data sets measured on the same objects. One standard tool of exploring the underlying dependence structure present in multiple quantitative data sets is simultaneous component analysis (SCA) model. However, it does not have any provisions when a part of the data are bi… ▽ More In the current era of systems biological research there is a need for the integrative analysis of binary and quantitative genomics data sets measured on the same objects. One standard tool of exploring the underlying dependence structure present in multiple quantitative data sets is simultaneous component analysis (SCA) model. However, it does not have any provisions when a part of the data are binary. To this end, we propose the generalized SCA (GSCA) model, which takes into account the distinct mathematical properties of binary and quantitative measurements in the maximum likelihood framework. Like in the SCA model, a common low dimensional subspace is assumed to represent the shared information between these two distinct types of measurements. However, the GSCA model can easily be overfitted when a rank larger than one is used, leading to some of the estimated parameters to become very large. To achieve a low rank solution and combat overfitting, we propose to use a concave variant of the nuclear norm penalty. An efficient majorization algorithm is developed to fit this model with different concave penalties. Realistic simulations (low signal-to-noise ratio and highly imbalanced binary data) are used to evaluate the performance of the proposed model in recovering the underlying structure. Also, a missing value based cross validation procedure is implemented for model selection. We illustrate the usefulness of the GSCA model for exploratory data analysis of quantitative gene expression and binary copy number aberration (CNA) measurements obtained from the GDSC1000 data sets. △ Less

Submitted 3 June, 2019; v1 submitted 13 July, 2018; originally announced July 2018.

Comments: 19 pages, 10 figures

arXiv:1701.06967 [pdf, other]

SparseStep: Approximating the Counting Norm for Sparse Regularization

Authors: Gerrit J. J. van den Burg, Patrick J. F. Groenen, Andreas Alfons

Abstract: The SparseStep algorithm is presented for the estimation of a sparse parameter vector in the linear regression problem. The algorithm works by adding an approximation of the exact counting norm as a constraint on the model parameters and iteratively strengthening this approximation to arrive at a sparse solution. Theoretical analysis of the penalty function shows that the estimator yields unbiased… ▽ More The SparseStep algorithm is presented for the estimation of a sparse parameter vector in the linear regression problem. The algorithm works by adding an approximation of the exact counting norm as a constraint on the model parameters and iteratively strengthening this approximation to arrive at a sparse solution. Theoretical analysis of the penalty function shows that the estimator yields unbiased estimates of the parameter vector. An iterative majorization algorithm is derived which has a straightforward implementation reminiscent of ridge regression. In addition, the SparseStep algorithm is compared with similar methods through a rigorous simulation study which shows it often outperforms existing methods in both model fit and prediction accuracy. △ Less

Submitted 24 January, 2017; originally announced January 2017.

MSC Class: 62J05; 62J07

arXiv:1603.03174 [pdf, other]

Multinomial Multiple Correspondence Analysis

Authors: Patrick J. F. Groenen, Julie Josse

Abstract: Relations between categorical variables can be analyzed conveniently by multiple correspondence analysis (MCA). %It is well suited to discover relations that may exist between categories of different variables. The graphical representation of MCA results in so-called biplots makes it easy to interpret the most important associations. However, a major drawback of MCA is that it does not have an und… ▽ More Relations between categorical variables can be analyzed conveniently by multiple correspondence analysis (MCA). %It is well suited to discover relations that may exist between categories of different variables. The graphical representation of MCA results in so-called biplots makes it easy to interpret the most important associations. However, a major drawback of MCA is that it does not have an underlying probability model for an individual selecting a category on a variable. In this paper, we propose such probability model called multinomial multiple correspondence analysis (MMCA) that combines the underlying low-rank representation of MCA with maximum likelihood. An efficient majorization algorithm that uses an elegant bound for the second derivative is derived to estimate the parameters. The proposed model can easily lead to overfitting causing some of the parameters to wander of to infinity. We add the nuclear norm penalty to counter this issue and discuss ways of selecting regularization parameters. The proposed approach is well suited to study and vizualise the dependences for high dimensional data. △ Less

Submitted 10 March, 2016; originally announced March 2016.

arXiv:1504.07005 [pdf]

Regularized Consensus PCA

Authors: Michel Tenenhaus, Arthur Tenenhaus, Patrick J. F. Groenen

Abstract: A new framework for many multiblock component methods (including consensus and hierarchical PCA) is proposed. It is based on the consensus PCA model: a scheme connecting each block of variables to a superblock obtained by concatenation of all blocks. Regularized consensus PCA is obtained by applying regularized generalized canonical correlation analysis to this scheme for the function… ▽ More A new framework for many multiblock component methods (including consensus and hierarchical PCA) is proposed. It is based on the consensus PCA model: a scheme connecting each block of variables to a superblock obtained by concatenation of all blocks. Regularized consensus PCA is obtained by applying regularized generalized canonical correlation analysis to this scheme for the function $g(x) = x^m$ where $m \ge 1$. A gradient algorithm is proposed. At convergence, a solution of the stationary equation related to the optimization problem is obtained. For m = 1, 2 or 4 and shrinkage constants equal to 0 or 1, many multiblock component methods are recovered. △ Less

Submitted 27 April, 2015; originally announced April 2015.

Showing 1–11 of 11 results for author: Groenen, P J F