-
Group COMBSS: Group Selection via Continuous Optimization
Authors:
Anant Mathur,
Sarat Moka,
Benoit Liquet,
Zdravko Botev
Abstract:
We present a new optimization method for the group selection problem in linear regression. In this problem, predictors are assumed to have a natural group structure and the goal is to select a small set of groups that best fits the response. The incorporation of group structure in a predictor matrix is a key factor in obtaining better estimators and identifying associations between response and pr…
▽ More
We present a new optimization method for the group selection problem in linear regression. In this problem, predictors are assumed to have a natural group structure and the goal is to select a small set of groups that best fits the response. The incorporation of group structure in a predictor matrix is a key factor in obtaining better estimators and identifying associations between response and predictors. Such a discrete constrained problem is well-known to be hard, particularly in high-dimensional settings where the number of predictors is much larger than the number of observations. We propose to tackle this problem by framing the underlying discrete binary constrained problem into an unconstrained continuous optimization problem. The performance of our proposed approach is compared to state-of-the-art variable selection strategies on simulated data sets. We illustrate the effectiveness of our approach on a genetic dataset to identify grou** of markers across chromosomes.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization
Authors:
Benoit Liquet,
Sarat Moka,
Samuel Muller
Abstract:
The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction…
▽ More
The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering. In particular, these approaches build principal components, new variables that are combinations of all the original variables. A main drawback of principal components is the difficulty to interpret them when the number of variables is large. To define principal components from the most relevant variables, we propose to cast the best subset solution path method into principal component analysis and partial least square frameworks. We offer a new alternative by exploiting a continuous optimization algorithm for best subset solution path. Empirical studies show the efficacy of our approach for providing the best subset solution path. The usage of our algorithm is further exposed through the analysis of two real datasets. The first dataset is analyzed using the principle component analysis while the analysis of the second dataset is based on partial least square framework.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Spatial Autoregressive Model on a Dirichlet Distribution
Authors:
Teo Nguyen,
Sarat Moka,
Kerrie Mengersen,
Benoit Liquet
Abstract:
Compositional data find broad application across diverse fields due to their efficacy in representing proportions or percentages of various components within a whole. Spatial dependencies often exist in compositional data, particularly when the data represents different land uses or ecological variables. Ignoring the spatial autocorrelations in modelling of compositional data may lead to incorrect…
▽ More
Compositional data find broad application across diverse fields due to their efficacy in representing proportions or percentages of various components within a whole. Spatial dependencies often exist in compositional data, particularly when the data represents different land uses or ecological variables. Ignoring the spatial autocorrelations in modelling of compositional data may lead to incorrect estimates of parameters. Hence, it is essential to incorporate spatial information into the statistical analysis of compositional data to obtain accurate and reliable results. However, traditional statistical methods are not directly applicable to compositional data due to the correlation between its observations, which are constrained to lie on a simplex. To address this challenge, the Dirichlet distribution is commonly employed, as its support aligns with the nature of compositional vectors. Specifically, the R package DirichletReg provides a regression model, termed Dirichlet regression, tailored for compositional data. However, this model fails to account for spatial dependencies, thereby restricting its utility in spatial contexts. In this study, we introduce a novel spatial autoregressive Dirichlet regression model for compositional data, adeptly integrating spatial dependencies among observations. We construct a maximum likelihood estimator for a Dirichlet density function augmented with a spatial lag term. We compare this spatial autoregressive model with the same model without spatial lag, where we test both models on synthetic data as well as two real datasets, using different metrics. By considering the spatial relationships among observations, our model provides more accurate and reliable results for the analysis of compositional data. The model is further evaluated against a spatial multinomial regression model for compositional data, and their relative effectiveness is discussed.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
A maximum penalised likelihood approach for semiparametric accelerated failure time models with time-varying covariates and partly interval censoring
Authors:
Aishwarya Bhaskaran,
Ding Ma,
Benoit Liquet,
Angela Hong,
Serigne N Lo,
Stephane Heritier,
Jun Ma
Abstract:
Accelerated failure time (AFT) models are frequently used for modelling survival data. This approach is attractive as it quantifies the direct relationship between the time until an event occurs and various covariates. It asserts that the failure times experience either acceleration or deceleration through a multiplicative factor when these covariates are present. While existing literature provide…
▽ More
Accelerated failure time (AFT) models are frequently used for modelling survival data. This approach is attractive as it quantifies the direct relationship between the time until an event occurs and various covariates. It asserts that the failure times experience either acceleration or deceleration through a multiplicative factor when these covariates are present. While existing literature provides numerous methods for fitting AFT models with time-fixed covariates, adapting these approaches to scenarios involving both time-varying covariates and partly interval-censored data remains challenging. In this paper, we introduce a maximum penalised likelihood approach to fit a semiparametric AFT model. This method, designed for survival data with partly interval-censored failure times, accommodates both time-fixed and time-varying covariates. We utilise Gaussian basis functions to construct a smooth approximation of the nonparametric baseline hazard and fit the model via a constrained optimisation approach. To illustrate the effectiveness of our proposed method, we conduct a comprehensive simulation study. We also present an implementation of our approach on a randomised clinical trial dataset on advanced melanoma patients.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
COMBSS: Best Subset Selection via Continuous Optimization
Authors:
Sarat Moka,
Benoit Liquet,
Houying Zhu,
Samuel Muller
Abstract:
The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. Existing optimal methods for solving this problem tend to be slow while fast methods tend to have low accuracy. Ide…
▽ More
The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. Existing optimal methods for solving this problem tend to be slow while fast methods tend to have low accuracy. Ideally, new methods perform best subset selection faster than existing optimal methods but with comparable accuracy, or, being more accurate than methods of comparable computational speed. Here, we propose a novel continuous optimization method that identifies a subset solution path, a small set of models of varying size, that consists of candidates for the single best subset of features, that is optimal in a specific sense in linear regression. Our method turns out to be fast, making the best subset selection possible when the number of features is well in excess of thousands. Because of the outstanding overall performance, framing the best subset selection challenge as a continuous optimization problem opens new research directions for feature extraction for a large variety of regression models.
△ Less
Submitted 24 November, 2023; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Nonstationary Spatial Process Models with Spatially Varying Covariance Kernels
Authors:
Sébastien Coube-Sisqueille,
Sudipto Banerjee,
Benoît Liquet
Abstract:
Spatial process models for capturing nonstationary behavior in scientific data present several challenges with regard to statistical inference and uncertainty quantification. While nonstationary spatially-varying kernels are attractive for their flexibility and richness, their practical implementation has been reported to be overwhelmingly cumbersome because of the high-dimensional parameter space…
▽ More
Spatial process models for capturing nonstationary behavior in scientific data present several challenges with regard to statistical inference and uncertainty quantification. While nonstationary spatially-varying kernels are attractive for their flexibility and richness, their practical implementation has been reported to be overwhelmingly cumbersome because of the high-dimensional parameter spaces resulting from the spatially varying process parameters. Matters are considerably exacerbated with the massive numbers of spatial locations over which measurements are available. With limited theoretical tractability offered by nonstationary spatial processes, overcoming such computational bottlenecks require a synergy between model construction and algorithm development. We build a class of scalable nonstationary spatial process models using spatially varying covariance kernels. We present some novel consequences of such representations that befit computationally efficient implementation. More specifically, we operate within a coherent Bayesian modeling framework to achieve full uncertainty quantification using a Hybrid Monte-Carlo with nested interweaving. We carry out experiments on synthetic data sets to explore model selection and parameter identifiability and assess inferential improvements accrued from the nonstationary modeling. We illustrate strengths and pitfalls with a data set on remote sensed normalized difference vegetation index with further analysis of a lead contamination data set in the Supplement.
△ Less
Submitted 28 March, 2024; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Understanding links between water-quality variables and nitrate concentration in freshwater streams using high-frequency sensor data
Authors:
Claire Kermorvant,
Benoit Liquet,
Guy Litt,
Kerrie Mengersen,
Erin Peterson,
Rob Hyndman,
Jeremy B. Jones Jr.,
Catherine Leigh
Abstract:
Real time monitoring using in situ sensors is becoming a common approach for measuring water quality within watersheds. High frequency measurements produce big data sets that present opportunities to conduct new analyses for improved understanding of water quality dynamics and more effective management of rivers and streams. Of primary importance is enhancing knowledge of the relationships between…
▽ More
Real time monitoring using in situ sensors is becoming a common approach for measuring water quality within watersheds. High frequency measurements produce big data sets that present opportunities to conduct new analyses for improved understanding of water quality dynamics and more effective management of rivers and streams. Of primary importance is enhancing knowledge of the relationships between nitrate, one of the most reactive forms of inorganic nitrogen in the aquatic environment, and other water quality variables. We analysed high frequency water quality data from in situ sensors deployed in three sites from different watersheds and climate zones within the National Ecological Observatory Network, USA. We used generalised additive mixed models to explain the nonlinear relationships at each site between nitrate concentration and conductivity, turbidity, dissolved oxygen, water temperature, and elevation. Temporal auto correlation was modelled with an auto regressive moving average model and we examined the relative importance of the explanatory variables. Total deviance explained by the models was high for all sites. Although variable importance and the smooth regression parameters differed among sites, the models explaining the most variation in nitrate contained the same explanatory variables. This study demonstrates that building a model for nitrate using the same set of explanatory water quality variables is achievable, even for sites with vastly different environmental and climatic characteristics. Applying such models will assist managers to select cost effective water quality variables to monitor when the goals are to gain a spatially and temporally in depth understanding of nitrate dynamics and adapt management plans accordingly.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Improving performances of MCMC for Nearest Neighbor Gaussian Process models with full data augmentation
Authors:
Sébastien Coube-Sisqueille,
Benoît Liquet
Abstract:
Even though Nearest Neighbor Gaussian Processes (NNGP) alleviate considerably MCMC implementation of Bayesian space-time models, they do not solve the convergence problems caused by high model dimension. Frugal alternatives such as response or collapsed algorithms are an answer.gree Our approach is to keep full data augmentation but to try and make it more efficient. We present two strategies to d…
▽ More
Even though Nearest Neighbor Gaussian Processes (NNGP) alleviate considerably MCMC implementation of Bayesian space-time models, they do not solve the convergence problems caused by high model dimension. Frugal alternatives such as response or collapsed algorithms are an answer.gree Our approach is to keep full data augmentation but to try and make it more efficient. We present two strategies to do so. The first scheme is to pay a particular attention to the seemingly trivial fixed effects of the model. We show empirically that re-centering the latent field on the intercept critically improves chain behavior. We extend this approach to other fixed effects that may interfere with a coherent spatial field. We propose a simple method that requires no tuning while remaining affordable thanks to NNGP's sparsity. The second scheme accelerates the sampling of the random field using Chromatic samplers. This method makes long sequential simulation boil down to group-parallelized or group-vectorized sampling. The attractive possibility to parallelize NNGP likelihood can therefore be carried over to field sampling. We present a R implementation of our methods for Gaussian fields in the public repository https://github.com/SebastienCoube/Improving_NNGP_full_augmentation . An extensive vignette is provided. We run our implementation on two synthetic toy examples along with the state of the art package spNNGP. Finally, we apply our method on a real data set of lead contamination in the United States of America mainland.
△ Less
Submitted 14 September, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Estimation of Semi-Markov Multi-state Models: A Comparison of the Sojourn Times and Transition Intensities Approaches
Authors:
Azam Asanjarani,
Benoit Liquet,
Yoni Nazarathy
Abstract:
Semi-Markov models are widely used for survival analysis and reliability analysis. In general, there are two competing parameterizations and each entails its own interpretation and inference properties. On the one hand, a semi-Markov process can be defined based on the distribution of sojourn times, often via hazard rates, together with transition probabilities of an embedded Markov chain. On the…
▽ More
Semi-Markov models are widely used for survival analysis and reliability analysis. In general, there are two competing parameterizations and each entails its own interpretation and inference properties. On the one hand, a semi-Markov process can be defined based on the distribution of sojourn times, often via hazard rates, together with transition probabilities of an embedded Markov chain. On the other hand, intensity transition functions may be used, often referred to as the hazard rates of the semi-Markov process. We summarize and contrast these two parameterizations both from a probabilistic and an inference perspective, and we highlight relationships between the two approaches. In general, the intensity transition based approach allows the likelihood to be split into likelihoods of two-state models having fewer parameters, allowing efficient computation and usage of many survival analysis tools. {Nevertheless, in certain cases the sojourn time based approach is natural and has been exploited extensively in applications.} In contrasting the two approaches and contemporary relevant R packages used for inference, we use two real datasets highlighting the probabilistic and inference properties of each approach. This analysis is accompanied by an R vignette.
△ Less
Submitted 28 December, 2020; v1 submitted 29 May, 2020;
originally announced May 2020.
-
A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data
Authors:
Pierre Lafaye de Micheaux,
Benoit Liquet,
Matthew Sutton
Abstract:
Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocs of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultan…
▽ More
Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocs of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modelling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparsity version of PLS methods is the link between the SVD of a matrix (constructed from deflated versions of the original matrices of data) and least squares minimisation in linear regression. We present here an accurate description of the most popular PLS methods, alongside their mathematical proofs. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. Various approaches to decrease the computation time are offered, and we show how the whole procedure can be scalable to big data sets.
△ Less
Submitted 22 February, 2017;
originally announced February 2017.
-
CEoptim: Cross-Entropy R Package for Optimization
Authors:
Tim Benham,
Qibin Duan,
Dirk P. Kroese,
Benoit Liquet
Abstract:
The cross-entropy (CE) method is simple and versatile technique for optimization, based on Kullback-Leibler (or cross-entropy) minimization. The method can be applied to a wide range of optimization tasks, including continuous, discrete, mixed and constrained optimization problems. The new package CEoptim provides the R implementation of the CE method for optimization. We describe the general CE m…
▽ More
The cross-entropy (CE) method is simple and versatile technique for optimization, based on Kullback-Leibler (or cross-entropy) minimization. The method can be applied to a wide range of optimization tasks, including continuous, discrete, mixed and constrained optimization problems. The new package CEoptim provides the R implementation of the CE method for optimization. We describe the general CE methodology for optimization and well as some useful modifications. The usage and efficacy of CEoptim is demonstrated through a variety of optimization examples, including model fitting, combinatorial optimization, and maximum likelihood estimation.
△ Less
Submitted 5 March, 2015;
originally announced March 2015.
-
Estimation of extended mixed models using latent classes and latent processes: the R package lcmm
Authors:
Cécile Proust-Lima,
Viviane Philipps,
Benoit Liquet
Abstract:
The R package lcmm provides a series of functions to estimate statistical models based on linear mixed model theory. It includes the estimation of mixed models and latent class mixed models for Gaussian longitudinal outcomes (hlme), curvilinear and ordinal univariate longitudinal outcomes (lcmm) and curvilinear multivariate outcomes (multlcmm), as well as joint latent class mixed models (Jointlcmm…
▽ More
The R package lcmm provides a series of functions to estimate statistical models based on linear mixed model theory. It includes the estimation of mixed models and latent class mixed models for Gaussian longitudinal outcomes (hlme), curvilinear and ordinal univariate longitudinal outcomes (lcmm) and curvilinear multivariate outcomes (multlcmm), as well as joint latent class mixed models (Jointlcmm) for a (Gaussian or curvilinear) longitudinal outcome and a time-to-event that can be possibly left-truncated right-censored and defined in a competing setting. Maximum likelihood esimators are obtained using a modified Marquardt algorithm with strict convergence criteria based on the parameters and likelihood stability, and on the negativity of the second derivatives. The package also provides various post-fit functions including goodness-of-fit analyses, classification, plots, predicted trajectories, individual dynamic prediction of the event and predictive accuracy assessment. This paper constitutes a companion paper to the package by introducing each family of models, the estimation technique, some implementation details and giving examples through a dataset on cognitive aging.
△ Less
Submitted 24 January, 2016; v1 submitted 3 March, 2015;
originally announced March 2015.
-
ClustOfVar: An R Package for the Clustering of Variables
Authors:
M. Chavent,
V. Kuentz,
B. Liquet,
L. Saracco
Abstract:
Clustering of variables is as a way to arrange variables into homogeneous clusters, i.e., groups of variables which are strongly related to each other and thus bring the same information. These approaches can then be useful for dimension reduction and variable selection. Several specific methods have been developed for the clustering of numerical variables. However concerning qualitative variables…
▽ More
Clustering of variables is as a way to arrange variables into homogeneous clusters, i.e., groups of variables which are strongly related to each other and thus bring the same information. These approaches can then be useful for dimension reduction and variable selection. Several specific methods have been developed for the clustering of numerical variables. However concerning qualitative variables or mixtures of quantitative and qualitative variables, far fewer methods have been proposed. The R package ClustOfVar was specifically developed for this purpose. The homogeneity criterion of a cluster is defined as the sum of correlation ratios (for qualitative variables) and squared correlations (for quantitative variables) to a synthetic quantitative variable, summarizing "as good as possible" the variables in the cluster. This synthetic variable is the first principal component obtained with the PCAMIX method. Two algorithms for the clustering of variables are proposed: iterative relocation algorithm and ascendant hierarchical clustering. We also propose a bootstrap approach in order to determine suitable numbers of clusters. We illustrate the methodologies and the associated package on small datasets.
△ Less
Submitted 1 December, 2011;
originally announced December 2011.