Search | arXiv e-print repository

Robust Linear Mixed Models using Hierarchical Gamma-Divergence

Authors: Shonosuke Sugasawa, Francis K. C. Hui, Alan H. Welsh

Abstract: Linear mixed models (LMMs), which typically assume normality for both the random effects and error terms, are a popular class of methods for analyzing longitudinal and clustered data. However, such models can be sensitive to outliers, and this can lead to poor statistical results (e.g., biased inference on model parameters and inaccurate prediction of random effects) if the data are contaminated.… ▽ More Linear mixed models (LMMs), which typically assume normality for both the random effects and error terms, are a popular class of methods for analyzing longitudinal and clustered data. However, such models can be sensitive to outliers, and this can lead to poor statistical results (e.g., biased inference on model parameters and inaccurate prediction of random effects) if the data are contaminated. We propose a new approach to robust estimation and inference for LMMs using a hierarchical gamma divergence, which offers an automated, data-driven approach to downweight the effects of outliers occurring in both the error, and the random effects, using normalized powered density weights. For estimation and inference, we develop a computationally scalable minorization-maximization algorithm for the resulting objective function, along with a clustered bootstrap method for uncertainty quantification and a Hyvarinen score criterion for selecting a tuning parameter controlling the degree of robustness. When the genuine and contamination mixed effects distributions are sufficiently separated, then under suitable regularity conditions assuming the number of clusters tends to infinity, we show the resulting robust estimates can be asymptotically controlled even under a heavy level of (covariate-dependent) contamination. Simulation studies demonstrate hierarchical gamma divergence consistently outperforms several currently available methods for robustifying LMMs, under a wide range of scenarios of outlier generation at both the response and random effects levels. We illustrate the proposed method using data from a multi-center AIDS cohort study, where the use of a robust LMMs using hierarchical gamma divergence approach produces noticeably different results compared to methods that do not adequately adjust for potential outlier contamination. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 30 pages (main) + 6 pages (supplement)

arXiv:2403.11562 [pdf, other]

A Comparison of Joint Species Distribution Models for Percent Cover Data

Authors: Pekka Korhonen, Francis K. C. Hui, Jenni Niku, Sara Taskinen, Bert van der Veen

Abstract: 1. Joint species distribution models (JSDMs) have gained considerable traction among ecologists over the past decade, due to their capacity to answer a wide range of questions at both the species- and the community-level. The family of generalized linear latent variable models in particular has proven popular for building JSDMs, being able to handle many response types including presence-absence d… ▽ More 1. Joint species distribution models (JSDMs) have gained considerable traction among ecologists over the past decade, due to their capacity to answer a wide range of questions at both the species- and the community-level. The family of generalized linear latent variable models in particular has proven popular for building JSDMs, being able to handle many response types including presence-absence data, biomass, overdispersed and/or zero-inflated counts. 2. We extend latent variable models to handle percent cover data, with vegetation, sessile invertebrate, and macroalgal cover data representing the prime examples of such data arising in community ecology. 3. Sparsity is a commonly encountered challenge with percent cover data. Responses are typically recorded as percentages covered per plot, though some species may be completely absent or present, i.e., have 0% or 100% cover respectively, rendering the use of beta distribution inadequate. 4. We propose two JSDMs suitable for percent cover data, namely a hurdle beta model and an ordered beta model. We compare the two proposed approaches to a beta distribution for shifted responses, transformed presence-absence data, and an ordinal model for percent cover classes. Results demonstrate the hurdle beta JSDM was generally the most accurate at retrieving the latent variables and predicting ecological percent cover data. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.12803 [pdf, other]

Joint Mean and Correlation Regression Models for Multivariate Data

Authors: Zhi Yang Tho, Francis K. C. Hui, Tao Zou

Abstract: We propose a new joint mean and correlation regression model for correlated multivariate discrete responses, that simultaneously regresses the mean of each response against a set of covariates, and the correlations between responses against a set of similarity/distance measures. A set of joint estimating equations are formulated to construct an estimator of both the mean regression coefficients an… ▽ More We propose a new joint mean and correlation regression model for correlated multivariate discrete responses, that simultaneously regresses the mean of each response against a set of covariates, and the correlations between responses against a set of similarity/distance measures. A set of joint estimating equations are formulated to construct an estimator of both the mean regression coefficients and the correlation regression parameters. Under a general setting where the number of responses can tend to infinity, the joint estimator is demonstrated to be consistent and asymptotically normally distributed, with differing rates of convergence due to the mean regression coefficients being heterogeneous across responses. An iterative estimation procedure is developed to obtain parameter estimates in the required, constrained parameter space. We apply the proposed model to a multivariate abundance dataset comprising overdispersed counts of 38 Carabidae ground beetle species sampled throughout Scotland, along with information about the environmental conditions of each site and the traits of each species. Results show in particular that the relationships between the mean abundances of various beetle species and environmental covariates are different and that beetle total length has statistically important effect in driving the correlations between the species. Simulations demonstrate the strong finite sample performance of the proposed estimator in terms of point estimation and inference. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.12719 [pdf, other]

Restricted maximum likelihood estimation in generalized linear mixed models

Authors: Luca Maestrini, Francis K. C. Hui, Alan H. Welsh

Abstract: Restricted maximum likelihood (REML) estimation is a widely accepted and frequently used method for fitting linear mixed models, with its principal advantage being that it produces less biased estimates of the variance components. However, the concept of REML does not immediately generalize to the setting of non-normally distributed responses, and it is not always clear the extent to which, either… ▽ More Restricted maximum likelihood (REML) estimation is a widely accepted and frequently used method for fitting linear mixed models, with its principal advantage being that it produces less biased estimates of the variance components. However, the concept of REML does not immediately generalize to the setting of non-normally distributed responses, and it is not always clear the extent to which, either asymptotically or in finite samples, such generalizations reduce the bias of variance component estimates compared to standard unrestricted maximum likelihood estimation. In this article, we review various attempts that have been made over the past four decades to extend REML estimation in generalized linear mixed models. We establish four major classes of approaches, namely approximate linearization, integrated likelihood, modified profile likelihoods, and direct bias correction of the score function, and show that while these four classes may have differing motivations and derivations, they often arrive at a similar if not the same REML estimate. We compare the finite sample performance of these four classes through a numerical study involving binary and count data, with results demonstrating that they perform similarly well in reducing the finite sample bias of variance components. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2401.13379 [pdf, other]

An Ising Similarity Regression Model for Modeling Multivariate Binary Data

Authors: Zhi Yang Tho, Francis K. C. Hui, Tao Zou

Abstract: Understanding the dependence structure between response variables is an important component in the analysis of correlated multivariate data. This article focuses on modeling dependence structures in multivariate binary data, motivated by a study aiming to understand how patterns in different U.S. senators' votes are determined by similarities (or lack thereof) in their attributes, e.g., political… ▽ More Understanding the dependence structure between response variables is an important component in the analysis of correlated multivariate data. This article focuses on modeling dependence structures in multivariate binary data, motivated by a study aiming to understand how patterns in different U.S. senators' votes are determined by similarities (or lack thereof) in their attributes, e.g., political parties and social network profiles. To address such a research question, we propose a new Ising similarity regression model which regresses pairwise interaction coefficients in the Ising model against a set of similarity measures available/constructed from covariates. Model selection approaches are further developed through regularizing the pseudo-likelihood function with an adaptive lasso penalty to enable the selection of relevant similarity measures. We establish estimation and selection consistency of the proposed estimator under a general setting where the number of similarity measures and responses tend to infinity. Simulation study demonstrates the strong finite sample performance of the proposed estimator in terms of parameter estimation and similarity selection. Applying the Ising similarity regression model to a dataset of roll call voting records of 100 U.S. senators, we are able to quantify how similarities in senators' parties, businessman occupations and social network profiles drive their voting associations. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2310.13858 [pdf, other]

Likelihood-based surrogate dimension reduction

Authors: Linh H. Nghiem, Francis K. C. Hui, Samuel Mueller, A. H. Welsh

Abstract: We consider the problem of surrogate sufficient dimension reduction, that is, estimating the central subspace of a regression model, when the covariates are contaminated by measurement error. When no measurement error is present, a likelihood-based dimension reduction method that relies on maximizing the likelihood of a Gaussian inverse regression model on the Grassmann manifold is well-known to h… ▽ More We consider the problem of surrogate sufficient dimension reduction, that is, estimating the central subspace of a regression model, when the covariates are contaminated by measurement error. When no measurement error is present, a likelihood-based dimension reduction method that relies on maximizing the likelihood of a Gaussian inverse regression model on the Grassmann manifold is well-known to have superior performance to traditional inverse moment methods. We propose two likelihood-based estimators for the central subspace in measurement error settings, which make different adjustments to the observed surrogates. Both estimators are computed based on maximizing objective functions on the Grassmann manifold and are shown to consistently recover the true central subspace. When the central subspace is assumed to depend on only a few covariates, we further propose to augment the likelihood function with a penalty term that induces sparsity on the Grassmann manifold to obtain sparse estimators. The resulting objective function has a closed-form Riemann gradient which facilitates efficient computation of the penalized estimator. We leverage the state-of-the-art trust region algorithm on the Grassmann manifold to compute the proposed estimators efficiently. Simulation studies and a data application demonstrate the proposed likelihood-based estimators perform better than inverse moment-based estimators in terms of both estimation and variable selection accuracy. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.05548 [pdf, other]

Cokrig-and-Regress for Spatially Misaligned Environmental Data

Authors: Z. Y. Tho, F. K. C. Hui, A. H. Welsh, T. Zou

Abstract: Spatially misaligned data, where the response and covariates are observed at different spatial locations, commonly arise in many environmental studies. Much of the statistical literature on handling spatially misaligned data has been devoted to the case of a single covariate and a linear relationship between the response and this covariate. Motivated by spatially misaligned data collected on air p… ▽ More Spatially misaligned data, where the response and covariates are observed at different spatial locations, commonly arise in many environmental studies. Much of the statistical literature on handling spatially misaligned data has been devoted to the case of a single covariate and a linear relationship between the response and this covariate. Motivated by spatially misaligned data collected on air pollution and weather in China, we propose a cokrig-and-regress (CNR) method to estimate spatial regression models involving multiple covariates and potentially non-linear associations. The CNR estimator is constructed by replacing the unobserved covariates (at the response locations) by their cokriging predictor derived from the observed but misaligned covariates under a multivariate Gaussian assumption, where a generalized Kronecker product covariance is used to account for spatial correlations within and between covariates. A parametric bootstrap approach is employed to bias-correct the CNR estimates of the spatial covariance parameters and for uncertainty quantification. Simulation studies demonstrate that CNR outperforms several existing methods for handling spatially misaligned data, such as nearest-neighbor interpolation. Applying CNR to the spatially misaligned air pollution and weather data in China reveals a number of non-linear relationships between PM$_{2.5}$ concentration and several meteorological covariates. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2107.02627 [pdf, other]

Fast, universal estimation of latent variable models using extended variational approximations

Authors: Pekka Korhonen, Francis K. C. Hui, Jenni Niku, Sara Taskinen

Abstract: Generalized linear latent variable models (GLLVMs) are a class of methods for analyzing multi-response data which has garnered considerable popularity in recent years, for example, in the analysis of multivariate abundance data in ecology. One of the main features of GLLVMs is their capacity to handle a variety of responses types, such as (overdispersed) counts, binomial responses, (semi-)continuo… ▽ More Generalized linear latent variable models (GLLVMs) are a class of methods for analyzing multi-response data which has garnered considerable popularity in recent years, for example, in the analysis of multivariate abundance data in ecology. One of the main features of GLLVMs is their capacity to handle a variety of responses types, such as (overdispersed) counts, binomial responses, (semi-)continuous, and proportions data. On the other hand, the introduction of underlying latent variables presents some major computational challenges, as the resulting marginal likelihood function involves an intractable integral for non-normally distributed responses. This has spurred research into approximation methods to overcome this integral, with a recent and particularly computationally scalable one being that of variational approximations (VA). However, research into the use of VA of GLLVMs and related models has been hampered by the fact that closed-form approximations have only been obtained for certain pairs of response distributions and link functions. In this article, we propose an extended variational approximations (EVA) approach which widens the set of VA-applicable GLLVMs drastically. EVA draws inspiration from the underlying idea of Laplace approximations: by replacing the complete-data likelihood function with its second order Taylor approximation about the mean of the variational distribution, we can obtain a closed-form approximation to the marginal likelihood of the GLLVM for any response type and link function. Through simulation studies and an application to testate amoebae data set in ecology, we demonstrate how EVA results in a universal approach to fitting GLLVMs, which remains competitive in terms of estimation and inferential performance relative to both standard VA and a Laplace approximation approach, while being computationally more scalable than both in practice. △ Less

Submitted 6 July, 2021; originally announced July 2021.

arXiv:2104.09838 [pdf, other]

Sparse Sliced Inverse Regression via Cholesky Matrix Penalization

Authors: Linh Nghiem, Francis K. C. Hui, Samuel Mueller, A. H. Welsh

Abstract: We introduce a new sparse sliced inverse regression estimator called Cholesky matrix penalization and its adaptive version for achieving sparsity in estimating the dimensions of the central subspace. The new estimators use the Cholesky decomposition of the covariance matrix of the covariates and include a regularization term in the objective function to achieve sparsity in a computationally effici… ▽ More We introduce a new sparse sliced inverse regression estimator called Cholesky matrix penalization and its adaptive version for achieving sparsity in estimating the dimensions of the central subspace. The new estimators use the Cholesky decomposition of the covariance matrix of the covariates and include a regularization term in the objective function to achieve sparsity in a computationally efficient manner. We establish the theoretical values of the tuning parameters that achieve estimation and variable selection consistency for the central subspace. Furthermore, we propose a new projection information criterion to select the tuning parameter for our proposed estimators and prove that the new criterion facilitates selection consistency. The Cholesky matrix penalization estimator inherits the strength of the Matrix Lasso and the Lasso sliced inverse regression estimator; it has superior performance in numerical studies and can be adapted to other sufficient dimension methods in the literature. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:2104.09812 [pdf, other]

Screening methods for linear errors-in-variables models in high dimensions

Authors: Linh Nghiem, Francis K. C. Hui, Samuel Mueller, A. H. Welsh

Abstract: Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables models; however, current methods for fitting such models are computationally expensive. In this paper… ▽ More Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables models; however, current methods for fitting such models are computationally expensive. In this paper, we present two efficient screening procedures, namely corrected penalized marginal screening and corrected sure independence screening, to reduce the number of variables for final model building. Both screening procedures are based on fitting corrected marginal regression models relating the outcome to each contaminated covariate separately, which can be computed efficiently even with a large number of features. Under mild conditions, we show that these procedures achieve screening consistency and reduce the number of features considerably, even when the number of covariates grows exponentially with the sample size. Additionally, if the true covariates are weakly correlated, corrected penalized marginal screening can achieve full variable selection consistency. Through simulation studies and an analysis of gene expression data for bone mineral density of Norwegian women, we demonstrate that the two new screening procedures make estimation of linear errors-in-variables models computationally scalable in high dimensional settings, and improve finite sample estimation and selection performance compared with estimators that do not employ a screening stage. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:2010.02469 [pdf, other]

Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays

Authors: Łukasz Kidziński, Francis K. C. Hui, David I. Warton, Trevor Hastie

Abstract: Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) ge… ▽ More Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm. △ Less

Submitted 27 January, 2022; v1 submitted 6 October, 2020; originally announced October 2020.

arXiv:1911.08628 [pdf, other]

doi 10.1007/978-981-15-1960-4_1

Symbolic Formulae for Linear Mixed Models

Authors: Emi Tanaka, Francis K. C. Hui

Abstract: A statistical model is a mathematical representation of an often simplified or idealised data-generating process. In this paper, we focus on a particular type of statistical model, called linear mixed models (LMMs), that is widely used in many disciplines e.g.~agriculture, ecology, econometrics, psychology. Mixed models, also commonly known as multi-level, nested, hierarchical or panel data models… ▽ More A statistical model is a mathematical representation of an often simplified or idealised data-generating process. In this paper, we focus on a particular type of statistical model, called linear mixed models (LMMs), that is widely used in many disciplines e.g.~agriculture, ecology, econometrics, psychology. Mixed models, also commonly known as multi-level, nested, hierarchical or panel data models, incorporate a combination of fixed and random effects, with LMMs being a special case. The inclusion of random effects in particular gives LMMs considerable flexibility in accounting for many types of complex correlated structures often found in data. This flexibility, however, has given rise to a number of ways by which an end-user can specify the precise form of the LMM that they wish to fit in statistical software. In this paper, we review the software design for specification of the LMM (and its special case, the linear model), focusing in particular on the use of high-level symbolic model formulae and two popular but contrasting R-packages in lme4 and asreml. △ Less

Submitted 19 November, 2019; originally announced November 2019.

arXiv:1509.04834 [pdf, ps, other]

doi 10.1214/15-AOAS813

Multi-species distribution modeling using penalized mixture of regressions

Authors: Francis K. C. Hui, David I. Warton, Scott D. Foster

Abstract: Multi-species distribution modeling, which relates the occurrence of multiple species to environmental variables, is an important tool used by ecologists for both predicting the distribution of species in a community and identifying the important variables driving species co-occurrences. Recently, Dunstan, Foster and Darnell [Ecol. Model. 222 (2011) 955-963] proposed using finite mixture of regres… ▽ More Multi-species distribution modeling, which relates the occurrence of multiple species to environmental variables, is an important tool used by ecologists for both predicting the distribution of species in a community and identifying the important variables driving species co-occurrences. Recently, Dunstan, Foster and Darnell [Ecol. Model. 222 (2011) 955-963] proposed using finite mixture of regression (FMR) models for multi-species distribution modeling, where species are clustered based on their environmental response to form a small number of "archetypal responses." As an illustrative example, they applied their mixture model approach to a presence-absence data set of 200 marine organisms, collected along the Great Barrier Reef in Australia. Little attention, however, was given to the problem of model selection - since the archetypes (mixture components) may depend on different but likely overlap** sets of covariates, a method is needed for performing variable selection on all components simultaneously. In this article, we consider using penalized likelihood functions for variable selection in FMR models. We propose two penalties which exploit the grouped structure of the covariates, that is, each covariate is represented by a group of coefficients, one for each component. This leads to an attractive form of shrinkage that allows a covariate to be removed from all components simultaneously. Both penalties are shown to possess specific forms of variable selection consistency, with simulations indicating they outperform other methods which do not take into account the grouped structure. When applied to the Great Barrier Reef data set, penalized FMR models offer more insight into the important variables driving species co-occurrence in the marine community (compared to previous results where no model selection was conducted), while offering a computationally stable method of modeling complex species-environment relationships (through regularization). △ Less

Submitted 16 September, 2015; originally announced September 2015.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS813 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS813

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 2, 866-882

arXiv:1211.3460 [pdf, other]

A Nonparametric Measure of Local Association for two-way Contingency Tables

Authors: Francis K. C. Hui, Gery Geenens

Abstract: In contingency table analysis, the odds ratio is a commonly applied measure used to summarize the degree of association between two categorical variables, say R and S. Suppose now that for each individual in the table, a vector of continuous variables X is also observed. It is then vital to analyze whether and how the degree of association varies with X. In this work, we extend the classical odds… ▽ More In contingency table analysis, the odds ratio is a commonly applied measure used to summarize the degree of association between two categorical variables, say R and S. Suppose now that for each individual in the table, a vector of continuous variables X is also observed. It is then vital to analyze whether and how the degree of association varies with X. In this work, we extend the classical odds ratio to the conditional case, and develop nonparametric estimators of this "pointwise odds ratio" to summarize the strength of local association between R and S given X. To allow for maximum flexibility, we make this extension using kernel regression. We develop confidence intervals based on these nonparametric estimators. We demonstrate via simulation that our pointwise odds ratio estimators can outperform model-based counterparts from logistic regression and GAMs, without the need for a linearity or additivity assumption. Finally, we illustrate its application to a dataset of patients from an intensive care unit (ICU), offering a greater insight into how the association between survival of patients admitted for emergency versus elective reasons varies with the patients' ages. △ Less

Submitted 14 November, 2012; originally announced November 2012.

Showing 1–14 of 14 results for author: Hui, F K C