-
Hidden Markov Models for Multivariate Panel Data
Authors:
Mackenzie R. Neal,
Alexa A. Sochaniwsky,
Paul D. McNicholas
Abstract:
While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms due to the unique correlation structure, a consequence of taking observations on several subjects over multiple time points. Additionally, panel data are often plagued by missing data and dropouts,…
▽ More
While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms due to the unique correlation structure, a consequence of taking observations on several subjects over multiple time points. Additionally, panel data are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the unique correlation structures that arise in panel data. A modified expectation-maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.
△ Less
Submitted 15 May, 2024; v1 submitted 5 April, 2024;
originally announced April 2024.
-
Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data
Authors:
Andrea Payne,
Anjali Silva,
Steven J. Rothstein,
Paul D. McNicholas,
Sanjeena Subedi
Abstract:
A mixture of multivariate Poisson-log normal factor analyzers is introduced by imposing constraints on the covariance matrix, which resulted in flexible models for clustering purposes. In particular, a class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced. Variational Gaussian approximation is used for parameter estimation, and information criter…
▽ More
A mixture of multivariate Poisson-log normal factor analyzers is introduced by imposing constraints on the covariance matrix, which resulted in flexible models for clustering purposes. In particular, a class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced. Variational Gaussian approximation is used for parameter estimation, and information criteria are used for model selection. The proposed models are explored in the context of clustering discrete data arising from RNA sequencing studies. Using real and simulated data, the models are shown to give favourable clustering performance. The GitHub R package for this work is available at https://github.com/anjalisilva/mixMPLNFA and is released under the open-source MIT license.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Clustering Three-Way Data with Outliers
Authors:
Katharine M. Clark,
Paul D. McNicholas
Abstract:
Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with o…
▽ More
Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with outliers is discussed. The approach, which uses the distribution of subset log-likelihoods, extends the OCLUST algorithm to matrix-variate normal data and uses an iterative approach to detect and trim outliers.
△ Less
Submitted 11 October, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
Longitudinal Data Clustering with a Copula Kernel Mixture Model
Authors:
Xi Zhang,
Orla A. Murphy,
Paul D. McNicholas
Abstract:
Many common clustering methods cannot be used for clustering multivariate longitudinal data in cases where variables exhibit high autocorrelations. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model which decomposes each mixture component's joint density function into its copula and marginal distribution functions.…
▽ More
Many common clustering methods cannot be used for clustering multivariate longitudinal data in cases where variables exhibit high autocorrelations. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model which decomposes each mixture component's joint density function into its copula and marginal distribution functions. In this decomposition, the Gaussian copula is used due to its mathematical tractability and Gaussian kernel functions are used to estimate the marginal distributions. A generalized expectation-maximization algorithm is used to estimate the model parameters. The performance of the proposed model is assessed in a simulation study and on two real datasets. The proposed model is shown to have effective performance in comparison to standard methods, such as K-means with dynamic time war** clustering and latent growth models.
△ Less
Submitted 21 July, 2023;
originally announced July 2023.
-
Flexible Variable Selection for Clustering and Classification
Authors:
Mackenzie R. Neal,
Paul D. McNicholas
Abstract:
The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the pres…
▽ More
The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the presence of cluster skewness. A novel variable selection algorithm is presented that utilizes the Manly transformation mixture model to select variables based on their ability to separate clusters, and is effective even when clusters depart from the Gaussian assumption. The proposed approach, which is implemented within the R package vscc, is compared to existing variable selection methods -- including an existing method that can account for cluster skewness -- using simulated and real datasets
△ Less
Submitted 9 February, 2024; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Model-based clustering via skewed matrix-variate cluster-weighted models
Authors:
Michael P. B. Gallaugher,
Salvatore D. Tomarchio,
Paul D. McNicholas,
Antonio Punzo
Abstract:
Cluster-weighted models (CWMs) extend finite mixtures of regressions (FMRs) in order to allow the distribution of covariates to contribute to the clustering process. In a matrix-variate framework, the matrix-variate normal CWM has been recently introduced. However, problems may be encountered when data exhibit skewness or other deviations from normality in the responses, covariates or both. Thus,…
▽ More
Cluster-weighted models (CWMs) extend finite mixtures of regressions (FMRs) in order to allow the distribution of covariates to contribute to the clustering process. In a matrix-variate framework, the matrix-variate normal CWM has been recently introduced. However, problems may be encountered when data exhibit skewness or other deviations from normality in the responses, covariates or both. Thus, we introduce a family of 24 matrix-variate CWMs which are obtained by allowing both the responses and covariates to be modelled by using one of four existing skewed matrix-variate distributions or the matrix-variate normal distribution. Endowed with a greater flexibility, our matrix-variate CWMs are able to handle this kind of data in a more suitable manner. As a by-product, the four skewed matrix-variate FMRs are also introduced. Maximum likelihood parameter estimates are derived using an expectation-conditional maximization algorithm. Parameter recovery, classification assessment, and the capability of the Bayesian information criterion to detect the underlying groups are investigated using simulated data. Lastly, our matrix-variate CWMs, along with the matrix-variate normal CWM and matrix-variate FMRs, are applied to two real datasets for illustrative purposes.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
Four Skewed Tensor Distributions
Authors:
Michael P. B. Gallaugher,
Peter A. Tait,
Paul D. McNicholas
Abstract:
With the rise of the "big data" phenomenon in recent years, data is coming in many different complex forms. One example of this is multi-way data that come in the form of higher-order tensors such as coloured images and movie clips. Although there has been a recent rise in models for looking at the simple case of three-way data in the form of matrices, there is a relative paucity of higher-order t…
▽ More
With the rise of the "big data" phenomenon in recent years, data is coming in many different complex forms. One example of this is multi-way data that come in the form of higher-order tensors such as coloured images and movie clips. Although there has been a recent rise in models for looking at the simple case of three-way data in the form of matrices, there is a relative paucity of higher-order tensor variate methods. The most common tensor distribution in the literature is the tensor variate normal distribution; however, its use can be problematic if the data exhibit skewness or outliers. Herein, we develop four skewed tensor variate distributions which to our knowledge are the first skewed tensor distributions to be proposed in the literature, and are able to parameterize both skewness and tail weight. Properties and parameter estimation are discussed, and real and simulated data are used for illustration.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Matrix Normal Cluster-Weighted Models
Authors:
Salvatore D. Tomarchio,
Paul D. McNicholas,
Antonio Punzo
Abstract:
Finite mixtures of regressions with fixed covariates are a commonly used model-based clustering methodology to deal with regression data. However, they assume assignment independence, i.e. the allocation of data points to the clusters is made independently of the distribution of the covariates. In order to take into account the latter aspect, finite mixtures of regressions with random covariates,…
▽ More
Finite mixtures of regressions with fixed covariates are a commonly used model-based clustering methodology to deal with regression data. However, they assume assignment independence, i.e. the allocation of data points to the clusters is made independently of the distribution of the covariates. In order to take into account the latter aspect, finite mixtures of regressions with random covariates, also known as cluster-weighted models (CWMs), have been proposed in the univariate and multivariate literature. In this paper, the CWM is extended to matrix data, e.g. those data where a set of variables are simultaneously observed at different time points or locations. Specifically, the cluster-specific marginal distribution of the covariates, and the cluster-specific conditional distribution of the responses given the covariates, are assumed to be matrix normal. Maximum likelihood parameter estimates are derived using an ECM algorithm. Parameter recovery, classification assessment and the capability of the BIC to detect the underlying groups are analyzed on simulated data. Finally, two real data applications concerning educational indicators and the Italian non-life insurance market are presented.
△ Less
Submitted 24 April, 2021;
originally announced April 2021.
-
Multivariate Cluster Weighted Models Using Skewed Distributions
Authors:
Michael P. B. Gallaugher,
Salvatore D. Tomarchio,
Paul D. McNicholas,
Antonio Punzo
Abstract:
Much work has been done in the area of the cluster weighted model (CWM), which extends the finite mixture of regression model to include modelling of the covariates. Although many types of distributions have been considered for both the response and covariates, to our knowledge skewed distributions have not yet been considered in this paradigm. Herein, a family of 24 novel CWMs are considered whic…
▽ More
Much work has been done in the area of the cluster weighted model (CWM), which extends the finite mixture of regression model to include modelling of the covariates. Although many types of distributions have been considered for both the response and covariates, to our knowledge skewed distributions have not yet been considered in this paradigm. Herein, a family of 24 novel CWMs are considered which allows both the covariates and response variables to be modelled using one of four skewed distributions, or the normal distribution. Parameter estimation is performed using the expectation-maximization algorithm and both simulated and real data are used for illustration.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
Skewed Distributions or Transformations? Modelling Skewness for a Cluster Analysis
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas,
Volodymyr Melnykov,
Xuwen Zhu
Abstract:
Because of its mathematical tractability, the Gaussian mixture model holds a special place in the literature for clustering and classification. For all its benefits, however, the Gaussian mixture model poses problems when the data is skewed or contains outliers. Because of this, methods have been developed over the years for handling skewed data, and fall into two general categories. The first is…
▽ More
Because of its mathematical tractability, the Gaussian mixture model holds a special place in the literature for clustering and classification. For all its benefits, however, the Gaussian mixture model poses problems when the data is skewed or contains outliers. Because of this, methods have been developed over the years for handling skewed data, and fall into two general categories. The first is to consider a mixture of more flexible skewed distributions, and the second is based on incorporating a transformation to near normality. Although these methods have been compared in their respective papers, there has yet to be a detailed comparison to determine when one method might be more suitable than the other. Herein, we provide a detailed comparison on many benchmarking datasets, as well as describe a novel method to assess cluster separation.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
Defying the Circadian Rhythm: Clustering Participant Telemetry in the UK Biobank Data
Authors:
Nikola Pocuca,
Mark Farrell,
Paul D. McNicholas
Abstract:
The UK Biobank dataset follows over 500,000 volunteers and contains a diverse set of information related to societal outcomes. Among this vast collection, a large quantity of telemetry collected from wrist-worn accelerometers provides a snapshot of participant activity. Using this data, a population of shift workers, subjected to disrupted circadian rhythms, is analysed using a mixture model-based…
▽ More
The UK Biobank dataset follows over 500,000 volunteers and contains a diverse set of information related to societal outcomes. Among this vast collection, a large quantity of telemetry collected from wrist-worn accelerometers provides a snapshot of participant activity. Using this data, a population of shift workers, subjected to disrupted circadian rhythms, is analysed using a mixture model-based approach to yield protective effects from physical activity on survival outcomes. In this paper, we develop a scalable, standardized, and unique methodology that efficiently clusters a vast quantity of participant telemetry. By building upon the work of Doherty et al. (2017), we introduce a standardized, low-dimensional feature for clustering purposes. Participants are clustered using a matrix variate mixture model-based approach. Once clustered, survival analysis is performed to demonstrate distinct lifetime outcomes for individuals within each cluster. In summary, we process, cluster, and analyse a subset of UK Biobank participants to show the protective effects from physical activity on circadian disrupted individuals.
△ Less
Submitted 16 November, 2020;
originally announced November 2020.
-
Mixtures of Contaminated Matrix Variate Normal Distributions
Authors:
Salvatore D. Tomarchio,
Michael P. B. Gallaugher,
Antonio Punzo,
Paul D. McNicholas
Abstract:
Analysis of three-way data is becoming ever more prevalent in the literature, especially in the area of clustering and classification. Real data, including real three-way data, are often contaminated by potential outlying observations. Their detection, as well as the development of robust models insensitive to their presence, is particularly important for this type of data because of the practical…
▽ More
Analysis of three-way data is becoming ever more prevalent in the literature, especially in the area of clustering and classification. Real data, including real three-way data, are often contaminated by potential outlying observations. Their detection, as well as the development of robust models insensitive to their presence, is particularly important for this type of data because of the practical issues concerning their effective visualization. Herein, the contaminated matrix variate normal distribution is discussed and then utilized in the mixture model paradigm for clustering. One key advantage of the proposed model is the ability to automatically detect potential outlying matrices by computing their \textit{a posteriori} probability to be a "good" or "bad" point. Such detection is currently unavailable using existing matrix variate methods. An expectation conditional maximization algorithm is used for parameter estimation, and both simulated and real data are used for illustration.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Parsimonious Mixtures of Matrix Variate Bilinear Factor Analyzers
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Over the years, data have become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Many methods have been proposed in the literature for two-way (multivariate) data and quite recently methods have been present…
▽ More
Over the years, data have become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Many methods have been proposed in the literature for two-way (multivariate) data and quite recently methods have been presented for three-way (matrix variate) data. One such such method is the mixtures of matrix variate bilinear factor analyzers (MMVBFA) model. Herein, we propose of total of 64 parsimonious MMVBFA models. Simulated and real data are used for illustration.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Assessing and Visualizing Matrix Variate Normality
Authors:
Nikola Pocuca,
Michael P. B. Gallaugher,
Katharine M. Clark,
Paul D. McNicholas
Abstract:
A framework for assessing the matrix variate normality of three-way data is developed. The framework comprises a visual method and a goodness of fit test based on the Mahalanobis squared distance (MSD). The MSD of multivariate and matrix variate normal estimators, respectively, are used as an assessment tool for matrix variate normality. Specifically, these are used in the form of a distance-dista…
▽ More
A framework for assessing the matrix variate normality of three-way data is developed. The framework comprises a visual method and a goodness of fit test based on the Mahalanobis squared distance (MSD). The MSD of multivariate and matrix variate normal estimators, respectively, are used as an assessment tool for matrix variate normality. Specifically, these are used in the form of a distance-distance (DD) plot as a graphical method for visualizing matrix variate normality. In addition, we employ the popular Kolmogorov-Smirnov goodness of fit test in the context of assessing matrix variate normality for three-way data. Finally, an appropriate simulation study spanning a large range of dimensions and data sizes shows that for various settings, the test proves itself highly robust.
△ Less
Submitted 7 October, 2019;
originally announced October 2019.
-
Clustering Higher Order Data: An Application to Pediatric Multi-variable Longitudinal Data
Authors:
Peter A. Tait,
Paul D. McNicholas,
Joyce Obeid
Abstract:
Physical activity levels are an important predictor of cardiovascular health and increasingly being measured by sensors, like accelerometers. Accelerometers produce rich multivariate data that can inform important clinical decisions related to individual patients and public health. The CHAMPION study, a study of youth with chronic inflammatory conditions, aims to determine the links between heart…
▽ More
Physical activity levels are an important predictor of cardiovascular health and increasingly being measured by sensors, like accelerometers. Accelerometers produce rich multivariate data that can inform important clinical decisions related to individual patients and public health. The CHAMPION study, a study of youth with chronic inflammatory conditions, aims to determine the links between heart health, inflammation, physical activity, and fitness. The accelerometer data from CHAMPION is represented as 4-dimensional arrays, and a finite mixture of multidimensional arrays model is developed for clustering. The use of model-based clustering for multidimensional arrays has thus far been limited to two-dimensional arrays, i.e., matrices or order-two tensors, and the work in this paper can also be seen as an approach for clustering D-dimensional arrays for D > 2 or, in other words, for clustering order-D tensors.
△ Less
Submitted 4 December, 2020; v1 submitted 19 July, 2019;
originally announced July 2019.
-
Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions
Authors:
Utkarsh J. Dang,
Michael P. B. Gallaugher,
Ryan P. Browne,
Paul D. McNicholas
Abstract:
Families of mixtures of multivariate power exponential (MPE) distributions have been previously introduced and shown to be competitive for cluster analysis in comparison to other elliptical mixtures including mixtures of Gaussian distributions. Herein, we propose a family of mixtures of multivariate skewed power exponential distributions to combine the flexibility of the MPE distribution with the…
▽ More
Families of mixtures of multivariate power exponential (MPE) distributions have been previously introduced and shown to be competitive for cluster analysis in comparison to other elliptical mixtures including mixtures of Gaussian distributions. Herein, we propose a family of mixtures of multivariate skewed power exponential distributions to combine the flexibility of the MPE distribution with the ability to model skewness. These mixtures are more robust to variations from normality and can account for skewness, varying tail weight, and peakedness of data. A generalized expectation-maximization approach combining minorization-maximization and optimization based on accelerated line search algorithms on the Stiefel manifold is used for parameter estimation. These mixtures are implemented both in the model-based clustering and classification frameworks. Both simulated and benchmark data are used for illustration and comparison to other mixture families.
△ Less
Submitted 20 January, 2023; v1 submitted 3 July, 2019;
originally announced July 2019.
-
Finding Outliers in Gaussian Model-Based Clustering
Authors:
Katharine M. Clark,
Paul D. McNicholas
Abstract:
Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that…
▽ More
Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.
△ Less
Submitted 30 May, 2024; v1 submitted 1 July, 2019;
originally announced July 2019.
-
Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions
Authors:
Alexa A. Sochaniwsky,
Michael P. B. Gallaugher,
Yang Tang,
Paul D. McNicholas
Abstract:
Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed…
▽ More
Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.
△ Less
Submitted 6 June, 2024; v1 submitted 12 March, 2019;
originally announced March 2019.
-
Clustering Discrete-Valued Time Series
Authors:
Tyler Roick,
Dimitris Karlis,
Paul D. McNicholas
Abstract:
There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model…
▽ More
There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model, several existing techniques such as the selection of the number of clusters, estimation using expectation-maximization and model selection are applicable. The proposed model is then demonstrated on real data to illustrate its clustering applications.
△ Less
Submitted 27 March, 2020; v1 submitted 26 January, 2019;
originally announced January 2019.
-
Modeling Frequency and Severity of Claims with the Zero-Inflated Generalized Cluster-Weighted Models
Authors:
Nikola Pocuca,
Petar Jevtic,
Paul D. McNicholas,
Tatjana Miljkovic
Abstract:
In this paper, we propose two important extensions to cluster-weighted models (CWMs). First, we extend CWMs to have generalized cluster-weighted models (GCWMs) by allowing modeling of non-Gaussian distribution of the continuous covariates, as they frequently occur in insurance practice. Secondly, we introduce a zero-inflated extension of GCWM (ZI-GCWM) for modeling insurance claims data with exces…
▽ More
In this paper, we propose two important extensions to cluster-weighted models (CWMs). First, we extend CWMs to have generalized cluster-weighted models (GCWMs) by allowing modeling of non-Gaussian distribution of the continuous covariates, as they frequently occur in insurance practice. Secondly, we introduce a zero-inflated extension of GCWM (ZI-GCWM) for modeling insurance claims data with excess zeros coming from heterogenous sources. Additionally, we give two expectation-optimization (EM) algorithms for parameter estimation given the proposed models. An appropriate simulation study shows that, for various settings and in contrast to the existing mixture-based approaches, both extended models perform well. Finally, a real data set based on French auto-mobile policies is used to illustrate the application of the proposed extensions.
△ Less
Submitted 31 December, 2018;
originally announced December 2018.
-
Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes
Authors:
Forrest Paton,
Paul D. McNicholas
Abstract:
Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function's value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, the…
▽ More
Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function's value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, they are used to shed light on coastal rainfall patterns in British Columbia (BC). Specifically, this work addressed the question over how one should carry out an exploratory cluster analysis for the BC, or any similar, coastal rainfall data. An approach is developed for clustering multiple processes observed on a comparable interval, based on how similar their underlying covariance kernel is. This approach provides interesting insights into the BC data, and these insights can be framed in terms of El Niño and La Niña; however, the result is not simply one cluster representing El Niño years and another for La Niña years. From one perspective, the results show that clustering annual rainfall can potentially be used to identify extreme weather patterns.
△ Less
Submitted 3 April, 2020; v1 submitted 23 December, 2018;
originally announced December 2018.
-
An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering
Authors:
Sharon M. McNicholas,
Paul D. McNicholas,
Daniel A. Ashlock
Abstract:
An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generali…
▽ More
An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generalization of the k-means algorithm, which is itself equivalent to a restricted Gaussian mixture model. The EA is illustrated on several datasets, and its performance is compared to other hard clustering approaches and model-based clustering via the EM algorithm.
△ Less
Submitted 8 June, 2020; v1 submitted 31 October, 2018;
originally announced November 2018.
-
Mixtures of Skewed Matrix Variate Bilinear Factor Analyzers
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix vari…
▽ More
In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix variate normality, which is not always sensible if cluster skewness or excess kurtosis is present. Mixtures of bilinear factor analyzers using skewed matrix variate distributions are proposed. In all, four such mixture models are presented, based on matrix variate skew-t, generalized hyperbolic, variance-gamma, and normal inverse Gaussian distributions, respectively.
△ Less
Submitted 27 September, 2019; v1 submitted 7 September, 2018;
originally announced September 2018.
-
Parameter-wise co-clustering for high-dimensional data
Authors:
M. P. B. Gallaugher,
C. Biernacki,
P. D. McNicholas
Abstract:
In recent years, data dimensionality has increasingly become a concern, leading to many parameter and dimension reduction techniques being proposed in the literature. A parameter-wise co-clustering model, for data modelled via continuous random variables, is presented. The proposed model, although allowing more flexibility, still maintains the very high degree of parsimony achieved by traditional…
▽ More
In recent years, data dimensionality has increasingly become a concern, leading to many parameter and dimension reduction techniques being proposed in the literature. A parameter-wise co-clustering model, for data modelled via continuous random variables, is presented. The proposed model, although allowing more flexibility, still maintains the very high degree of parsimony achieved by traditional co-clustering. A stochastic expectation-maximization (SEM) algorithm along with a Gibbs sampler is used for parameter estimation and an integrated complete log-likelihood criterion is used for model selection. Simulated and real datasets are used for illustration and comparison with traditional co-clustering.
△ Less
Submitted 30 September, 2020; v1 submitted 25 August, 2018;
originally announced August 2018.
-
Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data
Authors:
Anjali Silva,
Steven J. Rothstein,
Paul D. McNicholas,
Xiaoke Qin,
Sanjeena Subedi
Abstract:
Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for $n$ genes across $p$ conditions at $r$ occasions. Matrix variate distributions offer a natural way to model three-way data and mixtur…
▽ More
Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for $n$ genes across $p$ conditions at $r$ occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo based approach, a variational Gaussian approximation based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.
△ Less
Submitted 21 June, 2022; v1 submitted 22 July, 2018;
originally announced July 2018.
-
Robust Model-Based Clustering of Voting Records
Authors:
Yang Tang,
Paul D. McNicholas,
Antonio Punzo
Abstract:
We explore the possibility of discovering extreme voting patterns in the U.S. Congressional voting records by drawing ideas from the mixture of contaminated normal distributions. A mixture of latent trait models via contaminated normal distributions is proposed. We assume that the low dimensional continuous latent variable comes from a contaminated normal distribution and, therefore, picks up extr…
▽ More
We explore the possibility of discovering extreme voting patterns in the U.S. Congressional voting records by drawing ideas from the mixture of contaminated normal distributions. A mixture of latent trait models via contaminated normal distributions is proposed. We assume that the low dimensional continuous latent variable comes from a contaminated normal distribution and, therefore, picks up extreme patterns in the observed binary data while clustering. We consider in particular such model for the analysis of voting records. The model is applied to a U.S. Congressional Voting data set on 16 issues. Note this approach is the first instance within the literature of a mixture model handling binary data with possible extreme patterns.
△ Less
Submitted 10 May, 2018;
originally announced May 2018.
-
A Latent Gaussian Mixture Model for Clustering Longitudinal Data
Authors:
Vanessa S. E. Bierling,
Paul D. McNicholas
Abstract:
Finite mixture models have become a popular tool for clustering. Amongst other uses, they have been applied for clustering longitudinal data and clustering high-dimensional data. In the latter case, a latent Gaussian mixture model is sometimes used. Although there has been much work on clustering using latent variables and on clustering longitudinal data, respectively, there has been a paucity of…
▽ More
Finite mixture models have become a popular tool for clustering. Amongst other uses, they have been applied for clustering longitudinal data and clustering high-dimensional data. In the latter case, a latent Gaussian mixture model is sometimes used. Although there has been much work on clustering using latent variables and on clustering longitudinal data, respectively, there has been a paucity of work that combines these features. An approach is developed for clustering longitudinal data with many time points based on an extension of the mixture of common factor analyzers model. A variation of the expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. The approach is illustrated using real and simulated data.
△ Less
Submitted 13 April, 2018;
originally announced April 2018.
-
Clustering and Semi-Supervised Classification for Clickstream Data via Mixture Models
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Finite mixture models have been used for unsupervised learning for some time, and their use within the semi-supervised paradigm is becoming more commonplace. Clickstream data is one of the various emerging data types that demands particular attention because there is a notable paucity of statistical learning approaches currently available. A mixture of first-order continuous time Markov models is…
▽ More
Finite mixture models have been used for unsupervised learning for some time, and their use within the semi-supervised paradigm is becoming more commonplace. Clickstream data is one of the various emerging data types that demands particular attention because there is a notable paucity of statistical learning approaches currently available. A mixture of first-order continuous time Markov models is introduced for unsupervised and semi-supervised learning of clickstream data. This approach assumes continuous time, which distinguishes it from existing mixture model-based approaches; practically, this allows account to be taken of the amount of time each user spends on each webpage. The approach is evaluated, and compared to the discrete time approach, using simulated and real data.
△ Less
Submitted 16 December, 2020; v1 submitted 13 February, 2018;
originally announced February 2018.
-
A Mixture of Matrix Variate Bilinear Factor Analyzers
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Over the years data has become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Although dimension reduction in the area of clustering for multivariate data has been quite thoroughly discussed within the lite…
▽ More
Over the years data has become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Although dimension reduction in the area of clustering for multivariate data has been quite thoroughly discussed within the literature, there is relatively little work in the area of three-way, or matrix variate, data. Herein, we develop a mixture of matrix variate bilinear factor analyzers (MMVBFA) model for use in clustering high-dimensional matrix variate data. This work can be considered both the first matrix variate bilinear factor analysis model as well as the first MMVBFA model. Parameter estimation is discussed, and the MMVBFA model is illustrated using simulated and real data.
△ Less
Submitted 29 September, 2018; v1 submitted 22 December, 2017;
originally announced December 2017.
-
A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data
Authors:
Anjali Silva,
Steven J. Rothstein,
Paul D. McNicholas,
Sanjeena Subedi
Abstract:
High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the d…
▽ More
High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the data, cluster analysis provides an intuitive alternative. The aim of applying mixture model-based clustering in this context is to discover groups of co-expressed genes, which can shed light on biological functions and pathways of gene products. A mixture of multivariate Poisson-Log Normal (MPLN) model is proposed for clustering of high-throughput transcriptome sequencing data. The MPLN model is able to fit a wide range of correlation and overdispersion situations, and is ideal for modeling multivariate count data from RNA sequencing studies. Parameter estimation is carried out via a Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM), and information criteria are used for model selection.
△ Less
Submitted 29 November, 2017;
originally announced November 2017.
-
Mixtures of Hidden Truncation Hyperbolic Factor Analyzers
Authors:
Paula M. Murray,
Ryan P. Browne,
Paul D. McNicholas
Abstract:
The mixture of factor analyzers model was first introduced over 20 years ago and, in the meantime, has been extended to several non-Gaussian analogues. In general, these analogues account for situations with heavy tailed and/or skewed clusters. An approach is introduced that unifies many of these approaches into one very general model: the mixture of hidden truncation hyperbolic factor analyzers (…
▽ More
The mixture of factor analyzers model was first introduced over 20 years ago and, in the meantime, has been extended to several non-Gaussian analogues. In general, these analogues account for situations with heavy tailed and/or skewed clusters. An approach is introduced that unifies many of these approaches into one very general model: the mixture of hidden truncation hyperbolic factor analyzers (MHTHFA) model. In the process of doing this, a hidden truncation hyperbolic factor analysis model is also introduced. The MHTHFA model is illustrated for clustering as well as semi-supervised classification using two real datasets.
△ Less
Submitted 27 October, 2018; v1 submitted 4 November, 2017;
originally announced November 2017.
-
On Fractionally-Supervised Classification: Weight Selection and Extension to the Multivariate t-Distribution
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Recent work on fractionally-supervised classification (FSC), an approach that allows classification to be carried out with a fractional amount of weight given to the unlabelled points, is further developed in two respects. The primary development addresses a question of fundamental importance over how to choose the amount of weight given to the unlabelled points. The resolution of this matter is e…
▽ More
Recent work on fractionally-supervised classification (FSC), an approach that allows classification to be carried out with a fractional amount of weight given to the unlabelled points, is further developed in two respects. The primary development addresses a question of fundamental importance over how to choose the amount of weight given to the unlabelled points. The resolution of this matter is essential because it makes FSC more readily applicable to real problems. Interestingly, the resolution of the weight selection problem opens up the possibility of a different approach to model selection in model-based clustering and classification. A secondary development demonstrates that the FSC approach can be effective beyond Gaussian mixture models. To this end, an FSC approach is illustrated using mixtures of multivariate t-distributions.
△ Less
Submitted 24 September, 2017;
originally announced September 2017.
-
Hidden Truncation Hyperbolic Distributions, Finite Mixtures Thereof, and Their Application for Clustering
Authors:
Paula M. Murray,
Ryan P. Browne,
Paul D. McNicholas
Abstract:
A hidden truncation hyperbolic (HTH) distribution is introduced and finite mixtures thereof are applied for clustering. A stochastic representation of the HTH distribution is given and a density is derived. A hierarchical representation is described, which aids in parameter estimation. Finite mixtures of HTH distributions are presented and their identifiability is proved. The convexity of the HTH…
▽ More
A hidden truncation hyperbolic (HTH) distribution is introduced and finite mixtures thereof are applied for clustering. A stochastic representation of the HTH distribution is given and a density is derived. A hierarchical representation is described, which aids in parameter estimation. Finite mixtures of HTH distributions are presented and their identifiability is proved. The convexity of the HTH distribution is discussed, which is important in clustering applications, and some theoretical results in this direction are presented. The relationship between the HTH distribution and other skewed distributions in the literature is discussed. Illustrations are provided --- both of the HTH distribution and application of finite mixtures thereof for clustering.
△ Less
Submitted 20 July, 2017; v1 submitted 7 July, 2017;
originally announced July 2017.
-
Subspace Clustering with the Multivariate-t Distribution
Authors:
Angelina Pesevski,
Brian C. Franczak,
Paul D. McNicholas
Abstract:
Clustering procedures suitable for the analysis of very high-dimensional data are needed for many modern data sets. In model-based clustering, a method called high-dimensional data clustering (HDDC) uses a family of Gaussian mixture models for clustering. HDDC is based on the idea that high-dimensional data usually exists in lower-dimensional subspaces; as such, an intrinsic dimension for each sub…
▽ More
Clustering procedures suitable for the analysis of very high-dimensional data are needed for many modern data sets. In model-based clustering, a method called high-dimensional data clustering (HDDC) uses a family of Gaussian mixture models for clustering. HDDC is based on the idea that high-dimensional data usually exists in lower-dimensional subspaces; as such, an intrinsic dimension for each sub-population of the observed data can be estimated and cluster analysis can be performed in this lower-dimensional subspace. As a result, only a fraction of the total number of parameters need to be estimated and a computationally efficient parameter estimation scheme based on the EM algorithm was developed. This family of models has gained attention due to its superior classification performance compared to other families of mixture models; however, it still suffers from the usual limitations of Gaussian mixture model-based approaches. In this paper, a robust analogue of the HDDC approach is proposed. This approach, which extends the HDDC procedure to include the mulitvariate-t distribution, encompasses 28 models that rectify the aforementioned shortcomings of the HDDC procedure. Our tHDDC procedure is fitted to both simulated and real data sets and is compared to the HDDC procedure using an image reconstruction problem that arose from satellite imagery of Mars' surface.
△ Less
Submitted 27 June, 2017;
originally announced June 2017.
-
Flexible High-Dimensional Unsupervised Learning with Missing Data
Authors:
Yuhong Wei,
Yang Tang,
Paul D. McNicholas
Abstract:
The mixture of factor analyzers (MFA) model is a famous mixture model-based approach for unsupervised learning with high-dimensional data. It can be useful, inter alia, in situations where the data dimensionality far exceeds the number of observations. In recent years, the MFA model has been extended to non-Gaussian mixtures to account for clusters with heavier tail weight and/or asymmetry. The ge…
▽ More
The mixture of factor analyzers (MFA) model is a famous mixture model-based approach for unsupervised learning with high-dimensional data. It can be useful, inter alia, in situations where the data dimensionality far exceeds the number of observations. In recent years, the MFA model has been extended to non-Gaussian mixtures to account for clusters with heavier tail weight and/or asymmetry. The generalized hyperbolic factor analyzers (MGHFA) model is one such extension, which leads to a flexible modelling paradigm that accounts for both heavier tail weight and cluster asymmetry. In many practical applications, the occurrence of missing values often complicates data analyses. A generalization of the MGHFA is presented to accommodate missing values. Under a missing-at-random mechanism, we develop a computationally efficient alternating expectation conditional maximization algorithm for parameter estimation of the MGHFA model with different patterns of missing values. The imputation of missing values under an incomplete-data structure of MGHFA is also investigated. The performance of our proposed methodology is illustrated through the analysis of simulated and real data.
△ Less
Submitted 9 November, 2018; v1 submitted 19 June, 2017;
originally announced June 2017.
-
Clustering Airbnb Reviews
Authors:
Yang Tang,
Paul D. McNicholas
Abstract:
In the last decade, online customer reviews increasingly exert influence on consumers' decision when booking accommodation online. The renewal importance to the concept of word-of mouth is reflected in the growing interests in investigating consumers' experience by analyzing their online reviews through the process of text mining and sentiment analysis. A clustering approach is developed for Bosto…
▽ More
In the last decade, online customer reviews increasingly exert influence on consumers' decision when booking accommodation online. The renewal importance to the concept of word-of mouth is reflected in the growing interests in investigating consumers' experience by analyzing their online reviews through the process of text mining and sentiment analysis. A clustering approach is developed for Boston Airbnb reviews submitted in the English language and collected from 2009 to 2016. This approach is based on a mixture of latent variable models, which provides an appealing framework for handling clustered binary data. We address here the problem of discovering meaningful segments of consumers that are coherent from both the underlying topics and the sentiment behind the reviews. A penalized mixture of latent traits approach is developed to reduce the number of parameters and identify variables that are not informative for clustering. The introduction of component-specific rate parameters avoids the over-penalization that can occur when inferring a shared rate parameter on clustered data. We divided the guests into four groups -- property driven guests, host driven guests, guests with recent overall negative stay and guests with some negative experiences.
△ Less
Submitted 27 June, 2019; v1 submitted 8 May, 2017;
originally announced May 2017.
-
Flexible Clustering for High-Dimensional Data via Mixtures of Joint Generalized Hyperbolic Models
Authors:
Yang Tang,
Ryan P. Browne,
Paul D. McNicholas
Abstract:
A mixture of joint generalized hyperbolic distributions (MJGHD) is introduced for asymmetric clustering for high-dimensional data. The MJGHD approach takes into account the cluster-specific subspace, thereby limiting the number of parameters to estimate while also facilitating visualization of results. Identifiability is discussed, and a multi-cycle ECM algorithm is outlined for parameter estimati…
▽ More
A mixture of joint generalized hyperbolic distributions (MJGHD) is introduced for asymmetric clustering for high-dimensional data. The MJGHD approach takes into account the cluster-specific subspace, thereby limiting the number of parameters to estimate while also facilitating visualization of results. Identifiability is discussed, and a multi-cycle ECM algorithm is outlined for parameter estimation. The MJGHD approach is illustrated on two real data sets, where the Bayesian information criterion is used for model selection.
△ Less
Submitted 6 January, 2018; v1 submitted 8 May, 2017;
originally announced May 2017.
-
Three Skewed Matrix Variate Distributions
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Three-way data can be conveniently modelled by using matrix variate distributions. Although there has been a lot of work for the matrix variate normal distribution, there is little work in the area of matrix skew distributions. Three matrix variate distributions that incorporate skewness, as well as other flexible properties such as concentration, are discussed. Equivalences to multivariate analog…
▽ More
Three-way data can be conveniently modelled by using matrix variate distributions. Although there has been a lot of work for the matrix variate normal distribution, there is little work in the area of matrix skew distributions. Three matrix variate distributions that incorporate skewness, as well as other flexible properties such as concentration, are discussed. Equivalences to multivariate analogues are presented, and moment generating functions are derived. Maximum likelihood parameter estimation is discussed, and simulated data is used for illustration.
△ Less
Submitted 13 August, 2018; v1 submitted 8 April, 2017;
originally announced April 2017.
-
Finite Mixtures of Skewed Matrix Variate Distributions
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Clustering is the process of finding underlying group structures in data. Although mixture model-based clustering is firmly established in the multivariate case, there is a relative paucity of work on matrix variate distributions and none for clustering with mixtures of skewed matrix variate distributions. Four finite mixtures of skewed matrix variate distributions are considered. Parameter estima…
▽ More
Clustering is the process of finding underlying group structures in data. Although mixture model-based clustering is firmly established in the multivariate case, there is a relative paucity of work on matrix variate distributions and none for clustering with mixtures of skewed matrix variate distributions. Four finite mixtures of skewed matrix variate distributions are considered. Parameter estimation is carried out using an expectation-conditional maximization algorithm, and both simulated and real data are used for illustration.
△ Less
Submitted 5 March, 2018; v1 submitted 26 March, 2017;
originally announced March 2017.
-
Extending Growth Mixture Models Using Continuous Non-Elliptical Distributions
Authors:
Yuhong Wei,
Yang Tang,
Emilie Shireman,
Paul D. McNicholas,
Douglas L. Steinley
Abstract:
Growth mixture models (GMMs) incorporate both conventional random effects growth modeling and latent trajectory classes as in finite mixture modeling; therefore, they offer a way to handle the unobserved heterogeneity between subjects in their development. GMMs with Gaussian random effects dominate the literature. When the data are asymmetric and/or have heavier tails, more than one latent class i…
▽ More
Growth mixture models (GMMs) incorporate both conventional random effects growth modeling and latent trajectory classes as in finite mixture modeling; therefore, they offer a way to handle the unobserved heterogeneity between subjects in their development. GMMs with Gaussian random effects dominate the literature. When the data are asymmetric and/or have heavier tails, more than one latent class is required to capture the observed variable distribution. Therefore, a GMM with continuous non-elliptical distributions is proposed to capture skewness and heavier tails in the data set. Specifically, multivariate skew-t distributions and generalized hyperbolic distributions are introduced to extend GMMs. When extending GMMs, four statistical models are considered with differing distributions of measurement errors and random effects. The mathematical development of GMMs with non-elliptical distributions relies on their expression as normal variance-mean mixtures and the resultant relationship with the generalized inverse Gaussian distribution. Parameter estimation is outlined within the expectation-maximization framework before the performance of our GMMs with non-elliptical distributions is illustrated on simulated and real data.
△ Less
Submitted 13 November, 2017; v1 submitted 25 March, 2017;
originally announced March 2017.
-
Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data
Authors:
Yuhong Wei,
Yang Tang,
Paul D. McNicholas
Abstract:
Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavy-tailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for model-based clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skew-t distribution…
▽ More
Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavy-tailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for model-based clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skew-t distributions. An analytically feasible EM algorithm is formulated for parameter estimation and imputation of missing values for mixture models employing missing at random mechanisms. The proposed methodologies are investigated through a simulation study with varying proportions of synthetic missing values and illustrated using a real dataset. Comparisons are made with those obtained from the traditional mixture of generalized hyperbolic distribution counterparts by filling in the missing data using the mean imputation method.
△ Less
Submitted 19 August, 2018; v1 submitted 6 March, 2017;
originally announced March 2017.
-
A Matrix Variate Skew-t Distribution
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Although there is ample work in the literature dealing with skewness in the multivariate setting, there is a relative paucity of work in the matrix variate paradigm. Such work is, for example, useful for modelling three-way data. A matrix variate skew-t distribution is derived based on a mean-variance matrix normal mixture. An expectation-conditional maximization algorithm is developed for paramet…
▽ More
Although there is ample work in the literature dealing with skewness in the multivariate setting, there is a relative paucity of work in the matrix variate paradigm. Such work is, for example, useful for modelling three-way data. A matrix variate skew-t distribution is derived based on a mean-variance matrix normal mixture. An expectation-conditional maximization algorithm is developed for parameter estimation. Simulated data are used for illustration.
△ Less
Submitted 12 April, 2017; v1 submitted 3 March, 2017;
originally announced March 2017.
-
ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions
Authors:
Antonio Punzo,
Angelo Mazza,
Paul D. McNicholas
Abstract:
We introduce the R package ContaminatedMixt, conceived to disseminate the use of mixtures of multivariate contaminated normal distributions as a tool for robust clustering and classification under the common assumption of elliptically contoured groups. Thirteen variants of the model are also implemented to introduce parsimony. The expectation-conditional maximization algorithm is adopted to obtain…
▽ More
We introduce the R package ContaminatedMixt, conceived to disseminate the use of mixtures of multivariate contaminated normal distributions as a tool for robust clustering and classification under the common assumption of elliptically contoured groups. Thirteen variants of the model are also implemented to introduce parsimony. The expectation-conditional maximization algorithm is adopted to obtain maximum likelihood parameter estimates, and likelihood-based model selection criteria are used to select the model and the number of groups. Parallel computation can be used on multicore PCs and computer clusters, when several models have to be fitted. Differently from the more popular mixtures of multivariate normal and t distributions, this approach also allows for automatic detection of mild outliers via the maximum a posteriori probabilities procedure. To exemplify the use of the package, applications to artificial and real data are presented.
△ Less
Submitted 12 June, 2016;
originally announced June 2016.
-
Mixtures of Multivariate Power Exponential Distributions
Authors:
Utkarsh J. Dang,
Ryan P. Browne,
Paul D. McNicholas
Abstract:
An expanded family of mixtures of multivariate power exponential distributions is introduced. While fitting heavy-tails and skewness has received much attention in the model-based clustering literature recently, we investigate the use of a distribution that can deal with both varying tail-weight and peakedness of data. A family of parsimonious models is proposed using an eigen-decomposition of the…
▽ More
An expanded family of mixtures of multivariate power exponential distributions is introduced. While fitting heavy-tails and skewness has received much attention in the model-based clustering literature recently, we investigate the use of a distribution that can deal with both varying tail-weight and peakedness of data. A family of parsimonious models is proposed using an eigen-decomposition of the scale matrix. A generalized expectation-maximization algorithm is presented that combines convex optimization via a minorization-maximization approach and optimization based on accelerated line search algorithms on the Stiefel manifold. Lastly, the utility of this family of models is illustrated using both toy and benchmark data.
△ Less
Submitted 12 June, 2015;
originally announced June 2015.
-
Multivariate response and parsimony for Gaussian cluster-weighted models
Authors:
Utkarsh J. Dang,
Antonio Punzo,
Paul D. McNicholas,
Salvatore Ingrassia,
Ryan P. Browne
Abstract:
A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectat…
▽ More
A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectation-maximization algorithm is presented for parameter estimation. Model performance is investigated on both synthetic and classical real data sets and compared with some popular approaches. Finally, accounting for linear dependencies in the presence of a linear regression structure is shown to offer better performance, vis-à-vis clustering, over existing methodologies.
△ Less
Submitted 26 February, 2016; v1 submitted 3 November, 2014;
originally announced November 2014.
-
Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model
Authors:
Antonio Punzo,
Paul D. McNicholas
Abstract:
The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed…
▽ More
The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed departures from normality, due to the presence of atypical observations, the contaminated Gaussian CWM is here introduced. In addition to the parameters of the Gaussian CWM, each mixture component of our contaminated CWM has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and one specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to our approach. Furthermore, once the model is estimated and the observations are assigned to the groups, a finer intra-group classification in typical points, outliers, good leverage points, and bad leverage points - concepts of primary importance in robust regression analysis - can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared to the estimators from the Gaussian CWM. A sensitivity study is also conducted based on a real data set.
△ Less
Submitted 21 September, 2014;
originally announced September 2014.
-
High-dimensional unsupervised classification via parsimonious contaminated mixtures
Authors:
Antonio Punzo,
Martin Blostein,
Paul D. McNicholas
Abstract:
The contaminated Gaussian distribution represents a simple heavy-tailed elliptical generalization of the Gaussian distribution; unlike the often-considered t-distribution, it also allows for automatic detection of mild outlying or "bad" points in the same way that observations are typically assigned to the groups in the finite mixture model context. Starting from this distribution, we propose the…
▽ More
The contaminated Gaussian distribution represents a simple heavy-tailed elliptical generalization of the Gaussian distribution; unlike the often-considered t-distribution, it also allows for automatic detection of mild outlying or "bad" points in the same way that observations are typically assigned to the groups in the finite mixture model context. Starting from this distribution, we propose the contaminated factor analysis model as a method for dimensionality reduction and detection of bad points in higher dimensions. A mixture of contaminated Gaussian factor analyzers (MCGFA) model follows therefrom, and extends the recently proposed mixture of contaminated Gaussian distributions to high-dimensional data. We introduce a family of 32 parsimonious models formed by introducing constraints on the covariance and contamination structures of the general MCGFA model. We outline a variant of the expectation-maximization algorithm for parameter estimation. Various implementation issues are discussed, and the novel family of models is compared to well-established approaches on both simulated and real data.
△ Less
Submitted 28 August, 2019; v1 submitted 9 August, 2014;
originally announced August 2014.
-
An Adaptive LASSO-Penalized BIC
Authors:
Sakyajit Bhattacharya,
Paul D. McNicholas
Abstract:
Mixture models are becoming a popular tool for the clustering and classification of high-dimensional data. In such high dimensional applications, model selection is problematic. The Bayesian information criterion, which is popular in lower dimensional applications, tends to underestimate the true number of components in high dimensions. We introduce an adaptive LASSO-penalized BIC (ALPBIC) to miti…
▽ More
Mixture models are becoming a popular tool for the clustering and classification of high-dimensional data. In such high dimensional applications, model selection is problematic. The Bayesian information criterion, which is popular in lower dimensional applications, tends to underestimate the true number of components in high dimensions. We introduce an adaptive LASSO-penalized BIC (ALPBIC) to mitigate this problem. This efficacy of the ALPBIC is illustrated via applications of parsimonious mixtures of factor analyzers. The selection of the best model by ALPBIC is shown to be consistent with increasing numbers of observations based on simulated and real data analyses.
△ Less
Submitted 5 June, 2014;
originally announced June 2014.
-
Modelling Receiver Operating Characteristic Curves Using Gaussian Mixtures
Authors:
Amay Cheam,
Paul D. McNicholas
Abstract:
The receiver operating characteristic curve is widely applied in measuring the performance of diagnostic tests. Many direct and indirect approaches have been proposed for modelling the ROC curve, and because of its tractability, the Gaussian distribution has typically been used to model both populations. We propose using a Gaussian mixture model, leading to a more flexible approach that better acc…
▽ More
The receiver operating characteristic curve is widely applied in measuring the performance of diagnostic tests. Many direct and indirect approaches have been proposed for modelling the ROC curve, and because of its tractability, the Gaussian distribution has typically been used to model both populations. We propose using a Gaussian mixture model, leading to a more flexible approach that better accounts for atypical data. Monte Carlo simulation is used to circumvent the issue of absence of a closed-form. We show that our method performs favourably when compared to the crude binormal curve and to the semi-parametric frequentist binormal ROC using the famous LABROC procedure.
△ Less
Submitted 4 June, 2014;
originally announced June 2014.
-
Hypothesis Testing for Parsimonious Gaussian Mixture Models
Authors:
Antonio Punzo,
Ryan P. Browne,
Paul D. McNicholas
Abstract:
Gaussian mixture models with eigen-decomposed covariance structures make up the most popular family of mixture models for clustering and classification, i.e., the Gaussian parsimonious clustering models (GPCM). Although the GPCM family has been used for almost 20 years, selecting the best member of the family in a given situation remains a troublesome problem. Likelihood ratio tests are developed…
▽ More
Gaussian mixture models with eigen-decomposed covariance structures make up the most popular family of mixture models for clustering and classification, i.e., the Gaussian parsimonious clustering models (GPCM). Although the GPCM family has been used for almost 20 years, selecting the best member of the family in a given situation remains a troublesome problem. Likelihood ratio tests are developed to tackle this problems. These likelihood ratio tests use the heteroscedastic model under the alternative hypothesis but provide much more flexibility and real-world applicability than previous approaches that compare the homoscedastic Gaussian mixture versus the heteroscedastic one. Along the way, a novel maximum likelihood estimation procedure is developed for two members of the GPCM family. Simulations show that the $χ^2$ reference distribution gives reasonable approximation for the LR statistics only when the sample size is considerable and when the mixture components are well separated; accordingly, following Lo (2008), a parametric bootstrap is adopted. Furthermore, by generalizing the idea of Greselin and Punzo (2013) to the clustering context, a closed testing procedure, having the defined likelihood ratio tests as local tests, is introduced to assess a unique model in the general family. The advantages of this likelihood ratio testing procedure are illustrated via an application to the well-known Iris data set.
△ Less
Submitted 2 May, 2014;
originally announced May 2014.