Skip to main content

Showing 1–50 of 79 results for author: McNicholas, P D

.
  1. arXiv:2404.04122  [pdf, ps, other

    stat.ME

    Hidden Markov Models for Multivariate Panel Data

    Authors: Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas

    Abstract: While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms due to the unique correlation structure, a consequence of taking observations on several subjects over multiple time points. Additionally, panel data are often plagued by missing data and dropouts,… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

  2. arXiv:2311.07762  [pdf, other

    stat.ME stat.CO stat.ML

    Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data

    Authors: Andrea Payne, Anjali Silva, Steven J. Rothstein, Paul D. McNicholas, Sanjeena Subedi

    Abstract: A mixture of multivariate Poisson-log normal factor analyzers is introduced by imposing constraints on the covariance matrix, which resulted in flexible models for clustering purposes. In particular, a class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced. Variational Gaussian approximation is used for parameter estimation, and information criter… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

    Comments: 29 pages, 2 figures

    MSC Class: 62H30

  3. arXiv:2310.05288  [pdf, other

    stat.ML cs.LG

    Clustering Three-Way Data with Outliers

    Authors: Katharine M. Clark, Paul D. McNicholas

    Abstract: Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with o… ▽ More

    Submitted 11 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

  4. arXiv:2307.11682  [pdf, other

    stat.ME stat.CO

    Longitudinal Data Clustering with a Copula Kernel Mixture Model

    Authors: Xi Zhang, Orla A. Murphy, Paul D. McNicholas

    Abstract: Many common clustering methods cannot be used for clustering multivariate longitudinal data in cases where variables exhibit high autocorrelations. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model which decomposes each mixture component's joint density function into its copula and marginal distribution functions.… ▽ More

    Submitted 21 July, 2023; originally announced July 2023.

  5. arXiv:2305.16464  [pdf, other

    stat.ME

    Flexible Variable Selection for Clustering and Classification

    Authors: Mackenzie R. Neal, Paul D. McNicholas

    Abstract: The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the pres… ▽ More

    Submitted 9 February, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

  6. arXiv:2111.14952  [pdf, ps, other

    stat.ME stat.AP

    Model-based clustering via skewed matrix-variate cluster-weighted models

    Authors: Michael P. B. Gallaugher, Salvatore D. Tomarchio, Paul D. McNicholas, Antonio Punzo

    Abstract: Cluster-weighted models (CWMs) extend finite mixtures of regressions (FMRs) in order to allow the distribution of covariates to contribute to the clustering process. In a matrix-variate framework, the matrix-variate normal CWM has been recently introduced. However, problems may be encountered when data exhibit skewness or other deviations from normality in the responses, covariates or both. Thus,… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

  7. arXiv:2106.08984  [pdf, other

    stat.ME

    Four Skewed Tensor Distributions

    Authors: Michael P. B. Gallaugher, Peter A. Tait, Paul D. McNicholas

    Abstract: With the rise of the "big data" phenomenon in recent years, data is coming in many different complex forms. One example of this is multi-way data that come in the form of higher-order tensors such as coloured images and movie clips. Although there has been a recent rise in models for looking at the simple case of three-way data in the form of matrices, there is a relative paucity of higher-order t… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

  8. arXiv:2104.11900  [pdf, ps, other

    stat.ME

    Matrix Normal Cluster-Weighted Models

    Authors: Salvatore D. Tomarchio, Paul D. McNicholas, Antonio Punzo

    Abstract: Finite mixtures of regressions with fixed covariates are a commonly used model-based clustering methodology to deal with regression data. However, they assume assignment independence, i.e. the allocation of data points to the clusters is made independently of the distribution of the covariates. In order to take into account the latter aspect, finite mixtures of regressions with random covariates,… ▽ More

    Submitted 24 April, 2021; originally announced April 2021.

  9. arXiv:2103.09792  [pdf, ps, other

    stat.ME

    Multivariate Cluster Weighted Models Using Skewed Distributions

    Authors: Michael P. B. Gallaugher, Salvatore D. Tomarchio, Paul D. McNicholas, Antonio Punzo

    Abstract: Much work has been done in the area of the cluster weighted model (CWM), which extends the finite mixture of regression model to include modelling of the covariates. Although many types of distributions have been considered for both the response and covariates, to our knowledge skewed distributions have not yet been considered in this paradigm. Herein, a family of 24 novel CWMs are considered whic… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

  10. arXiv:2011.09152  [pdf, other

    stat.AP

    Skewed Distributions or Transformations? Modelling Skewness for a Cluster Analysis

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas, Volodymyr Melnykov, Xuwen Zhu

    Abstract: Because of its mathematical tractability, the Gaussian mixture model holds a special place in the literature for clustering and classification. For all its benefits, however, the Gaussian mixture model poses problems when the data is skewed or contains outliers. Because of this, methods have been developed over the years for handling skewed data, and fall into two general categories. The first is… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

  11. arXiv:2011.08350  [pdf, other

    stat.AP

    Defying the Circadian Rhythm: Clustering Participant Telemetry in the UK Biobank Data

    Authors: Nikola Pocuca, Mark Farrell, Paul D. McNicholas

    Abstract: The UK Biobank dataset follows over 500,000 volunteers and contains a diverse set of information related to societal outcomes. Among this vast collection, a large quantity of telemetry collected from wrist-worn accelerometers provides a snapshot of participant activity. Using this data, a population of shift workers, subjected to disrupted circadian rhythms, is analysed using a mixture model-based… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: 28 pages, 23 figures

  12. arXiv:2005.03861  [pdf, ps, other

    stat.ME

    Mixtures of Contaminated Matrix Variate Normal Distributions

    Authors: Salvatore D. Tomarchio, Michael P. B. Gallaugher, Antonio Punzo, Paul D. McNicholas

    Abstract: Analysis of three-way data is becoming ever more prevalent in the literature, especially in the area of clustering and classification. Real data, including real three-way data, are often contaminated by potential outlying observations. Their detection, as well as the development of robust models insensitive to their presence, is particularly important for this type of data because of the practical… ▽ More

    Submitted 8 May, 2020; originally announced May 2020.

  13. arXiv:1911.09012  [pdf, ps, other

    stat.ME stat.CO

    Parsimonious Mixtures of Matrix Variate Bilinear Factor Analyzers

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Over the years, data have become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Many methods have been proposed in the literature for two-way (multivariate) data and quite recently methods have been present… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

  14. arXiv:1910.02859  [pdf, other

    stat.ME math.ST

    Assessing and Visualizing Matrix Variate Normality

    Authors: Nikola Pocuca, Michael P. B. Gallaugher, Katharine M. Clark, Paul D. McNicholas

    Abstract: A framework for assessing the matrix variate normality of three-way data is developed. The framework comprises a visual method and a goodness of fit test based on the Mahalanobis squared distance (MSD). The MSD of multivariate and matrix variate normal estimators, respectively, are used as an assessment tool for matrix variate normality. Specifically, these are used in the form of a distance-dista… ▽ More

    Submitted 7 October, 2019; originally announced October 2019.

  15. arXiv:1907.08566  [pdf, other

    stat.ME stat.AP stat.ML

    Clustering Higher Order Data: An Application to Pediatric Multi-variable Longitudinal Data

    Authors: Peter A. Tait, Paul D. McNicholas, Joyce Obeid

    Abstract: Physical activity levels are an important predictor of cardiovascular health and increasingly being measured by sensors, like accelerometers. Accelerometers produce rich multivariate data that can inform important clinical decisions related to individual patients and public health. The CHAMPION study, a study of youth with chronic inflammatory conditions, aims to determine the links between heart… ▽ More

    Submitted 4 December, 2020; v1 submitted 19 July, 2019; originally announced July 2019.

  16. arXiv:1907.01938  [pdf, ps, other

    stat.CO stat.ME

    Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions

    Authors: Utkarsh J. Dang, Michael P. B. Gallaugher, Ryan P. Browne, Paul D. McNicholas

    Abstract: Families of mixtures of multivariate power exponential (MPE) distributions have been previously introduced and shown to be competitive for cluster analysis in comparison to other elliptical mixtures including mixtures of Gaussian distributions. Herein, we propose a family of mixtures of multivariate skewed power exponential distributions to combine the flexibility of the MPE distribution with the… ▽ More

    Submitted 20 January, 2023; v1 submitted 3 July, 2019; originally announced July 2019.

  17. Finding Outliers in Gaussian Model-Based Clustering

    Authors: Katharine M. Clark, Paul D. McNicholas

    Abstract: Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that… ▽ More

    Submitted 30 May, 2024; v1 submitted 1 July, 2019; originally announced July 2019.

  18. arXiv:1903.05054  [pdf, other

    stat.ME stat.ML

    Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

    Authors: Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas

    Abstract: Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed… ▽ More

    Submitted 6 June, 2024; v1 submitted 12 March, 2019; originally announced March 2019.

  19. arXiv:1901.09249  [pdf, other

    stat.ME stat.AP stat.ML

    Clustering Discrete-Valued Time Series

    Authors: Tyler Roick, Dimitris Karlis, Paul D. McNicholas

    Abstract: There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model… ▽ More

    Submitted 27 March, 2020; v1 submitted 26 January, 2019; originally announced January 2019.

  20. arXiv:1812.11829  [pdf, other

    stat.AP

    Modeling Frequency and Severity of Claims with the Zero-Inflated Generalized Cluster-Weighted Models

    Authors: Nikola Pocuca, Petar Jevtic, Paul D. McNicholas, Tatjana Miljkovic

    Abstract: In this paper, we propose two important extensions to cluster-weighted models (CWMs). First, we extend CWMs to have generalized cluster-weighted models (GCWMs) by allowing modeling of non-Gaussian distribution of the continuous covariates, as they frequently occur in insurance practice. Secondly, we introduce a zero-inflated extension of GCWM (ZI-GCWM) for modeling insurance claims data with exces… ▽ More

    Submitted 31 December, 2018; originally announced December 2018.

  21. arXiv:1812.09758  [pdf, other

    stat.AP stat.ME stat.ML

    Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes

    Authors: Forrest Paton, Paul D. McNicholas

    Abstract: Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function's value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, the… ▽ More

    Submitted 3 April, 2020; v1 submitted 23 December, 2018; originally announced December 2018.

  22. arXiv:1811.00097  [pdf, other

    stat.CO stat.ML

    An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering

    Authors: Sharon M. McNicholas, Paul D. McNicholas, Daniel A. Ashlock

    Abstract: An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generali… ▽ More

    Submitted 8 June, 2020; v1 submitted 31 October, 2018; originally announced November 2018.

  23. arXiv:1809.02385  [pdf, other

    stat.ME stat.CO stat.ML

    Mixtures of Skewed Matrix Variate Bilinear Factor Analyzers

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix vari… ▽ More

    Submitted 27 September, 2019; v1 submitted 7 September, 2018; originally announced September 2018.

  24. arXiv:1808.08366  [pdf, other

    stat.ML cs.LG

    Parameter-wise co-clustering for high-dimensional data

    Authors: M. P. B. Gallaugher, C. Biernacki, P. D. McNicholas

    Abstract: In recent years, data dimensionality has increasingly become a concern, leading to many parameter and dimension reduction techniques being proposed in the literature. A parameter-wise co-clustering model, for data modelled via continuous random variables, is presented. The proposed model, although allowing more flexibility, still maintains the very high degree of parsimony achieved by traditional… ▽ More

    Submitted 30 September, 2020; v1 submitted 25 August, 2018; originally announced August 2018.

    Comments: Submitted to Pattern Recognition Letters

  25. arXiv:1807.08380  [pdf, other

    stat.ME

    Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data

    Authors: Anjali Silva, Steven J. Rothstein, Paul D. McNicholas, Xiaoke Qin, Sanjeena Subedi

    Abstract: Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for $n$ genes across $p$ conditions at $r$ occasions. Matrix variate distributions offer a natural way to model three-way data and mixtur… ▽ More

    Submitted 21 June, 2022; v1 submitted 22 July, 2018; originally announced July 2018.

  26. arXiv:1805.04203  [pdf, other

    stat.ME stat.AP

    Robust Model-Based Clustering of Voting Records

    Authors: Yang Tang, Paul D. McNicholas, Antonio Punzo

    Abstract: We explore the possibility of discovering extreme voting patterns in the U.S. Congressional voting records by drawing ideas from the mixture of contaminated normal distributions. A mixture of latent trait models via contaminated normal distributions is proposed. We assume that the low dimensional continuous latent variable comes from a contaminated normal distribution and, therefore, picks up extr… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

  27. arXiv:1804.05133  [pdf, other

    stat.ME stat.ML

    A Latent Gaussian Mixture Model for Clustering Longitudinal Data

    Authors: Vanessa S. E. Bierling, Paul D. McNicholas

    Abstract: Finite mixture models have become a popular tool for clustering. Amongst other uses, they have been applied for clustering longitudinal data and clustering high-dimensional data. In the latter case, a latent Gaussian mixture model is sometimes used. Although there has been much work on clustering using latent variables and on clustering longitudinal data, respectively, there has been a paucity of… ▽ More

    Submitted 13 April, 2018; originally announced April 2018.

  28. arXiv:1802.04849  [pdf, other

    stat.ME stat.ML

    Clustering and Semi-Supervised Classification for Clickstream Data via Mixture Models

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Finite mixture models have been used for unsupervised learning for some time, and their use within the semi-supervised paradigm is becoming more commonplace. Clickstream data is one of the various emerging data types that demands particular attention because there is a notable paucity of statistical learning approaches currently available. A mixture of first-order continuous time Markov models is… ▽ More

    Submitted 16 December, 2020; v1 submitted 13 February, 2018; originally announced February 2018.

  29. arXiv:1712.08664  [pdf, other

    stat.ME stat.CO stat.ML

    A Mixture of Matrix Variate Bilinear Factor Analyzers

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Over the years data has become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Although dimension reduction in the area of clustering for multivariate data has been quite thoroughly discussed within the lite… ▽ More

    Submitted 29 September, 2018; v1 submitted 22 December, 2017; originally announced December 2017.

  30. arXiv:1711.11190  [pdf, ps, other

    stat.ME q-bio.QM stat.CO

    A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data

    Authors: Anjali Silva, Steven J. Rothstein, Paul D. McNicholas, Sanjeena Subedi

    Abstract: High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the d… ▽ More

    Submitted 29 November, 2017; originally announced November 2017.

  31. arXiv:1711.01504  [pdf, ps, other

    stat.ME stat.CO

    Mixtures of Hidden Truncation Hyperbolic Factor Analyzers

    Authors: Paula M. Murray, Ryan P. Browne, Paul D. McNicholas

    Abstract: The mixture of factor analyzers model was first introduced over 20 years ago and, in the meantime, has been extended to several non-Gaussian analogues. In general, these analogues account for situations with heavy tailed and/or skewed clusters. An approach is introduced that unifies many of these approaches into one very general model: the mixture of hidden truncation hyperbolic factor analyzers (… ▽ More

    Submitted 27 October, 2018; v1 submitted 4 November, 2017; originally announced November 2017.

  32. arXiv:1709.08258  [pdf, ps, other

    stat.ME stat.CO

    On Fractionally-Supervised Classification: Weight Selection and Extension to the Multivariate t-Distribution

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Recent work on fractionally-supervised classification (FSC), an approach that allows classification to be carried out with a fractional amount of weight given to the unlabelled points, is further developed in two respects. The primary development addresses a question of fundamental importance over how to choose the amount of weight given to the unlabelled points. The resolution of this matter is e… ▽ More

    Submitted 24 September, 2017; originally announced September 2017.

  33. Hidden Truncation Hyperbolic Distributions, Finite Mixtures Thereof, and Their Application for Clustering

    Authors: Paula M. Murray, Ryan P. Browne, Paul D. McNicholas

    Abstract: A hidden truncation hyperbolic (HTH) distribution is introduced and finite mixtures thereof are applied for clustering. A stochastic representation of the HTH distribution is given and a density is derived. A hierarchical representation is described, which aids in parameter estimation. Finite mixtures of HTH distributions are presented and their identifiability is proved. The convexity of the HTH… ▽ More

    Submitted 20 July, 2017; v1 submitted 7 July, 2017; originally announced July 2017.

  34. arXiv:1706.08927  [pdf, other

    stat.ME

    Subspace Clustering with the Multivariate-t Distribution

    Authors: Angelina Pesevski, Brian C. Franczak, Paul D. McNicholas

    Abstract: Clustering procedures suitable for the analysis of very high-dimensional data are needed for many modern data sets. In model-based clustering, a method called high-dimensional data clustering (HDDC) uses a family of Gaussian mixture models for clustering. HDDC is based on the idea that high-dimensional data usually exists in lower-dimensional subspaces; as such, an intrinsic dimension for each sub… ▽ More

    Submitted 27 June, 2017; originally announced June 2017.

    Comments: 16 pages, 2 figures

  35. arXiv:1706.06185  [pdf, other

    stat.CO stat.ME

    Flexible High-Dimensional Unsupervised Learning with Missing Data

    Authors: Yuhong Wei, Yang Tang, Paul D. McNicholas

    Abstract: The mixture of factor analyzers (MFA) model is a famous mixture model-based approach for unsupervised learning with high-dimensional data. It can be useful, inter alia, in situations where the data dimensionality far exceeds the number of observations. In recent years, the MFA model has been extended to non-Gaussian mixtures to account for clusters with heavier tail weight and/or asymmetry. The ge… ▽ More

    Submitted 9 November, 2018; v1 submitted 19 June, 2017; originally announced June 2017.

  36. arXiv:1705.03134  [pdf, other

    stat.AP stat.CO stat.ME

    Clustering Airbnb Reviews

    Authors: Yang Tang, Paul D. McNicholas

    Abstract: In the last decade, online customer reviews increasingly exert influence on consumers' decision when booking accommodation online. The renewal importance to the concept of word-of mouth is reflected in the growing interests in investigating consumers' experience by analyzing their online reviews through the process of text mining and sentiment analysis. A clustering approach is developed for Bosto… ▽ More

    Submitted 27 June, 2019; v1 submitted 8 May, 2017; originally announced May 2017.

  37. arXiv:1705.03130  [pdf, other

    stat.ME stat.CO

    Flexible Clustering for High-Dimensional Data via Mixtures of Joint Generalized Hyperbolic Models

    Authors: Yang Tang, Ryan P. Browne, Paul D. McNicholas

    Abstract: A mixture of joint generalized hyperbolic distributions (MJGHD) is introduced for asymmetric clustering for high-dimensional data. The MJGHD approach takes into account the cluster-specific subspace, thereby limiting the number of parameters to estimate while also facilitating visualization of results. Identifiability is discussed, and a multi-cycle ECM algorithm is outlined for parameter estimati… ▽ More

    Submitted 6 January, 2018; v1 submitted 8 May, 2017; originally announced May 2017.

  38. arXiv:1704.02531  [pdf, other

    stat.ME math.ST

    Three Skewed Matrix Variate Distributions

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Three-way data can be conveniently modelled by using matrix variate distributions. Although there has been a lot of work for the matrix variate normal distribution, there is little work in the area of matrix skew distributions. Three matrix variate distributions that incorporate skewness, as well as other flexible properties such as concentration, are discussed. Equivalences to multivariate analog… ▽ More

    Submitted 13 August, 2018; v1 submitted 8 April, 2017; originally announced April 2017.

  39. arXiv:1703.08882  [pdf, other

    stat.ME stat.CO

    Finite Mixtures of Skewed Matrix Variate Distributions

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Clustering is the process of finding underlying group structures in data. Although mixture model-based clustering is firmly established in the multivariate case, there is a relative paucity of work on matrix variate distributions and none for clustering with mixtures of skewed matrix variate distributions. Four finite mixtures of skewed matrix variate distributions are considered. Parameter estima… ▽ More

    Submitted 5 March, 2018; v1 submitted 26 March, 2017; originally announced March 2017.

  40. arXiv:1703.08723  [pdf, other

    stat.ME stat.AP stat.CO

    Extending Growth Mixture Models Using Continuous Non-Elliptical Distributions

    Authors: Yuhong Wei, Yang Tang, Emilie Shireman, Paul D. McNicholas, Douglas L. Steinley

    Abstract: Growth mixture models (GMMs) incorporate both conventional random effects growth modeling and latent trajectory classes as in finite mixture modeling; therefore, they offer a way to handle the unobserved heterogeneity between subjects in their development. GMMs with Gaussian random effects dominate the literature. When the data are asymmetric and/or have heavier tails, more than one latent class i… ▽ More

    Submitted 13 November, 2017; v1 submitted 25 March, 2017; originally announced March 2017.

  41. Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data

    Authors: Yuhong Wei, Yang Tang, Paul D. McNicholas

    Abstract: Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavy-tailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for model-based clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skew-t distribution… ▽ More

    Submitted 19 August, 2018; v1 submitted 6 March, 2017; originally announced March 2017.

  42. arXiv:1703.01364  [pdf, other

    stat.ME math.ST

    A Matrix Variate Skew-t Distribution

    Authors: Michael P. B. Gallaugher, Paul D. McNicholas

    Abstract: Although there is ample work in the literature dealing with skewness in the multivariate setting, there is a relative paucity of work in the matrix variate paradigm. Such work is, for example, useful for modelling three-way data. A matrix variate skew-t distribution is derived based on a mean-variance matrix normal mixture. An expectation-conditional maximization algorithm is developed for paramet… ▽ More

    Submitted 12 April, 2017; v1 submitted 3 March, 2017; originally announced March 2017.

  43. arXiv:1606.03766  [pdf, other

    stat.CO

    ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions

    Authors: Antonio Punzo, Angelo Mazza, Paul D. McNicholas

    Abstract: We introduce the R package ContaminatedMixt, conceived to disseminate the use of mixtures of multivariate contaminated normal distributions as a tool for robust clustering and classification under the common assumption of elliptically contoured groups. Thirteen variants of the model are also implemented to introduce parsimony. The expectation-conditional maximization algorithm is adopted to obtain… ▽ More

    Submitted 12 June, 2016; originally announced June 2016.

  44. arXiv:1506.04137  [pdf, ps, other

    stat.ME stat.CO

    Mixtures of Multivariate Power Exponential Distributions

    Authors: Utkarsh J. Dang, Ryan P. Browne, Paul D. McNicholas

    Abstract: An expanded family of mixtures of multivariate power exponential distributions is introduced. While fitting heavy-tails and skewness has received much attention in the model-based clustering literature recently, we investigate the use of a distribution that can deal with both varying tail-weight and peakedness of data. A family of parsimonious models is proposed using an eigen-decomposition of the… ▽ More

    Submitted 12 June, 2015; originally announced June 2015.

  45. arXiv:1411.0560  [pdf, ps, other

    stat.CO stat.ME stat.ML

    Multivariate response and parsimony for Gaussian cluster-weighted models

    Authors: Utkarsh J. Dang, Antonio Punzo, Paul D. McNicholas, Salvatore Ingrassia, Ryan P. Browne

    Abstract: A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectat… ▽ More

    Submitted 26 February, 2016; v1 submitted 3 November, 2014; originally announced November 2014.

  46. arXiv:1409.6019  [pdf, ps, other

    stat.ME math.ST stat.CO

    Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

    Authors: Antonio Punzo, Paul D. McNicholas

    Abstract: The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed… ▽ More

    Submitted 21 September, 2014; originally announced September 2014.

  47. arXiv:1408.2128  [pdf, ps, other

    stat.ME stat.AP stat.CO

    High-dimensional unsupervised classification via parsimonious contaminated mixtures

    Authors: Antonio Punzo, Martin Blostein, Paul D. McNicholas

    Abstract: The contaminated Gaussian distribution represents a simple heavy-tailed elliptical generalization of the Gaussian distribution; unlike the often-considered t-distribution, it also allows for automatic detection of mild outlying or "bad" points in the same way that observations are typically assigned to the groups in the finite mixture model context. Starting from this distribution, we propose the… ▽ More

    Submitted 28 August, 2019; v1 submitted 9 August, 2014; originally announced August 2014.

  48. arXiv:1406.1332  [pdf, other

    stat.ME

    An Adaptive LASSO-Penalized BIC

    Authors: Sakyajit Bhattacharya, Paul D. McNicholas

    Abstract: Mixture models are becoming a popular tool for the clustering and classification of high-dimensional data. In such high dimensional applications, model selection is problematic. The Bayesian information criterion, which is popular in lower dimensional applications, tends to underestimate the true number of components in high dimensions. We introduce an adaptive LASSO-penalized BIC (ALPBIC) to miti… ▽ More

    Submitted 5 June, 2014; originally announced June 2014.

  49. arXiv:1406.1245  [pdf, other

    stat.ME stat.AP stat.CO

    Modelling Receiver Operating Characteristic Curves Using Gaussian Mixtures

    Authors: Amay Cheam, Paul D. McNicholas

    Abstract: The receiver operating characteristic curve is widely applied in measuring the performance of diagnostic tests. Many direct and indirect approaches have been proposed for modelling the ROC curve, and because of its tractability, the Gaussian distribution has typically been used to model both populations. We propose using a Gaussian mixture model, leading to a more flexible approach that better acc… ▽ More

    Submitted 4 June, 2014; originally announced June 2014.

  50. arXiv:1405.0377  [pdf, ps, other

    stat.ME stat.CO

    Hypothesis Testing for Parsimonious Gaussian Mixture Models

    Authors: Antonio Punzo, Ryan P. Browne, Paul D. McNicholas

    Abstract: Gaussian mixture models with eigen-decomposed covariance structures make up the most popular family of mixture models for clustering and classification, i.e., the Gaussian parsimonious clustering models (GPCM). Although the GPCM family has been used for almost 20 years, selecting the best member of the family in a given situation remains a troublesome problem. Likelihood ratio tests are developed… ▽ More

    Submitted 2 May, 2014; originally announced May 2014.