-
greed: An R Package for Model-Based Clustering by Greedy Maximization of the Integrated Classification Likelihood
Authors:
Etienne Côme,
Nicolas Jouvin
Abstract:
The greed package implements the general and flexible framework of arXiv:2002.11577 for model-based clustering in the R language. Based on the direct maximization of the exact Integrated Classification Likelihood with respect to the partition, it allows jointly performing clustering and selection of the number of groups. This combinatorial problem is handled through an efficient hybrid genetic alg…
▽ More
The greed package implements the general and flexible framework of arXiv:2002.11577 for model-based clustering in the R language. Based on the direct maximization of the exact Integrated Classification Likelihood with respect to the partition, it allows jointly performing clustering and selection of the number of groups. This combinatorial problem is handled through an efficient hybrid genetic algorithm, while a final hierarchical step allows accessing coarser partitions and extract an ordering of the clusters. This methodology is applicable in a wide variety of latent variable models and, hence, can handle various data types as well as heterogeneous data. Classical models for continuous, count, categorical and graph data are implemented, and new models may be incorporated thanks to S4 class abstraction. This paper introduces the package, the design choices that guided its development and illustrates its usage on practical use-cases.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
A Bayesian Fisher-EM algorithm for discriminative Gaussian subspace clustering
Authors:
Nicolas Jouvin,
Charles Bouveyron,
Pierre Latouche
Abstract:
High-dimensional data clustering has become and remains a challenging task for modern statistics and machine learning, with a wide range of applications. We consider in this work the powerful discriminative latent mixture model, and we extend it to the Bayesian framework. Modeling data as a mixture of Gaussians in a low-dimensional discriminative subspace, a Gaussian prior distribution is introduc…
▽ More
High-dimensional data clustering has become and remains a challenging task for modern statistics and machine learning, with a wide range of applications. We consider in this work the powerful discriminative latent mixture model, and we extend it to the Bayesian framework. Modeling data as a mixture of Gaussians in a low-dimensional discriminative subspace, a Gaussian prior distribution is introduced over the latent group means and a family of twelve submodels are derived considering different covariance structures. Model inference is done with a variational EM algorithm, while the discriminative subspace is estimated via a Fisher-step maximizing an unsupervised Fisher criterion. An empirical Bayes procedure is proposed for the estimation of the prior hyper-parameters, and an integrated classification likelihood criterion is derived for selecting both the number of clusters and the submodel. The performances of the resulting Bayesian Fisher-EM algorithm are investigated in two thorough simulated scenarios, regarding both dimensionality as well as noise and assessing its superiority with respect to state-of-the-art Gaussian subspace clustering models. In addition to standard real data benchmarks, an application to single image denoising is proposed, displaying relevant results. This work comes with a reference implementation for the R software in the FisherEM package accompanying the paper.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Hierarchical clustering with discrete latent variable models and the integrated classification likelihood
Authors:
Etienne Côme,
Nicolas Jouvin,
Pierre Latouche,
Charles Bouveyron
Abstract:
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent varia…
▽ More
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm efficiently exploring the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number $K$ of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter $α$ as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of $α$, enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R package greed accompanying the paper.
△ Less
Submitted 21 April, 2021; v1 submitted 26 February, 2020;
originally announced February 2020.
-
Greedy clustering of count data through a mixture of multinomial PCA
Authors:
Nicolas Jouvin,
Pierre Latouche,
Charles Bouveyron,
Guillaume Bataillon,
Alain Livartowski
Abstract:
Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve perfo…
▽ More
Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.
△ Less
Submitted 10 July, 2020; v1 submitted 2 September, 2019;
originally announced September 2019.