Search | arXiv e-print repository

The quantile-based classifier with variable-wise parameters

Authors: Marco Berrettini, Christian Hennig, Cinzia Viroli

Abstract: Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermor… ▽ More Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermore, a variable-wise scale parameter is introduced. A simple greedy algorithm to estimate the parameters is proposed. Their consistency in a nonparametric setting is proved. Experiments using artificially generated and real data confirm the potential of the quantile-based classifier with variable-wise parameters. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2401.12126 [pdf, other]

Approaches to biological species delimitation based on genetic and spatial dissimilarity

Authors: Gabriele d'Angella, Christian Hennig

Abstract: The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of in… ▽ More The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of individuals can be genetically quite different even if they belong to the same species. We investigate the problem of testing whether two potentially separated groups of individuals can belong to a single species or not based on genetic and spatial data. Existing methods such as the partial Mantel test and jackknife-based distance-distance regression are considered. New approaches, i.e., an adaptation of a mixed effects model, a bootstrap approach, and a jackknife version of partial Mantel, are proposed. All these methods address the issue that distance data violate the independence assumption for standard inference regarding correlation and regression; a standard linear regression is also considered. The approaches are compared on simulated meta-populations generated with SLiM and GSpace - two software packages that can simulate spatially-explicit genetic data at an individual level. Simulations show that the new jackknife version of the partial Mantel test provides a good compromise between power and respecting the nominal type I error rate. Mixed-effects models have larger power than jackknife-based methods, but tend to display type I error rates slightly above the significance level. An application on brassy ringlets concludes the paper. △ Less

Submitted 3 June, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

Comments: Paper of 26 pages with 6 figures; appendix of 19 pages with 17 figures. February 2024 update: tiny notation edit, results unchanged. April 2024 update: additional simulation results and plots; introduction and description of the methodologies edited; broader appendix with new charts. June 2024 update: Minor edits in methods description

arXiv:2311.06108 [pdf, other]

Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions

Authors: Pietro Coretto, Christian Hennig

Abstract: The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is s… ▽ More The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes. △ Less

Submitted 26 April, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

MSC Class: 62H30; 62F35

arXiv:2309.08468 [pdf, other]

Choice of trimming proportion and number of clusters in robust clustering based on trimming

Authors: Luis Angel García-Escudero, Christian Hennig, Agustín Mayo-Iscar, Gianluca Morelli, Marco Riani

Abstract: So-called "classification trimmed likelihood curves" have been proposed as a useful heuristic tool to determine the number of clusters and trimming proportion in trimming-based robust clustering methods. However, these curves needs a careful visual inspection, and this way of choosing parameters requires subjective decisions. This work is intended to provide theoretical background for the understa… ▽ More So-called "classification trimmed likelihood curves" have been proposed as a useful heuristic tool to determine the number of clusters and trimming proportion in trimming-based robust clustering methods. However, these curves needs a careful visual inspection, and this way of choosing parameters requires subjective decisions. This work is intended to provide theoretical background for the understanding of these curves and the elements involved in their derivation. Moreover, a parametric bootstrap approach is presented in order to automatize the choice of parameter more by providing a reduced list of "sensible" choices for the parameters. The user can then pick a solution that fits their aims from that reduced list. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2308.14478 [pdf, ps, other]

Some issues in robust clustering

Authors: Christian Hennig

Abstract: Some key issues in robust clustering are discussed with focus on Gaussian mixture model based clustering, namely the formal definition of outliers, ambiguity between groups of outliers and clusters, the interaction between robust clustering and the estimation of the number of clusters, the essential dependence of (not only) robust clustering on tuning decisions, and shortcomings of existing measur… ▽ More Some key issues in robust clustering are discussed with focus on Gaussian mixture model based clustering, namely the formal definition of outliers, ambiguity between groups of outliers and clusters, the interaction between robust clustering and the estimation of the number of clusters, the essential dependence of (not only) robust clustering on tuning decisions, and shortcomings of existing measurements of cluster stability when it comes to outliers. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: 11 pages, no figures

MSC Class: 62H30

arXiv:2304.13406 [pdf]

Onset of a conceptual outline map to get a hold on the jungle of cluster analysis

Authors: Iven Van Mechelen, Christian Hennig, Henk A. L. Kiers

Abstract: The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the doma… ▽ More The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the domain suffers from a major accessibility problem as well as from the fact that it is rife with division across many pretty isolated islands. As a way out, the present paper offers a thorough and in-depth review of the clustering domain as a whole under the form of an outline map based on an overarching conceptual framework and a common language. With this framework we wish to contribute to structuring the clustering domain, to characterizing methods that have often been developed and studied in quite different contexts, to identifying links between methods, and to introducing a frame of reference for optimally setting up cluster analyses in data-analytic practice. △ Less

Submitted 10 April, 2024; v1 submitted 26 April, 2023; originally announced April 2023.

Comments: 44 pages, 4 figures

MSC Class: 62H30

arXiv:2204.09793 [pdf, other]

Clustering of football players based on performance data and aggregated clustering validity indexes

Authors: Serhat Akhanli, Christian Hennig

Abstract: We analyse football (soccer) player performance data with mixed type variables from the 2014-15 season of eight European major leagues. We cluster these data based on a tailor-made dissimilarity measure. In order to decide between the many available clustering methods and to choose an appropriate number of clusters, we use the approach by Akhanli and Hennig (2020). This is based on several valid… ▽ More We analyse football (soccer) player performance data with mixed type variables from the 2014-15 season of eight European major leagues. We cluster these data based on a tailor-made dissimilarity measure. In order to decide between the many available clustering methods and to choose an appropriate number of clusters, we use the approach by Akhanli and Hennig (2020). This is based on several validation criteria that refer to different desirable characteristics of a clustering. These characteristics are chosen based on the aim of clustering, and this allows to define a suitable validation index as weighted average of calibrated individual indexes measuring the desirable features. We derive two different clusterings. The first one is a partition of the data set into major groups of essentially different players, which can be used for the analysis of a team's composition. The second one divides the data set into many small clusters (with 10 players on average), which can be used for finding players with a very similar profile to a given player. It is discussed in depth what characteristics are desirable for these clusterings. Weighting the criteria for the second clustering is informed by a survey of football experts. △ Less

Submitted 20 April, 2022; originally announced April 2022.

Comments: 26 pages, 5 figures

MSC Class: 62H30

arXiv:2108.09243 [pdf, other]

A comparison of different clustering approaches for high-dimensional presence-absence data

Authors: Gabriele d'Angella, Christian Hennig

Abstract: Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elem… ▽ More Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elements, i.e., groups of species that tend to occur together geographically. Presence-absence data can be clustered in various ways, namely using a latent class mixture approach with local independence, distance-based hierarchical clustering with the Jaccard distance, or also using clustering methods for continuous data on a multidimensional scaling representation of the distances. These methods are conceptually very different and can therefore not easily be compared theoretically. We compare their performance with a comprehensive simulation study based on models for species distributions. This has been accepted for publication in Ferreira, J., Bekker, A., Arashi, M. and Chen, D. (eds.) Innovations in multivariate statistical modelling: navigating theoretical and multidisciplinary domains, Springer Emerging Topics in Statistics and Biostatistics. △ Less

Submitted 22 November, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

Comments: 22 pages, 6 Figures

MSC Class: 62H30

arXiv:2107.04946 [pdf, other]

Inference for the proportional odds cumulative logit model with monotonicity constraints for ordinal predictors and ordinal response

Authors: Javier Espinosa-Brito, Christian Hennig

Abstract: The proportional odds cumulative logit model (POCLM) is a standard regression model for an ordinal response. Ordinality of predictors can be incorporated by monotonicity constraints for the corresponding parameters. It is shown that estimators defined by optimization, such as maximum likelihood estimators, for an unconstrained model and for parameters in the interior set of the parameter space of… ▽ More The proportional odds cumulative logit model (POCLM) is a standard regression model for an ordinal response. Ordinality of predictors can be incorporated by monotonicity constraints for the corresponding parameters. It is shown that estimators defined by optimization, such as maximum likelihood estimators, for an unconstrained model and for parameters in the interior set of the parameter space of a constrained model are asymptotically equivalent. This is used in order to derive asymptotic confidence regions and tests for the constrained model, involving simple modifications for finite samples. The finite sample coverage probability of the confidence regions is investigated by simulation. Tests concern the effect of individual variables, monotonicity, and a specified monotonicity direction. The methodology is applied on real data related to the assessment of school performance. △ Less

Submitted 1 June, 2023; v1 submitted 10 July, 2021; originally announced July 2021.

arXiv:2103.01281 [pdf, ps, other]

Validation of cluster analysis results on validation data: A systematic framework

Authors: Theresa Ullmann, Christian Hennig, Anne-Laure Boulesteix

Abstract: Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to valid… ▽ More Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic structured framework for validating clustering results on validation data that includes most existing validation approaches. In particular, we review classical validation techniques such as internal and external validation, stability analysis, hypothesis testing, and visual validation, and show how they can be interpreted in terms of our framework. We precisely define and formalise different types of validation of clustering results on a validation dataset and explain how each type can be implemented in practice. Furthermore, we give examples of how clustering studies from the applied literature that used a validation dataset can be classified into the framework. △ Less

Submitted 10 January, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: 32 pages, 1 figure

arXiv:2102.03645 [pdf, other]

An empirical comparison and characterisation of nine popular clustering methods

Authors: Christian Hennig

Abstract: Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets c… ▽ More Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given "true" clustering. △ Less

Submitted 6 February, 2021; originally announced February 2021.

Comments: 44 pages, 9 Figures

MSC Class: 62H30

arXiv:2009.00921 [pdf, other]

An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering

Authors: Christian Hennig, Pietro Coretto

Abstract: We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto and Hennig 2016) of a Gaussian mixture model allowing for observations to be classified as "noise", but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic $Q$ that measur… ▽ More We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto and Hennig 2016) of a Gaussian mixture model allowing for observations to be classified as "noise", but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic $Q$ that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This nonparametric measure allows for non-Gaussian clusters as long as they have a good quality according to $Q$. The simplicity of a model is assessed by a measure $S$ that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of $Q$ is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian Information Criterion (BIC) and the Integrated Complete Likelihood (ICL) in a simulation study and on two datasets of scientific interest. Keywords: parametric bootstrap; noise component; unimodality; model-based clustering △ Less

Submitted 25 December, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

Comments: 35 pages, 13 figures

MSC Class: 62H30

arXiv:2007.05748 [pdf, ps, other]

doi 10.1007/978-3-030-19071-2_105-1

Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-As-Model

Authors: Christian Hennig

Abstract: Note: Published now as a chapter in "Handbook of the History and Philosophy of Mathematical Practice" (Springer Nature, editor B. Sriraman, https://doi.org/10.1007/978-3-030-19071-2_105-1). The application of mathematical probability theory in statistics is quite controversial. Controversies regard both the interpretation of probability, and approaches to statistical inference. After having give… ▽ More Note: Published now as a chapter in "Handbook of the History and Philosophy of Mathematical Practice" (Springer Nature, editor B. Sriraman, https://doi.org/10.1007/978-3-030-19071-2_105-1). The application of mathematical probability theory in statistics is quite controversial. Controversies regard both the interpretation of probability, and approaches to statistical inference. After having given an overview of the main approaches, I will propose a re-interpretation of frequentist probability. Most statisticians are aware that probability models interpreted in a frequentist manner are not really true in objective reality, but only idealisations. I argue that this is often ignored when actually applying frequentist methods and interpreting the results, and that kee** up the awareness for the essential difference between reality and models can lead to a more appropriate use and interpretation of frequentist models and methods, called "frequentism-as-model". This is elaborated showing connections to existing work, appreciating the special role of independently and identically distributed observations and subject matter knowledge, giving an account of how and under what conditions models that are not true can be useful, giving detailed interpretations of tests and confidence intervals, confronting their implicit compatibility logic with the inverse probability logic of Bayesian inference, re-interpreting the role of model assumptions, appreciating robustness, and the role of "interpretative equivalence" of models. Epistemic probability shares the issue that its models are only idealisations, and an analogous "epistemic-probability-as-model" can also be developed. △ Less

Submitted 18 November, 2023; v1 submitted 11 July, 2020; originally announced July 2020.

Comments: 55 pages no figures. Accepted for publication as a chapter in "Handbook of the History and Philosophy of Mathematical Practice - Practical, Historical and Philosophical Instances of Probability'' (Springer Nature, editor Egan Chernoff)

MSC Class: 62A01

arXiv:2002.01822 [pdf, other]

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Authors: Serhat Emre Akhanli, Christian Hennig

Abstract: A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are a… ▽ More A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clusterings on the same data. Two specific aggregated indexes are proposed and compared with existing indexes on simulated and real data. △ Less

Submitted 23 June, 2020; v1 submitted 5 February, 2020; originally announced February 2020.

Comments: 42 pages, 11 figures

MSC Class: 62H30

arXiv:1911.13272 [pdf, ps, other]

Minkowski distances and standardisation for clustering and classification of high dimensional data

Authors: Christian Hennig

Abstract: There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city blo… ▽ More There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city block)-, $L_2$ (Euclidean)-, $L_3$-, $L_4$-, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The $L_1$-distance and the boxplot transformation show good results. △ Less

Submitted 23 June, 2020; v1 submitted 29 November, 2019; originally announced November 2019.

Comments: Preliminary version; final version to be published by Springer, using Springer's svmult LATEX style

MSC Class: 62H30

arXiv:1910.11339 [pdf, other]

Clustering with the Average Silhouette Width

Authors: Fatima Batool, Christian Hennig

Abstract: The Average Silhouette Width (ASW; Rousseeuw (1987)) is a popular cluster validation index to estimate the number of clusters. Here we address the question whether it also is suitable as a general objective function to be optimized for finding a clustering. We will propose two algorithms (the standard version OSil and a fast version FOSil) and compare them with existing clustering methods in an ex… ▽ More The Average Silhouette Width (ASW; Rousseeuw (1987)) is a popular cluster validation index to estimate the number of clusters. Here we address the question whether it also is suitable as a general objective function to be optimized for finding a clustering. We will propose two algorithms (the standard version OSil and a fast version FOSil) and compare them with existing clustering methods in an extensive simulation study covering the cases of a known and unknown number of clusters. Real data sets are also analysed, partly exploring the use of the new methods with non-Euclidean distances. We will also show that the ASW satisfies some axioms that have been proposed for cluster quality functions (Ackerman and Ben-David (2009)). The new methods prove useful and sensible in many cases, but some weaknesses are also highlighted. These also concern the use of the ASW for estimating the number of clusters together with other methods, which is of general interest due to the popularity of the ASW for this task. △ Less

Submitted 21 November, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

Comments: 36 pages

MSC Class: 62H30 ACM Class: I.5.3

arXiv:1908.02218 [pdf, other]

Should we test the model assumptions before running a model-based test?

Authors: M. Iqbal Shamsudheen, Christian Hennig

Abstract: Statistical methods are based on model assumptions, and it is statistical folklore that a method's model assumptions should be checked before applying it. This can be formally done by running one or more misspecification tests of model assumptions before running a method that requires these assumptions; here we focus on model-based tests. A combined test procedure can be defined by specifying a pr… ▽ More Statistical methods are based on model assumptions, and it is statistical folklore that a method's model assumptions should be checked before applying it. This can be formally done by running one or more misspecification tests of model assumptions before running a method that requires these assumptions; here we focus on model-based tests. A combined test procedure can be defined by specifying a protocol in which first model assumptions are tested and then, conditionally on the outcome, a test is run that requires or does not require the tested assumptions. Although such an approach is often taken in practice, much of the literature that investigated this is surprisingly critical of it. Our aim is to explore conditions under which model checking is advisable or not advisable. For this, we review results regarding such "combined procedures" in the literature, we review and discuss controversial views on the role of model checking in statistics, and we present a general setup in which we can show that preliminary model checking is advantageous, which implies conditions for making model checking worthwhile. △ Less

Submitted 17 April, 2023; v1 submitted 6 August, 2019; originally announced August 2019.

Comments: 35 pages, 1 figure

MSC Class: 62F03

arXiv:1905.08876 [pdf, other]

Many perspectives on Deborah Mayo's "Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars"

Authors: Andrew Gelman, Brian Haig, Christian Hennig, Art Owen, Robert Cousins, Stan Young, Christian Robert, Corey Yanofsky, E. J. Wagenmakers, Ron Kenett, Daniel Lakeland

Abstract: The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion… ▽ More The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion will introduce people to Mayo's ideas along with other perspectives on the topics she addresses. △ Less

Submitted 29 May, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

Comments: 23 pages

arXiv:1809.10496 [pdf, ps, other]

doi 10.1002/widm.1511

Benchmarking in cluster analysis: A white paper

Authors: Iven Van Mechelen, Anne-Laure Boulesteix, Rainer Dangl, Nema Dean, Isabelle Guyon, Christian Hennig, Friedrich Leisch, Douglas Steinley

Abstract: Note: A revised version of this is now published. Please cite and read (it's open access): Van Mechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., Warrens, M. J. (2023). A white paper on good research practices in benchmarking: The case of cluster analysis. WIREs Data Mining and Knowledge Discovery, e1511. https://doi.org/10.1002/widm.1511 To achieve scien… ▽ More Note: A revised version of this is now published. Please cite and read (it's open access): Van Mechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., Warrens, M. J. (2023). A white paper on good research practices in benchmarking: The case of cluster analysis. WIREs Data Mining and Knowledge Discovery, e1511. https://doi.org/10.1002/widm.1511 To achieve scientific progress in terms of building a cumulative body of knowledge, careful attention to benchmarking is of the utmost importance. This means that proposals of new methods of data pre-processing, new data-analytic techniques, and new methods of output post-processing, should be extensively and carefully compared with existing alternatives, and that existing methods should be subjected to neutral comparison studies. To date, benchmarking and recommendations for benchmarking have been frequently seen in the context of supervised learning. Unfortunately, there has been a dearth of guidelines for benchmarking in an unsupervised setting, with the area of clustering as an important subdomain. To address this problem, discussion is given to the theoretical conceptual underpinnings of benchmarking in the field of cluster analysis by means of simulated as well as empirical data. Subsequently, the practicalities of how to address benchmarking questions in clustering are dealt with, and foundational recommendations are made. △ Less

Submitted 30 July, 2023; v1 submitted 27 September, 2018; originally announced September 2018.

MSC Class: 62H30

Journal ref: WIREs Data Mining and Knowledge Discovery, 2023, e1511

arXiv:1806.10403 [pdf, ps, other]

Quantile-based clustering

Authors: Christian Hennig, Cinzia Viroli, Laura Anderlucci

Abstract: A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for diffe… ▽ More A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for different levels of parsimony and computational efficiency. Although $K$-quantiles clustering is conceived as nonparametric, it can be connected to a fixed partition model of generalized asymmetric Laplace-distributions. The consistency of $K$-quantiles clustering is proved, and it is shown that $K$-quantiles clusters correspond to well separated mixture components in a nonparametric mixture. In a simulation, $K$-quantiles clustering is compared with a number of popular clustering methods with good results. A high-dimensional microarray dataset is clustered by $K$-quantiles. △ Less

Submitted 8 November, 2019; v1 submitted 27 June, 2018; originally announced June 2018.

arXiv:1804.08715 [pdf, other]

A constrained regression model for an ordinal response with ordinal predictors

Authors: Javier Espinosa, Christian Hennig

Abstract: A regression model is proposed for the analysis of an ordinal response variable depending on a set of multiple covariates containing ordinal and potentially other variables. The proportional odds model (McCullagh (1980)) is used for the ordinal response, and constrained maximum likelihood estimation is used to account for the ordinality of covariates. Ordinal predictors are coded by dummy variab… ▽ More A regression model is proposed for the analysis of an ordinal response variable depending on a set of multiple covariates containing ordinal and potentially other variables. The proportional odds model (McCullagh (1980)) is used for the ordinal response, and constrained maximum likelihood estimation is used to account for the ordinality of covariates. Ordinal predictors are coded by dummy variables. The parameters associated to the categories of the ordinal predictor(s) are constrained, enforcing them to be monotonic (isotonic or antitonic). A decision rule is introduced for classifying the ordinal predictors' monotonicity directions, also providing information whether observations are compatible with both or no monotonicity direction. In addition, a monotonicity test for the parameters of any ordinal predictor is proposed. The monotonicity constrained model is proposed together with three estimation methods and compared to the unconstrained one based on simulations. The model is applied to real data explaining a 10-Points Likert scale quality of life self-assessment variable from ordinal and other predictors. △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: 33 pages, 7 figures, 1 appendix

MSC Class: 62H12; 62J05; 62-07

arXiv:1704.00959 [pdf, other]

Using clustering of rankings to explain brand preferences with personality and socio-demographic variables

Authors: Daniel Müllensiefen, Christian Hennig, Hedie Howells

Abstract: The primary aim of market segmentation is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. This paper addresses the issue whether consumer groups can be identified from background variables that are not brand-related and how much personality vs. socio-demographic variables contribute to the identification of consumer clusters. This i… ▽ More The primary aim of market segmentation is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. This paper addresses the issue whether consumer groups can be identified from background variables that are not brand-related and how much personality vs. socio-demographic variables contribute to the identification of consumer clusters. This is done by clustering aggregated preferences for 25 brands across 5 different product categories, and by relating socio-demographic and personality variables to the clusters using logistic regression and random forests over a range of different numbers of clusters. Results indicate that some personality variables contribute significantly to the identification of consumer groups in one sample. However, these results were not replicated on a second sample that was more heterogeneous in terms of socio-demographic characteristics and not representative of the brands' target audience. △ Less

Submitted 4 April, 2017; originally announced April 2017.

Comments: 26 pages, 12 figures

MSC Class: 62H30; 91B08

arXiv:1703.09282 [pdf, other]

Cluster validation by measurement of clustering characteristics relevant to the user

Authors: Christian Hennig

Abstract: There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters suc… ▽ More There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters such as different numbers of clusters. There are many cluster validation indexes in the literature. Most of them attempt to measure the overall quality of a clustering by a single number, but this can be inappropriate. There are various different characteristics of a clustering that can be relevant in practice, depending on the aim of clustering, such as low within-cluster distances and high between-cluster separation. In this paper, a number of validation criteria will be introduced that refer to different desirable characteristics of a clustering, and that characterise a clustering in a multidimensional way. In specific applications the user may be interested in some of these criteria rather than others. A focus of the paper is on methodology to standardise the different characteristics so that users can aggregate them in a suitable way specifying weights for the various criteria that are relevant in the clustering application at hand. △ Less

Submitted 8 September, 2020; v1 submitted 27 March, 2017; originally announced March 2017.

Comments: 20 pages 2 figures

MSC Class: 62H30

arXiv:1604.02668 [pdf, other]

doi 10.1177/0962280217710050

Distance for Functional Data Clustering Based on Smoothing Parameter Commutation

Authors: ShengLi Tzeng, Christian Hennig, Yu-Fen Li, Chien-Ju Lin

Abstract: We propose a novel method to determine the dissimilarity between subjects for functional data clustering. Spline smoothing or interpolation is common to deal with data of such type. Instead of estimating the best-representing curve for each subject as fixed during clustering, we measure the dissimilarity between subjects based on varying curve estimates with commutation of smoothing parameters pai… ▽ More We propose a novel method to determine the dissimilarity between subjects for functional data clustering. Spline smoothing or interpolation is common to deal with data of such type. Instead of estimating the best-representing curve for each subject as fixed during clustering, we measure the dissimilarity between subjects based on varying curve estimates with commutation of smoothing parameters pair-by-pair (of subjects). The intuitions are that smoothing parameters of smoothing splines reflect inverse signal-to-noise ratios and that applying an identical smoothing parameter the smoothed curves for two similar subjects are expected to be close. The effectiveness of our proposal is shown through simulations comparing to other dissimilarity measures. It also has several pragmatic advantages. First, missing values or irregular time points can be handled directly, thanks to the nature of smoothing splines. Second, conventional clustering method based on dissimilarity can be employed straightforward, and the dissimilarity also serves as a useful tool for outlier detection. Third, the implementation is almost handy since subroutines for smoothing splines and numerical integration are widely available. Fourth, the computational complexity does not increase and is parallel with that in calculating Euclidean distance between curves estimated by smoothing splines. △ Less

Submitted 10 April, 2016; originally announced April 2016.

Journal ref: Statistical Methods in Medical Research, 27 (2018)

arXiv:1602.06989 [pdf, ps, other]

doi 10.1016/j.ins.2015.06.039

Recovering the number of clusters in data sets with noise features using feature rescaling factors

Authors: Renato Cordeiro de Amorim, Christian Hennig

Abstract: In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of re… ▽ More In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the p$^{th}$ power of the Minkowski distance), Dunn's, Calinski-Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set. △ Less

Submitted 22 February, 2016; originally announced February 2016.

Journal ref: Information Sciences 324 (2015), 126-145

arXiv:1508.05453 [pdf, ps, other]

Beyond subjective and objective in statistics

Authors: Andrew Gelman, Christian Hennig

Abstract: We argue that the words "objectivity" and "subjectivity" in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. The ad… ▽ More We argue that the words "objectivity" and "subjectivity" in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. The advantage of these reformulations is that the replacement terms do not oppose each other. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgment of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling, and socioeconomic stratification. △ Less

Submitted 21 August, 2015; originally announced August 2015.

Comments: 35 pages

arXiv:1503.02059 [pdf, ps, other]

Clustering strategy and method selection

Authors: Christian Hennig

Abstract: This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required. The aim of this chapter is to provide a framework for all the dec… ▽ More This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required. The aim of this chapter is to provide a framework for all the decisions that are required when carrying out a cluster analysis in practice. A general attitude to clustering is outlined, which connects these decisions closely to the clustering aims in a given application. From this point of view, the chapter then discusses aspects of data processing such as the choice of the representation of the objects to be clustered, dissimilarity design, transformation and standardization of variables. Regarding the choice of the clustering method, it is explored how different methods correspond to different clustering aims. Then an overview of benchmarking studies comparing different clustering methods is given, as well as an out- line of theoretical approaches to characterize desiderata for clustering by axioms. Finally, aspects of cluster validation, i.e., the assessment of the quality of a clustering in a given dataset, are discussed, including finding an appropriate number of clusters, testing homogeneity, internal and external cluster validation, assessing clustering stability and data visualization. △ Less

Submitted 6 March, 2015; originally announced March 2015.

arXiv:1502.02574 [pdf, ps, other]

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Authors: Christian Hennig, Chien-Ju Lin

Abstract: There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial dat… ▽ More There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation. △ Less

Submitted 9 February, 2015; originally announced February 2015.

MSC Class: 62H30; 62F03; 62F40

arXiv:1502.02555 [pdf, ps, other]

What are the true clusters?

Authors: Christian Hennig

Abstract: Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becom… ▽ More Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becomes scientific not through uniqueness but through transparent and open communication. The idea of "natural kinds" is a human construct, but it highlights the human experience that the reality outside the observer's control seems to make certain distinctions between categories inevitable. Various desirable characteristics of clusterings and various approaches to define a context-dependent truth are listed, and I discuss what impact these ideas can have on the comparison of clustering methods, and the choice of a clustering methods and related decisions in practice. △ Less

Submitted 9 February, 2015; originally announced February 2015.

MSC Class: 03A05; 62H30; 91C20

arXiv:1406.0808 [pdf, other]

doi 10.1080/01621459.2015.1100996

Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering

Authors: Pietro Coretto, Christian Hennig

Abstract: The two main topics of this paper are the introduction of the "optimally tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for tr… ▽ More The two main topics of this paper are the introduction of the "optimally tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant density for modelling outliers and noise. This can be chosen optimally so that the non-noise part of the data looks as close to a Gaussian mixture as possible. Some deviation from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance matrix constraints and computation of the OTRIMLE are also treated. In the simulation study, all methods are confronted with setups in which their model assumptions are not exactly fulfilled, and in order to evaluate the experiments in a standardized way by misclassification rates, a new model-based definition of "true clusters" is introduced that deviates from the usual identification of mixture components with clusters. In the study, every method turns out to be superior for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance. The methods are also applied to two real datasets, one without and one with known "true" clusters. △ Less

Submitted 28 January, 2017; v1 submitted 2 June, 2014; originally announced June 2014.

MSC Class: 62H30; 62F35; 62P25

Journal ref: Journal of the American Statistical Association 111(516), pp. 1648--1659 (2016)

arXiv:1309.6895 [pdf, other]

Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering

Authors: Pietro Coretto, Christian Hennig

Abstract: The robust improper maximum likelihood estimator (RIMLE) is a new method for robust multivariate clustering finding approximately Gaussian clusters. It maximizes a pseudo-likelihood defined by adding a component with improper constant density for accommodating outliers to a Gaussian mixture. A special case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In this paper we treat… ▽ More The robust improper maximum likelihood estimator (RIMLE) is a new method for robust multivariate clustering finding approximately Gaussian clusters. It maximizes a pseudo-likelihood defined by adding a component with improper constant density for accommodating outliers to a Gaussian mixture. A special case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In this paper we treat existence, consistency, and breakdown theory for the RIMLE comprehensively. RIMLE's existence is proved under non-smooth covariance matrix constraints. It is shown that these can be implemented via a computationally feasible Expectation-Conditional Maximization algorithm. △ Less

Submitted 13 February, 2018; v1 submitted 26 September, 2013; originally announced September 2013.

Comments: The title of this paper was originally: "A consistent and breakdown robust model-based clustering method"

MSC Class: 62H30; 62F35

Journal ref: 2017, Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. Download link: http://jmlr.org/papers/v18/16-382.html

arXiv:1303.1282 [pdf, ps, other]

Quantile-based classifiers

Authors: Christian Hennig, Cinzia Viroli

Abstract: Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this is consistent, fo… ▽ More Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this is consistent, for $n \to \infty$, for the classification rule with asymptotically optimal quantile, and that, under some assumptions, for $p\to\infty$ the probability of correct classification converges to one. The role of skewness of the involved variables is discussed, which leads to an improved classifier. The optimal quantile classifier performs very well in a comprehensive simulation study and a real data set from chemistry (classification of bioaerosols) compared to nine other classifiers, including the support vector machine and the recently proposed median-based classifier (Hall et al., 2009), which inspired the quantile classifier. △ Less

Submitted 12 November, 2013; v1 submitted 6 March, 2013; originally announced March 2013.

Showing 1–32 of 32 results for author: Hennig, C