Skip to main content

Showing 1–32 of 32 results for author: Hennig, C

Searching in archive stat. Search in all archives.
.
  1. arXiv:2404.13589  [pdf, other

    stat.ME

    The quantile-based classifier with variable-wise parameters

    Authors: Marco Berrettini, Christian Hennig, Cinzia Viroli

    Abstract: Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermor… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  2. arXiv:2401.12126  [pdf, other

    q-bio.PE stat.AP stat.ME

    Approaches to biological species delimitation based on genetic and spatial dissimilarity

    Authors: Gabriele d'Angella, Christian Hennig

    Abstract: The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of in… ▽ More

    Submitted 3 June, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Paper of 26 pages with 6 figures; appendix of 19 pages with 17 figures. February 2024 update: tiny notation edit, results unchanged. April 2024 update: additional simulation results and plots; introduction and description of the methodologies edited; broader appendix with new charts. June 2024 update: Minor edits in methods description

  3. arXiv:2311.06108  [pdf, other

    math.ST stat.ML

    Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions

    Authors: Pietro Coretto, Christian Hennig

    Abstract: The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is s… ▽ More

    Submitted 26 April, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

    MSC Class: 62H30; 62F35

  4. arXiv:2309.08468  [pdf, other

    stat.ME stat.CO

    Choice of trimming proportion and number of clusters in robust clustering based on trimming

    Authors: Luis Angel García-Escudero, Christian Hennig, Agustín Mayo-Iscar, Gianluca Morelli, Marco Riani

    Abstract: So-called "classification trimmed likelihood curves" have been proposed as a useful heuristic tool to determine the number of clusters and trimming proportion in trimming-based robust clustering methods. However, these curves needs a careful visual inspection, and this way of choosing parameters requires subjective decisions. This work is intended to provide theoretical background for the understa… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  5. arXiv:2308.14478  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Some issues in robust clustering

    Authors: Christian Hennig

    Abstract: Some key issues in robust clustering are discussed with focus on Gaussian mixture model based clustering, namely the formal definition of outliers, ambiguity between groups of outliers and clusters, the interaction between robust clustering and the estimation of the number of clusters, the essential dependence of (not only) robust clustering on tuning decisions, and shortcomings of existing measur… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: 11 pages, no figures

    MSC Class: 62H30

  6. arXiv:2304.13406  [pdf

    stat.OT

    Onset of a conceptual outline map to get a hold on the jungle of cluster analysis

    Authors: Iven Van Mechelen, Christian Hennig, Henk A. L. Kiers

    Abstract: The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the doma… ▽ More

    Submitted 10 April, 2024; v1 submitted 26 April, 2023; originally announced April 2023.

    Comments: 44 pages, 4 figures

    MSC Class: 62H30

  7. arXiv:2204.09793  [pdf, other

    stat.AP

    Clustering of football players based on performance data and aggregated clustering validity indexes

    Authors: Serhat Akhanli, Christian Hennig

    Abstract: We analyse football (soccer) player performance data with mixed type variables from the 2014-15 season of eight European major leagues. We cluster these data based on a tailor-made dissimilarity measure. In order to decide between the many available clustering methods and to choose an appropriate number of clusters, we use the approach by Akhanli and Hennig (2020). This is based on several valid… ▽ More

    Submitted 20 April, 2022; originally announced April 2022.

    Comments: 26 pages, 5 figures

    MSC Class: 62H30

  8. arXiv:2108.09243  [pdf, other

    stat.ME

    A comparison of different clustering approaches for high-dimensional presence-absence data

    Authors: Gabriele d'Angella, Christian Hennig

    Abstract: Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elem… ▽ More

    Submitted 22 November, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

    Comments: 22 pages, 6 Figures

    MSC Class: 62H30

  9. arXiv:2107.04946  [pdf, other

    econ.EM stat.ME

    Inference for the proportional odds cumulative logit model with monotonicity constraints for ordinal predictors and ordinal response

    Authors: Javier Espinosa-Brito, Christian Hennig

    Abstract: The proportional odds cumulative logit model (POCLM) is a standard regression model for an ordinal response. Ordinality of predictors can be incorporated by monotonicity constraints for the corresponding parameters. It is shown that estimators defined by optimization, such as maximum likelihood estimators, for an unconstrained model and for parameters in the interior set of the parameter space of… ▽ More

    Submitted 1 June, 2023; v1 submitted 10 July, 2021; originally announced July 2021.

  10. arXiv:2103.01281  [pdf, ps, other

    stat.ME

    Validation of cluster analysis results on validation data: A systematic framework

    Authors: Theresa Ullmann, Christian Hennig, Anne-Laure Boulesteix

    Abstract: Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to valid… ▽ More

    Submitted 10 January, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: 32 pages, 1 figure

  11. arXiv:2102.03645  [pdf, other

    stat.ME

    An empirical comparison and characterisation of nine popular clustering methods

    Authors: Christian Hennig

    Abstract: Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets c… ▽ More

    Submitted 6 February, 2021; originally announced February 2021.

    Comments: 44 pages, 9 Figures

    MSC Class: 62H30

  12. arXiv:2009.00921  [pdf, other

    stat.ME

    An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering

    Authors: Christian Hennig, Pietro Coretto

    Abstract: We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto and Hennig 2016) of a Gaussian mixture model allowing for observations to be classified as "noise", but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic $Q$ that measur… ▽ More

    Submitted 25 December, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

    Comments: 35 pages, 13 figures

    MSC Class: 62H30

  13. Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-As-Model

    Authors: Christian Hennig

    Abstract: Note: Published now as a chapter in "Handbook of the History and Philosophy of Mathematical Practice" (Springer Nature, editor B. Sriraman, https://doi.org/10.1007/978-3-030-19071-2_105-1). The application of mathematical probability theory in statistics is quite controversial. Controversies regard both the interpretation of probability, and approaches to statistical inference. After having give… ▽ More

    Submitted 18 November, 2023; v1 submitted 11 July, 2020; originally announced July 2020.

    Comments: 55 pages no figures. Accepted for publication as a chapter in "Handbook of the History and Philosophy of Mathematical Practice - Practical, Historical and Philosophical Instances of Probability'' (Springer Nature, editor Egan Chernoff)

    MSC Class: 62A01

  14. arXiv:2002.01822  [pdf, other

    stat.ME

    Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

    Authors: Serhat Emre Akhanli, Christian Hennig

    Abstract: A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are a… ▽ More

    Submitted 23 June, 2020; v1 submitted 5 February, 2020; originally announced February 2020.

    Comments: 42 pages, 11 figures

    MSC Class: 62H30

  15. arXiv:1911.13272  [pdf, ps, other

    stat.ME

    Minkowski distances and standardisation for clustering and classification of high dimensional data

    Authors: Christian Hennig

    Abstract: There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city blo… ▽ More

    Submitted 23 June, 2020; v1 submitted 29 November, 2019; originally announced November 2019.

    Comments: Preliminary version; final version to be published by Springer, using Springer's svmult LATEX style

    MSC Class: 62H30

  16. arXiv:1910.11339  [pdf, other

    stat.ML cs.LG

    Clustering with the Average Silhouette Width

    Authors: Fatima Batool, Christian Hennig

    Abstract: The Average Silhouette Width (ASW; Rousseeuw (1987)) is a popular cluster validation index to estimate the number of clusters. Here we address the question whether it also is suitable as a general objective function to be optimized for finding a clustering. We will propose two algorithms (the standard version OSil and a fast version FOSil) and compare them with existing clustering methods in an ex… ▽ More

    Submitted 21 November, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: 36 pages

    MSC Class: 62H30 ACM Class: I.5.3

  17. arXiv:1908.02218  [pdf, other

    stat.ME

    Should we test the model assumptions before running a model-based test?

    Authors: M. Iqbal Shamsudheen, Christian Hennig

    Abstract: Statistical methods are based on model assumptions, and it is statistical folklore that a method's model assumptions should be checked before applying it. This can be formally done by running one or more misspecification tests of model assumptions before running a method that requires these assumptions; here we focus on model-based tests. A combined test procedure can be defined by specifying a pr… ▽ More

    Submitted 17 April, 2023; v1 submitted 6 August, 2019; originally announced August 2019.

    Comments: 35 pages, 1 figure

    MSC Class: 62F03

  18. arXiv:1905.08876  [pdf, other

    stat.OT

    Many perspectives on Deborah Mayo's "Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars"

    Authors: Andrew Gelman, Brian Haig, Christian Hennig, Art Owen, Robert Cousins, Stan Young, Christian Robert, Corey Yanofsky, E. J. Wagenmakers, Ron Kenett, Daniel Lakeland

    Abstract: The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion… ▽ More

    Submitted 29 May, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

    Comments: 23 pages

  19. Benchmarking in cluster analysis: A white paper

    Authors: Iven Van Mechelen, Anne-Laure Boulesteix, Rainer Dangl, Nema Dean, Isabelle Guyon, Christian Hennig, Friedrich Leisch, Douglas Steinley

    Abstract: Note: A revised version of this is now published. Please cite and read (it's open access): Van Mechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., Warrens, M. J. (2023). A white paper on good research practices in benchmarking: The case of cluster analysis. WIREs Data Mining and Knowledge Discovery, e1511. https://doi.org/10.1002/widm.1511 To achieve scien… ▽ More

    Submitted 30 July, 2023; v1 submitted 27 September, 2018; originally announced September 2018.

    MSC Class: 62H30

    Journal ref: WIREs Data Mining and Knowledge Discovery, 2023, e1511

  20. arXiv:1806.10403  [pdf, ps, other

    stat.ME

    Quantile-based clustering

    Authors: Christian Hennig, Cinzia Viroli, Laura Anderlucci

    Abstract: A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for diffe… ▽ More

    Submitted 8 November, 2019; v1 submitted 27 June, 2018; originally announced June 2018.

  21. arXiv:1804.08715  [pdf, other

    stat.ME

    A constrained regression model for an ordinal response with ordinal predictors

    Authors: Javier Espinosa, Christian Hennig

    Abstract: A regression model is proposed for the analysis of an ordinal response variable depending on a set of multiple covariates containing ordinal and potentially other variables. The proportional odds model (McCullagh (1980)) is used for the ordinal response, and constrained maximum likelihood estimation is used to account for the ordinality of covariates. Ordinal predictors are coded by dummy variab… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

    Comments: 33 pages, 7 figures, 1 appendix

    MSC Class: 62H12; 62J05; 62-07

  22. arXiv:1704.00959  [pdf, other

    stat.AP

    Using clustering of rankings to explain brand preferences with personality and socio-demographic variables

    Authors: Daniel Müllensiefen, Christian Hennig, Hedie Howells

    Abstract: The primary aim of market segmentation is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. This paper addresses the issue whether consumer groups can be identified from background variables that are not brand-related and how much personality vs. socio-demographic variables contribute to the identification of consumer clusters. This i… ▽ More

    Submitted 4 April, 2017; originally announced April 2017.

    Comments: 26 pages, 12 figures

    MSC Class: 62H30; 91B08

  23. arXiv:1703.09282  [pdf, other

    stat.ME

    Cluster validation by measurement of clustering characteristics relevant to the user

    Authors: Christian Hennig

    Abstract: There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters suc… ▽ More

    Submitted 8 September, 2020; v1 submitted 27 March, 2017; originally announced March 2017.

    Comments: 20 pages 2 figures

    MSC Class: 62H30

  24. arXiv:1604.02668  [pdf, other

    stat.ME stat.AP stat.ML

    Distance for Functional Data Clustering Based on Smoothing Parameter Commutation

    Authors: ShengLi Tzeng, Christian Hennig, Yu-Fen Li, Chien-Ju Lin

    Abstract: We propose a novel method to determine the dissimilarity between subjects for functional data clustering. Spline smoothing or interpolation is common to deal with data of such type. Instead of estimating the best-representing curve for each subject as fixed during clustering, we measure the dissimilarity between subjects based on varying curve estimates with commutation of smoothing parameters pai… ▽ More

    Submitted 10 April, 2016; originally announced April 2016.

    Journal ref: Statistical Methods in Medical Research, 27 (2018)

  25. Recovering the number of clusters in data sets with noise features using feature rescaling factors

    Authors: Renato Cordeiro de Amorim, Christian Hennig

    Abstract: In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of re… ▽ More

    Submitted 22 February, 2016; originally announced February 2016.

    Journal ref: Information Sciences 324 (2015), 126-145

  26. arXiv:1508.05453  [pdf, ps, other

    stat.OT stat.AP

    Beyond subjective and objective in statistics

    Authors: Andrew Gelman, Christian Hennig

    Abstract: We argue that the words "objectivity" and "subjectivity" in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. The ad… ▽ More

    Submitted 21 August, 2015; originally announced August 2015.

    Comments: 35 pages

  27. arXiv:1503.02059  [pdf, ps, other

    stat.ME

    Clustering strategy and method selection

    Authors: Christian Hennig

    Abstract: This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required. The aim of this chapter is to provide a framework for all the dec… ▽ More

    Submitted 6 March, 2015; originally announced March 2015.

  28. arXiv:1502.02574  [pdf, ps, other

    stat.ME

    Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

    Authors: Christian Hennig, Chien-Ju Lin

    Abstract: There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial dat… ▽ More

    Submitted 9 February, 2015; originally announced February 2015.

    MSC Class: 62H30; 62F03; 62F40

  29. arXiv:1502.02555  [pdf, ps, other

    stat.OT

    What are the true clusters?

    Authors: Christian Hennig

    Abstract: Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becom… ▽ More

    Submitted 9 February, 2015; originally announced February 2015.

    MSC Class: 03A05; 62H30; 91C20

  30. Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering

    Authors: Pietro Coretto, Christian Hennig

    Abstract: The two main topics of this paper are the introduction of the "optimally tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for tr… ▽ More

    Submitted 28 January, 2017; v1 submitted 2 June, 2014; originally announced June 2014.

    MSC Class: 62H30; 62F35; 62P25

    Journal ref: Journal of the American Statistical Association 111(516), pp. 1648--1659 (2016)

  31. arXiv:1309.6895  [pdf, other

    stat.ME

    Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering

    Authors: Pietro Coretto, Christian Hennig

    Abstract: The robust improper maximum likelihood estimator (RIMLE) is a new method for robust multivariate clustering finding approximately Gaussian clusters. It maximizes a pseudo-likelihood defined by adding a component with improper constant density for accommodating outliers to a Gaussian mixture. A special case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In this paper we treat… ▽ More

    Submitted 13 February, 2018; v1 submitted 26 September, 2013; originally announced September 2013.

    Comments: The title of this paper was originally: "A consistent and breakdown robust model-based clustering method"

    MSC Class: 62H30; 62F35

    Journal ref: 2017, Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. Download link: http://jmlr.org/papers/v18/16-382.html

  32. arXiv:1303.1282  [pdf, ps, other

    stat.ME

    Quantile-based classifiers

    Authors: Christian Hennig, Cinzia Viroli

    Abstract: Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this is consistent, fo… ▽ More

    Submitted 12 November, 2013; v1 submitted 6 March, 2013; originally announced March 2013.