-
The quantile-based classifier with variable-wise parameters
Authors:
Marco Berrettini,
Christian Hennig,
Cinzia Viroli
Abstract:
Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermor…
▽ More
Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermore, a variable-wise scale parameter is introduced. A simple greedy algorithm to estimate the parameters is proposed. Their consistency in a nonparametric setting is proved. Experiments using artificially generated and real data confirm the potential of the quantile-based classifier with variable-wise parameters.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Approaches to biological species delimitation based on genetic and spatial dissimilarity
Authors:
Gabriele d'Angella,
Christian Hennig
Abstract:
The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of in…
▽ More
The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of individuals can be genetically quite different even if they belong to the same species. We investigate the problem of testing whether two potentially separated groups of individuals can belong to a single species or not based on genetic and spatial data. Existing methods such as the partial Mantel test and jackknife-based distance-distance regression are considered. New approaches, i.e., an adaptation of a mixed effects model, a bootstrap approach, and a jackknife version of partial Mantel, are proposed. All these methods address the issue that distance data violate the independence assumption for standard inference regarding correlation and regression; a standard linear regression is also considered. The approaches are compared on simulated meta-populations generated with SLiM and GSpace - two software packages that can simulate spatially-explicit genetic data at an individual level. Simulations show that the new jackknife version of the partial Mantel test provides a good compromise between power and respecting the nominal type I error rate. Mixed-effects models have larger power than jackknife-based methods, but tend to display type I error rates slightly above the significance level. An application on brassy ringlets concludes the paper.
△ Less
Submitted 3 June, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions
Authors:
Pietro Coretto,
Christian Hennig
Abstract:
The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is s…
▽ More
The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes.
△ Less
Submitted 26 April, 2024; v1 submitted 10 November, 2023;
originally announced November 2023.
-
Choice of trimming proportion and number of clusters in robust clustering based on trimming
Authors:
Luis Angel García-Escudero,
Christian Hennig,
Agustín Mayo-Iscar,
Gianluca Morelli,
Marco Riani
Abstract:
So-called "classification trimmed likelihood curves" have been proposed as a useful heuristic tool to determine the number of clusters and trimming proportion in trimming-based robust clustering methods. However, these curves needs a careful visual inspection, and this way of choosing parameters requires subjective decisions. This work is intended to provide theoretical background for the understa…
▽ More
So-called "classification trimmed likelihood curves" have been proposed as a useful heuristic tool to determine the number of clusters and trimming proportion in trimming-based robust clustering methods. However, these curves needs a careful visual inspection, and this way of choosing parameters requires subjective decisions. This work is intended to provide theoretical background for the understanding of these curves and the elements involved in their derivation. Moreover, a parametric bootstrap approach is presented in order to automatize the choice of parameter more by providing a reduced list of "sensible" choices for the parameters. The user can then pick a solution that fits their aims from that reduced list.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Some issues in robust clustering
Authors:
Christian Hennig
Abstract:
Some key issues in robust clustering are discussed with focus on Gaussian mixture model based clustering, namely the formal definition of outliers, ambiguity between groups of outliers and clusters, the interaction between robust clustering and the estimation of the number of clusters, the essential dependence of (not only) robust clustering on tuning decisions, and shortcomings of existing measur…
▽ More
Some key issues in robust clustering are discussed with focus on Gaussian mixture model based clustering, namely the formal definition of outliers, ambiguity between groups of outliers and clusters, the interaction between robust clustering and the estimation of the number of clusters, the essential dependence of (not only) robust clustering on tuning decisions, and shortcomings of existing measurements of cluster stability when it comes to outliers.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Onset of a conceptual outline map to get a hold on the jungle of cluster analysis
Authors:
Iven Van Mechelen,
Christian Hennig,
Henk A. L. Kiers
Abstract:
The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the doma…
▽ More
The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the domain suffers from a major accessibility problem as well as from the fact that it is rife with division across many pretty isolated islands. As a way out, the present paper offers a thorough and in-depth review of the clustering domain as a whole under the form of an outline map based on an overarching conceptual framework and a common language. With this framework we wish to contribute to structuring the clustering domain, to characterizing methods that have often been developed and studied in quite different contexts, to identifying links between methods, and to introducing a frame of reference for optimally setting up cluster analyses in data-analytic practice.
△ Less
Submitted 11 July, 2024; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Clustering of football players based on performance data and aggregated clustering validity indexes
Authors:
Serhat Akhanli,
Christian Hennig
Abstract:
We analyse football (soccer) player performance data with mixed type variables from the 2014-15 season of eight European major leagues. We cluster these data based on a tailor-made dissimilarity measure.
In order to decide between the many available clustering methods and to choose an appropriate number of clusters, we use the approach by Akhanli and Hennig (2020). This is based on several valid…
▽ More
We analyse football (soccer) player performance data with mixed type variables from the 2014-15 season of eight European major leagues. We cluster these data based on a tailor-made dissimilarity measure.
In order to decide between the many available clustering methods and to choose an appropriate number of clusters, we use the approach by Akhanli and Hennig (2020). This is based on several validation criteria that refer to different desirable characteristics of a clustering. These characteristics are chosen based on the aim of clustering, and this allows to define a suitable validation index as weighted average of calibrated individual indexes measuring the desirable features.
We derive two different clusterings. The first one is a partition of the data set into major groups of essentially different players, which can be used for the analysis of a team's composition. The second one divides the data set into many small clusters (with 10 players on average), which can be used for finding players with a very similar profile to a given player. It is discussed in depth what characteristics are desirable for these clusterings. Weighting the criteria for the second clustering is informed by a survey of football experts.
△ Less
Submitted 20 April, 2022;
originally announced April 2022.
-
A comparison of different clustering approaches for high-dimensional presence-absence data
Authors:
Gabriele d'Angella,
Christian Hennig
Abstract:
Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elem…
▽ More
Presence-absence data is defined by vectors or matrices of zeroes and ones, where the ones usually indicate a "presence" in a certain place. Presence-absence data occur for example when investigating geographical species distributions, genetic information, or the occurrence of certain terms in texts. There are many applications for clustering such data; one example is to find so-called biotic elements, i.e., groups of species that tend to occur together geographically. Presence-absence data can be clustered in various ways, namely using a latent class mixture approach with local independence, distance-based hierarchical clustering with the Jaccard distance, or also using clustering methods for continuous data on a multidimensional scaling representation of the distances. These methods are conceptually very different and can therefore not easily be compared theoretically. We compare their performance with a comprehensive simulation study based on models for species distributions.
This has been accepted for publication in Ferreira, J., Bekker, A., Arashi, M. and Chen, D. (eds.) Innovations in multivariate statistical modelling: navigating theoretical and multidisciplinary domains, Springer Emerging Topics in Statistics and Biostatistics.
△ Less
Submitted 22 November, 2021; v1 submitted 20 August, 2021;
originally announced August 2021.
-
Parameters not empirically identifiable or distinguishable, including correlation between Gaussian observations
Authors:
Christian Hennig
Abstract:
Note: Accepted version, published in Statistical Papers, https://doi.org/10.1007/s00362-023-01414-3.
It is shown that some theoretically identifiable parameters cannot be empirically identified, meaning that no consistent estimator of them can exist. An important example is a constant correlation between Gaussian observations (in presence of such correlation not even the mean can be empirically…
▽ More
Note: Accepted version, published in Statistical Papers, https://doi.org/10.1007/s00362-023-01414-3.
It is shown that some theoretically identifiable parameters cannot be empirically identified, meaning that no consistent estimator of them can exist. An important example is a constant correlation between Gaussian observations (in presence of such correlation not even the mean can be empirically identified). Empirical identifiability and three versions of empirical distinguishability are defined. Two different constant correlations between Gaussian observations cannot even be empirically distinguished. A further example are cluster membership parameters in $k$-means clustering. Several existing results in the literature are connected to the new framework. General conditions are discussed under which independence can be distinguished from dependence.
△ Less
Submitted 17 April, 2023; v1 submitted 20 August, 2021;
originally announced August 2021.
-
Inference for the proportional odds cumulative logit model with monotonicity constraints for ordinal predictors and ordinal response
Authors:
Javier Espinosa-Brito,
Christian Hennig
Abstract:
The proportional odds cumulative logit model (POCLM) is a standard regression model for an ordinal response. Ordinality of predictors can be incorporated by monotonicity constraints for the corresponding parameters. It is shown that estimators defined by optimization, such as maximum likelihood estimators, for an unconstrained model and for parameters in the interior set of the parameter space of…
▽ More
The proportional odds cumulative logit model (POCLM) is a standard regression model for an ordinal response. Ordinality of predictors can be incorporated by monotonicity constraints for the corresponding parameters. It is shown that estimators defined by optimization, such as maximum likelihood estimators, for an unconstrained model and for parameters in the interior set of the parameter space of a constrained model are asymptotically equivalent. This is used in order to derive asymptotic confidence regions and tests for the constrained model, involving simple modifications for finite samples. The finite sample coverage probability of the confidence regions is investigated by simulation. Tests concern the effect of individual variables, monotonicity, and a specified monotonicity direction. The methodology is applied on real data related to the assessment of school performance.
△ Less
Submitted 1 June, 2023; v1 submitted 10 July, 2021;
originally announced July 2021.
-
Validation of cluster analysis results on validation data: A systematic framework
Authors:
Theresa Ullmann,
Christian Hennig,
Anne-Laure Boulesteix
Abstract:
Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to valid…
▽ More
Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic structured framework for validating clustering results on validation data that includes most existing validation approaches. In particular, we review classical validation techniques such as internal and external validation, stability analysis, hypothesis testing, and visual validation, and show how they can be interpreted in terms of our framework. We precisely define and formalise different types of validation of clustering results on a validation dataset and explain how each type can be implemented in practice. Furthermore, we give examples of how clustering studies from the applied literature that used a validation dataset can be classified into the framework.
△ Less
Submitted 10 January, 2022; v1 submitted 1 March, 2021;
originally announced March 2021.
-
An empirical comparison and characterisation of nine popular clustering methods
Authors:
Christian Hennig
Abstract:
Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets c…
▽ More
Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given "true" clustering.
△ Less
Submitted 6 February, 2021;
originally announced February 2021.
-
The missing pieces of the PuO 2 nanoparticle puzzle
Authors:
Evgeny Gerber,
Anna Yu Romanchuk,
Ivan Pidchenko,
Lucia Amidani,
Andre Rossberg,
Christoph Hennig,
Gavin B M Vaughan,
Alexander Trigub,
Tolganay Egorova,
Stephen Bauters,
Tatiana Plakhova,
Myrtille O J Y Hunault,
Stephan Weiss,
Sergei M Butorin,
Andreas C Scheinost,
Stepan N Kalmykov,
Kristina O Kvashnina
Abstract:
The nanoscience field often produces results more mystifying than any other discipline. It has been argued that changes in the plutonium dioxide (PuO2) particle size from bulk to nano can have a drastic effect on PuO2 properties. Here we report a full characterization of PuO2 nanoparticles (NPs) at the atomic level and probe their local and electronic structures by a variety of methods available a…
▽ More
The nanoscience field often produces results more mystifying than any other discipline. It has been argued that changes in the plutonium dioxide (PuO2) particle size from bulk to nano can have a drastic effect on PuO2 properties. Here we report a full characterization of PuO2 nanoparticles (NPs) at the atomic level and probe their local and electronic structures by a variety of methods available at the synchrotron.
△ Less
Submitted 15 October, 2020;
originally announced October 2020.
-
An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering
Authors:
Christian Hennig,
Pietro Coretto
Abstract:
We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto and Hennig 2016) of a Gaussian mixture model allowing for observations to be classified as "noise", but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic $Q$ that measur…
▽ More
We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto and Hennig 2016) of a Gaussian mixture model allowing for observations to be classified as "noise", but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic $Q$ that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This nonparametric measure allows for non-Gaussian clusters as long as they have a good quality according to $Q$. The simplicity of a model is assessed by a measure $S$ that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of $Q$ is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian Information Criterion (BIC) and the Integrated Complete Likelihood (ICL) in a simulation study and on two datasets of scientific interest. Keywords: parametric bootstrap; noise component; unimodality; model-based clustering
△ Less
Submitted 25 December, 2020; v1 submitted 2 September, 2020;
originally announced September 2020.
-
Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-As-Model
Authors:
Christian Hennig
Abstract:
Note: Published now as a chapter in "Handbook of the History and Philosophy of Mathematical Practice" (Springer Nature, editor B. Sriraman, https://doi.org/10.1007/978-3-030-19071-2_105-1).
The application of mathematical probability theory in statistics is quite controversial. Controversies regard both the interpretation of probability, and approaches to statistical inference. After having give…
▽ More
Note: Published now as a chapter in "Handbook of the History and Philosophy of Mathematical Practice" (Springer Nature, editor B. Sriraman, https://doi.org/10.1007/978-3-030-19071-2_105-1).
The application of mathematical probability theory in statistics is quite controversial. Controversies regard both the interpretation of probability, and approaches to statistical inference. After having given an overview of the main approaches, I will propose a re-interpretation of frequentist probability. Most statisticians are aware that probability models interpreted in a frequentist manner are not really true in objective reality, but only idealisations. I argue that this is often ignored when actually applying frequentist methods and interpreting the results, and that kee** up the awareness for the essential difference between reality and models can lead to a more appropriate use and interpretation of frequentist models and methods, called "frequentism-as-model". This is elaborated showing connections to existing work, appreciating the special role of independently and identically distributed observations and subject matter knowledge, giving an account of how and under what conditions models that are not true can be useful, giving detailed interpretations of tests and confidence intervals, confronting their implicit compatibility logic with the inverse probability logic of Bayesian inference, re-interpreting the role of model assumptions, appreciating robustness, and the role of "interpretative equivalence" of models. Epistemic probability shares the issue that its models are only idealisations, and an analogous "epistemic-probability-as-model" can also be developed.
△ Less
Submitted 18 November, 2023; v1 submitted 11 July, 2020;
originally announced July 2020.
-
Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes
Authors:
Serhat Emre Akhanli,
Christian Hennig
Abstract:
A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are a…
▽ More
A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clusterings on the same data. Two specific aggregated indexes are proposed and compared with existing indexes on simulated and real data.
△ Less
Submitted 23 June, 2020; v1 submitted 5 February, 2020;
originally announced February 2020.
-
Minkowski distances and standardisation for clustering and classification of high dimensional data
Authors:
Christian Hennig
Abstract:
There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city blo…
▽ More
There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city block)-, $L_2$ (Euclidean)-, $L_3$-, $L_4$-, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The $L_1$-distance and the boxplot transformation show good results.
△ Less
Submitted 23 June, 2020; v1 submitted 29 November, 2019;
originally announced November 2019.
-
Clustering with the Average Silhouette Width
Authors:
Fatima Batool,
Christian Hennig
Abstract:
The Average Silhouette Width (ASW; Rousseeuw (1987)) is a popular cluster validation index to estimate the number of clusters. Here we address the question whether it also is suitable as a general objective function to be optimized for finding a clustering. We will propose two algorithms (the standard version OSil and a fast version FOSil) and compare them with existing clustering methods in an ex…
▽ More
The Average Silhouette Width (ASW; Rousseeuw (1987)) is a popular cluster validation index to estimate the number of clusters. Here we address the question whether it also is suitable as a general objective function to be optimized for finding a clustering. We will propose two algorithms (the standard version OSil and a fast version FOSil) and compare them with existing clustering methods in an extensive simulation study covering the cases of a known and unknown number of clusters. Real data sets are also analysed, partly exploring the use of the new methods with non-Euclidean distances. We will also show that the ASW satisfies some axioms that have been proposed for cluster quality functions (Ackerman and Ben-David (2009)). The new methods prove useful and sensible in many cases, but some weaknesses are also highlighted. These also concern the use of the ASW for estimating the number of clusters together with other methods, which is of general interest due to the popularity of the ASW for this task.
△ Less
Submitted 21 November, 2020; v1 submitted 24 October, 2019;
originally announced October 2019.
-
Should we test the model assumptions before running a model-based test?
Authors:
M. Iqbal Shamsudheen,
Christian Hennig
Abstract:
Statistical methods are based on model assumptions, and it is statistical folklore that a method's model assumptions should be checked before applying it. This can be formally done by running one or more misspecification tests of model assumptions before running a method that requires these assumptions; here we focus on model-based tests. A combined test procedure can be defined by specifying a pr…
▽ More
Statistical methods are based on model assumptions, and it is statistical folklore that a method's model assumptions should be checked before applying it. This can be formally done by running one or more misspecification tests of model assumptions before running a method that requires these assumptions; here we focus on model-based tests. A combined test procedure can be defined by specifying a protocol in which first model assumptions are tested and then, conditionally on the outcome, a test is run that requires or does not require the tested assumptions. Although such an approach is often taken in practice, much of the literature that investigated this is surprisingly critical of it. Our aim is to explore conditions under which model checking is advisable or not advisable. For this, we review results regarding such "combined procedures" in the literature, we review and discuss controversial views on the role of model checking in statistics, and we present a general setup in which we can show that preliminary model checking is advantageous, which implies conditions for making model checking worthwhile.
△ Less
Submitted 17 April, 2023; v1 submitted 6 August, 2019;
originally announced August 2019.
-
Many perspectives on Deborah Mayo's "Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars"
Authors:
Andrew Gelman,
Brian Haig,
Christian Hennig,
Art Owen,
Robert Cousins,
Stan Young,
Christian Robert,
Corey Yanofsky,
E. J. Wagenmakers,
Ron Kenett,
Daniel Lakeland
Abstract:
The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion…
▽ More
The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion will introduce people to Mayo's ideas along with other perspectives on the topics she addresses.
△ Less
Submitted 29 May, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Benchmarking in cluster analysis: A white paper
Authors:
Iven Van Mechelen,
Anne-Laure Boulesteix,
Rainer Dangl,
Nema Dean,
Isabelle Guyon,
Christian Hennig,
Friedrich Leisch,
Douglas Steinley
Abstract:
Note: A revised version of this is now published. Please cite and read (it's open access): Van Mechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., Warrens, M. J. (2023). A white paper on good research practices in benchmarking: The case of cluster analysis. WIREs Data Mining and Knowledge Discovery, e1511. https://doi.org/10.1002/widm.1511
To achieve scien…
▽ More
Note: A revised version of this is now published. Please cite and read (it's open access): Van Mechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., Warrens, M. J. (2023). A white paper on good research practices in benchmarking: The case of cluster analysis. WIREs Data Mining and Knowledge Discovery, e1511. https://doi.org/10.1002/widm.1511
To achieve scientific progress in terms of building a cumulative body of knowledge, careful attention to benchmarking is of the utmost importance. This means that proposals of new methods of data pre-processing, new data-analytic techniques, and new methods of output post-processing, should be extensively and carefully compared with existing alternatives, and that existing methods should be subjected to neutral comparison studies. To date, benchmarking and recommendations for benchmarking have been frequently seen in the context of supervised learning. Unfortunately, there has been a dearth of guidelines for benchmarking in an unsupervised setting, with the area of clustering as an important subdomain. To address this problem, discussion is given to the theoretical conceptual underpinnings of benchmarking in the field of cluster analysis by means of simulated as well as empirical data. Subsequently, the practicalities of how to address benchmarking questions in clustering are dealt with, and foundational recommendations are made.
△ Less
Submitted 30 July, 2023; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Quantile-based clustering
Authors:
Christian Hennig,
Cinzia Viroli,
Laura Anderlucci
Abstract:
A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for diffe…
▽ More
A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for different levels of parsimony and computational efficiency. Although $K$-quantiles clustering is conceived as nonparametric, it can be connected to a fixed partition model of generalized asymmetric Laplace-distributions. The consistency of $K$-quantiles clustering is proved, and it is shown that $K$-quantiles clusters correspond to well separated mixture components in a nonparametric mixture. In a simulation, $K$-quantiles clustering is compared with a number of popular clustering methods with good results. A high-dimensional microarray dataset is clustered by $K$-quantiles.
△ Less
Submitted 8 November, 2019; v1 submitted 27 June, 2018;
originally announced June 2018.
-
A constrained regression model for an ordinal response with ordinal predictors
Authors:
Javier Espinosa,
Christian Hennig
Abstract:
A regression model is proposed for the analysis of an ordinal response variable depending on a set of multiple covariates containing ordinal and potentially other variables. The proportional odds model (McCullagh (1980)) is used for the ordinal response, and constrained maximum likelihood estimation is used to account for the ordinality of covariates.
Ordinal predictors are coded by dummy variab…
▽ More
A regression model is proposed for the analysis of an ordinal response variable depending on a set of multiple covariates containing ordinal and potentially other variables. The proportional odds model (McCullagh (1980)) is used for the ordinal response, and constrained maximum likelihood estimation is used to account for the ordinality of covariates.
Ordinal predictors are coded by dummy variables. The parameters associated to the categories of the ordinal predictor(s) are constrained, enforcing them to be monotonic (isotonic or antitonic). A decision rule is introduced for classifying the ordinal predictors' monotonicity directions, also providing information whether observations are compatible with both or no monotonicity direction. In addition, a monotonicity test for the parameters of any ordinal predictor is proposed. The monotonicity constrained model is proposed together with three estimation methods and compared to the unconstrained one based on simulations.
The model is applied to real data explaining a 10-Points Likert scale quality of life self-assessment variable from ordinal and other predictors.
△ Less
Submitted 23 April, 2018;
originally announced April 2018.
-
Galaxies in X-ray Selected Clusters and Groups in Dark Energy Survey Data II: Hierarchical Bayesian Modeling of the Red-Sequence Galaxy Luminosity Function
Authors:
Y. Zhang,
C. J. Miller,
P. Rooney,
A. Bermeo,
A. K. Romer,
C. Vergara cervantes,
E. S. Rykoff,
C. Hennig,
R. Das,
T. Mckay,
J. Song,
H. Wilcox,
D. Bacon,
S. L. Bridle,
C. Collins,
C. Conselice,
M. Hilton,
B. Hoyle,
S. Kay,
A. R. Liddle,
R. G. Mann,
N. Mehrtens,
J. Mayers,
R. C. Nichol,
M. Sahlen
, et al. (55 additional authors not shown)
Abstract:
Using $\sim 100$ X-ray selected clusters in the Dark Energy Survey Science Verification data, we constrain the luminosity function (LF) of cluster red sequence galaxies as a function of redshift. This is the first homogeneous optical/X-ray sample large enough to constrain the evolution of the luminosity function simultaneously in redshift ($0.1<z<1.05$) and cluster mass (…
▽ More
Using $\sim 100$ X-ray selected clusters in the Dark Energy Survey Science Verification data, we constrain the luminosity function (LF) of cluster red sequence galaxies as a function of redshift. This is the first homogeneous optical/X-ray sample large enough to constrain the evolution of the luminosity function simultaneously in redshift ($0.1<z<1.05$) and cluster mass ($13.5 \le \rm{log_{10}}(M_{200crit}) \sim< 15.0$). We pay particular attention to completeness issues and the detection limit of the galaxy sample. We then apply a hierarchical Bayesian model to fit the cluster galaxy LFs via a Schecter function, including its characteristic break ($m^*$) to a faint end power-law slope ($α$). Our method enables us to avoid known issues in similar analyses based on stacking or binning the clusters. We find weak and statistically insignificant ($\sim 1.9 σ$) evolution in the faint end slope $α$ versus redshift. We also find no dependence in $α$ or $m^*$ with the X-ray inferred cluster masses. However, the amplitude of the LF as a function of cluster mass is constrained to $\sim 20\%$ precision. As a by-product of our algorithm, we utilize the correlation between the LF and cluster mass to provide an improved estimate of the individual cluster masses as well as the scatter in true mass given the X-ray inferred masses. This technique can be applied to a larger sample of X-ray or optically selected clusters from the Dark Energy Survey, significantly improving the sensitivity of the analysis.
△ Less
Submitted 29 June, 2019; v1 submitted 16 October, 2017;
originally announced October 2017.
-
Using clustering of rankings to explain brand preferences with personality and socio-demographic variables
Authors:
Daniel Müllensiefen,
Christian Hennig,
Hedie Howells
Abstract:
The primary aim of market segmentation is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. This paper addresses the issue whether consumer groups can be identified from background variables that are not brand-related and how much personality vs. socio-demographic variables contribute to the identification of consumer clusters. This i…
▽ More
The primary aim of market segmentation is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. This paper addresses the issue whether consumer groups can be identified from background variables that are not brand-related and how much personality vs. socio-demographic variables contribute to the identification of consumer clusters. This is done by clustering aggregated preferences for 25 brands across 5 different product categories, and by relating socio-demographic and personality variables to the clusters using logistic regression and random forests over a range of different numbers of clusters. Results indicate that some personality variables contribute significantly to the identification of consumer groups in one sample. However, these results were not replicated on a second sample that was more heterogeneous in terms of socio-demographic characteristics and not representative of the brands' target audience.
△ Less
Submitted 4 April, 2017;
originally announced April 2017.
-
Cluster validation by measurement of clustering characteristics relevant to the user
Authors:
Christian Hennig
Abstract:
There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters suc…
▽ More
There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters such as different numbers of clusters.
There are many cluster validation indexes in the literature. Most of them attempt to measure the overall quality of a clustering by a single number, but this can be inappropriate. There are various different characteristics of a clustering that can be relevant in practice, depending on the aim of clustering, such as low within-cluster distances and high between-cluster separation.
In this paper, a number of validation criteria will be introduced that refer to different desirable characteristics of a clustering, and that characterise a clustering in a multidimensional way. In specific applications the user may be interested in some of these criteria rather than others. A focus of the paper is on methodology to standardise the different characteristics so that users can aggregate them in a suitable way specifying weights for the various criteria that are relevant in the clustering application at hand.
△ Less
Submitted 8 September, 2020; v1 submitted 27 March, 2017;
originally announced March 2017.
-
Distance for Functional Data Clustering Based on Smoothing Parameter Commutation
Authors:
ShengLi Tzeng,
Christian Hennig,
Yu-Fen Li,
Chien-Ju Lin
Abstract:
We propose a novel method to determine the dissimilarity between subjects for functional data clustering. Spline smoothing or interpolation is common to deal with data of such type. Instead of estimating the best-representing curve for each subject as fixed during clustering, we measure the dissimilarity between subjects based on varying curve estimates with commutation of smoothing parameters pai…
▽ More
We propose a novel method to determine the dissimilarity between subjects for functional data clustering. Spline smoothing or interpolation is common to deal with data of such type. Instead of estimating the best-representing curve for each subject as fixed during clustering, we measure the dissimilarity between subjects based on varying curve estimates with commutation of smoothing parameters pair-by-pair (of subjects). The intuitions are that smoothing parameters of smoothing splines reflect inverse signal-to-noise ratios and that applying an identical smoothing parameter the smoothed curves for two similar subjects are expected to be close. The effectiveness of our proposal is shown through simulations comparing to other dissimilarity measures. It also has several pragmatic advantages. First, missing values or irregular time points can be handled directly, thanks to the nature of smoothing splines. Second, conventional clustering method based on dissimilarity can be employed straightforward, and the dissimilarity also serves as a useful tool for outlier detection. Third, the implementation is almost handy since subroutines for smoothing splines and numerical integration are widely available. Fourth, the computational complexity does not increase and is parallel with that in calculating Euclidean distance between curves estimated by smoothing splines.
△ Less
Submitted 10 April, 2016;
originally announced April 2016.
-
Galaxy Populations in Massive Galaxy Clusters to z=1.1: Color Distribution, Concentration, Halo Occupation Number and Red Sequence Fraction
Authors:
C. Hennig,
J. J. Mohr,
A. Zenteno,
S. Desai,
J. P. Dietrich,
S. Bocquet,
V. Strazzullo,
A. Saro,
T. M. C. Abbott,
F. B. Abdalla,
M. Bayliss,
A. Benoit-Levy,
R. A. Bernstein,
E. Bertin,
D. Brooks,
R. Capasso,
D. Capozzi,
A. Carnero,
M. Carrasco Kind,
J. Carretero,
I. Chiu,
C. B. D'Andrea,
L. N. daCosta,
H. T. Diehl,
P. Doel
, et al. (48 additional authors not shown)
Abstract:
We study the galaxy populations in 74 Sunyaev Zeldovich Effect (SZE) selected clusters from the South Pole Telescope (SPT) survey that have been imaged in the science verification phase of the Dark Energy Survey (DES). The sample extends up to $z\sim 1.1$ with $4 \times 10^{14} M_{\odot}\le M_{200}\le 3\times 10^{15} M_{\odot}$. Using the band containing the 4000~Å break and its redward neighbor,…
▽ More
We study the galaxy populations in 74 Sunyaev Zeldovich Effect (SZE) selected clusters from the South Pole Telescope (SPT) survey that have been imaged in the science verification phase of the Dark Energy Survey (DES). The sample extends up to $z\sim 1.1$ with $4 \times 10^{14} M_{\odot}\le M_{200}\le 3\times 10^{15} M_{\odot}$. Using the band containing the 4000~Å break and its redward neighbor, we study the color-magnitude distributions of cluster galaxies to $\sim m_*+2$, finding: (1) the intrinsic rest frame $g-r$ color width of the red sequence (RS) population is $\sim$0.03 out to $z\sim0.85$ with a preference for an increase to $\sim0.07$ at $z=1$ and (2) the prominence of the RS declines beyond $z\sim0.6$. The spatial distribution of cluster galaxies is well described by the NFW profile out to $4R_{200}$ with a concentration of $c_{\mathrm{g}} = 3.59^{+0.20}_{-0.18}$, $5.37^{+0.27}_{-0.24}$ and $1.38^{+0.21}_{-0.19}$ for the full, the RS and the blue non-RS populations, respectively, but with $\sim40$\% to 55\% cluster to cluster variation and no statistically significant redshift or mass trends. The number of galaxies within the virial region $N_{200}$ exhibits a mass trend indicating that the number of galaxies per unit total mass is lower in the most massive clusters, and shows no significant redshift trend. The red sequence (RS) fraction within $R_{200}$ is $(68\pm3)$\% at $z=0.46$, varies from $\sim$55\% at $z=1$ to $\sim$80\% at $z=0.1$, and exhibits intrinsic variation among clusters of $\sim14$\%. We discuss a model that suggests the observed redshift trend in RS fraction favors a transformation timescale for infalling field galaxies to become RS galaxies of 2 to 3~Gyr.
△ Less
Submitted 4 April, 2016;
originally announced April 2016.
-
Recovering the number of clusters in data sets with noise features using feature rescaling factors
Authors:
Renato Cordeiro de Amorim,
Christian Hennig
Abstract:
In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of re…
▽ More
In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters.
We experiment with the Silhouette (using squared Euclidean, Manhattan, and the p$^{th}$ power of the Minkowski distance), Dunn's, Calinski-Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.
△ Less
Submitted 22 February, 2016;
originally announced February 2016.
-
Detection of Enhancement in Number Densities of Background Galaxies due to Magnification by Massive Galaxy Clusters
Authors:
I. Chiu,
J. P. Dietrich,
J. Mohr,
D. E. Applegate,
B. A. Benson,
L. E. Bleem,
M. B. Bayliss,
S. Bocquet,
J. E. Carlstrom,
R. Capasso,
S. Desai,
C. Gangkofner,
A. H. Gonzalez,
N. Gupta,
C. Hennig,
H. Hoekstra,
A. von der Linden,
J. Liu,
M. McDonald,
C. L. Reichardt,
A. Saro,
T. Schrabback,
V. Strazzullo,
C. W. Stubbs,
A. Zenteno
Abstract:
We present a detection of the enhancement in the number densities of background galaxies induced from lensing magnification and use it to test the Sunyaev-Zel'dovich effect (SZE) inferred masses in a sample of 19 galaxy clusters with median redshift $z\simeq0.42$ selected from the South Pole Telescope SPT-SZ survey. Two background galaxy populations are selected for this study through their photom…
▽ More
We present a detection of the enhancement in the number densities of background galaxies induced from lensing magnification and use it to test the Sunyaev-Zel'dovich effect (SZE) inferred masses in a sample of 19 galaxy clusters with median redshift $z\simeq0.42$ selected from the South Pole Telescope SPT-SZ survey. Two background galaxy populations are selected for this study through their photometric colours; they have median redshifts ${z}_{\mathrm{median}}\simeq0.9$ (low-$z$ background) and ${z}_{\mathrm{median}}\simeq1.8$ (high-$z$ background). Stacking these populations, we detect the magnification bias effect at $3.3σ$ and $1.3σ$ for the low- and high-$z$ backgrounds, respectively. We fit NFW models simultaneously to all observed magnification bias profiles to estimate the multiplicative factor $η$ that describes the ratio of the weak lensing mass to the mass inferred from the SZE observable-mass relation. We further quantify systematic uncertainties in $η$ resulting from the photometric noise and bias, the cluster galaxy contamination and the estimations of the background properties. The resulting $η$ for the combined background populations with $1σ$ uncertainties is $0.83\pm0.24\mathrm{(stat)}\pm0.074\mathrm{(sys)}$, indicating good consistency between the lensing and the SZE-inferred masses. We use our best-fit $η$ to predict the weak lensing shear profiles and compare these predictions with observations, showing agreement between the magnification and shear mass constraints. This work demonstrates the promise of using the magnification as a complementary method to estimate cluster masses in large surveys.
△ Less
Submitted 9 February, 2016; v1 submitted 6 October, 2015;
originally announced October 2015.
-
Beyond subjective and objective in statistics
Authors:
Andrew Gelman,
Christian Hennig
Abstract:
We argue that the words "objectivity" and "subjectivity" in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. The ad…
▽ More
We argue that the words "objectivity" and "subjectivity" in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. The advantage of these reformulations is that the replacement terms do not oppose each other. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgment of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling, and socioeconomic stratification.
△ Less
Submitted 21 August, 2015;
originally announced August 2015.
-
Constraints on the Richness-Mass Relation and the Optical-SZE Positional Offset Distribution for SZE-Selected Clusters
Authors:
A. Saro,
S. Bocquet,
E. Rozo,
B. A. Benson,
J. Mohr,
E. S. Rykoff,
M. Soares-Santos,
L. Bleem,
S. Dodelson,
P. Melchior,
F. Sobreira,
V. Upadhyay,
J. Weller,
T. Abbott,
F. B. Abdalla,
S. Allam,
R. Armstrong,
M. Banerji,
A. H. Bauer,
M. Bayliss,
A. Benoit-Levy,
G. M. Bernstein,
E. Bertin,
M. Brodwin,
D. Brooks
, et al. (77 additional authors not shown)
Abstract:
We cross-match galaxy cluster candidates selected via their Sunyaev-Zel'dovich effect (SZE) signatures in 129.1 deg$^2$ of the South Pole Telescope 2500d SPT-SZ survey with optically identified clusters selected from the Dark Energy Survey (DES) science verification data. We identify 25 clusters between $0.1\lesssim z\lesssim 0.8$ in the union of the SPT-SZ and redMaPPer (RM) samples. RM is an opt…
▽ More
We cross-match galaxy cluster candidates selected via their Sunyaev-Zel'dovich effect (SZE) signatures in 129.1 deg$^2$ of the South Pole Telescope 2500d SPT-SZ survey with optically identified clusters selected from the Dark Energy Survey (DES) science verification data. We identify 25 clusters between $0.1\lesssim z\lesssim 0.8$ in the union of the SPT-SZ and redMaPPer (RM) samples. RM is an optical cluster finding algorithm that also returns a richness estimate for each cluster. We model the richness $λ$-mass relation with the following function $\langle\lnλ|M_{500}\rangle\propto B_λ\ln M_{500}+C_λ\ln E(z)$ and use SPT-SZ cluster masses and RM richnesses $λ$ to constrain the parameters. We find $B_λ= 1.14^{+0.21}_{-0.18}$ and $C_λ=0.73^{+0.77}_{-0.75}$. The associated scatter in mass at fixed richness is $σ_{\ln M|λ} = 0.18^{+0.08}_{-0.05}$ at a characteristic richness $λ=70$. We demonstrate that our model provides an adequate description of the matched sample, showing that the fraction of SPT-SZ selected clusters with RM counterparts is consistent with expectations and that the fraction of RM selected clusters with SPT-SZ counterparts is in mild tension with expectation. We model the optical-SZE cluster positional offset distribution with the sum of two Gaussians, showing that it is consistent with a dominant, centrally peaked population and a sub-dominant population characterized by larger offsets. We also cross-match the RM catalog with SPT-SZ candidates below the official catalog threshold significance $ξ=4.5$, using the RM catalog to provide optical confirmation and redshifts for additional low-$ξ$ SPT-SZ candidates.In this way, we identify 15 additional clusters with $ξ\in [4,4.5]$ over the redshift regime explored by RM in the overlap** region between DES science verification data and the SPT-SZ survey.
△ Less
Submitted 25 June, 2015;
originally announced June 2015.
-
Galaxies in X-ray Selected Clusters and Groups in Dark Energy Survey Data I: Stellar Mass Growth of Bright Central Galaxies Since z~1.2
Authors:
Y. Zhang,
C. Miller,
T. Mckay,
P. Rooney,
A. E. Evrard,
A. K. Romer,
R. Perfecto,
J. Song,
S. Desai,
J. Mohr,
H. Wilcox,
A. Bermeo,
T. Jeltema,
D. Hollowood,
D. Bacon,
D. Capozzi,
C. Collins,
R. Das,
D. Gerdes,
C. Hennig,
M. Hilton,
B. Hoyle,
S. Kay,
A. Liddle,
R. G. Mann
, et al. (58 additional authors not shown)
Abstract:
Using the science verification data of the Dark Energy Survey (DES) for a new sample of 106 X-Ray selected clusters and groups, we study the stellar mass growth of Bright Central Galaxies (BCGs) since redshift 1.2. Compared with the expectation in a semi-analytical model applied to the Millennium Simulation, the observed BCGs become under-massive/under-luminous with decreasing redshift. We incorpo…
▽ More
Using the science verification data of the Dark Energy Survey (DES) for a new sample of 106 X-Ray selected clusters and groups, we study the stellar mass growth of Bright Central Galaxies (BCGs) since redshift 1.2. Compared with the expectation in a semi-analytical model applied to the Millennium Simulation, the observed BCGs become under-massive/under-luminous with decreasing redshift. We incorporate the uncertainties associated with cluster mass, redshift, and BCG stellar mass measurements into analysis of a redshift-dependent BCG-cluster mass relation, $m_{*}\propto(\frac{M_{200}}{1.5\times 10^{14}M_{\odot}})^{0.24\pm 0.08}(1+z)^{-0.19\pm0.34}$, and compare the observed relation to the model prediction. We estimate the average growth rate since $z = 1.0$ for BCGs hosted by clusters of $M_{200, z}=10^{13.8}M_{\odot}$, at $z=1.0$: $m_{*, BCG}$ appears to have grown by $0.13\pm0.11$ dex, in tension at $\sim 2.5 σ$ significance level with the $0.40$ dex growth rate expected from the semi-analytic model. We show that the buildup of extended intra-cluster light after $z=1.0$ may alleviate this tension in BCG growth rates.
△ Less
Submitted 2 December, 2015; v1 submitted 12 April, 2015;
originally announced April 2015.
-
Clustering strategy and method selection
Authors:
Christian Hennig
Abstract:
This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required.
The aim of this chapter is to provide a framework for all the dec…
▽ More
This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required.
The aim of this chapter is to provide a framework for all the decisions that are required when carrying out a cluster analysis in practice. A general attitude to clustering is outlined, which connects these decisions closely to the clustering aims in a given application. From this point of view, the chapter then discusses aspects of data processing such as the choice of the representation of the objects to be clustered, dissimilarity design, transformation and standardization of variables. Regarding the choice of the clustering method, it is explored how different methods correspond to different clustering aims. Then an overview of benchmarking studies comparing different clustering methods is given, as well as an out- line of theoretical approaches to characterize desiderata for clustering by axioms. Finally, aspects of cluster validation, i.e., the assessment of the quality of a clustering in a given dataset, are discussed, including finding an appropriate number of clusters, testing homogeneity, internal and external cluster validation, assessing clustering stability and data visualization.
△ Less
Submitted 6 March, 2015;
originally announced March 2015.
-
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
Authors:
Christian Hennig,
Chien-Ju Lin
Abstract:
There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial dat…
▽ More
There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation.
△ Less
Submitted 9 February, 2015;
originally announced February 2015.
-
What are the true clusters?
Authors:
Christian Hennig
Abstract:
Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becom…
▽ More
Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becomes scientific not through uniqueness but through transparent and open communication. The idea of "natural kinds" is a human construct, but it highlights the human experience that the reality outside the observer's control seems to make certain distinctions between categories inevitable. Various desirable characteristics of clusterings and various approaches to define a context-dependent truth are listed, and I discuss what impact these ideas can have on the comparison of clustering methods, and the choice of a clustering methods and related decisions in practice.
△ Less
Submitted 9 February, 2015;
originally announced February 2015.
-
Baryon Content of Massive Galaxy Clusters (0.57 < z < 1.33)
Authors:
I. Chiu,
J. Mohr,
M. Mcdonald,
S. Bocquet,
M. L. Ashby,
M. Bayliss,
B. A. Benson,
L. E. Bleem,
M. Brodwin,
S. Desai,
J. P. Dietrich,
W. R. Forman,
C. Gangkofner,
A. H. Gonzalez,
C. Hennig,
J. Liu,
C. L. Reichardt,
A. Saro,
B. Stalder,
S. A. Stanford,
J. Song,
T. Schrabback,
R. Suhada,
V. Strazzullo,
A. Zenteno
Abstract:
We study the stellar, Brightest Cluster Galaxy (BCG) and intracluster medium (ICM) masses of 14 South Pole Telescope (SPT) selected galaxy clusters with median redshift $z=0.9$ and median mass $M_{500}=6\times10^{14}M_{\odot}$. We estimate stellar masses for each cluster and BCG using six photometric bands spanning the range from the ultraviolet to the near-infrared observed with the VLT, HST and…
▽ More
We study the stellar, Brightest Cluster Galaxy (BCG) and intracluster medium (ICM) masses of 14 South Pole Telescope (SPT) selected galaxy clusters with median redshift $z=0.9$ and median mass $M_{500}=6\times10^{14}M_{\odot}$. We estimate stellar masses for each cluster and BCG using six photometric bands spanning the range from the ultraviolet to the near-infrared observed with the VLT, HST and Spitzer. The ICM masses are derived from Chandra and XMM-Newton X-ray observations, and the virial masses are derived from the SPT Sunyaev-Zel'dovich Effect signature.
At $z=0.9$ the BCG mass $M_{\star}^{\textrm{BCG}}$ constitutes $0.12\pm0.01$% of the halo mass for a $6\times10^{14}M_{\odot}$ cluster, and this fraction falls as $M_{500}^{-0.58\pm0.07}$. The cluster stellar mass function has a characteristic mass $M_{0}=10^{11.0\pm0.1}M_{\odot}$, and the number of galaxies per unit mass in clusters is larger than in the field by a factor $1.65\pm0.2$. Both results are consistent with measurements on group scales and at lower redshift. We combine our SPT sample with previously published samples at low redshift that we correct to a common initial mass function and for systematic differences in virial masses. We then explore mass and redshift trends in the stellar fraction (fstar), the ICM fraction (fICM), the cold baryon fraction (fc) and the baryon fraction (fb). At a pivot mass of $6\times10^{14}M_{\odot}$ and redshift $z=0.9$, the characteristic values are fstar=$1.1\pm0.1$%, fICM=$9.6\pm0.5$%, fc=$10.4\pm1.2$% and fb=$10.7\pm0.6$%. These fractions all vary with cluster mass at high significance, indicating that higher mass clusters have lower fstar and fc and higher fICM and fb. When accounting for a 15% systematic virial mass uncertainty, there is no statistically significant redshift trend at fixed mass in these baryon fractions.
(abridged)
△ Less
Submitted 3 October, 2015; v1 submitted 25 December, 2014;
originally announced December 2014.
-
A Measurement of Gravitational Lensing of the Cosmic Microwave Background by Galaxy Clusters Using Data from the South Pole Telescope
Authors:
E. J. Baxter,
R. Keisler,
S. Dodelson,
K. A. Aird,
S. W. Allen,
M. L. N. Ashby,
M. Bautz,
M. Bayliss,
B. A. Benson,
L. E. Bleem,
S. Bocquet,
M. Brodwin,
J. E. Carlstrom,
C. L. Chang,
I. Chiu,
H-M. Cho,
A. Clocchiatti,
T. M. Crawford,
A. T. Crites,
S. Desai,
J. P. Dietrich,
T. de Haan,
M. A. Dobbs,
R. J. Foley,
W. R. Forman
, et al. (50 additional authors not shown)
Abstract:
Clusters of galaxies are expected to gravitationally lens the cosmic microwave background (CMB) and thereby generate a distinct signal in the CMB on arcminute scales. Measurements of this effect can be used to constrain the masses of galaxy clusters with CMB data alone. Here we present a measurement of lensing of the CMB by galaxy clusters using data from the South Pole Telescope (SPT). We develop…
▽ More
Clusters of galaxies are expected to gravitationally lens the cosmic microwave background (CMB) and thereby generate a distinct signal in the CMB on arcminute scales. Measurements of this effect can be used to constrain the masses of galaxy clusters with CMB data alone. Here we present a measurement of lensing of the CMB by galaxy clusters using data from the South Pole Telescope (SPT). We develop a maximum likelihood approach to extract the CMB cluster lensing signal and validate the method on mock data. We quantify the effects on our analysis of several potential sources of systematic error and find that they generally act to reduce the best-fit cluster mass. It is estimated that this bias to lower cluster mass is roughly $0.85σ$ in units of the statistical error bar, although this estimate should be viewed as an upper limit. We apply our maximum likelihood technique to 513 clusters selected via their SZ signatures in SPT data, and rule out the null hypothesis of no lensing at $3.1σ$. The lensing-derived mass estimate for the full cluster sample is consistent with that inferred from the SZ flux: $M_{200,\mathrm{lens}} = 0.83_{-0.37}^{+0.38}\, M_{200,\mathrm{SZ}}$ (68% C.L., statistical error only).
△ Less
Submitted 23 June, 2015; v1 submitted 23 December, 2014;
originally announced December 2014.
-
Galaxy Clusters Discovered via the Sunyaev-Zel'dovich Effect in the 2500-square-degree SPT-SZ survey
Authors:
L. E. Bleem,
B. Stalder,
T. de Haan,
K. A. Aird,
S. W. Allen,
D. E. Applegate,
M. L. N. Ashby,
M. Bautz,
M. Bayliss,
B. A. Benson,
S. Bocquet,
M. Brodwin,
J. E. Carlstrom,
C. L. Chang,
I. Chiu,
H. M. Cho,
A. Clocchiatti,
T. M. Crawford,
A. T. Crites,
S. Desai,
J. P. Dietrich,
M. A. Dobbs,
R. J. Foley,
W. R. Forman,
E. M. George
, et al. (49 additional authors not shown)
Abstract:
We present a catalog of galaxy clusters selected via their Sunyaev-Zel'dovich (SZ) effect signature from 2500 deg$^2$ of South Pole Telescope (SPT) data. This work represents the complete sample of clusters detected at high significance in the 2500-square-degree SPT-SZ survey, which was completed in 2011. A total of 677 (409) cluster candidates are identified above a signal-to-noise threshold of…
▽ More
We present a catalog of galaxy clusters selected via their Sunyaev-Zel'dovich (SZ) effect signature from 2500 deg$^2$ of South Pole Telescope (SPT) data. This work represents the complete sample of clusters detected at high significance in the 2500-square-degree SPT-SZ survey, which was completed in 2011. A total of 677 (409) cluster candidates are identified above a signal-to-noise threshold of $ξ$ =4.5 (5.0). Ground- and space-based optical and near-infrared (NIR) imaging confirms overdensities of similarly colored galaxies in the direction of 516 (or 76%) of the $ξ$>4.5 candidates and 387 (or 95%) of the $ξ$>5 candidates; the measured purity is consistent with expectations from simulations. Of these confirmed clusters, 415 were first identified in SPT data, including 251 new discoveries reported in this work. We estimate photometric redshifts for all candidates with identified optical and/or NIR counterparts; we additionally report redshifts derived from spectroscopic observations for 141 of these systems. The mass threshold of the catalog is roughly independent of redshift above $z$~0.25 leading to a sample of massive clusters that extends to high redshift. The median mass of the sample is $M_{\scriptsize 500c}(ρ_\mathrm{crit})$ ~ 3.5 x 10$^{14} M_\odot h^{-1}$, the median redshift is $z_{med}$ =0.55, and the highest-redshift systems are at $z$>1.4. The combination of large redshift extent, clean selection, and high typical mass makes this cluster sample of particular interest for cosmological analyses and studies of cluster formation and evolution.
△ Less
Submitted 13 February, 2015; v1 submitted 2 September, 2014;
originally announced September 2014.
-
Analysis of Sunyaev-Zel'dovich Effect Mass-Observable Relations using South Pole Telescope Observations of an X-ray Selected Sample of Low Mass Galaxy Clusters and Groups
Authors:
J. Liu,
J. Mohr,
A. Saro,
K. A. Aird,
M. L. N. Ashby,
M. Bautz,
M. Bayliss,
B. A. Benson,
L. E. Bleem,
S. Bocquet,
M. Brodwin,
J. E. Carlstrom,
C. L. Chang,
I. Chiu,
H. M. Cho,
A. Clocchiatti,
T. M. Crawford,
A. T. Crites,
T. de Haan,
S. Desai,
J. P. Dietrich,
M. A. Dobbs,
R. J. Foley,
D. Gangkofner,
E. M. George
, et al. (41 additional authors not shown)
Abstract:
(Abridged) We use 95, 150, and 220GHz observations from the SPT to examine the SZE signatures of a sample of 46 X-ray selected groups and clusters drawn from ~6 deg^2 of the XMM-BCS. These systems extend to redshift z=1.02, have characteristic masses ~3x lower than clusters detected directly in the SPT data and probe the SZE signal to the lowest X-ray luminosities (>10^42 erg s^-1) yet.
We devel…
▽ More
(Abridged) We use 95, 150, and 220GHz observations from the SPT to examine the SZE signatures of a sample of 46 X-ray selected groups and clusters drawn from ~6 deg^2 of the XMM-BCS. These systems extend to redshift z=1.02, have characteristic masses ~3x lower than clusters detected directly in the SPT data and probe the SZE signal to the lowest X-ray luminosities (>10^42 erg s^-1) yet.
We develop an analysis tool that combines the SZE information for the full ensemble of X-ray-selected clusters. Using X-ray luminosity as a mass proxy, we extract selection-bias corrected constraints on the SZE significance- and Y_500-mass relations. The SZE significance- mass relation is in good agreement with an extrapolation of the relation obtained from high mass clusters. However, the fit to the Y_500-mass relation at low masses, while in good agreement with the extrapolation from high mass SPT clusters, is in tension at 2.8 sigma with the constraints from the Planck sample. We examine the tension with the Planck relation, discussing sample differences and biases that could contribute.
We also present an analysis of the radio galaxy point source population in this ensemble of X-ray selected systems. We find 18 of our systems have 843 MHz SUMSS sources within 2 arcmin of the X-ray centre, and three of these are also detected at significance >4 by SPT. Of these three, two are associated with the group brightest cluster galaxies, and the third is likely an unassociated quasar candidate. We examine the impact of these point sources on our SZE scaling relation analyses and find no evidence of biases. We also examine the impact of dusty galaxies using constraints from the 220 GHz data. The stacked sample provides 2.8$σ$ significant evidence of dusty galaxy flux, which would correspond to an average underestimate of the SPT Y_500 signal that is (17+-9) per cent in this sample of low mass systems.
△ Less
Submitted 29 May, 2015; v1 submitted 28 July, 2014;
originally announced July 2014.
-
Optical Confirmation and Redshift Estimation of the Planck Cluster Candidates overlap** the Pan-STARRS Survey
Authors:
J. Liu,
C. Hennig,
S. Desai,
B. Hoyle,
J. Koppenhoefer,
J. J. Mohr,
K. Paech,
W. S. Burgett,
K. C. Chambers,
S. Cole,
P. W. Draper,
N. Kaiser,
N. Metcalfe,
J. S. Morgan,
P. A. Price,
C. W. Stubbs,
J. L. Tonry,
R. J. Wainscoat,
C. Waters
Abstract:
We report results of a study of Planck Sunyaev-Zel'dovich effect (SZE) selected galaxy cluster candidates using the Panoramic Survey Telescope & Rapid Response System (Pan-STARRS) imaging data. We first examine 150 Planck confirmed galaxy clusters with spectroscopic redshifts to test our algorithm for identifying optical counterparts and measuring their redshifts; our redshifts have a typical accu…
▽ More
We report results of a study of Planck Sunyaev-Zel'dovich effect (SZE) selected galaxy cluster candidates using the Panoramic Survey Telescope & Rapid Response System (Pan-STARRS) imaging data. We first examine 150 Planck confirmed galaxy clusters with spectroscopic redshifts to test our algorithm for identifying optical counterparts and measuring their redshifts; our redshifts have a typical accuracy of $σ_{z/(1+z)} \sim 0.022$ for this sample. Using 60 random sky locations, we estimate that our chance of contamination through a random superposition is ~ 3 per cent. We then examine an additional 237 Planck galaxy cluster candidates that have no redshift in the source catalogue. Of these 237 unconfirmed cluster candidates we are able to confirm 60 galaxy clusters and measure their redshifts. A further 83 candidates are so heavily contaminated by stars due to their location near the Galactic plane that we do not attempt to identify counterparts. For the remaining 94 candidates we find no optical counterpart but use the depth of the Pan-STARRS1 data to estimate a redshift lower limit $z_{\text{lim}(10^{15})}$ beyond which we would not have expected to detect enough galaxies for confirmation. Scaling from the already published Planck sample, we expect that $\sim$12 of these unconfirmed candidates may be real clusters.
△ Less
Submitted 29 May, 2015; v1 submitted 22 July, 2014;
originally announced July 2014.
-
Mass Calibration and Cosmological Analysis of the SPT-SZ Galaxy Cluster Sample Using Velocity Dispersion $σ_v$ and X-ray $Y_\textrm{X}$ Measurements
Authors:
S. Bocquet,
A. Saro,
J. J. Mohr,
K. A. Aird,
M. L. N. Ashby,
M. Bautz,
M. Bayliss,
G. Bazin,
B. A. Benson,
L. E. Bleem,
M. Brodwin,
J. E. Carlstrom,
C. L. Chang,
I. Chiu,
H. M. Cho,
A. Clocchiatti,
T. M. Crawford,
A. T. Crites,
S. Desai,
T. de Haan,
J. P. Dietrich,
M. A. Dobbs,
R. J. Foley,
W. R. Forman,
D. Gangkofner
, et al. (46 additional authors not shown)
Abstract:
We present a velocity dispersion-based mass calibration of the South Pole Telescope Sunyaev-Zel'dovich effect survey (SPT-SZ) galaxy cluster sample. Using a homogeneously selected sample of 100 cluster candidates from 720 deg2 of the survey along with 63 velocity dispersion ($σ_v$) and 16 X-ray Yx measurements of sample clusters, we simultaneously calibrate the mass-observable relation and constra…
▽ More
We present a velocity dispersion-based mass calibration of the South Pole Telescope Sunyaev-Zel'dovich effect survey (SPT-SZ) galaxy cluster sample. Using a homogeneously selected sample of 100 cluster candidates from 720 deg2 of the survey along with 63 velocity dispersion ($σ_v$) and 16 X-ray Yx measurements of sample clusters, we simultaneously calibrate the mass-observable relation and constrain cosmological parameters. The calibrations using $σ_v$ and Yx are consistent at the $0.6σ$ level, with the $σ_v$ calibration preferring ~16% higher masses. We use the full cluster dataset to measure $σ_8(Ω_ m/0.27)^{0.3}=0.809\pm0.036$. The SPT cluster abundance is lower than preferred by either the WMAP9 or Planck+WMAP9 polarization (WP) data, but assuming the sum of the neutrino masses is $\sum m_ν=0.06$ eV, we find the datasets to be consistent at the 1.0$σ$ level for WMAP9 and 1.5$σ$ for Planck+WP. Allowing for larger $\sum m_ν$ further reconciles the results. When we combine the cluster and Planck+WP datasets with BAO and SNIa, the preferred cluster masses are $1.9σ$ higher than the Yx calibration and $0.8σ$ higher than the $σ_v$ calibration. Given the scale of these shifts (~44% and ~23% in mass, respectively), we execute a goodness of fit test; it reveals no tension, indicating that the best-fit model provides an adequate description of the data. Using the multi-probe dataset, we measure $Ω_ m=0.299\pm0.009$ and $σ_8=0.829\pm0.011$. Within a $ν$CDM model we find $\sum m_ν= 0.148\pm0.081$ eV. We present a consistency test of the cosmic growth rate. Allowing both the growth index $γ$ and the dark energy equation of state parameter $w$ to vary, we find $γ=0.73\pm0.28$ and $w=-1.007\pm0.065$, demonstrating that the expansion and the growth histories are consistent with a LCDM model ($γ=0.55; \,w=-1$).
△ Less
Submitted 2 December, 2014; v1 submitted 10 July, 2014;
originally announced July 2014.
-
Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering
Authors:
Pietro Coretto,
Christian Hennig
Abstract:
The two main topics of this paper are the introduction of the "optimally tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for tr…
▽ More
The two main topics of this paper are the introduction of the "optimally tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant density for modelling outliers and noise. This can be chosen optimally so that the non-noise part of the data looks as close to a Gaussian mixture as possible. Some deviation from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance matrix constraints and computation of the OTRIMLE are also treated. In the simulation study, all methods are confronted with setups in which their model assumptions are not exactly fulfilled, and in order to evaluate the experiments in a standardized way by misclassification rates, a new model-based definition of "true clusters" is introduced that deviates from the usual identification of mixture components with clusters. In the study, every method turns out to be superior for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance. The methods are also applied to two real datasets, one without and one with known "true" clusters.
△ Less
Submitted 28 January, 2017; v1 submitted 2 June, 2014;
originally announced June 2014.
-
Constraints on the CMB Temperature Evolution using Multi-Band Measurements of the Sunyaev Zel'dovich Effect with the South Pole Telescope
Authors:
A. Saro,
J. Liu,
J. J. Mohr,
K. A. Aird,
M. L. N. Ashby,
M. Bayliss,
B. A. Benson,
L. E. Bleem,
S. Bocquet,
M. Brodwin,
J. E. Carlstrom,
C. L. Chang,
I. Chiu,
H. M. Cho,
A. Clocchiatti,
T. M. Crawford,
A. T. Crites,
T. de Haan,
S. Desai,
J. P. Dietrich,
M. A. Dobbs,
K. Dolag,
J. P. Dudley,
R. J. Foley,
D. Gangkofner
, et al. (46 additional authors not shown)
Abstract:
The adiabatic evolution of the temperature of the cosmic microwave background (CMB) is a key prediction of standard cosmology. We study deviations from the expected adiabatic evolution of the CMB temperature of the form $T(z) =T_0(1+z)^{1-α}$ using measurements of the spectrum of the Sunyaev Zel'dovich Effect with the South Pole Telescope (SPT). We present a method for using the ratio of the Sunya…
▽ More
The adiabatic evolution of the temperature of the cosmic microwave background (CMB) is a key prediction of standard cosmology. We study deviations from the expected adiabatic evolution of the CMB temperature of the form $T(z) =T_0(1+z)^{1-α}$ using measurements of the spectrum of the Sunyaev Zel'dovich Effect with the South Pole Telescope (SPT). We present a method for using the ratio of the Sunyaev Zel'dovich signal measured at 95 and 150 GHz in the SPT data to constrain the temperature of the CMB. We demonstrate that this approach provides unbiased results using mock observations of clusters from a new set of hydrodynamical simulations. We apply this method to a sample of 158 SPT-selected clusters, spanning the redshift range $0.05 < z < 1.35$, and measure $α= 0.017^{+0.030}_{-0.028}$, consistent with the standard model prediction of $α=0$. In combination with other published results, we constrain $α= 0.011 \pm 0.016$, an improvement of $\sim 20\%$ over published constraints. This measurement also provides a strong constraint on the effective equation of state in models of decaying dark energy $w_\mathrm{eff} = -0.987^{+0.016}_{-0.017}$.
△ Less
Submitted 9 December, 2013;
originally announced December 2013.
-
Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering
Authors:
Pietro Coretto,
Christian Hennig
Abstract:
The robust improper maximum likelihood estimator (RIMLE) is a new method for robust multivariate clustering finding approximately Gaussian clusters. It maximizes a pseudo-likelihood defined by adding a component with improper constant density for accommodating outliers to a Gaussian mixture. A special case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In this paper we treat…
▽ More
The robust improper maximum likelihood estimator (RIMLE) is a new method for robust multivariate clustering finding approximately Gaussian clusters. It maximizes a pseudo-likelihood defined by adding a component with improper constant density for accommodating outliers to a Gaussian mixture. A special case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In this paper we treat existence, consistency, and breakdown theory for the RIMLE comprehensively. RIMLE's existence is proved under non-smooth covariance matrix constraints. It is shown that these can be implemented via a computationally feasible Expectation-Conditional Maximization algorithm.
△ Less
Submitted 13 February, 2018; v1 submitted 26 September, 2013;
originally announced September 2013.
-
Quantile-based classifiers
Authors:
Christian Hennig,
Cinzia Viroli
Abstract:
Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample.
It is shown that this is consistent, fo…
▽ More
Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample.
It is shown that this is consistent, for $n \to \infty$, for the classification rule with asymptotically optimal quantile, and that, under some assumptions, for $p\to\infty$ the probability of correct classification converges to one. The role of skewness of the involved variables is discussed, which leads to an improved classifier.
The optimal quantile classifier performs very well in a comprehensive simulation study and a real data set from chemistry (classification of bioaerosols) compared to nine other classifiers, including the support vector machine and the recently proposed median-based classifier (Hall et al., 2009), which inspired the quantile classifier.
△ Less
Submitted 12 November, 2013; v1 submitted 6 March, 2013;
originally announced March 2013.
-
Breakdown points for maximum likelihood estimators of location-scale mixtures
Authors:
Christian Hennig
Abstract:
ML-estimation based on mixtures of Normal distributions is a widely used tool for cluster analysis. However, a single outlier can make the parameter estimation of at least one of the mixture components break down. Among others, the estimation of mixtures of t-distributions by McLachlan and
Peel [Finite Mixture Models (2000) Wiley, New York] and the addition of a further mixture component accou…
▽ More
ML-estimation based on mixtures of Normal distributions is a widely used tool for cluster analysis. However, a single outlier can make the parameter estimation of at least one of the mixture components break down. Among others, the estimation of mixtures of t-distributions by McLachlan and
Peel [Finite Mixture Models (2000) Wiley, New York] and the addition of a further mixture component accounting for ``noise'' by Fraley and Raftery
[The Computer J. 41 (1998) 578-588] were suggested as more robust alternatives.
In this paper, the definition of an adequate robustness measure for cluster analysis is discussed and bounds for the breakdown points of the mentioned methods are given. It turns out that the two alternatives, while adding stability in the presence of outliers of moderate size, do not possess a substantially better breakdown behavior than estimation based on Normal mixtures. If the number of clusters s is treated as fixed, r additional points suffice for all three methods to let the parameters of r clusters explode. Only in the case of r=s is this not possible for t-mixtures. The ability to estimate the number of mixture components, for example, by use of the Bayesian information criterion of Schwarz [Ann. Statist. 6 (1978)
461-464], and to isolate gross outliers as clusters of one point, is crucial for an improved breakdown behavior of all three techniques. Furthermore, a mixture of Normals with an improper uniform distribution is proposed to achieve more robustness in the case of a fixed number of components.
△ Less
Submitted 5 October, 2004;
originally announced October 2004.