Skip to main content

Showing 1–16 of 16 results for author: Zamar, R H

.
  1. arXiv:2102.06851  [pdf, other

    stat.ME math.ST stat.CO

    Robust Model-Based Clustering

    Authors: Juan D. Gonzalez, Ricardo Maronna, Victor J. Yohai, Ruben H. Zamar

    Abstract: We propose a new class of robust and Fisher-consistent estimators for mixture models. These estimators can be used to construct robust model-based clustering procedures. We study in detail the case of multivariate normal mixtures and propose a procedure that uses S estimators of multivariate location and scatter. We develop an algorithm to compute the estimators and to build the clusters which is… ▽ More

    Submitted 8 June, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

  2. arXiv:2010.00950  [pdf, other

    stat.ML cs.LG stat.ME

    Regularized K-means through hard-thresholding

    Authors: Jakob Raymaekers, Ruben H. Zamar

    Abstract: We study a framework of regularized $K$-means methods based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared through simulation and theoretical analysis. Based on the results, we propose HT $K$-means, which uses an $\ell_0$ penalty to induce sparsity in the variables. Different techniques for selecting the tuning parameter are… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

  3. Pooled variable scaling for cluster analysis

    Authors: Jakob Raymaekers, Ruben H. Zamar

    Abstract: We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures such as the standard deviation and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well known real data examples that the proposed s… ▽ More

    Submitted 25 July, 2020; v1 submitted 22 December, 2019; originally announced December 2019.

    Comments: 29 pages, 32 figures

  4. arXiv:1906.08198  [pdf, other

    stat.CO

    Robust Clustering Using Tau-Scales

    Authors: Juan D. Gonzalez, Victor J. Yohai, Ruben H. Zamar

    Abstract: K means is a popular non-parametric clustering procedure introduced by Steinhaus (1956) and further developed by MacQueen (1967). It is known, however, that K means does not perform well in the presence of outliers. Cuesta-Albertos et al (1997) introduced a robust alternative, trimmed K means, which can be tuned to be robust or efficient, but cannot achieve these two properties simultaneously in a… ▽ More

    Submitted 19 June, 2019; originally announced June 2019.

    Comments: 40 pages, 9 figures

    MSC Class: 62G35; 62H30; 62H35

  5. arXiv:1808.06016  [pdf, other

    stat.ME

    A Stepwise Approach for High-Dimensional Gaussian Graphical Models

    Authors: Ginette Lafit, Francisco J. Nogales, Marcelo Ruiz, Ruben H. Zamar

    Abstract: We present a stepwise approach to estimate high dimensional Gaussian graphical models. We exploit the relation between the partial correlation coefficients and the distribution of the prediction errors, and parametrize the model in terms of the Pearson correlation coefficients between the prediction errors of the nodes' best linear predictors. We propose a novel stepwise algorithm for detecting pa… ▽ More

    Submitted 17 August, 2018; originally announced August 2018.

    Comments: 26 pages, 5 figures, 4 tables

  6. arXiv:1707.00727  [pdf, other

    stat.ML

    Regression Phalanxes

    Authors: Hongyang Zhang, William J. Welch, Ruben H. Zamar

    Abstract: Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction… ▽ More

    Submitted 3 July, 2017; originally announced July 2017.

  7. arXiv:1706.06971  [pdf, ps, other

    stat.ML

    Ensembles of phalanxes across assessment metrics for robust ranking of homologous proteins

    Authors: Jabed H Tomal, William J Welch, Ruben H Zamar

    Abstract: Two proteins are homologous if they have a common evolutionary origin, and the binary classification problem is to identify proteins in a candidate set that are homologous to a particular native protein. The feature (explanatory) variables available for classification are various measures of similarity of proteins. There are multiple classification problems of this type for different native protei… ▽ More

    Submitted 9 September, 2019; v1 submitted 21 June, 2017; originally announced June 2017.

    Comments: 29 pages, 4 figures, 8 tables and 2 algorithms

  8. arXiv:1609.00402  [pdf, other

    math.ST

    Multivariate Location and Scatter Matrix Estimation Under Cellwise and Casewise Contamination

    Authors: Andy Leung, Victor J. Yohai, Ruben H. Zamar

    Abstract: We consider the problem of multivariate location and scatter matrix estimation when the data contain cellwise and casewise outliers. Agostinelli et al. (2015) propose a two-step approach to deal with this problem: first, apply a univariate filter to remove cellwise outliers and second, apply a generalized S-estimator to downweight casewise outliers. We improve this proposal in three main direction… ▽ More

    Submitted 25 December, 2016; v1 submitted 1 September, 2016; originally announced September 2016.

    MSC Class: 62G35; 62G05; 62G20

  9. Robust regression estimation and inference in the presence of cellwise and casewise contamination

    Authors: Andy Leung, Hongyang Zhang, Ruben H. Zamar

    Abstract: Cellwise outliers are likely to occur together with casewise outliers in modern data sets with relatively large dimension. Recent work has shown that traditional robust regression methods may fail for data sets in this paradigm. The proposed method, called three-step regression, proceeds as follows: first, it uses a consistent univariate filter to detect and eliminate extreme cellwise outliers; se… ▽ More

    Submitted 25 December, 2016; v1 submitted 8 September, 2015; originally announced September 2015.

    MSC Class: 62G35; 62G05; 62G20

  10. arXiv:1409.0745  [pdf, other

    stat.ML cs.LG

    Multi-rank Sparse Hierarchical Clustering

    Authors: Hongyang Zhang, Ruben H. Zamar

    Abstract: There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better cover… ▽ More

    Submitted 3 July, 2017; v1 submitted 2 September, 2014; originally announced September 2014.

  11. arXiv:1406.6031  [pdf, other

    math.ST

    Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination

    Authors: Claudio Agostinelli, Andy Leung, Victor J. Yohai, Ruben H. Zamar

    Abstract: Multivariate location and scatter matrix estimation is a cornerstone in multivariate data analysis. We consider this problem when the data may contain independent cellwise and casewise outliers. Flat data sets with a large number of variables and a relatively small number of cases are common place in modern statistical applications. In these cases global down-weighting of an entire case, as perfor… ▽ More

    Submitted 23 June, 2014; originally announced June 2014.

    MSC Class: 62G35 (Primary); 62G05 (Secondary)

  12. arXiv:1303.4805  [pdf, ps, other

    stat.ML stat.CO

    Ensembling classification models based on phalanxes of variables with applications in drug discovery

    Authors: Jabed H. Tomal, William J. Welch, Ruben H. Zamar

    Abstract: Statistical detection of a rare class of objects in a two-class classification problem can pose several challenges. Because the class of interest is rare in the training data, there is relatively little information in the known class response labels for model building. At the same time the available explanatory variables are often moderately high dimensional. In the four assays of our drug-discove… ▽ More

    Submitted 15 May, 2015; v1 submitted 19 March, 2013; originally announced March 2013.

    Comments: Published at http://dx.doi.org/10.1214/14-AOAS778 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS778

    Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 1, 69-93

  13. Propagation of outliers in multivariate data

    Authors: Fatemah Alqallaf, Stefan Van Aelst, Victor J. Yohai, Ruben H. Zamar

    Abstract: We investigate the performance of robust estimates of multivariate location under nonstandard data contamination models such as componentwise outliers (i.e., contamination in each variable is independent from the other variables). This model brings up a possible new source of statistical error that we call "propagation of outliers." This source of error is unusual in the sense that it is generat… ▽ More

    Submitted 3 March, 2009; originally announced March 2009.

    Comments: Published in at http://dx.doi.org/10.1214/07-AOS588 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOS-AOS588 MSC Class: 62F35 (Primary) 62H12 (Secondary)

    Journal ref: Annals of Statistics 2009, Vol. 37, No. 1, 311-331

  14. Discussion: Conditional growth charts

    Authors: Matias Salibian-Barrera, Ruben H. Zamar

    Abstract: Discussion of Conditional growth charts [math.ST/0702634]

    Submitted 22 February, 2007; originally announced February 2007.

    Comments: Published at http://dx.doi.org/10.1214/009053606000000669 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOS-AOS0102D

    Journal ref: Annals of Statistics 2006, Vol. 34, No. 5, 2113-2118

  15. Robust nonparametric inference for the median

    Authors: Victor J. Yohai, Ruben H. Zamar

    Abstract: We consider the problem of constructing robust nonparametric confidence intervals and tests of hypothesis for the median when the data distribution is unknown and the data may contain a small fraction of contamination. We propose a modification of the sign test (and its associated confidence interval) which attains the nominal significance level (probability coverage) for any distribution in the… ▽ More

    Submitted 29 March, 2005; originally announced March 2005.

    Comments: Published at http://dx.doi.org/10.1214/009053604000000634 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOS-AOS283 MSC Class: 62F35 (Primary) 62G35 (Secondary)

    Journal ref: Annals of Statistics 2004, Vol. 32, No. 5, 1841-1857

  16. Uniform asymptotics for robust location estimates when the scale is unknown

    Authors: Matias Salibian-Barrera, Ruben H. Zamar

    Abstract: Most asymptotic results for robust estimates rely on regularity conditions that are difficult to verify in practice. Moreover, these results apply to fixed distribution functions. In the robustness context the distribution of the data remains largely unspecified and hence results that hold uniformly over a set of possible distribution functions are of theoretical and practical interest. Also, it… ▽ More

    Submitted 5 October, 2004; originally announced October 2004.

    Comments: Published by the Institute of Mathematical Statistics (http://www.imstat.org) in the Annals of Statistics (http://www.imstat.org/aos/) at http://dx.doi.org/10.1214/009053604000000544

    Report number: IMS-AOS-AOS235 MSC Class: 62F35; 62F12; 62E20. (Primary)

    Journal ref: Annals of Statistics 2004, Vol. 32, No. 4, 1434-1447