Search | arXiv e-print repository

An Axiomatic Definition of Hierarchical Clustering

Authors: Ery Arias-Castro, Elizabeth Coda

Abstract: In this paper, we take an axiomatic approach to defining a population hierarchical clustering for piecewise constant densities, and in a similar manner to Lebesgue integration, extend this definition to more general densities. When the density satisfies some mild conditions, e.g., when it has connected support, is continuous, and vanishes only at infinity, or when the connected components of the d… ▽ More In this paper, we take an axiomatic approach to defining a population hierarchical clustering for piecewise constant densities, and in a similar manner to Lebesgue integration, extend this definition to more general densities. When the density satisfies some mild conditions, e.g., when it has connected support, is continuous, and vanishes only at infinity, or when the connected components of the density satisfy these conditions, our axiomatic definition results in Hartigan's definition of cluster tree. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2312.04924 [pdf, other]

Sparse Anomaly Detection Across Referentials: A Rank-Based Higher Criticism Approach

Authors: Ivo V. Stoepker, Rui M. Castro, Ery Arias-Castro

Abstract: Detecting anomalies in large sets of observations is crucial in various applications, such as epidemiological studies, gene expression studies, and systems monitoring. We consider settings where the units of interest result in multiple independent observations from potentially distinct referentials. Scan statistics and related methods are commonly used in such settings, but rely on stringent model… ▽ More Detecting anomalies in large sets of observations is crucial in various applications, such as epidemiological studies, gene expression studies, and systems monitoring. We consider settings where the units of interest result in multiple independent observations from potentially distinct referentials. Scan statistics and related methods are commonly used in such settings, but rely on stringent modeling assumptions for proper calibration. We instead propose a rank-based variant of the higher criticism statistic that only requires independent observations originating from ordered spaces. We show under what conditions the resulting methodology is able to detect the presence of anomalies. These conditions are stated in a general, non-parametric manner, and depend solely on the probabilities of anomalous observations exceeding nominal observations. The analysis requires a refined understanding of the distribution of the ranks under the presence of anomalies, and in particular of the rank-induced dependencies. The methodology is robust against heavy-tailed distributions through the use of ranks. Within the exponential family and a family of convolutional models, we analytically quantify the asymptotic performance of our methodology and the performance of the oracle, and show the difference is small for many common models. Simulations confirm these results. We show the applicability of the methodology through an analysis of quality control data of a pharmaceutical manufacturing process. △ Less

Submitted 8 December, 2023; originally announced December 2023.

MSC Class: 62G10; 62G20; 62G32; 62J15

arXiv:2310.10900 [pdf, other]

Stability of Sequential Lateration and of Stress Minimization in the Presence of Noise

Authors: Ery Arias-Castro, Siddharth Vishwanath

Abstract: Sequential lateration is a class of methods for multidimensional scaling where a suitable subset of nodes is first embedded by some method, e.g., a clique embedded by classical scaling, and then the remaining nodes are recursively embedded by lateration. A graph is a lateration graph when it can be embedded by such a procedure. We provide a stability result for a particular variant of sequential l… ▽ More Sequential lateration is a class of methods for multidimensional scaling where a suitable subset of nodes is first embedded by some method, e.g., a clique embedded by classical scaling, and then the remaining nodes are recursively embedded by lateration. A graph is a lateration graph when it can be embedded by such a procedure. We provide a stability result for a particular variant of sequential lateration. We do so in a setting where the dissimilarities represent noisy Euclidean distances between nodes in a geometric lateration graph. We then deduce, as a corollary, a perturbation bound for stress minimization. To argue that our setting applies broadly, we show that a (large) random geometric graph is a lateration graph with high probability under mild conditions, extending a previous result of Aspnes et al (2006). △ Less

Submitted 26 March, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2207.07218

arXiv:2310.00211 [pdf, ps, other]

Theoretical Foundations of Ordinal Multidimensional Scaling, Including Internal and External Unfolding

Authors: Ery Arias-Castro, Clément Berenfeld, Daniel Kane

Abstract: We provide a comprehensive theory of multiple variants of ordinal multidimensional scaling, including external and internal unfolding. We do so in the continuous model of Shepard (1966). We provide a comprehensive theory of multiple variants of ordinal multidimensional scaling, including external and internal unfolding. We do so in the continuous model of Shepard (1966). △ Less

Submitted 5 October, 2023; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: same exact version with funding information added

arXiv:2208.14540 [pdf, ps, other]

Embedding Functional Data: Multidimensional Scaling and Manifold Learning

Authors: Ery Arias-Castro, Wanli Qiao

Abstract: We adapt concepts, methodology, and theory originally developed in the areas of multidimensional scaling and dimensionality reduction for multivariate data to the functional setting. We focus on classical scaling and Isomap -- prototypical methods that have played important roles in these area -- and showcase their use in the context of functional data analysis. In the process, we highlight the cr… ▽ More We adapt concepts, methodology, and theory originally developed in the areas of multidimensional scaling and dimensionality reduction for multivariate data to the functional setting. We focus on classical scaling and Isomap -- prototypical methods that have played important roles in these area -- and showcase their use in the context of functional data analysis. In the process, we highlight the crucial role that the ambient metric plays. △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2207.11121 [pdf, other]

Fitting a Multi-modal Density by Dynamic Programming

Authors: Ery Arias-Castro, He Jiang

Abstract: We consider the problem of fitting a probability density function when it is constrained to have a given number of modal intervals. We propose a dynamic programming approach to solving this problem numerically. When this number is not known, we provide several data-driven ways for selecting it. We perform some numerical experiments to illustrate our methodology. We consider the problem of fitting a probability density function when it is constrained to have a given number of modal intervals. We propose a dynamic programming approach to solving this problem numerically. When this number is not known, we provide several data-driven ways for selecting it. We perform some numerical experiments to illustrate our methodology. △ Less

Submitted 14 July, 2022; originally announced July 2022.

arXiv:2207.07218 [pdf, other]

On the Selection of Tuning Parameters for Patch-Stitching Embedding Methods

Authors: Ery Arias-Castro, Phong Alain Chau

Abstract: While classical scaling, just like principal component analysis, is parameter-free, other methods for embedding multivariate data require the selection of one or several tuning parameters. This tuning can be difficult due to the unsupervised nature of the situation. We propose a simple, almost obvious, approach to supervise the choice of tuning parameter(s): minimize a notion of stress. We apply t… ▽ More While classical scaling, just like principal component analysis, is parameter-free, other methods for embedding multivariate data require the selection of one or several tuning parameters. This tuning can be difficult due to the unsupervised nature of the situation. We propose a simple, almost obvious, approach to supervise the choice of tuning parameter(s): minimize a notion of stress. We apply this approach to the selection of the patch size in a prototypical patch-stitching embedding method, both in the multidimensional scaling (aka network localization) setting and in the dimensionality reduction (aka manifold learning) setting. In our study, we uncover a new bias--variance tradeoff phenomenon. △ Less

Submitted 17 October, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

Comments: Title change. Theory was removed to spin off another paper [arXiv:2310.10900]

arXiv:2202.09023 [pdf, ps, other]

Clustering by Hill-Climbing: Consistency Results

Authors: Ery Arias-Castro, Wanli Qiao

Abstract: We consider several hill-climbing approaches to clustering as formulated by Fukunaga and Hostetler in the 1970's. We study both continuous-space and discrete-space (i.e., medoid) variants and establish their consistency. We consider several hill-climbing approaches to clustering as formulated by Fukunaga and Hostetler in the 1970's. We study both continuous-space and discrete-space (i.e., medoid) variants and establish their consistency. △ Less

Submitted 18 February, 2022; originally announced February 2022.

arXiv:2111.10298 [pdf, other]

An Asymptotic Equivalence between the Mean-Shift Algorithm and the Cluster Tree

Authors: Ery Arias-Castro, Wanli Qiao

Abstract: Two important nonparametric approaches to clustering emerged in the 1970's: clustering by level sets or cluster tree as proposed by Hartigan, and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hosteler. In a recent paper, we argue the thesis that these two approaches are fundamentally the same by showing that the gradient flow provides a way to move along the cluster tre… ▽ More Two important nonparametric approaches to clustering emerged in the 1970's: clustering by level sets or cluster tree as proposed by Hartigan, and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hosteler. In a recent paper, we argue the thesis that these two approaches are fundamentally the same by showing that the gradient flow provides a way to move along the cluster tree. In making a stronger case, we are confronted with the fact the cluster tree does not define a partition of the entire support of the underlying density, while the gradient flow does. In the present paper, we resolve this conundrum by proposing two ways of obtaining a partition from the cluster tree -- each one of them very natural in its own right -- and showing that both of them reduce to the partition given by the gradient flow under standard assumptions on the sampling density. △ Less

Submitted 19 November, 2021; originally announced November 2021.

arXiv:2109.08362 [pdf, other]

Moving Up the Cluster Tree with the Gradient Flow

Authors: Ery Arias-Castro, Wanli Qiao

Abstract: The paper establishes a strong correspondence between two important clustering approaches that emerged in the 1970's: clustering by level sets or cluster tree as proposed by Hartigan and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hostetler. We do so by showing that we can move up the cluster tree by following the gradient ascent flow. The paper establishes a strong correspondence between two important clustering approaches that emerged in the 1970's: clustering by level sets or cluster tree as proposed by Hartigan and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hostetler. We do so by showing that we can move up the cluster tree by following the gradient ascent flow. △ Less

Submitted 9 December, 2021; v1 submitted 17 September, 2021; originally announced September 2021.

Comments: This is an expanded version. We changed the title to better reflect the contribution made in the paper

arXiv:2105.03122 [pdf, other]

The Coreness and H-Index of Random Geometric Graphs

Authors: Eddie Aamari, Ery Arias-Castro, Clément Berenfeld

Abstract: In network analysis, a measure of node centrality provides a scale indicating how central a node is within a network. The coreness is a popular notion of centrality that accounts for the maximal smallest degree of a subgraph containing a given node. In this paper, we study the coreness of random geometric graphs and show that, with an increasing number of nodes and properly chosen connectivity rad… ▽ More In network analysis, a measure of node centrality provides a scale indicating how central a node is within a network. The coreness is a popular notion of centrality that accounts for the maximal smallest degree of a subgraph containing a given node. In this paper, we study the coreness of random geometric graphs and show that, with an increasing number of nodes and properly chosen connectivity radius, the coreness converges to a new object, that we call the continuum coreness. In the process, we show that other popular notions of centrality measures, namely the H-index and its iterates, also converge under the same setting to new limiting objects. △ Less

Submitted 13 June, 2024; v1 submitted 7 May, 2021; originally announced May 2021.

arXiv:2104.07870 [pdf, other]

Estimation of the Global Mode of a Density: Minimaxity, Adaptation, and Computational Complexity

Authors: Ery Arias-Castro, Wanli Qiao, Lin Zheng

Abstract: We consider the estimation of the global mode of a density under some decay rate condition around the global mode. We show that the maximum of a histogram, with proper choice of bandwidth, achieves the minimax rate that we establish for the setting that we consider. This is based on knowledge of the decay rate. Addressing the situation where the decay rate is unknown, we propose a multiscale varia… ▽ More We consider the estimation of the global mode of a density under some decay rate condition around the global mode. We show that the maximum of a histogram, with proper choice of bandwidth, achieves the minimax rate that we establish for the setting that we consider. This is based on knowledge of the decay rate. Addressing the situation where the decay rate is unknown, we propose a multiscale variant consisting in the recursive refinement of a histogram, which is shown to be minimax adaptive. These methods run in linear time, and we prove in an appendix that this is best possible: There is no estimation procedure that runs in sublinear time that achieves the minimax rate. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2012.07937 [pdf, other]

Template Matching with Ranks

Authors: Ery Arias-Castro, Lin Zheng

Abstract: We consider the problem of matching a template to a noisy signal. Motivated by some recent proposals in the signal processing literature, we suggest a rank-based method and study its asymptotic properties using some well-established techniques in empirical process theory combined with Hájek's projection method. The resulting estimator of the shift is shown to achieve a parametric rate of convergen… ▽ More We consider the problem of matching a template to a noisy signal. Motivated by some recent proposals in the signal processing literature, we suggest a rank-based method and study its asymptotic properties using some well-established techniques in empirical process theory combined with Hájek's projection method. The resulting estimator of the shift is shown to achieve a parametric rate of convergence and to be asymptotically normal. Some numerical simulations corroborate these findings. △ Less

Submitted 14 December, 2020; originally announced December 2020.

arXiv:2011.12478 [pdf, other]

Minimax Estimation of Distances on a Surface and Minimax Manifold Learning in the Isometric-to-Convex Setting

Authors: Ery Arias-Castro, Phong Alain Chau

Abstract: We start by considering the problem of estimating intrinsic distances on a smooth submanifold. We show that minimax optimality can be obtained via a reconstruction of the surface, and discuss the use of a particular mesh construction -- the tangential Delaunay complex -- for that purpose. We then turn to manifold learning and argue that a variant of Isomap where the distances are instead computed… ▽ More We start by considering the problem of estimating intrinsic distances on a smooth submanifold. We show that minimax optimality can be obtained via a reconstruction of the surface, and discuss the use of a particular mesh construction -- the tangential Delaunay complex -- for that purpose. We then turn to manifold learning and argue that a variant of Isomap where the distances are instead computed on a reconstructed surface is minimax optimal for the isometric variant of the problem. △ Less

Submitted 3 October, 2023; v1 submitted 24 November, 2020; originally announced November 2020.

arXiv:2010.09906 [pdf, ps, other]

On the Consistency of Metric and Non-Metric K-medoids

Authors: Ery Arias-Castro, He Jiang

Abstract: We establish the consistency of K-medoids in the context of metric spaces. We start by proving that K-medoids is asymptotically equivalent to K-means restricted to the support of the underlying distribution under general conditions, including a wide selection of loss functions. This asymptotic equivalence, in turn, enables us to apply the work of Parna (1986) on the consistency of K-means. This ge… ▽ More We establish the consistency of K-medoids in the context of metric spaces. We start by proving that K-medoids is asymptotically equivalent to K-means restricted to the support of the underlying distribution under general conditions, including a wide selection of loss functions. This asymptotic equivalence, in turn, enables us to apply the work of Parna (1986) on the consistency of K-means. This general approach applies also to non-metric settings where only an ordering of the dissimilarities is available. We consider two types of ordinal information: one where all quadruple comparisons are available; and one where only triple comparisons are available. We provide some numerical experiments to illustrate our theory. △ Less

Submitted 19 October, 2020; originally announced October 2020.

arXiv:2009.04072 [pdf, other]

Template Matching and Change Point Detection by M-estimation

Authors: Ery Arias-Castro, Lin Zheng

Abstract: We consider the fundamental problem of matching a template to a signal. We do so by M-estimation, which encompasses procedures that are robust to gross errors (i.e., outliers). Using standard results from empirical process theory, we derive the convergence rate and the asymptotic distribution of the M-estimator under relatively mild assumptions. We also discuss the optimality of the estimator, bot… ▽ More We consider the fundamental problem of matching a template to a signal. We do so by M-estimation, which encompasses procedures that are robust to gross errors (i.e., outliers). Using standard results from empirical process theory, we derive the convergence rate and the asymptotic distribution of the M-estimator under relatively mild assumptions. We also discuss the optimality of the estimator, both in finite samples in the minimax sense and in the large-sample limit in terms of local minimaxity and relative efficiency. Although most of the paper is dedicated to the study of the basic shift model in the context of a random design, we consider many extensions towards the end of the paper, including more flexible templates, fixed designs, the agnostic setting, and more. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2009.03117 [pdf, other]

Anomaly Detection for a Large Number of Streams: A Permutation-Based Higher Criticism Approach

Authors: Ivo V. Stoepker, Rui M. Castro, Ery Arias-Castro, Edwin van den Heuvel

Abstract: Anomaly detection when observing a large number of data streams is essential in a variety of applications, ranging from epidemiological studies to monitoring of complex systems. High-dimensional scenarios are usually tackled with scan-statistics and related methods, requiring stringent modeling assumptions for proper calibration. In this work we take a non-parametric stance, and propose a permutat… ▽ More Anomaly detection when observing a large number of data streams is essential in a variety of applications, ranging from epidemiological studies to monitoring of complex systems. High-dimensional scenarios are usually tackled with scan-statistics and related methods, requiring stringent modeling assumptions for proper calibration. In this work we take a non-parametric stance, and propose a permutation-based variant of the higher criticism statistic not requiring knowledge of the null distribution. This results in an exact test in finite samples which is asymptotically optimal in the wide class of exponential models. We demonstrate the power loss in finite samples is minimal with respect to the oracle test. Furthermore, since the proposed statistic does not rely on asymptotic approximations it typically performs better than popular variants of higher criticism that rely on such approximations. We include recommendations such that the test can be readily applied in practice, and demonstrate its applicability in monitoring the content uniformity of an active ingredient for a batch-produced drug product. △ Less

Submitted 6 October, 2022; v1 submitted 7 September, 2020; originally announced September 2020.

arXiv:1906.08884 [pdf, other]

A Multiscale Scan Statistic for Adaptive Submatrix Localization

Authors: Yuchao Liu, Ery Arias-Castro

Abstract: We consider the problem of localizing a submatrix with larger-than-usual entry values inside a data matrix, without the prior knowledge of the submatrix size. We establish an optimization framework based on a multiscale scan statistic, and develop algorithms in order to approach the optimizer. We also show that our estimator only requires a signal strength of the same order as the minimax estimato… ▽ More We consider the problem of localizing a submatrix with larger-than-usual entry values inside a data matrix, without the prior knowledge of the submatrix size. We establish an optimization framework based on a multiscale scan statistic, and develop algorithms in order to approach the optimizer. We also show that our estimator only requires a signal strength of the same order as the minimax estimator with oracle knowledge of the submatrix size, to exactly recover the anomaly with high probability. We perform some simulations that show that our estimator has superior performance compared to other estimators which do not require prior submatrix knowledge, while being comparatively faster to compute. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: The original version was accepted by KDD2019 Research Track. Detail of the proof is available at https://escholarship.org/uc/item/9wt627dg

arXiv:1811.07105 [pdf, other]

Detection of Sparse Positive Dependence

Authors: Ery Arias-Castro, Rong Huang, Nicolas Verzelen

Abstract: In a bivariate setting, we consider the problem of detecting a sparse contamination or mixture component, where the effect manifests itself as a positive dependence between the variables, which are otherwise independent in the main component. We first look at this problem in the context of a normal mixture model. In essence, the situation reduces to a univariate setting where the effect is a decre… ▽ More In a bivariate setting, we consider the problem of detecting a sparse contamination or mixture component, where the effect manifests itself as a positive dependence between the variables, which are otherwise independent in the main component. We first look at this problem in the context of a normal mixture model. In essence, the situation reduces to a univariate setting where the effect is a decrease in variance. In particular, a higher criticism test based on the pairwise differences is shown to achieve the detection boundary defined by the (oracle) likelihood ratio test. We then turn to a Gaussian copula model where the marginal distributions are unknown. Standard invariance considerations lead us to consider rank tests. In fact, a higher criticism test based on the pairwise rank differences achieves the detection boundary in the normal mixture model, although not in the very sparse regime. We do not know of any rank test that has any power in that regime. △ Less

Submitted 9 January, 2020; v1 submitted 17 November, 2018; originally announced November 2018.

arXiv:1811.01101 [pdf, other]

Some Random Paths with Angle Constraints

Authors: Clément Berenfeld, Ery Arias-Castro

Abstract: We propose a simple, geometrically-motivated construction of smooth random paths in the plane. The construction is such that, with probability one, the paths have finite curvature everywhere (and the realizations are visually pleasing when simulated on a computer). Our construction is Markov of order 2. We show that a simpler construction which is Markov of order 1 fails to exhibit the desired fin… ▽ More We propose a simple, geometrically-motivated construction of smooth random paths in the plane. The construction is such that, with probability one, the paths have finite curvature everywhere (and the realizations are visually pleasing when simulated on a computer). Our construction is Markov of order 2. We show that a simpler construction which is Markov of order 1 fails to exhibit the desired finite curvature property. △ Less

Submitted 2 November, 2018; originally announced November 2018.

arXiv:1810.09569 [pdf, other]

Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning

Authors: Ery Arias-Castro, Adel Javanmard, Bruno Pelletier

Abstract: One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed f… ▽ More One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed for localization, and also manifold learning. Nonetheless, the robustness property of most of them is little understood. In this paper, we obtain perturbation bounds for classical scaling and trilateration, which are then applied to derive performance bounds for Isomap, Landmark Isomap, and Maximum Variance Unfolding. A new perturbation bound for procrustes analysis plays a key role. △ Less

Submitted 24 October, 2019; v1 submitted 22 October, 2018; originally announced October 2018.

Comments: 33 pages, 6 Figures

arXiv:1808.00631 [pdf, other]

A Scan Procedure for Multiple Testing

Authors: Shiyun Chen, Andrew Ying, Ery Arias-Castro

Abstract: In a multiple testing framework, we propose a method that identifies the interval with the highest estimated false discovery rate of P-values and rejects the corresponding null hypotheses. Unlike the Benjamini-Hochberg method, which does the same but over intervals with an endpoint at the origin, the new procedure `scans' all intervals. In parallel with \citep*{storey2004strong}, we show that this… ▽ More In a multiple testing framework, we propose a method that identifies the interval with the highest estimated false discovery rate of P-values and rejects the corresponding null hypotheses. Unlike the Benjamini-Hochberg method, which does the same but over intervals with an endpoint at the origin, the new procedure `scans' all intervals. In parallel with \citep*{storey2004strong}, we show that this scan procedure provides strong control of asymptotic false discovery rate. In addition, we investigate its asymptotic false non-discovery rate, deriving conditions under which it outperforms the Benjamini-Hochberg procedure. For example, the scan procedure is superior in power-law location models. △ Less

Submitted 1 August, 2018; originally announced August 2018.

arXiv:1807.10785 [pdf, other]

The Sparse Variance Contamination Model

Authors: Ery Arias-Castro, Rong Huang

Abstract: We consider a Gaussian contamination (i.e., mixture) model where the contamination manifests itself as a change in variance. We study this model in various asymptotic regimes, in parallel with the work of Ingster (1997) and Donoho and ** (2004), who considered a similar model where the contamination was in the mean instead. We consider a Gaussian contamination (i.e., mixture) model where the contamination manifests itself as a change in variance. We study this model in various asymptotic regimes, in parallel with the work of Ingster (1997) and Donoho and ** (2004), who considered a similar model where the contamination was in the mean instead. △ Less

Submitted 27 July, 2018; originally announced July 2018.

arXiv:1804.10611 [pdf, other]

On the Estimation of Latent Distances Using Graph Distances

Authors: Ery Arias-Castro, Antoine Channarond, Bruno Pelletier, Nicolas Verzelen

Abstract: We are given the adjacency matrix of a geometric graph and the task of recovering the latent positions. We study one of the most popular approaches which consists in using the graph distances and derive error bounds under various assumptions on the link function. In the simplest case where the link function is proportional to an indicator function, the bound matches an information lower bound that… ▽ More We are given the adjacency matrix of a geometric graph and the task of recovering the latent positions. We study one of the most popular approaches which consists in using the graph distances and derive error bounds under various assumptions on the link function. In the simplest case where the link function is proportional to an indicator function, the bound matches an information lower bound that we derive. △ Less

Submitted 11 August, 2020; v1 submitted 27 April, 2018; originally announced April 2018.

arXiv:1802.08715 [pdf, other]

Detection of Sparse Mixtures: Higher Criticism and Scan Statistic

Authors: Ery Arias-Castro, Andrew Ying

Abstract: We consider the problem of detecting a sparse mixture as studied by Ingster (1997) and Donoho and ** (2004). We consider a wide array of base distributions. In particular, we study the situation when the base distribution has polynomial tails, a situation that has not received much attention in the literature. Perhaps surprisingly, we find that in the context of such a power-law distribution, the… ▽ More We consider the problem of detecting a sparse mixture as studied by Ingster (1997) and Donoho and ** (2004). We consider a wide array of base distributions. In particular, we study the situation when the base distribution has polynomial tails, a situation that has not received much attention in the literature. Perhaps surprisingly, we find that in the context of such a power-law distribution, the higher criticism does not achieve the detection boundary. However, the scan statistic does. △ Less

Submitted 23 February, 2018; originally announced February 2018.

arXiv:1711.11220 [pdf, other]

RANSAC Algorithms for Subspace Recovery and Subspace Clustering

Authors: Ery Arias-Castro, Jue Wang

Abstract: We consider the RANSAC algorithm in the context of subspace recovery and subspace clustering. We derive some theory and perform some numerical experiments. We also draw some correspondences with the methods of Hardt and Moitra (2013) and Chen and Lerman (2009b). We consider the RANSAC algorithm in the context of subspace recovery and subspace clustering. We derive some theory and perform some numerical experiments. We also draw some correspondences with the methods of Hardt and Moitra (2013) and Chen and Lerman (2009b). △ Less

Submitted 29 November, 2017; originally announced November 2017.

arXiv:1706.09441 [pdf, other]

Unconstrained and Curvature-Constrained Shortest-Path Distances and their Approximation

Authors: Ery Arias-Castro, Thibaut Le Gouic

Abstract: We study shortest paths and their distances on a subset of a Euclidean space, and their approximation by their equivalents in a neighborhood graph defined on a sample from that subset. In particular, we recover and extend the results of Bernstein et al. (2000). We do the same with curvature-constrained shortest paths and their distances, establishing what we believe are the first approximation bou… ▽ More We study shortest paths and their distances on a subset of a Euclidean space, and their approximation by their equivalents in a neighborhood graph defined on a sample from that subset. In particular, we recover and extend the results of Bernstein et al. (2000). We do the same with curvature-constrained shortest paths and their distances, establishing what we believe are the first approximation bounds for them. △ Less

Submitted 24 October, 2018; v1 submitted 28 June, 2017; originally announced June 2017.

arXiv:1705.10190 [pdf, other]

Sequential Multiple Testing

Authors: Shiyun Chen, Ery Arias-Castro

Abstract: We study an online multiple testing problem where the hypotheses arrive sequentially in a stream. The test statistics are independent and assumed to have the same distribution under their respective null hypotheses. We investigate two procedures LORD and LOND, proposed by (Javanmard and Montanari, 2015), which are proved to control the FDR in an online manner. In some (static) model, we show that… ▽ More We study an online multiple testing problem where the hypotheses arrive sequentially in a stream. The test statistics are independent and assumed to have the same distribution under their respective null hypotheses. We investigate two procedures LORD and LOND, proposed by (Javanmard and Montanari, 2015), which are proved to control the FDR in an online manner. In some (static) model, we show that LORD is optimal in some asymptotic sense, in particular as powerful as the (static) Benjamini-Hochberg procedure to first asymptotic order. We also quantify the performance of LOND. Some numerical experiments complement our theory. △ Less

Submitted 25 May, 2017; originally announced May 2017.

Comments: arXiv admin note: text overlap with arXiv:1604.07520

arXiv:1607.08156 [pdf, ps, other]

Remember the Curse of Dimensionality: The Case of Goodness-of-Fit Testing in Arbitrary Dimension

Authors: Ery Arias-Castro, Bruno Pelletier, Venkatesh Saligrama

Abstract: Despite a substantial literature on nonparametric two-sample goodness-of-fit testing in arbitrary dimensions spanning decades, there is no mention there of any curse of dimensionality. Only more recently Ramdas et al. (2015) have discussed this issue in the context of kernel methods by showing that their performance degrades with the dimension even when the underlying distributions are isotropic G… ▽ More Despite a substantial literature on nonparametric two-sample goodness-of-fit testing in arbitrary dimensions spanning decades, there is no mention there of any curse of dimensionality. Only more recently Ramdas et al. (2015) have discussed this issue in the context of kernel methods by showing that their performance degrades with the dimension even when the underlying distributions are isotropic Gaussians. We take a minimax perspective and follow in the footsteps of Ingster (1987) to derive the minimax rate in arbitrary dimension when the discrepancy is measured in the L2 metric. That rate is revealed to be nonparametric and exhibit a prototypical curse of dimensionality. We further extend Ingster's work to show that the chi-squared test achieves the minimax rate. Moreover, we show that the test can be made to work when the distributions have support of low intrinsic dimension. Finally, inspired by Ingster (2000), we consider a multiscale version of the chi-square test which can adapt to unknown smoothness and/or unknown intrinsic dimensionality without much loss in power. △ Less

Submitted 11 September, 2018; v1 submitted 27 July, 2016; originally announced July 2016.

Comments: This version comes after the publication of the paper in the Journal of Nonparametric Statistics. The main change is to cite the work of Ramdas et al. Some very minor typos were also corrected

arXiv:1607.07549 [pdf, ps, other]

Concentration of Measure for Radial Distributions and Consequences for Statistical Modeling

Authors: Ery Arias-Castro, Xiao Pu

Abstract: Motivated by problems in high-dimensional statistics such as mixture modeling for classification and clustering, we consider the behavior of radial densities as the dimension increases. We establish a form of concentration of measure, and even a convergence in distribution, under additional assumptions. This extends the well-known behavior of the normal distribution (its concentration around the s… ▽ More Motivated by problems in high-dimensional statistics such as mixture modeling for classification and clustering, we consider the behavior of radial densities as the dimension increases. We establish a form of concentration of measure, and even a convergence in distribution, under additional assumptions. This extends the well-known behavior of the normal distribution (its concentration around the sphere of radius square-root of the dimension) to other radial densities. We draw some possible consequences for statistical modeling in high-dimensions, including a possible universality property of Gaussian mixtures. △ Less

Submitted 11 September, 2016; v1 submitted 26 July, 2016; originally announced July 2016.

arXiv:1605.01333 [pdf, other]

Minimax Estimation of the Volume of a Set with Smooth Boundary

Authors: Ery Arias-Castro, Beatriz Pateiro-López, Alberto Rodríguez-Casal

Abstract: We consider the problem of estimating the volume of a compact domain in a Euclidean space based on a uniform sample from the domain. We assume the domain has a boundary with positive reach. We propose a data splitting approach to correct the bias of the plug-in estimator based on the sample alpha-convex hull. We show that this simple estimator achieves a minimax lower bound that we derive. Some nu… ▽ More We consider the problem of estimating the volume of a compact domain in a Euclidean space based on a uniform sample from the domain. We assume the domain has a boundary with positive reach. We propose a data splitting approach to correct the bias of the plug-in estimator based on the sample alpha-convex hull. We show that this simple estimator achieves a minimax lower bound that we derive. Some numerical experiments corroborate our theoretical findings. △ Less

Submitted 4 May, 2016; originally announced May 2016.

arXiv:1604.07520 [pdf, other]

Distribution-free Multiple Testing

Authors: Ery Arias-Castro, Shiyun Chen

Abstract: We study a stylized multiple testing problem where the test statistics are independent and assumed to have the same distribution under their respective null hypotheses. We first show that, in the normal means model where the test statistics are normal Z-scores, the well-known method of (Benjamini and Hochberg, 1995) is optimal in some asymptotic sense. We then show that this is also the case of a… ▽ More We study a stylized multiple testing problem where the test statistics are independent and assumed to have the same distribution under their respective null hypotheses. We first show that, in the normal means model where the test statistics are normal Z-scores, the well-known method of (Benjamini and Hochberg, 1995) is optimal in some asymptotic sense. We then show that this is also the case of a recent distribution-free method proposed by Foygel-Barber and Candès (2015). The method is distribution-free in the sense that it is agnostic to the null distribution - it only requires that the null distribution be symmetric. We extend these optimality results to other location models with a base distribution having fast-decaying tails. △ Less

Submitted 26 April, 2016; originally announced April 2016.

arXiv:1604.07449 [pdf, other]

Distribution-free Detection of a Submatrix

Authors: Ery Arias-Castro, Yuchao Liu

Abstract: We consider the problem of detecting the presence of a submatrix with larger-than-usual values in a large data matrix. This problem was considered in (Butucea and Ingster, 2013) under a one-parameter exponential family, and one of the test they analyzed is the scan test. Taking a nonparametric stance, we show that a calibration by permutation leads to the same (first-order) asymptotic performance.… ▽ More We consider the problem of detecting the presence of a submatrix with larger-than-usual values in a large data matrix. This problem was considered in (Butucea and Ingster, 2013) under a one-parameter exponential family, and one of the test they analyzed is the scan test. Taking a nonparametric stance, we show that a calibration by permutation leads to the same (first-order) asymptotic performance. This is true for the two types of permutations we consider. We also study the corresponding rank-based variants and precisely quantify the loss in asymptotic power. △ Less

Submitted 25 April, 2016; originally announced April 2016.

arXiv:1603.05947 [pdf, ps, other]

Noisy Hypotheses in the Age of Discovery Science

Authors: Ery Arias-Castro

Abstract: We draw attention to one specific issue raised by Ioannidis (2005), that of very many hypotheses being tested in a given field of investigation. To better isolate the problem that arises in this (massive) multiple testing scenario, we consider a utopian setting where the hypotheses are tested with no additional bias. We show that, as the number of hypotheses being tested becomes much larger than t… ▽ More We draw attention to one specific issue raised by Ioannidis (2005), that of very many hypotheses being tested in a given field of investigation. To better isolate the problem that arises in this (massive) multiple testing scenario, we consider a utopian setting where the hypotheses are tested with no additional bias. We show that, as the number of hypotheses being tested becomes much larger than the discoveries to be made, it becomes impossible to reliably identify true discoveries. This phenomenon, well-known to statisticians working in the field of multiple testing, puts in jeopardy any naive pursuit in (pure) discovery science. △ Less

Submitted 9 November, 2016; v1 submitted 18 March, 2016; originally announced March 2016.

arXiv:1511.01009 [pdf, ps, other]

Detecting a Path of Correlations in a Network

Authors: Ery Arias-Castro, Gábor Lugosi, Nicolas Verzelen

Abstract: We consider the problem of detecting an anomaly in the form of a path of correlations hidden in white noise. We provide a minimax lower bound and a test that, under mild assumptions, is able to achieve the lower bound up to a multiplicative constant. We consider the problem of detecting an anomaly in the form of a path of correlations hidden in white noise. We provide a minimax lower bound and a test that, under mild assumptions, is able to achieve the lower bound up to a multiplicative constant. △ Less

Submitted 22 December, 2016; v1 submitted 3 November, 2015; originally announced November 2015.

Comments: arXiv admin note: text overlap with arXiv:1504.06984

arXiv:1509.05790 [pdf, ps, other]

On the Consistency of the Crossmatch Test

Authors: Ery Arias-Castro, Bruno Pelletier

Abstract: Rosenbaum (2005) proposed the crossmatch test for two-sample goodness-of-fit testing in arbitrary dimensions. We prove that the test is consistent against all fixed alternatives. In the process, we develop a general consistency result based on (Henze & Penrose, 1999) that applies more generally. Rosenbaum (2005) proposed the crossmatch test for two-sample goodness-of-fit testing in arbitrary dimensions. We prove that the test is consistent against all fixed alternatives. In the process, we develop a general consistency result based on (Henze & Penrose, 1999) that applies more generally. △ Less

Submitted 18 September, 2015; originally announced September 2015.

arXiv:1508.03002 [pdf, other]

Distribution-Free Detection of Structured Anomalies: Permutation and Rank-Based Scans

Authors: Ery Arias-Castro, Rui M. Castro, Ervin Tánczos, Meng Wang

Abstract: The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields a hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distri… ▽ More The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields a hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distribution is known, then calibration of a scan-based test is relatively easy, as it can be done by Monte Carlo simulation. When the null distribution is unknown, it is less straightforward. We investigate two procedures. The first one is a calibration by permutation and the other is a rank-based scan test, which is distribution-free and less sensitive to outliers. Furthermore, the rank scan test requires only a one-time calibration for a given data size making it computationally much more appealing. In both cases, we quantify the performance loss with respect to an oracle scan test that knows the null distribution. We show that using one of these calibration procedures results in only a very small loss of power in the context of a natural exponential family. This includes the classical normal location model, popular in signal processing, and the Poisson model, popular in syndromic surveillance. We perform numerical experiments on simulated data further supporting our theory and also on a real dataset from genomics. △ Less

Submitted 24 November, 2016; v1 submitted 12 August, 2015; originally announced August 2015.

arXiv:1507.00065 [pdf, ps, other]

On Estimating the Perimeter Using the Alpha-Shape

Authors: Ery Arias-Castro, Alberto Rodríguez Casal

Abstract: We consider the problem of estimating the perimeter of a smooth domain in the plane based on a sample from the uniform distribution over the domain. We study the performance of the estimator defined as the perimeter of the alpha-shape of the sample. Some numerical experiments corroborate our theoretical findings. We consider the problem of estimating the perimeter of a smooth domain in the plane based on a sample from the uniform distribution over the domain. We study the performance of the estimator defined as the perimeter of the alpha-shape of the sample. Some numerical experiments corroborate our theoretical findings. △ Less

Submitted 30 June, 2015; originally announced July 2015.

arXiv:1505.01247 [pdf, other]

The Sparse Poisson Means Model

Authors: Ery Arias-Castro, Meng Wang

Abstract: We consider the problem of detecting a sparse Poisson mixture. Our results parallel those for the detection of a sparse normal mixture, pioneered by Ingster (1997) and Donoho and ** (2004), when the Poisson means are larger than logarithmic in the sample size. In particular, a form of higher criticism achieves the detection boundary in the whole sparse regime. When the Poisson means are smaller t… ▽ More We consider the problem of detecting a sparse Poisson mixture. Our results parallel those for the detection of a sparse normal mixture, pioneered by Ingster (1997) and Donoho and ** (2004), when the Poisson means are larger than logarithmic in the sample size. In particular, a form of higher criticism achieves the detection boundary in the whole sparse regime. When the Poisson means are smaller than logarithmic in the sample size, a different regime arises in which simple multiple testing with Bonferroni correction is enough in the sparse regime. We present some numerical experiments that confirm our theoretical findings. △ Less

Submitted 5 May, 2015; originally announced May 2015.

arXiv:1504.06984 [pdf, other]

Detecting Markov Random Fields Hidden in White Noise

Authors: Ery Arias-Castro, Sébastien Bubeck, Gábor Lugosi, Nicolas Verzelen

Abstract: Motivated by change point problems in time series and the detection of textured objects in images, we consider the problem of detecting a piece of a Gaussian Markov random field hidden in white Gaussian noise. We derive minimax lower bounds and propose near-optimal tests. Motivated by change point problems in time series and the detection of textured objects in images, we consider the problem of detecting a piece of a Gaussian Markov random field hidden in white Gaussian noise. We derive minimax lower bounds and propose near-optimal tests. △ Less

Submitted 14 October, 2015; v1 submitted 27 April, 2015; originally announced April 2015.

Comments: In the 2nd version we removed the part on path detection, which will appear on its own in a separate paper

arXiv:1501.02861 [pdf, ps, other]

Some theory for ordinal embedding

Authors: Ery Arias-Castro

Abstract: Motivated by recent work on ordinal embedding (Kleindessner and von Luxburg, 2014), we derive large sample consistency results and rates of convergence for the problem of embedding points based on triple or quadruple distance comparisons. We also consider a variant of this problem where only local comparisons are provided. Finally, inspired by (Jamieson and Nowak, 2011), we bound the number of suc… ▽ More Motivated by recent work on ordinal embedding (Kleindessner and von Luxburg, 2014), we derive large sample consistency results and rates of convergence for the problem of embedding points based on triple or quadruple distance comparisons. We also consider a variant of this problem where only local comparisons are provided. Finally, inspired by (Jamieson and Nowak, 2011), we bound the number of such comparisons needed to achieve consistency. △ Less

Submitted 4 May, 2016; v1 submitted 12 January, 2015; originally announced January 2015.

arXiv:1409.7127 [pdf, other]

Exact Asymptotics for the Scan Statistic and Fast Alternatives

Authors: James Sharpnack, Ery Arias-Castro

Abstract: We consider the problem of detecting a rectangle of activation in a grid of sensors in d-dimensions with noisy measurements. This has applications to massive surveillance projects and anomaly detection in large datasets in which one detects anomalously high measurements over rectangular regions, or more generally, blobs. Recently, the asymptotic distribution of a multiscale scan statistic was esta… ▽ More We consider the problem of detecting a rectangle of activation in a grid of sensors in d-dimensions with noisy measurements. This has applications to massive surveillance projects and anomaly detection in large datasets in which one detects anomalously high measurements over rectangular regions, or more generally, blobs. Recently, the asymptotic distribution of a multiscale scan statistic was established in (Kabluchko, 2011) under the null hypothesis, using non-constant boundary crossing probabilities for locally-stationary Gaussian random fields derived in (Chan and Lai, 2006). Using a similar approach, we derive the exact asymptotic level and power of four variants of the scan statistic: an oracle scan that knows the dimensions of the activation rectangle; the multiscale scan statistic just mentioned; an adaptive variant; and an epsilon-net approximation to the latter, in the spirit of (Arias-Castro, 2005). This approximate scan runs in time near-linear in the size of the grid and achieves the same asymptotic power as the adaptive scan. We complement our theory with some numerical experiments. △ Less

Submitted 24 September, 2014; originally announced September 2014.

MSC Class: 62F03

arXiv:1405.1478 [pdf, other]

Detection and Feature Selection in Sparse Mixture Models

Authors: Nicolas Verzelen, Ery Arias-Castro

Abstract: We consider Gaussian mixture models in high dimensions and concentrate on the twin tasks of detection and feature selection. Under sparsity assumptions on the difference in means, we derive information bounds and establish the performance of various procedures, including the top sparse eigenvalue of the sample covariance matrix and other projection tests based on moments, such as the skewness and… ▽ More We consider Gaussian mixture models in high dimensions and concentrate on the twin tasks of detection and feature selection. Under sparsity assumptions on the difference in means, we derive information bounds and establish the performance of various procedures, including the top sparse eigenvalue of the sample covariance matrix and other projection tests based on moments, such as the skewness and kurtosis tests of Malkovich and Afifi (1973), and other variants which we were better able to control under the null. △ Less

Submitted 1 October, 2016; v1 submitted 6 May, 2014; originally announced May 2014.

Comments: 70 pages

arXiv:1308.2955 [pdf, other]

Community Detection in Sparse Random Networks

Authors: Ery Arias-Castro, Nicolas Verzelen

Abstract: We consider the problem of detecting a tight community in a sparse random network. This is formalized as testing for the existence of a dense random subgraph in a random graph. Under the null hypothesis, the graph is a realization of an Erdös-Rényi graph on $N$ vertices and with connection probability $p_0$; under the alternative, there is an unknown subgraph on $n$ vertices where the connection p… ▽ More We consider the problem of detecting a tight community in a sparse random network. This is formalized as testing for the existence of a dense random subgraph in a random graph. Under the null hypothesis, the graph is a realization of an Erdös-Rényi graph on $N$ vertices and with connection probability $p_0$; under the alternative, there is an unknown subgraph on $n$ vertices where the connection probability is p1 > p0. In Arias-Castro and Verzelen (2012), we focused on the asymptotically dense regime where p0 is large enough that np0>(n/N)^{o(1)}. We consider here the asymptotically sparse regime where p0 is small enough that np0<(n/N)^{c0} for some c0>0. As before, we derive information theoretic lower bounds, and also establish the performance of various tests. Compared to our previous work, the arguments for the lower bounds are based on the same technology, but are substantially more technical in the details; also, the methods we study are different: besides a variant of the scan statistic, we study other statistics such as the size of the largest connected component, the number of triangles, the eigengap of the adjacency matrix, etc. Our detection bounds are sharp, except in the Poisson regime where we were not able to fully characterize the constant arising in the bound. △ Less

Submitted 25 September, 2014; v1 submitted 13 August, 2013; originally announced August 2013.

arXiv:1308.0346 [pdf, other]

Distribution-Free Tests for Sparse Heterogeneous Mixtures

Authors: Ery Arias-Castro, Meng Wang

Abstract: We consider the problem of detecting sparse heterogeneous mixtures from a nonparametric perspective, and develop distribution-free tests when all effects have the same sign. Specifically, we assume that the null distribution is symmetric about zero, while the true effects have positive median. We evaluate the precise performance of classical tests for the median (t-test, sign test) and classical t… ▽ More We consider the problem of detecting sparse heterogeneous mixtures from a nonparametric perspective, and develop distribution-free tests when all effects have the same sign. Specifically, we assume that the null distribution is symmetric about zero, while the true effects have positive median. We evaluate the precise performance of classical tests for the median (t-test, sign test) and classical tests for symmetry (signed-rank, Smirnov, total number of runs, longest run tests) showing that none of them is asymptotically optimal for the normal mixture model in all sparsity regimes. We then suggest two new tests. The main one is a form of Higher Criticism, or Anderson-Darling, test for symmetry. It is shown to be asymptotically optimal for the normal mixture model, and other generalized Gaussian mixture models, in all sparsity regimes. Our numerical experiments confirm our theoretical findings. △ Less

Submitted 15 November, 2013; v1 submitted 1 August, 2013; originally announced August 2013.

arXiv:1302.7099 [pdf, ps, other]

Community Detection in Random Networks

Authors: Ery Arias-Castro, Nicolas Verzelen

Abstract: We formalize the problem of detecting a community in a network into testing whether in a given (random) graph there is a subgraph that is unusually dense. We observe an undirected and unweighted graph on N nodes. Under the null hypothesis, the graph is a realization of an Erdös-Rényi graph with probability p0. Under the (composite) alternative, there is a subgraph of n nodes where the probability… ▽ More We formalize the problem of detecting a community in a network into testing whether in a given (random) graph there is a subgraph that is unusually dense. We observe an undirected and unweighted graph on N nodes. Under the null hypothesis, the graph is a realization of an Erdös-Rényi graph with probability p0. Under the (composite) alternative, there is a subgraph of n nodes where the probability of connection is p1 > p0. We derive a detection lower bound for detecting such a subgraph in terms of N, n, p0, p1 and exhibit a test that achieves that lower bound. We do this both when p0 is known and unknown. We also consider the problem of testing in polynomial-time. As an aside, we consider the problem of detecting a clique, which is intimately related to the planted clique problem. Our focus in this paper is in the quasi-normal regime where n p0 is either bounded away from zero, or tends to zero slowly. △ Less

Submitted 28 February, 2013; originally announced February 2013.

arXiv:1208.6516 [pdf, other]

A two-stage denoising filter: the preprocessed Yaroslavsky filter

Authors: Joseph Salmon, Rebecca Willett, Ery Arias-Castro

Abstract: This paper describes a simple image noise removal method which combines a preprocessing step with the Yaroslavsky filter for strong numerical, visual, and theoretical performance on a broad class of images. The framework developed is a two-stage approach. In the first stage the image is filtered with a classical denoising method (e.g., wavelet or curvelet thresholding). In the second stage a modif… ▽ More This paper describes a simple image noise removal method which combines a preprocessing step with the Yaroslavsky filter for strong numerical, visual, and theoretical performance on a broad class of images. The framework developed is a two-stage approach. In the first stage the image is filtered with a classical denoising method (e.g., wavelet or curvelet thresholding). In the second stage a modification of the Yaroslavsky filter is performed on the original noisy image, where the weights of the filters are governed by pixel similarities in the denoised image from the first stage. Similar prefiltering ideas have proved effective previously in the literature, and this paper provides theoretical guarantees and important insight into why prefiltering can be effective. Empirically, this simple approach achieves very good performance for cartoon images, and can be computed much more quickly than current patch-based denoising algorithms. △ Less

Submitted 31 August, 2012; originally announced August 2012.

ACM Class: I.4.3; I.4.10; I.5.1; G.3

arXiv:1208.2635 [pdf, ps, other]

Variable Selection with Exponential Weights and $l_0$-Penalization

Authors: Ery Arias-Castro, Karim Lounici

Abstract: In the context of a linear model with a sparse coefficient vector, exponential weights methods have been shown to be achieve oracle inequalities for prediction. We show that such methods also succeed at variable selection and estimation under the necessary identifiability condition on the design matrix, instead of much stronger assumptions required by other methods such as the Lasso or the Dantzig… ▽ More In the context of a linear model with a sparse coefficient vector, exponential weights methods have been shown to be achieve oracle inequalities for prediction. We show that such methods also succeed at variable selection and estimation under the necessary identifiability condition on the design matrix, instead of much stronger assumptions required by other methods such as the Lasso or the Dantzig Selector. The same analysis yields consistency results for Bayesian methods and BIC-type variable selection under similar conditions. △ Less

Submitted 16 September, 2012; v1 submitted 13 August, 2012; originally announced August 2012.

Comments: 23 pages; 1 figures

arXiv:1202.5536 [pdf, ps, other]

doi 10.3150/13-BEJ565

Detecting positive correlations in a multivariate sample

Authors: Ery Arias-Castro, Sébastien Bubeck, Gábor Lugosi

Abstract: We consider the problem of testing whether a correlation matrix of a multivariate normal population is the identity matrix. We focus on sparse classes of alternatives where only a few entries are nonzero and, in fact, positive. We derive a general lower bound applicable to various classes and study the performance of some near-optimal tests. We pay special attention to computational feasibility an… ▽ More We consider the problem of testing whether a correlation matrix of a multivariate normal population is the identity matrix. We focus on sparse classes of alternatives where only a few entries are nonzero and, in fact, positive. We derive a general lower bound applicable to various classes and study the performance of some near-optimal tests. We pay special attention to computational feasibility and construct near-optimal tests that can be computed efficiently. Finally, we apply our results to prove new lower bounds for the clique number of high-dimensional random geometric graphs. △ Less

Submitted 14 April, 2015; v1 submitted 24 February, 2012; originally announced February 2012.

Comments: Published at http://dx.doi.org/10.3150/13-BEJ565 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ565

Journal ref: Bernoulli 2015, Vol. 21, No. 1, 209-241

arXiv:1112.6235 [pdf, ps, other]

Detecting a Vector Based on Linear Measurements

Authors: Ery Arias-Castro

Abstract: We consider a situation where the state of a system is represented by a real-valued vector. Under normal circumstances, the vector is zero, while an event manifests as non-zero entries in this vector, possibly few. Our interest is in the design of algorithms that can reliably detect events (i.e., test whether the vector is zero or not) with the least amount of information. We place ourselves in a… ▽ More We consider a situation where the state of a system is represented by a real-valued vector. Under normal circumstances, the vector is zero, while an event manifests as non-zero entries in this vector, possibly few. Our interest is in the design of algorithms that can reliably detect events (i.e., test whether the vector is zero or not) with the least amount of information. We place ourselves in a situation, now common in the signal processing literature, where information about the vector comes in the form of noisy linear measurements. We derive information bounds in an active learning setup and exhibit some simple near-optimal algorithms. In particular, our results show that the task of detection within this setting is at once much easier, simpler and different than the tasks of estimation and support recovery. △ Less

Submitted 29 December, 2011; originally announced December 2011.

Showing 1–50 of 66 results for author: Arias-Castro, E