-
Topological Analysis for Detecting Anomalies (TADA) in Time Series
Authors:
Frédéric Chazal,
Martin Royer,
Clément Levrard
Abstract:
This paper introduces new methodology based on the field of Topological Data Analysis for detecting anomalies in multivariate time series, that aims to detect global changes in the dependency structure between channels. The proposed approach is lean enough to handle large scale datasets, and extensive numerical experiments back the intuition that it is more suitable for detecting global changes of…
▽ More
This paper introduces new methodology based on the field of Topological Data Analysis for detecting anomalies in multivariate time series, that aims to detect global changes in the dependency structure between channels. The proposed approach is lean enough to handle large scale datasets, and extensive numerical experiments back the intuition that it is more suitable for detecting global changes of correlation structures than existing methods. Some theoretical guarantees for quantization algorithms based on dependent time sequences are also provided.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Wasserstein GANs are Minimax Optimal Distribution Estimators
Authors:
Arthur Stéphanovitch,
Eddie Aamari,
Clément Levrard
Abstract:
We provide non asymptotic rates of convergence of the Wasserstein Generative Adversarial networks (WGAN) estimator. We build neural networks classes representing the generators and discriminators which yield a GAN that achieves the minimax optimal rate for estimating a certain probability measure $μ$ with support in $\mathbb{R}^p$. The probability $μ$ is considered to be the push forward of the Le…
▽ More
We provide non asymptotic rates of convergence of the Wasserstein Generative Adversarial networks (WGAN) estimator. We build neural networks classes representing the generators and discriminators which yield a GAN that achieves the minimax optimal rate for estimating a certain probability measure $μ$ with support in $\mathbb{R}^p$. The probability $μ$ is considered to be the push forward of the Lebesgue measure on the $d$-dimensional torus $\mathbb{T}^d$ by a map $g^\star:\mathbb{T}^d\rightarrow \mathbb{R}^p$ of smoothness $β+1$. Measuring the error with the $γ$-Hölder Integral Probability Metric (IPM), we obtain up to logarithmic factors, the minimax optimal rate $O(n^{-\frac{β+γ}{2β+d}}\vee n^{-\frac{1}{2}})$ where $n$ is the sample size, $β$ determines the smoothness of the target measure $μ$, $γ$ is the smoothness of the IPM ($γ=1$ is the Wasserstein case) and $d\leq p$ is the intrinsic dimension of $μ$. In the process, we derive a sharp interpolation inequality between Hölder IPMs. This novel result of theory of functions spaces generalizes classical interpolation inequalities to the case where the measures involved have densities on different manifolds.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Statistical learning on measures: an application to persistence diagrams
Authors:
Olympio Hacquard,
Gilles Blanchard,
Clément Levrard
Abstract:
We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space $\mathcal{X}$. Formally, we observe data $D_N = (μ_1, Y_1), \ldots, (μ_N, Y_N)$ where $μ_i$ is a measure on $\mathcal{X}$ and $Y_i$ is a label in $\{0, 1\}$. Given a set $\mathcal{F}$ of base-classifiers on $\mathcal{X}$, we bu…
▽ More
We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space $\mathcal{X}$. Formally, we observe data $D_N = (μ_1, Y_1), \ldots, (μ_N, Y_N)$ where $μ_i$ is a measure on $\mathcal{X}$ and $Y_i$ is a label in $\{0, 1\}$. Given a set $\mathcal{F}$ of base-classifiers on $\mathcal{X}$, we build corresponding classifiers in the space of measures. We provide upper and lower bounds on the Rademacher complexity of this new class of classifiers that can be expressed simply in terms of corresponding quantities for the class $\mathcal{F}$. If the measures $μ_i$ are uniform over a finite set, this classification task boils down to a multi-instance learning problem. However, our approach allows more flexibility and diversity in the input data we can deal with. While such a framework has many possible applications, this work strongly emphasizes on classifying data via topological descriptors called persistence diagrams. These objects are discrete measures on $\mathbb{R}^2$, where the coordinates of each point correspond to the range of scales at which a topological feature exists. We will present several classifiers on measures and show how they can heuristically and theoretically enable a good classification performance in various settings in the case of persistence diagrams.
△ Less
Submitted 31 May, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Optimal Reach Estimation and Metric Learning
Authors:
Eddie Aamari,
Clément Berenfeld,
Clément Levrard
Abstract:
We study the estimation of the reach, an ubiquitous regularity parameter in manifold estimation and geometric data analysis. Given an i.i.d. sample over an unknown $d$-dimensional $\mathcal{C}^k$-smooth submanifold of $\mathbb{R}^D$, we provide optimal nonasymptotic bounds for the estimation of its reach. We build upon a formulation of the reach in terms of maximal curvature on one hand, and geode…
▽ More
We study the estimation of the reach, an ubiquitous regularity parameter in manifold estimation and geometric data analysis. Given an i.i.d. sample over an unknown $d$-dimensional $\mathcal{C}^k$-smooth submanifold of $\mathbb{R}^D$, we provide optimal nonasymptotic bounds for the estimation of its reach. We build upon a formulation of the reach in terms of maximal curvature on one hand, and geodesic metric distortion on the other hand. The derived rates are adaptive, with rates depending on whether the reach of $M$ arises from curvature or from a bottleneck structure. In the process, we derive optimal geodesic metric estimation bounds.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Topologically penalized regression on manifolds
Authors:
Olympio Hacquard,
Krishnakumar Balasubramanian,
Gilles Blanchard,
Clément Levrard,
Wolfgang Polonik
Abstract:
We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the…
▽ More
We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is ''topologically smooth''.
△ Less
Submitted 10 June, 2022; v1 submitted 26 October, 2021;
originally announced October 2021.
-
Minimax Boundary Estimation and Estimation with Boundary
Authors:
Eddie Aamari,
Catherine Aaron,
Clément Levrard
Abstract:
We derive non-asymptotic minimax bounds for the Hausdorff estimation of $d$-dimensional submanifolds $M \subset \mathbb{R}^D$ with (possibly) non-empty boundary $\partial M$. The model reunites and extends the most prevalent $\mathcal{C}^2$-type set estimation models: manifolds without boundary, and full-dimensional domains. We consider both the estimation of the manifold $M$ itself and that of it…
▽ More
We derive non-asymptotic minimax bounds for the Hausdorff estimation of $d$-dimensional submanifolds $M \subset \mathbb{R}^D$ with (possibly) non-empty boundary $\partial M$. The model reunites and extends the most prevalent $\mathcal{C}^2$-type set estimation models: manifolds without boundary, and full-dimensional domains. We consider both the estimation of the manifold $M$ itself and that of its boundary $\partial M$ if non-empty. Given $n$ samples, the minimax rates are of order $O\bigl((\log n/n)^{2/d}\bigr)$ if $\partial M = \emptyset$ and $O\bigl((\log n/n)^{2/(d+1)}\bigr)$ if $\partial M \neq \emptyset$, up to logarithmic factors. In the process, we develop a Voronoi-based procedure that allows to identify enough points $O\bigl((\log n/n)^{2/(d+1)}\bigr)$-close to $\partial M$ for reconstructing it.
△ Less
Submitted 10 March, 2023; v1 submitted 6 August, 2021;
originally announced August 2021.
-
Optimal quantization of the mean measure and applications to statistical learning
Authors:
Frédéric Chazal,
Clément Levrard,
Martin Royer
Abstract:
This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topologic…
▽ More
This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in $\mathbb{R}^k$, for $k \in \mathbb{N}^*$ that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}.
△ Less
Submitted 18 March, 2021; v1 submitted 4 February, 2020;
originally announced February 2020.
-
Robust Bregman Clustering
Authors:
Aurélie Fischer,
Clément Levrard,
Claire Brécheteau
Abstract:
Using a trimming approach, we investigate a k-means type method based on Bregman divergences for clustering data possibly corrupted with clutter noise. The main interest of Bregman divergences is that the standard Lloyd algorithm adapts to these distortion measures, and they are well-suited for clustering data sampled according to mixture models from exponential families. We prove that there exist…
▽ More
Using a trimming approach, we investigate a k-means type method based on Bregman divergences for clustering data possibly corrupted with clutter noise. The main interest of Bregman divergences is that the standard Lloyd algorithm adapts to these distortion measures, and they are well-suited for clustering data sampled according to mixture models from exponential families. We prove that there exists an optimal codebook, and that an empirically optimal codebook converges a.s. to an optimal codebook in the distortion sense. Moreover, we obtain the sub-Gaussian rate of convergence for k-means 1 $\sqrt$ n under mild tail assumptions. Also, we derive a Lloyd-type algorithm with a trimming parameter that can be selected from data according to some heuristic, and present some experimental results.
△ Less
Submitted 9 September, 2020; v1 submitted 11 December, 2018;
originally announced December 2018.
-
The k-PDTM : a coreset for robust geometric inference
Authors:
Claire Brécheteau,
Clément Levrard
Abstract:
Analyzing the sub-level sets of the distance to a compact sub-manifold of R d is a common method in TDA to understand its topology. The distance to measure (DTM) was introduced by Chazal, Cohen-Steiner and M{é}rigot in [7] to face the non-robustness of the distance to a compact set to noise and outliers. This function makes possible the inference of the topology of a compact subset of R d from a n…
▽ More
Analyzing the sub-level sets of the distance to a compact sub-manifold of R d is a common method in TDA to understand its topology. The distance to measure (DTM) was introduced by Chazal, Cohen-Steiner and M{é}rigot in [7] to face the non-robustness of the distance to a compact set to noise and outliers. This function makes possible the inference of the topology of a compact subset of R d from a noisy cloud of n points lying nearby in the Wasserstein sense. In practice, these sub-level sets may be computed using approximations of the DTM such as the q-witnessed distance [10] or other power distance [6]. These approaches lead eventually to compute the homology of unions of n growing balls, that might become intractable whenever n is large. To simultaneously face the two problems of large number of points and noise, we introduce the k-power distance to measure (k-PDTM). This new approximation of the distance to measure may be thought of as a k-coreset based approximation of the DTM. Its sublevel sets consist in union of k-balls, k << n, and this distance is also proved robust to noise. We assess the quality of this approximation for k possibly dramatically smaller than n, for instance k = n 1 3 is proved to be optimal for 2-dimensional shapes. We also provide an algorithm to compute this k-PDTM.
△ Less
Submitted 31 January, 2018;
originally announced January 2018.
-
Quantization/clustering: when and why does k-means work?
Authors:
Clément Levrard
Abstract:
Though mostly used as a clustering algorithm, k-means are originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with k points. Building upon [21, 33], we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of [27]…
▽ More
Though mostly used as a clustering algorithm, k-means are originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with k points. Building upon [21, 33], we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of [27] for supervised learning), both the associated empirical risk minimizer and the output of Lloyd's algorithm provide almost optimal classification in certain cases (in the sense of [6]). Besides, we also show that they achieved fast and optimal convergence rates in terms of sample size and compression risk.
△ Less
Submitted 30 January, 2018; v1 submitted 11 January, 2018;
originally announced January 2018.
-
Non-Asymptotic Rates for Manifold, Tangent Space, and Curvature Estimation
Authors:
Eddie Aamari,
Clément Levrard
Abstract:
Given an $n$-sample drawn on a submanifold $M \subset \mathbb{R}^D$, we derive optimal rates for the estimation of tangent spaces $T\_X M$, the second fundamental form $II\_X^M$, and the submanifold $M$.After motivating their study, we introduce a quantitative class of $\mathcal{C}^k$-submanifolds in analogy with H{ö}lder classes.The proposed estimators are based on local polynomials and allow to…
▽ More
Given an $n$-sample drawn on a submanifold $M \subset \mathbb{R}^D$, we derive optimal rates for the estimation of tangent spaces $T\_X M$, the second fundamental form $II\_X^M$, and the submanifold $M$.After motivating their study, we introduce a quantitative class of $\mathcal{C}^k$-submanifolds in analogy with H{ö}lder classes.The proposed estimators are based on local polynomials and allow to deal simultaneously with the three problems at stake. Minimax lower bounds are derived using a conditional version of Assouad's lemma when the base point $X$ is random.
△ Less
Submitted 5 February, 2018; v1 submitted 2 May, 2017;
originally announced May 2017.
-
Stability and Minimax Optimality of Tangential Delaunay Complexes for Manifold Reconstruction
Authors:
Eddie Aamari,
Clément Levrard
Abstract:
We consider the problem of optimality in manifold reconstruction. A random sample $\mathbb{X}_n = \left\{X_1,\ldots,X_n\right\}\subset \mathbb{R}^D$ composed of points close to a $d$-dimensional submanifold $M$, with or without outliers drawn in the ambient space, is observed. Based on the Tangential Delaunay Complex, we construct an estimator $\hat{M}$ that is ambient isotopic and Hausdorff-close…
▽ More
We consider the problem of optimality in manifold reconstruction. A random sample $\mathbb{X}_n = \left\{X_1,\ldots,X_n\right\}\subset \mathbb{R}^D$ composed of points close to a $d$-dimensional submanifold $M$, with or without outliers drawn in the ambient space, is observed. Based on the Tangential Delaunay Complex, we construct an estimator $\hat{M}$ that is ambient isotopic and Hausdorff-close to $M$ with high probability. The estimator $\hat{M}$ is built from existing algorithms. In a model with additive noise of small amplitude, we show that this estimator is asymptotically minimax optimal for the Hausdorff distance over a class of submanifolds satisfying a reach constraint. Therefore, even with no a priori information on the tangent spaces of $M$, our estimator based on Tangential Delaunay Complexes is optimal. This shows that the optimal rate of convergence can be achieved through existing algorithms. A similar result is also derived in a model with outliers. A geometric interpolation result is derived, showing that the Tangential Delaunay Complex is stable with respect to noise and perturbations of the tangent spaces. In the process, a decluttering procedure and a tangent space estimator both based on local principal component analysis (PCA) are studied.
△ Less
Submitted 31 January, 2018; v1 submitted 9 December, 2015;
originally announced December 2015.
-
Sparse Oracle Inequalities for Variable Selection via Regularized Quantization
Authors:
Clément Levrard
Abstract:
We give oracle inequalities on procedures which combines quantization and variable selection via a weighted Lasso $k$-means type algorithm. The results are derived for a general family of weights, which can be tuned to size the influence of the variables in different ways. Moreover, these theoretical guarantees are proved to adapt the corresponding sparsity of the optimal codebooks, if appropriat…
▽ More
We give oracle inequalities on procedures which combines quantization and variable selection via a weighted Lasso $k$-means type algorithm. The results are derived for a general family of weights, which can be tuned to size the influence of the variables in different ways. Moreover, these theoretical guarantees are proved to adapt the corresponding sparsity of the optimal codebooks, if appropriate. Even if there is no sparsity assumption on the optimal codebooks, our procedure is proved to be close to a sparse approximation of the optimal codebooks, as has been done for the Generalized Linear Models in regression. If the optimal codebooks have a sparse support, we also show that this support can be asymptotically recovered, giving an asymptotic upper bound on the probability of misclassification. These results are illustrated with Gaussian mixture models in arbitrary dimension with sparsity assumptions on the means, which are standard distributions in model-based clustering.
△ Less
Submitted 6 July, 2016; v1 submitted 12 June, 2014;
originally announced June 2014.
-
Nonasymptotic bounds for vector quantization in Hilbert spaces
Authors:
Clément Levrard
Abstract:
Recent results in quantization theory show that the mean-squared expected distortion can reach a rate of convergence of $\mathcal{O}(1/n)$, where $n$ is the sample size [see, e.g., IEEE Trans. Inform. Theory 60 (2014) 7279-7292 or Electron. J. Stat. 7 (2013) 1716-1746]. This rate is attained for the empirical risk minimizer strategy, if the source distribution satisfies some regularity conditions.…
▽ More
Recent results in quantization theory show that the mean-squared expected distortion can reach a rate of convergence of $\mathcal{O}(1/n)$, where $n$ is the sample size [see, e.g., IEEE Trans. Inform. Theory 60 (2014) 7279-7292 or Electron. J. Stat. 7 (2013) 1716-1746]. This rate is attained for the empirical risk minimizer strategy, if the source distribution satisfies some regularity conditions. However, the dependency of the average distortion on other parameters is not known, and these results are only valid for distributions over finite-dimensional Euclidean spaces. This paper deals with the general case of distributions over separable, possibly infinite dimensional, Hilbert spaces. A condition is proposed, which may be thought of as a margin condition [see, e.g., Ann. Statist. 27 (1999) 1808-1829], under which a nonasymptotic upper bound on the expected distortion rate of the empirically optimal quantizer is derived. The dependency of the distortion on other parameters of distributions is then discussed, in particular through a minimax lower bound.
△ Less
Submitted 1 April, 2015; v1 submitted 26 May, 2014;
originally announced May 2014.
-
Margin conditions for vector quantization
Authors:
Clément Levrard
Abstract:
In this report, oracle inequalities on the excess risk of the empirical risk minimizer in the quantization framework are derived. These inequalities are based on conditions which may be thought of as margin type conditions, such as one derived in the statistical learning framework. Furthermore, these inequalities derive from innovative chaining techniques and its use for Dudley's entropy integral.
In this report, oracle inequalities on the excess risk of the empirical risk minimizer in the quantization framework are derived. These inequalities are based on conditions which may be thought of as margin type conditions, such as one derived in the statistical learning framework. Furthermore, these inequalities derive from innovative chaining techniques and its use for Dudley's entropy integral.
△ Less
Submitted 25 April, 2014; v1 submitted 26 October, 2013;
originally announced October 2013.
-
Fast rates for empirical vector quantization
Authors:
Clément Levrard
Abstract:
We consider the rate of convergence of the expected loss of empirically optimal vector quantizers. Earlier results show that the mean-squared expected distortion for any fixed distribution supported on a bounded set and satisfying some regularity conditions decreases at the rate O(log n/n). We prove that this rate is actually O(1/n). Although these conditions are hard to check, we show that well-p…
▽ More
We consider the rate of convergence of the expected loss of empirically optimal vector quantizers. Earlier results show that the mean-squared expected distortion for any fixed distribution supported on a bounded set and satisfying some regularity conditions decreases at the rate O(log n/n). We prove that this rate is actually O(1/n). Although these conditions are hard to check, we show that well-polarized distributions with continuous densities supported on a bounded set are included in the scope of this result.
△ Less
Submitted 29 January, 2012;
originally announced January 2012.