-
Multivariate Gaussian Approximation for Random Forest via Region-based Stabilization
Authors:
Zhaoyang Shi,
Chinmoy Bhattacharjee,
Krishnakumar Balasubramanian,
Wolfgang Polonik
Abstract:
We derive Gaussian approximation bounds for random forest predictions based on a set of training points given by a Poisson process, under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that the random forest predictions satisfy a certain geometric property called region-based stabilization. In the process of develo** our results fo…
▽ More
We derive Gaussian approximation bounds for random forest predictions based on a set of training points given by a Poisson process, under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that the random forest predictions satisfy a certain geometric property called region-based stabilization. In the process of develo** our results for the random forest, we also establish a probabilistic result, which might be of independent interest, on multivariate Gaussian approximation bounds for general functionals of Poisson process that are region-based stabilizing. This general result makes use of the Malliavin-Stein method, and is potentially applicable to various related statistical problems.
△ Less
Submitted 25 March, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Nonsmooth Nonparametric Regression via Fractional Laplacian Eigenmaps
Authors:
Zhaoyang Shi,
Krishnakumar Balasubramanian,
Wolfgang Polonik
Abstract:
We develop nonparametric regression methods for the case when the true regression function is not necessarily smooth. More specifically, our approach is using the fractional Laplacian and is designed to handle the case when the true regression function lies in an $L_2$-fractional Sobolev space with order $s\in (0,1)$. This function class is a Hilbert space lying between the space of square-integra…
▽ More
We develop nonparametric regression methods for the case when the true regression function is not necessarily smooth. More specifically, our approach is using the fractional Laplacian and is designed to handle the case when the true regression function lies in an $L_2$-fractional Sobolev space with order $s\in (0,1)$. This function class is a Hilbert space lying between the space of square-integrable functions and the first-order Sobolev space consisting of differentiable functions. It contains fractional power functions, piecewise constant or polynomial functions and bump function as canonical examples. For the proposed approach, we prove upper bounds on the in-sample mean-squared estimation error of order $n^{-\frac{2s}{2s+d}}$, where $d$ is the dimension, $s$ is the aforementioned order parameter and $n$ is the number of observations. We also provide preliminary empirical results validating the practical performance of the developed estimators.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Adaptive and non-adaptive minimax rates for weighted Laplacian-eigenmap based nonparametric regression
Authors:
Zhaoyang Shi,
Krishnakumar Balasubramanian,
Wolfgang Polonik
Abstract:
We show both adaptive and non-adaptive minimax rates of convergence for a family of weighted Laplacian-Eigenmap based nonparametric regression methods, when the true regression function belongs to a Sobolev space and the sampling density is bounded from above and below. The adaptation methodology is based on extensions of Lepski's method and is over both the smoothness parameter (…
▽ More
We show both adaptive and non-adaptive minimax rates of convergence for a family of weighted Laplacian-Eigenmap based nonparametric regression methods, when the true regression function belongs to a Sobolev space and the sampling density is bounded from above and below. The adaptation methodology is based on extensions of Lepski's method and is over both the smoothness parameter ($s\in\mathbb{N}_{+}$) and the norm parameter ($M>0$) determining the constraints on the Sobolev space. Our results extend the non-adaptive result in \cite{green2021minimax}, established for a specific normalized graph Laplacian, to a wide class of weighted Laplacian matrices used in practice, including the unnormalized Laplacian and random walk Laplacian.
△ Less
Submitted 31 October, 2023;
originally announced November 2023.
-
A Flexible Approach for Normal Approximation of Geometric and Topological Statistics
Authors:
Zhaoyang Shi,
Krishnakumar Balasubramanian,
Wolfgang Polonik
Abstract:
We derive normal approximation results for a class of stabilizing functionals of binomial or Poisson point process, that are not necessarily expressible as sums of certain score functions. Our approach is based on a flexible notion of the add-one cost operator, which helps one to deal with the second-order cost operator via suitably appropriate first-order operators. We combine this flexible notio…
▽ More
We derive normal approximation results for a class of stabilizing functionals of binomial or Poisson point process, that are not necessarily expressible as sums of certain score functions. Our approach is based on a flexible notion of the add-one cost operator, which helps one to deal with the second-order cost operator via suitably appropriate first-order operators. We combine this flexible notion with the theory of strong stabilization to establish our results. We illustrate the applicability of our results by establishing normal approximation results for certain geometric and topological statistics arising frequently in practice. Several existing results also emerge as special cases of our approach.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
Topologically penalized regression on manifolds
Authors:
Olympio Hacquard,
Krishnakumar Balasubramanian,
Gilles Blanchard,
Clément Levrard,
Wolfgang Polonik
Abstract:
We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the…
▽ More
We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is ''topologically smooth''.
△ Less
Submitted 10 June, 2022; v1 submitted 26 October, 2021;
originally announced October 2021.
-
Algorithms for ridge estimation with convergence guarantees
Authors:
Wanli Qiao,
Wolfgang Polonik
Abstract:
The extraction of filamentary structure from a point cloud is discussed. The filaments are modeled as ridge lines or higher dimensional ridges of an underlying density. We propose two novel algorithms, and provide theoretical guarantees for their convergences. We consider the new algorithms as alternatives to the Subspace Constraint Mean Shift (SCMS) algorithm that do not suffer from a shortcoming…
▽ More
The extraction of filamentary structure from a point cloud is discussed. The filaments are modeled as ridge lines or higher dimensional ridges of an underlying density. We propose two novel algorithms, and provide theoretical guarantees for their convergences. We consider the new algorithms as alternatives to the Subspace Constraint Mean Shift (SCMS) algorithm that do not suffer from a shortcoming of the SCMS that is also revealed in this paper.
△ Less
Submitted 25 April, 2021;
originally announced April 2021.
-
Testing For Global Covariate Effects in Dynamic Interaction Event Networks
Authors:
Alexander Kreiss,
Enno Mammen,
Wolfgang Polonik
Abstract:
In statistical network analysis it is common to observe so called interaction data. Such data is characterized by actors forming the vertices and interacting along edges of the network, where edges are randomly formed and dissolved over the observation horizon. In addition covariates are observed and the goal is to model the impact of the covariates on the interactions. We distinguish two types of…
▽ More
In statistical network analysis it is common to observe so called interaction data. Such data is characterized by actors forming the vertices and interacting along edges of the network, where edges are randomly formed and dissolved over the observation horizon. In addition covariates are observed and the goal is to model the impact of the covariates on the interactions. We distinguish two types of covariates: global, system-wide covariates (i.e. covariates taking the same value for all individuals, such as seasonality) and local, dyadic covariates modeling interactions between two individuals in the network. Existing continuous time network models are extended to allow for comparing a completely parametric model and a model that is parametric only in the local covariates but has a global non-parametric time component. This allows, for instance, to test whether global time dynamics can be explained by simple global covariates like weather, seasonality etc. The procedure is applied to a bike-sharing network by using weather and weekdays as global covariates and distances between the bike stations as local covariates.
△ Less
Submitted 15 June, 2023; v1 submitted 26 March, 2021;
originally announced March 2021.
-
On approximation theorems for the Euler characteristic with applications to the bootstrap
Authors:
Johannes Krebs,
Benjamin Roycraft,
Wolfgang Polonik
Abstract:
We study approximation theorems for the Euler characteristic of the Vietoris-Rips and Cech filtration. The filtration is obtained from a Poisson or binomial sampling scheme in the critical regime. We apply our results to the smooth bootstrap of the Euler characteristic and determine its rate of convergence in the Kantorovich-Wasserstein distance and in the Kolmogorov distance.
We study approximation theorems for the Euler characteristic of the Vietoris-Rips and Cech filtration. The filtration is obtained from a Poisson or binomial sampling scheme in the critical regime. We apply our results to the smooth bootstrap of the Euler characteristic and determine its rate of convergence in the Kantorovich-Wasserstein distance and in the Kolmogorov distance.
△ Less
Submitted 20 September, 2021; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Bootstrap** Persistent Betti Numbers and Other Stabilizing Statistics
Authors:
Benjamin Roycraft,
Johannes Krebs,
Wolfgang Polonik
Abstract:
The present contribution investigates multivariate bootstrap procedures for general stabilizing statistics, with specific application to topological data analysis. Existing limit theorems for topological statistics prove difficult to use in practice for the construction of confidence intervals, motivating the use of the bootstrap in this capacity. However, the standard nonparametric bootstrap does…
▽ More
The present contribution investigates multivariate bootstrap procedures for general stabilizing statistics, with specific application to topological data analysis. Existing limit theorems for topological statistics prove difficult to use in practice for the construction of confidence intervals, motivating the use of the bootstrap in this capacity. However, the standard nonparametric bootstrap does not directly provide for asymptotically valid confidence intervals in some situations. A smoothed bootstrap procedure, instead, is shown to give consistent estimation in these settings. The present work relates to other general results in the area of stabilizing statistics, including central limit theorems for functionals of Poisson and Binomial processes in the critical regime. Specific statistics considered include the persistent Betti numbers of Čech and Vietoris-Rips complexes over point sets in $\mathbb R^d$, along with Euler characteristics, and the total edge length of the $k$-nearest neighbor graph. Special emphasis is made throughout to weakening the necessary conditions needed to establish bootstrap consistency. In particular, the assumption of a continuous underlying density is not required. A simulation study is provided to assess the performance of the smoothed bootstrap for finite sample sizes, and the method is further applied to the cosmic web dataset from the Sloan Digital Sky Survey (SDSS). Source code is available at github.com/btroycraft/stabilizing_statistics_bootstrap.
△ Less
Submitted 25 March, 2021; v1 submitted 4 May, 2020;
originally announced May 2020.
-
On the asymptotic normality of persistent Betti numbers
Authors:
Johannes T. N. Krebs,
Wolfgang Polonik
Abstract:
Persistent Betti numbers are a major tool in persistent homology, a subfield of topological data analysis. Many tools in persistent homology rely on the properties of persistent Betti numbers considered as a two-dimensional stochastic process $ (r,s) \mapsto n^{-1/2} (β^{r,s}_q ( \mathcal{K}(n^{1/d} S_n))-\mathbb{E}[β^{r,s}_q ( \mathcal{K}( n^{1/d} S_n))])$. So far, pointwise limit theorems have b…
▽ More
Persistent Betti numbers are a major tool in persistent homology, a subfield of topological data analysis. Many tools in persistent homology rely on the properties of persistent Betti numbers considered as a two-dimensional stochastic process $ (r,s) \mapsto n^{-1/2} (β^{r,s}_q ( \mathcal{K}(n^{1/d} S_n))-\mathbb{E}[β^{r,s}_q ( \mathcal{K}( n^{1/d} S_n))])$. So far, pointwise limit theorems have been established in different set-ups. In particular, the pointwise asymptotic normality of (persistent) Betti numbers has been established for stationary Poisson processes and binomial processes with constant intensity function in the so-called critical (or thermodynamic) regime, see Yogeshwaran et al. [2017] and Hiraoka et al. [2018].
In this contribution, we derive a strong stabilizing property (in the spirit of Penrose and Yukich [2001] of persistent Betti numbers and generalize the existing results on the asymptotic normality to the multivariate case and to a broader class of underlying Poisson and binomial processes. Most importantly, we show that the multivariate asymptotic normality holds for all pairs $(r,s)$, $0\le r\le s<\infty$, and that it is not affected by percolation effects in the underlying random geometric graph.
△ Less
Submitted 2 March, 2020; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Nonparametric Confidence Regions for Level Sets: Statistical Properties and Geometry
Authors:
Wanli Qiao,
Wolfgang Polonik
Abstract:
This paper studies and critically discusses the construction of nonparametric confidence regions for density level sets. Methodologies based on both vertical variation and horizontal variation are considered. The investigations provide theoretical insight into the behavior of these confidence regions via large sample theory. We also discuss the geometric relationships underlying the construction o…
▽ More
This paper studies and critically discusses the construction of nonparametric confidence regions for density level sets. Methodologies based on both vertical variation and horizontal variation are considered. The investigations provide theoretical insight into the behavior of these confidence regions via large sample theory. We also discuss the geometric relationships underlying the construction of horizontal and vertical methods, and how finite sample performance of these confidence regions is influenced by geometric or topological aspects. These discussions are supported by numerical studies.
△ Less
Submitted 4 March, 2019;
originally announced March 2019.
-
Multiscale geometric feature extraction for high-dimensional and non-Euclidean data with application
Authors:
Gabriel Chandler,
Wolfgang Polonik
Abstract:
A method for extracting multiscale geometric features from a data cloud is proposed and analyzed. The basic idea is to map each pair of data points into a real-valued feature function defined on $[0,1]$. The construction of these feature functions is heavily based on geometric considerations, which has the benefits of enhancing interpretability. Further statistical analysis is then based on the co…
▽ More
A method for extracting multiscale geometric features from a data cloud is proposed and analyzed. The basic idea is to map each pair of data points into a real-valued feature function defined on $[0,1]$. The construction of these feature functions is heavily based on geometric considerations, which has the benefits of enhancing interpretability. Further statistical analysis is then based on the collection of the feature functions. The potential of the method is illustrated by different applications, including classification of high-dimensional and non-Euclidean data. For continuous data in Euclidean space, our feature functions contain information about the underlying density at a given base point (small scale features), and also about the depth of the base point (large scale feature). As shown by our theoretical investigations, the method combats the curse of dimensionality, and also shows some adaptiveness towards sparsity. Connections to other concepts, such as random set theory, localized depth measures and nonlinear multidimensional scaling, are also explored.
△ Less
Submitted 12 December, 2019; v1 submitted 26 November, 2018;
originally announced November 2018.
-
On the choice of weight functions for linear representations of persistence diagrams
Authors:
Vincent Divol,
Wolfgang Polonik
Abstract:
Persistence diagrams are efficient descriptors of the topology of a point cloud. As they do not naturally belong to a Hilbert space, standard statistical methods cannot be directly applied to them. Instead, feature maps (or representations) are commonly used for the analysis. A large class of feature maps, which we call linear, depends on some weight functions, the choice of which is a critical is…
▽ More
Persistence diagrams are efficient descriptors of the topology of a point cloud. As they do not naturally belong to a Hilbert space, standard statistical methods cannot be directly applied to them. Instead, feature maps (or representations) are commonly used for the analysis. A large class of feature maps, which we call linear, depends on some weight functions, the choice of which is a critical issue. An important criterion to choose a weight function is to ensure stability of the feature maps with respect to Wasserstein distances on diagrams. We improve known results on the stability of such maps, and extend it to general weight functions. We also address the choice of the weight function by considering an asymptotic setting; assume that $\mathbb{X}_n$ is an i.i.d. sample from a density on $[0,1]^d$. For the Čech and Rips filtrations, we characterize the weight functions for which the corresponding feature maps converge as $n$ approaches infinity, and by doing so, we prove laws of large numbers for the total persistences of such diagrams. Those two approaches (stability and convergence) lead to the same simple heuristic for tuning weight functions: if the data lies near a $d$-dimensional manifold, then a sensible choice of weight function is the persistence to the power $α$ with $α\geq d$.
△ Less
Submitted 30 November, 2020; v1 submitted 10 July, 2018;
originally announced July 2018.
-
Neighborhood selection with application to social networks
Authors:
Nana Wang,
Wolfgang Polonik
Abstract:
The topic of this paper is modeling and analyzing dependence in stochastic social networks. Using a latent variable block model allows the analysis of dependence between blocks via the analysis of a latent graphical model. Our approach to the analysis of the graphical model then is based on the idea underlying the neighborhood selection scheme put forward by Meinshausen and Bühlmann (2006). Howeve…
▽ More
The topic of this paper is modeling and analyzing dependence in stochastic social networks. Using a latent variable block model allows the analysis of dependence between blocks via the analysis of a latent graphical model. Our approach to the analysis of the graphical model then is based on the idea underlying the neighborhood selection scheme put forward by Meinshausen and Bühlmann (2006). However, because of the latent nature of our model, estimates have to be used in lieu of the unobserved variables. This leads to a novel analysis of graphical models under uncertainty, in the spirit of Rosenbaum et al. (2010), or Belloni et al. (2017). Lasso-based selectors, and a class of Dantzig-type selectors are studied.
△ Less
Submitted 23 August, 2018; v1 submitted 16 November, 2017;
originally announced November 2017.
-
Nonparametric inference for continuous-time event counting and link-based dynamic network models
Authors:
Alexander Kreiß,
Enno Mammen,
Wolfgang Polonik
Abstract:
A flexible approach for modeling both dynamic event counting and dynamic link-based networks based on counting processes is proposed, and estimation in these models is studied. We consider nonparametric likelihood based estimation of parameter functions via kernel smoothing. The asymptotic behavior of these estimators is rigorously analyzed by allowing the number of nodes to tend to infinity. The…
▽ More
A flexible approach for modeling both dynamic event counting and dynamic link-based networks based on counting processes is proposed, and estimation in these models is studied. We consider nonparametric likelihood based estimation of parameter functions via kernel smoothing. The asymptotic behavior of these estimators is rigorously analyzed by allowing the number of nodes to tend to infinity. The finite sample performance of the estimators is illustrated through an empirical analysis of bike share data.
△ Less
Submitted 28 May, 2019; v1 submitted 10 May, 2017;
originally announced May 2017.
-
Theoretical Analysis of Nonparametric Filament Estimation
Authors:
Wanli Qiao,
Wolfgang Polonik
Abstract:
This paper provides a rigorous study of the nonparametric estimation of filaments or ridge lines of a probability density $f$. Points on the filament are considered as local extrema of the density when traversing the support of $f$ along the integral curve driven by the vector field of second eigenvectors of the Hessian of $f$. We `parametrize' points on the filaments by such integral curves, and…
▽ More
This paper provides a rigorous study of the nonparametric estimation of filaments or ridge lines of a probability density $f$. Points on the filament are considered as local extrema of the density when traversing the support of $f$ along the integral curve driven by the vector field of second eigenvectors of the Hessian of $f$. We `parametrize' points on the filaments by such integral curves, and thus both the estimation of integral curves and of filaments will be considered via a plug-in method using kernel density estimation. We establish rates of convergence and asymptotic distribution results for the estimation of both the integral curves and the filaments. The main theoretical result establishes the asymptotic distribution of the uniform deviation of the estimated filament from its theoretical counterpart. This result utilizes the extreme value behavior of non-stationary Gaussian processes indexed by manifolds $M_h, h \in(0,1]$ as $h \to 0$.
△ Less
Submitted 24 October, 2015;
originally announced October 2015.
-
Extrema of locally stationary Gaussian fields on growing manifolds
Authors:
Wanli Qiao,
Wolfgang Polonik
Abstract:
We consider a class of non-homogeneous, continuous, centered Gaussian random fields $\{X_h(t), t \in {\cal M}_h;\,0 < h \le 1\}$ where ${\cal M}_h$ denotes a rescaled smooth manifold, i.e. ${\cal M}_h = \frac{1}{h} {\cal M},$ and study the limit behavior of the extreme values of these Gaussian random fields when $h$ tends to zero, which means that the manifold is growing. Our main result can be th…
▽ More
We consider a class of non-homogeneous, continuous, centered Gaussian random fields $\{X_h(t), t \in {\cal M}_h;\,0 < h \le 1\}$ where ${\cal M}_h$ denotes a rescaled smooth manifold, i.e. ${\cal M}_h = \frac{1}{h} {\cal M},$ and study the limit behavior of the extreme values of these Gaussian random fields when $h$ tends to zero, which means that the manifold is growing. Our main result can be thought of as a generalization of a classical result of Bickel and Rosenblatt (1973a), and also of results by Mikhaleva and Piterbarg (1997).
△ Less
Submitted 23 October, 2015;
originally announced October 2015.
-
Asymptotic normality of plug-in level set estimates
Authors:
David M. Mason,
Wolfgang Polonik
Abstract:
We establish the asymptotic normality of the $G$-measure of the symmetric difference between the level set and a plug-in-type estimator of it formed by replacing the density in the definition of the level set by a kernel density estimator. Our proof will highlight the efficacy of Poissonization methods in the treatment of large sample theory problems of this kind.
We establish the asymptotic normality of the $G$-measure of the symmetric difference between the level set and a plug-in-type estimator of it formed by replacing the density in the definition of the level set by a kernel density estimator. Our proof will highlight the efficacy of Poissonization methods in the treatment of large sample theory problems of this kind.
△ Less
Submitted 7 August, 2009;
originally announced August 2009.
-
Empirical spectral processes for locally stationary time series
Authors:
Rainer Dahlhaus,
Wolfgang Polonik
Abstract:
A time-varying empirical spectral process indexed by classes of functions is defined for locally stationary time series. We derive weak convergence in a function space, and prove a maximal exponential inequality and a Glivenko--Cantelli-type convergence result. The results use conditions based on the metric entropy of the index class. In contrast to related earlier work, no Gaussian assumption i…
▽ More
A time-varying empirical spectral process indexed by classes of functions is defined for locally stationary time series. We derive weak convergence in a function space, and prove a maximal exponential inequality and a Glivenko--Cantelli-type convergence result. The results use conditions based on the metric entropy of the index class. In contrast to related earlier work, no Gaussian assumption is made. As applications, quasi-likelihood estimation, goodness-of-fit testing and inference under model misspecification are discussed. In an extended application, uniform rates of convergence are derived for local Whittle estimates of the parameter curves of locally stationary time series models.
△ Less
Submitted 9 February, 2009;
originally announced February 2009.
-
Nonparametric quasi-maximum likelihood estimation for Gaussian locally stationary processes
Authors:
Rainer Dahlhaus,
Wolfgang Polonik
Abstract:
This paper deals with nonparametric maximum likelihood estimation for Gaussian locally stationary processes. Our nonparametric MLE is constructed by minimizing a frequency domain likelihood over a class of functions. The asymptotic behavior of the resulting estimator is studied. The results depend on the richness of the class of functions. Both sieve estimation and global estimation are consider…
▽ More
This paper deals with nonparametric maximum likelihood estimation for Gaussian locally stationary processes. Our nonparametric MLE is constructed by minimizing a frequency domain likelihood over a class of functions. The asymptotic behavior of the resulting estimator is studied. The results depend on the richness of the class of functions. Both sieve estimation and global estimation are considered. Our results apply, in particular, to estimation under shape constraints. As an example, autoregressive model fitting with a monotonic variance function is discussed in detail, including algorithmic considerations. A key technical tool is the time-varying empirical spectral process indexed by functions. For this process, a Bernstein-type exponential inequality and a central limit theorem are derived. These results for empirical spectral processes are of independent interest.
△ Less
Submitted 1 August, 2007;
originally announced August 2007.