-
Hebbian learning inspired estimation of the linear regression parameters from queries
Authors:
Johannes Schmidt-Hieber,
Wouter M Koolen
Abstract:
Local learning rules in biological neural networks (BNNs) are commonly referred to as Hebbian learning. [26] links a biologically motivated Hebbian learning rule to a specific zeroth-order optimization method. In this work, we study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model. Zeroth-order optimization methods are known to converge with…
▽ More
Local learning rules in biological neural networks (BNNs) are commonly referred to as Hebbian learning. [26] links a biologically motivated Hebbian learning rule to a specific zeroth-order optimization method. In this work, we study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model. Zeroth-order optimization methods are known to converge with suboptimal rate for large parameter dimension compared to first-order methods like gradient descent, and are therefore thought to be in general inferior. By establishing upper and lower bounds, we show, however, that such methods achieve near-optimal rates if only queries of the linear regression loss are available. Moreover, we prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
△ Less
Submitted 26 September, 2023;
originally announced November 2023.
-
Convergence guarantees for forward gradient descent in the linear regression model
Authors:
Thijs Bos,
Johannes Schmidt-Hieber
Abstract:
Renewed interest in the relationship between artificial and biological neural networks motivates the study of gradient-free methods. Considering the linear regression model with random design, we theoretically analyze in this work the biologically motivated (weight-perturbed) forward gradient scheme that is based on random linear combination of the gradient. If d denotes the number of parameters a…
▽ More
Renewed interest in the relationship between artificial and biological neural networks motivates the study of gradient-free methods. Considering the linear regression model with random design, we theoretically analyze in this work the biologically motivated (weight-perturbed) forward gradient scheme that is based on random linear combination of the gradient. If d denotes the number of parameters and k the number of samples, we prove that the mean squared error of this method converges for $k\gtrsim d^2\log(d)$ with rate $d^2\log(d)/k.$ Compared to the dimension dependence d for stochastic gradient descent, an additional factor $d\log(d)$ occurs.
△ Less
Submitted 20 June, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model
Authors:
Gabriel Clara,
Sophie Langer,
Johannes Schmidt-Hieber
Abstract:
We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and l2-regularization in the linear model. We indicate a more subtle relationship, ow…
▽ More
We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and l2-regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator
△ Less
Submitted 25 April, 2024; v1 submitted 18 June, 2023;
originally announced June 2023.
-
A supervised deep learning method for nonparametric density estimation
Authors:
Thijs Bos,
Johannes Schmidt-Hieber
Abstract:
Nonparametric density estimation is an unsupervised learning problem. In this work we propose a two-step procedure that casts the density estimation problem in the first step into a supervised regression problem. The advantage is that we can afterwards apply supervised learning methods. Compared to the standard nonparametric regression setting, the proposed procedure creates, however, dependence a…
▽ More
Nonparametric density estimation is an unsupervised learning problem. In this work we propose a two-step procedure that casts the density estimation problem in the first step into a supervised regression problem. The advantage is that we can afterwards apply supervised learning methods. Compared to the standard nonparametric regression setting, the proposed procedure creates, however, dependence among the training samples. To derive statistical risk bounds, one can therefore not rely on the well-developed theory for i.i.d. data. To overcome this, we prove an oracle inequality for this specific form of data dependence. As an application, it is shown that under a compositional structure assumption on the underlying density, the proposed two-step method achieves convergence rates that are faster than the standard nonparametric rates. A simulation study illustrates the finite sample performance.
△ Less
Submitted 3 June, 2024; v1 submitted 18 June, 2023;
originally announced June 2023.
-
Lower bounds for the trade-off between bias and mean absolute deviation
Authors:
Alexis Derumigny,
Johannes Schmidt-Hieber
Abstract:
In nonparametric statistics, rate-optimal estimators typically balance bias and stochastic error. The recent work on overparametrization raises the question whether rate-optimal estimators exist that do not obey this trade-off. In this work we consider pointwise estimation in the Gaussian white noise model with regression function $f$ in a class of $β$-Hölder smooth functions. Let 'worst-case' ref…
▽ More
In nonparametric statistics, rate-optimal estimators typically balance bias and stochastic error. The recent work on overparametrization raises the question whether rate-optimal estimators exist that do not obey this trade-off. In this work we consider pointwise estimation in the Gaussian white noise model with regression function $f$ in a class of $β$-Hölder smooth functions. Let 'worst-case' refer to the supremum over all functions $f$ in the Hölder class. It is shown that any estimator with worst-case bias $\lesssim n^{-β/(2β+1)}=: ψ_n$ must necessarily also have a worst-case mean absolute deviation that is lower bounded by $\gtrsim ψ_n.$ To derive the result, we establish abstract inequalities relating the change of expectation for two probability measures to the mean absolute deviation.
△ Less
Submitted 20 June, 2024; v1 submitted 21 March, 2023;
originally announced March 2023.
-
Codivergences and information matrices
Authors:
Alexis Derumigny,
Johannes Schmidt-Hieber
Abstract:
We propose a new concept of codivergence, which quantifies the similarity between two probability measures $P_1, P_2$ relative to a reference probability measure $P_0$. In the neighborhood of the reference measure $P_0$, a codivergence behaves like an inner product between the measures $P_1 - P_0$ and $P_2 - P_0$. Codivergences of covariance-type and correlation-type are introduced and studied wit…
▽ More
We propose a new concept of codivergence, which quantifies the similarity between two probability measures $P_1, P_2$ relative to a reference probability measure $P_0$. In the neighborhood of the reference measure $P_0$, a codivergence behaves like an inner product between the measures $P_1 - P_0$ and $P_2 - P_0$. Codivergences of covariance-type and correlation-type are introduced and studied with a focus on two specific correlation-type codivergences, the $χ^2$-codivergence and the Hellinger codivergence. We derive explicit expressions for several common parametric families of probability distributions. For a codivergence, we introduce moreover the divergence matrix as an analogue of the Gram matrix. It is shown that the $χ^2$-divergence matrix satisfies a data-processing inequality.
△ Less
Submitted 9 May, 2024; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Interpreting learning in biological neural networks as zero-order optimization method
Authors:
Johannes Schmidt-Hieber
Abstract:
Recently, significant progress has been made regarding the statistical understanding of artificial neural networks (ANNs). ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. In particular, the locality in the updating rule of the connection parameters in biological neural networks (BNNs) makes it biologically implausible that the learning of the brain is bas…
▽ More
Recently, significant progress has been made regarding the statistical understanding of artificial neural networks (ANNs). ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. In particular, the locality in the updating rule of the connection parameters in biological neural networks (BNNs) makes it biologically implausible that the learning of the brain is based on gradient descent. In this work, we look at the brain as a statistical method for supervised learning. The main contribution is to relate the local updating rule of the connection parameters in BNNs to a zero-order optimization method. It is shown that the expected values of the iterates implement a modification of gradient descent.
△ Less
Submitted 23 March, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
A statistical analysis of an image classification problem
Authors:
Sophie Langer,
Johannes Schmidt-Hieber
Abstract:
The availability of massive image databases resulted in the development of scalable machine learning methods such as convolutional neural network (CNNs) filtering and processing these data. While the very recent theoretical work on CNNs focuses on standard nonparametric denoising problems, the variability in image classification datasets does, however, not originate from additive noise but from va…
▽ More
The availability of massive image databases resulted in the development of scalable machine learning methods such as convolutional neural network (CNNs) filtering and processing these data. While the very recent theoretical work on CNNs focuses on standard nonparametric denoising problems, the variability in image classification datasets does, however, not originate from additive noise but from variation of the shape and other characteristics of the same object across different images. To address this problem, we consider a simple supervised classification problem for object detection on grayscale images. While from the function estimation point of view, every pixel is a variable and large images lead to high-dimensional function recovery tasks suffering from the curse of dimensionality, increasing the number of pixels in our image deformation model enhances the image resolution and makes the object classification problem easier. We propose and theoretically analyze two different procedures. The first method estimates the image deformation by support alignment. Under a minimal separation condition, it is shown that perfect classification is possible. The second method fits a CNN to the data. We derive a rate for the misclassification error depending on the sample size and the number of pixels. Both classifiers are empirically compared on images generated from the MNIST handwritten digit database. The obtained results corroborate the theoretical findings.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
On the inability of Gaussian process regression to optimally learn compositional functions
Authors:
Matteo Giordano,
Kolyan Ray,
Johannes Schmidt-Hieber
Abstract:
We rigorously prove that deep Gaussian process priors can outperform Gaussian process priors if the target function has a compositional structure. To this end, we study information-theoretic lower bounds for posterior contraction rates for Gaussian process regression in a continuous regression model. We show that if the true function is a generalized additive function, then the posterior based on…
▽ More
We rigorously prove that deep Gaussian process priors can outperform Gaussian process priors if the target function has a compositional structure. To this end, we study information-theoretic lower bounds for posterior contraction rates for Gaussian process regression in a continuous regression model. We show that if the true function is a generalized additive function, then the posterior based on any mean-zero Gaussian process can only recover the truth at a rate that is strictly slower than the minimax rate by a factor that is polynomially suboptimal in the sample size $n$.
△ Less
Submitted 27 September, 2022; v1 submitted 16 May, 2022;
originally announced May 2022.
-
Local convergence rates of the nonparametric least squares estimator with applications to transfer learning
Authors:
Johannes Schmidt-Hieber,
Petr Zamolodtchikov
Abstract:
Convergence properties of empirical risk minimizers can be conveniently expressed in terms of the associated population risk. To derive bounds for the performance of the estimator under covariate shift, however, pointwise convergence rates are required. Under weak assumptions on the design distribution, it is shown that least squares estimators (LSE) over 1-Lipschitz functions are also minimax rat…
▽ More
Convergence properties of empirical risk minimizers can be conveniently expressed in terms of the associated population risk. To derive bounds for the performance of the estimator under covariate shift, however, pointwise convergence rates are required. Under weak assumptions on the design distribution, it is shown that least squares estimators (LSE) over 1-Lipschitz functions are also minimax rate optimal with respect to a weighted uniform norm, where the weighting accounts in a natural way for the non-uniformity of the design distribution. This implies that although least squares is a global criterion, the LSE adapts locally to the size of the design density. We develop a new indirect proof technique that establishes the local convergence behavior based on a carefully chosen local perturbation of the LSE. The obtained local rates are then applied to analyze the LSE for transfer learning under covariate shift.
△ Less
Submitted 29 December, 2023; v1 submitted 11 April, 2022;
originally announced April 2022.
-
On generalization bounds for deep networks based on loss surface implicit regularization
Authors:
Masaaki Imaizumi,
Johannes Schmidt-Hieber
Abstract:
The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by sto…
▽ More
The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by stochastic gradient descent (SGD), we study here how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces another form of implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima.
△ Less
Submitted 16 October, 2022; v1 submitted 12 January, 2022;
originally announced January 2022.
-
Convergence rates of deep ReLU networks for multiclass classification
Authors:
Thijs Bos,
Johannes Schmidt-Hieber
Abstract:
For classification problems, trained deep neural networks return probabilities of class memberships. In this work we study convergence of the learned probabilities to the true conditional class probabilities. More specifically we consider sparse deep ReLU network reconstructions minimizing cross-entropy loss in the multiclass classification setup. Interesting phenomena occur when the class members…
▽ More
For classification problems, trained deep neural networks return probabilities of class memberships. In this work we study convergence of the learned probabilities to the true conditional class probabilities. More specifically we consider sparse deep ReLU network reconstructions minimizing cross-entropy loss in the multiclass classification setup. Interesting phenomena occur when the class membership probabilities are close to zero. Convergence rates are derived that depend on the near-zero behaviour via a margin-type condition.
△ Less
Submitted 2 August, 2021;
originally announced August 2021.
-
Posterior contraction for deep Gaussian process priors
Authors:
Gianluca Finocchio,
Johannes Schmidt-Hieber
Abstract:
We study posterior contraction rates for a class of deep Gaussian process priors applied to the nonparametric regression problem under a general composition assumption on the regression function. It is shown that the contraction rates can achieve the minimax convergence rate (up to $\log n$ factors), while being adaptive to the underlying structure and smoothness of the target function. The propos…
▽ More
We study posterior contraction rates for a class of deep Gaussian process priors applied to the nonparametric regression problem under a general composition assumption on the regression function. It is shown that the contraction rates can achieve the minimax convergence rate (up to $\log n$ factors), while being adaptive to the underlying structure and smoothness of the target function. The proposed framework extends the Bayesian nonparametric theory for Gaussian process priors.
△ Less
Submitted 13 August, 2022; v1 submitted 16 May, 2021;
originally announced May 2021.
-
The Kolmogorov-Arnold representation theorem revisited
Authors:
Johannes Schmidt-Hieber
Abstract:
There is a longstanding debate whether the Kolmogorov-Arnold representation theorem can explain the use of more than one hidden layer in neural networks. The Kolmogorov-Arnold representation decomposes a multivariate function into an interior and an outer function and therefore has indeed a similar structure as a neural network with two hidden layers. But there are distinctive differences. One of…
▽ More
There is a longstanding debate whether the Kolmogorov-Arnold representation theorem can explain the use of more than one hidden layer in neural networks. The Kolmogorov-Arnold representation decomposes a multivariate function into an interior and an outer function and therefore has indeed a similar structure as a neural network with two hidden layers. But there are distinctive differences. One of the main obstacles is that the outer function depends on the represented function and can be wildly varying even if the represented function is smooth. We derive modifications of the Kolmogorov-Arnold representation that transfer smoothness properties of the represented function to the outer function and can be well approximated by ReLU networks. It appears that instead of two hidden layers, a more natural interpretation of the Kolmogorov-Arnold representation is that of a deep neural network where most of the layers are required to approximate the interior function.
△ Less
Submitted 2 January, 2021; v1 submitted 31 July, 2020;
originally announced July 2020.
-
On lower bounds for the bias-variance trade-off
Authors:
Alexis Derumigny,
Johannes Schmidt-Hieber
Abstract:
It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a…
▽ More
It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or $χ^2$-divergence. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we propose to combine the general strategy for lower bounds with a reduction technique. This allows us to reduce the original problem to a lower bound on the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. In the Gaussian sequence model, different phase transitions of the bias-variance trade-off occur. Although there is a non-trivial interplay between bias and variance, the rate of the squared bias and the variance do not have to be balanced in order to achieve the minimax estimation rate.
△ Less
Submitted 20 March, 2023; v1 submitted 30 May, 2020;
originally announced June 2020.
-
On frequentist coverage of Bayesian credible sets for estimation of the mean under constraints
Authors:
Kevin Duisters,
Johannes Schmidt-Hieber
Abstract:
Frequentist coverage of $(1-α)$-highest posterior density (HPD) credible sets is studied in a signal plus noise model under a large class of noise distributions. We consider a specific class of spike-and-slab prior distributions. Different regimes are identified and we derive closed form expressions for the $(1-α)$-HPD on each of these regimes. Similar to the earlier work by Marchand and Strawderm…
▽ More
Frequentist coverage of $(1-α)$-highest posterior density (HPD) credible sets is studied in a signal plus noise model under a large class of noise distributions. We consider a specific class of spike-and-slab prior distributions. Different regimes are identified and we derive closed form expressions for the $(1-α)$-HPD on each of these regimes. Similar to the earlier work by Marchand and Strawderman, it is shown that under suitable conditions, the frequentist coverage can drop to $1-3α/2.$
△ Less
Submitted 9 March, 2020;
originally announced March 2020.
-
Deep ReLU network approximation of functions on a manifold
Authors:
Johannes Schmidt-Hieber
Abstract:
Whereas recovery of the manifold from data is a well-studied topic, approximation rates for functions defined on manifolds are less known. In this work, we study a regression problem with inputs on a $d^*$-dimensional manifold that is embedded into a space with potentially much larger ambient dimension. It is shown that sparsely connected deep ReLU networks can approximate a Hölder function with s…
▽ More
Whereas recovery of the manifold from data is a well-studied topic, approximation rates for functions defined on manifolds are less known. In this work, we study a regression problem with inputs on a $d^*$-dimensional manifold that is embedded into a space with potentially much larger ambient dimension. It is shown that sparsely connected deep ReLU networks can approximate a Hölder function with smoothness index $β$ up to error $ε$ using of the order of $ε^{-d^*/β}\log(1/ε)$ many non-zero network parameters. As an application, we derive statistical convergence rates for the estimator minimizing the empirical risk over all possible choices of bounded network parameters.
△ Less
Submitted 2 August, 2019;
originally announced August 2019.
-
Bayesian variance estimation in the Gaussian sequence model with partial information on the means
Authors:
Gianluca Finocchio,
Johannes Schmidt-Hieber
Abstract:
Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for $σ^2$ is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By develo** a new proving strategy…
▽ More
Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for $σ^2$ is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By develo** a new proving strategy that uses refined properties of the posterior distribution, we find that the marginal posterior is inconsistent for any i.i.d. prior on the mean parameters. In particular, no assumption on the decay of the prior needs to be imposed. Surprisingly, we also find that consistency can be retained for a hierarchical prior based on Gaussian mixtures. In this case we also establish a limiting shape result and determine the limit distribution. In contrast to the classical Bernstein-von Mises theorem, the limit is non-Gaussian. We show that the Bayesian analysis leads to new statistical estimators outperforming the correctly calibrated MLE in a numerical simulation study.
△ Less
Submitted 18 December, 2019; v1 submitted 9 April, 2019;
originally announced April 2019.
-
Nonparametric Bayesian analysis of the compound Poisson prior for support boundary recovery
Authors:
Markus Reiss,
Johannes Schmidt-Hieber
Abstract:
Given data from a Poisson point process with intensity $(x,y) \mapsto n \mathbf{1}(f(x)\leq y),$ frequentist properties for the Bayesian reconstruction of the support boundary function $f$ are derived. We mainly study compound Poisson process priors with fixed intensity proving that the posterior contracts with nearly optimal rate for monotone and piecewise constant support boundaries and adapts t…
▽ More
Given data from a Poisson point process with intensity $(x,y) \mapsto n \mathbf{1}(f(x)\leq y),$ frequentist properties for the Bayesian reconstruction of the support boundary function $f$ are derived. We mainly study compound Poisson process priors with fixed intensity proving that the posterior contracts with nearly optimal rate for monotone and piecewise constant support boundaries and adapts to Hölder smooth boundaries with smoothness index at most one. We then derive a non-standard Bernstein-von Mises result for a compound Poisson process prior and a function space with increasing parameter dimension. As an intermediate result the limiting shape of the posterior for random histogram type priors is obtained. In both settings, it is shown that the marginal posterior of the functional $\vartheta =\int f$ performs an automatic bias correction and contracts with a faster rate than the MLE. In this case, $(1-α)$-credible sets are also asymptotic $(1-α)$-confidence intervals. As a negative result, it is shown that the frequentist coverage of credible sets is lost for linear functions indicating that credible sets only have frequentist coverage for priors that are specifically constructed to match properties of the underlying true function.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
Posterior analysis of $n$ in the binomial $(n,p)$ problem with both parameters unknown -- with applications to quantitative nanoscopy
Authors:
Johannes Schmidt-Hieber,
Laura Fee Schneider,
Thomas Staudt,
Andrea Kra**a,
Timo Aspelmeier,
Axel Munk
Abstract:
Estimation of the population size $n$ from $k$ i.i.d.\ binomial observations with unknown success probability $p$ is relevant to a multitude of applications and has a long history. Without additional prior information this is a notoriously difficult task when $p$ becomes small, and the Bayesian approach becomes particularly useful. For a large class of priors, we establish posterior contraction an…
▽ More
Estimation of the population size $n$ from $k$ i.i.d.\ binomial observations with unknown success probability $p$ is relevant to a multitude of applications and has a long history. Without additional prior information this is a notoriously difficult task when $p$ becomes small, and the Bayesian approach becomes particularly useful. For a large class of priors, we establish posterior contraction and a Bernstein-von Mises type theorem in a setting where $p\rightarrow0$ and $n\rightarrow\infty$ as $k\to\infty$. Furthermore, we suggest a new class of Bayesian estimators for $n$ and provide a comprehensive simulation study in which we investigate their performance. To showcase the advantages of a Bayesian approach on real data, we also benchmark our estimators in a novel application from super-resolution microscopy.
△ Less
Submitted 16 November, 2020; v1 submitted 7 September, 2018;
originally announced September 2018.
-
A comparison of deep networks with ReLU activation function and linear spline-type methods
Authors:
Konstantin Eckle,
Johannes Schmidt-Hieber
Abstract:
Deep neural networks (DNNs) generate much richer function spaces than shallow networks. Since the function spaces induced by shallow networks have several approximation theoretic drawbacks, this explains, however, not necessarily the success of deep networks. In this article we take another route by comparing the expressive power of DNNs with ReLU activation function to piecewise linear spline met…
▽ More
Deep neural networks (DNNs) generate much richer function spaces than shallow networks. Since the function spaces induced by shallow networks have several approximation theoretic drawbacks, this explains, however, not necessarily the success of deep networks. In this article we take another route by comparing the expressive power of DNNs with ReLU activation function to piecewise linear spline methods. We show that MARS (multivariate adaptive regression splines) is improper learnable by DNNs in the sense that for any given function that can be expressed as a function in MARS with $M$ parameters there exists a multilayer neural network with $O(M \log (M/\varepsilon))$ parameters that approximates this function up to sup-norm error $\varepsilon.$ We show a similar result for expansions with respect to the Faber-Schauder system. Based on this, we derive risk comparison inequalities that bound the statistical risk of fitting a neural network by the statistical risk of spline-based methods. This shows that deep networks perform better or only slightly worse than the considered spline methods. We provide a constructive proof for the function approximations.
△ Less
Submitted 24 September, 2018; v1 submitted 6 April, 2018;
originally announced April 2018.
-
Asymptotic nonequivalence of density estimation and Gaussian white noise for small densities
Authors:
Kolyan Ray,
Johannes Schmidt-Hieber
Abstract:
It is well-known that density estimation on the unit interval is asymptotically equivalent to a Gaussian white noise experiment, provided the densities are sufficiently smooth and uniformly bounded away from zero. We show that a uniform lower bound, whose size we sharply characterize, is in general necessary for asymptotic equivalence to hold.
It is well-known that density estimation on the unit interval is asymptotically equivalent to a Gaussian white noise experiment, provided the densities are sufficiently smooth and uniformly bounded away from zero. We show that a uniform lower bound, whose size we sharply characterize, is in general necessary for asymptotic equivalence to hold.
△ Less
Submitted 6 November, 2018; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Nonparametric regression using deep neural networks with ReLU activation function
Authors:
Johannes Schmidt-Hieber
Abstract:
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to $\log n$-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constrain…
▽ More
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to $\log n$-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.
△ Less
Submitted 13 September, 2020; v1 submitted 22 August, 2017;
originally announced August 2017.
-
Tests for qualitative features in the random coefficients model
Authors:
Fabian Dunker,
Konstantin Eckle,
Katharina Proksch,
Johannes Schmidt-Hieber
Abstract:
The random coefficients model is an extension of the linear regression model that allows for unobserved heterogeneity in the population by modeling the regression coefficients as random variables. Given data from this model, the statistical challenge is to recover information about the joint density of the random coefficients which is a multivariate and ill-posed problem. Because of the curse of d…
▽ More
The random coefficients model is an extension of the linear regression model that allows for unobserved heterogeneity in the population by modeling the regression coefficients as random variables. Given data from this model, the statistical challenge is to recover information about the joint density of the random coefficients which is a multivariate and ill-posed problem. Because of the curse of dimensionality and the ill-posedness, pointwise nonparametric estimation of the joint density is difficult and suffers from slow convergence rates. Larger features, such as an increase of the density along some direction or a well-accentuated mode can, however, be much easier detected from data by means of statistical tests. In this article, we follow this strategy and construct tests and confidence statements for qualitative features of the joint density, such as increases, decreases and modes. We propose a multiple testing approach based on aggregating single tests which are designed to extract shape information on fixed scales and directions. Using recent tools for Gaussian approximations of multivariate empirical processes, we derive expressions for the critical value. We apply our method to simulated and real data.
△ Less
Submitted 13 March, 2018; v1 submitted 4 April, 2017;
originally announced April 2017.
-
Posterior contraction rates for support boundary recovery
Authors:
Markus Reiss,
Johannes Schmidt-Hieber
Abstract:
Given a sample of a Poisson point process with intensity $λ_f(x,y) = n \mathbf{1}(f(x) \leq y),$ we study recovery of the boundary function $f$ from a nonparametric Bayes perspective. Because of the irregularity of this model, the analysis is non-standard. We establish a general result for the posterior contraction rate with respect to the $L^1$-norm based on entropy and one-sided small probabilit…
▽ More
Given a sample of a Poisson point process with intensity $λ_f(x,y) = n \mathbf{1}(f(x) \leq y),$ we study recovery of the boundary function $f$ from a nonparametric Bayes perspective. Because of the irregularity of this model, the analysis is non-standard. We establish a general result for the posterior contraction rate with respect to the $L^1$-norm based on entropy and one-sided small probability bounds. From this, specific posterior contraction results are derived for Gaussian process priors and priors based on random wavelet series.
△ Less
Submitted 12 June, 2020; v1 submitted 24 March, 2017;
originally announced March 2017.
-
The Le Cam distance between density estimation, Poisson processes and Gaussian white noise
Authors:
Kolyan Ray,
Johannes Schmidt-Hieber
Abstract:
It is well-known that density estimation on the unit interval is asymptotically equivalent to a Gaussian white noise experiment, provided the densities have Hölder smoothness larger than $1/2$ and are uniformly bounded away from zero. We derive matching lower and constructive upper bounds for the Le Cam deficiencies between these experiments, with explicit dependence on both the sample size and th…
▽ More
It is well-known that density estimation on the unit interval is asymptotically equivalent to a Gaussian white noise experiment, provided the densities have Hölder smoothness larger than $1/2$ and are uniformly bounded away from zero. We derive matching lower and constructive upper bounds for the Le Cam deficiencies between these experiments, with explicit dependence on both the sample size and the size of the densities in the parameter space. As a consequence, we derive sharp conditions on how small the densities can be for asymptotic equivalence to hold. The related case of Poisson intensity estimation is also treated.
△ Less
Submitted 14 April, 2018; v1 submitted 5 August, 2016;
originally announced August 2016.
-
Minimax theory for a class of non-linear statistical inverse problems
Authors:
Kolyan Ray,
Johannes Schmidt-Hieber
Abstract:
We study a class of statistical inverse problems with non-linear pointwise operators motivated by concrete statistical applications. A two-step procedure is proposed, where the first step smoothes the data and inverts the non-linearity. This reduces the initial non-linear problem to a linear inverse problem with deterministic noise, which is then solved in a second step. The noise reduction step i…
▽ More
We study a class of statistical inverse problems with non-linear pointwise operators motivated by concrete statistical applications. A two-step procedure is proposed, where the first step smoothes the data and inverts the non-linearity. This reduces the initial non-linear problem to a linear inverse problem with deterministic noise, which is then solved in a second step. The noise reduction step is based on wavelet thresholding and is shown to be minimax optimal (up to logarithmic factors) in a pointwise function-dependent sense. Our analysis is based on a modified notion of Hölder smoothness scales that are natural in this setting.
△ Less
Submitted 11 May, 2016; v1 submitted 1 December, 2015;
originally announced December 2015.
-
A regularity class for the roots of non-negative functions
Authors:
Kolyan Ray,
Johannes Schmidt-Hieber
Abstract:
We investigate the regularity of the positive roots of a non-negative function of one-variable. A modified Hölder space $\mathcal{F}^β$ is introduced such that if $f\in \mathcal{F}^β$ then $f^α\in C^{αβ}$. This provides sufficient conditions to overcome the usual limitation in the square root case ($α= 1/2$) for Hölder functions that $f^{1/2}$ need be no more than $C^1$ in general. We also derive…
▽ More
We investigate the regularity of the positive roots of a non-negative function of one-variable. A modified Hölder space $\mathcal{F}^β$ is introduced such that if $f\in \mathcal{F}^β$ then $f^α\in C^{αβ}$. This provides sufficient conditions to overcome the usual limitation in the square root case ($α= 1/2$) for Hölder functions that $f^{1/2}$ need be no more than $C^1$ in general. We also derive bounds on the wavelet coefficients of $f^α$, which provide a finer understanding of its local regularity.
△ Less
Submitted 16 March, 2017; v1 submitted 30 October, 2015;
originally announced October 2015.
-
Conditions for Posterior Contraction in the Sparse Normal Means Problem
Authors:
Stéphanie van der Pas,
Jean-Bernard Salomond,
Johannes Schmidt-Hieber
Abstract:
The first Bayesian results for the sparse normal means problem were proven for spike-and-slab priors. However, these priors are less convenient from a computational point of view. In the meanwhile, a large number of continuous shrinkage priors has been proposed. Many of these shrinkage priors can be written as a scale mixture of normals, which makes them particularly easy to implement. We propose…
▽ More
The first Bayesian results for the sparse normal means problem were proven for spike-and-slab priors. However, these priors are less convenient from a computational point of view. In the meanwhile, a large number of continuous shrinkage priors has been proposed. Many of these shrinkage priors can be written as a scale mixture of normals, which makes them particularly easy to implement. We propose general conditions on the prior on the local variance in scale mixtures of normals, such that posterior contraction at the minimax rate is assured. The conditions require tails at least as heavy as Laplace, but not too heavy, and a large amount of mass around zero relative to the tails, more so as the sparsity increases. These conditions give some general guidelines for choosing a shrinkage prior for estimation under a nearly black sparsity assumption. We verify these conditions for the class of priors considered by Ghosh and Chakrabarti (2015), which includes the horseshoe and the normal-exponential gamma priors, and for the horseshoe+, the inverse-Gaussian prior, the normal-gamma prior, and the spike-and-slab Lasso, and thus extend the number of shrinkage priors which are known to lead to posterior contraction at the minimax estimation rate.
△ Less
Submitted 13 October, 2015; v1 submitted 8 October, 2015;
originally announced October 2015.
-
Bayesian linear regression with sparse priors
Authors:
Ismaël Castillo,
Johannes Schmidt-Hieber,
Aad van der Vaart
Abstract:
We study full Bayesian procedures for high-dimensional linear regression under sparsity constraints. The prior is a mixture of point masses at zero and continuous distributions. Under compatibility conditions on the design matrix, the posterior distribution is shown to contract at the optimal rate for recovery of the unknown sparse vector, and to give optimal prediction of the response vector. It…
▽ More
We study full Bayesian procedures for high-dimensional linear regression under sparsity constraints. The prior is a mixture of point masses at zero and continuous distributions. Under compatibility conditions on the design matrix, the posterior distribution is shown to contract at the optimal rate for recovery of the unknown sparse vector, and to give optimal prediction of the response vector. It is also shown to select the correct sparse model, or at least the coefficients that are significantly different from zero. The asymptotic shape of the posterior distribution is characterized and employed to the construction and study of credible sets for uncertainty quantification.
△ Less
Submitted 14 October, 2015; v1 submitted 4 March, 2014;
originally announced March 2014.
-
Asymptotic equivalence for regression under fractional noise
Authors:
Johannes Schmidt-Hieber
Abstract:
Consider estimation of the regression function based on a model with equidistant design and measurement errors generated from a fractional Gaussian noise process. In previous literature, this model has been heuristically linked to an experiment, where the anti-derivative of the regression function is continuously observed under additive perturbation by a fractional Brownian motion. Based on a refo…
▽ More
Consider estimation of the regression function based on a model with equidistant design and measurement errors generated from a fractional Gaussian noise process. In previous literature, this model has been heuristically linked to an experiment, where the anti-derivative of the regression function is continuously observed under additive perturbation by a fractional Brownian motion. Based on a reformulation of the problem using reproducing kernel Hilbert spaces, we derive abstract approximation conditions on function spaces under which asymptotic equivalence between these models can be established and show that the conditions are satisfied for certain Sobolev balls exceeding some minimal smoothness. Furthermore, we construct a sequence space representation and provide necessary conditions for asymptotic equivalence to hold.
△ Less
Submitted 1 December, 2014; v1 submitted 2 December, 2013;
originally announced December 2013.
-
Spot volatility estimation for high-frequency data: adaptive estimation in practice
Authors:
Till Sabel,
Johannes Schmidt-Hieber,
Axel Munk
Abstract:
We develop further the spot volatility estimator introduced in Hoffmann, Munk and Schmidt-Hieber (2012) from a practical point of view and make it useful for the analysis of high-frequency financial data. In a first part, we adjust the estimator substantially in order to achieve good finite sample performance and to overcome difficulties arising from violations of the additive microstructure noise…
▽ More
We develop further the spot volatility estimator introduced in Hoffmann, Munk and Schmidt-Hieber (2012) from a practical point of view and make it useful for the analysis of high-frequency financial data. In a first part, we adjust the estimator substantially in order to achieve good finite sample performance and to overcome difficulties arising from violations of the additive microstructure noise model (e.g. jumps, rounding errors). These modifications are justified by simulations. The second part is devoted to investigate the behavior of volatility in response to macroeconomic events. We give evidence that the spot volatility of Euro-BUND futures is considerably higher during press conferences of the European Central Bank. As an outlook, we present an estimator for the spot covolatility of two different prices.
△ Less
Submitted 24 September, 2013;
originally announced September 2013.
-
On adaptive posterior concentration rates
Authors:
Marc Hoffmann,
Judith Rousseau,
Johannes Schmidt-Hieber
Abstract:
We investigate the problem of deriving posterior concentration rates under different loss functions in nonparametric Bayes. We first provide a lower bound on posterior coverages of shrinking neighbourhoods that relates the metric or loss under which the shrinking neighbourhood is considered, and an intrinsic pre-metric linked to frequentist separation rates. In the Gaussian white noise model, we c…
▽ More
We investigate the problem of deriving posterior concentration rates under different loss functions in nonparametric Bayes. We first provide a lower bound on posterior coverages of shrinking neighbourhoods that relates the metric or loss under which the shrinking neighbourhood is considered, and an intrinsic pre-metric linked to frequentist separation rates. In the Gaussian white noise model, we construct feasible priors based on a spike and slab procedure reminiscent of wavelet thresholding that achieve adaptive rates of contraction under $L^2$ or $L^{\infty}$ metrics when the underlying parameter belongs to a collection of Hölder balls and that moreover achieve our lower bound. We analyse the consequences in terms of asymptotic behaviour of posterior credible balls as well as frequentist minimax adaptive estimation. Our results are appended with an upper bound for the contraction rate under an arbitrary loss in a generic regular experiment. The upper bound is attained for certain sieve priors and enables to extend our results to density estimation.
△ Less
Submitted 5 November, 2015; v1 submitted 22 May, 2013;
originally announced May 2013.
-
On an estimator achieving the adaptive rate in nonparametric regression under $L^p$-loss for all $1\leq p \leq \infty$
Authors:
Johannes Schmidt-Hieber
Abstract:
Consider nonparametric function estimation under $L^p$-loss. The minimax rate for estimation of the regression function over a Hölder ball with smoothness index $β$ is $n^{-β/(2β+1)}$ if $1\leq p<\infty$ and $(n/\log n)^{-β/(2β+1)}$ if $p=\infty.$ There are many known procedures that either attain this rate for $p=\infty$ but are suboptimal by a $\log n$ factor in the case $p<\infty$ or the other…
▽ More
Consider nonparametric function estimation under $L^p$-loss. The minimax rate for estimation of the regression function over a Hölder ball with smoothness index $β$ is $n^{-β/(2β+1)}$ if $1\leq p<\infty$ and $(n/\log n)^{-β/(2β+1)}$ if $p=\infty.$ There are many known procedures that either attain this rate for $p=\infty$ but are suboptimal by a $\log n$ factor in the case $p<\infty$ or the other way around. In this article, we construct an estimator that simultaneously achieves the optimal rates under $L^p$-risk for all $1\leq p\leq \infty$ without prior knowledge of $β.$ In contrast to classical wavelet thresholding methods that kill small empirical wavelet coefficients and keep large ones, it is essential for simultaneous adaptation that on each resolution level, the largest empirical wavelet coefficients are truncated. This leads to a completely different point of view on wavelet thresholding. The crucial part in the construction of the estimator is the size of the truncation level which is linked to the unknown smoothness index. Although estimation of the smoothness index is known to be a difficult task, there is a data-driven choice of the truncation level that is sufficiently precise for our purpose.
△ Less
Submitted 7 February, 2015; v1 submitted 13 March, 2013;
originally announced March 2013.
-
Asymptotically efficient estimation of a scale parameter in Gaussian time series and closed-form expressions for the Fisher information
Authors:
Till Sabel,
Johannes Schmidt-Hieber
Abstract:
Mimicking the maximum likelihood estimator, we construct first order Cramer-Rao efficient and explicitly computable estimators for the scale parameter $σ^2$ in the model $Z_{i,n}=σn^{-β}X_i+Y_i,i=1,\ldots,n,β>0$ with independent, stationary Gaussian processes $(X_i)_{i\in\mathbb{N}}$, $(Y_i)_{i\in\mathbb{N}}$, and $(X_i)_{i\in\mathbb{N}}$ exhibits possibly long-range dependence. In a second part,…
▽ More
Mimicking the maximum likelihood estimator, we construct first order Cramer-Rao efficient and explicitly computable estimators for the scale parameter $σ^2$ in the model $Z_{i,n}=σn^{-β}X_i+Y_i,i=1,\ldots,n,β>0$ with independent, stationary Gaussian processes $(X_i)_{i\in\mathbb{N}}$, $(Y_i)_{i\in\mathbb{N}}$, and $(X_i)_{i\in\mathbb{N}}$ exhibits possibly long-range dependence. In a second part, closed-form expressions for the asymptotic behavior of the corresponding Fisher information are derived. Our main finding is that depending on the behavior of the spectral densities at zero, the Fisher information has asymptotically two different scaling regimes, which are separated by a sharp phase transition. The most prominent example included in our analysis is the Fisher information for the scaling factor of a high-frequency sample of fractional Brownian motion under additive noise.
△ Less
Submitted 13 March, 2014; v1 submitted 27 August, 2012;
originally announced August 2012.
-
Multiscale Methods for Shape Constraints in Deconvolution: Confidence Statements for Qualitative Features
Authors:
Johannes Schmidt-Hieber,
Axel Munk,
Lutz Duembgen
Abstract:
We derive multiscale statistics for deconvolution in order to detect qualitative features of the unknown density. An important example covered within this framework is to test for local monotonicity on all scales simultaneously. We investigate the moderately ill-posed setting, where the Fourier transform of the error density in the deconvolution model is of polynomial decay. For multiscale testing…
▽ More
We derive multiscale statistics for deconvolution in order to detect qualitative features of the unknown density. An important example covered within this framework is to test for local monotonicity on all scales simultaneously. We investigate the moderately ill-posed setting, where the Fourier transform of the error density in the deconvolution model is of polynomial decay. For multiscale testing, we consider a calibration, motivated by the modulus of continuity of Brownian motion. We investigate the performance of our results from both the theoretical and simulation based point of view. A major consequence of our work is that the detection of qualitative features of a density in a deconvolution problem is a doable task although the minimax rates for pointwise estimation are very slow.
△ Less
Submitted 17 December, 2012; v1 submitted 7 July, 2011;
originally announced July 2011.
-
Adaptive wavelet estimation of the diffusion coefficient under additive error measurements
Authors:
Marc Hoffmann,
Axel Munk,
Johannes Schmidt-Hieber
Abstract:
We study nonparametric estimation of the diffusion coefficient from discrete data, when the observations are blurred by additional noise. Such issues have been developed over the last 10 years in several application fields and in particular in high frequency financial data modelling, however mainly from a parametric and semiparametric point of view. This paper addresses the nonparametric estimatio…
▽ More
We study nonparametric estimation of the diffusion coefficient from discrete data, when the observations are blurred by additional noise. Such issues have been developed over the last 10 years in several application fields and in particular in high frequency financial data modelling, however mainly from a parametric and semiparametric point of view. This paper addresses the nonparametric estimation of the path of the (possibly stochastic) diffusion coefficient in a relatively general setting. By develo** pre-averaging techniques combined with wavelet thresholding, we construct adaptive estimators that achieve a nearly optimal rate within a large scale of smoothness constraints of Besov type. Since the diffusion coefficient is usually genuinely random, we propose a new criterion to assess the quality of estimation; we retrieve the usual minimax theory when this approach is restricted to a deterministic diffusion coefficient. In particular, we take advantage of recent results of Reiss [33] of asymptotic equivalence between a Gaussian diffusion with additive noise and Gaussian white noise model, in order to prove a sharp lower bound.
△ Less
Submitted 29 December, 2011; v1 submitted 27 July, 2010;
originally announced July 2010.
-
Lower bounds for volatility estimation in microstructure noise models
Authors:
Axel Munk,
Johannes Schmidt-Hieber
Abstract:
In this paper we derive lower bounds in minimax sense for estimation of the instantaneous volatility if the diffusion type part cannot be observed directly but under some additional Gaussian noise. Three different models are considered. Our technique is based on a general inequality for Kullback-Leibler divergence of multivariate normal random variables and spectral analysis of the processes. Th…
▽ More
In this paper we derive lower bounds in minimax sense for estimation of the instantaneous volatility if the diffusion type part cannot be observed directly but under some additional Gaussian noise. Three different models are considered. Our technique is based on a general inequality for Kullback-Leibler divergence of multivariate normal random variables and spectral analysis of the processes. The derived lower bounds are indeed optimal. Upper bounds can be found in Munk and Schmidt-Hieber [18]. Our major finding is that the Gaussian microstructure noise introduces an additional degree of ill-posedness for each model, respectively.
△ Less
Submitted 16 February, 2010;
originally announced February 2010.
-
Nonparametric estimation of the volatility function in a high-frequency model corrupted by noise
Authors:
Axel Munk,
Johannes Schmidt-Hieber
Abstract:
We consider the models Y_{i,n}=\int_0^{i/n} σ(s)dW_s+τ(i/n)ε_{i,n}, and \tilde Y_{i,n}=σ(i/n)W_{i/n}+τ(i/n)ε_{i,n}, i=1,...,n, where W_t denotes a standard Brownian motion and ε_{i,n} are centered i.i.d. random variables with E(ε_{i,n}^2)=1 and finite fourth moment. Furthermore, σand τare unknown deterministic functions and W_t and (ε_{1,n},...,ε_{n,n}) are assumed to be independent processes. Bas…
▽ More
We consider the models Y_{i,n}=\int_0^{i/n} σ(s)dW_s+τ(i/n)ε_{i,n}, and \tilde Y_{i,n}=σ(i/n)W_{i/n}+τ(i/n)ε_{i,n}, i=1,...,n, where W_t denotes a standard Brownian motion and ε_{i,n} are centered i.i.d. random variables with E(ε_{i,n}^2)=1 and finite fourth moment. Furthermore, σand τare unknown deterministic functions and W_t and (ε_{1,n},...,ε_{n,n}) are assumed to be independent processes. Based on a spectral decomposition of the covariance structures we derive series estimators for σ^2 and τ^2 and investigate their rate of convergence of the MISE in dependence of their smoothness. To this end specific basis functions and their corresponding Sobolev ellipsoids are introduced and we show that our estimators are optimal in minimax sense. Our work is motivated by microstructure noise models. Our major finding is that the microstructure noise ε_{i,n} introduces an additionally degree of ill-posedness of 1/2; irrespectively of the tail behavior of ε_{i,n}. The method is illustrated by a small numerical study.
△ Less
Submitted 6 April, 2010; v1 submitted 21 August, 2009;
originally announced August 2009.