Search | arXiv e-print repository

Spectral complexity of deep neural networks

Authors: Simmaco Di Lillo, Domenico Marinucci, Michele Salvi, Stefano Vigogna

Abstract: It is well-known that randomly initialized, push-forward, fully-connected neural networks weakly converge to isotropic Gaussian processes, in the limit where the width of all layers goes to infinity. In this paper, we propose to use the angular power spectrum of the limiting field to characterize the complexity of the network architecture. In particular, we define sequences of random variables ass… ▽ More It is well-known that randomly initialized, push-forward, fully-connected neural networks weakly converge to isotropic Gaussian processes, in the limit where the width of all layers goes to infinity. In this paper, we propose to use the angular power spectrum of the limiting field to characterize the complexity of the network architecture. In particular, we define sequences of random variables associated with the angular power spectrum, and provide a full characterization of the network complexity in terms of the asymptotic distribution of these sequences as the depth diverges. On this basis, we classify neural networks as low-disorder, sparse, or high-disorder; we show how this classification highlights a number of distinct features for standard activation functions, and in particular, sparsity properties of ReLU networks. Our theoretical results are also validated by numerical simulations. △ Less

Submitted 27 June, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

MSC Class: 68T07; 60G60; 33C55; 62M15

arXiv:2403.08750 [pdf, ps, other]

Neural reproducing kernel Banach spaces and representer theorems for deep networks

Authors: Francesca Bartolucci, Ernesto De Vito, Lorenzo Rosasco, Stefano Vigogna

Abstract: Studying the function spaces defined by neural networks helps to understand the corresponding learning models and their inductive bias. While in some limits neural networks correspond to function spaces that are reproducing kernel Hilbert spaces, these regimes do not capture the properties of the networks used in practice. In contrast, in this paper we show that deep neural networks define suitabl… ▽ More Studying the function spaces defined by neural networks helps to understand the corresponding learning models and their inductive bias. While in some limits neural networks correspond to function spaces that are reproducing kernel Hilbert spaces, these regimes do not capture the properties of the networks used in practice. In contrast, in this paper we show that deep neural networks define suitable reproducing kernel Banach spaces. These spaces are equipped with norms that enforce a form of sparsity, enabling them to adapt to potential latent structures within the input data and their representations. In particular, leveraging the theory of reproducing kernel Banach spaces, combined with variational results, we derive representer theorems that justify the finite architectures commonly employed in applications. Our study extends analogous results for shallow networks and can be seen as a step towards considering more practically plausible neural architectures. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2306.16932 [pdf, ps, other]

A Quantitative Functional Central Limit Theorem for Shallow Neural Networks

Authors: Valentina Cammarota, Domenico Marinucci, Michele Salvi, Stefano Vigogna

Abstract: We prove a Quantitative Functional Central Limit Theorem for one-hidden-layer neural networks with generic activation function. The rates of convergence that we establish depend heavily on the smoothness of the activation function, and they range from logarithmic in non-differentiable cases such as the Relu to $\sqrt{n}$ for very regular activations. Our main tools are functional versions of the S… ▽ More We prove a Quantitative Functional Central Limit Theorem for one-hidden-layer neural networks with generic activation function. The rates of convergence that we establish depend heavily on the smoothness of the activation function, and they range from logarithmic in non-differentiable cases such as the Relu to $\sqrt{n}$ for very regular activations. Our main tools are functional versions of the Stein-Malliavin approach; in particular, we exploit heavily a quantitative functional central limit theorem which has been recently established by Bourguin and Campese (2020). △ Less

Submitted 5 July, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

MSC Class: 60F17; 68T07; 60G60

arXiv:2305.16014 [pdf, other]

How many samples are needed to leverage smoothness?

Authors: Vivien Cabannes, Stefano Vigogna

Abstract: A core principle in statistical learning is that smoothness of target functions allows to break the curse of dimensionality. However, learning a smooth function seems to require enough samples close to one another to get meaningful estimate of high-order derivatives, which would be hard in machine learning problems where the ratio between number of data and input dimension is relatively small. By… ▽ More A core principle in statistical learning is that smoothness of target functions allows to break the curse of dimensionality. However, learning a smooth function seems to require enough samples close to one another to get meaningful estimate of high-order derivatives, which would be hard in machine learning problems where the ratio between number of data and input dimension is relatively small. By deriving new lower bounds on the generalization error, this paper formalizes such an intuition, before investigating the role of constants and transitory regimes which are usually not depicted beyond classical learning theory statements while they play a dominant role in practice. △ Less

Submitted 16 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 34 pages, 13 figures

MSC Class: 68T05 ACM Class: I.2.6; F.2.2; G.3

Journal ref: NeurIPS 2023

arXiv:2205.10055 [pdf, other]

A Case of Exponential Convergence Rates for SVM

Authors: Vivien Cabannes, Stefano Vigogna

Abstract: Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalit… ▽ More Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition. △ Less

Submitted 22 May, 2023; v1 submitted 20 May, 2022; originally announced May 2022.

Comments: 16 pages, 6 figures

MSC Class: 68T05 ACM Class: G.3

Journal ref: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 2023, PMLR 206:359-374

arXiv:2202.01773 [pdf, other]

Multiclass learning with margin: exponential rates with no bias-variance trade-off

Authors: Stefano Vigogna, Giacomo Meanti, Ernesto De Vito, Lorenzo Rosasco

Abstract: We study the behavior of error bounds for multiclass classification under suitable margin conditions. For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off. Different convergence rates can be obtained in correspondence of different margin assumptions. With a self-contained and instructive… ▽ More We study the behavior of error bounds for multiclass classification under suitable margin conditions. For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off. Different convergence rates can be obtained in correspondence of different margin assumptions. With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting. △ Less

Submitted 3 February, 2022; originally announced February 2022.

arXiv:2109.09710 [pdf, ps, other]

Understanding neural networks with reproducing kernel Banach spaces

Authors: Francesca Bartolucci, Ernesto De Vito, Lorenzo Rosasco, Stefano Vigogna

Abstract: Characterizing the function spaces corresponding to neural networks can provide a way to understand their properties. In this paper we discuss how the theory of reproducing kernel Banach spaces can be used to tackle this challenge. In particular, we prove a representer theorem for a wide class of reproducing kernel Banach spaces that admit a suitable integral representation and include one hidden… ▽ More Characterizing the function spaces corresponding to neural networks can provide a way to understand their properties. In this paper we discuss how the theory of reproducing kernel Banach spaces can be used to tackle this challenge. In particular, we prove a representer theorem for a wide class of reproducing kernel Banach spaces that admit a suitable integral representation and include one hidden layer neural networks of possibly infinite width. Further, we show that, for a suitable class of ReLU activation functions, the norm in the corresponding reproducing kernel Banach space can be characterized in terms of the inverse Radon transform of a bounded real measure, with norm given by the total variation norm of the measure. Our analysis simplifies and extends recent results in [34,29,30]. △ Less

Submitted 26 October, 2021; v1 submitted 20 September, 2021; originally announced September 2021.

arXiv:2106.12231 [pdf, ps, other]

ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions

Authors: Luigi Carratino, Stefano Vigogna, Daniele Calandriello, Lorenzo Rosasco

Abstract: We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the… ▽ More We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the local estimators, thus ensuring that key quantities such as local effective dimension and bias remain under control. We characterize the statistical-computational tradeoff of our model, and demonstrate the effectiveness of our method by numerical experiments on large-scale datasets. △ Less

Submitted 17 October, 2022; v1 submitted 23 June, 2021; originally announced June 2021.

arXiv:2101.05119 [pdf, ps, other]

Multiscale regression on unknown manifolds

Authors: Wen**g Liao, Mauro Maggioni, Stefano Vigogna

Abstract: We consider the regression problem of estimating functions on $\mathbb{R}^D$ but supported on a $d$-dimensional manifold $ \mathcal{M} \subset \mathbb{R}^D $ with $ d \ll D $. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on $\mathcal{M}$ at multiple scales, and perform multiscale regression by local polynomial fitting. We propos… ▽ More We consider the regression problem of estimating functions on $\mathbb{R}^D$ but supported on a $d$-dimensional manifold $ \mathcal{M} \subset \mathbb{R}^D $ with $ d \ll D $. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on $\mathcal{M}$ at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension $d$, instead of an unknown manifold embedded in $\mathbb{R}^D$. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in $D$ and exponential in $d$. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees. △ Less

Submitted 13 January, 2021; originally announced January 2021.

arXiv:2006.09870 [pdf, ps, other]

Construction and Monte Carlo estimation of wavelet frames generated by a reproducing kernel

Authors: Ernesto De Vito, Zeljko Kereta, Valeriya Naumova, Lorenzo Rosasco, Stefano Vigogna

Abstract: We introduce a construction of multiscale tight frames on general domains. The frame elements are obtained by spectral filtering of the integral operator associated with a reproducing kernel. Our construction extends classical wavelets as well as generalized wavelets on both continuous and discrete non-Euclidean structures such as Riemannian manifolds and weighted graphs. Moreover, it allows to st… ▽ More We introduce a construction of multiscale tight frames on general domains. The frame elements are obtained by spectral filtering of the integral operator associated with a reproducing kernel. Our construction extends classical wavelets as well as generalized wavelets on both continuous and discrete non-Euclidean structures such as Riemannian manifolds and weighted graphs. Moreover, it allows to study the relation between continuous and discrete frames in a random sampling regime, where discrete frames can be seen as Monte Carlo estimates of the continuous ones. Pairing spectral regularization with learning theory, we show that a sample frame tends to its population counterpart, and derive explicit finite-sample rates on spaces of Sobolev and Besov regularity. Our results prove the stability of frames constructed on empirical data, in the sense that all stochastic discretizations have the same underlying limit regardless of the set of initial training samples. △ Less

Submitted 8 March, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

MSC Class: 42C15; 42C40; 65T60; 46E22; 47A52; 68T05

arXiv:2003.04788 [pdf, other]

Estimating multi-index models with response-conditional least squares

Authors: Timo Klock, Alessandro Lanteri, Stefano Vigogna

Abstract: The multi-index model is a simple yet powerful high-dimensional regression model which circumvents the curse of dimensionality assuming $ \mathbb{E} [ Y | X ] = g(A^\top X) $ for some unknown index space $A$ and link function $g$. In this paper we introduce a method for the estimation of the index space, and study the propagation error of an index space estimate in the regression of the link funct… ▽ More The multi-index model is a simple yet powerful high-dimensional regression model which circumvents the curse of dimensionality assuming $ \mathbb{E} [ Y | X ] = g(A^\top X) $ for some unknown index space $A$ and link function $g$. In this paper we introduce a method for the estimation of the index space, and study the propagation error of an index space estimate in the regression of the link function. The proposed method approximates the index space by the span of linear regression slope coefficients computed over level sets of the data. Being based on ordinary least squares, our approach is easy to implement and computationally efficient. We prove a tight concentration bound that shows $N^{-1/2}$-convergence, but also faithfully describes the dependence on the chosen partition of level sets, hence giving indications on the hyperparameter tuning. The estimator's competitiveness is confirmed by extensive comparisons with state-of-the-art methods, both on synthetic and real data sets. As a second contribution, we establish minimax optimal generalization bounds for k-nearest neighbors and piecewise polynomial regression when trained on samples projected onto any $N^{-1/2}$-consistent estimate of the index space, thus providing complete and provable estimation of the multi-index model. △ Less

Submitted 3 June, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

Comments: 30 pages, 13 figures, 1 table

MSC Class: 62G05; 62G08; 62H99

arXiv:2002.10008 [pdf, other]

Conditional regression for single-index models

Authors: Alessandro Lanteri, Mauro Maggioni, Stefano Vigogna

Abstract: The single-index model is a statistical model for intrinsic regression where responses are assumed to depend on a single yet unknown linear combination of the predictors, allowing to express the regression function as $ \mathbb{E} [ Y | X ] = f ( \langle v , X \rangle ) $ for some unknown \emph{index} vector $v$ and \emph{link} function $f$. Conditional methods provide a simple and effective appro… ▽ More The single-index model is a statistical model for intrinsic regression where responses are assumed to depend on a single yet unknown linear combination of the predictors, allowing to express the regression function as $ \mathbb{E} [ Y | X ] = f ( \langle v , X \rangle ) $ for some unknown \emph{index} vector $v$ and \emph{link} function $f$. Conditional methods provide a simple and effective approach to estimate $v$ by averaging moments of $X$ conditioned on $Y$, but depend on parameters whose optimal choice is unknown and do not provide generalization bounds on $f$. In this paper we propose a new conditional method converging at $\sqrt{n}$ rate under an explicit parameter characterization. Moreover, we prove that polynomial partitioning estimates achieve the $1$-dimensional min-max rate for regression of Hölder functions when combined to any $\sqrt{n}$-convergent index estimator. Overall this yields an estimator for dimension reduction and regression of single-index models that attains statistical optimality in quasilinear time. △ Less

Submitted 27 May, 2022; v1 submitted 23 February, 2020; originally announced February 2020.

MSC Class: 62G05 (Primary) 62G08; 62H99 (Secondary)

arXiv:1903.06594 [pdf, ps, other]

Monte Carlo wavelets: a randomized approach to frame discretization

Authors: Zeljko Kereta, Stefano Vigogna, Valeriya Naumova, Lorenzo Rosasco, Ernesto De Vito

Abstract: In this paper we propose and study a family of continuous wavelets on general domains, and a corresponding stochastic discretization that we call Monte Carlo wavelets. First, using tools from the theory of reproducing kernel Hilbert spaces and associated integral operators, we define a family of continuous wavelets by spectral calculus. Then, we propose a stochastic discretization based on Monte C… ▽ More In this paper we propose and study a family of continuous wavelets on general domains, and a corresponding stochastic discretization that we call Monte Carlo wavelets. First, using tools from the theory of reproducing kernel Hilbert spaces and associated integral operators, we define a family of continuous wavelets by spectral calculus. Then, we propose a stochastic discretization based on Monte Carlo estimates of integral operators. Using concentration of measure results, we establish the convergence of such a discretization and derive convergence rates under natural regularity assumptions. △ Less

Submitted 23 October, 2019; v1 submitted 15 March, 2019; originally announced March 2019.

arXiv:1510.04547 [pdf, ps, other]

doi 10.1142/S021953051750004X

Continuous and discrete frames generated by the evolution flow of the Schrödinger equation

Authors: Giovanni S. Alberti, Stephan Dahlke, Filippo De Mari, Ernesto De Vito, Stefano Vigogna

Abstract: We study a family of coherent states, called Schrödingerlets, both in the continuous and discrete setting. They are defined in terms of the Schrödinger equation of a free quantum particle and some of its invariant transformations. We study a family of coherent states, called Schrödingerlets, both in the continuous and discrete setting. They are defined in terms of the Schrödinger equation of a free quantum particle and some of its invariant transformations. △ Less

Submitted 9 December, 2016; v1 submitted 15 October, 2015; originally announced October 2015.

Comments: 20 pages

Report number: SAM Reports, 2015-29 MSC Class: 22D10; 42C40; 42C15

Journal ref: Anal. Appl. 15, 915, 2017

arXiv:1403.1396 [pdf, ps, other]

Intrinsic Localization of Anisotropic Frames II: $α$-Molecules

Authors: Philipp Grohs, Stefano Vigogna

Abstract: This article is a continuation of the recent paper [Grohs, Intrinsic localization of anisotropic frames, ACHA, 2013], where off-diagonal-decay properties (often referred to as 'localization' in the literature) of Moore-Penrose pseudoinverses of (bi-infinite) matrices are established, whenever the latter possess similar off-diagonal-decay properties. This problem is especially interesting if the ma… ▽ More This article is a continuation of the recent paper [Grohs, Intrinsic localization of anisotropic frames, ACHA, 2013], where off-diagonal-decay properties (often referred to as 'localization' in the literature) of Moore-Penrose pseudoinverses of (bi-infinite) matrices are established, whenever the latter possess similar off-diagonal-decay properties. This problem is especially interesting if the matrix arises as a discretization of an operator with respect to a frame or basis. Previous work on this problem has been restricted to wavelet- or Gabor frames. In the previous work we extended these results to frames of parabolic molecules, including curvelets or shearlets as special cases. The present paper extends and unifies these results by establishing analogous properties for frames of $α$-molecules as introduced in recent work [Grohs, Keiper, Kutyniok, Schäfer, Alpha molecules: curvelets, shearlets, ridgelets, and beyond, Proc. SPIE. 8858, 2013]. Since wavelets, curvelets, shearlets, ridgelets and hybrid shearlets all constitute instances of $α$-molecules, our results establish localization properties for all these systems simultaneously. △ Less

Submitted 6 March, 2014; originally announced March 2014.

Comments: 16 pages

MSC Class: Primary 41AXX; Secondary 41A25; 53B; 22E

arXiv:1402.5833 [pdf, other]

Geometric classification of semidirect products in the maximal parabolic subgroup of $\operatorname{Sp}(2,\mathbb{R})$

Authors: Filippo De Mari, Ernesto De Vito, Stefano Vigogna

Abstract: We classify up to conjugation by $\operatorname{GL}(2,\mathbb{R})$ (more precisely, block diagonal symplectic matrices) all the semidirect products inside the maximal parabolic of $\operatorname{Sp}(2,\mathbb{R})$ by means of an essentially geometric argument. This classification has already been established without geometry, under a stricter notion of equivalence, namely conjugation by arbitrary… ▽ More We classify up to conjugation by $\operatorname{GL}(2,\mathbb{R})$ (more precisely, block diagonal symplectic matrices) all the semidirect products inside the maximal parabolic of $\operatorname{Sp}(2,\mathbb{R})$ by means of an essentially geometric argument. This classification has already been established without geometry, under a stricter notion of equivalence, namely conjugation by arbitrary symplectic matrices. The present approach might be useful in higher dimensions and provides some insight. △ Less

Submitted 24 February, 2014; originally announced February 2014.

Comments: 11 pages, 1 figure

arXiv:1402.3917 [pdf, other]

Coorbit spaces with voice in a Fréchet space

Authors: Stephan Dahlke, Filippo De Mari, Ernesto De Vito, Demetrio Labate, Gabrielle Steidl, Gerd Teschke, Stefano Vigogna

Abstract: We set up a new general coorbit space theory for reproducing representations of a locally compact second countable group $G$ that are not necessarily irreducible nor integrable. Our basic assumption is that the kernel associated with the voice transform belongs to a Fréchet space $\mathcal T$ of functions on $G$, which generalizes the classical choice $\mathcal T=L_w^1(G)$. Our basic example is… ▽ More We set up a new general coorbit space theory for reproducing representations of a locally compact second countable group $G$ that are not necessarily irreducible nor integrable. Our basic assumption is that the kernel associated with the voice transform belongs to a Fréchet space $\mathcal T$ of functions on $G$, which generalizes the classical choice $\mathcal T=L_w^1(G)$. Our basic example is $ \mathcal T=\bigcap_{p\in(1,+\infty)} L^p(G)$, or a weighted versions of it. By means of this choice it is possible to treat, for instance, Paley-Wiener spaces and coorbit spaces related to Shannon wavelets and Schrödingerlets. △ Less

Submitted 17 February, 2014; originally announced February 2014.

Comments: 52 pages, 1 figures

MSC Class: 43A15; 42B35; 22D10; 46A04; 46F05

Showing 1–17 of 17 results for author: Vigogna, S