Search | arXiv e-print repository

On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions

Authors: Denys Pushkin, Raphaël Berthier, Emmanuel Abbe

Abstract: We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case… ▽ More We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case (Abbe et al., 2023). We then consider the sparse target regime and explain how this regime relates to the small feature regime, but with a different regularization term that can alter the picture in the non-Boolean case. We show two different outcomes for the sparse regime with q-ary data tokens: (1) if the data is embedded with roots of unities, then a min-degree interpolator is learned like in the Boolean case for RF models, (2) if the data is not embedded as such, e.g., simply as integers, then RF models and Transformers may not learn minimal degree interpolators. This shows that the Boolean setting and its roots of unities generalization are special cases where the minimal degree interpolator offers a rare characterization of how learning takes place. For more general integer and real-valued settings, a more nuanced picture remains to be fully characterized. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 9 pages of main body, 24 pages in total. 7 figures Proceedings of the 41-st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

arXiv:2304.09576 [pdf, other]

Leveraging the two timescale regime to demonstrate convergence of neural networks

Authors: Pierre Marion, Raphaël Berthier

Abstract: We study the training dynamics of shallow neural networks, in a two-timescale regime in which the stepsizes for the inner layer are much smaller than those for the outer layer. In this regime, we prove convergence of the gradient flow to a global optimum of the non-convex optimization problem in a simple univariate setting. The number of neurons need not be asymptotically large for our result to h… ▽ More We study the training dynamics of shallow neural networks, in a two-timescale regime in which the stepsizes for the inner layer are much smaller than those for the outer layer. In this regime, we prove convergence of the gradient flow to a global optimum of the non-convex optimization problem in a simple univariate setting. The number of neurons need not be asymptotically large for our result to hold, distinguishing our result from popular recent approaches such as the neural tangent kernel or mean-field regimes. Experimental illustration is provided, showing that the stochastic gradient descent behaves according to our description of the gradient flow and thus converges to a global optimum in the two-timescale regime, but can fail outside of this regime. △ Less

Submitted 25 October, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

Comments: NeurIPS 2023. 34 pages, 10 figures

arXiv:2303.00055 [pdf, other]

Learning time-scales in two-layers neural networks

Authors: Raphaël Berthier, Andrea Montanari, Kangjie Zhou

Abstract: Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, mode… ▽ More Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system. △ Less

Submitted 17 April, 2024; v1 submitted 28 February, 2023; originally announced March 2023.

Comments: 64 pages, 15 figures

MSC Class: 34E15; 37N40; 68T07

arXiv:2208.14673 [pdf, other]

Incremental Learning in Diagonal Linear Networks

Authors: Raphaël Berthier

Abstract: Diagonal linear networks (DLNs) are a toy simplification of artificial neural networks; they consist in a quadratic reparametrization of linear regression inducing a sparse implicit regularization. In this paper, we describe the trajectory of the gradient flow of DLNs in the limit of small initialization. We show that incremental learning is effectively performed in the limit: coordinates are succ… ▽ More Diagonal linear networks (DLNs) are a toy simplification of artificial neural networks; they consist in a quadratic reparametrization of linear regression inducing a sparse implicit regularization. In this paper, we describe the trajectory of the gradient flow of DLNs in the limit of small initialization. We show that incremental learning is effectively performed in the limit: coordinates are successively activated, while the iterate is the minimizer of the loss constrained to have support on the active coordinates only. This shows that the sparse implicit regularization of DLNs decreases with time. This work is restricted to the underparametrized regime with anti-correlated features for technical reasons. △ Less

Submitted 13 November, 2023; v1 submitted 31 August, 2022; originally announced August 2022.

Journal ref: Journal of Machine Learning Research, 2023, 24 (171), pp.1-26

arXiv:2202.10742 [pdf, other]

Acceleration of Gossip Algorithms through the Euler-Poisson-Darboux Equation

Authors: Raphaël Berthier, Mufan Li

Abstract: Gossip algorithms and their accelerated versions have been studied exclusively in discrete time on graphs. In this work, we take a different approach, and consider the scaling limit of gossip algorithms in both large graphs and large number of iterations. These limits lead to well-known partial differential equations (PDEs) with insightful properties. On lattices, we prove that the non-accelerated… ▽ More Gossip algorithms and their accelerated versions have been studied exclusively in discrete time on graphs. In this work, we take a different approach, and consider the scaling limit of gossip algorithms in both large graphs and large number of iterations. These limits lead to well-known partial differential equations (PDEs) with insightful properties. On lattices, we prove that the non-accelerated gossip algorithm of Boyd et al. [2006] converges to the heat equation, and the accelerated Jacobi polynomial iteration of Berthier et al. [2020] converges to the Euler-Poisson-Darboux (EPD) equation - a damped wave equation. Remarkably, with appropriate parameters, the fundamental solution of the EPD equation has the ideal gossip behaviour: a uniform density over an ellipsoid, whose radius increases at a rate proportional to t - the fastest possible rate for locally communicating gossip algorithms. This is in contrast with the heat equation where the density spreads on a typical scale of $\sqrt{t}$. Additionally, we provide simulations demonstrating that the gossip algorithms are accurately approximated by their limiting PDEs. △ Less

Submitted 22 February, 2022; originally announced February 2022.

arXiv:2109.11905 [pdf, ps, other]

Graph-based Approximate Message Passing Iterations

Authors: Cédric Gerbelot, Raphaël Berthier

Abstract: Approximate-message passing (AMP) algorithms have become an important element of high-dimensional statistical inference, mostly due to their adaptability and concentration properties, the state evolution (SE) equations. This is demonstrated by the growing number of new iterations proposed for increasingly complex problems, ranging from multi-layer inference to low-rank matrix estimation with elabo… ▽ More Approximate-message passing (AMP) algorithms have become an important element of high-dimensional statistical inference, mostly due to their adaptability and concentration properties, the state evolution (SE) equations. This is demonstrated by the growing number of new iterations proposed for increasingly complex problems, ranging from multi-layer inference to low-rank matrix estimation with elaborate priors. In this paper, we address the following questions: is there a structure underlying all AMP iterations that unifies them in a common framework? Can we use such a structure to give a modular proof of state evolution equations, adaptable to new AMP iterations without reproducing each time the full argument ? We propose an answer to both questions, showing that AMP instances can be generically indexed by an oriented graph. This enables to give a unified interpretation of these iterations, independent from the problem they solve, and a way of composing them arbitrarily. We then show that all AMP iterations indexed by such a graph admit rigorous SE equations, extending the reach of previous proofs, and proving a number of recent heuristic derivations of those equations. Our proof naturally includes non-separable functions and we show how existing refinements, such as spatial coupling or matrix-valued variables, can be combined with our framework. △ Less

Submitted 19 April, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: 59 pages, 24 main, 35 appendix

arXiv:2106.07644 [pdf, other]

A Continuized View on Nesterov Acceleration for Stochastic Gradient Descent and Randomized Gossip

Authors: Mathieu Even, Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, Hadrien Hendrikx, Laurent Massoulié, Adrien Taylor

Abstract: We introduce the continuized Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, o… ▽ More We introduce the continuized Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters; and a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters. We provide continuized Nesterov acceleration under deterministic as well as stochastic gradients, with either additive or multiplicative noise. Finally, using our continuized framework and expressing the gossip averaging problem as the stochastic minimization of a certain energy function, we provide the first rigorous acceleration of asynchronous gossip algorithms. △ Less

Submitted 27 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2102.06035

arXiv:2102.06035 [pdf, other]

A Continuized View on Nesterov Acceleration

Authors: Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, Adrien Taylor

Abstract: We introduce the "continuized" Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process,… ▽ More We introduce the "continuized" Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters; but a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters. △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2006.08212 [pdf, other]

Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

Authors: Raphaël Berthier, Francis Bach, Pierre Gaillard

Abstract: In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation $Y = \langle θ_*, X \rangle$ between the random output $Y$ and the random feature vector $Φ(U)$, a potentially non-linear transformation of the inputs $U$. We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square r… ▽ More In the context of statistical supervised learning, the noiseless linear model assumes that there exists a deterministic linear relation $Y = \langle θ_*, X \rangle$ between the random output $Y$ and the random feature vector $Φ(U)$, a potentially non-linear transformation of the inputs $U$. We analyze the convergence of single-pass, fixed step-size stochastic gradient descent on the least-square risk under this model. The convergence of the iterates to the optimum $θ_*$ and the decay of the generalization error follow polynomial convergence rates with exponents that both depend on the regularities of the optimum $θ_*$ and of the feature vectors $Φ(u)$. We interpret our result in the reproducing kernel Hilbert space framework. As a special case, we analyze an online algorithm for estimating a real function on the unit interval from the noiseless observation of its value at randomly sampled points; the convergence depends on the Sobolev smoothness of the function and of a chosen kernel. Finally, we apply our analysis beyond the supervised learning setting to obtain convergence rates for the averaging process (a.k.a. gossip algorithm) on a graph depending on its spectral dimension. △ Less

Submitted 27 October, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

arXiv:1805.08531 [pdf, other]

Accelerated Gossip in Networks of Given Dimension using Jacobi Polynomial Iterations

Authors: Raphaël Berthier, Francis Bach, Pierre Gaillard

Abstract: Consider a network of agents connected by communication links, where each agent holds a real value. The gossip problem consists in estimating the average of the values diffused in the network in a distributed manner. We develop a method solving the gossip problem that depends only on the spectral dimension of the network, that is, in the communication network set-up, the dimension of the space in… ▽ More Consider a network of agents connected by communication links, where each agent holds a real value. The gossip problem consists in estimating the average of the values diffused in the network in a distributed manner. We develop a method solving the gossip problem that depends only on the spectral dimension of the network, that is, in the communication network set-up, the dimension of the space in which the agents live. This contrasts with previous work that required the spectral gap of the network as a parameter, or suffered from slow mixing. Our method shows an important improvement over existing algorithms in the non-asymptotic regime, i.e., when the values are far from being fully mixed in the network. Our approach stems from a polynomial-based point of view on gossip algorithms, as well as an approximation of the spectral measure of the graphs with a Jacobi measure. We show the power of the approach with simulations on various graphs, and with performance guarantees on graphs of known spectral dimension, such as grids and random percolation bonds. An extension of this work to distributed Laplacian solvers is discussed. As a side result, we also use the polynomial-based point of view to show the convergence of the message passing algorithm for gossip of Moallemi \& Van Roy on regular graphs. The explicit computation of the rate of the convergence shows that message passing has a slow rate of convergence on graphs with small spectral gap. △ Less

Submitted 11 June, 2019; v1 submitted 22 May, 2018; originally announced May 2018.

arXiv:1708.03950 [pdf, other]

State Evolution for Approximate Message Passing with Non-Separable Functions

Authors: Raphael Berthier, Andrea Montanari, Phan-Minh Nguyen

Abstract: Given a high-dimensional data matrix ${\boldsymbol A}\in{\mathbb R}^{m\times n}$, Approximate Message Passing (AMP) algorithms construct sequences of vectors ${\boldsymbol u}^t\in{\mathbb R}^n$, ${\boldsymbol v}^t\in{\mathbb R}^m$, indexed by $t\in\{0,1,2\dots\}$ by iteratively applying ${\boldsymbol A}$ or ${\boldsymbol A}^{\sf T}$, and suitable non-linear functions, which depend on the specific… ▽ More Given a high-dimensional data matrix ${\boldsymbol A}\in{\mathbb R}^{m\times n}$, Approximate Message Passing (AMP) algorithms construct sequences of vectors ${\boldsymbol u}^t\in{\mathbb R}^n$, ${\boldsymbol v}^t\in{\mathbb R}^m$, indexed by $t\in\{0,1,2\dots\}$ by iteratively applying ${\boldsymbol A}$ or ${\boldsymbol A}^{\sf T}$, and suitable non-linear functions, which depend on the specific application. Special instances of this approach have been developed --among other applications-- for compressed sensing reconstruction, robust regression, Bayesian estimation, low-rank matrix recovery, phase retrieval, and community detection in graphs. For certain classes of random matrices ${\boldsymbol A}$, AMP admits an asymptotically exact description in the high-dimensional limit $m,n\to\infty$, which goes under the name of `state evolution.' Earlier work established state evolution for separable non-linearities (under certain regularity conditions). Nevertheless, empirical work demonstrated several important applications that require non-separable functions. In this paper we generalize state evolution to Lipschitz continuous non-separable nonlinearities, for Gaussian matrices ${\boldsymbol A}$. Our proof makes use of Bolthausen's conditioning technique along with several approximation arguments. In particular, we introduce a modified algorithm (called LAMP for Long AMP) which is of independent interest. △ Less

Submitted 13 August, 2017; originally announced August 2017.

Comments: 41 pages, 4 figures

Showing 1–11 of 11 results for author: Berthier, R