Search | arXiv e-print repository

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Authors: Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

Abstract: In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that thes… ▽ More In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance. △ Less

Submitted 5 June, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: ICML 2024

arXiv:2205.11787 [pdf, other]

Quadratic models for understanding catapult dynamics of neural networks

Authors: Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

Abstract: While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour o… ▽ More While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour of neural quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models can be an effective tool for analysis of neural networks. △ Less

Submitted 1 May, 2024; v1 submitted 24 May, 2022; originally announced May 2022.

Comments: accepted in ICLR 2024; changed the title

arXiv:2112.14872 [pdf, other]

Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size

Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

Abstract: Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice. With the increasing popularity of deep learning over the past decade, stochastic gradient descent and its adaptive variants (e.g. Adagrad, Adam, etc.) have become prominent methods of choice for machine learning practitioners. While a large number of works have demonstrated that these fi… ▽ More Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice. With the increasing popularity of deep learning over the past decade, stochastic gradient descent and its adaptive variants (e.g. Adagrad, Adam, etc.) have become prominent methods of choice for machine learning practitioners. While a large number of works have demonstrated that these first order optimization methods can achieve sub-linear or linear convergence, we establish local quadratic convergence for stochastic gradient descent with adaptive step size for problems such as matrix inversion. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: ICML 2021 Workshop on Beyond first-order methods in ML systems

arXiv:2104.13758 [pdf, other]

A Non-Nested Multilevel Method for Meshless Solution of the Poisson Equation in Heat Transfer and Fluid Flow

Authors: Anand Radhakrishnan, Michael Xu, Shantanu Shahane, Surya Pratap Vanka

Abstract: We present a non-nested multilevel algorithm for solving the Poisson equation discretized at scattered points using polyharmonic radial basis function (PHS-RBF) interpolations. We append polynomials to the radial basis functions to achieve exponential convergence of discretization errors. The interpolations are performed over local clouds of points and the Poisson equation is collocated at each of… ▽ More We present a non-nested multilevel algorithm for solving the Poisson equation discretized at scattered points using polyharmonic radial basis function (PHS-RBF) interpolations. We append polynomials to the radial basis functions to achieve exponential convergence of discretization errors. The interpolations are performed over local clouds of points and the Poisson equation is collocated at each of the scattered points, resulting in a sparse set of discrete equations for the unkown variables. To solve this set of equations, we have developed a non-nested multilevel algorithm utilizing multiple independently generated coarse sets of points. The restriction and prolongation operators are also constructed with the same RBF interpolations procedure. The performance of the algorithm for Dirichlet and all-Neumann boundary conditions is evaluated in three model geometries using a manufactured solution. For Dirichlet boundary conditions, rapid convergence is observed using SOR point solver as the relaxation scheme. For cases of all-Neumann boundary conditions, convergence is seen to slow down with the degree of the appended polynomial. However, when the multilevel procedure is combined with a GMRES algorithm, the convergence is seen to significantly improve. The GMRES accelerated multilevel algorithm is included in a fractional step method to solve incompressible Navier-Stokes equations. △ Less

Submitted 28 April, 2021; originally announced April 2021.

arXiv:2010.01702 [pdf, other]

doi 10.1016/j.jcp.2021.110623

A High-Order Accurate Meshless Method for Solution of Incompressible Fluid Flow Problems

Authors: Shantanu Shahane, Anand Radhakrishnan, Surya Pratap Vanka

Abstract: Meshless solution to differential equations using radial basis functions (RBF) is an alternative to grid based methods commonly used. Since the meshless method does not need an underlying connectivity in the form of control volumes or elements, issues such as grid skewness that adversely impact accuracy are eliminated. Gaussian, Multiquadrics and inverse Multiquadrics are some of the most popular… ▽ More Meshless solution to differential equations using radial basis functions (RBF) is an alternative to grid based methods commonly used. Since the meshless method does not need an underlying connectivity in the form of control volumes or elements, issues such as grid skewness that adversely impact accuracy are eliminated. Gaussian, Multiquadrics and inverse Multiquadrics are some of the most popular RBFs used for the solutions of fluid flow and heat transfer problems. But they have additional shape parameters that have to be fine tuned for accuracy and stability. Moreover, they also face stagnation error when the point density is increased for accuracy. Recently, Polyharmonic splines (PHS) with appended polynomials have been shown to solve the above issues and give rapid convergence of discretization errors with the degree of appended polynomials. In this research, we extend the PHS-RBF method for the solution of incompressible Navier-Stokes equations. A fractional step method with explicit convection and explicit diffusion terms is combined with a pressure Poisson equation to satisfy the momentum and continuity equations. Systematic convergence tests have been performed for five model problems with two of them having analytical solutions. We demonstrate fast convergence both with refinement of number of points and degree of appended polynomials. The method is further applied to solve problems such as lid-driven cavity and vortex shedding over circular cylinder. We have also analyzed the performance of this approach for solution of Euler equations. The proposed method shows promise to solve fluid flow and heat transfer problems in complex domains with high accuracy. △ Less

Submitted 31 July, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

arXiv:1706.06091 [pdf, other]

Counting Markov Equivalence Classes for DAG models on Trees

Authors: Adityanarayanan Radhakrishnan, Liam Solus, Caroline Uhler

Abstract: DAG models are statistical models satisfying a collection of conditional independence relations encoded by the nonedges of a directed acyclic graph (DAG) $\mathcal{G}$. Such models are used to model complex cause-effect systems across a variety of research fields. From observational data alone, a DAG model $\mathcal{G}$ is only recoverable up to Markov equivalence. Combinatorially, two DAGs are Ma… ▽ More DAG models are statistical models satisfying a collection of conditional independence relations encoded by the nonedges of a directed acyclic graph (DAG) $\mathcal{G}$. Such models are used to model complex cause-effect systems across a variety of research fields. From observational data alone, a DAG model $\mathcal{G}$ is only recoverable up to Markov equivalence. Combinatorially, two DAGs are Markov equivalent if and only if they have the same underlying undirected graph (i.e. skeleton) and the same set of the induced subDAGs $i\to j \leftarrow k$, known as immoralities. Hence it is of interest to study the number and size of Markov equivalence classes (MECs). In a recent paper, the authors introduced a pair of generating functions that enumerate the number of MECs on a fixed skeleton by number of immoralities and by class size, and they studied the complexity of computing these functions. In this paper, we lay the foundation for studying these generating functions by analyzing their structure for trees and other closely related graphs. We describe these polynomials for some important families of graphs including paths, stars, cycles, spider graphs, caterpillars, and complete binary trees. In doing so, we recover important connections to independence polynomials, and extend some classical identities that hold for Fibonacci numbers. We also provide tight lower and upper bounds for the number and size of MECs on any tree. Finally, we use computational methods to show that the number and distribution of high degree nodes in a triangle-free graph dictates the number and size of MECs. △ Less

Submitted 17 June, 2017; originally announced June 2017.

Comments: 31 Pages, 25 Figures, 1 Table

arXiv:1611.07493 [pdf, other]

Counting Markov Equivalence Classes by Number of Immoralities

Authors: Adityanarayanan Radhakrishnan, Liam Solus, Caroline Uhler

Abstract: Two directed acyclic graphs (DAGs) are called Markov equivalent if and only if they have the same underlying undirected graph (i.e. skeleton) and the same set of immoralities. Using observational data, a DAG model can only be determined up to Markov equivalence, and so it is desirable to understand the size and number of Markov equivalence classes (MECs) combinatorially. In this paper, we address… ▽ More Two directed acyclic graphs (DAGs) are called Markov equivalent if and only if they have the same underlying undirected graph (i.e. skeleton) and the same set of immoralities. Using observational data, a DAG model can only be determined up to Markov equivalence, and so it is desirable to understand the size and number of Markov equivalence classes (MECs) combinatorially. In this paper, we address this enumerative question using a pair of generating functions that encode the number and size of MECs on a skeleton $G$, and in doing so we connect this problem to classical problems in combinatorial optimization. The first is a graph polynomial that counts the number of MECs on $G$ by their number of immoralities. Using connections to the independent set problem, we show that computing a DAG on $G$ with the maximum possible number of immoralities is NP-hard. The second generating function counts the MECs on $G$ according to their size. Via computer enumeration, we show that this generating function is distinct for every connected graph on $p$ nodes for all $p\leq 10$. △ Less

Submitted 17 June, 2017; v1 submitted 22 November, 2016; originally announced November 2016.

Comments: 10 pages, 3 Figures, 1 Table

Showing 1–7 of 7 results for author: Radhakrishnan, A