Search | arXiv e-print repository

On the fast convergence of minibatch heavy ball momentum

Authors: Raghu Bollapragada, Tyler Chen, Rachel Ward

Abstract: Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadra… ▽ More Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size. The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical illustrations demonstrating that our bounds are reasonably sharp. △ Less

Submitted 12 December, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

MSC Class: 65K05; 90C06; 90C30; 65F10; 68W20

arXiv:2110.15442 [pdf, other]

Scalable Unidirectional Pareto Optimality for Multi-Task Learning with Constraints

Authors: Soumyajit Gupta, Gurpreet Singh, Raghu Bollapragada, Matthew Lease

Abstract: Multi-objective optimization (MOO) problems require balancing competing objectives, often under constraints. The Pareto optimal solution set defines all possible optimal trade-offs over such objectives. In this work, we present a novel method for Pareto-front learning: inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time. Our key insight… ▽ More Multi-objective optimization (MOO) problems require balancing competing objectives, often under constraints. The Pareto optimal solution set defines all possible optimal trade-offs over such objectives. In this work, we present a novel method for Pareto-front learning: inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time. Our key insight is to exploit Fritz-John Conditions for a novel guided double gradient descent strategy. Evaluation on synthetic benchmark problems allows us to vary MOO problem difficulty in controlled fashion and measure accuracy vs. known analytic solutions. We further test scalability and generalization in learning optimal neural model parameterizations for Multi-Task Learning (MTL) on image classification. Results show consistent improvement in accuracy and efficiency over prior MTL methods as well as techniques from operations research. △ Less

Submitted 16 April, 2022; v1 submitted 28 October, 2021; originally announced October 2021.

arXiv:2109.12213 [pdf, other]

Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

Authors: Raghu Bollapragada, Stefan M. Wild

Abstract: We consider unconstrained stochastic optimization problems with no available gradient information. Such problems arise in settings from derivative-free simulation optimization to reinforcement learning. We propose an adaptive sampling quasi-Newton method where we estimate the gradients of a stochastic function using finite differences within a common random number framework. We develop modified ve… ▽ More We consider unconstrained stochastic optimization problems with no available gradient information. Such problems arise in settings from derivative-free simulation optimization to reinforcement learning. We propose an adaptive sampling quasi-Newton method where we estimate the gradients of a stochastic function using finite differences within a common random number framework. We develop modified versions of a norm test and an inner product quasi-Newton test to control the sample sizes used in the stochastic approximations and provide global convergence results to the neighborhood of the optimal solution. We present numerical experiments on simulation optimization problems to illustrate the performance of the proposed algorithm. When compared with classical zeroth-order stochastic gradient methods, we observe that our strategies of adapting the sample sizes significantly improve performance in terms of the number of stochastic function evaluations required. △ Less

Submitted 24 September, 2021; originally announced September 2021.

arXiv:1802.05374 [pdf, other]

A Progressive Batching L-BFGS Method for Machine Learning

Authors: Raghu Bollapragada, Dheevatsa Mudigere, Jorge Nocedal, Hao-Jun Michael Shi, ** Tak Peter Tang

Abstract: The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization pr… ▽ More The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the L-BFGS algorithm that combines three basic components - progressive batching, a stochastic line search, and stable quasi-Newton updating - and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method. △ Less

Submitted 30 May, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

Comments: ICML 2018. 25 pages, 17 figures, 2 tables

arXiv:1705.06211 [pdf, other]

An Investigation of Newton-Sketch and Subsampled Newton Methods

Authors: Albert S. Berahas, Raghu Bollapragada, Jorge Nocedal

Abstract: Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton's method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomi… ▽ More Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton's method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomized Hadamard transformations. Each has its own advantages, and their relative tradeoffs have not been investigated in the optimization literature. Our study focuses on practical versions of the two methods in which the resulting linear systems of equations are solved approximately, at every iteration, using an iterative solver. The advantages of using the conjugate gradient method vs. a stochastic gradient iteration are revealed through a set of numerical experiments, and a complexity analysis of the Hessian subsampling method is presented. △ Less

Submitted 30 May, 2019; v1 submitted 17 May, 2017; originally announced May 2017.

Comments: 36 pages, 22 figures

Showing 1–5 of 5 results for author: Bollapragada, R