Search | arXiv e-print repository

Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy

Authors: Blake Woodworth, Konstantin Mishchenko, Francis Bach

Abstract: We present an algorithm for minimizing an objective with hard-to-compute gradients by using a related, easier-to-access function as a proxy. Our algorithm is based on approximate proximal point iterations on the proxy combined with relatively few stochastic gradients from the objective. When the difference between the objective and the proxy is $δ$-smooth, our algorithm guarantees convergence at a… ▽ More We present an algorithm for minimizing an objective with hard-to-compute gradients by using a related, easier-to-access function as a proxy. Our algorithm is based on approximate proximal point iterations on the proxy combined with relatively few stochastic gradients from the objective. When the difference between the objective and the proxy is $δ$-smooth, our algorithm guarantees convergence at a rate matching stochastic gradient descent on a $δ$-smooth objective, which can lead to substantially better sample efficiency. Our algorithm has many potential applications in machine learning, and provides a principled means of leveraging synthetic data, physics simulators, mixed public and private data, and more. △ Less

Submitted 7 June, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

arXiv:2206.07638 [pdf, other]

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays

Authors: Konstantin Mishchenko, Francis Bach, Mathieu Even, Blake Woodworth

Abstract: The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay. On the contrary, we prove much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients, depending instead just on the number of parallel devices used to implement the… ▽ More The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay. On the contrary, we prove much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients, depending instead just on the number of parallel devices used to implement the algorithm. Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider. For our analysis, we introduce a novel recursion based on "virtual iterates" and delay-adaptive stepsizes, which allow us to derive state-of-the-art guarantees for both convex and non-convex objectives. △ Less

Submitted 20 April, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

arXiv:2204.04970 [pdf, other]

Non-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares

Authors: Blake Woodworth, Francis Bach, Alessandro Rudi

Abstract: We consider potentially non-convex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on i… ▽ More We consider potentially non-convex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on infinite-dimensional sums-of-squares and Fourier analysis, and is instantiated on the minimization of multivariate periodic functions. △ Less

Submitted 11 April, 2022; originally announced April 2022.

arXiv:2111.04640 [pdf, other]

Experiments conducted in the burning plasma regime with inertial fusion implosions

Authors: J. S. Ross, J. E. Ralph, A. B. Zylstra, A. L. Kritcher, H. F. Robey, C. V. Young, O. A. Hurricane, D. A. Callahan, K. L. Baker, D. T. Casey, T. Doeppner, L. Divol, M. Hohenberger, S. Le Pape, A. Pak, P. K. Patel, R. Tommasini, S. J. Ali, P. A. Amendt, L. J. Atherton, B. Bachmann, D. Bailey, L. R. Benedetti, L. Berzak Hopkins, R. Betti , et al. (127 additional authors not shown)

Abstract: An experimental program is currently underway at the National Ignition Facility (NIF) to compress deuterium and tritium (DT) fuel to densities and temperatures sufficient to achieve fusion and energy gain. The primary approach being investigated is indirect drive inertial confinement fusion (ICF), where a high-Z radiation cavity (a hohlraum) is heated by lasers, converting the incident energy into… ▽ More An experimental program is currently underway at the National Ignition Facility (NIF) to compress deuterium and tritium (DT) fuel to densities and temperatures sufficient to achieve fusion and energy gain. The primary approach being investigated is indirect drive inertial confinement fusion (ICF), where a high-Z radiation cavity (a hohlraum) is heated by lasers, converting the incident energy into x-ray radiation which in turn drives the DT fuel filled capsule causing it to implode. Previous experiments reported DT fuel gain exceeding unity [O.A. Hurricane et al., Nature 506, 343 (2014)] and then exceeding the kinetic energy of the imploding fuel [S. Le Pape et al., Phys. Rev. Lett. 120, 245003 (2018)]. We report on recent experiments that have achieved record fusion neutron yields on NIF, greater than 100 kJ with momentary fusion powers exceeding 1PW, and have for the first time entered the burning plasma regime where fusion alpha-heating of the fuel exceeds the energy delivered to the fuel via compression. This was accomplished by increasing the size of the high-density carbon (HDC) capsule, increasing energy coupling, while controlling symmetry and implosion design parameters. Two tactics were successful in controlling the radiation flux symmetry and therefore the implosion symmetry: transferring energy between laser cones via plasma waves, and changing the shape of the hohlraum. In conducting these experiments, we controlled for known sources of degradation. Herein we show how these experiments were performed to produce record performance, and demonstrate the data fidelity leading us to conclude that these shots have entered the burning plasma regime. △ Less

Submitted 8 November, 2021; originally announced November 2021.

arXiv:2110.02954 [pdf, other]

A Stochastic Newton Algorithm for Distributed Convex Optimization

Authors: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations… ▽ More We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication. We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance, by proving convergence guarantees for quasi-self-concordant objectives (e.g., logistic regression), alongside empirical evidence. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2109.00534 [pdf, other]

The Minimax Complexity of Distributed Optimization

Authors: Blake Woodworth

Abstract: In this thesis, I study the minimax oracle complexity of distributed stochastic optimization. First, I present the "graph oracle model", an extension of the classic oracle complexity framework that can be applied to study distributed optimization algorithms. Next, I describe a general approach to proving optimization lower bounds for arbitrary randomized algorithms (as opposed to more restricted c… ▽ More In this thesis, I study the minimax oracle complexity of distributed stochastic optimization. First, I present the "graph oracle model", an extension of the classic oracle complexity framework that can be applied to study distributed optimization algorithms. Next, I describe a general approach to proving optimization lower bounds for arbitrary randomized algorithms (as opposed to more restricted classes of algorithms, e.g., deterministic or "zero-respecting" algorithms), which is used extensively throughout the thesis. For the remainder of the thesis, I focus on the specific case of the "intermittent communication setting", where multiple computing devices work in parallel with limited communication amongst themselves. In this setting, I analyze the theoretical properties of the popular Local Stochastic Gradient Descent (SGD) algorithm in convex setting, both for homogeneous and heterogeneous objectives. I provide the first guarantees for Local SGD that improve over simple baseline methods, but show that Local SGD is not optimal in general. In pursuit of optimal methods in the intermittent communication setting, I then show matching upper and lower bounds for the intermittent communication setting with homogeneous convex, heterogeneous convex, and homogeneous non-convex objectives. These upper bounds are attained by simple variants of SGD which are therefore optimal. Finally, I discuss several additional assumptions about the objective or more powerful oracles that might be exploitable in order to develop better intermittent communication algorithms with better guarantees than our lower bounds allow. △ Less

Submitted 1 September, 2021; originally announced September 2021.

arXiv:2107.06917 [pdf, other]

A Field Guide to Federated Optimization

Authors: Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz , et al. (28 additional authors not shown)

Abstract: Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and… ▽ More Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. △ Less

Submitted 14 July, 2021; originally announced July 2021.

arXiv:2106.02720 [pdf, ps, other]

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

Authors: Blake Woodworth, Nathan Srebro

Abstract: We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan (2012), which is insensitive to the minimum expected loss; over the optimistic accel… ▽ More We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan (2012), which is insensitive to the minimum expected loss; over the optimistic acceleration of Cotter et al. (2011), which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin (2018), which is limited to least squares problems and is also similarly suboptimal with respect to the minibatch size. Applied to interpolation learning, the improvement over Cotter et al. and Liu and Belkin translates to a linear, rather than square-root, parallelization speedup. △ Less

Submitted 26 October, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: 24 pages

arXiv:2102.09769 [pdf, other]

On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

Authors: Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson, Daniel Soudry

Abstract: Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involve… ▽ More Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradient-flow and use it to obtain closed-form implicit regularizers for multiple cases of interest. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 33 pages, 2 figures

MSC Class: 68T07 (Primary) ACM Class: I.2.6; G.1.6

arXiv:2102.01583 [pdf, other]

The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

Authors: Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

Abstract: We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound wi… ▽ More We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm. △ Less

Submitted 5 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: 48 pages

arXiv:2007.06738 [pdf, other]

Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

Authors: Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

Abstract: We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accuratel… ▽ More We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies (well beyond $10^{-100}$). Moreover, the implicit bias at reasonable initialization scales and training accuracies is more complex and not captured by these limits. △ Less

Submitted 13 July, 2020; originally announced July 2020.

arXiv:2006.04735 [pdf, other]

Minibatch vs Local SGD for Heterogeneous Distributed Learning

Authors: Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro

Abstract: We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) domina… ▽ More We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime. △ Less

Submitted 1 March, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 34 pages

arXiv:2004.01025 [pdf, ps, other]

Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent

Authors: Suriya Gunasekar, Blake Woodworth, Nathan Srebro

Abstract: We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows gen… ▽ More We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows generalizing Mirror Descent to general Riemannian geometries, even when the metric tensor is {\em not} a Hessian, and thus there is no "dual." △ Less

Submitted 1 July, 2021; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: 11 pages

arXiv:2002.09277 [pdf, other]

Kernel and Rich Regimes in Overparametrized Models

Authors: Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks. △ Less

Submitted 27 July, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Comments: This updates and significantly extends a previous article (arXiv:1906.05827), Sections 6 and 7 are the most major additions. 31 pages. arXiv admin note: text overlap with arXiv:1906.05827

arXiv:2002.07839 [pdf, other]

Is Local SGD Better than Minibatch SGD?

Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee. △ Less

Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 29 pages

arXiv:1912.02365 [pdf, other]

Lower Bounds for Non-Convex Stochastic Optimization

Authors: Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, Blake Woodworth

Abstract: We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an… ▽ More We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an $ε$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $ε^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques. △ Less

Submitted 27 February, 2022; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: Correction to hard instance dimensions in Theorem 3

arXiv:1911.02212 [pdf, other]

The gradient complexity of linear regression

Authors: Mark Braverman, Elad Hazan, Max Simchowitz, Blake Woodworth

Abstract: We investigate the computational complexity of several basic linear algebra primitives, including largest eigenvector computation and linear regression, in the computational model that allows access to the data via a matrix-vector product oracle. We show that for polynomial accuracy, $Θ(d)$ calls to the oracle are necessary and sufficient even for a randomized algorithm. Our lower bound is based… ▽ More We investigate the computational complexity of several basic linear algebra primitives, including largest eigenvector computation and linear regression, in the computational model that allows access to the data via a matrix-vector product oracle. We show that for polynomial accuracy, $Θ(d)$ calls to the oracle are necessary and sufficient even for a randomized algorithm. Our lower bound is based on a reduction to estimating the least eigenvalue of a random Wishart matrix. This simple distribution enables a concise proof, leveraging a few key properties of the random Wishart ensemble. △ Less

Submitted 23 May, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

arXiv:1907.00762 [pdf, other]

Open Problem: The Oracle Complexity of Convex Optimization with Limited Memory

Authors: Blake Woodworth, Nathan Srebro

Abstract: We note that known methods achieving the optimal oracle complexity for first order convex optimization require quadratic memory, and ask whether this is necessary, and more broadly seek to characterize the minimax number of first order queries required to optimize a convex Lipschitz function subject to a memory constraint. We note that known methods achieving the optimal oracle complexity for first order convex optimization require quadratic memory, and ask whether this is necessary, and more broadly seek to characterize the minimax number of first order queries required to optimize a convex Lipschitz function subject to a memory constraint. △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: 9 pages

arXiv:1906.09231 [pdf, other]

Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis

Authors: Ryan Rogers, Aaron Roth, Adam Smith, Nathan Srebro, Om Thakkar, Blake Woodworth

Abstract: We design a general framework for answering adaptive statistical queries that focuses on providing explicit confidence intervals along with point estimates. Prior work in this area has either focused on providing tight confidence intervals for specific analyses, or providing general worst-case bounds for point estimates. Unfortunately, as we observe, these worst-case bounds are loose in many setti… ▽ More We design a general framework for answering adaptive statistical queries that focuses on providing explicit confidence intervals along with point estimates. Prior work in this area has either focused on providing tight confidence intervals for specific analyses, or providing general worst-case bounds for point estimates. Unfortunately, as we observe, these worst-case bounds are loose in many settings --- often not even beating simple baselines like sample splitting. Our main contribution is to design a framework for providing valid, instance-specific confidence intervals for point estimates that can be generated by heuristics. When paired with good heuristics, this method gives guarantees that are orders of magnitude better than the best worst-case bounds. We provide a Python library implementing our method. △ Less

Submitted 9 March, 2020; v1 submitted 21 June, 2019; originally announced June 2019.

Comments: Accepted to appear in the proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020

arXiv:1906.05827

Kernel and Rich Regimes in Overparametrized Models

Authors: Blake Woodworth, Suriya Gunasekar, Pedro Savarese, Edward Moroshko, Itay Golan, Jason Lee, Daniel Soudry, Nathan Srebro

Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and rich regimes, and we demonstrate the transition for more complex matrix factorization models and multilayer non-linear networks. △ Less

Submitted 25 February, 2020; v1 submitted 13 June, 2019; originally announced June 2019.

Comments: This paper has been substantially modified, updated, and expanded with additional content (arXiv:2002.09277). To avoid confusion with already existing citations, we are withdrawing the old version of this article

arXiv:1902.04686 [pdf, ps, other]

The Complexity of Making the Gradient Small in Stochastic Convex Optimization

Authors: Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth

Abstract: We give nearly matching upper and lower bounds on the oracle complexity of finding $ε$-stationary points ($\| \nabla F(x) \| \leqε$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimizatio… ▽ More We give nearly matching upper and lower bounds on the oracle complexity of finding $ε$-stationary points ($\| \nabla F(x) \| \leqε$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and that the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent "recursive regularization" technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff. △ Less

Submitted 14 February, 2019; v1 submitted 12 February, 2019; originally announced February 2019.

arXiv:1807.00028 [pdf, other]

Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints

Authors: Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, Seungil You

Abstract: Classifiers can be trained with data-dependent constraints to satisfy fairness goals, reduce churn, achieve a targeted false positive rate, or other policy goals. We study the generalization performance for such constrained optimization problems, in terms of how well the constraints are satisfied at evaluation time, given that they are satisfied at training time. To improve generalization performa… ▽ More Classifiers can be trained with data-dependent constraints to satisfy fairness goals, reduce churn, achieve a targeted false positive rate, or other policy goals. We study the generalization performance for such constrained optimization problems, in terms of how well the constraints are satisfied at evaluation time, given that they are satisfied at training time. To improve generalization performance, we frame the problem as a two-player game where one player optimizes the model parameters on a training dataset, and the other player enforces the constraints on an independent validation dataset. We build on recent work in two-player constrained optimization to show that if one uses this two-dataset approach, then constraint generalization can be significantly improved. As we illustrate experimentally, this approach works not only in theory, but also in practice. △ Less

Submitted 28 September, 2018; v1 submitted 29 June, 2018; originally announced July 2018.

arXiv:1805.10222 [pdf, other]

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization

Authors: Blake Woodworth, Jialei Wang, Adam Smith, Brendan McMahan, Nathan Srebro

Abstract: We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight ga… ▽ More We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the "natural" algorithms are not known to be optimal. △ Less

Submitted 11 February, 2019; v1 submitted 25 May, 2018; originally announced May 2018.

arXiv:1803.04307 [pdf, ps, other]

The Everlasting Database: Statistical Validity at a Fair Price

Authors: Blake Woodworth, Vitaly Feldman, Saharon Rosset, Nathan Srebro

Abstract: The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional s… ▽ More The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples. Crucially, we guarantee statistical validity without any assumptions on how the queries are generated. We also ensure with high probability that the cost for $M$ non-adaptive queries is $O(\log M)$, while the cost to a potentially adaptive user who makes $M$ queries that do not depend on any others is $O(\sqrt{M})$. △ Less

Submitted 2 April, 2019; v1 submitted 12 March, 2018; originally announced March 2018.

Comments: 22 pages, accepted to NeurIPS 2018

arXiv:1709.03594 [pdf, ps, other]

Lower Bound for Randomized First Order Convex Optimization

Authors: Blake Woodworth, Nathan Srebro

Abstract: We provide an explicit construction and direct proof for the lower bound on the number of first order oracle accesses required for a randomized algorithm to minimize a convex Lipschitz function. We provide an explicit construction and direct proof for the lower bound on the number of first order oracle accesses required for a randomized algorithm to minimize a convex Lipschitz function. △ Less

Submitted 3 November, 2017; v1 submitted 11 September, 2017; originally announced September 2017.

Comments: 8 pages

arXiv:1705.09280 [pdf, other]

Implicit Regularization in Matrix Factorization

Authors: Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

Abstract: We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution. We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution. △ Less

Submitted 25 May, 2017; originally announced May 2017.

arXiv:1702.06081 [pdf, other]

Learning Non-Discriminatory Predictors

Authors: Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, Nathan Srebro

Abstract: We consider learning a predictor which is non-discriminatory with respect to a "protected attribute" according to the notion of "equalized odds" proposed by Hardt et al. [2016]. We study the problem of learning such a non-discriminatory predictor from a finite training set, both statistically and computationally. We show that a post-hoc correction approach, as suggested by Hardt et al, can be high… ▽ More We consider learning a predictor which is non-discriminatory with respect to a "protected attribute" according to the notion of "equalized odds" proposed by Hardt et al. [2016]. We study the problem of learning such a non-discriminatory predictor from a finite training set, both statistically and computationally. We show that a post-hoc correction approach, as suggested by Hardt et al, can be highly suboptimal, present a nearly-optimal statistical procedure, argue that the associated computational problem is intractable, and suggest a second moment relaxation of the non-discrimination definition for which learning is tractable. △ Less

Submitted 1 November, 2017; v1 submitted 20 February, 2017; originally announced February 2017.

Comments: 28 pages

arXiv:1605.08003 [pdf, ps, other]

Tight Complexity Bounds for Optimizing Composite Objectives

Authors: Blake Woodworth, Nathan Srebro

Abstract: We provide tight upper and lower bounds on the complexity of minimizing the average of $m$ convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic… ▽ More We provide tight upper and lower bounds on the complexity of minimizing the average of $m$ convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smoothing that improve over methods using just gradient accesses. △ Less

Submitted 4 April, 2019; v1 submitted 25 May, 2016; originally announced May 2016.

Showing 1–28 of 28 results for author: Woodworth, B