Skip to main content

Showing 1–32 of 32 results for author: Defazio, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.15682  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    The Road Less Scheduled

    Authors: Aaron Defazio, Xingyu, Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky

    Abstract: Existing learning rate schedules that do not require specification of the optimization stop** step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stop** time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from c… ▽ More

    Submitted 30 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

  2. arXiv:2403.04081  [pdf, other

    cs.LG math.OC

    Directional Smoothness and Gradient Methods: Convergence and Adaptivity

    Authors: Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower

    Abstract: We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a se… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: Twenty-four pages

  3. arXiv:2310.07831  [pdf, other

    cs.LG cs.AI stat.ML

    When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

    Authors: Aaron Defazio, Ashok Cutkosky, Harsh Mehta, Konstantin Mishchenko

    Abstract: Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works t… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  4. arXiv:2306.06101  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Prodigy: An Expeditiously Adaptive Parameter-Free Learner

    Authors: Konstantin Mishchenko, Aaron Defazio

    Abstract: We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution $D$, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation b… ▽ More

    Submitted 19 March, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

  5. arXiv:2306.00144  [pdf, other

    cs.LG

    Mechanic: A Learning Rate Tuner

    Authors: Ashok Cutkosky, Aaron Defazio, Harsh Mehta

    Abstract: We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call \textsc{mechanic}. Our method provides a practical realization of recent theoretical reductions for accomplishing a similar goal in online convex optimization. We rigorously evaluate \textsc{mechanic} on a range of large scale deep learning tasks with vary… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

  6. arXiv:2305.07583  [pdf, other

    cs.LG math.OC

    MoMo: Momentum Models for Adaptive Learning Rates

    Authors: Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

    Abstract: Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent wi… ▽ More

    Submitted 5 June, 2024; v1 submitted 12 May, 2023; originally announced May 2023.

    MSC Class: 90C53; 74S60; 90C06; 62L20; 68W20; 15B52; 65Y20; 68W40 ACM Class: G.1.6

  7. arXiv:2301.07733  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Learning-Rate-Free Learning by D-Adaptation

    Authors: Aaron Defazio, Konstantin Mishchenko

    Abstract: D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in… ▽ More

    Submitted 7 July, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

  8. arXiv:2206.06900  [pdf, other

    cs.LG math.OC stat.ML

    Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

    Authors: Aaron Defazio, Baoyu Zhou, Lin Xiao

    Abstract: The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

  9. arXiv:2106.11851  [pdf, other

    cs.LG math.OC

    Stochastic Polyak Stepsize with a Moving Target

    Authors: Robert M. Gower, Aaron Defazio, Michael Rabbat

    Abstract: We propose a new stochastic gradient method called MOTAPS (Moving Targetted Polyak Stepsize) that uses recorded past loss values to compute adaptive stepsizes. MOTAPS can be seen as a variant of the Stochastic Polyak (SP) which is also a method that also uses loss values to adjust the stepsize. The downside to the SP method is that it only converges when the interpolation condition holds. MOTAPS i… ▽ More

    Submitted 23 September, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: 49 pages, 13 figures, 1 table

    MSC Class: 90C53; 74S60; 90C06; 62L20; 68W20; 15B52; 65Y20; 68W40 ACM Class: G.1.6

  10. arXiv:2101.11075  [pdf, other

    cs.LG cs.AI math.OC

    Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

    Authors: Aaron Defazio, Samy Jelassi

    Abstract: We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both… ▽ More

    Submitted 26 August, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

  11. arXiv:2010.10502  [pdf, other

    cs.LG math.OC stat.ML

    Dual Averaging is Surprisingly Effective for Deep Learning Optimization

    Authors: Samy Jelassi, Aaron Defazio

    Abstract: First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance. For instance, SGD with momentum (SGD+M) is typically used in computer vision (CV) and Adam is used for training transformer models for Natural Language Process… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  12. arXiv:2010.00406  [pdf, other

    cs.LG math.OC stat.ML

    Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

    Authors: Aaron Defazio

    Abstract: Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA)… ▽ More

    Submitted 1 June, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

  13. arXiv:2006.07867  [pdf, other

    cs.LG math.OC stat.ML

    Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball

    Authors: Othmane Sebbouh, Robert M. Gower, Aaron Defazio

    Abstract: We study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem. For SGD, in the convex and smooth setting, we provide the first \emph{almost sure} asymptotic convergence \emph{rates} for a weighted average of the iterates . More precisely, we show that the convergence rate of the funct… ▽ More

    Submitted 5 February, 2021; v1 submitted 14 June, 2020; originally announced June 2020.

  14. arXiv:2006.01244  [pdf, other

    cs.LG math.OC stat.ML

    The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization

    Authors: Aaron Defazio, Robert M. Gower

    Abstract: The convergence rates for convex and non-convex optimization methods depend on the choice of a host of constants, including step sizes, Lyapunov function constants and momentum constants. In this work we propose the use of factorial powers as a flexible tool for defining constants that appear in convergence proofs. We list a number of remarkable properties that these sequences enjoy, and show how… ▽ More

    Submitted 11 April, 2023; v1 submitted 1 June, 2020; originally announced June 2020.

  15. arXiv:2004.06688  [pdf, other

    eess.IV cs.CV

    End-to-End Variational Networks for Accelerated MRI Reconstruction

    Authors: Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C. Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, Patricia Johnson

    Abstract: The slow acquisition speed of magnetic resonance imaging (MRI) has led to the development of two complementary methods: acquiring multiple views of the anatomy simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). While the combination of these methods has the potential to allow much faster scan times, reconstru… ▽ More

    Submitted 15 April, 2020; v1 submitted 14 April, 2020; originally announced April 2020.

  16. arXiv:2001.08699  [pdf, other

    eess.IV cs.CV cs.LG stat.ML

    MRI Banding Removal via Adversarial Training

    Authors: Aaron Defazio, Tullie Murrell, Michael P. Recht

    Abstract: MRI images reconstructed from sub-sampled Cartesian data using deep learning techniques often show a characteristic banding (sometimes described as streaking), which is particularly strong in low signal-to-noise regions of the reconstructed image. In this work, we propose the use of an adversarial loss that penalizes banding structures without requiring any human annotation. Our technique greatly… ▽ More

    Submitted 8 October, 2020; v1 submitted 23 January, 2020; originally announced January 2020.

  17. arXiv:2001.02518  [pdf, other

    eess.IV cs.CV

    Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge

    Authors: Florian Knoll, Tullie Murrell, Anuroop Sriram, Nafissa Yakubova, Jure Zbontar, Michael Rabbat, Aaron Defazio, Matthew J. Muckley, Daniel K. Sodickson, C. Lawrence Zitnick, Michael P. Recht

    Abstract: Purpose: To advance research in the field of machine learning for MR image reconstruction with an open challenge. Methods: We provided participants with a dataset of raw k-space data from 1,594 consecutive clinical exams of the knee. The goal of the challenge was to reconstruct images from these data. In order to strike a balance between realistic data and a shallow learning curve for those not al… ▽ More

    Submitted 6 January, 2020; originally announced January 2020.

  18. arXiv:1912.01101  [pdf, other

    eess.IV cs.CV cs.LG

    Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry

    Authors: Aaron Defazio

    Abstract: Deep learning approaches to accelerated MRI take a matrix of sampled Fourier-space lines as input and produce a spatial image as output. In this work we show that by careful choice of the offset used in the sampling procedure, the symmetries in k-space can be better exploited, producing higher quality reconstructions than given by standard equally-spaced samples or randomized samples motivated by… ▽ More

    Submitted 4 February, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

  19. arXiv:1910.12325  [pdf, other

    eess.IV cs.CV

    GrappaNet: Combining Parallel Imaging with Deep Learning for Multi-Coil MRI Reconstruction

    Authors: Anuroop Sriram, Jure Zbontar, Tullie Murrell, C. Lawrence Zitnick, Aaron Defazio, Daniel K. Sodickson

    Abstract: Magnetic Resonance Image (MRI) acquisition is an inherently slow process which has spurred the development of two different acceleration methods: acquiring multiple correlated samples simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). Both methods provide complementary approaches to accelerating the speed of… ▽ More

    Submitted 30 March, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

  20. Beyond Folklore: A Scaling Calculus for the Design and Initialization of ReLU Networks

    Authors: Aaron Defazio, Léon Bottou

    Abstract: We propose a system for calculating a "scaling constant" for layers and weights of neural networks. We relate this scaling constant to two important quantities that relate to the optimizability of neural networks, and argue that a network that is "preconditioned" via scaling, in the sense that all weights have the same scaling constant, will be easier to train. This scaling calculus results in a n… ▽ More

    Submitted 11 February, 2021; v1 submitted 10 June, 2019; originally announced June 2019.

    Journal ref: Neural Comput & Applic (2022)

  21. arXiv:1812.04634  [pdf, other

    cs.LG math.OC stat.ML

    On the Curved Geometry of Accelerated Optimization

    Authors: Aaron Defazio

    Abstract: In this work we propose a differential geometric motivation for Nesterov's accelerated gradient method (AGM) for strongly-convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, wher… ▽ More

    Submitted 20 November, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

    Comments: NeurIPS 2019 Accepted paper

  22. arXiv:1812.04549  [pdf, other

    cs.LG stat.ML

    Controlling Covariate Shift using Balanced Normalization of Weights

    Authors: Aaron Defazio, Léon Bottou

    Abstract: We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs. The proposed technique keeps the contribution of positive and negative weights to the layer output balanced. We validate our method on a set of standard benchmarks including CIFAR-10/100, SVHN and ILSVRC 2012 ImageNet.

    Submitted 10 May, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

  23. arXiv:1812.04529  [pdf, other

    cs.LG stat.ML

    On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

    Authors: Aaron Defazio, Léon Bottou

    Abstract: The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.

    Submitted 20 November, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

  24. arXiv:1811.08839  [pdf, other

    cs.CV cs.LG eess.SP physics.med-ph stat.ML

    fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

    Authors: Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J. Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, Nafissa Yakubova, James Pinkerton, Duo Wang, Erich Owens, C. Lawrence Zitnick, Michael P. Recht , et al. (2 additional authors not shown)

    Abstract: Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of ma… ▽ More

    Submitted 11 December, 2019; v1 submitted 21 November, 2018; originally announced November 2018.

    Comments: 35 pages, 10 figures

  25. arXiv:1602.02442  [pdf, other

    stat.ML cs.LG

    A Simple Practical Accelerated Method for Finite Sums

    Authors: Aaron Defazio

    Abstract: We describe a novel optimization method for finite sums (such as empirical risk minimization problems) building on the recently introduced SAGA method. Our method achieves an accelerated convergence rate on strongly convex smooth problems. Our method has only one parameter (a step size), and is radically simpler than other accelerated methods for finite sums. Additionally it can be applied when th… ▽ More

    Submitted 27 October, 2016; v1 submitted 7 February, 2016; originally announced February 2016.

  26. arXiv:1510.02533  [pdf, other

    cs.LG stat.ML

    New Optimisation Methods for Machine Learning

    Authors: Aaron Defazio

    Abstract: A thesis submitted for the degree of Doctor of Philosophy of The Australian National University. In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally exp… ▽ More

    Submitted 18 March, 2016; v1 submitted 8 October, 2015; originally announced October 2015.

    Comments: PhD thesis, 205 pages

  27. arXiv:1504.04406  [pdf, other

    stat.ML cs.LG math.OC stat.CO

    Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

    Authors: Mark Schmidt, Reza Babanezhad, Mohamed Osama Ahmed, Aaron Defazio, Ann Clifton, Anoop Sarkar

    Abstract: We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). We describe a practical implementation that uses structure in the CRF gradient to reduce the memory requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling scheme that substantially improves practical performance, and analyze the rate of convergence of the… ▽ More

    Submitted 16 April, 2015; originally announced April 2015.

    Comments: AI/Stats 2015, 24 pages

  28. arXiv:1410.8620  [pdf, other

    cs.LG cs.AI

    A Comparison of learning algorithms on the Arcade Learning Environment

    Authors: Aaron Defazio, Thore Graepel

    Abstract: Reinforcement learning agents have traditionally been evaluated on small toy problems. With advances in computing power and the advent of the Arcade Learning Environment, it is now possible to evaluate algorithms on diverse and difficult problems within a consistent framework. We discuss some challenges posed by the arcade learning environment which do not manifest in simpler environments. We then… ▽ More

    Submitted 30 October, 2014; originally announced October 2014.

  29. arXiv:1407.2710  [pdf, other

    cs.LG stat.ML

    Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

    Authors: Aaron J. Defazio, Tibério S. Caetano, Justin Domke

    Abstract: Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box "batch" problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement… ▽ More

    Submitted 10 July, 2014; originally announced July 2014.

    Journal ref: International Conference on Machine Learning 2014

  30. arXiv:1407.2697  [pdf, other

    cs.LG stat.ML

    A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

    Authors: Aaron J. Defazio, Tiberio S. Caetano

    Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lovász extension to obtain a convex relaxation. For tractable… ▽ More

    Submitted 10 July, 2014; originally announced July 2014.

    Journal ref: Advances in Neural Information Processing Systems 25 (NIPS 2012)

  31. arXiv:1407.0202  [pdf, other

    cs.LG math.OC stat.ML

    SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

    Authors: Aaron Defazio, Francis Bach, Simon Lacoste-Julien

    Abstract: In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA… ▽ More

    Submitted 16 December, 2014; v1 submitted 1 July, 2014; originally announced July 2014.

    Comments: Advances In Neural Information Processing Systems, Nov 2014, Montreal, Canada

  32. arXiv:1206.4622  [pdf

    cs.LG cs.IR stat.ML

    A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training

    Authors: Aaron Defazio, Tiberio Caetano

    Abstract: Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical mod… ▽ More

    Submitted 18 June, 2012; originally announced June 2012.

    Comments: ICML2012