-
From Learning to Optimize to Learning Optimization Algorithms
Authors:
Camille Castera,
Peter Ochs
Abstract:
Towards designing learned optimization algorithms that are usable beyond their training setting, we identify key principles that classical algorithms obey, but have up to now, not been used for Learning to Optimize (L2O). Following these principles, we provide a general design pipeline, taking into account data, architecture and learning strategy, and thereby enabling a synergy between classical o…
▽ More
Towards designing learned optimization algorithms that are usable beyond their training setting, we identify key principles that classical algorithms obey, but have up to now, not been used for Learning to Optimize (L2O). Following these principles, we provide a general design pipeline, taking into account data, architecture and learning strategy, and thereby enabling a synergy between classical optimization and L2O, resulting in a philosophy of Learning Optimization Algorithms. As a consequence our learned algorithms perform well far beyond problems from the training distribution. We demonstrate the success of these novel principles by designing a new learning-enhanced BFGS algorithm and provide numerical experiments evidencing its adaptation to many settings at test time.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Near-optimal Closed-loop Method via Lyapunov Dam** for Convex Optimization
Authors:
Severin Maier,
Camille Castera,
Peter Ochs
Abstract:
We introduce an autonomous system with closed-loop dam** for first-order convex optimization. While, to this day, optimal rates of convergence are almost exclusively achieved by non-autonomous methods via open-loop dam** (e.g., Nesterov's algorithm), we show that our system, featuring a closed-loop dam**, exhibits a rate arbitrarily close to the optimal one. We do so by coupling the dam**…
▽ More
We introduce an autonomous system with closed-loop dam** for first-order convex optimization. While, to this day, optimal rates of convergence are almost exclusively achieved by non-autonomous methods via open-loop dam** (e.g., Nesterov's algorithm), we show that our system, featuring a closed-loop dam**, exhibits a rate arbitrarily close to the optimal one. We do so by coupling the dam** and the speed of convergence of the system via a well-chosen Lyapunov function. By discretizing our system we then derive an algorithm and present numerical experiments supporting our theoretical findings.
△ Less
Submitted 15 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Inertial Newton Algorithms Avoiding Strict Saddle Points
Authors:
Camille Castera
Abstract:
We study the asymptotic behavior of second-order algorithms mixing Newton's method and inertial gradient descent in non-convex landscapes. We show that, despite the Newtonian behavior of these methods, they almost always escape strict saddle points. We also evidence the role played by the hyper-parameters of these methods in their qualitative behavior near critical points. The theoretical results…
▽ More
We study the asymptotic behavior of second-order algorithms mixing Newton's method and inertial gradient descent in non-convex landscapes. We show that, despite the Newtonian behavior of these methods, they almost always escape strict saddle points. We also evidence the role played by the hyper-parameters of these methods in their qualitative behavior near critical points. The theoretical results are supported by numerical illustrations.
△ Less
Submitted 12 February, 2024; v1 submitted 8 November, 2021;
originally announced November 2021.
-
Second-order step-size tuning of SGD for non-convex optimization
Authors:
Camille Castera,
Jérôme Bolte,
Cédric Févotte,
Edouard Pauwels
Abstract:
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version o…
▽ More
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.
△ Less
Submitted 21 November, 2021; v1 submitted 5 March, 2021;
originally announced March 2021.
-
An Inertial Newton Algorithm for Deep Learning
Authors:
Camille Castera,
Jérôme Bolte,
Cédric Févotte,
Edouard Pauwels
Abstract:
We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both…
▽ More
We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.
△ Less
Submitted 28 July, 2021; v1 submitted 29 May, 2019;
originally announced May 2019.