Search | arXiv e-print repository

From Learning to Optimize to Learning Optimization Algorithms

Abstract: Towards designing learned optimization algorithms that are usable beyond their training setting, we identify key principles that classical algorithms obey, but have up to now, not been used for Learning to Optimize (L2O). Following these principles, we provide a general design pipeline, taking into account data, architecture and learning strategy, and thereby enabling a synergy between classical o… ▽ More Towards designing learned optimization algorithms that are usable beyond their training setting, we identify key principles that classical algorithms obey, but have up to now, not been used for Learning to Optimize (L2O). Following these principles, we provide a general design pipeline, taking into account data, architecture and learning strategy, and thereby enabling a synergy between classical optimization and L2O, resulting in a philosophy of Learning Optimization Algorithms. As a consequence our learned algorithms perform well far beyond problems from the training distribution. We demonstrate the success of these novel principles by designing a new learning-enhanced BFGS algorithm and provide numerical experiments evidencing its adaptation to many settings at test time. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.03290 [pdf, other]

Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation

Authors: Michael Sucker, Jalal Fadili, Peter Ochs

Abstract: We use the PAC-Bayesian theory for the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-Bayesian bounds) and explicit trade-off between convergence guarantees and convergence speed, which contrasts with the typical worst-case analysis. Our learned optimization algorithms prova… ▽ More We use the PAC-Bayesian theory for the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-Bayesian bounds) and explicit trade-off between convergence guarantees and convergence speed, which contrasts with the typical worst-case analysis. Our learned optimization algorithms provably outperform related ones derived from a (deterministic) worst-case analysis. The results rely on PAC-Bayesian bounds for general, possibly unbounded loss-functions based on exponential families. Then, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum. Furthermore, we provide a concrete algorithmic realization of the framework and new methodologies for learning-to-optimize, and we conduct four practically relevant experiments to support our theory. With this, we showcase that the provided learning framework yields optimization algorithms that provably outperform the state-of-the-art by orders of magnitude. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2311.10053 [pdf, other]

Near-optimal Closed-loop Method via Lyapunov Dam** for Convex Optimization

Authors: Severin Maier, Camille Castera, Peter Ochs

Abstract: We introduce an autonomous system with closed-loop dam** for first-order convex optimization. While, to this day, optimal rates of convergence are almost exclusively achieved by non-autonomous methods via open-loop dam** (e.g., Nesterov's algorithm), we show that our system, featuring a closed-loop dam**, exhibits a rate arbitrarily close to the optimal one. We do so by coupling the dam**… ▽ More We introduce an autonomous system with closed-loop dam** for first-order convex optimization. While, to this day, optimal rates of convergence are almost exclusively achieved by non-autonomous methods via open-loop dam** (e.g., Nesterov's algorithm), we show that our system, featuring a closed-loop dam**, exhibits a rate arbitrarily close to the optimal one. We do so by coupling the dam** and the speed of convergence of the system via a well-chosen Lyapunov function. By discretizing our system we then derive an algorithm and present numerical experiments supporting our theoretical findings. △ Less

Submitted 15 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2210.11113 [pdf, other]

PAC-Bayesian Learning of Optimization Algorithms

Authors: Michael Sucker, Peter Ochs

Abstract: We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algori… ▽ More We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory. △ Less

Submitted 15 February, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: Accepted to AISTATS 2023

arXiv:2208.03107 [pdf, other]

Fixed-Point Automatic Differentiation of Forward--Backward Splitting Algorithms for Partly Smooth Functions

Authors: Sheheryar Mehmood, Peter Ochs

Abstract: A large class of non-smooth practical optimization problems can be written as minimization of a sum of smooth and partly smooth functions. We consider such structured problems which also depend on a parameter vector and study the problem of differentiating its solution map** with respect to the parameter which has far reaching applications in sensitivity analysis and parameter learning optmizati… ▽ More A large class of non-smooth practical optimization problems can be written as minimization of a sum of smooth and partly smooth functions. We consider such structured problems which also depend on a parameter vector and study the problem of differentiating its solution map** with respect to the parameter which has far reaching applications in sensitivity analysis and parameter learning optmization problems. We show that under partial smoothness and other mild assumptions, Automatic Differentiation (AD) of the sequence generated by proximal splitting algorithms converges to the derivative of the solution map**. For a variant of automatic differentiation, which we call Fixed-Point Automatic Differentiation (FPAD), we remedy the memory overhead problem of the Reverse Mode AD and moreover provide faster convergence theoretically. We numerically illustrate the convergence and convergence rates of AD and FPAD on Lasso and Group Lasso problems and demonstrate the working of FPAD on prototypical practical image denoising problem by learning the regularization term. △ Less

Submitted 5 August, 2022; originally announced August 2022.

arXiv:2012.13161 [pdf, other]

Global Convergence of Model Function Based Bregman Proximal Minimization Algorithms

Authors: Mahesh Chandra Mukkamala, Jalal Fadili, Peter Ochs

Abstract: Lipschitz continuity of the gradient map** of a continuously differentiable function plays a crucial role in designing various optimization algorithms. However, many functions arising in practical applications such as low rank matrix factorization or deep neural network problems do not have a Lipschitz continuous gradient. This led to the development of a generalized notion known as the $L$-smad… ▽ More Lipschitz continuity of the gradient map** of a continuously differentiable function plays a crucial role in designing various optimization algorithms. However, many functions arising in practical applications such as low rank matrix factorization or deep neural network problems do not have a Lipschitz continuous gradient. This led to the development of a generalized notion known as the $L$-smad property, which is based on generalized proximity measures called Bregman distances. However, the $L$-smad property cannot handle nonsmooth functions, for example, simple nonsmooth functions like $\abs{x^4-1}$ and also many practical composite problems are out of scope. We fix this issue by proposing the MAP property, which generalizes the $L$-smad property and is also valid for a large class of nonconvex nonsmooth composite problems. Based on the proposed MAP property, we propose a globally convergent algorithm called Model BPG, that unifies several existing algorithms. The convergence analysis is based on a new Lyapunov function. We also numerically illustrate the superior performance of Model BPG on standard phase retrieval problems, robust phase retrieval problems, and Poisson linear inverse problems, when compared to a state of the art optimization method that is valid for generic nonconvex nonsmooth optimization problems. △ Less

Submitted 24 December, 2020; originally announced December 2020.

Comments: 44 pages, 22 figures

MSC Class: 90C25; 26B25; 49M27; 52A41; 65K05

arXiv:2008.07872 [pdf, other]

Self-supervised Sparse to Dense Motion Segmentation

Authors: Amirhossein Kardoost, Kalun Ho, Peter Ochs, Margret Keuper

Abstract: Observable motion in videos can give rise to the definition of objects moving with respect to the scene. The task of segmenting such moving objects is referred to as motion segmentation and is usually tackled either by aggregating motion information in long, sparse point trajectories, or by directly producing per frame dense segmentations relying on large amounts of training data. In this paper, w… ▽ More Observable motion in videos can give rise to the definition of objects moving with respect to the scene. The task of segmenting such moving objects is referred to as motion segmentation and is usually tackled either by aggregating motion information in long, sparse point trajectories, or by directly producing per frame dense segmentations relying on large amounts of training data. In this paper, we propose a self supervised method to learn the densification of sparse motion segmentations from single video frames. While previous approaches towards motion segmentation build upon pre-training on large surrogate datasets and use dense motion information as an essential cue for the pixelwise segmentation, our model does not require pre-training and operates at test time on single frames. It can be trained in a sequence specific way to produce high quality dense segmentations from sparse and noisy input. We evaluate our method on the well-known motion segmentation datasets FBMS59 and DAVIS16. △ Less

Submitted 18 August, 2020; originally announced August 2020.

arXiv:1910.03638 [pdf, other]

Bregman Proximal Framework for Deep Linear Neural Networks

Authors: Mahesh Chandra Mukkamala, Felix Westerkamp, Emanuel Laude, Daniel Cremers, Peter Ochs

Abstract: A typical assumption for the analysis of first order optimization methods is the Lipschitz continuity of the gradient of the objective function. However, for many practical applications this assumption is violated, including loss functions in deep learning. To overcome this issue, certain extensions based on generalized proximity measures known as Bregman distances were introduced. This initiated… ▽ More A typical assumption for the analysis of first order optimization methods is the Lipschitz continuity of the gradient of the objective function. However, for many practical applications this assumption is violated, including loss functions in deep learning. To overcome this issue, certain extensions based on generalized proximity measures known as Bregman distances were introduced. This initiated the development of the Bregman proximal gradient (BPG) algorithm and an inertial variant (momentum based) CoCaIn BPG, which however rely on problem dependent Bregman distances. In this paper, we develop Bregman distances for using BPG methods to train Deep Linear Neural Networks. The main implications of our results are strong convergence guarantees for these algorithms. We also propose several strategies for their efficient implementation, for example, closed form updates and a closed form expression for the inertial parameter of CoCaIn BPG. Moreover, the BPG method requires neither diminishing step sizes nor line search, unlike its corresponding Euclidean version. We numerically illustrate the competitiveness of the proposed methods compared to existing state of the art schemes. △ Less

Submitted 8 October, 2019; originally announced October 2019.

Comments: 34 pages, 54 images

MSC Class: 90C26; 26B25; 90C30; 49M27; 47J25; 65K05; 65F22

arXiv:1905.09050 [pdf, other]

Beyond Alternating Updates for Matrix Factorization with Inertial Bregman Proximal Gradient Algorithms

Authors: Mahesh Chandra Mukkamala, Peter Ochs

Abstract: Matrix Factorization is a popular non-convex optimization problem, for which alternating minimization schemes are mostly used. They usually suffer from the major drawback that the solution is biased towards one of the optimization variables. A remedy is non-alternating schemes. However, due to a lack of Lipschitz continuity of the gradient in matrix factorization problems, convergence cannot be gu… ▽ More Matrix Factorization is a popular non-convex optimization problem, for which alternating minimization schemes are mostly used. They usually suffer from the major drawback that the solution is biased towards one of the optimization variables. A remedy is non-alternating schemes. However, due to a lack of Lipschitz continuity of the gradient in matrix factorization problems, convergence cannot be guaranteed. A recently developed approach relies on the concept of Bregman distances, which generalizes the standard Euclidean distance. We exploit this theory by proposing a novel Bregman distance for matrix factorization problems, which, at the same time, allows for simple/closed form update steps. Therefore, for non-alternating schemes, such as the recently introduced Bregman Proximal Gradient (BPG) method and an inertial variant Convex--Concave Inertial BPG (CoCaIn BPG), convergence of the whole sequence to a stationary point is proved for Matrix Factorization. In several experiments, we observe a superior performance of our non-alternating schemes in terms of speed and objective value at the limit point. △ Less

Submitted 6 December, 2019; v1 submitted 22 May, 2019; originally announced May 2019.

Comments: Accepted at NeuRIPS 2019. Paper url: http://papers.nips.cc/paper/8679-beyond-alternating-updates-for-matrix-factorization-with-inertial-bregman-proximal-gradient-algorithms

arXiv:1904.03537 [pdf, other]

Convex-Concave Backtracking for Inertial Bregman Proximal Gradient Algorithms in Non-Convex Optimization

Authors: Mahesh Chandra Mukkamala, Peter Ochs, Thomas Pock, Shoham Sabach

Abstract: Backtracking line-search is an old yet powerful strategy for finding a better step sizes to be used in proximal gradient algorithms. The main principle is to locally find a simple convex upper bound of the objective function, which in turn controls the step size that is used. In case of inertial proximal gradient algorithms, the situation becomes much more difficult and usually leads to very restr… ▽ More Backtracking line-search is an old yet powerful strategy for finding a better step sizes to be used in proximal gradient algorithms. The main principle is to locally find a simple convex upper bound of the objective function, which in turn controls the step size that is used. In case of inertial proximal gradient algorithms, the situation becomes much more difficult and usually leads to very restrictive rules on the extrapolation parameter. In this paper, we show that the extrapolation parameter can be controlled by locally finding also a simple concave lower bound of the objective function. This gives rise to a double convex-concave backtracking procedure which allows for an adaptive choice of both the step size and extrapolation parameters. We apply this procedure to the class of inertial Bregman proximal gradient methods, and prove that any sequence generated by these algorithms converges globally to a critical point of the function at hand. Numerical experiments on a number of challenging non-convex problems in image processing and machine learning were conducted and show the power of combining inertial step and double backtracking strategy in achieving improved performances. △ Less

Submitted 5 November, 2019; v1 submitted 6 April, 2019; originally announced April 2019.

Comments: 29 pages

MSC Class: 90C25; 26B25; 49M27; 52A41; 65K05

arXiv:1901.08087 [pdf, other]

Model Function Based Conditional Gradient Method with Armijo-like Line Search

Authors: Yura Malitsky, Peter Ochs

Abstract: The Conditional Gradient Method is generalized to a class of non-smooth non-convex optimization problems with many applications in machine learning. The proposed algorithm iterates by minimizing so-called model functions over the constraint set. Complemented with an Amijo line search procedure, we prove that subsequences converge to a stationary point. The abstract framework of model functions pro… ▽ More The Conditional Gradient Method is generalized to a class of non-smooth non-convex optimization problems with many applications in machine learning. The proposed algorithm iterates by minimizing so-called model functions over the constraint set. Complemented with an Amijo line search procedure, we prove that subsequences converge to a stationary point. The abstract framework of model functions provides great flexibility for the design of concrete algorithms. As special cases, for example, we develop an algorithm for additive composite problems and an algorithm for non-linear composite problems which leads to a Gauss--Newton-type algorithm. Both instances are novel in non-smooth non-convex optimization and come with numerous applications in machine learning. Moreover, we obtain a hybrid version of Conditional Gradient and Proximal Minimization schemes for free, which combines advantages of both. Our algorithm is shown to perform favorably on a sparse non-linear robust regression problem and we discuss the flexibility of the proposed framework in several matrix factorization formulations. △ Less

Submitted 23 January, 2019; originally announced January 2019.

arXiv:1803.08660 [pdf, other]

Lifting Layers: Analysis and Applications

Authors: Peter Ochs, Tim Meinhardt, Laura Leal-Taixe, Michael Moeller

Abstract: The great advances of learning-based approaches in image processing and computer vision are largely based on deeply nested networks that compose linear transfer functions with suitable non-linearities. Interestingly, the most frequently used non-linearities in imaging applications (variants of the rectified linear unit) are uncommon in low dimensional approximation problems. In this paper we propo… ▽ More The great advances of learning-based approaches in image processing and computer vision are largely based on deeply nested networks that compose linear transfer functions with suitable non-linearities. Interestingly, the most frequently used non-linearities in imaging applications (variants of the rectified linear unit) are uncommon in low dimensional approximation problems. In this paper we propose a novel non-linear transfer function, called lifting, which is motivated from a related technique in convex optimization. A lifting layer increases the dimensionality of the input, naturally yields a linear spline when combined with a fully connected layer, and therefore closes the gap between low and high dimensional approximation problems. Moreover, applying the lifting operation to the loss layer of the network allows us to handle non-convex and flat (zero-gradient) cost functions. We analyze the proposed lifting theoretically, exemplify interesting properties in synthetic experiments and demonstrate its effectiveness in deep learning approaches to image classification and denoising. △ Less

Submitted 23 March, 2018; originally announced March 2018.

arXiv:1404.4805 [pdf, other]

iPiano: Inertial Proximal Algorithm for Non-Convex Optimization

Authors: Peter Ochs, Yun** Chen, Thomas Brox, Thomas Pock

Abstract: In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly non-convex) and a convex (possibly non-differentiable) function. The algorithm iPiano combines forward-backward splitting with an inertial force. It can be seen as a non-smooth split version of the Heavy-ball method from Polyak. A rigorous analysis of the algorithm for the proposed class o… ▽ More In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly non-convex) and a convex (possibly non-differentiable) function. The algorithm iPiano combines forward-backward splitting with an inertial force. It can be seen as a non-smooth split version of the Heavy-ball method from Polyak. A rigorous analysis of the algorithm for the proposed class of problems yields global convergence of the function values and the arguments. This makes the algorithm robust for usage on non-convex problems. The convergence result is obtained based on the \KL inequality. This is a very weak restriction, which was used to prove convergence for several other gradient methods. First, an abstract convergence theorem for a generic algorithm is proved, and, then iPiano is shown to satisfy the requirements of this theorem. Furthermore, a convergence rate is established for the general problem class. We demonstrate iPiano on computer vision problems: image denoising with learned priors and diffusion based image compression. △ Less

Submitted 18 April, 2014; originally announced April 2014.

Comments: 32pages, 7 figures, to appear in SIAM Journal on Imaging Sciences

Showing 1–13 of 13 results for author: Ochs, P