-
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
Authors:
Kaan Ozkara,
Can Karakus,
Parameswaran Raman,
Mingyi Hong,
Shoham Sabach,
Branislav Kveton,
Volkan Cevher
Abstract:
Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during tra…
▽ More
Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hyper-gradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT-2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.
△ Less
Submitted 17 June, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate
Authors:
Ruichen Jiang,
Parameswaran Raman,
Shoham Sabach,
Aryan Mokhtari,
Mingyi Hong,
Volkan Cevher
Abstract:
Second-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second-order updates within a lower-dimensional subspace, giving rise to subspace second-order methods.…
▽ More
Second-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second-order updates within a lower-dimensional subspace, giving rise to subspace second-order methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $d$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of ${O}\left(\frac{1}{mk}+\frac{1}{k^2}\right)$ for solving convex optimization problems. Here, $m$ represents the subspace dimension, which can be significantly smaller than $d$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the Krylov subspace associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Convex Bi-Level Optimization Problems with Non-smooth Outer Objective Function
Authors:
Roey Merchav,
Shoham Sabach
Abstract:
In this paper, we propose the Bi-Sub-Gradient (Bi-SG) method, which is a generalization of the classical sub-gradient method to the setting of convex bi-level optimization problems. This is a first-order method that is very easy to implement in the sense that it requires only a computation of the associated proximal map** or a sub-gradient of the outer non-smooth objective function, in addition…
▽ More
In this paper, we propose the Bi-Sub-Gradient (Bi-SG) method, which is a generalization of the classical sub-gradient method to the setting of convex bi-level optimization problems. This is a first-order method that is very easy to implement in the sense that it requires only a computation of the associated proximal map** or a sub-gradient of the outer non-smooth objective function, in addition to a proximal gradient step on the inner optimization problem. We show, under very mild assumptions, that Bi-SG tackles bi-level optimization problems and achieves sub-linear rates both in terms of the inner and outer objective functions. Moreover, if the outer objective function is additionally strongly convex (still could be non-smooth), the outer rate can be improved to a linear rate. Last, we prove that the distance of the generated sequence to the set of optimal solutions of the bi-level problem converges to zero.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle
Authors:
Dan Garber,
Tsur Livney,
Shoham Sabach
Abstract:
This paper considers a convex composite optimization problem with affine constraints, which includes problems that take the form of minimizing a smooth convex objective function over the intersection of (simple) convex sets, or regularized with multiple (simple) functions. Motivated by high-dimensional applications in which exact projection/proximal computations are not tractable, we propose a \te…
▽ More
This paper considers a convex composite optimization problem with affine constraints, which includes problems that take the form of minimizing a smooth convex objective function over the intersection of (simple) convex sets, or regularized with multiple (simple) functions. Motivated by high-dimensional applications in which exact projection/proximal computations are not tractable, we propose a \textit{projection-free} augmented Lagrangian-based method, in which primal updates are carried out using a \textit{weak proximal oracle} (WPO). In an earlier work, WPO was shown to be more powerful than the standard \textit{linear minimization oracle} (LMO) that underlies conditional gradient-based methods (aka Frank-Wolfe methods). Moreover, WPO is computationally tractable for many high-dimensional problems of interest, including those motivated by recovery of low-rank matrices and tensors, and optimization over polytopes which admit efficient LMOs. The main result of this paper shows that under a certain curvature assumption (which is weaker than strong convexity), our WPO-based algorithm achieves an ergodic rate of convergence of $O(1/T)$ for both the objective residual and feasibility gap. This result, to the best of our knowledge, improves upon the $O(1/\sqrt{T})$ rate for existing LMO-based projection-free methods for this class of problems. Empirical experiments on a low-rank and sparse covariance matrix estimation task and the Max Cut semidefinite relaxation demonstrate that of our method can outperform state-of-the-art LMO-based Lagrangian-based methods.
△ Less
Submitted 21 February, 2023; v1 submitted 25 October, 2022;
originally announced October 2022.
-
Faster Lagrangian-Based Methods in Convex Optimization
Authors:
Shoham Sabach,
Marc Teboulle
Abstract:
In this paper, we aim at unifying, simplifying and improving the convergence rate analysis of Lagrangian-based methods for convex optimization problems. We first introduce the notion of nice primal algorithmic map, which plays a central role in the unification and in the simplification of the analysis of most Lagrangian-based methods. Equipped with a nice primal algorithmic map, we then introduce…
▽ More
In this paper, we aim at unifying, simplifying and improving the convergence rate analysis of Lagrangian-based methods for convex optimization problems. We first introduce the notion of nice primal algorithmic map, which plays a central role in the unification and in the simplification of the analysis of most Lagrangian-based methods. Equipped with a nice primal algorithmic map, we then introduce a versatile generic scheme, which allows for the design and analysis of Faster LAGrangian (FLAG) methods with new provably sublinear rate of convergence expressed in terms of function values and feasibility violation of the original (non-ergodic) generated sequence. To demonstrate the power and versatility of our approach and results, we show that most well-known iconic Lagrangian-based schemes admit a nice primal algorithmic map, and hence share the new faster rate of convergence results within their corresponding FLAG.
△ Less
Submitted 6 June, 2023; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Alternating Minimization Based First-Order Method for the Wireless Sensor Network Localization Problem
Authors:
Eyal Gur,
Shoham Sabach,
Shimrit Shtern
Abstract:
We propose an algorithm for the Wireless Sensor Network localization problem, which is based on the well-known algorithmic framework of Alternating Minimization. We start with a non-smooth and non-convex minimization, and transform it into an equivalent smooth and non-convex problem, which stands at the heart of our study. This paves the way to a new method which is globally convergent: not only d…
▽ More
We propose an algorithm for the Wireless Sensor Network localization problem, which is based on the well-known algorithmic framework of Alternating Minimization. We start with a non-smooth and non-convex minimization, and transform it into an equivalent smooth and non-convex problem, which stands at the heart of our study. This paves the way to a new method which is globally convergent: not only does the sequence of objective function values converge, but the sequence of the location estimates also converges to a unique location that is a critical point of the corresponding (original) objective function. The proposed algorithm has a range of fully distributed to fully centralized implementations, which all have the property of global convergence. The algorithm is tested over several network configurations, and it is shown to produce more accurate solutions within a shorter time relative to existing methods.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
Non-Convex Split Feasibility Problems: Models, Algorithms and Theory
Authors:
Aviv Gibali,
Shoham Sabach,
Sergey Voldman
Abstract:
In this paper, we propose a catalog of iterative methods for solving the Split Feasibility Problem in the non-convex setting. We study four different optimization formulations of the problem, where each model has advantageous in different settings of the problem. For each model, we study relevant iterative algorithms, some of which are well-known in this area and some are new. All the studied meth…
▽ More
In this paper, we propose a catalog of iterative methods for solving the Split Feasibility Problem in the non-convex setting. We study four different optimization formulations of the problem, where each model has advantageous in different settings of the problem. For each model, we study relevant iterative algorithms, some of which are well-known in this area and some are new. All the studied methods, including the well-known CQ Algorithm, are proven to have global convergence guarantees in the non-convex setting under mild conditions on the problem's data.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Convex-Concave Backtracking for Inertial Bregman Proximal Gradient Algorithms in Non-Convex Optimization
Authors:
Mahesh Chandra Mukkamala,
Peter Ochs,
Thomas Pock,
Shoham Sabach
Abstract:
Backtracking line-search is an old yet powerful strategy for finding a better step sizes to be used in proximal gradient algorithms. The main principle is to locally find a simple convex upper bound of the objective function, which in turn controls the step size that is used. In case of inertial proximal gradient algorithms, the situation becomes much more difficult and usually leads to very restr…
▽ More
Backtracking line-search is an old yet powerful strategy for finding a better step sizes to be used in proximal gradient algorithms. The main principle is to locally find a simple convex upper bound of the objective function, which in turn controls the step size that is used. In case of inertial proximal gradient algorithms, the situation becomes much more difficult and usually leads to very restrictive rules on the extrapolation parameter. In this paper, we show that the extrapolation parameter can be controlled by locally finding also a simple concave lower bound of the objective function. This gives rise to a double convex-concave backtracking procedure which allows for an adaptive choice of both the step size and extrapolation parameters. We apply this procedure to the class of inertial Bregman proximal gradient methods, and prove that any sequence generated by these algorithms converges globally to a critical point of the function at hand. Numerical experiments on a number of challenging non-convex problems in image processing and machine learning were conducted and show the power of combining inertial step and double backtracking strategy in achieving improved performances.
△ Less
Submitted 5 November, 2019; v1 submitted 6 April, 2019;
originally announced April 2019.
-
Optimization on Spheres: Models and Proximal Algorithms with Computational Performance Comparisons
Authors:
D. Russell Luke,
Shoham Sabach,
Marc Teboulle
Abstract:
We present a unified treatment of the abstract problem of finding the best approximation between a cone and spheres in the image of affine transformations. Prominent instances of this problem are phase retrieval and source localization. The common geometry binding these problems permits a generic application of algorithmic ideas and abstract convergence results for nonconvex optimization. We organ…
▽ More
We present a unified treatment of the abstract problem of finding the best approximation between a cone and spheres in the image of affine transformations. Prominent instances of this problem are phase retrieval and source localization. The common geometry binding these problems permits a generic application of algorithmic ideas and abstract convergence results for nonconvex optimization. We organize variational models for this problem into three different classes and derive the main algorithmic approaches within these classes (13 in all). We identify the central ideas underlying these methods and provide thorough numerical benchmarks comparing their performance on synthetic and laboratory data. The software and data of our experiments are all publicly accessible. We also introduce one new algorithm, a cyclic relaxed Douglas-Rachford algorithm, which outperforms all other algorithms by every measure: speed, stability and accuracy. The analysis of this algorithm remains open.
△ Less
Submitted 5 October, 2018;
originally announced October 2018.
-
Improved Complexities of Conditional Gradient-Type Methods with Applications to Robust Matrix Recovery Problems
Authors:
Dan Garber,
Shoham Sabach,
Atara Kaplan
Abstract:
Motivated by robust matrix recovery problems such as Robust Principal Component Analysis, we consider a general optimization problem of minimizing a smooth and strongly convex loss function applied to the sum of two blocks of variables, where each block of variables is constrained or regularized individually. We study a Conditional Gradient-Type method which is able to leverage the special structu…
▽ More
Motivated by robust matrix recovery problems such as Robust Principal Component Analysis, we consider a general optimization problem of minimizing a smooth and strongly convex loss function applied to the sum of two blocks of variables, where each block of variables is constrained or regularized individually. We study a Conditional Gradient-Type method which is able to leverage the special structure of the problem to obtain faster convergence rates than those attainable via standard methods, under a variety of assumptions. In particular, our method is appealing for matrix problems in which one of the blocks corresponds to a low-rank matrix since it avoids prohibitive full-rank singular value decompositions required by most standard methods. While our initial motivation comes from problems which originated in statistics, our analysis does not impose any statistical assumptions on the data.
△ Less
Submitted 15 November, 2019; v1 submitted 15 February, 2018;
originally announced February 2018.
-
Nonconvex Lagrangian-Based Optimization: Monitoring Schemes and Global Convergence
Authors:
Jérôme Bolte,
Shoham Sabach,
Marc Teboulle
Abstract:
We introduce a novel approach addressing global analysis of a difficult class of nonconvex-nonsmooth optimization problems within the important framework of Lagrangian-based methods. This genuine nonlinear class captures many problems in modern disparate fields of applications. It features complex geometries, qualification conditions, and other regularity properties do not hold everywhere. To addr…
▽ More
We introduce a novel approach addressing global analysis of a difficult class of nonconvex-nonsmooth optimization problems within the important framework of Lagrangian-based methods. This genuine nonlinear class captures many problems in modern disparate fields of applications. It features complex geometries, qualification conditions, and other regularity properties do not hold everywhere. To address these issues we work along several research lines to develop an original general Lagrangian methodology which can deal, all at once, with the above obstacles. A first innovative feature of our approach is to introduce the concept of Lagrangian sequences for a broad class of algorithms. Central to this methodology is the idea of turning an arbitrary descent method into a multiplier method. Secondly, we provide these methods with a transitional regime allowing us to identify in finitely many steps a zone where we can tune the step-sizes of the algorithm for the final converging regime. Then, despite the min-max nature of Lagrangian methods, using an original Lyapunov method we prove that each bounded sequence generated by the resulting monitoring schemes are globally convergent to a critical point for some fundamental Lagrangian-based methods in the broad semialgebraic setting, which to the best of our knowledge, are the first of this kind.
△ Less
Submitted 9 January, 2018;
originally announced January 2018.
-
First Order Methods beyond Convexity and Lipschitz Gradient Continuity with Applications to Quadratic Inverse Problems
Authors:
Jérôme Bolte,
Shoham Sabach,
Marc Teboulle,
Yakov Vaisbourd
Abstract:
We focus on nonconvex and nonsmooth minimization problems with a composite objective, where the differentiable part of the objective is freed from the usual and restrictive global Lipschitz gradient continuity assumption. This longstanding smoothness restriction is pervasive in first order methods (FOM), and was recently circumvent for convex composite optimization by Bauschke, Bolte and Teboulle,…
▽ More
We focus on nonconvex and nonsmooth minimization problems with a composite objective, where the differentiable part of the objective is freed from the usual and restrictive global Lipschitz gradient continuity assumption. This longstanding smoothness restriction is pervasive in first order methods (FOM), and was recently circumvent for convex composite optimization by Bauschke, Bolte and Teboulle, through a simple and elegant framework which captures, all at once, the geometry of the function and of the feasible set. Building on this work, we tackle genuine nonconvex problems. We first complement and extend their approach to derive a full extended descent lemma by introducing the notion of smooth adaptable functions. We then consider a Bregman-based proximal gradient methods for the nonconvex composite model with smooth adaptable functions, which is proven to globally converge to a critical point under natural assumptions on the problem's data. To illustrate the power and potential of our general framework and results, we consider a broad class of quadratic inverse problems with sparsity constraints which arises in many fundamental applications, and we apply our approach to derive new globally convergent schemes for this class.
△ Less
Submitted 20 June, 2017;
originally announced June 2017.
-
On Fienup Methods for Regularized Phase Retrieval
Authors:
Edouard Pauwels,
Amir Beck,
Yonina C. Eldar,
Shoham Sabach
Abstract:
Alternating minimization, or Fienup methods, have a long history in phase retrieval. We provide new insights related to the empirical and theoretical analysis of these algorithms when used with Fourier measurements and combined with convex priors. In particular, we show that Fienup methods can be viewed as performing alternating minimization on a regularized nonconvex least-squares problem with re…
▽ More
Alternating minimization, or Fienup methods, have a long history in phase retrieval. We provide new insights related to the empirical and theoretical analysis of these algorithms when used with Fourier measurements and combined with convex priors. In particular, we show that Fienup methods can be viewed as performing alternating minimization on a regularized nonconvex least-squares problem with respect to amplitude measurements. We then prove that under mild additional structural assumptions on the prior (semi-algebraicity), the sequence of signal estimates has a smooth convergent behaviour towards a critical point of the nonconvex regularized least-squares objective. Finally, we propose an extension to Fienup techniques, based on a projected gradient descent interpretation and acceleration using inertial terms. We demonstrate experimentally that this modification combined with an $\ell_1$ prior constitutes a competitive approach for sparse phase retrieval.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
A First Order Method for Solving Convex Bi-Level Optimization Problems
Authors:
Shoham Sabach,
Shimrit Shtern
Abstract:
In this paper we study convex bi-level optimization problems for which the inner level consists of minimization of the sum of smooth and nonsmooth functions. The outer level aims at minimizing a smooth and strongly convex function over the optimal solutions set of the inner problem. We analyze a first order method which is based on an existing fixed-point algorithm. Global sublinear rate of conver…
▽ More
In this paper we study convex bi-level optimization problems for which the inner level consists of minimization of the sum of smooth and nonsmooth functions. The outer level aims at minimizing a smooth and strongly convex function over the optimal solutions set of the inner problem. We analyze a first order method which is based on an existing fixed-point algorithm. Global sublinear rate of convergence of the method is established in terms of the inner objective function values.
△ Less
Submitted 13 February, 2017;
originally announced February 2017.
-
Inertial Proximal Alternating Linearized Minimization (iPALM) for Nonconvex and Nonsmooth Problems
Authors:
Thomas Pock,
Shoham Sabach
Abstract:
In this paper we study nonconvex and nonsmooth optimization problems with semi-algebraic data, where the variables vector is split into several blocks of variables. The problem consists of one smooth function of the entire variables vector and the sum of nonsmooth functions for each block separately. We analyze an inertial version of the Proximal Alternating Linearized Minimization (PALM) algorith…
▽ More
In this paper we study nonconvex and nonsmooth optimization problems with semi-algebraic data, where the variables vector is split into several blocks of variables. The problem consists of one smooth function of the entire variables vector and the sum of nonsmooth functions for each block separately. We analyze an inertial version of the Proximal Alternating Linearized Minimization (PALM) algorithm and prove its global convergence to a critical point of the objective function at hand. We illustrate our theoretical findings by presenting numerical experiments on blind image deconvolution, on sparse non-negative matrix factorization and on dictionary learning, which demonstrate the viability and effectiveness of the proposed method.
△ Less
Submitted 8 February, 2017;
originally announced February 2017.
-
The Cyclic Block Conditional Gradient Method for Convex Optimization Problems
Authors:
Amir Beck,
Edouard Pauwels,
Shoham Sabach
Abstract:
In this paper we study the convex problem of optimizing the sum of a smooth function and a compactly supported non-smooth term with a specific separable form. We analyze the block version of the generalized conditional gradient method when the blocks are chosen in a cyclic order. A global sublinear rate of convergence is established for two different stepsize strategies commonly used in this class…
▽ More
In this paper we study the convex problem of optimizing the sum of a smooth function and a compactly supported non-smooth term with a specific separable form. We analyze the block version of the generalized conditional gradient method when the blocks are chosen in a cyclic order. A global sublinear rate of convergence is established for two different stepsize strategies commonly used in this class of methods. Numerical comparisons of the proposed method to both the classical conditional gradient algorithm and its random block version demonstrate the effectiveness of the cyclic block update rule.
△ Less
Submitted 25 September, 2015; v1 submitted 12 February, 2015;
originally announced February 2015.
-
Proximal Heterogeneous Block Input-Output Method and application to Blind Ptychographic Diffraction Imaging
Authors:
Robert Hesse,
D. Russell Luke,
Shoham Sabach,
Matthew K. Tam
Abstract:
We propose a general alternating minimization algorithm for nonconvex optimization problems with separable structure and nonconvex coupling between blocks of variables. To fix our ideas, we apply the methodology to the problem of blind ptychographic imaging. Compared to other schemes in the literature, our approach differs in two ways: (i) it is posed within a clear mathematical framework with pra…
▽ More
We propose a general alternating minimization algorithm for nonconvex optimization problems with separable structure and nonconvex coupling between blocks of variables. To fix our ideas, we apply the methodology to the problem of blind ptychographic imaging. Compared to other schemes in the literature, our approach differs in two ways: (i) it is posed within a clear mathematical framework with practically verifiable assumptions, and (ii) under the given assumptions, it is provably convergent to critical points. A numerical comparison of our proposed algorithm with the current state-of-the-art on simulated and experimental data validates our approach and points toward directions for further improvement.
△ Less
Submitted 8 August, 2014;
originally announced August 2014.