-
Bolstering Stochastic Gradient Descent with Model Building
Authors:
S. Ilker Birbil,
Ozgur Martin,
Gonenc Onay,
Figen Oztoprak
Abstract:
Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by li…
▽ More
Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the step length. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the step length but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.
△ Less
Submitted 13 March, 2024; v1 submitted 13 November, 2021;
originally announced November 2021.
-
Constrained Optimization in the Presence of Noise
Authors:
Figen Oztoprak,
Richard Byrd,
Jorge Nocedal
Abstract:
The problem of interest is the minimization of a nonlinear function subject to nonlinear equality constraints using a sequential quadratic programming (SQP) method. The minimization must be performed while observing only noisy evaluations of the objective and constraint functions. In order to obtain stability, the classical SQP method is modified by relaxing the standard Armijo line search based o…
▽ More
The problem of interest is the minimization of a nonlinear function subject to nonlinear equality constraints using a sequential quadratic programming (SQP) method. The minimization must be performed while observing only noisy evaluations of the objective and constraint functions. In order to obtain stability, the classical SQP method is modified by relaxing the standard Armijo line search based on the noise level in the functions, which is assumed to be known. Convergence theory is presented giving conditions under which the iterates converge to a neighborhood of the solution characterized by the noise level and the problem conditioning. The analysis assumes that the SQP algorithm does not require regularization or trust regions. Numerical experiments indicate that the relaxed line search improves the practical performance of the method on problems involving uniformly distributed noise. One important application of this work is in the field of derivative-free optimization, when finite differences are employed to estimate gradients.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
On the Numerical Performance of Derivative-Free Optimization Methods Based on Finite-Difference Approximations
Authors:
Hao-Jun Michael Shi,
Melody Qiming Xuan,
Figen Oztoprak,
Jorge Nocedal
Abstract:
The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calcu…
▽ More
The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calculated by finite differences, with a differencing interval determined by the noise level in the functions and a bound on the second or third derivatives. It is assumed that noise level is known or can be estimated by means of difference tables or sampling. The use of finite differences has been largely dismissed in the derivative-free optimization literature as too expensive in terms of function evaluations and/or as impractical when the objective function contains noise. The test results presented in this paper suggest that such views should be re-examined and that the finite-difference approach has much to be recommended. The tests compared NEWUOA, DFO-LS and COBYLA against the finite-difference approach on three classes of problems: general unconstrained problems, nonlinear least squares, and general nonlinear programs with equality constraints.
△ Less
Submitted 19 February, 2021;
originally announced February 2021.
-
An Alternative Globalization Strategy for Unconstrained Optimization
Authors:
Figen Öztoprak,
Ş. İlker Birbil
Abstract:
We propose a new globalization strategy that can be used in unconstrained optimization algorithms to support rapid convergence from remote starting points. Our approach is based on using multiple points at each iteration to build a representative model of the objective function. Using the new information gathered from those multiple points, a local step is gradually improved by updating its direct…
▽ More
We propose a new globalization strategy that can be used in unconstrained optimization algorithms to support rapid convergence from remote starting points. Our approach is based on using multiple points at each iteration to build a representative model of the objective function. Using the new information gathered from those multiple points, a local step is gradually improved by updating its direction as well as its length. We give a global convergence result and also provide parallel implementation details accompanied with a numerical study. Our numerical study shows that the proposed algorithm is a promising alternative as a globalization strategy.
△ Less
Submitted 15 May, 2017;
originally announced May 2017.
-
HAMSI: A Parallel Incremental Optimization Algorithm Using Quadratic Approximations for Solving Partially Separable Problems
Authors:
Kamer Kaya,
Figen Öztoprak,
Ş. İlker Birbil,
A. Taylan Cemgil,
Umut Şimşekli,
Nurdan Kuru,
Hazal Koptagel,
M. Kaan Öztürk
Abstract:
We propose HAMSI (Hessian Approximated Multiple Subsets Iteration), which is a provably convergent, second order incremental algorithm for solving large-scale partially separable optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with t…
▽ More
We propose HAMSI (Hessian Approximated Multiple Subsets Iteration), which is a provably convergent, second order incremental algorithm for solving large-scale partially separable optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with the number of processors. Combined with techniques for effectively utilizing modern parallel computer architectures, we illustrate that the proposed method converges more rapidly than a parallel stochastic gradient descent when both methods are used to solve large-scale matrix factorization problems. This performance gain comes only at the expense of using memory that scales linearly with the total size of the optimization variables. We conclude that HAMSI may be considered as a viable alternative in many large scale problems, where first order methods based on variants of stochastic gradient descent are applicable.
△ Less
Submitted 4 August, 2017; v1 submitted 5 September, 2015;
originally announced September 2015.
-
Parallel Stochastic Gradient Markov Chain Monte Carlo for Matrix Factorisation Models
Authors:
Umut Şimşekli,
Hazal Koptagel,
Hakan Güldaş,
A. Taylan Cemgil,
Figen Öztoprak,
Ş. İlker Birbil
Abstract:
For large matrix factorisation problems, we develop a distributed Markov Chain Monte Carlo (MCMC) method based on stochastic gradient Langevin dynamics (SGLD) that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size and is comparable in terms of computational requirements to optimisation methods based on stochastic gradient descent. PSGLD achieves…
▽ More
For large matrix factorisation problems, we develop a distributed Markov Chain Monte Carlo (MCMC) method based on stochastic gradient Langevin dynamics (SGLD) that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size and is comparable in terms of computational requirements to optimisation methods based on stochastic gradient descent. PSGLD achieves high performance by exploiting the conditional independence structure of the MF models to sub-sample data in a systematic manner as to allow parallelisation and distributed computation. We provide a convergence proof of the algorithm and verify its superior performance on various architectures such as Graphics Processing Units, shared memory multi-core systems and multi-computer clusters.
△ Less
Submitted 28 September, 2015; v1 submitted 3 June, 2015;
originally announced June 2015.
-
A Second-Order Method for Convex $\ell_1$-Regularized Optimization with Active Set Prediction
Authors:
Nitish Shirish Keskar,
Jorge Nocedal,
Figen Oztoprak,
Andreas Waechter
Abstract:
We describe an active-set method for the minimization of an objective function $φ$ that is the sum of a smooth convex function and an $\ell_1$-regularization term. A distinctive feature of the method is the way in which active-set identification and {second-order} subspace minimization steps are integrated to combine the predictive power of the two approaches. At every iteration, the algorithm sel…
▽ More
We describe an active-set method for the minimization of an objective function $φ$ that is the sum of a smooth convex function and an $\ell_1$-regularization term. A distinctive feature of the method is the way in which active-set identification and {second-order} subspace minimization steps are integrated to combine the predictive power of the two approaches. At every iteration, the algorithm selects a candidate set of free and fixed variables, performs an (inexact) subspace phase, and then assesses the quality of the new active set. If it is not judged to be acceptable, then the set of free variables is restricted and a new active-set prediction is made. We establish global convergence for our approach, and compare the new method against the state-of-the-art code LIBLINEAR.
△ Less
Submitted 16 May, 2015;
originally announced May 2015.
-
An Inexact Successive Quadratic Approximation Method for Convex L-1 Regularized Optimization
Authors:
Richard H. Byrd,
Jorge Nocedal,
Figen Oztoprak
Abstract:
We study a Newton-like method for the minimization of an objective function that is the sum of a smooth convex function and an l-1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model of the objective function. In order to make this approach efficient in practice, it is imperative t…
▽ More
We study a Newton-like method for the minimization of an objective function that is the sum of a smooth convex function and an l-1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model of the objective function. In order to make this approach efficient in practice, it is imperative to perform this inner minimization inexactly. In this paper, we give inexactness conditions that guarantee global convergence and that can be used to control the local rate of convergence of the iteration. Our inexactness conditions are based on a semi-smooth function that represents a (continuous) measure of the optimality conditions of the problem, and that embodies the soft-thresholding iteration. We give careful consideration to the algorithm employed for the inner minimization, and report numerical results on two test sets originating in machine learning.
△ Less
Submitted 13 September, 2013;
originally announced September 2013.