Search | arXiv e-print repository

arXiv:2402.11920 [pdf, other]

A Feasible Method for Constrained Derivative-Free Optimization

Authors: Melody Qiming Xuan, Jorge Nocedal

Abstract: This paper explores a method for solving constrained optimization problems when the derivatives of the objective function are unavailable, while the derivatives of the constraints are known. We allow the objective and constraint function to be nonconvex. The method constructs a quadratic model of the objective function via interpolation and computes a step by minimizing this model subject to the o… ▽ More This paper explores a method for solving constrained optimization problems when the derivatives of the objective function are unavailable, while the derivatives of the constraints are known. We allow the objective and constraint function to be nonconvex. The method constructs a quadratic model of the objective function via interpolation and computes a step by minimizing this model subject to the original constraints in the problem and a trust region constraint. The step computation requires the solution of a general nonlinear program, which is economically feasible when the constraints and their derivatives are very inexpensive to compute compared to the objective function. The paper includes a summary of numerical results that highlight the method's promising potential. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2401.15007 [pdf, other]

Noise-Tolerant Optimization Methods for the Solution of a Robust Design Problem

Authors: Yuchen Lou, Shigeng Sun, Jorge Nocedal

Abstract: The development of nonlinear optimization algorithms capable of performing reliably in the presence of noise has garnered considerable attention lately. This paper advocates for strategies to create noise-tolerant nonlinear optimization algorithms by adapting classical deterministic methods. These adaptations follow certain design guidelines described here, which make use of estimates of the noise… ▽ More The development of nonlinear optimization algorithms capable of performing reliably in the presence of noise has garnered considerable attention lately. This paper advocates for strategies to create noise-tolerant nonlinear optimization algorithms by adapting classical deterministic methods. These adaptations follow certain design guidelines described here, which make use of estimates of the noise level in the problem. The application of our methodology is illustrated by the development of a line search gradient projection method, which is tested on an engineering design problem. It is shown that a new self-calibrated line search and noise-aware finite-difference techniques are effective even in the high noise regime. Numerical experiments investigate the resiliency of key algorithmic components. A convergence analysis of the line search gradient projection method establishes convergence to a neighborhood of the solution. △ Less

Submitted 26 January, 2024; originally announced January 2024.

MSC Class: 90C30; 90C15; 93B51; 65K05

arXiv:2201.00973 [pdf, other]

A Trust Region Method for the Optimization of Noisy Functions

Authors: Shigeng Sun, Jorge Nocedal

Abstract: Classical trust region methods were designed to solve problems in which function and gradient information are exact. This paper considers the case when there are bounded errors (or noise) in the above computations and proposes a simple modification of the trust region method to cope with these errors. The new algorithm only requires information about the size of the errors in the function evaluati… ▽ More Classical trust region methods were designed to solve problems in which function and gradient information are exact. This paper considers the case when there are bounded errors (or noise) in the above computations and proposes a simple modification of the trust region method to cope with these errors. The new algorithm only requires information about the size of the errors in the function evaluations and incurs no additional computational expense. It is shown that, when applied to a smooth (but not necessarily convex) objective function, the iterates of the algorithm visit a neighborhood of stationarity infinitely often, and that the rest of the sequence cannot stray too far away, as measured by function values. Numerical results illustrate how the classical trust region algorithm may fail in the presence of noise, and how the proposed algorithm ensures steady progress towards stationarity in these cases. △ Less

Submitted 3 January, 2022; originally announced January 2022.

MSC Class: 65K05; 68Q25; 65G99; 90C30

arXiv:2110.06380 [pdf, other]

Adaptive Finite-Difference Interval Estimation for Noisy Derivative-Free Optimization

Authors: Hao-Jun Michael Shi, Yuchen Xie, Melody Qiming Xuan, Jorge Nocedal

Abstract: A common approach for minimizing a smooth nonlinear function is to employ finite-difference approximations to the gradient. While this can be easily performed when no error is present within the function evaluations, when the function is noisy, the optimal choice requires information about the noise level and higher-order derivatives of the function, which is often unavailable. Given the noise lev… ▽ More A common approach for minimizing a smooth nonlinear function is to employ finite-difference approximations to the gradient. While this can be easily performed when no error is present within the function evaluations, when the function is noisy, the optimal choice requires information about the noise level and higher-order derivatives of the function, which is often unavailable. Given the noise level of the function, we propose a bisection search for finding a finite-difference interval for any finite-difference scheme that balances the truncation error, which arises from the error in the Taylor series approximation, and the measurement error, which results from noise in the function evaluation. Our procedure produces reliable estimates of the finite-difference interval at low cost without explicitly approximating higher-order derivatives. We show its numerical reliability and accuracy on a set of test problems. When combined with L-BFGS, we obtain a robust method for minimizing noisy black-box functions, as illustrated on a subset of unconstrained CUTEst problems with synthetically added noise. △ Less

Submitted 22 March, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: 39 pages, 20 tables, 6 figures

arXiv:2110.04355 [pdf, other]

Constrained Optimization in the Presence of Noise

Authors: Figen Oztoprak, Richard Byrd, Jorge Nocedal

Abstract: The problem of interest is the minimization of a nonlinear function subject to nonlinear equality constraints using a sequential quadratic programming (SQP) method. The minimization must be performed while observing only noisy evaluations of the objective and constraint functions. In order to obtain stability, the classical SQP method is modified by relaxing the standard Armijo line search based o… ▽ More The problem of interest is the minimization of a nonlinear function subject to nonlinear equality constraints using a sequential quadratic programming (SQP) method. The minimization must be performed while observing only noisy evaluations of the objective and constraint functions. In order to obtain stability, the classical SQP method is modified by relaxing the standard Armijo line search based on the noise level in the functions, which is assumed to be known. Convergence theory is presented giving conditions under which the iterates converge to a neighborhood of the solution characterized by the noise level and the problem conditioning. The analysis assumes that the SQP algorithm does not require regularization or trust regions. Numerical experiments indicate that the relaxed line search improves the practical performance of the method on problems involving uniformly distributed noise. One important application of this work is in the field of derivative-free optimization, when finite differences are employed to estimate gradients. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2102.09762 [pdf, other]

On the Numerical Performance of Derivative-Free Optimization Methods Based on Finite-Difference Approximations

Authors: Hao-Jun Michael Shi, Melody Qiming Xuan, Figen Oztoprak, Jorge Nocedal

Abstract: The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calcu… ▽ More The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calculated by finite differences, with a differencing interval determined by the noise level in the functions and a bound on the second or third derivatives. It is assumed that noise level is known or can be estimated by means of difference tables or sampling. The use of finite differences has been largely dismissed in the derivative-free optimization literature as too expensive in terms of function evaluations and/or as impractical when the objective function contains noise. The test results presented in this paper suggest that such views should be re-examined and that the finite-difference approach has much to be recommended. The tests compared NEWUOA, DFO-LS and COBYLA against the finite-difference approach on three classes of problems: general unconstrained problems, nonlinear least squares, and general nonlinear programs with equality constraints. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 82 pages, 38 tables, 29 figures

arXiv:2012.15411 [pdf, other]

Constrained and Composite Optimization via Adaptive Sampling Methods

Authors: Yuchen Xie, Raghu Bollapragada, Richard Byrd, Jorge Nocedal

Abstract: The motivation for this paper stems from the desire to develop an adaptive sampling method for solving constrained optimization problems in which the objective function is stochastic and the constraints are deterministic. The method proposed in this paper is a proximal gradient method that can also be applied to the composite optimization problem min f(x) + h(x), where f is stochastic and h is c… ▽ More The motivation for this paper stems from the desire to develop an adaptive sampling method for solving constrained optimization problems in which the objective function is stochastic and the constraints are deterministic. The method proposed in this paper is a proximal gradient method that can also be applied to the composite optimization problem min f(x) + h(x), where f is stochastic and h is convex (but not necessarily differentiable). Adaptive sampling methods employ a mechanism for gradually improving the quality of the gradient approximation so as to keep computational cost to a minimum. The mechanism commonly employed in unconstrained optimization is no longer reliable in the constrained or composite optimization settings because it is based on pointwise decisions that cannot correctly predict the quality of the proximal gradient step. The method proposed in this paper measures the result of a complete step to determine if the gradient approximation is accurate enough; otherwise a more accurate gradient is generated and a new step is computed. Convergence results are established both for strongly convex and general convex f. Numerical experiments are presented to illustrate the practical behavior of the method. △ Less

Submitted 30 December, 2020; originally announced December 2020.

Comments: 26 pages, 13 figures

arXiv:2010.04352 [pdf, other]

A Noise-Tolerant Quasi-Newton Algorithm for Unconstrained Optimization

Authors: Hao-Jun Michael Shi, Yuchen Xie, Richard Byrd, Jorge Nocedal

Abstract: This paper describes an extension of the BFGS and L-BFGS methods for the minimization of a nonlinear function subject to errors. This work is motivated by applications that contain computational noise, employ low-precision arithmetic, or are subject to statistical noise. The classical BFGS and L-BFGS methods can fail in such circumstances because the updating procedure can be corrupted and the lin… ▽ More This paper describes an extension of the BFGS and L-BFGS methods for the minimization of a nonlinear function subject to errors. This work is motivated by applications that contain computational noise, employ low-precision arithmetic, or are subject to statistical noise. The classical BFGS and L-BFGS methods can fail in such circumstances because the updating procedure can be corrupted and the line search can behave erratically. The proposed method addresses these difficulties and ensures that the BFGS update is stable by employing a lengthening procedure that spaces out the points at which gradient differences are collected. A new line search, designed to tolerate errors, guarantees that the Armijo-Wolfe conditions are satisfied under most reasonable conditions, and works in conjunction with the lengthening procedure. The proposed methods are shown to enjoy convergence guarantees for strongly convex functions. Detailed implementations of the methods are presented, together with encouraging numerical results. △ Less

Submitted 8 September, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: 27 pages, 13 figures, 2 tables

arXiv:1901.09063 [pdf, other]

Analysis of the BFGS Method with Errors

Authors: Yuchen Xie, Richard Byrd, Jorge Nocedal

Abstract: The classical convergence analysis of quasi-Newton methods assumes that the function and gradients employed at each iteration are exact. In this paper, we consider the case when there are (bounded) errors in both computations and establish conditions under which a slight modification of the BFGS algorithm with an Armijo-Wolfe line search converges to a neighborhood of the solution that is determin… ▽ More The classical convergence analysis of quasi-Newton methods assumes that the function and gradients employed at each iteration are exact. In this paper, we consider the case when there are (bounded) errors in both computations and establish conditions under which a slight modification of the BFGS algorithm with an Armijo-Wolfe line search converges to a neighborhood of the solution that is determined by the size of the errors. One of our results is an extension of the analysis presented in Byrd, R. H., & Nocedal, J. (1989), which establishes that, for strongly convex functions, a fraction of the BFGS iterates are good iterates. We present numerical results illustrating the performance of the new BFGS method in the presence of noise. △ Less

Submitted 25 January, 2019; originally announced January 2019.

arXiv:1803.10173 [pdf, other]

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods

Authors: Albert S. Berahas, Richard H. Byrd, Jorge Nocedal

Abstract: This paper presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval $h$ based on the noise estimation techniques of Hamming (2012) and Moré and Wild (2011). This noise estimation procedure and the selection of $h$ are in… ▽ More This paper presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval $h$ based on the noise estimation techniques of Hamming (2012) and Moré and Wild (2011). This noise estimation procedure and the selection of $h$ are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a function interpolating trust region method are presented. △ Less

Submitted 8 January, 2019; v1 submitted 27 March, 2018; originally announced March 2018.

Comments: 26 pages, 9 figures

arXiv:1802.05374 [pdf, other]

A Progressive Batching L-BFGS Method for Machine Learning

Authors: Raghu Bollapragada, Dheevatsa Mudigere, Jorge Nocedal, Hao-Jun Michael Shi, ** Tak Peter Tang

Abstract: The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization pr… ▽ More The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the L-BFGS algorithm that combines three basic components - progressive batching, a stochastic line search, and stable quasi-Newton updating - and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method. △ Less

Submitted 30 May, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

Comments: ICML 2018. 25 pages, 17 figures, 2 tables

arXiv:1710.11258 [pdf, other]

Adaptive Sampling Strategies for Stochastic Optimization

Authors: Raghu Bollapragada, Richard Byrd, Jorge Nocedal

Abstract: In this paper, we propose a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques that either require additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size i… ▽ More In this paper, we propose a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques that either require additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size is governed by an inner product test that ensures that search directions are descent directions with high probability. We show that the inner product test improves upon the well known norm test, and can be used as a basis for an algorithm that is globally convergent on nonconvex functions and enjoys a global linear rate of convergence on strongly convex functions. Numerical experiments on logistic regression problems illustrate the performance of the algorithm. △ Less

Submitted 30 October, 2017; originally announced October 2017.

Comments: 32 Pages

arXiv:1705.06211 [pdf, other]

An Investigation of Newton-Sketch and Subsampled Newton Methods

Authors: Albert S. Berahas, Raghu Bollapragada, Jorge Nocedal

Abstract: Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton's method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomi… ▽ More Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton's method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomized Hadamard transformations. Each has its own advantages, and their relative tradeoffs have not been investigated in the optimization literature. Our study focuses on practical versions of the two methods in which the resulting linear systems of equations are solved approximately, at every iteration, using an iterative solver. The advantages of using the conjugate gradient method vs. a stochastic gradient iteration are revealed through a set of numerical experiments, and a complexity analysis of the Hessian subsampling method is presented. △ Less

Submitted 30 May, 2019; v1 submitted 17 May, 2017; originally announced May 2017.

Comments: 36 pages, 22 figures

arXiv:1609.08502 [pdf, other]

Exact and Inexact Subsampled Newton Methods for Optimization

Authors: Raghu Bollapragada, Richard Byrd, Jorge Nocedal

Abstract: The paper studies the solution of stochastic optimization problems in which approximations to the gradient and Hessian are obtained through subsampling. We first consider Newton-like methods that employ these approximations and discuss how to coordinate the accuracy in the gradient and Hessian to yield a superlinear rate of convergence in expectation. The second part of the paper analyzes an inexa… ▽ More The paper studies the solution of stochastic optimization problems in which approximations to the gradient and Hessian are obtained through subsampling. We first consider Newton-like methods that employ these approximations and discuss how to coordinate the accuracy in the gradient and Hessian to yield a superlinear rate of convergence in expectation. The second part of the paper analyzes an inexact Newton method that solves linear systems approximately using the conjugate gradient (CG) method, and that samples the Hessian and not the gradient (the gradient is assumed to be exact). We provide a complexity analysis for this method based on the properties of the CG iteration and the quality of the Hessian approximation, and compare it with a method that employs a stochastic gradient iteration instead of the CG method. We report preliminary numerical results that illustrate the performance of inexact subsampled Newton methods on machine learning applications based on logistic regression. △ Less

Submitted 27 September, 2016; originally announced September 2016.

Comments: 37 pages

arXiv:1609.04836 [pdf, other]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Authors: Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, ** Tak Peter Tang

Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the mod… ▽ More The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap. △ Less

Submitted 9 February, 2017; v1 submitted 15 September, 2016; originally announced September 2016.

Comments: Accepted as a conference paper at ICLR 2017

arXiv:1606.04838 [pdf, other]

Optimization Methods for Large-Scale Machine Learning

Authors: Léon Bottou, Frank E. Curtis, Jorge Nocedal

Abstract: This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine… ▽ More This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations. △ Less

Submitted 8 February, 2018; v1 submitted 15 June, 2016; originally announced June 2016.

arXiv:1605.06049 [pdf, other]

A Multi-Batch L-BFGS Method for Machine Learning

Authors: Albert S. Berahas, Jorge Nocedal, Martin Takáč

Abstract: The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the… ▽ More The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases. △ Less

Submitted 23 October, 2016; v1 submitted 19 May, 2016; originally announced May 2016.

Comments: NIPS 2016. 31 pages, 22 figures

arXiv:1505.04315 [pdf, ps, other]

A Second-Order Method for Convex $\ell_1$-Regularized Optimization with Active Set Prediction

Authors: Nitish Shirish Keskar, Jorge Nocedal, Figen Oztoprak, Andreas Waechter

Abstract: We describe an active-set method for the minimization of an objective function $φ$ that is the sum of a smooth convex function and an $\ell_1$-regularization term. A distinctive feature of the method is the way in which active-set identification and {second-order} subspace minimization steps are integrated to combine the predictive power of the two approaches. At every iteration, the algorithm sel… ▽ More We describe an active-set method for the minimization of an objective function $φ$ that is the sum of a smooth convex function and an $\ell_1$-regularization term. A distinctive feature of the method is the way in which active-set identification and {second-order} subspace minimization steps are integrated to combine the predictive power of the two approaches. At every iteration, the algorithm selects a candidate set of free and fixed variables, performs an (inexact) subspace phase, and then assesses the quality of the new active set. If it is not judged to be acceptable, then the set of free variables is restricted and a new active-set prediction is made. We establish global convergence for our approach, and compare the new method against the state-of-the-art code LIBLINEAR. △ Less

Submitted 16 May, 2015; originally announced May 2015.

arXiv:1412.1844 [pdf, ps, other]

An Algorithm for Quadratic $\ell_1$-Regularized Optimization with a Flexible Active-Set Strategy

Authors: Stefan Solntsev, Jorge Nocedal, Richard Byrd

Abstract: We present an active-set method for minimizing an objective that is the sum of a convex quadratic and $\ell_1$ regularization term. Unlike two-phase methods that combine a first-order active set identification step and a subspace phase consisting of a \emph{cycle} of conjugate gradient (CG) iterations, the method presented here has the flexibility of computing a first-order proximal gradient step… ▽ More We present an active-set method for minimizing an objective that is the sum of a convex quadratic and $\ell_1$ regularization term. Unlike two-phase methods that combine a first-order active set identification step and a subspace phase consisting of a \emph{cycle} of conjugate gradient (CG) iterations, the method presented here has the flexibility of computing a first-order proximal gradient step or a subspace CG step at each iteration. The decision of which type of step to perform is based on the relative magnitudes of some scaled components of the minimum norm subgradient of the objective function. The paper establishes global rates of convergence, as well as work complexity estimates for two variants of our approach, which we call the iiCG method. Numerical results illustrating the behavior of the method on a variety of test problems are presented. △ Less

Submitted 4 December, 2014; originally announced December 2014.

arXiv:1401.7020 [pdf, other]

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Authors: R. H. Byrd, S. L. Hansen, J. Nocedal, Y. Singer

Abstract: The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scal… ▽ More The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise. △ Less

Submitted 18 February, 2015; v1 submitted 27 January, 2014; originally announced January 2014.

arXiv:1309.3529 [pdf, ps, other]

An Inexact Successive Quadratic Approximation Method for Convex L-1 Regularized Optimization

Authors: Richard H. Byrd, Jorge Nocedal, Figen Oztoprak

Abstract: We study a Newton-like method for the minimization of an objective function that is the sum of a smooth convex function and an l-1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model of the objective function. In order to make this approach efficient in practice, it is imperative t… ▽ More We study a Newton-like method for the minimization of an objective function that is the sum of a smooth convex function and an l-1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model of the objective function. In order to make this approach efficient in practice, it is imperative to perform this inner minimization inexactly. In this paper, we give inexactness conditions that guarantee global convergence and that can be used to control the local rate of convergence of the iteration. Our inexactness conditions are based on a semi-smooth function that represents a (continuous) measure of the optimality conditions of the problem, and that embodies the soft-thresholding iteration. We give careful consideration to the algorithm employed for the inner minimization, and report numerical results on two test sets originating in machine learning. △ Less

Submitted 13 September, 2013; originally announced September 2013.

Showing 1–21 of 21 results for author: Nocedal, J