Higher-Order Newton Methods
with Polynomial Work per Iteration
Abstract
We present generalizations of Newton’s method that incorporate derivatives of an arbitrary order but maintain a polynomial dependence on dimension in their cost per iteration. At each step, our th-order method uses semidefinite programming to construct and minimize a sum of squares-convex approximation to the th-order Taylor expansion of the function we wish to minimize. We prove that our th-order method has local convergence of order . This results in lower oracle complexity compared to the classical Newton method. We show on numerical examples that basins of attraction around local minima can get larger as increases. Under additional assumptions, we present a modified algorithm, again with polynomial cost per iteration, which is globally convergent and has local convergence of order .
Keywords. Newton’s method, tensor methods, semidefinite programming, sum of squares methods, convergence analysis.
1 Introduction
Newton’s method is perhaps one of the most well-known and prominent algorithms in optimization. In its attempt to minimize a function , this algorithm replaces with its second-order Taylor expansion at an iterate and defines the next iterate to be a critical point of this quadratic approximation. This critical point coincides with a minimizer of the quadratic approximation in the case where the Hessian of at is positive semidefinite.
The work required in each iteration of Newton’s method consists of solving a system of linear equations which arises from setting the gradient of the quadratic approximation to zero. This can be carried out in time that grows polynomially with the dimension . Perhaps the most well-known theorem about the performance of Newton’s method is its local quadratic convergence. More precisely, under the assumptions that the second derivative of is locally Lipschitz around a local minimizer , and that the Hessian at is positive definite, there exists a full-dimensional basin around and a constant , such that if is in this basin, one has
for all . We note however that Newton’s method is in general not globally convergent. Lack of global convergence can occur even when in addition to the previous assumptions, is assumed to be strongly convex (see, e.g., Example 5.1 in Section 5).
As higher-order Taylor expansions provide closer local approximations to the function , it is natural to ask why Newton’s method limits the order of Taylor approximation to 2. The main barrier to higher-order Newton methods is the computational burden associated with minimizing polynomials of degree larger than 2 which would arise from higher-order Taylor expansions. For instance, any of the following tasks that one could consider for each iteration of a higher-order Newton method are in general NP-hard:
-
(i)
finding a global minimum of polynomials of degree even111Note that odd-degree polynomials are unbounded below. and at least 4 (see, e.g., [39]),
-
(ii)
finding a local minimum of polynomials of degree at least 4 (see [8, Theorem 2.1]),
-
(iii)
finding a second-order point (i.e., a point where the gradient vanishes and the Hessian is positive semidefinite) of polynomials of degree at least 4 (see [7, Theorem 2.2]),
-
(iv)
finding a critical point (i.e., a point where the gradient vanishes) of polynomials of degree at least 3 (see [7, Theorem 2.1]).
In addition to matters related to computation, there are geometric distinctions between Newton’s method and higher-order analogues of it. For example, even when the function is strongly convex and the starting iterate is arbitrarily close to its minimizer, Taylor expansions of even degree and larger than 2 may not be bounded below. One can see this by examining the strongly convex univariate function and its 4th order Taylor expansion near the origin.
Despite these barriers, the question of whether one can make higher-order Newton methods tractable and in some way superior to Newton’s method has been considered at least since the work of Chebyshev [20] (see Section 1.1 for more recent literature). More specifically, the question that is of interest to us is whether it is possible to design a higher-order Newton method (i.e., a method which utilizes a Taylor expansion of degree in each iteration) in such a way that (i) the work per iteration grows polynomially with the dimension, and (ii) the local order of convergence grows with , hence requiring fewer function evaluations as increases. In this paper, we show that this is indeed possible (Algorithm 1 and Theorem 4).
Our algorithm relies on sum of squares techniques in optimization [44], [30] and semidefinite programming and does not require the function to be convex. For any fixed degree , our approach is to approximate the -th order Taylor expansion of with an “sos-convex” polynomial (see Section 2 for a definition). Sos-convex polynomials form a subclass of convex polynomials whose convexity has an explicit algebraic proof. One can then use a first-order sum of squares relaxation to minimize this sos-convex polynomial. It turns out that both the task of finding a suitable sos-convex polynomial and that of minimizing it can be carried out by solving two semidefinite programs whose sizes are polynomial in the dimension (in fact of the same order as the number of terms in the Taylor expansion). As is well known, semidefinite programs can be solved to arbitrary accuracy in polynomial time; see [48] and references therein.
We work with sos-convex polynomials instead of general convex polynomials since the latter set lacks a tractable description [5], and the former, as we show, turns out to be sufficient for achieving an algorithm with superlinear local convergence. Our sum of squares based algorithm works for higher-order Newton methods of any order and can be easily implemented using any sum of squares parser (e.g., YALMIP [35] or SOSTOOLS [45]). This is in contrast to previous work where implementable algorithms have been worked out only for ; see [40], [23, Sect. 1.5], [25, Sect. 5]. While we present our algorithms in the unconstrained case, they can be readily implemented in the presence of sos-convex constraints (such as linear constraints or convex quadratic constraints). We note, however, that our interest in this paper is only on generalizing Newton’s method in terms of its convergence order and polynomial work per iteration, and not on the practical aspects of implementation. Designing more scalable algorithms for semidefinite programs is an active area of research [36, 49]. In addition, we believe that there are promising future research directions which could make our algorithms more practical at larger scale (see Section 7).
1.1 Related Work
Over the years, there have been many adaptations of and extensions to Newton’s method. A primary example is the pioneering work of Nesterov and Polyak [41], where the idea of Newton’s method with cubic regularization was introduced. We do not review the large literature that emerged from this work since the order of Taylor expansion in this line of work is still equal to 2, and hence these methods are not considered “higher-order” (i.e., ). However, the framework that we propose, similar to most of the literature, follows the structure of [41] (and [33, 37]) in terms of minimizing, in each iteration, a Taylor expansion of a certain order plus an appropriate regularization term. Recently, there has been a body of work following this structure with Taylor expansions of order higher than two [40, 9, 12, 28, 29, 25]. Unlike our paper, these works are in the setting of convex optimization, do not study the complexity of minimizing the regularized Taylor expansion in each iteration (except in the case of for a subset of these papers), and derive sublinear rates of global convergence. There has also been work on lower bounds on the rates of convergence for such methods [11, 1, 13, 40]. These lower bounds are nearly achieved by the algorithms in the aforementioned papers. The recent textbook [17] provides an accessible summary of this literature and its broader scope. See also [14, 16, 15] and references therein.
In terms of work per iteration of higher-order Newton methods, Nesterov presents a polynomial-time algorithm in [40] for minimizing a quartically-regularized third-order Taylor expansion. This problem is revisited recently in [18], where an algorithm for recovering an approximate second-order point for a possibly nonconvex quartically-regularized third-order Taylor expansion is presented. In [47], a different third-order Newton method is presented which has polynomial work per iteration. In each iteration, this algorithm moves to a local minimum of the third-order Taylor expansion. It turns out that local minima of cubic polynomials can be found by semidefinite programs of polynomial size [7]. To the best of our knowledge, no efficient algorithm for higher-order Newton methods of degree has been presented. In fact, designing such an algorithm is referred to as an open problem in [23, Sec. 1.5] and [25, Sec. 5]. Interestingly, Nesterov asks in [40, Sec. 6] whether it is possible to tackle this problem using “some tools from algebraic geometry and the related technique of sums of squares”. This is precisely the approach that we take in this paper.
To our knowledge, the only works that establish superlinear rates of local convergence for higher-order Newton methods are [47] and [24] (and the related PhD thesis [23]), the latter of which came to our attention at the time of writing this paper. In [47], the authors establish third-order local convergence rate for an unregularized third-order Newton method applied to a strongly convex function. In [24], the authors establish superlinear local convergence for higher-order Newton methods applied to convex optimization problems with composite objective. When the smooth part of the objective function is strongly convex, the authors show local convergence of order in function value and norm of the subgradient for their proposed th-order Newton method. An algorithm carrying out the work per iteration of this method, however, is available only in the case of (and is the same as that in [40]). Moreover, similar to much of the literature, the regularization term that is added to the Taylor expansion in this method requires knowledge of the Lipschitz constant of the th derivative of . Our proof technique for local superlinear convergence is different than [24] both in the parts where the sum of squares programming aspects come in and in the parts that they do not. Furthermore, our method has polynomial work per iteration for any degree . It also does not rely on knowledge of any Lipschitz constants. Our regularization term is instead derived from the optimal value of a semidefinite program which can be written down from the coefficients of the Taylor expansion alone. This optimized approach can potentially lead to smaller deviations from the Taylor expansion and therefore an improved convergence factor. Finally, we note that in our work, assumptions on convexity of and knowledge of the Lipschitz constant of its th derivative are made only in Section 6, where global convergence is established. Our approach in Section 6 is based on incorporating sum of squares methods into the framework of Nesterov in [40], though in theory this can also be done with other globally convergent higher-order Newton methods. In fact, at the time of revision, there has already been interesting follow-up work to our paper which combines our sum of squares framework with adaptive regularization techniques for tensor methods and analyzes the complexity of the resulting algorithm for finding an approximate stationary point of a nonconvex function [19].
1.2 Organization and Contributions
In Section 2, we review preliminaries on sos-convexity, sos-convex polynomial optimization, and error rates of derivatives of Taylor expansions. In Section 3, we present our main algorithm (Algorithm 1). In Section 4, we prove that our algorithm is well-defined in the sense that the semidefinite programs it executes are always feasible and that the next iterate is always uniquely defined (Theorem 3). We then prove that our semidefinite programming-based th-order Newton scheme has local convergence of order (Theorem 4). Compared to the classical Newton method, this leads to fewer calls to the Taylor expansion oracle (a common oracle in this literature; see e.g., [12], [29, Sect. 2.2], [1, Sect. 1.1], [11, Sect. 2], [17, Chap. 1.2]) at the price of requiring higher-order derivatives. The proof of Theorem 4 is more involved than the proof of local quadratic convergence of Newton’s method. This is in part because the expression for the next iterate of Newton’s method is explicit, whereas our next iterate comes from the solution to two semidefinite programs. We also remark that our proof framework is applicable to a broader class of higher-order Newton methods that may not necessarily use sum of squares techniques.
In Section 5, we present three numerical examples. We give an explicit expression and a geometric interpretation of our third-order Newton method in dimension one. We compare the basins of attraction of local minima for our higher-order methods to those of the classical Newton method. In Section 6, we present a slightly modified higher-order Newton method which is globally convergent under additional convexity and Lipschitzness assumptions similar to those in [40]. This modified algorithm works in the case of being an odd integer and still has polynomial work per iteration and local convergence of order . Finally, in Section 7, we present a few directions for future research.
2 Preliminaries
2.1 SOS-Convex Polynomial Optimization
In each iteration of the higher-order Newton methods that we propose, two semidefinite programs (SDPs) need to be solved. These SDPs arise from the notion of sos-convexity, which is reviewed in this subsection.
Definition 1.
A polynomial is said to be a sum of squares (sos) if there exist polynomials such that .
As is well known, one can check if a polynomial is sos by solving an SDP. The next theorem establishes this link. We denote that a symmetric matrix is positive semidefinite (i.e., has nonnegative eigenvalues) with the standard notation .
Theorem 1 (see, e.g., [44]).
For a variable and an even integer , let denote the vector of all monomials of degree at most in . A polynomial of degree is sos if and only if there exists a symmetric matrix such that (i) for all , and, (ii) .
The first constraint above can be written as a finite number of linear equations by coefficient matching. Therefore, the two constraints together represent the intersection of an affine subspace with the cone of positive semidefinite matrices. Thus, as polynomials can be encoded as an ordered vector of coefficients, the set of sos polynomials of a given degree has a description as the feasible region of a semidefinite program. Furthermore, the size of this SDP grows polynomially in when is fixed.
Throughout this paper, we denote the gradient vector (resp. Hessian matrix) of a function with the standard notation (resp. ).
Definition 2 (SOS-Convex).
A polynomial is said to be sos-convex if the polynomial defined as is sos.
Note that any sos-convex polynomial is convex. The converse statement is not true, except for certain dimensions and degrees (see [6]). By Theorem 1 above, the set of sos-convex polynomials of a given degree also form the feasible region of a semidefinite program. Because the polynomial is quadratic in , one can reduce the size of the underlying SDP. More specifically, a polynomial of degree is sos-convex222Note that an odd-degree polynomial can never be convex, except for the trivial case of affine polynomials. if and only if there exists a symmetric matrix such that . (Here, denotes the Kronecker product.) We see that the size of the SDP that represents sos-convex polynomials of degree in variables grows polynomially in when is fixed.
We next explain why sos-convex polynomial optimization problems can be solved with the first level of the so-called Lasserre hierarchy. A polynomial optimization problem is a problem of the form
(1) | ||||
s.t. |
where are real-valued polynomial functions of a variable . The first-level Lasserre relaxation (see [30]) corresponding to problem (1) takes the form
(2) | ||||
s.t. | ||||
The reader can check that the optimal value of (2) is always a lower bound on that of (1). The next theorem establishes that this lower-bound is tight when the defining polynomials of (1) are sos-convex.
Theorem 2 (See Corollary 2.5 from [31], and Theorem 3.3 from [32]).
Suppose that the polynomials in (1) are sos-convex, the optimal value of (1) is finite, and that the Slater condition holds333That is, there exists some such that for all .. Then, the optimal values of (1) and (2) are the same. Moreover, an optimal solution to (1) can be readily recovered from a solution to the semidefinite program that is dual to (2).
This result is already proven by Lasserre in [32] using a lemma of Helton and Nie from [26]. For completeness and for the benefit of the reader, we give an alternative short proof of the first claim.
Proof.
Recalling that an sos polynomial is nonnegative and that for , it is easy to see that the optimal value of (1) is larger than or equal to the optimal value of (2). To show the opposite inequality, let be the optimal value of (1). Then, the convex function is nonnegative over the set . By the convex Farkas lemma (see, e.g., [46, Theorem 2.1]), there exists a nonnegative vector such that for all . Notice that is sos-convex since it is a conic combination of sos-convex polynomials. Thus, by [6, Theorem 3.1], the polynomial is sos. Let be an optimal solution to (1) (such a vector must exist [10]). Observe that the polynomial is also sos (since it is the restriction of to ). By optimality of to (1), we have . Since is nonnegative, we have and . Thus, , and hence must be sos. Therefore, is feasible to (2), and hence the optimal value of (2) is at least ; i.e., the optimal value of (1).
∎
2.2 Error rates of Taylor remainders
In this subsection, we review certain error rates of multivariate Taylor expansions that will be used in our arguments. We denote by the th order symmetric tensor of order- partial derivatives of the function . We denote the tensor product of a set of vectors with .444We use this slightly nonstandard notation to avoid confusion with the Kronecker product. We use the notation to denote the tensor product of a vector with itself times. With this notation, we can define the th-order Taylor expansion of a -times differentiable function at a point as
where denotes the standard tensor inner product. The remainder or error term of the Taylor expansion is
For a th-order tensor , let us define the following norm
where denotes the Euclidean 2-norm of the vector . Note that for cases of and , this expression reduces to the standard Euclidean norm and the spectral norm, respectively.
We will need the following lemma in Section 4.
Lemma 1 (see, e.g., inequality (11) in [9]).
Fix a vector . Suppose has a Lipschitz constant over a convex set containing , i.e.,
for all . Then, for any , we have
and
3 Algorithm Definition
For a given integer , we consider the task of minimizing a function which is assumed to have derivatives up to order , and a local minimum satisfying . We also assume that the th derivative of is locally Lipschitz around the point , i.e., there is a radius , and a scalar , such that for points in the set , we have
Note that the latter assumption is always satisfied if the th derivative of exists and is continuous. Our goal is to minimize by iteratively minimizing a surrogate function of the type
where is our current iterate, is the Taylor expansion of of order at , is the smallest even integer greater than (as we require the surrogate to be a polynomial), and is chosen according to the following sum of squares program:
(3) | ||||
s.t. | ||||
In view of Theorem 1 and the remarks after Definition 2, this program can be reformulated as an SDP of size polynomial in . Letting denote the optimal value of (3) for a given , we define our surrogate function to be
(4) |
In our algorithm, we choose to be the minimizer of (which exists and is unique; see Theorem 3 below). By Theorem 2, since is sos-convex, we can find its minimizer via another SDP of size polynomial in .
If is far from so that is not positive definite, it may occur that (3) is infeasible. If this occurs, we fix a positive scalar555Our analysis applies to any positive value of . and instead solve the SDP:
(5) | ||||
s.t. | ||||
Let denote the optimal value of (5) and define
(6) |
We then let be the minimizer (which again exists and is unique; see Theorem 3 below). As before, we can find a minimizer of by solving an SDP of size polynomial in ; see Theorem 2.
Our overall algorithm is summarized below:
4 Algorithm Analysis and Convergence
In this section, we present our main technical results. Theorem 3 shows that our algorithm is well-defined for all initial conditions. Theorem 4 gives our convergence result. We remind the reader that the assumptions made on are described in the first paragraph of Section 3. In particular, the function is not required to be convex, and the th derivatives of are not required to be globally Lipschitz.
Theorem 3.
Theorem 4.
The power in this theorem is referred to as the order of convergence and the constant is referred to as the factor of convergence. We note that the factor of convergence arising from our proof is explicit.
To prove Theorems 3 and 4, we first establish some technical lemmas. Lemmas 2 and 3 are used to prove the first claim of Theorem 3; Lemmas 4 and 5 are for the second claim; and Lemmas 3, 4, and 6 are employed in the proof of Theorem 4.
In Lemma 2, we show that a particular polynomial is in the interior of the cone of sos-convex polynomials. This is used in Lemma 3 to show that we can always make our surrogate functions defined in (3) and (5) sos-convex.
Lemma 2.
Let . The polynomial
is in the interior of the cone of sos-convex polynomials in variables and of degree at most .
Proof.
We first establish the following claim:
Claim 0. For all , the polynomial
can be written as , where is the standard basis of monomials of degree up to with the monomials appearing in ascending order of degree, and is a positive definite matrix.
To prove Claim 0, it suffices to show that for all , there exists a constant and a positive definite matrix such that . Indeed, if , we can observe that
where can be taken to be positive semidefinite since is sos. If , we can observe that
where can be taken to be positive semidefinite since is sos.
Let us now proceed by induction on to prove the claim made in the previous paragraph. The case of is clear since we can take any and the associated matrix is simply a matrix containing the scalar . Now suppose that the induction hypothesis holds for . To construct and , we will add matrices associated with the polynomials and , where is an arbitrary scalar. From the induction hypothesis, there exist a scalar and a matrix of size that satisfy
Meanwhile, observe that we can write
for some matrices and , where the zero block is of size . Indeed, we can take the matrix to be diagonal with its diagonal entries equalling the coefficients of and move the coefficients of to the matrix . Adding the two identities, we observe that:
Since and are both positive definite matrices, by the Schur complement condition, whenever , the matrix on the right-hand side of the above expression will be positive definite. One can therefore choose to be any large enough value of that satisfies the previous condition and let . We have thus proved Claim 0.
By Claim 0 (with replaced by ), we can fix a positive definite matrix such that for all . One can check that
where can be taken to be positive semidefinite since is sos. Since the matrix is positive definite, it follows that is in the interior of the cone of sos-convex polynomials of degree at most . ∎
Lemma 3.
Suppose has continuous derivatives up to order over a compact set . If for all , then (i.e., the optimal value of (3)) is uniformly bounded from above over .
Proof.
Let be a positive scalar such that for all . Let be any vector in , and define
Since , we have . Let (resp. ) be the sum of the quadratic and higher (resp. cubic and higher) terms of . For a polynomial , define as the infinity norm of the coefficients of when expressed in the standard monomial basis. By Lemma 2, we can fix a positive scalar such that for any polynomial of degree at most with , we have that the polynomial is sos-convex. Fix a scalar such that for all . Define . We have since all terms of are of cubic or higher order. Then we can write
Since , the first term is sos-convex. We can bound the second term as follows: . Thus, the sum of the second and the third term is sos-convex by the definition of . It follows that the polynomial
We can then conclude the sos-convexity of the polynomials
-
(a)
,
-
(b)
,
-
(c)
,
-
(d)
,
-
(e)
, and
-
(f)
,
respectively (a) by scaling, (b) by the observation that the affine terms do not affect sos-convexity, (c) by rewriting, (d) by a linear change of coordinates, (e) by another rescaling, and (f) by an affine change of coordinates. Thus, we have for .
∎
We next use a quadrature rule for integration to establish a technical lemma that is needed for the remainder of this section. By a polynomial matrix, we mean a matrix whose entries are polynomial functions.
Lemma 4.
Let be univariate polynomial matrix whose entries have degree at most , where is even. Suppose for all . Then,
for .
Proof.
The next lemma directly proves the second claim of Theorem 3 and is possibly of independent interest.
Lemma 5.
If a convex polynomial satisfies for any point , then has a unique minimizer.
Proof.
Without loss of generality, assume . Let be an even integer greater than the degree of the Hessian of . For any ,
where the inequality follows from Lemma 4. Thus, is lower bounded by a coercive666We recall that a function is coercive if as . quadratic function, and hence is coercive itself. A coercive function that is convex (and hence continuous) has at least one minimizer.
Suppose for the sake of contradiction that had two minimizers . Then, by convexity, any point on the line segment connecting and would also be a minimizer. Since is a polynomial, it follows that must be constant along the line passing through and . This contradicts coercivity. ∎
We remark that the statement of Lemma 5 does not hold for non-polynomial convex functions (consider, e.g., the univariate function ).
The next lemma is used in the proof of Theorem 4.
Lemma 6.
There exists a constant such that if then .
Proof.
We show that we can take
where and are as in the first paragraph of Section 3. By Lemma 1, For every satisfying , we have
Thus, if , we have
It follows that
Indeed, if there was a unit vector such that if , the previous inequality would be violated.
Recall from (4) that is obtained by adding to the convex function . Therefore, we have which gives the claim.
∎
Proof of Theorem 3.
(i) When , the proof of Lemma 3 with demonstrates a feasible solution to (3).
This argument also extends to show feasibility of (5) since the polynomial has a positive definite Hessian at .
(ii) At Algorithm 1 (resp. Algorithm 1), (resp. ) has a positive definite Hessian at .
Moreover, the polynomial (resp. ) is sos-convex and therefore convex.
Thus, by Lemma 5, (resp. ) has a unique minimizer.
∎
Proof of Theorem 4.
Since , it suffices to show that there exist constants such that if , then .
By continuity of the map , there exists a scalar such that for all with .
Let be the constant needed for the conclusion of Lemma 6 to hold. Define
and . Suppose . Note that in this case, Algorithm 1 finds the next iterate by minimizing the polynomial defined in .
By the fundamental theorem of calculus, we have
Since minimizes , we have , and thus
We can bound the norm of this vector from below:
(7) |
Applying first Lemma 4 and then Lemma 6, we have
Substituting this into (7) and rearranging yields
(8) |
Expanding , we have
Applying Lemma 1 and noting that , we have
Using Lemma 3 and the fact that , we get
Substituting into (8), we have
as desired. ∎
5 Numerical Examples
We present three examples to compare the performance of our th-order Newton methods and the classical Newton method.
5.1 The Univariate Case
In the univariate case, the iterations of the classical Newton method read
In terms of finding a root of , this iteration can be interpreted as first computing the first-order Taylor expansion of at , and then finding the root of this affine function to define .
We derive a similar explicit formula for our higher-order Newton method in the case where , , and is positive. Since convex univariate polynomials are sos-convex, finding explicit solutions to the two SDPs involved in each iteration of our algorithm reduces to arguments about roots of univariate polynomials.
Proposition 1.
In the univariate case, when and , the next iterate of the rd-order version of Algorithm 1 is given by777Note that when and , the third-order Taylor series is convex and coincides with the second-order Taylor series. Therefore, the next iterates of the third-order and the classical Newton method coincide.
Proof.
To simplify notation, we let and . By translation, we may assume . Then , , where is the smallest constant that makes convex. We have . The discriminant of is , which tells us that .
To find , we look for the root of . One can write the expression for in the following form:
Observe that a univariate cubic polynomial of the form , with , has a unique root at . Therefore, after a translation back by , we have
∎
As in the case of the classical Newton method, the expression in Proposition 1 can be interpreted geometrically in terms of finding a root of . This iteration computes the second-order Taylor expansion of at , adds a sufficiently large cubic term to enforce monotonicity, and then finds the root of this monotone cubic function to define .
Example 1
In this example, we apply our method to the univariate function
(9) |
This is a strictly convex function with its unique minimizer at . One can check that the classical Newton method converges to this minimizer if and only if . Using Proposition 1, we can calculate the exact basin of convergence of our third-order Newton method to be , where
This is strictly larger than the basin of convergence of the classical method.
Figure 1 demonstrates the difference between one iteration of the classical and our third-order Newton method starting at the point . We display the quadratic and quartic polynomials and . The minimizers of these polynomials are denoted by and , which are respectively the next iterates of the classical and our third-order Newton method. Since the third-order Taylor expansion of provides a more accurate approximation, we see that the next iterate of our method is closer to , while that of the classical Newton method moves farther away from .
For our th-order Newton methods with , we calculate the radii of convergence numerically. These radii increase with degree as the following table demonstrates:
Degree | Radius of Convergence |
---|---|
2 (Classical Newton) | 1 |
3 | 3.4 |
4 | 4.5 |
5 | 5.9 |
We can visualize the speed of convergence of the fifth-order method, for example, in Figure 2. In this figure, we plot the absolute value of starting at , which is close to the boundary of the basin. In just five iterations, the method reaches a point with absolute value approximately .
Example 2
In this example, we compare our third-order method to the classical Newton method when applied to the function
(10) |
This is a strongly convex function with its unique minimizer at .
In Figure 3, (resp. ) is the map that takes a point to the corresponding next iterate of the classical (resp. third-order) Newton method. In this example, the third-order method satisfies for all nonzero , implying global convergence of the method. Meanwhile, the classical Newton method oscillates between when is outside of the range , where is point of intersection of the functions and .
In Figure 4, we can see a comparison of the iterates of the third-order and the classical Newton method starting from the initial condition . While both methods converge to the minimizer, the third-order method converges much faster.
5.2 A Multivariate Example
In our last example, we compare the classical and the third-order Newton methods applied to a standard test function in nonlinear optimization called the Beale function:
This nonconvex function has a single global minimum at and no other local minima. In Figure 5, we explore the behavior of both methods with initial conditions in the region . We initialize the classical method and our third-order method at a fine grid of points in this box and run both methods for iterations. For our third-order method, we take the parameter in Algorithm 1 to be equal to . In Figure 5, the color yellow corresponds to initial points that converge to , and the color blue corresponds to any other behavior including divergence or convergence to a point which is not a local minimum. In this example, the two basins are incomparable, but that of the third-order method is more contiguous and larger in volume.
6 Global convergence
In this section, we present a slightly modified algorithm which has global convergence under additional assumptions. There is a vast literature on modifications to Newton’s method that lead to global convergence in special circumstances: see, e.g., [41, 43, 38, 22]. In the setting of our work, it turns out that we can use a result of Nesterov from [40] to show that a simple modification to our algorithm that still has polynomial work per iteration is globally convergent when the Taylor expansion is made to an odd order.888The reason we need the Taylor expansion order to be odd is that in the work of Nesterov, the Taylor polynomial is regularized by a term of degree one larger. We need this new term to be a polynomial function for sum of squares methods to be readily applicable. This modified algorithm (Algorithm 2 below) also inherits the local convergence order of Algorithm 1.
As in [40], suppose the th derivative of the function that we wish to minimize has a Lipschitz constant , and that an upper bound on is known. In this setting, consider the following algorithm:
Using the same arguments as those in the proof of Theorem 3, one can see that the next iterate produced by this algorithm is well-defined whenever . Also as before, problem (3) can be solved as a semidefinite program of size polynomial in the dimension. This claim also holds for the problem of finding the (unique) minimizer of the degree polynomial
This is because the polynomials and are sos-convex and a conic combination of two sos-convex polynomials is sos-convex, making Theorem 2 applicable.
Theorem 5.
Suppose has bounded level sets, a positive definite Hessian everywhere, and the Lipschitz constant of its th derivative bounded above by .999The assumptions that we make here are the same as those in [40] except that our assumption of positive definiteness of the Hessian is stronger than the assumption of positive semidefiniteness of the Hessian made in [40]. Then, the iterates of Algorithm 2 starting from any converge to the (unique) minimizer of . Furthermore, Algorithm 2 has local convergence rate of order .
Proof.
Since the Hessian of is positive definite everywhere, the function is strictly convex. This, along with boundedness of the level sets, implies that has a unique (global) minimizer which we call .
Define . By Theorem 1 from [40], we have for all , thus the method is monotone; i.e., . Let and . Since the set is compact and the method is monotone, there exists a scalar such that for all . By the arguments in the proof of Theorem 2 from [40], we can conclude that
where .
By Lemma 3, we know that
is finite. Letting , we have , and therefore for all . Continuing the argument from the proof of Theorem 2 from [40], we can conclude that
Thus, we have and therefore .
For the local superlinear convergence rate, it suffices to show that for close enough to , we have
for some constant . Let and be as in the proof of Theorem 4, and . By the arguments in the proof of Theorem 4, for every , we have
Substituting into (8) (with replaced with ), we have
where
We note that by Lemma 3, is finite. ∎
7 Future directions
Besides the question of extending the results of Section 6 to the case of even, there are a few other potential directions for future research that we wish to highlight:
-
•
Can we replace the SDPs used in Algorithm 1 with more scalable conic programs such as linear programs (LPs) or second-order cone programs (SOCPs)? There has been work (see, e.g., [4]) on replacing methods based on sos programming with LP or SOCP-based approaches that rely on more tractable subsets of sos polynomials, such as the so-called diagonally dominant sum of squares (dsos) or scaled diagonally dominant sum of squares (sdsos) polynomials. In our setting, we might wish to replace the constraint in (3) (or (5)) that a polynomial is sos-convex with a constraint that it is “dsos-convex” or “sdsos-convex” (see, e.g., [3]). The results in [3] on the difference of dsos-convex decompositions of arbitrary polynomials could be explored to potentially replace the first SDP in each iteration of Algorithm 1 with an LP or SOCP. One would then need to establish an appropriate dsos or sdsos version of Theorem 2 to replace our second SDP with an LP or SOCP. It would be interesting to compare the factor of convergence of such an algorithm to that of the SDP-based approach.
-
•
Can we create a method that uses a sparse subset of higher-order derivatives of the function and that perhaps approximates the remaining derivatives in order to speed up each iteration? Such a method would be a higher-order analogue to the so-called “quasi-Newton” methods which rely on approximations of the Hessian of (see, e.g., [42, Chap. 6]). An example of such a higher-order quasi-Newton method which results in semidefinite programs of small size in each iteration has been proposed in [2], but its convergence properties are currently unknown.
-
•
Can we use our method or a modification thereof to solve systems of nonlinear equations (in a way that is superior to simply minimizing the sum of the squares of the equations)? The classical Newton method and its variants can be used for this purpose (see, e.g., [42, Sect. 11.1]). What are the right higher-order analogues of these approaches?
-
•
Each iteration of the algorithms that we have presented in this paper can be interpreted as running just one iteration of the so-called “convex-concave procedure” (see, e.g., [34]) to a particular difference of convex decomposition of the Taylor expansion of . Are there benefits of working with alternative difference of convex decompositions (see, e.g., [3]) of the Taylor expansion, or running more iterations of the convex-concave procedure before the Taylor polynomial is updated?
Acknowledgements
We would like to thank Jean-Bernard Lasserre for insightful discussions around the results in [32].
References
- [1] N. Agarwal and E. Hazan. Lower bounds for higher-order convex optimization. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 774–792, 2018.
- [2] A. A. Ahmadi, C. Dibek, and G. Hall. Sums of separable and quadratic polynomials. Mathematics of Operations Research, 48, 2022.
- [3] A. A. Ahmadi and G. Hall. DC decomposition of nonconvex polynomials with algebraic techniques. Mathematical Programming, 169(1):69–94, 2018.
- [4] A. A. Ahmadi and A. Majumdar. DSOS and SDSOS optimization: More tractable alternatives to sum of squares and semidefinite optimization. SIAM Journal on Applied Algebra and Geometry, 3(2):193–230, 2019.
- [5] A. A. Ahmadi, A. Olshevsky, P. A. Parrilo, and J. N. Tsitsiklis. NP-hardness of deciding convexity of quartic polynomials and related problems. Mathematical Programming, 137:453–476, 2013.
- [6] A. A. Ahmadi and P. A. Parrilo. A complete characterization of the gap between convexity and sos-convexity. SIAM Journal on Optimization, 23(2):811–833, 2013.
- [7] A. A. Ahmadi and J. Zhang. Complexity aspects of local minima and related notions. Advances in Mathematics, 397:108119, 2022.
- [8] A. A. Ahmadi and J. Zhang. On the complexity of finding a local minimizer of a quadratic function over a polytope. Mathematical Programming, 195(1-2):783–792, 2022.
- [9] M. Baes. Estimate sequence methods: extensions and approximations. Institute for Operations Research, ETH, Zürich, Switzerland, 2(1), 2009.
- [10] E. G. Belousov and D. Klatte. A Frank–Wolfe type theorem for convex polynomial programs. Computational Optimization and Applications, 22(1):37–48, 2002.
- [11] E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and P. L. Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming, 163(1):359–368, 2017.
- [12] S. Bubeck, Q. Jiang, Y. T. Lee, Y. Li, and A. Sidford. Near-optimal method for highly smooth convex optimization. In Conference on Learning Theory, pages 492–507. Proceedings of Machine Learning Research, 2019.
- [13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
- [14] C. Cartis, N. I. Gould, and P. L. Toint. Universal regularization methods: varying the power, the smoothness and the accuracy. SIAM Journal on Optimization, 29(1):595–615, 2019.
- [15] C. Cartis, N. I. Gould, and P. L. Toint. A concise second-order complexity analysis for unconstrained optimization using high-order regularized models. Optimization Methods and Software, 35(2):243–256, 2020.
- [16] C. Cartis, N. I. Gould, and P. L. Toint. Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints. SIAM Journal on Optimization, 30(1):513–541, 2020.
- [17] C. Cartis, N. I. Gould, and P. L. Toint. Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022.
- [18] C. Cartis and W. Zhu. Second-order methods for quartically-regularised cubic polynomials, with applications to high-order tensor methods. arXiv preprint arXiv:2308.15336, 2023.
- [19] C. Cartis and W. Zhu. Global convergence of high-order regularization methods with sums-of-squares Taylor models. arXiv preprint arXiv:2404.03035, 2024.
- [20] P. L. Chebyshev. Polnoe Sobranie Sochinenii. Izd. Akad. Nauk SSSR, 5:7–25, 1951.
- [21] C. W. Clenshaw and A. R. Curtis. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197–205, 1960.
- [22] A. Conn, N. Gould, and P. Toint. Trust Region Methods. MPS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics, 2000.
- [23] N. Doikov. New second-order and tensor methods in convex optimization. PhD thesis, Université catholique de Louvain, 2021.
- [24] N. Doikov and Y. Nesterov. Local convergence of tensor methods. Mathematical Programming, 193(1):315–336, 2022.
- [25] G. N. Grapiglia and Y. Nesterov. Tensor methods for finding approximate stationary points of convex functions. Optimization Methods and Software, 37(2):605–638, 2022.
- [26] J. W. Helton and J. Nie. Semidefinite representation of convex sets. Mathematical Programming, 122:21–64, 2010.
- [27] J. P. Imhof. On the method for numerical integration of Clenshaw and Curtis. Numerische Mathematik, 5(1):138–141, 1963.
- [28] B. Jiang, T. Lin, and S. Zhang. A unified adaptive tensor approximation scheme to accelerate composite convex optimization. SIAM Journal on Optimization, 30(4):2897–2926, 2020.
- [29] B. Jiang, H. Wang, and S. Zhang. An optimal high-order tensor method for convex optimization. Mathematics of Operations Research, 46(4):1390–1412, 2021.
- [30] J.-B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11:796–817, 2000.
- [31] J.-B. Lasserre. Representation of nonnegative convex polynomials. Archiv der Mathematik, 91(2):126–130, 2008.
- [32] J.-B. Lasserre. Convexity in semialgebraic geometry and polynomial optimization. SIAM Journal on Optimization, 19:1995–2014, 2009.
- [33] K. Levenberg. Method for the solution of certain problems in least squares. J Numer Anal, 16:588–A604, 1944.
- [34] T. Lipp and S. Boyd. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016.
- [35] J. Löfberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In IEEE International Conference on Robotics and Automation, pages 284–289, 2004.
- [36] A. Majumdar, G. Hall, and A. A. Ahmadi. Recent scalability improvements for semidefinite programming with applications in machine learning, control, and robotics. Annual Review of Control, Robotics, and Autonomous Systems, 3:331–360, 2020.
- [37] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
- [38] J. J. Moré. Recent Developments in Algorithms and Software for Trust Region Methods, pages 258–287. Springer Berlin Heidelberg, 1983.
- [39] K. G. Murty and S. N. Kabadi. Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129, 1987.
- [40] Y. Nesterov. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, 186(1):157–183, 2021.
- [41] Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- [42] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
- [43] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. SIAM, 2000.
- [44] P. A. Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, 2000.
- [45] S. Prajna, A. Papachristodoulou, and P. A. Parrilo. Introducing SOSTOOLS: A general purpose sum of squares programming solver. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 1, pages 741–746, 2002.
- [46] I. Pólik and T. Terlaky. A survey of the S-lemma. SIAM Review, 49(3):371–418, 2007.
- [47] O. Silina and J. Zhang. An unregularized third order Newton method. arXiv preprint arXiv:2209.10051, 2022.
- [48] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49–95, 1996.
- [49] A. Yurtsever, J. A. Tropp, O. Fercoq, M. Udell, and V. Cevher. Scalable semidefinite programming. SIAM Journal on Mathematics of Data Science, 3(1):171–200, 2021.