Heavy Ball Momentum for Non-Strongly Convex Optimization
Abstract
When considering the minimization of a quadratic or strongly convex function, it is well known that first-order methods involving an inertial term weighted by a constant-in-time parameter are particularly efficient (see Polyak [32], Nesterov [28], and references therein). By setting the inertial parameter according to the condition number of the objective function, these methods guarantee a fast exponential decay of the error. We prove that this type of schemes (which are later called Heavy Ball schemes) is relevant in a relaxed setting, i.e. for composite functions satisfying a quadratic growth condition. In particular, we adapt V-FISTA, introduced by Beck in [10] for strongly convex functions, to this broader class of functions. To the authors’ knowledge, the resulting worst-case convergence rates are faster than any other in the literature, including those of FISTA restart schemes. No assumption on the set of minimizers is required and guarantees are also given in the non-optimal case, i.e. when the condition number is not exactly known. This analysis follows the study of the corresponding continuous-time dynamical system (Heavy Ball with friction system), for which new convergence results of the trajectory are shown.
1 Introduction
In many image processing or statistical problems, the optimization of a convex function from to with a non empty set of minimizers may be needed. In this context, when is large (i.e. for large scale problems), second order algorithms cannot be used and only gradient or sub-gradient of can be computed to get a minimizing sequence .
If is convex, differentiable and has a -Lipschitz gradient, the explicit gradient descent algorithm (GD) with step defined by
is a simple first order algorithm that provides a sequence converging to a minimizer of . This method is actually slow on this class of convex functions since its asymptotic convergence rate is
This decay rate can be improved when considering strongly convex functions, since the worst-case guarantee is then
(1) |
This asymptotic decay is faster that the one obtained for convex functions but when , this decay can still be slow in practice. As is the inverse of the condition number of , GD is particularly slow for large scale problems.
Two remarks can be made about these decays. First, if is not differentiable but composite, GD can be replaced by the Forward-Backward algorithm and the two decays above are still valid. We provide an exact definition of composite and Forward-Backward algorithm in Section 2. Second, the above exponential decay of the error is given under a strong convexity assumption but can be extended under weaker hypotheses such as a quadratic growth condition.
In 1964 Polyak introduces the Heavy Ball (HB) scheme inspired by mechanics, which improves the decay of gradient descent on the class of strongly convex functions by incorporating inertia. This scheme generates a sequence of iterates ensuring that:
(2) |
If , this convergence rate is significantly faster than (1) guaranteed by the Forward-Backward algorithm. This theoretical improvement reflects a better performance in practice. At the core of the Polyak’s analysis is the fact that in the neighborhood of its unique minimizer, behaves like a quadratic function. But the assumption is crucial in the Polyak’s analysis, and examples of simple strongly convex functions such that the (HB) provides diverging sequences can be found in [21].
In 1983 Nesterov [27] proposes an inertial scheme built to speed up the convergence of GD on the class of convex functions. This acceleration process is at the core of FISTA introduced by Beck and Teboulle [11], which applies to composite functions and provides a sequence such that
The details of this algorithm and convergence rates are given in the Section 2. The main difference between the Heavy Ball algorithm and the Nesterov scheme is the inertia parameter which is constant over iterations and depends on for Heavy Ball while it depends on the iteration number and tends to 1 when goes to for FISTA.
Many variations of these schemes have been proposed during the last decade, see table 1 for various examples, and the behavior, rates and stability of these various schemes are now well understood. A common approach is to study an associated dynamical system via a Lyapunov analysis before deriving convergence results on the scheme, see e.g. [9] and the references therein.
Several Heavy Ball schemes have been proposed to provide fast decays of the type (2) under weaker hypotheses than and strong convexity [28, 10, 35, 37, 36]. But for all these schemes, a fast geometrical decay such as (2) is achieved only on classes of functions having a unique minimizer: no known inertial scheme achieves such rates on the class of convex functions satisfying a simple quadratic growth condition,
(3) |
or equivalently in the convex setting a Łojasiewicz property with parameter , without introducing additional uniqueness hypothesis. In others words, no known inertial scheme provides better asymptotic bounds compared to (GD) within the class of convex functions satisfying a quadratic growth condition. Thus, it remains unclear if inertia holds any real significance for this class of functions.
The main contribution of this work is to provide Heavy Ball schemes, similar to Beck’s V-FISTA, ensuring rates of on the class of convex functions satisfying some quadratic growth condition, where the value of will be specified later. The inertia parameter depends on the knowledge of and , but this new scheme guarantees exponential decay even if is overestimated. We prove that an overestimation of only results in suboptimal exponential decay. Theorem 1 provides a straightforward Lyapunov analysis and a fast convergence rate for a given friction parameter, while Theorem 2 yields rates that can be achieved even if the friction parameter is not set in an optimal way, demonstrating that fast exponential decay is robust to a mild overestimation of the quadratic growth parameter .
The paper is organized as follows: in Section 2 we introduce the main geometric assumption made on the function to minimize, namely the quadratic growth condition, and propose a review of the literature on the convergence rates of inertial algorithms under this condition. In Section 3 is devoted to our two main theorems and corollary proving that Heavy Ball like methods can be properly parameterized to achieve fast exponential decay for this class of function. Section 4 presents the continuous counterpart of the discrete analysis proposed in Section 3 providing a guide to construct the proofs of the Theorems presented in Section 3, as well as new results for the convergence rate for the trajectories of the Heavy Ball dynamical system. All the proofs have been gathered in Section 5, and the more technical ones are detailed in the Appendix.
2 Geometry of convex functions and inertial algorithms: definitions and state of the art.
Let us first recall some basic notations and definitions. We assume that is equipped with the Euclidean scalar product and the associated norm . As usual denotes the open Euclidean ball with center and radius . For any real subset , the Euclidean distance is defined as:
2.1 Framework and notations
In this paper we focus on the class of composite functions: where is a convex, differentiable function having a -Lipschitz gradient and is a proper lower semicontinuous (l.s.c.) convex function whose proximal operator is known. The proximal operator of is denoted by and defined by:
(4) |
For this class of functions a classical minimization algorithm is the Forward-Backward algorithm (FB) whose iterations are described by:
(5) |
Without further assumptions on , the convergence decay of the FB algorithm, i.e. the decay of along the iterates, may be slow. In this paper we are interested in inertial methods, which are among the most effective first order optimization methods, and may ensure a better convergence rates, especially when is additionally strongly convex:
Definition 1 (Strong convexity ).
Let be a proper lower semicontinuous convex function. The function is said -strongly convex for some real constant if the function is convex.
Weakening this assumption, we consider the class of convex composite functions satisfying some quadratic growth condition:
Definition 2 (Quadratic growth condition ).
Let be a proper lower semicontinuous convex function such that: and . The function satisfies a quadratic growth condition for some real constant if:
(6) |
Classically the quadratic growth condition can be seen as a relaxation of the strong convexity. Note that satisfying some growth condition does not impose the uniqueness of the minimizer as it does for strong convexity. In the convex setting, the quadratic growth condition is equivalent to a global Łojasiewicz property with an exponent [20]. The Łojasiewicz property [24, 25] is a key tool in the mathematical analysis of continuous and discrete dynamical systems, initially introduced to prove the convergence of the trajectories for the gradient flow of analytic functions. An extension to nonsmooth functions has been proposed by Bolte et al. in [13]:
Definition 3.
Let be a lower proper semicontinuous convex function with . The function has a Łojasiewicz property of exponent if for any minimizer , there exist real constants and such that:
(7) |
The Łojasiewicz property is said to be global if (7) is satisfied for any .
A general inertial optimization method can be described as follows:
(8) |
where denotes some friction parameter and or depending on the considered method. Historically in his seminal work [32], B.T. Polyak proposes a first inertial scheme by choosing a constant friction parameter and , for the minimization of strongly convex functions. One of the most popular inertial algorithm is FISTA (for Fast Iterative Shrinkage-Thresholding Algorithm) introduced by Beck and Teboulle in [11] to minimize convex composite functions. Inspired by Nesterov’s accelerated gradient method proposed in [27], the friction parameter is defined by:
(9) |
where the sequence is recursively defined by: and . Chambolle and Dossal propose in [16] a variant of FISTA defining for any where . The original choice of Nesterov can be approximated by setting .
In this paper we consider the family of Heavy Ball algorithms for which the friction parameter is set to a constant and . The term Heavy Ball refers to a family of optimization schemes that can be interpreted as discretizations of the following second-order ordinary differential equation:
(10) |
which describes the move of a heavy ball in a potential field with a constant friction. The inertia coefficient has to be parameterized according to the geometry of to get an optimal convergence rate. For the class of -strongly convex functions, Beck in [10, Chapter 10.7.7] proposes the following choice (following Nesterov’s choice [28]):
(11) |
where , leading to the algorithm V-FISTA (seen as a variant of FISTA by the author).
2.2 Convergence rate of inertial algorithms under quadratic growth assumptions
Let be a convex composite function where is a convex, differentiable function having a -Lipschitz gradient and is a proper semicontinuous convex function whose proximal operator is known.
When the function to minimize satisfies some additional growth assumption , Garrigos et al. [20] prove that the Forward-Backward algorithm (5) provides better theoretical guarantees than in the general convex case. More precisely, they show that the function values achieve an exponential decay along the iterates of Forward-Backward, without any assumption on the set of minimizers . Observe that this convergence rate depends on the ratio which represents the inverse of the conditioning of and can be very small in large-scale optimization. Note also that Necoara et al. proved similar results for the projected gradient algorithm in [26].
While Nesterov’s accelerations allow for speeding up gradient-based algorithms for the class of convex functions, it is less clear for the class of convex functions satisfying some quadratic growth condition. Considering FISTA, in its historical form by Beck and Teboulle (9) or its variant introduced by Chambolle and Dossal in [16], the convergence rate is still polynomial for the class of convex functions satisfying some quadratic growth condition. Note however that considering the variant of FISTA introduced by [16], Aujol et al. prove in [9] that the sequence provided by (8) with and satisfies:
(12) |
which is better than the rate in the convex setting from [27, 11] which is in fact as proved by Attouch and Peypouquet [4]. Although this decay is not exponential, the authors show that the friction parameter can be set according to some desired accuracy, and in that case the number of iterations required to achieve this accuracy is comparable to methods ensuring a fast exponential decay of the error, i.e. a exponential decay depending on , which can be much faster than an exponential decay rate solely in as for Forward-Backward. Note that this result holds under the assumption that has a unique minimizer.
A way to accelerate the convergence of FISTA for the class of composite convex functions having some quadratic growth property, is to use restart strategies. The idea of this approach is to take benefit of inertia while avoiding oscillations by re-initializing the inertia parameter to zero when some restart condition is verified. Empiric restart rules have been proposed by Giselsson and Boyd [22] or O’Donoghue and Candès [30], offering an improved convergence of FISTA in practice but without theoretical guarantees. Elementary computations show that re-initializing the inertia parameter every iterations allows the resulting sequence to satisfy:
(13) |
This convergence rate is actually the fastest in the literature for restart methods and does not require the uniqueness of the minimizer. But note that it requires knowledge of the value of , see e.g. [29, 30, 26]. Recently, adaptive restart schemes have been developed aiming at exploiting the geometry assumption to derive improved convergence rates without knowing exactly the growth parameter : Fercoq and Qu [19], Alamo et al. [1], Aujol et al. [6, 5] introduce restart schemes ensuring a fast exponential decay of the error (i.e. depending on . The schemes having the best theoretical guarantees in this setting are that proposed by Alamo et al. in [1] () and the method introduced by Renegar and Grimmer in [33] (). As the optimal periodic restart, no uniqueness assumption is needed on the set of minimizers of to obtain these guarantees.
In contrast to FISTA and Nesterov’s accelerated gradient method, Heavy Ball type schemes are designed for convex functions satisfying additional growth assumptions such as the -strong convexity. To this end, these methods require to be calibrated according to the growth parameter . In his seminal paper [32], Polyak introduces the first Heavy Ball method for -strongly convex functions which guarantees a convergence rate of the error of . This decay rate is significantly fast but relies strongly on the assumption. Ghadimi et al. in [21] provide a convex function such that this method does not converge. Nesterov proposes in [28] the accelerated gradient method for strongly convex functions which only requires a assumption ensuring that the error decreases as . In this setting, several Heavy Ball schemes have been proposed such as Siegel’s Heavy Ball method [35] and the geometric descent method [15] which have the same theoretical asymptotic guarantees as Nesterov’s accelerated gradient method for strongly convex functions, the Heavy Ball method by Aujol et al. [7] for strongly convex functions (which we will denote ADR- Heavy Ball), the triple momentum method by Van Scoy et al. [37] and ITEM by Taylor and Drori [36]. The latter two schemes are built thanks to the Performance Estimation Problem approach introduced by Drori and Teboulle [18] and they provide the best bounds on the error for this class of function and first-order methods (). Some of these schemes can be adapted to composite optimization as detailed in Table 1. Note that Beck generalizes Nesterov’s accelerated gradient method for strongly convex functions to composite optimization in [10] with V-FISTA proving the same theoretical convergence rate of the error.
Algorithm | Reference | Assumption on | Convergence rate of |
Polyak’s Heavy Ball |
Polyak [32] |
and |
|
Nesterov’s accelerated gradient method for strongly convex functions |
Nesterov [28] |
||
Necoara et al. [26] |
and |
||
, and uniqueness of the projection of the iterates onto |
|||
Geometric descent method |
Bubeck et al. [15] |
||
Chen et al. [17] |
|||
Adapted to composite optimization |
|||
Triple momentum method |
Van Scoy et al. [37] |
and |
|
ITEM |
Taylor, Drori [36] |
and |
|
Polyak’s Heavy Ball with general friction |
Ghadimi et al. [21] |
and |
|
Siegel’s Heavy Ball |
Siegel [35] |
and |
|
Adapted to composite optimization |
|||
V-FISTA |
Beck [10] |
||
Adapted to composite optimization |
|||
ADR- Heavy Ball |
Aujol et al. [7] |
||
Adapted to composite optimization |
|||
ADR- Heavy Ball |
Aujol et al. [8] |
and uniqueness of the minimizer |
|
Adapted to composite optimization |
|||
ADR- Heavy Ball |
Aujol et al. [8] |
||
Adapted to composite optimization |
Recently, Heavy Ball type schemes have been studied under weaker geometry assumptions than strong convexity. Necoara et al. prove in [26] that the convergence rate of Nesterov’s accelerated gradient method for strongly convex method is actually valid for -quasi-strongly convex functions i.e. for functions satisfying:
(14) |
where denotes the projection onto , provided that the iterates share the same projection onto the set of minimizers. In [8], Aujol et al. build a Heavy Ball type scheme (ADR- Heavy Ball) for functions satisfying the quadratic growth assumption guaranteeing that
(15) |
as long as has a unique minimizer.
Thus, the theoretical guarantees of Heavy Ball type schemes are the best in the literature among first-order methods for functions satisfying growth conditions but they do not hold without assuming the uniqueness of the minimizer. If this hypothesis is not verified, the theoretical convergence rates are similar to those of Forward-Backward, and the relevance of applying such algorithms in this context can therefore be questioned.
3 Fast exponential decay for Heavy Ball type algorithms
Let us now consider Heavy Ball type methods that can be generically described as variants of the V-FISTA algorithm proposed by Beck in [10]:
(V-FISTA) |
with , , and any . Recall that in the original definition of V-FISTA [10], the dam** parameter is set to: where denotes the inverse of the conditioning of the function to minimize.
The main contribution in this section is to prove that Heavy Ball methods like (V-FISTA) can be properly parameterized to achieve fast exponential decay rates (i.e. depending on ) for the class of convex composite functions satisfying some quadratic growth property and without assuming the uniqueness of the minimizer. An example of such functions is the LASSO functional :
(16) |
This function is convex, satisfies a quadratic growth condition but may not have a unique minimizer. To the best of our knowledge, this is the first result proving that an inertial method can actually improve the convergence rate of the Forward-Backward algorithm which is in and not in . In large scale dimension, the inverse of the conditioning of the function to minimize could be very small, so that decaying in could be much slower that in .
Our purpose is to build an inertial algorithm providing fast exponential decay, that is ensuring that but not to optimize the value of . Both Theorems 1 and 2 provide such bounds. In Theorem 1, the value of the inertial parameter is chosen to provide a fast decay rate under a mild condition of and by using a quite simple Lyapunov analysis. Indeed, other choices could have been made leading to slightly different inertia, rates, conditions on and Lyapunov functions, using the same approach.
The rates given by Theorem 2 are different from Theorem 1 since the objective is not the same. This second theorem provides bounds for a large set of inertial parameters and the inertia is not optimized for a fixed bound on . Indeed, in various settings, the exact value of is not precisely known. For the LASSO functional (16) for example, the value of the quadratic growth parameter may be hard to estimate. The goal of this second theorem is to provide fast exponential decays even if the inertia parameter is not set to its optimal value, and to provide rates that are more accurate when is smaller. The proof of this second theorem is inspired by the proof of Theorem 3 which provides results on the solution of the Heavy Ball dynamical system.
Theorem 1.
Let where is a convex differentiable function having a -Lipschitz gradient for some , and a proper convex l.s.c. function. Assume that satisfies a quadratic growth condition for some real parameter . Let be the sequence provided by (V-FISTA) with and . If , then:
(17) |
and
(18) |
Theorem 1, whose proof can be found in Section 5.1, ensures that for a well-chosen parameter which depends on , the decay of the error along the iterates of (V-FISTA) is at worst of order . Looking back at the results in the literature, this convergence rate is slower than those of most other Heavy Ball schemes. However, remember that the required assumptions on in these works (summarized in Table 1) are stronger than those needed in Theorem 1. The only method proposed for functions satisfying , i.e. ADR- Heavy Ball [8], ensures a fast exponential decay of the error:
if the function has a unique minimizer. This theoretical decay is faster than (17), but it does not hold if has multiple minimizers. To the authors’ knowledge, the fast exponential decay of the error given by Theorem 1 is the first in the literature for Heavy Ball methods in this setting and without any uniqueness assumption on the set of minimizers.
In addition, the guarantee on the decay of the error given in (17) is faster than that given by FISTA restarted periodically and optimally as it only ensures (even with some oracle [26])
This means that in this setting, (V-FISTA) is theoretically more relevant than a periodic restart of FISTA when the growth parameter is known. In other words, one should define a constant inertia parameter depending on and instead of setting an increasing inertia parameter and re-initializing it optimally.
The second claim of Theorem 1 gives an asymptotic control on which ensures that . As a consequence, the length of the trajectory of the sequence is finite and it converges strongly to a minimizer of the function .
Below is a second theorem about the iterates of (V-FISTA) which gives stronger and more general results than Theorem 1. The proof is built using the parallel with dynamical systems (see Section 4) and is located in Section 5.2.
Theorem 2.
Let where is a convex differentiable function having a -Lipschitz gradient for some , and a proper convex l.s.c. function. Assume that satisfies a quadratic growth condition for some real parameter . Let be the sequence provided by (V-FISTA) with where . Then, for any :
(19) |
and
(20) |
where satisfies the following inequality:
(21) |
The statements of Theorem 2 are less readable than those of Theorem 1 but they are actually stronger. The inequality (21) hides the convergence rates which can be obtained for a given . Observe that the larger , the better the convergence rate. The best rates are obtained when:
(22) |
The admissible maximum value of can be thus obtained studying the limit case when :
(23) |
i.e. (since :
(24) |
This can be seen as a quadratic polynomial in , whose discriminant is:
Hence the largest value of for which the discriminant satisfies is , which corresponds to a limit maximum value of equal to . These two observations highlight that the convergence rates given by Theorem 2, are faster than that given in Theorem 1 for suitable choices of , as expressed in the following corollary. Note that the convergence guarantees and best choices of depend on the value of the conditioning number since appears in Equation (22).
Corollary 1.
Let where is a convex differentiable function having a -Lipschitz gradient for some , and a proper convex l.s.c. function. Assume that satisfies a quadratic growth condition for some real parameter . Let .
Let be two real parameters chosen such that:
(25) |
and the value of is maximum. Let be the sequence provided by (V-FISTA) with . Then:
(26) |
and
(27) |
where:
and there exist three real-valued functions , such that: and:
Thus, Corollary 1 provides better convergence rates than Theorem 1 (since ). We can remark that the guarantees given by Aujol et al. in [8] for ADR- are still better with the additional assumption that has a unique minimizer.
In fact, Theorem 2 and inequality (21) hide more than improved convergence rates. To illustrate this, we provide a graph displaying the evolution of with respect to and such that satisfy (22) in Figure 1. An interactive graph can be found on the link https://www.desmos.com/calculator/syrtiatos6. We can see on this graph that inequality (21) allows to obtain convergence guarantees even for non-optimal choices of , i.e. large values of .
By exploiting this observation, the following corollary provides convergence rates for (V-FISTA) if is too small which can be the case if is overestimated. A brief proof is given in Section 5.3.
Corollary 2.
Let where is a convex differentiable function having a -Lipschitz gradient for some , and a proper convex l.s.c. function. Assume that satisfies a quadratic growth condition for some real parameter . Let be the sequence provided by (V-FISTA) with and for some . Then, if ,
(28) |
and
(29) |
where .
Corollary 2 allows us to derive convergence rates for (V-FISTA) even if is far from its optimal value. Let us describe two examples:
-
•
Suppose that where . Then, applying Corollary 2 with , we get that the iterates of (V-FISTA) for this inertia parameter ensure a decrease of the error in where . Thus, if we choose where is an upper estimation of , then we get a theoretical guarantee on the error with . In this way, we obtain that if , then the error decreases as where .
-
•
Assume now that is arbitrarily set to without knowing the actual value of . Consequently, we have that where . If (i.e. if ), then Corollary 2 states that the error along the iterates of (V-FISTA) for this choice of decreases in the worst case as where which can be upper bounded by . As a consequence, if is sufficiently ill-conditioned, (V-FISTA) has better theoretical guarantees for this choice of than Forward-Backward which ensures that [20].
The convergence guarantees obtained in non-optimal cases and the robustness to an overestimation of the growth parameter are the main contributions of this work. To the authors’ knowledge, there are no such results in the literature. This provides a better understanding of the behavior of the iterates of (V-FISTA) for a wide range of values of .
4 Convergence rates for the trajectories of the Heavy Ball dynamical system
The so called Heavy Ball equation
(30) |
where denotes the friction parameter, has been studied from decades now. See for example Attouch et al [3] for a general study of the dynamical system : existence of solutions, link with the mechanical system and convergence of the trajectories to critical points if no strong assumptions are made on . The main result of this section is to prove that if is convex, differentiable and satisfies a quadratic growth condition, the solution of (30) ensures a fast exponential decay. A crucial point here, is that the uniqueness of the minimizer of is not needed to get such fast rates. Before stating Theorem 3, we give an overview of the literature. We highlight that slower exponential decays are already known in this setting, that a fast decay is known if the minimizer of is supposed to be unique and that various other decays can be achieved in non convex settings. Note that the proof of Theorem 3 has served as a guide for the analysis of the Heavy Ball algorithm as performed in Section 3 and for the proof of Theorem 2.
4.1 State of the art
The first study of the convergence rate of , where is a solution of (30), under strong convexity analysis is due to Polyak [32] . In his seminal work Polyak observes that if the function is a quadratic function, , the solution of (30) ensures the fastest decay of when where is the smallest non negative eigenvalue of . He deduced that if is and -strongly convex , the solution of (HBF) satisfies
(31) |
Both and strong convexity are necessary to achieve such a decay. During the last decades several works provide various bounds depending on geometrical assumptions made on . A summary is given in Table 3. To achieve an exponential decay of a Łojasiewicz condition with parameter is required in all these work. In [12], the authors proved that with other Łojasiewicz exponents only polynomial decay can be achieved. Nevertheless the exact exponential decay highly depends on the assumptions made on . If is -strongly convex and it was first proved that the decay was , see for example Siegel [35] for a simple proof. Aujol et al [7] extend this former result giving a better rate for functions that are quasi-strongly convex and having a unique minimizer, and a weaker convergence rate if satisfies only a quadratic growth condition and has a unique minimizer, see Table 3 for more details. All these results ensure fast exponential decays of and assume the convexity of , a quadratic growth condition and a uniqueness of the minimizer of .
In [13], Bégout et al. provide several results on the trajectory , solution of (HBF) if is a function satisfying some Łojasiewicz hypothesis but is not necessarily convex. The authors prove that under such hypothesis the trajectory strongly converges to a minimizer of , and provide several decay rates. If satisfies a Łojasiewicz hypothesis with an exponent , the decay rate is polynomial, see [13, Corollary 5.1]. Indeed this polynomial bound are similar to the ones of achieved by the gradient flow under similar hypotheses. If satisfies a Łojasiewicz hypothesis with an exponent , the rate is actually exponential. More recent works by Polyak et al [31] and Apidopoulos et al [2] also provide exponential decay under Łojasiewicz hypothesis on depending on and , see Table 3.
Reference | Assumption on | Exponential rate of |
---|---|---|
Polyak [32] |
, and convexity |
|
Siegel [35] |
and convexity |
|
Aujol et al. [8] |
and convexity |
|
Uniqueness of the minimizer |
||
Polyak, Shcherbakov [31] |
, Łojasiewicz with and constant , -Lipschitz gradient |
|
with |
||
Apidopoulos et al. [2] |
Łojasiewicz with and constant , and -Lipschitz gradient |
|
with |
The goal of the next part is to show that under convexity and quadratic growth conditions, a faster exponential rate, independent of , can be achieved for the solution of the Heavy Ball dynamical system (HBF) without assuming the uniqueness of the minimizer.
4.2 Fast exponential decay under quadratic growth conditions
We consider the Heavy Ball Friction (HBF) system defined as follows:
(HBF) |
where , and is a convex differentiable function satisfying some quadratic growth condition. Generalizing recent works [35, 8, 7] and making assumptions about the regularity of the boundary of the set of minimizers, we prove that the trajectories of (HBF) can achieve a fast exponential decay:
Theorem 3.
Let be a convex differentiable function having a non empty set of minimizers . Suppose that has a bound or that it is a polyhedral set. Let be a solution of (HBF) for some and . If satisfies the assumption for some and , then
(32) |
where .Moreover,
(33) |
We give elements of proof in the following section and a demonstration of the second claim is given in Section A.1.
Proposition 1.
Let be a convex differentiable function having a non empty set of minimizers . Suppose that has a bound or that it is a polyhedral set. Assume that is a -quasi strongly convex function, i.e there exists such that:
where denotes the projection of onto . Let be a solution of (HBF) for some and . Then if :
(34) |
where .
Remark 1.
Theorem 3 and Proposition 1 are based on the assumption that the set of minimizers has a bound or is a polyhedral set. More generally, the corresponding statements hold if the set is second order regular by the definition of Bonnans et al. [14], which is a weaker assumption. Given the technical nature of this hypothesis, the results are given for special cases. We refer the careful reader to the above reference for more details.
Remark 2.
4.2.1 Comparisons and comments
The first study of (HBF) has been proposed by Polyak [32]. In this seminal work, Polyak consider a -strongly convex functions. Polyak proved that for such functions the solution of (HBF) satisfies for a suitable choice of the friction parameter . If the function is and -strongly convex the convergence rate is weaker, see for example [35, 7]. If is , satisfies a quadratic growth condition and has a unique minimizer, which is a weaker assumption than strong convexity, Aujol et al. [7] proved that the solution of (HBF) satisfies for another choice of the friction parameter , which is slighlty slower that the rate achieved by Polyak. All the above works use the fact that the function to minimize has a unique minimizer. Indeed if is , convex and satisfies a quadratic growth and has several minimizers, there were no results ensuring that the solution of (HBF) satisfies for any . As far as we know Theorem 3 is the first one ensuring such decay rate on this set of convex functions. This fast decay allows to prove the Theorem 2 ensuring a fast decay of an inertial algorithm on the same set of convex functions.
Several other articles provides interesting results decay rate of the solution of the Heavy Ball ODE (HBF). In [31, 2, 12] authors provides general analysis considering Łojasiewicz properties. In these three articles, some results on the trajectory or the error are given. It is not simple to perform a fair comparison between these results and Theorem 3 because our analysis relies on the convexity and the global analysis of these works do not use this assumption. Nevertheless, in [12] and [2] provides some decay bounds when the convexity assumption is added. More precisely, in [12], Corollary 5.5 ensures that if is convex, and satisfies a quadratic growth condition with parameter then the trajectory converges to a minimizer of , the length of the trajectory is finite and . Indeed, for such functions and the Theorem 3 ensures a better decay rate of the trajectory to the set of minimizers.
The work of Apidoupoulos et al. [2] deepens the one of Polyak et al. [31] providing explicit decay of under similar hypothesis i.e Łojasiewicz properties, assumptions and a uniform bound on the Hessian of in the neighborhood of the set of minimizers. That is why we compare our results to those in [2], but the same conclusions hold with [31]. The bounds provided by the authors depend on a uniform bound on the Hessian of which is not the case for Theorem 3 whose bound is better than Theorem 2 of [2] when . It turns out that the analysis of Apidopoulos et al has been developed in a non convex setting and in this setting, the use of this bound on the Hessian seems the only known way to get bounds on . The convexity seems to be a price to pay to get bounds independent of this Lipschitz constant .
This analysis of the Heavy Ball dynamical provides a guideline for the analysis of the analysis of the Heavy Ball algorithm.
Remark 3.
Even if the convexity of seems to be a key hypothesis to reach such decay rate, the careful reader may notice that Theorem 3 actually holds for star convex functions i.e functions satisfying:
where denotes the set of minimizers of .
4.2.2 Elements of proof
The results obtained in this paper rely on a Lyapunov approach. Let us recall that when has a unique minimizer i.e , then a classical Lyapunov choice for (HBF) is:
(35) |
for some well-chosen parameters and . Our approach to extend that type of analysis without the uniqueness assumption is to adapt the Lyapunov energy to our relaxed setting. Let have a non empty set of minimizers which is not reduced to one point. Let
(36) |
where for all , denotes the projection of onto , i.e.
This slight modification leads to a question when attempting to differentiate the Lyapunov energy: is differentiable?
The smoothness of is related to the smoothness of . In fact, if is directionally differentiable then is right-differentiable (and left-differentiable) and its right-hand derivative is equal to .
In [14, Theorem 7.2], Bonnans et al. prove that if a closed convex set is second order regular at for some , then is directionally differentiable at . We refer the reader to [14, 34] to have a complete understanding of the notion of second order regularity. Note that this geometry assumption is satisfied by sets having a bound [23] and polyhedral sets [34].
Throughout the rest of this section we assume that the set of minimizers is second order regular. Consequently, is right-differentiable as well as . For the sake of simplicity, let and denote the corresponding right-hand derivatives. We can write that:
(37) |
where
Observe that is exactly equal to respectively if has a unique minimizer . The objective is then to control the additional terms and . We introduce Figure 2 to give an intuition of the behaviour of these terms.
We can first prove that is positive by using the expression and the property of the projection onto a convex set. Indeed, as is a closed convex set, for any and :
Thus, for any we have:
By considering tending towards we can deduce that .
In [14, Theorem 7.2] the authors give an expression of the directional derivative for a closed convex set being second order regular at for some . This directional derivative satisfies:
Considering the assumptions made on we can deduce that for all .
These results ensure that for any choice of parameter and ,. From this point, the proofs of the convergence results stated in Theorem 3 and Proposition 1 follow the original proofs, taking instead of and by applying the following lemma. A proof is given in Section A.2.
Lemma 1.
Let be a continuous function which is right-differentiable. Assume that
(38) |
where denotes the right derivative of at . Then,
(39) |
5 Proofs of Theorems 1 and 2
The proofs of Theorems 1 and 2 are built around the following discrete Lyapunov energies
(40) |
and
(41) |
where and denotes the projection of onto for any . For the sake of clarity, we use the following notations:
(42) |
The following lemma is necessary in order to handle the terms related to non uniqueness of the minimizers. We give a proof in Section A.3.
Lemma 2.
For all , the following equalities hold:
-
1.
-
2.
Moreover, we introduce a lemma which encodes the fact that the sequence is provided by (V-FISTA). The proof is based on the descent lemma proved in [16] and it can be found in Appendix A.4.
Lemma 3.
Let be the sequence provided by (V-FISTA) with . Then, for any ,
We would like to point out that several controls are deduced from the properties of the projection onto a convex. Indeed, if is a closed convex set such that , then for any and ,
where denotes the projection of onto . This property directly guarantees inequalities such as
or
5.1 Proof of Theorem 1
Let be the sequence provided by (V-FISTA) for some to be defined. We define the following discrete Lyapunov energy:
(43) |
where and denotes the projection of on for any . By setting and considering that has a unique minimizer, we recover the energy considered by Beck in [10].
The aim of this proof is to find as large as possible such that for a well-chosen set of parameters,
(44) |
The proof is divided into three parts. We first use the lemmas introduced in the introduction of Section 5 and the properties of the projection onto a convex to handle the terms related to the non uniqueness of the minimizers. Then, we give a set of parameters which leads to the wanted inequality (44) by using the geometry assumption satisfied by . The convergence of the trajectories is obtained in the last section using the previous results and elementary computations.
5.1.1 Preliminary work
We recall that we use the notations defined in (42). By rewriting (43) and using the second claim of Lemma 2 we have:
(45) |
Lemma 3 ensures that if :
(46) | ||||
This inequality combined with (45) ensures that:
(47) |
where:
and is defined by
Due to the properties of the projection onto a convex set we have that for all :
and since , we can conclude that and consequently that:
(48) |
5.1.2 Getting the convergence rate
Recall that we want to find such that: . We choose the following set of parameters:
Then we get that:
(49) | ||||
where for this parameter choice we have:
Under the condition , we have that and hence,
(50) |
Moreover, as the condition ensures that we can apply the following lemma which is proved in Section A.5
Lemma 4.
Let be the sequence provided by (V-FISTA) and . Then for all :
(51) |
Lemma 51 guarantees that if :
(52) |
Moreover, as satisfies we can write that and consequently:
(53) | ||||
Noticing that we can conclude that:
(54) |
Hence: As we consider that , we have that . As a consequence, the geometry condition ensures that and
(55) |
5.1.3 Convergence of the trajectories
Let . Using the inequality , we get that:
(56) |
Thus, using the definition of and the geometry of :
(57) |
Then, by applying inequality (54), we deduce that:
(58) |
and consequently,
(59) |
Hence,
(60) |
5.2 Proof of Theorem 2
5.2.1 Structure of the proof
The proof of Theorem 2 is built around the approach provided in [8] in order to prove convergence rates of the trajectories of the Heavy Ball system described by:
(HBF) |
for some . Indeed, the sequence generated by (V-FISTA) can be seen as a discretization of (HBF) (when is differentiable) and the strategy can be adapted to the discrete setting. Note that the parameter in (HBF) does not play the same role as in (V-FISTA). Indeed, we have that behaves as .
This proof relies on the analysis of the following Lyapunov energy:
(61) |
The strategy of the proof is straightforward: we aim to find a set of parameters such that for any ,
(62) |
In this way, simple calculations show that it ensures
(63) |
which leads us to the conclusion.
5.2.2 Proof
Let be the sequence provided by (V-FISTA). Following the notations introduced in (42), we can rewrite:
(64) |
Following the first claim of 2,
(65) | ||||
Observe that due to the properties of the projection onto a convex, for any :
(66) |
By exploiting the expression (65), we can show the following lemma. The proof can be found in Section A.6.
Lemma 5.
For any , we have that:
(67) | ||||
This inequality combined to (66) guarantees that for any ,
(68) | ||||
We make the following choice of parameters:
(69) |
This set of parameters ensures that the following inequality is valid:
(70) | ||||
Note that we consider a parameter which implies that . Suppose in addition that . Then, we get that for any ,
(71) |
As satisfies the assumption , we can write that and hence:
(72) |
Thus, if
(73) |
then for any ,
(74) |
Note that the solutions of (73) automatically satisfy . Elementary computations show that this implies
(75) |
Note that since ,
(76) |
and using the assumption ,
(77) |
Moreover, for any , . Thus, if (73) is satisfied, then for any :
(78) |
In addition, by applying the inequality , we get that:
(79) |
where . The assumption gives that
(80) |
and given the definition of we obtain,
(81) |
By combining the above inequality with (75), we can finally prove that if (73) is valid, then
(82) |
5.3 Proof of Corollary 2
Let be the sequence given by (V-FISTA) for some and . According to Theorem 2, if for some , then (19) and (20) are valid for any satisfying:
(83) |
Corollary 2 relies on the following lemma.
Lemma 6.
Let
If , then for any , .
Proof. Given the expression of , simple computations give that for any and ,
(84) |
We define the function as follows
such that . We get that:
(85) |
Consequently, if and , we have that and
Since which is strictly negative if , we can deduce that for any . As , this lemma is proved.
In fact, is a lower bound of the highest value of satisfying (22) for some and as illustrated in Figure 3.
Appendix A Technical proofs
A.1 Proof of Theorem 3
Our analysis follows that introduced in [8]. We set and we consider the following Lyapunov energy:
(86) |
with and .
Following the discussion of Section 4.2.2, the assumptions of Theorem 3 ensure that is right-differentiable and for all ,
(87) |
where denotes the right derivative of . By using the convexity of and replacing the parameters by their value,
(88) |
Let us define . The above inequality guarantees that for all :
(89) |
As satisfies and for all , we obtain that for all :
(90) |
We refer the reader to the proof of [8, Theorem 1] for further developments on each of the above steps and a discussion on the value of the parameters. Lemma 39 then guarantees that, for all ,
(91) |
Since satisfies , elementary computations show that:
(92) |
and consequently,
(93) |
This inequality implies that for all :
(94) |
and
(95) |
The first statement of Theorem 3 can be demonstrated by combining (91) and (94) and rewriting (see [8, Section 6.1] for further details). We prove the second result as follows.
Using inequality , we get that:
(96) | ||||
By applying the previous inequalities, we have that:
(97) |
The bound on the energy given in (91) lead us to the conclusion:
(98) |
A.2 Proof of Lemma 39
Let denote the derivative of when it is well defined. According to [38], the function is differentiable except at a countable set of points. This implies that there exists and such that for any and , is well defined and equal to . We suppose that the sequence is ordered such that for any and that when .
Suppose that .
-
•
If is differentiable at , then is differentiable on the interval and in this interval. Consequently inequality (38) ensures that,
-
•
If is not differentiable at , then inequality (38) guarantees that for sufficiently small,
Then, the previous discussion allows us to say that is differentiable on . As a consequence, we can say that there exists such that for any :
As this inequality is valid for any , we finally get the wanted inequality (39).
We now suppose that . We just proved that (39) is true for all . Therefore, for all ,
and as is continuous we get the same inequality at .
By using the same arguments, we can prove that (39) is valid for any . Indeed, if , then it means that or that for some . In both cases, we get the wanted inequality by applying the above reasonings to the consecutive intervals for .
A.3 Proof of Lemma 2
Let . By rewriting
we get that:
Noticing that leads to:
The second claim is proved using the same approach. We rewrite
and consequently:
By applying the same rewriting of , simple calculations give that:
A.4 Proof of Lemma 3
The first claim is straight forward as Lemma 3.1 of [16] ensures that:
By writing and , we can conclude.
A.5 Proof of Lemma 51
Let be the sequence provided by (V-FISTA) and . We can write the Lyapunov energy in the following way:
Since , we can write that:
which leads to the final result.
A.6 Proof of Lemma 67
Acknowledgements
JFA acknowledges support of the EU Horizon 2020 research and innovation program under the Marie Skłodowska-Curie NoMADS grant agreement No777826, and PEPR PDE-AI. HL acknowledges the financial support of the Ministry of Education, University and Research (grant ML4IP R205T7J2KP). This work was supported by the ANR MICROBLIND (grant ANR-21-CE48-0008) and the ANR Masdol (grant ANR-PRC-CE23).
References
- [1] T. Alamo, P. Krupa, and D. Limon. Restart of accelerated first-order methods with linear convergence under a quadratic functional growth condition. IEEE Transactions on Automatic Control, 68(1):612–619, 2022.
- [2] V. Apidopoulos, N. Ginatta, and S. Villa. Convergence rates for the Heavy-Ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition. Journal of Global Optimization, pages 1–27, 2022.
- [3] H. Attouch, X. Goudou, and P. Redont. The Heavy-Ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):1–34, 2000.
- [4] H. Attouch and J. Peypouquet. The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than . SIAM Journal on Optimization, 26(3):1824–1834, 2016.
- [5] J.-F. Aujol, L. Calatroni, C. Dossal, H. Labarrière, and A. Rondepierre. Parameter-free FISTA by adaptive restart and backtracking. arXiv preprint arXiv:2307.14323, 2023.
- [6] J.-F. Aujol, C. Dossal, H. Labarrière, and A. Rondepierre. FISTA restart using an automatic estimation of the growth parameter. Hal Preprint 03153525, May 2022.
- [7] J.-F. Aujol, C. Dossal, and A. Rondepierre. Convergence rates of the Heavy Ball method for quasi-strongly convex optimization. SIAM Journal on Optimization, 32(3):1817–1842, 2022.
- [8] J.-F. Aujol, C. Dossal, and A. Rondepierre. Convergence rates of the Heavy-Ball method under the Łojasiewicz property. Mathematical Programming, pages 1–60, 2022.
- [9] J.-F. Aujol, C. Dossal, and A. Rondepierre. FISTA is an automatic geometrically optimized algorithm for strongly convex functions. Mathematical Programming, 204(1-2), 2024.
- [10] A. Beck. First-order methods in optimization. SIAM, 2017.
- [11] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- [12] P. Bégout, J. Bolte, and M. A. Jendoubi. On damped second-order gradient systems. Journal of Differential Equations, 259(7):3115–3143, 2015.
- [13] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
- [14] J. F. Bonnans, R. Cominetti, and A. Shapiro. Sensitivity analysis of optimization problems under second order regular constraints. Mathematics of Operations Research, 23(4):806–831, 1998.
- [15] S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
- [16] A. Chambolle and C. Dossal. On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. Journal of Optimization theory and Applications, 166(3):968–982, 2015.
- [17] S. Chen, S. Ma, and W. Liu. Geometric descent method for convex composite minimization. Advances in Neural Information Processing Systems, 30, 2017.
- [18] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1):451–482, June 2014.
- [19] O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019.
- [20] G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: beyond the worst-case with the help of geometry. Mathematical Programming, pages 1–60, 2022.
- [21] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the Heavy-Ball method for convex optimization. In 2015 European control conference (ECC), pages 310–315. IEEE, 2015.
- [22] P. Giselsson and S. Boyd. Monotonicity and restart in fast gradient methods. In 53rd IEEE Conference on Decision and Control, pages 5058–5063. IEEE, 2014.
- [23] J.-B. Hiriart-Urruty. At what points is the projection map** differentiable? The American Mathematical Monthly, 89(7):456–458, 1982.
- [24] S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In Les Équations aux Dérivées Partielles (Paris, 1962), pages 87–89. Éditions du Centre National de la Recherche Scientifique, Paris, 1963.
- [25] S. Łojasiewicz. Sur la géométrie semi- et sous-analytique. Annales de l’Institut Fourier. Université de Grenoble, 43(5):1575–1595, 1993.
- [26] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, 175(1):69–107, 2019.
- [27] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k). In Sov. Math. Dokl, volume 27, 1983.
- [28] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- [29] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125–161, 2013.
- [30] B. O’donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715–732, 2015.
- [31] B. Polyak and P. Shcherbakov. Lyapunov functions: An optimization theory perspective. IFAC-PapersOnLine, 50(1):7456–7461, 2017. 20th IFAC World Congress.
- [32] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- [33] J. Renegar and B. Grimmer. A simple nearly optimal restart scheme for speeding up first-order methods. Foundations of computational mathematics, 22(1):211–256, 2022.
- [34] A. Shapiro. Differentiability properties of metric projections onto convex sets. Journal of Optimization Theory and Applications, 169(3):953–964, 2016.
- [35] J. W. Siegel. Accelerated first-order methods: Differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671, 2019.
- [36] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, pages 1–38, 2022.
- [37] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2017.
- [38] G. C. Young. A note on derivates and differential coefficients. Acta mathematica, 37(1):141–154, 1914.