Heavy Ball Momentum for Non-Strongly Convex Optimization

J.-F. Aujol¹¹1Univ. Bordeaux, Bordeaux INP, CNRS, IMB, UMR 5251, F-33400 Talence, France C. Dossal²²2IMT, Univ. Toulouse, INSA Toulouse, Toulouse, France H. Labarrière³³3MaLGa, DIBRIS, Università di Genova, Genoa, Italy A. Rondepierre²²footnotemark: 2 ⁴⁴4LAAS, Univ. Toulouse, CNRS, Toulouse, France

(March 11, 2024)

Abstract

When considering the minimization of a quadratic or strongly convex function, it is well known that first-order methods involving an inertial term weighted by a constant-in-time parameter are particularly efficient (see Polyak [32], Nesterov [28], and references therein). By setting the inertial parameter according to the condition number of the objective function, these methods guarantee a fast exponential decay of the error. We prove that this type of schemes (which are later called Heavy Ball schemes) is relevant in a relaxed setting, i.e. for composite functions satisfying a quadratic growth condition. In particular, we adapt V-FISTA, introduced by Beck in [10] for strongly convex functions, to this broader class of functions. To the authors’ knowledge, the resulting worst-case convergence rates are faster than any other in the literature, including those of FISTA restart schemes. No assumption on the set of minimizers is required and guarantees are also given in the non-optimal case, i.e. when the condition number is not exactly known. This analysis follows the study of the corresponding continuous-time dynamical system (Heavy Ball with friction system), for which new convergence results of the trajectory are shown.

1 Introduction

In many image processing or statistical problems, the optimization of a convex function $F$ from $\mathbb{R}^{N}$ to $\mathbb{R}\cup\{+\infty\}$ with a non empty set of minimizers may be needed. In this context, when $N$ is large (i.e. for large scale problems), second order algorithms cannot be used and only gradient or sub-gradient of $F$ can be computed to get a minimizing sequence $(x_{n})_{n\in\mathbb{N}}$ .

If $F$ is convex, differentiable and has a $L$ -Lipschitz gradient, the explicit gradient descent algorithm (GD) with step $s=\frac{1}{L}$ defined by

x_{n+1}=x_{n}-s\nabla F(x_{n})

is a simple first order algorithm that provides a sequence converging to a minimizer $x^{*}$ of $F$ . This method is actually slow on this class of convex functions since its asymptotic convergence rate is

F(x_{n})-F(x^{*})=\mathcal{O}\left(n^{-1}\right).

This decay rate can be improved when considering $\mu-$ strongly convex functions, since the worst-case guarantee is then

F(x_{n})-F(x^{*})=\mathcal{O}\left(e^{-\frac{\mu}{L}n}\right).

(1)

This asymptotic decay is faster that the one obtained for convex functions but when $\kappa:=\frac{\mu}{L}\ll 1$ , this decay can still be slow in practice. As $\kappa$ is the inverse of the condition number of $F$ , GD is particularly slow for large scale problems.
Two remarks can be made about these decays. First, if $F$ is not differentiable but composite, GD can be replaced by the Forward-Backward algorithm and the two decays above are still valid. We provide an exact definition of composite and Forward-Backward algorithm in Section 2. Second, the above exponential decay of the error is given under a strong convexity assumption but can be extended under weaker hypotheses such as a quadratic growth condition.

In 1964 Polyak introduces the Heavy Ball (HB) scheme inspired by mechanics, which improves the decay of gradient descent on the class of $C^{2}$ strongly convex functions by incorporating inertia. This scheme generates a sequence of iterates $(x_{n})_{n\in\mathbb{N}}$ ensuring that:

F(x_{n})-F(x^{*})=\mathcal{O}\left(e^{-4\sqrt{\kappa}n}\right).

(2)

If $\kappa\ll 1$ , this convergence rate is significantly faster than (1) guaranteed by the Forward-Backward algorithm. This theoretical improvement reflects a better performance in practice. At the core of the Polyak’s analysis is the fact that in the neighborhood of its unique minimizer, $F$ behaves like a quadratic function. But the $C^{2}$ assumption is crucial in the Polyak’s analysis, and examples of simple $C^{1}$ strongly convex functions $F$ such that the (HB) provides diverging sequences can be found in [21].

In 1983 Nesterov [27] proposes an inertial scheme built to speed up the convergence of GD on the class of convex functions. This acceleration process is at the core of FISTA introduced by Beck and Teboulle [11], which applies to composite functions and provides a sequence such that

F(x_{n})-F(x^{*})=\mathcal{O}\left(n^{-2}\right).

The details of this algorithm and convergence rates are given in the Section 2. The main difference between the Heavy Ball algorithm and the Nesterov scheme is the inertia parameter which is constant over iterations and depends on $\kappa:=\frac{\mu}{L}$ for Heavy Ball while it depends on the iteration number and tends to 1 when $n$ goes to $+\infty$ for FISTA.

Many variations of these schemes have been proposed during the last decade, see table 1 for various examples, and the behavior, rates and stability of these various schemes are now well understood. A common approach is to study an associated dynamical system via a Lyapunov analysis before deriving convergence results on the scheme, see e.g. [9] and the references therein.

Several Heavy Ball schemes have been proposed to provide fast decays of the type (2) under weaker hypotheses than $C^{2}$ and strong convexity [28, 10, 35, 37, 36]. But for all these schemes, a fast geometrical decay such as (2) is achieved only on classes of functions having a unique minimizer: no known inertial scheme achieves such rates on the class of convex functions satisfying a simple quadratic growth condition,

\exists\mu>0,\leavevmode\nobreak\ \forall x\in\mathbb{R}^{N},\,\forall x^{*}% \in X^{*},\quad F(x)-F(x^{*})\geqslant\frac{\mu}{2}d(x,X^{*})^{2},

(3)

or equivalently in the convex setting a Łojasiewicz property with parameter $\theta=\frac{1}{2}$ , without introducing additional uniqueness hypothesis. In others words, no known inertial scheme provides better asymptotic bounds compared to (GD) within the class of convex functions satisfying a quadratic growth condition. Thus, it remains unclear if inertia holds any real significance for this class of functions.

The main contribution of this work is to provide Heavy Ball schemes, similar to Beck’s V-FISTA, ensuring rates of $O(e^{-c\sqrt{\kappa}n})$ on the class of convex functions satisfying some quadratic growth condition, where the value of $c$ will be specified later. The inertia parameter depends on the knowledge of $L$ and $\mu$ , but this new scheme guarantees exponential decay even if $\kappa=\frac{\mu}{L}$ is overestimated. We prove that an overestimation of $\kappa$ only results in suboptimal exponential decay. Theorem 1 provides a straightforward Lyapunov analysis and a fast convergence rate for a given friction parameter, while Theorem 2 yields rates that can be achieved even if the friction parameter is not set in an optimal way, demonstrating that fast exponential decay is robust to a mild overestimation of the quadratic growth parameter $\mu$ .

The paper is organized as follows: in Section 2 we introduce the main geometric assumption made on the function to minimize, namely the quadratic growth condition, and propose a review of the literature on the convergence rates of inertial algorithms under this condition. In Section 3 is devoted to our two main theorems and corollary proving that Heavy Ball like methods can be properly parameterized to achieve fast exponential decay for this class of function. Section 4 presents the continuous counterpart of the discrete analysis proposed in Section 3 providing a guide to construct the proofs of the Theorems presented in Section 3, as well as new results for the convergence rate for the trajectories of the Heavy Ball dynamical system. All the proofs have been gathered in Section 5, and the more technical ones are detailed in the Appendix.

2 Geometry of convex functions and inertial algorithms: definitions and state of the art.

Let us first recall some basic notations and definitions. We assume that $\mathbb{R}^{N}$ is equipped with the Euclidean scalar product $\langle\cdot,\cdot\rangle$ and the associated norm $\|\cdot\|$ . As usual $B(x^{*},r)$ denotes the open Euclidean ball with center $x^{*}\in\mathbb{R}^{N}$ and radius $r>0$ . For any real subset $X\subset\mathbb{R}^{N}$ , the Euclidean distance $d$ is defined as:

\forall x\in\mathbb{R}^{N},\leavevmode\nobreak\ d(x,X)=\inf_{y\in X}\|x-y\|.

2.1 Framework and notations

In this paper we focus on the class of composite functions: $F=f+h$ where $f$ is a convex, differentiable function having a $L$ -Lipschitz gradient and $h$ is a proper lower semicontinuous (l.s.c.) convex function whose proximal operator is known. The proximal operator of $h$ is denoted by $\text{prox}_{h}$ and defined by:

\text{prox}_{h}(x)=\textup{argmin}\,_{y\in\mathbb{R}^{N}}{\left(h(y)+\frac{1}{% 2}\|y-x\|^{2}\right)}.

(4)

For this class of functions a classical minimization algorithm is the Forward-Backward algorithm (FB) whose iterations are described by:

x_{0}\in\mathbb{R}^{N},\quad x_{n+1}=\text{prox}_{sh}(x_{n}-s\nabla f(x_{n})),% \leavevmode\nobreak\ s\in\left(0,\frac{2}{L}\right).

(5)

Without further assumptions on $F$ , the convergence decay of the FB algorithm, i.e. the decay of $F(x_{n})-F^{*}$ along the iterates, may be slow. In this paper we are interested in inertial methods, which are among the most effective first order optimization methods, and may ensure a better convergence rates, especially when $F$ is additionally strongly convex:

Definition 1 (Strong convexity $\mathcal{S}_{\mu}$ ).

Let $F:\mathbb{R}^{N}\rightarrow\mathbb{R}\cup\{+\infty\}$ be a proper lower semicontinuous convex function. The function $F$ is said $\mu$ -strongly convex for some real constant $\mu>0$ if the function $x\mapsto F(x)-\frac{\mu}{2}\|x\|^{2}$ is convex.

Weakening this assumption, we consider the class of convex composite functions satisfying some quadratic growth condition:

Definition 2 (Quadratic growth condition $\mathcal{G}^{2}_{\mu}$ ).

Let $F:\mathbb{R}^{N}\rightarrow\mathbb{R}\cup\{+\infty\}$ be a proper lower semicontinuous convex function such that: $X^{*}=\textup{argmin}\,F\neq\emptyset$ and $F^{*}=\min F$ . The function $F$ satisfies a quadratic growth condition $\mathcal{G}_{\mu}^{2}$ for some real constant $\mu>0$ if:

\forall x\in\mathbb{R}^{N},\leavevmode\nobreak\ F(x)-F^{*}\geqslant\frac{\mu}{% 2}d(x,X^{*})^{2}.

(6)

Classically the quadratic growth condition $\mathcal{G}_{\mu}^{2}$ can be seen as a relaxation of the strong convexity. Note that satisfying some growth condition does not impose the uniqueness of the minimizer as it does for strong convexity. In the convex setting, the quadratic growth condition $\mathcal{G}_{\mu}^{2}$ is equivalent to a global Łojasiewicz property with an exponent $\frac{1}{2}$ [20]. The Łojasiewicz property [24, 25] is a key tool in the mathematical analysis of continuous and discrete dynamical systems, initially introduced to prove the convergence of the trajectories for the gradient flow of analytic functions. An extension to nonsmooth functions has been proposed by Bolte et al. in [13]:

Definition 3.

Let $F:\mathbb{R}^{N}\rightarrow\mathbb{R}\cup\{+\infty\}$ be a lower proper semicontinuous convex function with $X^{*}=\textup{arg}\,\min\leavevmode\nobreak\ F\neq\emptyset$ . The function $F$ has a Łojasiewicz property of exponent $\theta\in[0,1)$ if for any minimizer $x^{*}\in X^{*}$ , there exist real constants $c_{\ell}>0$ and $\varepsilon>0$ such that:

\forall x\in B(x^{*},\varepsilon),\leavevmode\nobreak\ \left(F(x)-F(x^{*})% \right)^{\theta}\leqslant c_{\ell}d(0,\partial F(x)).

(7)

The Łojasiewicz property is said to be global if (7) is satisfied for any $x\in\mathbb{R}^{N}$ .

A general inertial optimization method can be described as follows:

\forall n\in\mathbb{N},\leavevmode\nobreak\ \left\{\begin{gathered}y_{n}=x_{n}% +\alpha_{n}(x_{n}-x_{n-1}),\\ x_{n+1}=\text{prox}_{sh}\left(y_{n}-s\nabla f(z_{n})\right),\end{gathered}\right.

(8)

where $\alpha_{n}>0$ denotes some friction parameter and $z_{n}=x_{n}$ or $y_{n}$ depending on the considered method. Historically in his seminal work [32], B.T. Polyak proposes a first inertial scheme by choosing a constant friction parameter $\alpha_{n}=\alpha$ and $z_{n}=x_{n}$ , for the minimization of $C^{2}$ strongly convex functions. One of the most popular inertial algorithm is FISTA (for Fast Iterative Shrinkage-Thresholding Algorithm) introduced by Beck and Teboulle in [11] to minimize convex composite functions. Inspired by Nesterov’s accelerated gradient method proposed in [27], the friction parameter $\alpha_{n}$ is defined by:

\alpha_{n}=\frac{t_{n-1}-1}{t_{n}},\qquad z_{n}=y_{n}

(9)

where the sequence $(t_{n})_{n\in\mathbb{N}}$ is recursively defined by: $t_{0}=1$ and $t_{n+1}=\frac{1+\sqrt{1+4t_{n}^{2}}}{2}$ . Chambolle and Dossal propose in [16] a variant of FISTA defining $\alpha_{n}=\frac{n-1}{n+\alpha-1}$ for any $n\in\mathbb{N}^{*}$ where $\alpha\geqslant 3$ . The original choice of Nesterov can be approximated by setting $\alpha=3$ .

In this paper we consider the family of Heavy Ball algorithms for which the friction parameter $\alpha_{n}$ is set to a constant $\alpha>0$ and $z_{n}=y_{n}$ . The term Heavy Ball refers to a family of optimization schemes that can be interpreted as discretizations of the following second-order ordinary differential equation:

\ddot{x}(t)+\alpha\dot{x}(t)+\nabla F(x(t))=0,

(10)

which describes the move of a heavy ball in a potential field with a constant friction. The inertia coefficient $\alpha$ has to be parameterized according to the geometry of $F$ to get an optimal convergence rate. For the class of $\mu$ -strongly convex functions, Beck in [10, Chapter 10.7.7] proposes the following choice (following Nesterov’s choice [28]):

\alpha=\frac{1-\sqrt{\kappa}}{1+\sqrt{\kappa}},\qquad z_{n}=y_{n}

(11)

where $\kappa=\frac{\mu}{L}$ , leading to the algorithm V-FISTA (seen as a variant of FISTA by the author).

2.2 Convergence rate of inertial algorithms under quadratic growth assumptions

Let $F=f+h$ be a convex composite function where $f$ is a convex, differentiable function having a $L$ -Lipschitz gradient and $h$ is a proper semicontinuous convex function whose proximal operator is known.

When the function $F$ to minimize satisfies some additional growth assumption $\mathcal{G}^{2}_{\mu}$ , Garrigos et al. [20] prove that the Forward-Backward algorithm (5) provides better theoretical guarantees than in the general convex case. More precisely, they show that the function values achieve an exponential decay $F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\frac{\mu}{L}n}\right)$ along the iterates of Forward-Backward, without any assumption on the set of minimizers $X^{*}$ . Observe that this convergence rate depends on the ratio $\kappa=\frac{\mu}{L}$ which represents the inverse of the conditioning of $F$ and can be very small in large-scale optimization. Note also that Necoara et al. proved similar results for the projected gradient algorithm in [26].

While Nesterov’s accelerations allow for speeding up gradient-based algorithms for the class of convex functions, it is less clear for the class of convex functions satisfying some quadratic growth condition. Considering FISTA, in its historical form by Beck and Teboulle (9) or its variant introduced by Chambolle and Dossal in [16], the convergence rate is still polynomial for the class of convex functions satisfying some quadratic growth condition. Note however that considering the variant of FISTA introduced by [16], Aujol et al. prove in [9] that the sequence $(x_{n})_{n\in\mathbb{N}}$ provided by (8) with $\alpha_{n}=\frac{n-1}{n+\alpha-1}$ and $\alpha\geqslant 3$ satisfies:

F(x_{n})-F^{*}=\mathcal{O}\left(n^{-\frac{2\alpha}{3}}\right).

(12)

which is better than the rate in the convex setting $\mathcal{O}\left(n^{-2}\right)$ from [27, 11] which is in fact $o\left(n^{-2}\right)$ as proved by Attouch and Peypouquet [4]. Although this decay is not exponential, the authors show that the friction parameter $\alpha$ can be set according to some desired accuracy, and in that case the number of iterations required to achieve this accuracy is comparable to methods ensuring a fast exponential decay of the error, i.e. a exponential decay depending on $\sqrt{\frac{\mu}{L}}$ , which can be much faster than an exponential decay rate solely in $\kappa=\frac{\mu}{L}$ as for Forward-Backward. Note that this result holds under the assumption that $F$ has a unique minimizer.

A way to accelerate the convergence of FISTA for the class of composite convex functions having some quadratic growth property, is to use restart strategies. The idea of this approach is to take benefit of inertia while avoiding oscillations by re-initializing the inertia parameter to zero when some restart condition is verified. Empiric restart rules have been proposed by Giselsson and Boyd [22] or O’Donoghue and Candès [30], offering an improved convergence of FISTA in practice but without theoretical guarantees. Elementary computations show that re-initializing the inertia parameter every $\lfloor 2e\sqrt{\frac{L}{\mu}}\rfloor$ iterations allows the resulting sequence to satisfy:

F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\frac{1}{e}\sqrt{\kappa}n}\right).

(13)

This convergence rate is actually the fastest in the literature for restart methods and does not require the uniqueness of the minimizer. But note that it requires knowledge of the value of $\mu$ , see e.g. [29, 30, 26]. Recently, adaptive restart schemes have been developed aiming at exploiting the geometry assumption $\mathcal{G}_{\mu}^{2}$ to derive improved convergence rates without knowing exactly the growth parameter $\mu$ : Fercoq and Qu [19], Alamo et al. [1], Aujol et al. [6, 5] introduce restart schemes ensuring a fast exponential decay of the error (i.e. depending on $\sqrt{\kappa})$ . The schemes having the best theoretical guarantees in this setting are that proposed by Alamo et al. in [1] ( $F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\frac{\ln(15)}{4e}\sqrt{\kappa}n}\right)$ ) and the method introduced by Renegar and Grimmer in [33] ( $F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\frac{1}{2\sqrt{2}}\sqrt{\kappa}n}\right)$ ). As the optimal periodic restart, no uniqueness assumption is needed on the set of minimizers of $F$ to obtain these guarantees.

In contrast to FISTA and Nesterov’s accelerated gradient method, Heavy Ball type schemes are designed for convex functions satisfying additional growth assumptions such as the $\mu$ -strong convexity. To this end, these methods require to be calibrated according to the growth parameter $\mu$ . In his seminal paper [32], Polyak introduces the first Heavy Ball method for $C^{2}$ $\mu$ -strongly convex functions which guarantees a convergence rate of the error of $\mathcal{O}\left(e^{-4\sqrt{\kappa}n}\right)$ . This decay rate is significantly fast but relies strongly on the $C^{2}$ assumption. Ghadimi et al. in [21] provide a $C^{1}$ convex function such that this method does not converge. Nesterov proposes in [28] the accelerated gradient method for strongly convex functions which only requires a $C^{1}$ assumption ensuring that the error decreases as $\mathcal{O}\left(e^{-\sqrt{\kappa}n}\right)$ . In this setting, several Heavy Ball schemes have been proposed such as Siegel’s Heavy Ball method [35] and the geometric descent method [15] which have the same theoretical asymptotic guarantees as Nesterov’s accelerated gradient method for strongly convex functions, the Heavy Ball method by Aujol et al. [7] for strongly convex functions (which we will denote ADR- $\mathcal{S}_{\mu}$ Heavy Ball), the triple momentum method by Van Scoy et al. [37] and ITEM by Taylor and Drori [36]. The latter two schemes are built thanks to the Performance Estimation Problem approach introduced by Drori and Teboulle [18] and they provide the best bounds on the error for this class of function and first-order methods ( $\mathcal{O}\left(e^{-2\sqrt{\kappa}n}\right)$ ). Some of these schemes can be adapted to composite optimization as detailed in Table 1. Note that Beck generalizes Nesterov’s accelerated gradient method for strongly convex functions to composite optimization in [10] with V-FISTA proving the same theoretical convergence rate of the error.

Algorithm	Reference	Assumption on $F$	Convergence rate of $F(x_{n})-F^{*}$
Polyak’s Heavy Ball	Polyak [32]	$\mathcal{S}_{\mu}$ and $C^{2}$	$\mathcal{O}\left(e^{-4\sqrt{\kappa}n}\right)$
Nesterov’s accelerated gradient method for strongly convex functions	Nesterov [28]
Necoara et al. [26]	$\mathcal{S}_{\mu}$ and $C^{1}$
$\mathcal{Q}_{\mu}$ , $C^{1}$ and uniqueness of the projection of the iterates onto $X^{*}$	$\mathcal{O}\left(e^{-\sqrt{\kappa}n}\right)$
Geometric descent method	Bubeck et al. [15]
Chen et al. [17]	$\mathcal{S}_{\mu}$
Adapted to composite optimization	$\mathcal{O}\left(e^{-\sqrt{\kappa}n}\right)$
Triple momentum method	Van Scoy et al. [37]	$\mathcal{S}_{\mu}$ and $C^{1}$	$\mathcal{O}\left(e^{-2\sqrt{\kappa}n}\right)$
ITEM	Taylor, Drori [36]	$\mathcal{S}_{\mu}$ and $C^{1}$	$\mathcal{O}\left(e^{-2\sqrt{\kappa}n}\right)$
Polyak’s Heavy Ball with general friction	Ghadimi et al. [21]	$\mathcal{S}_{\mu}$ and $C^{1}$	$\mathcal{O}\left(e^{-\kappa n}\right)$
Siegel’s Heavy Ball	Siegel [35]	$\mathcal{S}_{\mu}$ and $C^{1}$
Adapted to composite optimization	$\mathcal{O}\left(e^{-\sqrt{\kappa}n}\right)$
V-FISTA	Beck [10]	$\mathcal{S}_{\mu}$
Adapted to composite optimization	$\mathcal{O}\left(e^{-\sqrt{\kappa}n}\right)$
ADR- $\mathcal{S}_{\mu}$ Heavy Ball	Aujol et al. [7]	$\mathcal{S}_{\mu}$
Adapted to composite optimization	$\mathcal{O}\left(e^{\left(-\sqrt{2\kappa}+\mathcal{O}(\kappa)\right)n}\right)$
ADR- $\mathcal{G}^{2}_{\mu}$ Heavy Ball	Aujol et al. [8]	$\mathcal{G}^{2}_{\mu}$ and uniqueness of the minimizer
Adapted to composite optimization	$\mathcal{O}\left(e^{(-(2-\sqrt{2})\sqrt{\kappa}+\mathcal{O}(\kappa))n}\right)$
ADR- $\mathcal{G}^{2}_{\mu}$ Heavy Ball	Aujol et al. [8]	$\mathcal{G}^{2}_{\mu}$
Adapted to composite optimization	$\mathcal{O}\left(e^{(-\kappa+\varepsilon+\mathcal{O}(\kappa))n}\right)$

Table 1: Convergence rate of

F(x_{n})-F^{*}

for Heavy Ball type schemes with various geometry assumptions on

F

Recently, Heavy Ball type schemes have been studied under weaker geometry assumptions than strong convexity. Necoara et al. prove in [26] that the convergence rate of Nesterov’s accelerated gradient method for strongly convex method is actually valid for $C^{1}$ $\mu$ -quasi-strongly convex functions i.e. for functions satisfying:

\forall x\in\mathbb{R}^{n},\langle\nabla F(x),x-x^{*}\rangle\geqslant F(x)-F(x% ^{*})+\frac{\mu}{2}\|x-x^{*}\|^{2},\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}

(14)

where $x^{*}$ denotes the projection onto $X^{*}$ , provided that the iterates share the same projection onto the set of minimizers. In [8], Aujol et al. build a Heavy Ball type scheme (ADR- $\mathcal{G}^{2}_{\mu}$ Heavy Ball) for functions satisfying the quadratic growth assumption $\mathcal{G}^{2}_{\mu}$ guaranteeing that

F(x_{n})-F^{*}=\mathcal{O}\left(e^{(-(2-\sqrt{2})\sqrt{\kappa}+\mathcal{O}(% \kappa))n}\right),

(15)

as long as $F$ has a unique minimizer.

Thus, the theoretical guarantees of Heavy Ball type schemes are the best in the literature among first-order methods for functions satisfying growth conditions but they do not hold without assuming the uniqueness of the minimizer. If this hypothesis is not verified, the theoretical convergence rates are similar to those of Forward-Backward, and the relevance of applying such algorithms in this context can therefore be questioned.

3 Fast exponential decay for Heavy Ball type algorithms

Let us now consider Heavy Ball type methods that can be generically described as variants of the V-FISTA algorithm proposed by Beck in [10]:

\forall n\in\mathbb{N},\leavevmode\nobreak\ \left\{\begin{gathered}x_{n+1}=% \text{prox}_{sh}\left(y_{n}-s\nabla f(y_{n})\right),\\ y_{n+1}=x_{n+1}+\alpha(x_{n+1}-x_{n}),\end{gathered}\right.

(V-FISTA)

with $x_{0}\in\mathbb{R}^{N}$ , $s=\frac{1}{L}$ , $y_{0}=x_{0}$ and any $\alpha>0$ . Recall that in the original definition of V-FISTA [10], the dam** parameter $\alpha$ is set to: $\frac{1-\sqrt{\kappa}}{1+\sqrt{\kappa}}$ where $\kappa=\frac{\mu}{L}$ denotes the inverse of the conditioning of the function $F$ to minimize.

The main contribution in this section is to prove that Heavy Ball methods like (V-FISTA) can be properly parameterized to achieve fast exponential decay rates (i.e. depending on $\sqrt{\kappa}$ ) for the class of convex composite functions satisfying some quadratic growth property $\mathcal{G}^{2}_{\mu}$ and without assuming the uniqueness of the minimizer. An example of such functions is the LASSO functional :

F(x)=\frac{1}{2}\left\|{Ax-y}\right\|_{2}^{2}+\lambda\left\|{x}\right\|_{1}.

(16)

This function $F$ is convex, satisfies a quadratic growth condition but may not have a unique minimizer. To the best of our knowledge, this is the first result proving that an inertial method can actually improve the convergence rate of the Forward-Backward algorithm which is in $O(e^{-\kappa n})$ and not in $O(e^{-c\sqrt{\kappa}n})$ . In large scale dimension, the inverse $\kappa$ of the conditioning of the function to minimize could be very small, so that decaying in $\kappa$ could be much slower that in $\sqrt{\kappa}$ .

Our purpose is to build an inertial algorithm providing fast exponential decay, that is ensuring that $F(x_{n})-F(x^{*})=O(e^{-c\sqrt{\kappa}n})$ but not to optimize the value of $c$ . Both Theorems 1 and 2 provide such bounds. In Theorem 1, the value of the inertial parameter $\alpha$ is chosen to provide a fast decay rate under a mild condition of $\kappa$ and by using a quite simple Lyapunov analysis. Indeed, other choices could have been made leading to slightly different inertia, rates, conditions on $\kappa$ and Lyapunov functions, using the same approach.

The rates given by Theorem 2 are different from Theorem 1 since the objective is not the same. This second theorem provides bounds for a large set of inertial parameters and the inertia is not optimized for a fixed bound on $\kappa$ . Indeed, in various settings, the exact value of $\kappa$ is not precisely known. For the LASSO functional (16) for example, the value of the quadratic growth parameter $\mu$ may be hard to estimate. The goal of this second theorem is to provide fast exponential decays even if the inertia parameter is not set to its optimal value, and to provide rates that are more accurate when $\kappa$ is smaller. The proof of this second theorem is inspired by the proof of Theorem 3 which provides results on the solution of the Heavy Ball dynamical system.

Theorem 1.

Let $F=f+h$ where $f$ is a convex differentiable function having a $L$ -Lipschitz gradient for some $L>0$ , and $h$ a proper convex l.s.c. function. Assume that $F$ satisfies a quadratic growth condition $\mathcal{G}_{\mu}^{2}$ for some real parameter $\mu>0$ . Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) with $\alpha=1-\frac{5}{3\sqrt{3}}\sqrt{\kappa}$ and $s=\frac{1}{L}$ . If $\kappa\leqslant\frac{1}{3}$ , then:

\forall n\in\mathbb{N},\leavevmode\nobreak\ F(x_{n})-F^{*}\leqslant\frac{4}{3}% \left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}\right)^{n}(F(x_{0})-F^{*}),

(17)

and

\|x_{n}-x_{n-1}\|=\mathcal{O}\left(e^{-\frac{\sqrt{\kappa}}{3\sqrt{3}}n}\right).

(18)

Theorem 1, whose proof can be found in Section 5.1, ensures that for a well-chosen parameter $\alpha$ which depends on $\kappa$ , the decay of the error along the iterates of (V-FISTA) is at worst of order $\mathcal{O}\left(e^{-\frac{2}{3\sqrt{3}}\sqrt{\kappa}n}\right)$ . Looking back at the results in the literature, this convergence rate is slower than those of most other Heavy Ball schemes. However, remember that the required assumptions on $F$ in these works (summarized in Table 1) are stronger than those needed in Theorem 1. The only method proposed for functions satisfying $\mathcal{G}^{2}_{\mu}$ , i.e. ADR- $\mathcal{G}^{2}_{\mu}$ Heavy Ball [8], ensures a fast exponential decay of the error:

F(x_{n})-F^{*}=\mathcal{O}\left(e^{(-(2-\sqrt{2})\sqrt{\kappa}+\mathcal{O}(% \kappa))n}\right),

if the function $F$ has a unique minimizer. This theoretical decay is faster than (17), but it does not hold if $F$ has multiple minimizers. To the authors’ knowledge, the fast exponential decay of the error given by Theorem 1 is the first in the literature for Heavy Ball methods in this setting and without any uniqueness assumption on the set of minimizers.

In addition, the guarantee on the decay of the error given in (17) is faster than that given by FISTA restarted periodically and optimally as it only ensures (even with some oracle [26])

F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\frac{1}{e}\sqrt{\kappa}n}\right).

This means that in this setting, (V-FISTA) is theoretically more relevant than a periodic restart of FISTA when the growth parameter $\mu$ is known. In other words, one should define a constant inertia parameter depending on $\mu$ and $L$ instead of setting an increasing inertia parameter and re-initializing it optimally.

The second claim of Theorem 1 gives an asymptotic control on $\|x_{n}-x_{n-1}\|$ which ensures that $\sum_{n\in\mathbb{N}^{*}}\|x_{n}-x_{n-1}\|<+\infty$ . As a consequence, the length of the trajectory of the sequence $\left(x_{n}\right)_{n\in\mathbb{N}}$ is finite and it converges strongly to a minimizer of the function $F$ .

Below is a second theorem about the iterates of (V-FISTA) which gives stronger and more general results than Theorem 1. The proof is built using the parallel with dynamical systems (see Section 4) and is located in Section 5.2.

Theorem 2.

Let $F=f+h$ where $f$ is a convex differentiable function having a $L$ -Lipschitz gradient for some $L>0$ , and $h$ a proper convex l.s.c. function. Assume that $F$ satisfies a quadratic growth condition $\mathcal{G}_{\mu}^{2}$ for some real parameter $\mu>0$ . Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) with $\alpha=1-\omega\sqrt{\kappa}$ where $\omega\in\left(0,\frac{1}{\sqrt{\kappa}}\right)$ . Then, for any $n\in\mathbb{N}$ :

F(x_{n})-F^{*}\leqslant\left(1+(\omega-\tau)^{2}+(\omega-\tau)\omega\tau\sqrt{% \kappa}\right)\left(1-\tau\sqrt{\kappa}+\tau^{2}\kappa\right)^{n}(F(x_{0})-F^{% *}),

(19)

and

\|x_{n}-x_{n-1}\|=\mathcal{O}\left(e^{-\frac{1}{2}\tau\sqrt{\kappa}\left(1-% \tau\sqrt{\kappa}\right)n}\right),

(20)

where $\tau>0$ satisfies the following inequality:

\left(1-\omega\sqrt{\kappa}\right)\tau^{3}-\omega\left(2-\omega\sqrt{\kappa}% \right)\tau^{2}+(\omega^{2}+2)\tau-\omega\leqslant 0.

(21)

The statements of Theorem 2 are less readable than those of Theorem 1 but they are actually stronger. The inequality (21) hides the convergence rates which can be obtained for a given $\omega\in\left(0,\frac{1}{\sqrt{\kappa}}\right)$ . Observe that the larger $\tau$ , the better the convergence rate. The best rates are obtained when:

\left(1-\omega\sqrt{\kappa}\right)\tau^{3}-\omega\left(2-\omega\sqrt{\kappa}% \right)\tau^{2}+(\omega^{2}+2)\tau-\omega=0.

(22)

The admissible maximum value of $\tau$ can be thus obtained studying the limit case when $\kappa=0$ :

\tau^{3}-2\omega\tau^{2}+(\omega^{2}+2)\tau-\omega=0.

(23)

i.e. (since $\tau>0)$ :

\omega^{2}-\frac{1+2\tau^{2}}{\tau}\omega+2+\tau^{2}=0.

(24)

This can be seen as a quadratic polynomial in $\omega$ , whose discriminant is:

	$\displaystyle\Delta$	$\displaystyle=$	$\displaystyle\frac{1}{\tau^{2}}\left(1+4\tau^{4}+4\tau^{2}-8\tau^{2}-4\tau^{4}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{\tau^{2}}\left(1-4\tau^{2}\right)=\frac{1}{\tau^{2}}% \left(1-2\tau\right)(1+2\tau)$

Hence the largest value of $\tau$ for which the discriminant satisfies $\Delta\geqslant 0$ is $\tau=\frac{1}{2}$ , which corresponds to a limit maximum value of $\omega$ equal to $\omega=\frac{3}{2}$ . These two observations highlight that the convergence rates given by Theorem 2, are faster than that given in Theorem 1 for suitable choices of $\alpha$ , as expressed in the following corollary. Note that the convergence guarantees and best choices of $\alpha$ depend on the value of the conditioning number since $\kappa$ appears in Equation (22).

Corollary 1.

Let $(\omega,\tau)\in(\mathbb{R}_{+})^{2}$ be two real parameters chosen such that:

\left(1-\omega\sqrt{\kappa}\right)\tau^{3}-\omega\left(2-\omega\sqrt{\kappa}% \right)\tau^{2}+(\omega^{2}+2)\tau-\omega=0

(25)

and the value of $\tau$ is maximum. Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) with $\alpha=1-\omega\sqrt{\kappa}$ . Then:

\displaystyle F(x_{n})-F^{*}

\displaystyle\leqslant C\left(1-\sigma\sqrt{\kappa}\right)^{n}(F(x_{0})-F^{*})

(26)

and

\|x_{n}-x_{n-1}\|=\mathcal{O}\left(e^{-\frac{\sigma}{2}\sqrt{\kappa}n}\right),

(27)

where:

C=1+(\omega-\tau)^{2}+(\omega-\tau)\omega\tau\sqrt{\kappa},\quad\sigma=\tau-% \tau^{2}\sqrt{\kappa},

and there exist three real-valued functions $\varepsilon_{i}$ , $i=1,2,3$ such that: $\lim_{t\to 0}\varepsilon_{i}(t)=0$ and:

\omega=\frac{3}{2}-\varepsilon_{1}(\kappa),\qquad\tau=\frac{1}{2}-\varepsilon_% {2}(\kappa),\qquad\sigma=\frac{1}{2}-\varepsilon_{2}(\kappa).

Table 2 provides admissible sets of parameters $(\omega,\tau)$ for Corollary 1.

$\kappa$	$\omega$	$\tau$	$\sigma$	$C$
$1$	$1.2$	$\mathbf{0.39}$	$0.23$	$2.04$
$\frac{1}{3}$	$1.32$	$\mathbf{0.42}$	$0.31$	$2.1$
$10^{-1}$	$1.39$	$\mathbf{0.45}$	$0.38$	$2.07$
$10^{-2}$	$1.46$	$\mathbf{0.48}$	$0.45$	$2.03$
$10^{-3}$	$1.49$	$\mathbf{0.494}$	$0.486$	$2.02$
$10^{-4}$	$1.495$	$\mathbf{0.498}$	$0.495$	$2.002$

Table 2: Admissible sets of parameters for Corollary 1.

Thus, Corollary 1 provides better convergence rates than Theorem 1 (since $\frac{2}{3\sqrt{3}}\approx 0.38$ ). We can remark that the guarantees given by Aujol et al. in [8] for ADR- $\mathcal{G}^{2}_{\mu}$ are still better with the additional assumption that $F$ has a unique minimizer.

In fact, Theorem 2 and inequality (21) hide more than improved convergence rates. To illustrate this, we provide a graph displaying the evolution of $\tau$ with respect to $\omega$ and $\kappa$ such that $(\tau,\omega,\kappa)$ satisfy (22) in Figure 1. An interactive graph can be found on the link https://www.desmos.com/calculator/syrtiatos6. We can see on this graph that inequality (21) allows to obtain convergence guarantees even for non-optimal choices of $\alpha$ , i.e. large values of $\omega$ .

Refer to caption — Figure 1: Evolution of $\tau$ with respect to $\omega$ for several values of $\kappa$ such that $(\tau,\omega,\kappa)$ satisfy (22).

By exploiting this observation, the following corollary provides convergence rates for (V-FISTA) if $\alpha$ is too small which can be the case if $\mu$ is overestimated. A brief proof is given in Section 5.3.

Corollary 2.

Let $F=f+h$ where $f$ is a convex differentiable function having a $L$ -Lipschitz gradient for some $L>0$ , and $h$ a proper convex l.s.c. function. Assume that $F$ satisfies a quadratic growth condition $\mathcal{G}_{\mu}^{2}$ for some real parameter $\mu>0$ . Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) with $s=\frac{1}{L}$ and $\alpha=1-\theta$ for some $\theta\in\left[\frac{3}{2}\sqrt{\kappa},1\right)$ . Then, if $\kappa\leqslant\frac{1}{10}$ ,

F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\tau\kappa n}\right),

(28)

and

\|x_{n}-x_{n-1}\|=\mathcal{O}\left(e^{-\frac{\tau}{2}\kappa n}\right)

(29)

where $\tau=\frac{2}{3\theta}\left(1-\frac{2}{3\theta}\sqrt{\kappa}\right)$ .

Corollary 2 allows us to derive convergence rates for (V-FISTA) even if $\alpha$ is far from its optimal value. Let us describe two examples:

•

Suppose that $\alpha=1-C\sqrt{\kappa}$ where $C\geqslant\frac{3}{2}$ . Then, applying Corollary 2 with $\theta=C\sqrt{\kappa}$ , we get that the iterates of (V-FISTA) for this inertia parameter ensure a decrease of the error in $\mathcal{O}\left(e^{-\tau\sqrt{\kappa}n}\right)$ where $\tau=\frac{2(3C-2)}{9C^{2}}$ . Thus, if we choose $\alpha=1-\frac{3}{2}\sqrt{\tilde{\kappa}}$ where $\tilde{\kappa}$ is an upper estimation of $\kappa$ , then we get a theoretical guarantee on the error with $C=\frac{3}{2}\sqrt{\frac{\tilde{\kappa}}{\kappa}}$ . In this way, we obtain that if $\tilde{\kappa}=10\kappa$ , then the error decreases as $\mathcal{O}\left(e^{-\tau\sqrt{\kappa}n}\right)$ where $\tau\approx 0.12$ .
•

Assume now that $\alpha$ is arbitrarily set to $\alpha=\frac{9}{10}$ without knowing the actual value of $\kappa$ . Consequently, we have that $\alpha=1-\theta$ where $\theta=\frac{1}{10}$ . If $\theta\geqslant\frac{3}{2}\sqrt{\kappa}$ (i.e. if $\kappa<\frac{1}{225}$ ), then Corollary 2 states that the error along the iterates of (V-FISTA) for this choice of $\alpha$ decreases in the worst case as $e^{-\tau\kappa n}$ where $\tau=\frac{20}{3}\left(1-\frac{20}{3}\sqrt{\kappa}\right)$ which can be upper bounded by $\frac{100}{27}\approx 3.7$ . As a consequence, if $F$ is sufficiently ill-conditioned, (V-FISTA) has better theoretical guarantees for this choice of $\alpha$ than Forward-Backward which ensures that $F(x_{n})-F^{*}=\mathcal{O}\left(e^{-\kappa n}\right)$ [20].

The convergence guarantees obtained in non-optimal cases and the robustness to an overestimation of the growth parameter are the main contributions of this work. To the authors’ knowledge, there are no such results in the literature. This provides a better understanding of the behavior of the iterates of (V-FISTA) for a wide range of values of $\alpha$ .

4 Convergence rates for the trajectories of the Heavy Ball dynamical system

The so called Heavy Ball equation

\ddot{x}(t)+\alpha\dot{x}(t)+\nabla F(x(t))=0

(30)

where $\alpha>0$ denotes the friction parameter, has been studied from decades now. See for example Attouch et al [3] for a general study of the dynamical system : existence of solutions, link with the mechanical system and convergence of the trajectories to critical points if no strong assumptions are made on $F$ . The main result of this section is to prove that if $F$ is convex, differentiable and satisfies a quadratic growth condition, the solution of (30) ensures a fast exponential decay. A crucial point here, is that the uniqueness of the minimizer of $F$ is not needed to get such fast rates. Before stating Theorem 3, we give an overview of the literature. We highlight that slower exponential decays are already known in this setting, that a fast decay is known if the minimizer of $F$ is supposed to be unique and that various other decays can be achieved in non convex settings. Note that the proof of Theorem 3 has served as a guide for the analysis of the Heavy Ball algorithm as performed in Section 3 and for the proof of Theorem 2.

4.1 State of the art

The first study of the convergence rate of $F(x(t))-F^{*}$ , where $x$ is a solution of (30), under strong convexity analysis is due to Polyak [32] . In his seminal work Polyak observes that if the function $F$ is a quadratic function, $F(x)=\left\|{Ax}\right\|^{2}$ , the solution of (30) ensures the fastest decay of $F(x(t))-F(x^{*})$ when $\alpha=\sqrt{\mu}$ where $\mu$ is the smallest non negative eigenvalue of $A^{\top}A$ . He deduced that if $F$ is $C^{2}$ and $\mu$ -strongly convex $(\mathcal{S}_{\mu})$ , the solution of (HBF) satisfies

F(x(t))-F(x^{*})=\mathcal{O}(e^{-2\sqrt{\mu}t}).

(31)

Both $C^{2}$ and strong convexity are necessary to achieve such a decay. During the last decades several works provide various bounds depending on geometrical assumptions made on $F$ . A summary is given in Table 3. To achieve an exponential decay of $F(x(t))-F(x^{*})$ a Łojasiewicz condition with parameter $\frac{1}{2}$ is required in all these work. In [12], the authors proved that with other Łojasiewicz exponents only polynomial decay can be achieved. Nevertheless the exact exponential decay highly depends on the assumptions made on $F$ . If $F$ is $\mu$ -strongly convex and $C^{1}$ it was first proved that the decay was $\mathcal{O}(e^{-\sqrt{\mu}t})$ , see for example Siegel [35] for a simple proof. Aujol et al [7] extend this former result giving a better rate for functions that are quasi-strongly convex and having a unique minimizer, and a weaker convergence rate if $F$ satisfies only a quadratic growth condition and has a unique minimizer, see Table 3 for more details. All these results ensure fast exponential decays of $F(x(t))-F(x^{*})$ and assume the convexity of $F$ , a quadratic growth condition and a uniqueness of the minimizer of $F$ .

In [13], Bégout et al. provide several results on the trajectory $(x_{t})$ , solution of (HBF) if $F$ is a $C^{2}$ function satisfying some Łojasiewicz hypothesis but is not necessarily convex. The authors prove that under such hypothesis the trajectory strongly converges to a minimizer of $F$ , and provide several decay rates. If $F$ satisfies a Łojasiewicz hypothesis with an exponent $\theta\in]0,\frac{1}{2}[$ , the decay rate is polynomial, see [13, Corollary 5.1]. Indeed this polynomial bound are similar to the ones of achieved by the gradient flow under similar hypotheses. If $F$ satisfies a Łojasiewicz hypothesis with an exponent $\theta=\frac{1}{2}$ , the rate is actually exponential. More recent works by Polyak et al [31] and Apidopoulos et al [2] also provide exponential decay under Łojasiewicz hypothesis on $F$ depending on $L$ and $\mu$ , see Table 3.

Reference	Assumption on $F$	Exponential rate $K$ of $F(x(t))-F^{*}$
Polyak [32]	$\mathcal{S}_{\mu}$ , $C^{2}$ and convexity	$2\sqrt{\mu}$
Siegel [35]	$\mathcal{S}_{\mu}$ and convexity	$\sqrt{\mu}$
Aujol et al. [8]	$\mathcal{G}^{2}_{\mu}$ and convexity
Uniqueness of the minimizer	$(2-\sqrt{2})\sqrt{\mu}$
Polyak, Shcherbakov [31]	$C^{2}$ , Łojasiewicz with $\theta=\frac{1}{2}$ and constant $c_{\ell}$ , $L$ -Lipschitz gradient	$2\frac{\mu^{\frac{3}{2}}}{(\sqrt{2}+1)\mu+L}$
with $\mu=\frac{1}{2c_{\ell}^{2}}$
Apidopoulos et al. [2]	Łojasiewicz with $\theta=\frac{1}{2}$ and constant $c_{\ell}$ , and $L$ -Lipschitz gradient	$2\left(\sqrt{\frac{L}{\mu}}-\sqrt{\frac{L-\mu}{\mu}}\right)\sqrt{\mu}$
with $\mu=\frac{1}{2c_{\ell}^{2}}$

Table 3: Convergence rate of

F(x(t))-F^{*}

where

x

is solution of (HBF) for the best choice of

\alpha

(which depends on the assumptions satisfied by

F

). The constant

K

is defined such that

F(x(t))-F^{*}=\mathcal{O}\left(e^{-Kt}\right)

The goal of the next part is to show that under convexity and quadratic growth conditions, a faster exponential rate, independent of $L$ , can be achieved for the solution of the Heavy Ball dynamical system (HBF) without assuming the uniqueness of the minimizer.

4.2 Fast exponential decay under quadratic growth conditions

We consider the Heavy Ball Friction (HBF) system defined as follows:

\forall t\geqslant t_{0},\quad\ddot{x}(t)+\alpha\dot{x}(t)+\nabla F(x(t))=0,

(HBF)

where $t_{0}>0$ , $\alpha>0$ and $F:\mathbb{R}^{N}\rightarrow\mathbb{R}$ is a convex differentiable function satisfying some quadratic growth condition. Generalizing recent works [35, 8, 7] and making assumptions about the regularity of the boundary of the set of minimizers, we prove that the trajectories of (HBF) can achieve a fast exponential decay:

Theorem 3.

Let $F$ be a convex differentiable function having a non empty set of minimizers $X^{*}$ . Suppose that $X^{*}$ has a $C^{2}$ bound or that it is a polyhedral set. Let $x$ be a solution of (HBF) for some $t_{0}\geqslant 0$ and $\alpha>0$ . If $F$ satisfies the assumption $\mathcal{G}^{2}_{\mu}$ for some $\mu>0$ and $\alpha=\left(2-\frac{\sqrt{2}}{2}\right)\sqrt{\mu}$ , then

\forall t\geqslant t_{0},\leavevmode\nobreak\ F(x(t))-F^{*}\leqslant\left(% \frac{11}{2}-2\sqrt{2}\right)M_{0}e^{-(2-\sqrt{2})\sqrt{\mu}(t-t_{0})},

(32)

where $M_{0}=F(x(t_{0}))-F^{*}+\frac{1}{2}\|\dot{x}(t_{0})\|^{2}$ .Moreover,

\|\dot{x}(t)\|=\mathcal{O}\left(e^{-\left(1-\frac{\sqrt{2}}{2}\right)\sqrt{\mu% }t}\right).

(33)

We give elements of proof in the following section and a demonstration of the second claim is given in Section A.1.

Proposition 1.

Let $F$ be a convex differentiable function having a non empty set of minimizers $X^{*}$ . Suppose that $X^{*}$ has a $C^{2}$ bound or that it is a polyhedral set. Assume that $F$ is a $\mu$ -quasi strongly convex function, i.e there exists $\mu>0$ such that:

\forall x\in\mathbb{R}^{N},\leavevmode\nobreak\ \langle\nabla F(x),x-x^{*}% \rangle\geqslant F(x)-F^{*}+\frac{\mu}{2}\|x-x^{*}\|^{2},

where $x^{*}$ denotes the projection of $x$ onto $X^{*}$ . Let $x$ be a solution of (HBF) for some $t_{0}\geqslant 0$ and $\alpha>0$ . Then if $\alpha=\frac{3}{\sqrt{2}}\sqrt{\mu}$ :

\forall t\geqslant t_{0},\leavevmode\nobreak\ F(x(t))-F^{*}\leqslant 39M_{0}e^% {-\sqrt{2\mu}(t-t_{0})},

(34)

where $M_{0}=F(x(t_{0}))-F^{*}+\frac{1}{2}\|\dot{x}(t_{0})\|^{2}$ .

Remark 1.

Theorem 3 and Proposition 1 are based on the assumption that the set of minimizers $X^{*}$ has a $C^{2}$ bound or is a polyhedral set. More generally, the corresponding statements hold if the set $X^{*}$ is second order regular by the definition of Bonnans et al. [14], which is a weaker assumption. Given the technical nature of this hypothesis, the results are given for special cases. We refer the careful reader to the above reference for more details.

Remark 2.

The fact that a regularity assumption on the set of minimizers $X^{*}$ is needed to obtain these results is a curiosity, since no such hypothesis is required in the discrete case, i.e. for Theorems 1 and 2. It is directly related to the time-continuous nature of the trajectory $x$ .

4.2.1 Comparisons and comments

The first study of (HBF) has been proposed by Polyak [32]. In this seminal work, Polyak consider a $C^{2}$ $\mu$ -strongly convex functions. Polyak proved that for such functions the solution $x$ of (HBF) satisfies $F(x(t))-F(x^{*})=O(e^{-2\sqrt{\mu}t})$ for a suitable choice of the friction parameter $\alpha$ . If the function $F$ is $C^{1}$ and $\mu$ -strongly convex the convergence rate is weaker, see for example [35, 7]. If $F$ is $C^{1}$ , satisfies a quadratic growth condition and has a unique minimizer, which is a weaker assumption than strong convexity, Aujol et al. [7] proved that the solution of (HBF) satisfies $F(x(t))-F(x^{*})=O(e^{-(2-\sqrt{2})\sqrt{\mu}t})$ for another choice of the friction parameter $\alpha$ , which is slighlty slower that the rate achieved by Polyak. All the above works use the fact that the function $f$ to minimize has a unique minimizer. Indeed if $F$ is $C^{1}$ , convex and satisfies a quadratic growth and has several minimizers, there were no results ensuring that the solution of (HBF) satisfies $F(x_{n})-F(x^{*})=0(e^{-C\sqrt{\mu}t})$ for any $C>0$ . As far as we know Theorem 3 is the first one ensuring such decay rate on this set of convex functions. This fast decay allows to prove the Theorem 2 ensuring a fast decay of an inertial algorithm on the same set of convex functions.

Several other articles provides interesting results decay rate of the solution of the Heavy Ball ODE (HBF). In [31, 2, 12] authors provides general analysis considering Łojasiewicz properties. In these three articles, some results on the trajectory $x(t)$ or the error $F(x(t))-F(x^{*})$ are given. It is not simple to perform a fair comparison between these results and Theorem 3 because our analysis relies on the convexity and the global analysis of these works do not use this assumption. Nevertheless, in [12] and [2] provides some decay bounds when the convexity assumption is added. More precisely, in [12], Corollary 5.5 ensures that if $F$ is convex, $C^{2}$ and satisfies a quadratic growth condition with parameter $\mu$ then the trajectory $x(t)$ converges to a minimizer $x^{*}$ of $F$ , the length of the trajectory is finite and $\left\|{x(t)-x^{*}}\right\|^{2}=O(e^{-\mu t})$ . Indeed, for such functions $d(x,X^{*})^{2}\leqslant\frac{2}{\mu}(F(x(t))-F(x^{*}))$ and the Theorem 3 ensures a better decay rate of the trajectory to the set of minimizers.

The work of Apidoupoulos et al. [2] deepens the one of Polyak et al. [31] providing explicit decay of $F(x(t))-F(x^{*})$ under similar hypothesis i.e Łojasiewicz properties, $C^{2}$ assumptions and a uniform bound on the Hessian of $F$ in the neighborhood of the set of minimizers. That is why we compare our results to those in [2], but the same conclusions hold with [31]. The bounds provided by the authors depend on a uniform bound $L$ on the Hessian of $F$ which is not the case for Theorem 3 whose bound is better than Theorem 2 of [2] when $\frac{L}{\mu}>3$ . It turns out that the analysis of Apidopoulos et al has been developed in a non convex setting and in this setting, the use of this bound on the Hessian seems the only known way to get bounds on $F(x_{n})-F(x^{*})$ . The convexity seems to be a price to pay to get bounds independent of this Lipschitz constant $L$ .

This analysis of the Heavy Ball dynamical provides a guideline for the analysis of the analysis of the Heavy Ball algorithm.

Remark 3.

Even if the convexity of $F$ seems to be a key hypothesis to reach such decay rate, the careful reader may notice that Theorem 3 actually holds for star convex functions i.e functions satisfying:

\forall x\in\mathbb{R}^{n},\leavevmode\nobreak\ \forall x^{*}\in X^{*},% \leavevmode\nobreak\ F(x)-F(x^{*})\leqslant\left\langle x-x^{*},\nabla F(x)% \right\rangle.

where $X^{*}$ denotes the set of minimizers of $F$ .

4.2.2 Elements of proof

The results obtained in this paper rely on a Lyapunov approach. Let us recall that when $F$ has a unique minimizer i.e $X^{*}=\{x^{*}\}$ , then a classical Lyapunov choice for (HBF) is:

\mathcal{E}(t)=F(x(t))-F^{*}+\frac{1}{2}\|\lambda(x(t)-x^{*})+\dot{x}(t)\|^{2}% +\frac{\xi}{2}\|x(t)-x^{*}\|^{2},

(35)

for some well-chosen parameters $\lambda>0$ and $\xi\in\mathbb{R}$ . Our approach to extend that type of analysis without the uniqueness assumption is to adapt the Lyapunov energy to our relaxed setting. Let $F$ have a non empty set of minimizers $X^{*}$ which is not reduced to one point. Let

\mathcal{E}^{*}(t)=F(x(t))-F^{*}+\frac{1}{2}\|\lambda(x(t)-x^{*}(t))+\dot{x}(t% )\|^{2}+\frac{\xi}{2}\|x(t)-x^{*}\|^{2},

(36)

where for all $t\geqslant t_{0}$ , $x^{*}(t)$ denotes the projection of $x(t)$ onto $X^{*}$ , i.e.

x^{*}(t)=\textup{arg}\,\inf\limits_{x^{*}\in X^{*}}\|x(t)-x^{*}\|^{2}:=P_{X^{*% }}(x(t)).

This slight modification leads to a question when attempting to differentiate the Lyapunov energy: is $t\mapsto x^{*}(t)$ differentiable?

The smoothness of $t\mapsto x^{*}(t)$ is related to the smoothness of $P_{X^{*}}$ . In fact, if $P_{X^{*}}$ is directionally differentiable then $t\mapsto x^{*}(t)$ is right-differentiable (and left-differentiable) and its right-hand derivative is equal to $P^{\prime}_{X^{*}}(x(t),\dot{x}(t))$ .

In [14, Theorem 7.2], Bonnans et al. prove that if a closed convex set $\mathcal{S}\subset\mathcal{X}$ is second order regular at $P_{\mathcal{S}}(x)$ for some $x\in\mathcal{X}$ , then $P_{\mathcal{S}}$ is directionally differentiable at $x$ . We refer the reader to [14, 34] to have a complete understanding of the notion of second order regularity. Note that this geometry assumption is satisfied by sets having a $C^{2}$ bound [23] and polyhedral sets [34].

Throughout the rest of this section we assume that the set of minimizers is second order regular. Consequently, $t\mapsto x^{*}(t)$ is right-differentiable as well as $\mathcal{E}$ . For the sake of simplicity, let $\dot{x^{*}}$ and $\dot{\mathcal{E}}$ denote the corresponding right-hand derivatives. We can write that:

\dot{\mathcal{E}}^{*}(t)=D(t)-(\lambda^{2}+\xi)\langle x(t)-x^{*}(t),\dot{x^{*% }}(t)\rangle-\lambda\langle\dot{x}(t),\dot{x^{*}}(t)\rangle,

(37)

where

D(t)=\langle\nabla F(x(t)),\dot{x}(t)\rangle+\langle\lambda(x(t)-x^{*}(t))+% \dot{x}(t),\lambda\dot{x}(t)+\ddot{x}(t)\rangle+\xi\langle x(t)-x^{*}(t),\dot{% x}(t)\rangle.

Observe that $D$ is exactly equal to respectively $\dot{\mathcal{E}}$ if $F$ has a unique minimizer $x^{*}$ . The objective is then to control the additional terms $\langle x(t)-x^{*}(t),\dot{x^{*}}(t)\rangle$ and $\langle\dot{x}(t),\dot{x^{*}}(t)\rangle$ . We introduce Figure 2 to give an intuition of the behaviour of these terms.

Figure 2: Behaviour of

\dot{x^{*}}

for a set of minimizers having a

C^{2}

bound (on the left) and a polyhedral set of minimizers (on the right).

We can first prove that $\langle\dot{x}(t),\dot{x^{*}}(t)\rangle$ is positive by using the expression $\dot{x^{*}}(t)=\lim\limits_{h\rightarrow 0}\frac{x^{*}(t+h)-x^{*}(t)}{h}$ and the property of the projection onto a convex set. Indeed, as $X^{*}$ is a closed convex set, for any $x\in\mathbb{R}^{N}$ and $u\in X^{*}$ :

\langle x-P_{X^{*}}(x),u-P_{X^{*}}(x)\rangle\leqslant 0.

Thus, for any $h>0$ we have:

	$\displaystyle\langle x(t+h)-x(t),x^{}(t+h)-x^{}(t)\rangle$	$\displaystyle=\langle x(t+h)-x^{}(t+h),x^{}(t+h)-x^{*}(t)\rangle$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +% \\|x^{}(t+h)-x^{}(t)\\|^{2}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +% \langle x(t)-x^{}(t),x^{}(t)-x^{*}(t+h)\rangle\geqslant 0.$

By considering $h$ tending towards $0$ we can deduce that $\langle\dot{x}(t),\dot{x^{*}}(t)\rangle\geqslant 0$ .

In [14, Theorem 7.2] the authors give an expression of the directional derivative $P^{\prime}_{\mathcal{S}}(x,d)$ for a closed convex set $\mathcal{S}\subset\mathcal{X}$ being second order regular at $P_{\mathcal{S}}(x)$ for some $x\in\mathcal{X}$ . This directional derivative satisfies:

\langle x-P_{\mathcal{S}}(x),P^{\prime}_{\mathcal{S}}(x,d)\rangle=0.

Considering the assumptions made on $X^{*}$ we can deduce that $\langle x(t)-x^{*}(t),\dot{x^{*}}(t)\rangle=0$ for all $t\geqslant t_{0}$ .

These results ensure that for any choice of parameter $\lambda>0$ and $\xi\in\mathbb{R}$ , $\dot{\mathcal{E}}^{*}(t)\leqslant D(t)$ . From this point, the proofs of the convergence results stated in Theorem 3 and Proposition 1 follow the original proofs, taking $D$ instead of $\dot{\mathcal{E}}$ and by applying the following lemma. A proof is given in Section A.2.

Lemma 1.

Let $\phi:\mathbb{R}\rightarrow\mathbb{R}$ be a continuous function which is right-differentiable. Assume that

\forall t\geqslant t_{0},\leavevmode\nobreak\ \phi_{+}(t)\leqslant\psi(t),

(38)

where $\phi_{+}(t)=\lim\limits_{h\rightarrow 0,\leavevmode\nobreak\ h>0}\frac{\phi(t+% h)-\phi(t)}{h}$ denotes the right derivative of $\phi$ at $t$ . Then,

\forall t\geqslant t_{0},\leavevmode\nobreak\ \phi(t)\leqslant\phi(t_{0})+\int% _{t_{0}}^{t}\psi(u)du.

(39)

5 Proofs of Theorems 1 and 2

The proofs of Theorems 1 and 2 are built around the following discrete Lyapunov energies

\mathcal{E}_{n}=\frac{2}{L}(F(x_{n})-F^{*})+\left\|\lambda(x_{n-1}-x_{n-1}^{*}% )+x_{n}-x_{n-1}\right\|^{2},

(40)

and

\mathcal{E}_{n}=\frac{2}{L}(F(x_{n})-F^{*})+\alpha\left\|\lambda(x_{n}-x_{n}^{% *})+x_{n}-x_{n-1}\right\|^{2}+\lambda(1-\alpha)^{2}\|x_{n}-x_{n}^{*}\|^{2},

(41)

where $\lambda>0$ and $x_{n}^{*}$ denotes the projection of $x_{n}$ onto $X^{*}$ for any $n\in\mathbb{N}$ . For the sake of clarity, we use the following notations:

\begin{gathered}w_{n}=\frac{2}{L}(F(x_{n})-F^{*})),\leavevmode\nobreak\ h_{n}=% \|x_{n}-x^{*}_{n}\|^{2},\\ \delta_{n}=\|x_{n}-x_{n-1}\|^{2},\leavevmode\nobreak\ \gamma_{n}^{*}=\|x_{n}^{% *}-x_{n-1}^{*}\|^{2}.\end{gathered}

(42)

The following lemma is necessary in order to handle the terms related to non uniqueness of the minimizers. We give a proof in Section A.3.

Lemma 2.

For all $n\in\mathbb{N}^{*}$ , the following equalities hold:

1.

$\langle x_{n}-x_{n}^{*},x_{n}-x_{n-1}\rangle=\frac{1}{2}(h_{n}-h_{n-1}+\delta_% {n}-\gamma_{n}^{*})+\langle x_{n-1}-x_{n-1}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle.$
2.

$\langle x_{n-1}-x_{n-1}^{*},x_{n}-x_{n-1}\rangle=\frac{1}{2}(h_{n}-h_{n-1}-% \delta_{n}+\gamma_{n}^{*})+\langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle.$

Moreover, we introduce a lemma which encodes the fact that the sequence $(x_{n})_{n\in\mathbb{N}}$ is provided by (V-FISTA). The proof is based on the descent lemma proved in [16] and it can be found in Appendix A.4.

Lemma 3.

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) with $s=\frac{1}{L}$ . Then, for any $n\in\mathbb{N}^{*}$ ,

$\displaystyle w_{n+1}-w_{n}$	$\displaystyle\leqslant$	$\displaystyle\alpha^{2}\delta_{n}-\delta_{n+1},$
$\displaystyle w_{n+1}$	$\displaystyle\leqslant$	$\displaystyle(1+\alpha)h_{n}+(\alpha^{2}+\alpha)\delta_{n}-\alpha h_{n-1}-h_{n% +1}-\gamma_{n+1}^{}-\alpha\gamma_{n}^{}\vspace{.2cm}$
		$\displaystyle+2\alpha\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}\rangle-% 2\langle x_{n+1}-x_{n+1}^{},x_{n+1}^{}-x_{n}^{}\rangle.$

We would like to point out that several controls are deduced from the properties of the projection onto a convex. Indeed, if $C$ is a closed convex set such that $C\subset\mathbb{R}^{N}$ , then for any $x\in\mathbb{R}^{N}$ and $y\in C$ ,

\langle x-p,y-p\rangle\leqslant 0,

where $p$ denotes the projection of $x$ onto $C$ . This property directly guarantees inequalities such as

\langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle\geqslant 0,

\langle x_{n-1}-x_{n-1}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle\leqslant 0.

5.1 Proof of Theorem 1

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) for some $\alpha>0$ to be defined. We define the following discrete Lyapunov energy:

\mathcal{E}_{n}=\frac{2}{L}(F(x_{n})-F^{*})+\|x_{n}-x_{n-1}+\lambda(x_{n-1}-x_% {n-1}^{*})\|^{2},

(43)

where $\lambda>0$ and $x_{n}^{*}$ denotes the projection of $x_{n}$ on $X^{*}$ for any $n\in\mathbb{N}$ . By setting $\lambda=\sqrt{\kappa}$ and considering that $F$ has a unique minimizer, we recover the energy considered by Beck in [10].

The aim of this proof is to find $\tau>0$ as large as possible such that for a well-chosen set of parameters,

\mathcal{E}_{n+1}-(1-\tau\sqrt{\kappa})\mathcal{E}_{n}\leqslant 0.

(44)

The proof is divided into three parts. We first use the lemmas introduced in the introduction of Section 5 and the properties of the projection onto a convex to handle the terms related to the non uniqueness of the minimizers. Then, we give a set of parameters which leads to the wanted inequality (44) by using the geometry assumption satisfied by $F$ . The convergence of the trajectories is obtained in the last section using the previous results and elementary computations.

5.1.1 Preliminary work

We recall that we use the notations defined in (42). By rewriting (43) and using the second claim of Lemma 2 we have:

\mathcal{E}_{n}=w_{n}+(1-\lambda)\delta_{n}+\lambda(h_{n}-h_{n-1})+\lambda^{2}% h_{n-1}+\lambda\gamma_{n}^{*}+2\lambda\langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1% }^{*}\rangle.

(45)

Lemma 3 ensures that if $\lambda\leqslant 1$ :

$\displaystyle w_{n+1}-(1-\lambda)w_{n}$	$\displaystyle\leqslant\alpha(\alpha+\lambda)\delta_{n}-(1-\lambda)\delta_{n+1}% +\alpha\lambda(h_{n}-h_{n-1})+\lambda(h_{n}-h_{n+1})$	(46)
	$\displaystyle-\lambda\gamma_{n+1}^{}-\alpha\lambda\gamma_{n}^{}+2\alpha% \lambda\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{*}\rangle$
	$\displaystyle-2\lambda\langle x_{n+1}-x_{n+1}^{},x_{n+1}^{}-x_{n}^{*}\rangle.$

This inequality combined with (45) ensures that:

\mathcal{E}_{n+1}-(1-\lambda)\mathcal{E}_{n}\leqslant a_{1}\delta_{n}+a_{2}(h_% {n}-h_{n-1})+a_{3}h_{n}+\mathcal{X}_{n}^{*},

(47)

where:

a_{1}=\alpha(\alpha+\lambda)-(1-\lambda)^{2},\leavevmode\nobreak\ a_{2}=\alpha% \lambda-\lambda(1-\lambda)+(1-\lambda)\lambda^{2},\leavevmode\nobreak\ a_{3}=% \lambda^{3},

and $\mathcal{X}_{n}^{*}$ is defined by

	$\displaystyle\mathcal{X}_{n}^{*}=$	$\displaystyle-\lambda(1-\lambda+\alpha)\gamma_{n}^{}+2\alpha\lambda\langle x_% {n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}\rangle$
		$\displaystyle-2\lambda(1-\lambda)\langle x_{n}-x_{n}^{},x_{n}^{}-x_{n-1}^{*}\rangle.$

Due to the properties of the projection onto a convex set we have that for all $n\in\mathbb{N}$ :

\left\{\begin{gathered}\langle x_{n-1}-x_{n-1}^{*},x_{n}^{*}-x_{n-1}^{*}% \rangle\leqslant 0,\\ \langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle\geqslant 0.\end{gathered}\right.

and since $\gamma_{n}^{*}\geqslant 0$ , we can conclude that $\mathcal{X}_{n}^{*}\leqslant 0$ and consequently that:

\mathcal{E}_{n+1}-(1-\lambda)\mathcal{E}_{n}\leqslant a_{1}\delta_{n}+a_{2}(h_% {n}-h_{n-1})+a_{3}h_{n}.

(48)

5.1.2 Getting the convergence rate

Recall that we want to find $\tau>0$ such that: $\mathcal{E}_{n+1}-(1-\tau\sqrt{\kappa})\mathcal{E}_{n}\leqslant 0$ . We choose the following set of parameters:

\tau=\frac{2}{3\sqrt{3}},\quad\lambda=\frac{1}{\sqrt{3}}\sqrt{\kappa},\quad% \alpha=1-\frac{5}{3\sqrt{3}}\sqrt{\kappa}=1-\frac{5}{3}\lambda.

Then we get that:

	$\displaystyle\mathcal{E}_{n+1}-(1-\tau\sqrt{\kappa})\mathcal{E}_{n}$	$\displaystyle=\mathcal{E}_{n+1}-(1-\lambda)\mathcal{E}_{n}+\left(\frac{2}{3% \sqrt{3}}-\frac{1}{\sqrt{3}}\right)\sqrt{\kappa}\mathcal{E}_{n}$		(49)
		$\displaystyle\leqslant a_{1}\delta_{n}+a_{2}(h_{n}-h_{n-1})+a_{3}h_{n}-\frac{1% }{3\sqrt{3}}\sqrt{\kappa}\mathcal{E}_{n},$		(49)

where for this parameter choice we have:

	$\displaystyle a_{1}$	$\displaystyle=$	$\displaystyle-\frac{\lambda}{3}\left(1-\frac{\lambda}{3}\right)=\frac{\sqrt{% \kappa}}{27}(\sqrt{\kappa}-3\sqrt{3}),\quad a_{3}=\lambda^{3}=\frac{\kappa^{% \frac{3}{2}}}{3\sqrt{3}},$
	$\displaystyle a_{2}$	$\displaystyle=$	$\displaystyle\lambda^{2}\left(\frac{1}{3}-\lambda\right)=\frac{\kappa}{3\sqrt{% 3}}\left(\frac{1}{\sqrt{3}}-\sqrt{\kappa}\right).$

Under the condition $\kappa\leqslant\frac{1}{3}$ , we have that $a_{1}\leqslant 0$ and hence,

\mathcal{E}_{n+1}-\left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}\right)\mathcal{E}_{% n}\leqslant\frac{\kappa}{3\sqrt{3}}\left(\frac{1}{\sqrt{3}}-\sqrt{\kappa}% \right)(h_{n}-h_{n-1})+\frac{\kappa^{\frac{3}{2}}}{3\sqrt{3}}h_{n}-\frac{1}{3% \sqrt{3}}\sqrt{\kappa}\mathcal{E}_{n}.

(50)

Moreover, as the condition $\kappa\leqslant\frac{1}{3}$ ensures that $a_{2}\geqslant 0$ we can apply the following lemma which is proved in Section A.5

Lemma 4.

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) and $\lambda=\frac{1}{\sqrt{3}}\sqrt{\kappa}$ . Then for all $n\in\mathbb{N}$ :

h_{n}-h_{n-1}\leqslant\frac{\sqrt{3}}{\sqrt{\kappa}}\left(\mathcal{E}_{n}-w_{n% }\right).

(51)

Lemma 51 guarantees that if $\kappa\leqslant\frac{1}{3}$ :

\mathcal{E}_{n+1}-(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa})\mathcal{E}_{n}\leqslant% \frac{\kappa^{\frac{3}{2}}}{3\sqrt{3}}h_{n}-\frac{\sqrt{\kappa}}{3}\left(\frac% {1}{\sqrt{3}}-\sqrt{\kappa}\right)w_{n}-\frac{\kappa}{3}\mathcal{E}_{n}.

(52)

Moreover, as $F$ satisfies $\mathcal{G}^{2}_{\mu}$ we can write that $h_{n}\leqslant\frac{w_{n}}{\kappa}$ and consequently:

	$\displaystyle\mathcal{E}_{n+1}-(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa})\mathcal{E}% _{n}$	$\displaystyle\leqslant\left(\frac{\sqrt{\kappa}}{3\sqrt{3}}-\frac{\sqrt{\kappa% }}{3}\left(\frac{1}{\sqrt{3}}-\sqrt{\kappa}\right)\right)w_{n}-\frac{\kappa}{3% }\mathcal{E}_{n}$		(53)
		$\displaystyle\leqslant\frac{\kappa}{3}w_{n}-\frac{\kappa}{3}\mathcal{E}_{n}.$		(53)

Noticing that $w_{n}\leqslant\mathcal{E}_{n}$ we can conclude that:

\mathcal{E}_{n+1}-(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa})\mathcal{E}_{n}\leqslant 0.

(54)

Hence: $\mathcal{E}_{n}\leqslant\left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}\right)^{n}% \mathcal{E}_{0}.$ As we consider that $x_{-1}=x_{0}$ , we have that $\mathcal{E}_{0}=w_{0}+\lambda^{2}h_{0}$ . As a consequence, the geometry condition $\mathcal{G}^{2}_{\mu}$ ensures that $\mathcal{E}_{0}\leqslant\frac{4}{3}w_{0}$ and

F(x_{n})-F^{*}\leqslant\frac{4}{3}\left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}% \right)^{n}(F(x_{0})-F^{*}).

(55)

5.1.3 Convergence of the trajectories

Let $b_{n}=\|x_{n}-x_{n-1}+\lambda(x_{n-1}-x_{n-1}^{*})\|^{2}$ . Using the inequality $\|u\|^{2}=2\|u+v\|^{2}+2\|v\|^{2}$ , we get that:

\delta_{n}\leqslant 2b_{n}+\frac{2}{\lambda^{2}}h_{n-1}.

(56)

Thus, using the definition of $\mathcal{E}$ and the geometry of $F$ :

\delta_{n}\leqslant 2\mathcal{E}_{n}+\frac{2}{\lambda^{2}\kappa}w_{n-1}.

(57)

Then, by applying inequality (54), we deduce that:

\delta_{n}\leqslant\left(2\left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}\right)+% \frac{2}{\lambda^{2}\kappa}\right)\mathcal{E}_{n-1},

(58)

and consequently,

\delta_{n}\leqslant\left(2\left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}\right)+% \frac{2}{\lambda^{2}\kappa}\right)\left(1-\frac{2}{3\sqrt{3}}\sqrt{\kappa}% \right)^{n-1}\mathcal{E}_{0}.

(59)

Hence,

\|x_{n}-x_{n-1}\|=\mathcal{O}\left(e^{-\frac{1}{3\sqrt{3}}\sqrt{\kappa}n}% \right).

(60)

5.2 Proof of Theorem 2

5.2.1 Structure of the proof

The proof of Theorem 2 is built around the approach provided in [8] in order to prove convergence rates of the trajectories of the Heavy Ball system described by:

\ddot{x}(t)+\alpha_{c}\dot{x}(t)+\nabla F(x(t))=0,

(HBF)

for some $\alpha_{c}>0$ . Indeed, the sequence generated by (V-FISTA) can be seen as a discretization of (HBF) (when $F$ is differentiable) and the strategy can be adapted to the discrete setting. Note that the parameter $\alpha_{c}$ in (HBF) does not play the same role as $\alpha$ in (V-FISTA). Indeed, we have that $\alpha_{c}$ behaves as $\sqrt{L}(1-\alpha)$ .
This proof relies on the analysis of the following Lyapunov energy:

\forall n\in\mathbb{N},\quad\mathcal{E}_{n}=\frac{2}{L}(F(x_{n})-F^{*})+\alpha% \|x_{n}-x_{n-1}+\lambda(x_{n}-x_{n}^{*})\|^{2}+\lambda(1-\alpha)^{2}\|x_{n}-x_% {n}^{*}\|^{2}.

(61)

The strategy of the proof is straightforward: we aim to find a set of parameters $(\alpha,\lambda,\nu)\in\left(\mathbb{R}^{+}\right)^{3}$ such that for any $n\in\mathbb{N}$ ,

\mathcal{E}_{n+1}-\mathcal{E}_{n}+\nu\mathcal{E}_{n+1}\leqslant 0.

(62)

In this way, simple calculations show that it ensures

\forall n\in\mathbb{N},\quad\mathcal{E}_{n}\leqslant\left(1-\nu+\nu^{2}\right)% ^{n}\mathcal{E}_{0}

(63)

which leads us to the conclusion.

5.2.2 Proof

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA). Following the notations introduced in (42), we can rewrite:

\forall n\in\mathbb{N},\leavevmode\nobreak\ \mathcal{E}_{n}=w_{n}+\lambda(% \lambda\alpha+(1-\alpha)^{2})h_{n}+\alpha\delta_{n}+2\alpha\lambda\langle x_{n% }-x_{n}^{*},x_{n}-x_{n-1}\rangle.

(64)

Following the first claim of 2,

	$\displaystyle\mathcal{E}_{n}=$	$\displaystyle w_{n}+\lambda(\lambda\alpha+(1-\alpha)^{2})h_{n}+\lambda\alpha(h% _{n}-h_{n-1})+(1+\lambda)\alpha\delta_{n}$		(65)
		$\displaystyle-\lambda\alpha\gamma_{n}^{}+2\lambda\alpha\langle x_{n-1}-x_{n-1% }^{},x_{n}^{}-x_{n-1}^{}\rangle.$		(65)

Observe that due to the properties of the projection onto a convex, for any $n\in\mathbb{N}$ :

\mathcal{E}_{n}\leqslant w_{n}+\lambda(\lambda\alpha+(1-\alpha)^{2})h_{n}+% \lambda\alpha(h_{n}-h_{n-1})+(1+\lambda)\alpha\delta_{n}.

(66)

By exploiting the expression (65), we can show the following lemma. The proof can be found in Section A.6.

Lemma 5.

For any $n\in\mathbb{N}$ , we have that:

	$\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}\leqslant$	$\displaystyle\leavevmode\nobreak\ -\lambda w_{n+1}+\lambda\alpha(\lambda+% \alpha-1)(h_{n+1}-h_{n})$		(67)
		$\displaystyle+(\lambda\alpha+\alpha-1)\delta_{n+1}-\alpha(1-\alpha-\lambda% \alpha)\delta_{n}.$		(67)

This inequality combined to (66) guarantees that for any $\nu>0$ ,

$\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}+\nu\mathcal{E}_{n+1}\leqslant$	$\displaystyle\leavevmode\nobreak\ (\nu-\lambda)w_{n+1}+\lambda\alpha(\lambda+% \alpha-1+\nu)(h_{n+1}-h_{n})$	(68)
	$\displaystyle+((1+\lambda)\alpha(1+\nu)-1)\delta_{n+1}-\alpha(1-\alpha-\lambda% \alpha)\delta_{n}$
	$\displaystyle+\nu\lambda(\lambda\alpha+(1-\alpha)^{2})h_{n+1}.$

We make the following choice of parameters:

\alpha=1-\omega\sqrt{\kappa},\leavevmode\nobreak\ \nu=\tau\sqrt{\kappa},% \leavevmode\nobreak\ \lambda=1-\alpha-\nu=(\omega-\tau)\sqrt{\kappa}.

(69)

This set of parameters ensures that the following inequality is valid:

$\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}+\tau\sqrt{\kappa}\mathcal{E}_{n% +1}\leqslant$	$\displaystyle\leavevmode\nobreak\ (2\tau-\omega)\sqrt{\kappa}w_{n+1}-(\omega% \tau+(\omega-\tau)^{2}+\omega\tau(\omega-\tau)\sqrt{\kappa})\kappa\delta_{n+1}$	(70)
	$\displaystyle-(1-\omega\sqrt{\kappa})(\tau+(\omega-\tau)\sqrt{\kappa})\sqrt{% \kappa}\delta_{n}$
	$\displaystyle+\tau(\omega-\tau)(\omega-\tau+\omega\tau\sqrt{\kappa})\kappa^{% \frac{3}{2}}h_{n+1}.$

Note that we consider a parameter $\alpha>0$ which implies that $1-\omega\sqrt{\kappa}>0$ . Suppose in addition that $\omega>2\tau$ . Then, we get that for any $n\in\mathbb{N}$ ,

\mathcal{E}_{n+1}-\mathcal{E}_{n}+\tau\sqrt{\kappa}\mathcal{E}_{n+1}\leqslant-% (\omega-2\tau)\sqrt{\kappa}w_{n+1}+\tau(\omega-\tau)(\omega-\tau+\omega\tau% \sqrt{\kappa})\kappa^{\frac{3}{2}}h_{n+1}.

(71)

As $F$ satisfies the assumption $\mathcal{G}^{2}_{\mu}$ , we can write that $\kappa h_{n+1}\leqslant w_{n+1}$ and hence:

\mathcal{E}_{n+1}-\mathcal{E}_{n}+\tau\sqrt{\kappa}\mathcal{E}_{n+1}\leqslant% \left(\tau(\omega-\tau)(\omega-\tau+\omega\tau\sqrt{\kappa})-\omega+2\tau% \right)\kappa^{\frac{3}{2}}h_{n+1}.

(72)

Thus, if

\left(1-\omega\sqrt{\kappa}\right)\tau^{3}-\omega\left(2-\omega\sqrt{\kappa}% \right)\tau^{2}+(\omega^{2}+2)\tau-\omega\leqslant 0,

(73)

then for any $n\in\mathbb{N}$ ,

\mathcal{E}_{n+1}-\mathcal{E}_{n}+\tau\sqrt{\kappa}\mathcal{E}_{n+1}\leqslant 0.

(74)

Note that the solutions of (73) automatically satisfy $\omega>2\tau$ . Elementary computations show that this implies

\mathcal{E}_{n}\leqslant\left(1-\tau\sqrt{\kappa}+\tau^{2}\kappa\right)^{n}% \mathcal{E}_{0}.

(75)

Note that since $x_{-1}=x_{0}$ ,

\mathcal{E}_{0}=w_{0}+\lambda\left(\lambda\alpha+(1-\alpha)^{2}\right)h_{0}=w_% {0}+\left((\omega-\tau)^{2}+(\omega-\tau)\omega\tau\sqrt{\kappa}\right)\kappa h% _{0},

(76)

and using the assumption $\mathcal{G}^{2}_{\mu}$ ,

\mathcal{E}_{0}\leqslant\left(1+(\omega-\tau)^{2}+(\omega-\tau)\omega\tau\sqrt% {\kappa}\right)w_{0}.

(77)

Moreover, for any $n\in\mathbb{N}$ , $w_{n}\leqslant\mathcal{E}_{n}$ . Thus, if (73) is satisfied, then for any $n\in\mathbb{N}$ :

F(x_{n})-F^{*}\leqslant\left(1+(\omega-\tau)^{2}+(\omega-\tau)\omega\tau\sqrt{% \kappa}\right)\left(1-\tau\sqrt{\kappa}+\tau^{2}\kappa\right)^{n}(F(x_{0})-F^{% *}).

(78)

In addition, by applying the inequality $\|u\|^{2}=2\|u+v\|^{2}+2\|v\|^{2}$ , we get that:

\delta_{n}\leqslant 2b_{n}+\frac{2}{\lambda^{2}}h_{n},

(79)

where $b_{n}=\|\lambda(x_{n}-x_{n}^{*})+x_{n}-x_{n-1}\|^{2}$ . The assumption $\mathcal{G}^{2}_{\mu}$ gives that

\delta_{n}\leqslant 2b_{n}+\frac{2}{\lambda^{2}\kappa}w_{n},

(80)

and given the definition of $\mathcal{E}_{n}$ we obtain,

\delta_{n}\leqslant\left(\frac{2}{\alpha}+\frac{2}{\lambda^{2}\kappa}\right)% \mathcal{E}_{n}.

(81)

By combining the above inequality with (75), we can finally prove that if (73) is valid, then

\|x_{n}-x_{n-1}\|=\mathcal{O}\left(e^{-\frac{1}{2}\tau\sqrt{\kappa}\left(1-% \tau\sqrt{\kappa}\right)n}\right).

(82)

5.3 Proof of Corollary 2

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence given by (V-FISTA) for some $\alpha>0$ and $s=\frac{1}{L}$ . According to Theorem 2, if $\alpha=1-\omega\sqrt{\kappa}$ for some $\omega\in\left(0,\frac{1}{\sqrt{\kappa}}\right)$ , then (19) and (20) are valid for any $\tau>0$ satisfying:

\left(1-\omega\sqrt{\kappa}\right)\tau^{3}-\omega\left(2-\omega\sqrt{\kappa}% \right)\tau^{2}+(\omega^{2}+2)\tau-\omega\leqslant 0.

(83)

Corollary 2 relies on the following lemma.

Lemma 6.

Let

P:(\tau,\omega,\kappa)\mapsto\left(1-\omega\sqrt{\kappa}\right)\tau^{3}-\omega% \left(2-\omega\sqrt{\kappa}\right)\tau^{2}+(\omega^{2}+2)\tau-\omega.

If $\kappa\leqslant\frac{1}{10}$ , then for any $\omega\geqslant\frac{3}{2}$ , $P\left(\frac{2}{3\omega},\omega,\kappa\right)<0$ .

Proof. Given the expression of $P$ , simple computations give that for any $\kappa\in(0,1)$ and $\omega>0$ ,

P\left(\frac{2}{3\omega},\omega,\kappa\right)=\left(1-\omega\sqrt{\kappa}% \right)\frac{8-12\omega^{2}}{27\omega^{3}}+\frac{8-3\omega^{2}}{9\omega}.

(84)

We define the function $\Phi$ as follows

\Phi(\omega;\kappa)=-9\omega^{4}+12\omega^{3}\sqrt{\kappa}+12\omega^{2}-8% \omega\sqrt{\kappa}+8,

such that $\frac{\Phi(\omega;\kappa)}{27\omega^{3}}=P\left(\frac{2}{3\omega},\omega,% \kappa\right)$ . We get that:

\frac{\partial\Phi}{\partial\omega}(\omega;\kappa)=-36\omega^{2}(\omega-\sqrt{% \kappa})+24\omega-8\sqrt{\kappa}.

(85)

Consequently, if $\omega\geqslant\frac{3}{2}$ and $\kappa\leqslant\frac{1}{10}$ , we have that $\omega-\sqrt{\kappa}>1$ and

\frac{\partial\Phi}{\partial\omega}(\omega;\kappa)\leqslant-36\omega^{2}+24% \omega<0.

Since $\Phi\left(\frac{3}{2};\kappa\right)=-\frac{169}{16}+\frac{57}{2}\sqrt{\kappa}$ which is strictly negative if $\kappa\leqslant\frac{1}{10}$ , we can deduce that $\Phi(\omega;\kappa)<0$ for any $\omega\geqslant\frac{3}{2}$ . As $\frac{\Phi(\omega;\kappa)}{27\omega^{3}}=P\left(\frac{2}{3\omega},\omega,% \kappa\right)$ , this lemma is proved.
In fact, $\omega\mapsto\frac{2}{3\omega}$ is a lower bound of the highest value of $\tau$ satisfying (22) for some $\omega\geqslant\frac{3}{2}$ and $\kappa=\frac{1}{10}$ as illustrated in Figure 3.

According to Lemma 6, if $\kappa\leqslant\frac{1}{10}$ and $\omega\in\left(\frac{3}{2},\frac{1}{\sqrt{\kappa}}\right)$ , then (19) and (20) are valid with $\tau=\frac{2}{3\omega}$ . Moreover, if we consider $\alpha=1-\theta$ for some $\theta\in\left[\frac{3}{2}\sqrt{\kappa},1\right)$ , then (19) and (20) are satisfied with $\tau=\frac{2}{3\theta}\sqrt{\kappa}$ . This leads to the conclusion of Corollary 2.

Appendix A Technical proofs

A.1 Proof of Theorem 3

Our analysis follows that introduced in [8]. We set $\alpha=\left(2-\frac{\sqrt{2}}{2}\right)\sqrt{\mu}$ and we consider the following Lyapunov energy:

\mathcal{E}(t)=F(x(t))-F^{*}+\frac{1}{2}\|\lambda(x(t)-x^{*}(t))+\dot{x}(t)\|^% {2}+\xi\|x(t)-x^{*}(t)\|^{2},

(86)

with $\lambda=\sqrt{\mu}$ and $\xi=-\left(1-\frac{\sqrt{2}}{2}\right)\mu$ .

Following the discussion of Section 4.2.2, the assumptions of Theorem 3 ensure that $\mathcal{E}$ is right-differentiable and for all $t\geqslant t_{0}$ ,

\dot{\mathcal{E}}(t)\leqslant-\lambda\langle\nabla F(x(t)),x(t)-x^{*}(t)% \rangle+(\lambda-\alpha)\|\dot{x}(t)\|^{2}+(\xi+\lambda(\lambda-\alpha))% \langle x(t)-x^{*}(t),\dot{x}(t)\rangle,

(87)

where $\dot{\mathcal{E}}$ denotes the right derivative of $\mathcal{E}$ . By using the convexity of $F$ and replacing the parameters by their value,

\dot{\mathcal{E}}(t)\leqslant-\sqrt{\mu}(F(x(t))-F^{*})-\left(1-\frac{\sqrt{2}% }{2}\right)\sqrt{\mu}\|\dot{x}(t)\|^{2}-(2-\sqrt{2})\mu\langle x(t)-x^{*}(t),% \dot{x}(t)\rangle.

(88)

Let us define $\delta=(2-\sqrt{2})\sqrt{\mu}$ . The above inequality guarantees that for all $t\geqslant t_{0}$ :

\dot{\mathcal{E}}(t)+\delta\mathcal{E}(t)\leqslant\left(1-\sqrt{2}\right)\sqrt% {\mu}\left(F(x(t))-F^{*}\right)+\frac{\sqrt{2}-1}{2}\mu^{\frac{3}{2}}\|x(t)-x^% {*}(t)\|^{2}.

(89)

As $F$ satisfies $\mathcal{G}^{2}_{\mu}$ and $\|x(t)-x^{*}(t)\|=d(x(t),X^{*})$ for all $t\geqslant t_{0}$ , we obtain that for all $t\geqslant t_{0}$ :

\dot{\mathcal{E}}(t)+\delta\mathcal{E}(t)\leqslant\left(\left(1-\sqrt{2}\right% )\sqrt{\mu}\frac{\mu}{2}+\frac{\sqrt{2}-1}{2}\mu^{\frac{3}{2}}\right)\|x(t)-x^% {*}(t)\|^{2}\leqslant 0.

(90)

We refer the reader to the proof of [8, Theorem 1] for further developments on each of the above steps and a discussion on the value of the parameters. Lemma 39 then guarantees that, for all $t\geqslant t_{0}$ ,

\mathcal{E}(t)\leqslant\mathcal{E}(t_{0})e^{-(2-\sqrt{2})\sqrt{\mu}(t-t_{0})}.

(91)

Since $F$ satisfies $\mathcal{G}^{2}_{\mu}$ , elementary computations show that:

F(x(t))-F^{*}+\xi\|x(t)-x^{*}(t)\|^{2}\geqslant(\sqrt{2}-1)(F(x(t))-F^{*}),

(92)

and consequently,

\forall t\geqslant t_{0},\leavevmode\nobreak\ \mathcal{E}(t)\geqslant\left(% \sqrt{2}-1\right)(F(x(t))-F^{*})+\frac{1}{2}\|\lambda(x(t)-x^{*}(t))+\dot{x}(t% )\|^{2}.

(93)

This inequality implies that for all $t\geqslant t_{0}$ :

F(x(t))-F^{*}\leqslant\frac{1}{\sqrt{2}-1}\mathcal{E}(t),

(94)

and

\|\lambda(x(t)-x^{*}(t))+\dot{x}(t)\|^{2}\leqslant 2\mathcal{E}(t).

(95)

The first statement of Theorem 3 can be demonstrated by combining (91) and (94) and rewriting $\mathcal{E}(t_{0})$ (see [8, Section 6.1] for further details). We prove the second result as follows.
Using inequality $\|u\|^{2}\leqslant 2\|u+v\|^{2}+2\|v\|^{2}$ , we get that:

	$\displaystyle\\|\dot{x}(t)\\|^{2}$	$\displaystyle\leqslant 2\\|\lambda(x(t)-x^{}(t))+\dot{x}(t)\\|^{2}+2\mu\\|x(t)-x% ^{}(t)\\|^{2}$		(96)
		$\displaystyle\leqslant 2\\|\lambda(x(t)-x^{}(t))+\dot{x}(t)\\|^{2}+4(F(x(t))-F^% {}).$		(96)

By applying the previous inequalities, we have that:

\|\dot{x}(t)\|^{2}\leqslant 4\left(1+\frac{1}{\sqrt{2}-1}\right)\mathcal{E}(t).

(97)

The bound on the energy given in (91) lead us to the conclusion:

\|\dot{x}(t)\|=\mathcal{O}\left(e^{-\left(1-\frac{\sqrt{2}}{2}\right)t}\right).

(98)

A.2 Proof of Lemma 39

Let $\phi^{\prime}$ denote the derivative of $\phi$ when it is well defined. According to [38], the function $\phi$ is differentiable except at a countable set of points. This implies that there exists $(t_{i})_{i\in\llbracket 1,N\rrbracket}$ and $N\in\mathbb{N}^{*}\cup\{+\infty\}$ such that for any $i\in\llbracket 0,N-1\rrbracket$ and $t\in(t_{i},t_{i+1})$ , $\phi^{\prime}(t)$ is well defined and equal to $\phi_{+}(t)$ . We suppose that the sequence is ordered such that $t_{0}<t_{i}<t_{i+1}$ for any $i$ and that $t_{N}=+\infty$ when $N\neq+\infty$ .
Suppose that $t\in(t_{0},t_{1})$ .

•

If $\phi$ is differentiable at $t_{0}$ , then $\phi$ is differentiable on the interval $[t_{0},t_{1})$ and $\phi^{\prime}=\phi_{+}$ in this interval. Consequently inequality (38) ensures that,

\phi(t)\leqslant\phi(t_{0})+\int_{t_{0}}^{t}\psi(u)du.

•

If $\phi$ is not differentiable at $t_{0}$ , then inequality (38) guarantees that for $h>0$ sufficiently small,

\phi(t_{0}+h)\leqslant\phi(t_{0})+h\psi(t_{0}).

Then, the previous discussion allows us to say that $\phi$ is differentiable on $[t_{0}+h,t_{1})$ . As a consequence, we can say that there exists $H\in(0,t-t_{0})$ such that for any $h\in(0,H)$ :

\phi(t)\leqslant\phi(t_{0}+h)+\int_{t_{0}+h}^{t}\psi(u)du\leqslant\phi(t_{0})+% \int_{t_{0}}^{t}\psi(u)du+\int_{t_{0}}^{t_{0}+h}\left(\psi(t_{0})-\psi(u)% \right)du.

As this inequality is valid for any $h\in(0,H)$ , we finally get the wanted inequality (39).

We now suppose that $t=t_{1}$ . We just proved that (39) is true for all $t\in(t_{0},t_{1})$ . Therefore, for all $t\in(t_{0},t_{1})$ ,

\phi(t)\leqslant\phi(t_{0})+\int_{t_{0}}^{t_{1}}\psi(u)du,

and as $\phi$ is continuous we get the same inequality at $t=t_{1}$ .
By using the same arguments, we can prove that (39) is valid for any $t>t_{1}$ . Indeed, if $t>t_{1}$ , then it means that $t\in(t_{i},t_{i+1})$ or that $t=t_{i}$ for some $i\in\llbracket 1,N\rrbracket$ . In both cases, we get the wanted inequality by applying the above reasonings to the consecutive intervals $(t_{j},t_{j+1})$ for $0\leqslant j\leqslant i$ .

A.3 Proof of Lemma 2

Let $n\in\mathbb{N}^{*}$ . By rewriting

x_{n}-x_{n}^{*}=\frac{1}{2}\left((x_{n}-x_{n-1})+(x_{n-1}-x_{n-1}^{*})+(x_{n-1% }^{*}-x_{n}^{*})+(x_{n}-x_{n}^{*})\right),

we get that:

\langle x_{n}-x_{n}^{*},x_{n}-x_{n-1}\rangle=\frac{1}{2}\delta_{n}+\frac{1}{2}% \langle(x_{n-1}-x_{n-1}^{*})+(x_{n-1}^{*}-x_{n}^{*})+(x_{n}-x_{n}^{*}),x_{n}-x% _{n-1}\rangle.

Noticing that $x_{n}-x_{n-1}=(x_{n}-x_{n}^{*})+(x_{n}^{*}-x_{n-1}^{*})+(x_{n-1}^{*}-x_{n-1})$ leads to:

	$\displaystyle 2\langle x_{n}-x_{n}^{*},x_{n}-x_{n-1}\rangle$	$\displaystyle=\delta_{n}+\langle x_{n-1}-x_{n-1}^{},x_{n}-x_{n}^{}\rangle+% \langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{*}\rangle$
		$\displaystyle-h_{n-1}-\langle x_{n}^{}-x_{n-1}^{},x_{n}-x_{n}^{}\rangle+% \langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}\rangle$
		$\displaystyle-\gamma_{n}^{}+\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}% \rangle-\langle x_{n-1}-x_{n-1}^{},x_{n}-x_{n}^{}\rangle+h_{n}$
		$\displaystyle=h_{n}-h_{n-1}+\delta_{n}-\gamma_{n}^{}+2\langle x_{n-1}-x_{n-1}% ^{},x_{n}^{}-x_{n-1}^{}\rangle.$

The second claim is proved using the same approach. We rewrite

x_{n-1}-x_{n-1}^{*}=\frac{1}{2}\left((x_{n-1}-x_{n})+(x_{n}-x_{n}^{*})+(x_{n}^% {*}-x_{n-1}^{*})+(x_{n-1}^{*}-x_{n-1})\right),

and consequently:

2\langle x_{n-1}-x_{n-1}^{*},x_{n}-x_{n-1}\rangle=-\delta_{n}+\langle(x_{n}-x_% {n}^{*})+(x_{n}^{*}-x_{n-1}^{*})+(x_{n-1}^{*}-x_{n-1}),x_{n}-x_{n-1}\rangle.

By applying the same rewriting of $x_{n}-x_{n-1}$ , simple calculations give that:

\langle x_{n-1}-x_{n-1}^{*},x_{n}-x_{n-1}\rangle=\frac{1}{2}(h_{n}-h_{n-1}-% \delta_{n}+\gamma_{n}^{*})+\langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle.

A.4 Proof of Lemma 3

The first claim is straight forward as Lemma 3.1 of [16] ensures that:

F(x_{n+1})-F(x_{n})\leqslant\frac{L}{2}\left(\|y_{n}-x_{n}\|^{2}-\|x_{n+1}-x_{% n}\|^{2}\right).

By writing $y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})$ and $\frac{2}{L}(F(x_{n+1})-F(x_{n}))=w_{n+1}-w_{n}$ , we can conclude.

By applying Lemma 3.1 of [16] to an other couple of points, we get that:

F(x_{n+1})-F^{*}\leqslant\frac{L}{2}\left(\|y_{n}-x_{n}^{*}\|^{2}-\|x_{n+1}-x_% {n}^{*}\|^{2}\right).

It follows that:

	$\displaystyle w_{n+1}$	$\displaystyle\leqslant\\|x_{n}+\alpha_{n}(x_{n}-x_{n-1})-x_{n}^{}\\|^{2}-\\|(x_{% n+1}-x_{n+1}^{})+(x_{n+1}^{}-x_{n}^{})\\|^{2}$
		$\displaystyle\leqslant h_{n}+\alpha_{n}^{2}\delta_{n}-h_{n+1}-\gamma_{n+1}^{}% +2\alpha_{n}\langle x_{n}-x_{n}^{},x_{n}-x_{n-1}\rangle$
		$\displaystyle-2\langle x_{n+1}-x_{n+1}^{},x_{n+1}^{}-x_{n}^{*}\rangle.$

Recall that the first claim of Lemma 2 ensures that:

\langle x_{n}-x_{n}^{*},x_{n}-x_{n-1}\rangle=\frac{1}{2}(h_{n}-h_{n-1}+\delta_% {n}-\gamma_{n}^{*})+\langle x_{n-1}-x_{n-1}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle,

we can deduce that:

	$\displaystyle w_{n+1}$	$\displaystyle\leqslant(1+\alpha_{n})h_{n}+(\alpha_{n}^{2}+\alpha_{n})\delta_{n% }-\alpha_{n}h_{n-1}-h_{n+1}-\gamma_{n+1}^{}-\alpha_{n}\gamma_{n}^{}$
		$\displaystyle+2\alpha_{n}\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}% \rangle-2\langle x_{n+1}-x_{n+1}^{},x_{n+1}^{}-x_{n}^{}\rangle.$

A.5 Proof of Lemma 51

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA) and $s=\frac{1}{L}$ . We can write the Lyapunov energy $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ in the following way:

\mathcal{E}_{n}=w_{n}+(1-\lambda)\delta_{n}+\lambda(h_{n}-h_{n-1})+\lambda^{2}% h_{n-1}+\lambda\gamma_{n}^{*}+2\lambda\langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1% }^{*}\rangle.

Since $\langle x_{n}-x_{n}^{*},x_{n}^{*}-x_{n-1}^{*}\rangle\geqslant 0$ , we can write that:

\mathcal{E}_{n}\geqslant w_{n}+\lambda(h_{n}-h_{n-1}),

which leads to the final result.

A.6 Proof of Lemma 67

Let $(x_{n})_{n\in\mathbb{N}}$ be the sequence provided by (V-FISTA). By using the expression (65) of $\mathcal{E}_{n}$ , we get that:

$\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}=$	$\displaystyle\leavevmode\nobreak\ w_{n+1}-w_{n}+\lambda\left(\alpha+\lambda% \alpha+(1-\alpha)^{2}\right)(h_{n+1}-h_{n})+\alpha(1+\lambda)\delta_{n+1}$	(99)
	$\displaystyle-\lambda\alpha(h_{n}-h_{n-1})-\alpha(1+\lambda)\delta_{n}-\lambda% \alpha\gamma_{n+1}^{}+\lambda\alpha\gamma_{n}^{}$
	$\displaystyle+2\lambda\alpha\langle x_{n}-x_{n}^{},x_{n+1}^{}-x_{n}^{}% \rangle-2\lambda\alpha\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}\rangle.$

The first claim of Lemma 3 combined to the inequality $-\lambda\alpha\gamma_{n+1}^{*}+2\lambda\alpha\langle x_{n}-x_{n}^{*},x_{n+1}^{% *}-x_{n}^{*}\rangle\leqslant 0$ lead to:

$\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}\leqslant$	$\displaystyle\leavevmode\nobreak\ \lambda\left(\alpha+\lambda\alpha+(1-\alpha)% ^{2}\right)(h_{n+1}-h_{n})+(\alpha+\lambda\alpha-1)\delta_{n+1}$	(100)
	$\displaystyle-\lambda\alpha(h_{n}-h_{n-1})-\alpha(1+\lambda-\alpha)\delta_{n}+% \lambda\alpha\gamma_{n}^{*}$
	$\displaystyle-2\lambda\alpha\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{*}\rangle.$

According to the second claim of Lemma 3, we have

$\displaystyle 0\leqslant$	$\displaystyle\leavevmode\nobreak\ -\lambda w_{n+1}+\lambda(\alpha+\alpha^{2})% \delta_{n}+\lambda\alpha(h_{n}-h_{n-1})-\lambda(h_{n+1}-h_{n})$	(101)
	$\displaystyle-\lambda\gamma_{n+1}^{}-\lambda\alpha\gamma_{n}^{}+2\lambda% \alpha\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{*}\rangle$
	$\displaystyle-2\lambda\langle x_{n+1}-x_{n+1}^{},x_{n+1}^{}-x_{n}^{*}\rangle,$

and as $\langle x_{n+1}-x_{n+1}^{*},x_{n+1}^{*}-x_{n}^{*}\rangle\geqslant 0$ , it follows that

	$\displaystyle\mathcal{E}_{n+1}-\mathcal{E}_{n}\leqslant$	$\displaystyle\leavevmode\nobreak\ -\lambda w_{n+1}+\lambda\left(\alpha+\lambda% \alpha+(1-\alpha)^{2}-1\right)(h_{n+1}-h_{n})$		(102)
		$\displaystyle+(\alpha+\lambda\alpha-1)\delta_{n+1}-\alpha(1-\alpha-\lambda% \alpha)\delta_{n}.$		(102)

By develo** $\alpha+\lambda\alpha+(1-\alpha)^{2}-1$ we get to the conclusion.

Acknowledgements

JFA acknowledges support of the EU Horizon 2020 research and innovation program under the Marie Skłodowska-Curie NoMADS grant agreement No777826, and PEPR PDE-AI. HL acknowledges the financial support of the Ministry of Education, University and Research (grant ML4IP R205T7J2KP). This work was supported by the ANR MICROBLIND (grant ANR-21-CE48-0008) and the ANR Masdol (grant ANR-PRC-CE23).

References

[1] T. Alamo, P. Krupa, and D. Limon. Restart of accelerated first-order methods with linear convergence under a quadratic functional growth condition. IEEE Transactions on Automatic Control, 68(1):612–619, 2022.
[2] V. Apidopoulos, N. Ginatta, and S. Villa. Convergence rates for the Heavy-Ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition. Journal of Global Optimization, pages 1–27, 2022.
[3] H. Attouch, X. Goudou, and P. Redont. The Heavy-Ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):1–34, 2000.
[4] H. Attouch and J. Peypouquet. The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than $1/k^{2}$ . SIAM Journal on Optimization, 26(3):1824–1834, 2016.
[5] J.-F. Aujol, L. Calatroni, C. Dossal, H. Labarrière, and A. Rondepierre. Parameter-free FISTA by adaptive restart and backtracking. arXiv preprint arXiv:2307.14323, 2023.
[6] J.-F. Aujol, C. Dossal, H. Labarrière, and A. Rondepierre. FISTA restart using an automatic estimation of the growth parameter. Hal Preprint 03153525, May 2022.
[7] J.-F. Aujol, C. Dossal, and A. Rondepierre. Convergence rates of the Heavy Ball method for quasi-strongly convex optimization. SIAM Journal on Optimization, 32(3):1817–1842, 2022.
[8] J.-F. Aujol, C. Dossal, and A. Rondepierre. Convergence rates of the Heavy-Ball method under the Łojasiewicz property. Mathematical Programming, pages 1–60, 2022.
[9] J.-F. Aujol, C. Dossal, and A. Rondepierre. FISTA is an automatic geometrically optimized algorithm for strongly convex functions. Mathematical Programming, 204(1-2), 2024.
[10] A. Beck. First-order methods in optimization. SIAM, 2017.
[11] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[12] P. Bégout, J. Bolte, and M. A. Jendoubi. On damped second-order gradient systems. Journal of Differential Equations, 259(7):3115–3143, 2015.
[13] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
[14] J. F. Bonnans, R. Cominetti, and A. Shapiro. Sensitivity analysis of optimization problems under second order regular constraints. Mathematics of Operations Research, 23(4):806–831, 1998.
[15] S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
[16] A. Chambolle and C. Dossal. On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. Journal of Optimization theory and Applications, 166(3):968–982, 2015.
[17] S. Chen, S. Ma, and W. Liu. Geometric descent method for convex composite minimization. Advances in Neural Information Processing Systems, 30, 2017.
[18] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1):451–482, June 2014.
[19] O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019.
[20] G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: beyond the worst-case with the help of geometry. Mathematical Programming, pages 1–60, 2022.
[21] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the Heavy-Ball method for convex optimization. In 2015 European control conference (ECC), pages 310–315. IEEE, 2015.
[22] P. Giselsson and S. Boyd. Monotonicity and restart in fast gradient methods. In 53rd IEEE Conference on Decision and Control, pages 5058–5063. IEEE, 2014.
[23] J.-B. Hiriart-Urruty. At what points is the projection map** differentiable? The American Mathematical Monthly, 89(7):456–458, 1982.
[24] S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In Les Équations aux Dérivées Partielles (Paris, 1962), pages 87–89. Éditions du Centre National de la Recherche Scientifique, Paris, 1963.
[25] S. Łojasiewicz. Sur la géométrie semi- et sous-analytique. Annales de l’Institut Fourier. Université de Grenoble, 43(5):1575–1595, 1993.
[26] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, 175(1):69–107, 2019.
[27] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k ${}^{2}$ ). In Sov. Math. Dokl, volume 27, 1983.
[28] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
[29] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125–161, 2013.
[30] B. O’donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715–732, 2015.
[31] B. Polyak and P. Shcherbakov. Lyapunov functions: An optimization theory perspective. IFAC-PapersOnLine, 50(1):7456–7461, 2017. 20th IFAC World Congress.
[32] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
[33] J. Renegar and B. Grimmer. A simple nearly optimal restart scheme for speeding up first-order methods. Foundations of computational mathematics, 22(1):211–256, 2022.
[34] A. Shapiro. Differentiability properties of metric projections onto convex sets. Journal of Optimization Theory and Applications, 169(3):953–964, 2016.
[35] J. W. Siegel. Accelerated first-order methods: Differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671, 2019.
[36] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, pages 1–38, 2022.
[37] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2017.
[38] G. C. Young. A note on derivates and differential coefficients. Acta mathematica, 37(1):141–154, 1914.

	$\displaystyle\langle x(t+h)-x(t),x^{}(t+h)-x^{}(t)\rangle$	$\displaystyle=\langle x(t+h)-x^{}(t+h),x^{}(t+h)-x^{*}(t)\rangle$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +% \\|x^{}(t+h)-x^{}(t)\\|^{2}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +% \langle x(t)-x^{}(t),x^{}(t)-x^{*}(t+h)\rangle\geqslant 0.$

	$\displaystyle\\|\dot{x}(t)\\|^{2}$	$\displaystyle\leqslant 2\\|\lambda(x(t)-x^{}(t))+\dot{x}(t)\\|^{2}+2\mu\\|x(t)-x% ^{}(t)\\|^{2}$		(96)
		$\displaystyle\leqslant 2\\|\lambda(x(t)-x^{}(t))+\dot{x}(t)\\|^{2}+4(F(x(t))-F^% {}).$		(96)

	$\displaystyle 2\langle x_{n}-x_{n}^{*},x_{n}-x_{n-1}\rangle$	$\displaystyle=\delta_{n}+\langle x_{n-1}-x_{n-1}^{},x_{n}-x_{n}^{}\rangle+% \langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{*}\rangle$
		$\displaystyle-h_{n-1}-\langle x_{n}^{}-x_{n-1}^{},x_{n}-x_{n}^{}\rangle+% \langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}\rangle$
		$\displaystyle-\gamma_{n}^{}+\langle x_{n-1}-x_{n-1}^{},x_{n}^{}-x_{n-1}^{}% \rangle-\langle x_{n-1}-x_{n-1}^{},x_{n}-x_{n}^{}\rangle+h_{n}$
		$\displaystyle=h_{n}-h_{n-1}+\delta_{n}-\gamma_{n}^{}+2\langle x_{n-1}-x_{n-1}% ^{},x_{n}^{}-x_{n-1}^{}\rangle.$

	$\displaystyle w_{n+1}$	$\displaystyle\leqslant\\|x_{n}+\alpha_{n}(x_{n}-x_{n-1})-x_{n}^{}\\|^{2}-\\|(x_{% n+1}-x_{n+1}^{})+(x_{n+1}^{}-x_{n}^{})\\|^{2}$
		$\displaystyle\leqslant h_{n}+\alpha_{n}^{2}\delta_{n}-h_{n+1}-\gamma_{n+1}^{}% +2\alpha_{n}\langle x_{n}-x_{n}^{},x_{n}-x_{n-1}\rangle$
		$\displaystyle-2\langle x_{n+1}-x_{n+1}^{},x_{n+1}^{}-x_{n}^{*}\rangle.$