-
Study of the behaviour of Nesterov Accelerated Gradient in a non convex setting: the strongly quasar convex case
Authors:
J Hermant,
J. -F Aujol,
C Dossal,
A Rondepierre
Abstract:
We study the convergence of Nesterov Accelerated Gradient (NAG) minimization algorithm applied to a class of non convex functions called strongly quasar convex functions, which can exhibit highly non convex behaviour. We show that in the case of strongly quasar convex functions, NAG can achieve an accelerated convergence speed at the cost of a lower curvature assumption. We provide a continuous…
▽ More
We study the convergence of Nesterov Accelerated Gradient (NAG) minimization algorithm applied to a class of non convex functions called strongly quasar convex functions, which can exhibit highly non convex behaviour. We show that in the case of strongly quasar convex functions, NAG can achieve an accelerated convergence speed at the cost of a lower curvature assumption. We provide a continuous analysis through high resolution ODEs, in which negative friction may appear. Finally, we investigate connections with a weaker class of non convex functions (smooth Polyak-Łojasiewicz functions) by characterizing the gap between this class and the one of smooth strongly quasar convex functions.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Heavy Ball Momentum for Non-Strongly Convex Optimization
Authors:
Jean-François Aujol,
Charles Dossal,
Hippolyte Labarrière,
Aude Rondepierre
Abstract:
When considering the minimization of a quadratic or strongly convex function, it is well known that first-order methods involving an inertial term weighted by a constant-in-time parameter are particularly efficient (see Polyak [32], Nesterov [28], and references therein). By setting the inertial parameter according to the condition number of the objective function, these methods guarantee a fast e…
▽ More
When considering the minimization of a quadratic or strongly convex function, it is well known that first-order methods involving an inertial term weighted by a constant-in-time parameter are particularly efficient (see Polyak [32], Nesterov [28], and references therein). By setting the inertial parameter according to the condition number of the objective function, these methods guarantee a fast exponential decay of the error. We prove that this type of schemes (which are later called Heavy Ball schemes) is relevant in a relaxed setting, i.e. for composite functions satisfying a quadratic growth condition. In particular, we adapt V-FISTA, introduced by Beck in [10] for strongly convex functions, to this broader class of functions. To the authors' knowledge, the resulting worst-case convergence rates are faster than any other in the literature, including those of FISTA restart schemes. No assumption on the set of minimizers is required and guarantees are also given in the non-optimal case, i.e. when the condition number is not exactly known. This analysis follows the study of the corresponding continuous-time dynamical system (Heavy Ball with friction system), for which new convergence results of the trajectory are shown.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Parameter-Free FISTA by Adaptive Restart and Backtracking
Authors:
Jean-François Aujol,
Luca Calatroni,
Charles Dossal,
Hippolyte Labarrière,
Aude Rondepierre
Abstract:
We consider a combined restarting and adaptive backtracking strategy for the popular Fast Iterative Shrinking-Thresholding Algorithm frequently employed for accelerating the convergence speed of large-scale structured convex optimization problems. Several variants of FISTA enjoy a provable linear convergence rate for the function values $F(x_n)$ of the form $\mathcal{O}( e^{-K\sqrt{μ/L}~n})$ under…
▽ More
We consider a combined restarting and adaptive backtracking strategy for the popular Fast Iterative Shrinking-Thresholding Algorithm frequently employed for accelerating the convergence speed of large-scale structured convex optimization problems. Several variants of FISTA enjoy a provable linear convergence rate for the function values $F(x_n)$ of the form $\mathcal{O}( e^{-K\sqrt{μ/L}~n})$ under the prior knowledge of problem conditioning, i.e. of the ratio between the (Łojasiewicz) parameter $μ$ determining the growth of the objective function and the Lipschitz constant $L$ of its smooth component. These parameters are nonetheless hard to estimate in many practical cases. Recent works address the problem by estimating either parameter via suitable adaptive strategies. In our work both parameters can be estimated at the same time by means of an algorithmic restarting scheme where, at each restart, a non-monotone estimation of $L$ is performed. For this scheme, theoretical convergence results are proved, showing that a $\mathcal{O}( e^{-K\sqrt{μ/L}n})$ convergence speed can still be achieved along with quantitative estimates of the conditioning. The resulting Free-FISTA algorithm is therefore parameter-free. Several numerical results are reported to confirm the practical interest of its use in many exemplar problems.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
Fast convergence of inertial dynamics with Hessian-driven dam** under geometry assumptions
Authors:
Jean-François Aujol,
Charles Dossal,
Văn Hào Hoàng,
Hippolyte Labarrière,
Aude Rondepierre
Abstract:
First-order optimization algorithms can be considered as a discretization of ordinary differential equations (ODEs) \cite{su2014differential}. In this perspective, studying the properties of the corresponding trajectories may lead to convergence results which can be transfered to the numerical scheme. In this paper we analyse the following ODE introduced by Attouch et al. in \cite{attouch2016fast}…
▽ More
First-order optimization algorithms can be considered as a discretization of ordinary differential equations (ODEs) \cite{su2014differential}. In this perspective, studying the properties of the corresponding trajectories may lead to convergence results which can be transfered to the numerical scheme. In this paper we analyse the following ODE introduced by Attouch et al. in \cite{attouch2016fast}: \begin{equation*} \forall t\geqslant t_0,~\ddot{x}(t)+\fracα{t}\dot{x}(t)+βH_F(x(t))\dot{x}(t)+\nabla F(x(t))=0,\end{equation*} where $α>0$, $β>0$ and $H_F$ denotes the Hessian of $F$. This ODE can be derived to build numerical schemes which do not require $F$ to be twice differentiable as shown in \cite{attouch2020first,attouch2021convergence}. We provide strong convergence results on the error $F(x(t))-F^*$ and integrability properties on $\|\nabla F(x(t))\|$ under some geometry assumptions on $F$ such as quadratic growth around the set of minimizers. In particular, we show that the decay rate of the error for a strongly convex function is $O(t^{-α-\varepsilon})$ for any $\varepsilon>0$. These results are briefly illustrated at the end of the paper.
△ Less
Submitted 20 June, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Nesterov's acceleration and Polyak's heavy ball method in continuous time: convergence rate analysis under geometric conditions and perturbations
Authors:
Othmane Sebbouh,
Charles Dossal,
Aude Rondepierre
Abstract:
In this article a family of second order ODEs associated to inertial gradient descend is studied. These ODEs are widely used to build trajectories converging to a minimizer $x^*$ of a function $F$, possibly convex. This family includes the continuous version of the Nesterov inertial scheme and the continuous heavy ball method. Several dam** parameters, not necessarily vanishing, and a perturbati…
▽ More
In this article a family of second order ODEs associated to inertial gradient descend is studied. These ODEs are widely used to build trajectories converging to a minimizer $x^*$ of a function $F$, possibly convex. This family includes the continuous version of the Nesterov inertial scheme and the continuous heavy ball method. Several dam** parameters, not necessarily vanishing, and a perturbation term $g$ are thus considered. The dam** parameter is linked to the inertia of the associated inertial scheme and the perturbation term $g$ is linked to the error that can be done on the gradient of the function $F$. This article presents new asymptotic bounds on $F(x(t))-F(x^*)$ where $x$ is a solution of the ODE, when $F$ is convex and satisfies local geometrical properties such as Łojasiewicz properties and under integrability conditions on $g$. Even if geometrical properties and perturbations were already studied for most ODEs of these families, it is the first time they are jointly studied. All these results give an insight on the behavior of these inertial and perturbed algorithms if $F$ satisfies some Łojasiewicz properties especially in the setting of stochastic algorithms.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Optimal convergence rates for Nesterov acceleration
Authors:
Jean François Aujol,
Charles Dossal,
Aude Rondepierre
Abstract:
In this paper, we study the behavior of solutions of the ODE associated to Nesterov acceleration. It is well-known since the pioneering work of Nesterov that the rate of convergence $O(1/t^2)$ is optimal for the class of convex functions with Lipschitz gradient. In this work, we show that better convergence rates can be obtained with some additional geometrical conditions, such as Łojasiewicz prop…
▽ More
In this paper, we study the behavior of solutions of the ODE associated to Nesterov acceleration. It is well-known since the pioneering work of Nesterov that the rate of convergence $O(1/t^2)$ is optimal for the class of convex functions with Lipschitz gradient. In this work, we show that better convergence rates can be obtained with some additional geometrical conditions, such as Łojasiewicz property. More precisely, we prove the optimal convergence rates that can be obtained depending on the geometry of the function $F$ to minimize. The convergence rates are new, and they shed new light on the behavior of Nesterov acceleration schemes. We prove in particular that the classical Nesterov scheme may provide convergence rates that are worse than the classical gradient descent scheme on sharp functions: for instance, the convergence rate for strongly convex functions is not geometric for the classical Nesterov scheme (while it is the case for the gradient descent algorithm). This shows that applying the classical Nesterov acceleration on convex functions without looking more at the geometrical properties of the objective functions may lead to sub-optimal algorithms.
△ Less
Submitted 8 July, 2019; v1 submitted 15 May, 2018;
originally announced May 2018.
-
Overrelaxed Sinkhorn-Knopp Algorithm for Regularized Optimal Transport
Authors:
Alexis Thibault,
Lénaïc Chizat,
Charles Dossal,
Nicolas Papadakis
Abstract:
This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or Sinkhorn--Knopp algorithm). We first propose to rely on regularized nonlinear acceleration schemes. In practice, such approaches lead to fast algorithms, but their global convergence is…
▽ More
This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or Sinkhorn--Knopp algorithm). We first propose to rely on regularized nonlinear acceleration schemes. In practice, such approaches lead to fast algorithms, but their global convergence is not ensured. Hence, we next propose a new algorithm with convergence guarantees. The idea is to overrelax the Bregman projection operators, allowing for faster convergence. We propose a simple method for establishing global convergence by ensuring the decrease of a Lyapunov function at each step. An adaptive choice of overrelaxation parameter based on the Lyapunov function is constructed. We also suggest a heuristic to choose a suitable asymptotic overrelaxation parameter, based on a local convergence analysis. Our numerical experiments show a gain in convergence speed by an order of magnitude in certain regimes.
△ Less
Submitted 31 March, 2021; v1 submitted 6 November, 2017;
originally announced November 2017.
-
Sampling the Fourier transform along radial lines
Authors:
Charles Dossal,
Vincent Duval,
Clarice Poon
Abstract:
This article considers the use of total variation minimization for the recovery of a superposition of point sources from samples of its Fourier transform along radial lines. We present a numerical algorithm for the computation of solutions to this infinite dimensional problem. The theoretical results of this paper make precise the link between the sampling operator and the recoverability of the po…
▽ More
This article considers the use of total variation minimization for the recovery of a superposition of point sources from samples of its Fourier transform along radial lines. We present a numerical algorithm for the computation of solutions to this infinite dimensional problem. The theoretical results of this paper make precise the link between the sampling operator and the recoverability of the point sources.
△ Less
Submitted 4 June, 2017; v1 submitted 20 December, 2016;
originally announced December 2016.
-
The Degrees of Freedom of Partly Smooth Regularizers
Authors:
Samuel Vaiter,
Charles-Alban Deledalle,
Jalal M. Fadili,
Gabriel Peyré,
Charles Dossal
Abstract:
In this paper, we are concerned with regularized regression problems where the prior regularizer is a proper lower semicontinuous and convex function which is also partly smooth relative to a Riemannian submanifold. This encompasses as special cases several known penalties such as the Lasso ($\ell^1$-norm), the group Lasso ($\ell^1-\ell^2$-norm), the $\ell^\infty$-norm, and the nuclear norm. This…
▽ More
In this paper, we are concerned with regularized regression problems where the prior regularizer is a proper lower semicontinuous and convex function which is also partly smooth relative to a Riemannian submanifold. This encompasses as special cases several known penalties such as the Lasso ($\ell^1$-norm), the group Lasso ($\ell^1-\ell^2$-norm), the $\ell^\infty$-norm, and the nuclear norm. This also includes so-called analysis-type priors, i.e. composition of the previously mentioned penalties with linear operators, typical examples being the total variation or fused Lasso penalties.We study the sensitivity of any regularized minimizer to perturbations of the observations and provide its precise local parameterization.Our main sensitivity analysis result shows that the predictor moves locally stably along the same active submanifold as the observations undergo small perturbations. This local stability is a consequence of the smoothness of the regularizer when restricted to the active submanifold, which in turn plays a pivotal role to get a closed form expression for the variations of the predictor w.r.t. observations. We also show that, for a variety of regularizers, including polyhedral ones or the group Lasso and its analysis counterpart, this divergence formula holds Lebesgue almost everywhere.When the perturbation is random (with an appropriate continuous distribution), this allows us to derive an unbiased estimator of the degrees of freedom and of the risk of the estimator prediction.Our results hold true without requiring the design matrix to be full column rank.They generalize those already known in the literature such as the Lasso problem, the general Lasso problem (analysis $\ell^1$-penalty), or the group Lasso where existing results for the latter assume that the design is full column rank.
△ Less
Submitted 10 February, 2016; v1 submitted 22 April, 2014;
originally announced April 2014.
-
Consistency of l1 recovery from noisy deterministic measurements
Authors:
Charles Dossal,
Rémi Tesson
Abstract:
In this paper a new result of recovery of sparse vectors from deterministic and noisy measurements by l1 minimization is given. The sparse vector is randomly chosen and follows a generic p-sparse model introduced by Candes and al. The main theorem ensures consistency of l1 minimization with high probability. This first result is secondly extended to compressible vectors.
In this paper a new result of recovery of sparse vectors from deterministic and noisy measurements by l1 minimization is given. The sparse vector is randomly chosen and follows a generic p-sparse model introduced by Candes and al. The main theorem ensures consistency of l1 minimization with high probability. This first result is secondly extended to compressible vectors.
△ Less
Submitted 3 December, 2012;
originally announced December 2012.
-
Risk estimation for matrix recovery with spectral regularization
Authors:
Charles-Alban Deledalle,
Samuel Vaiter,
Gabriel Peyré,
Jalal Fadili,
Charles Dossal
Abstract:
In this paper, we develop an approach to recursively estimate the quadratic risk for matrix recovery problems regularized with spectral functions. Toward this end, in the spirit of the SURE theory, a key step is to compute the (weak) derivative and divergence of a solution with respect to the observations. As such a solution is not available in closed form, but rather through a proximal splitting…
▽ More
In this paper, we develop an approach to recursively estimate the quadratic risk for matrix recovery problems regularized with spectral functions. Toward this end, in the spirit of the SURE theory, a key step is to compute the (weak) derivative and divergence of a solution with respect to the observations. As such a solution is not available in closed form, but rather through a proximal splitting algorithm, we propose to recursively compute the divergence from the sequence of iterates. A second challenge that we unlocked is the computation of the (weak) derivative of the proximity operator of a spectral function. To show the potential applicability of our approach, we exemplify it on a matrix completion problem to objectively and automatically select the regularization parameter.
△ Less
Submitted 1 November, 2012; v1 submitted 7 May, 2012;
originally announced May 2012.
-
The Degrees of Freedom of the Group Lasso
Authors:
Samuel Vaiter,
Charles Deledalle,
Gabriel Peyré,
Jalal Fadili,
Charles Dossal
Abstract:
This paper studies the sensitivity to the observations of the block/group Lasso solution to an overdetermined linear regression model. Such a regularization is known to promote sparsity patterns structured as nonoverlap** groups of coefficients. Our main contribution provides a local parameterization of the solution with respect to the observations. As a byproduct, we give an unbiased estimate o…
▽ More
This paper studies the sensitivity to the observations of the block/group Lasso solution to an overdetermined linear regression model. Such a regularization is known to promote sparsity patterns structured as nonoverlap** groups of coefficients. Our main contribution provides a local parameterization of the solution with respect to the observations. As a byproduct, we give an unbiased estimate of the degrees of freedom of the group Lasso. Among other applications of such results, one can choose in a principled and objective way the regularization parameter of the Lasso through model selection criteria.
△ Less
Submitted 7 May, 2012;
originally announced May 2012.
-
Local Behavior of Sparse Analysis Regularization: Applications to Risk Estimation
Authors:
Samuel Vaiter,
Charles Deledalle,
Gabriel Peyré,
Charles Dossal,
Jalal Fadili
Abstract:
In this paper, we aim at recovering an unknown signal x0 from noisy L1measurements y=Phi*x0+w, where Phi is an ill-conditioned or singular linear operator and w accounts for some noise. To regularize such an ill-posed inverse problem, we impose an analysis sparsity prior. More precisely, the recovery is cast as a convex optimization program where the objective is the sum of a quadratic data fideli…
▽ More
In this paper, we aim at recovering an unknown signal x0 from noisy L1measurements y=Phi*x0+w, where Phi is an ill-conditioned or singular linear operator and w accounts for some noise. To regularize such an ill-posed inverse problem, we impose an analysis sparsity prior. More precisely, the recovery is cast as a convex optimization program where the objective is the sum of a quadratic data fidelity term and a regularization term formed of the L1-norm of the correlations between the sought after signal and atoms in a given (generally overcomplete) dictionary. The L1-sparsity analysis prior is weighted by a regularization parameter lambda>0. In this paper, we prove that any minimizers of this problem is a piecewise-affine function of the observations y and the regularization parameter lambda. As a byproduct, we exploit these properties to get an objectively guided choice of lambda. In particular, we develop an extension of the Generalized Stein Unbiased Risk Estimator (GSURE) and show that it is an unbiased and reliable estimator of an appropriately defined risk. The latter encompasses special cases such as the prediction risk, the projection risk and the estimation risk. We apply these risk estimators to the special case of L1-sparsity analysis regularization. We also discuss implementation issues and propose fast algorithms to solve the L1 analysis minimization problem and to compute the associated GSURE. We finally illustrate the applicability of our framework to parameter(s) selection on several imaging problems.
△ Less
Submitted 10 October, 2012; v1 submitted 14 April, 2012;
originally announced April 2012.
-
The degrees of freedom of the Lasso for general design matrix
Authors:
Charles Dossal,
Maher Kachour,
Jalal M. Fadili,
Gabriel Peyré,
Christophe Chesneau
Abstract:
In this paper, we investigate the degrees of freedom ($\dof$) of penalized $\ell_1$ minimization (also known as the Lasso) for linear regression models. We give a closed-form expression of the $\dof$ of the Lasso response. Namely, we show that for any given Lasso regularization parameter $λ$ and any observed data $y$ belonging to a set of full (Lebesgue) measure, the cardinality of the support of…
▽ More
In this paper, we investigate the degrees of freedom ($\dof$) of penalized $\ell_1$ minimization (also known as the Lasso) for linear regression models. We give a closed-form expression of the $\dof$ of the Lasso response. Namely, we show that for any given Lasso regularization parameter $λ$ and any observed data $y$ belonging to a set of full (Lebesgue) measure, the cardinality of the support of a particular solution of the Lasso problem is an unbiased estimator of the degrees of freedom. This is achieved without the need of uniqueness of the Lasso solution. Thus, our result holds true for both the underdetermined and the overdetermined case, where the latter was originally studied in \cite{zou}. We also show, by providing a simple counterexample, that although the $\dof$ theorem of \cite{zou} is correct, their proof contains a flaw since their divergence formula holds on a different set of a full measure than the one that they claim. An effective estimator of the number of degrees of freedom may have several applications including an objectively guided choice of the regularization parameter in the Lasso through the $\sure$ framework. Our theoretical findings are illustrated through several numerical simulations.
△ Less
Submitted 28 May, 2012; v1 submitted 4 November, 2011;
originally announced November 2011.
-
Sharp Support Recovery from Noisy Random Measurements by L1 minimization
Authors:
Charles Dossal,
Marie-Line Chabanol,
Gabriel Peyré,
Jalal Fadili
Abstract:
In this paper, we investigate the theoretical guarantees of penalized $\lun$ minimization (also called Basis Pursuit Denoising or Lasso) in terms of sparsity pattern recovery (support and sign consistency) from noisy measurements with non-necessarily random noise, when the sensing operator belongs to the Gaussian ensemble (i.e. random design matrix with i.i.d. Gaussian entries). More precisely, we…
▽ More
In this paper, we investigate the theoretical guarantees of penalized $\lun$ minimization (also called Basis Pursuit Denoising or Lasso) in terms of sparsity pattern recovery (support and sign consistency) from noisy measurements with non-necessarily random noise, when the sensing operator belongs to the Gaussian ensemble (i.e. random design matrix with i.i.d. Gaussian entries). More precisely, we derive sharp non-asymptotic bounds on the sparsity level and (minimal) signal-to-noise ratio that ensure support identification for most signals and most Gaussian sensing matrices by solving the Lasso problem with an appropriately chosen regularization parameter. Our first purpose is to establish conditions allowing exact sparsity pattern recovery when the signal is strictly sparse. Then, these conditions are extended to cover the compressible or nearly sparse case. In these two results, the role of the minimal signal-to-noise ratio is crucial. Our third main result gets rid of this assumption in the strictly sparse case, but this time, the Lasso allows only partial recovery of the support. We also provide in this case a sharp $\ell_2$-consistency result on the coefficient vector. The results of the present work have several distinctive features compared to previous ones. One of them is that the leading constants involved in all the bounds are sharp and explicit. This is illustrated by some numerical experiments where it is indeed shown that the sharp sparsity level threshold identified by our theoretical results below which sparsistency of the Lasso is guaranteed meets that empirically observed.
△ Less
Submitted 12 September, 2011; v1 submitted 8 January, 2011;
originally announced January 2011.
-
Bandlet Image Estimation with Model Selection
Authors:
Charles Dossal,
Erwan Le Pennec,
Stéphane Mallat
Abstract:
To estimate geometrically regular images in the white noise model and obtain an adaptive near asymptotic minimaxity result, we consider a model selection based bandlet estimator. This bandlet estimator combines the best basis selection behaviour of the model selection and the approximation properties of the bandlet dictionary. We derive its near asymptotic minimaxity for geometrically regular im…
▽ More
To estimate geometrically regular images in the white noise model and obtain an adaptive near asymptotic minimaxity result, we consider a model selection based bandlet estimator. This bandlet estimator combines the best basis selection behaviour of the model selection and the approximation properties of the bandlet dictionary. We derive its near asymptotic minimaxity for geometrically regular images as an example of model selection with general dictionary of orthogonal bases. This paper is thus also a self contained tutorial on model selection with orthogonal bases dictionary.
△ Less
Submitted 12 December, 2009; v1 submitted 18 September, 2008;
originally announced September 2008.