Search | arXiv e-print repository

The gradient's limit of a definable family of functions is a conservative set-valued field

Abstract: It is well-known that the convergence of a family of smooth functions does not imply the convergence of its gradients. In this work, we show that if the family is definable in an o-minimal structure (for instance semialgebraic, subanalytic, or any composition of the previous with exp, log), then the gradient's limit is a conservative set-valued field in the sense introduced by Bolte and Pauwels. I… ▽ More It is well-known that the convergence of a family of smooth functions does not imply the convergence of its gradients. In this work, we show that if the family is definable in an o-minimal structure (for instance semialgebraic, subanalytic, or any composition of the previous with exp, log), then the gradient's limit is a conservative set-valued field in the sense introduced by Bolte and Pauwels. Immediate implications of this result on convergence guarantees of smoothing methods are discussed. Finally, a more general result is established, where the functions in the original family might be merely Lipschitz continuous, vector-valued and the gradients are replaced by their Clarke's Jacobians or an arbitrary definable conservative map**. △ Less

Submitted 21 February, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

arXiv:2305.13187 [pdf, ps, other]

SignSVRG: fixing SignSGD via variance reduction

Authors: Evgenii Chzhen, Sholom Schechtman

Abstract: We consider the problem of unconstrained minimization of finite sums of functions. We propose a simple, yet, practical way to incorporate variance reduction techniques into SignSGD, guaranteeing convergence that is similar to the full sign gradient descent. The core idea is first instantiated on the problem of minimizing sums of convex and Lipschitz functions and is then extended to the smooth cas… ▽ More We consider the problem of unconstrained minimization of finite sums of functions. We propose a simple, yet, practical way to incorporate variance reduction techniques into SignSGD, guaranteeing convergence that is similar to the full sign gradient descent. The core idea is first instantiated on the problem of minimizing sums of convex and Lipschitz functions and is then extended to the smooth case via variance reduction. Our analysis is elementary and much simpler than the typical proof for variance reduction methods. We show that for smooth functions our method gives $\mathcal{O}(1 / \sqrt{T})$ rate for expected norm of the gradient and $\mathcal{O}(1/T)$ rate in the case of smooth convex functions, recovering convergence results of deterministic methods, while preserving computational advantages of SignSGD. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2303.09261 [pdf, other]

Orthogonal Directions Constrained Gradient Method: from non-linear equality constraints to Stiefel manifold

Authors: Sholom Schechtman, Daniil Tiapkin, Michael Muehlebach, Eric Moulines

Abstract: We consider the problem of minimizing a non-convex function over a smooth manifold $\mathcal{M}$. We propose a novel algorithm, the Orthogonal Directions Constrained Gradient Method (ODCGM) which only requires computing a projection onto a vector space. ODCGM is infeasible but the iterates are constantly pulled towards the manifold, ensuring the convergence of ODCGM towards $\mathcal{M}$. ODCGM is… ▽ More We consider the problem of minimizing a non-convex function over a smooth manifold $\mathcal{M}$. We propose a novel algorithm, the Orthogonal Directions Constrained Gradient Method (ODCGM) which only requires computing a projection onto a vector space. ODCGM is infeasible but the iterates are constantly pulled towards the manifold, ensuring the convergence of ODCGM towards $\mathcal{M}$. ODCGM is much simpler to implement than the classical methods which require the computation of a retraction. Moreover, we show that ODCGM exhibits the near-optimal oracle complexities $\mathcal{O}(1/\varepsilon^2)$ and $\mathcal{O}(1/\varepsilon^4)$ in the deterministic and stochastic cases, respectively. Furthermore, we establish that, under an appropriate choice of the projection metric, our method recovers the landing algorithm of Ablin and Peyré (2022), a recently introduced algorithm for optimization over the Stiefel manifold. As a result, we significantly extend the analysis of Ablin and Peyré (2022), establishing near-optimal rates both in deterministic and stochastic frameworks. Finally, we perform numerical experiments which shows the efficiency of ODCGM in a high-dimensional setting. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2109.02455

Stochastic Subgradient Descent on a Generic Definable Function Converges to a Minimizer

Authors: Sholom Schechtman

Abstract: It was previously shown by Davis and Drusvyatskiy that every Clarke critical point of a generic, semialgebraic (and more generally definable in an o-minimal structure), weakly convex function is lying on an active manifold and is either a local minimum or an active strict saddle. In the first part of this work, we show that when the weak convexity assumption fails a third type of point appears: a… ▽ More It was previously shown by Davis and Drusvyatskiy that every Clarke critical point of a generic, semialgebraic (and more generally definable in an o-minimal structure), weakly convex function is lying on an active manifold and is either a local minimum or an active strict saddle. In the first part of this work, we show that when the weak convexity assumption fails a third type of point appears: a sharply repulsive critical point. Moreover, we show that the corresponding active manifolds satisfy the Verdier and the angle conditions which were introduced by us in our previous work. In the second part of this work, we show that, under a density-like assumption on the perturbation sequence, the stochastic subgradient descent (SGD) avoids sharply repulsive critical points with probability one. We show that such a density-like assumption could be obtained upon adding a small random perturbation (e.g. a nondegenerate Gaussian) at each iteration of the algorithm. These results, combined with our previous work on the avoidance of active strict saddles, show that the SGD on a generic definable (e.g. semialgebraic) function converges to a local minimum. △ Less

Submitted 10 February, 2022; v1 submitted 6 September, 2021; originally announced September 2021.

Comments: This paper was withdrawn due to a mistake in the work of Benaïm-Hofbauer-Sorin "Stochastic Approximations and Differential Inclusions". In the latter, the equivalence in Theorem 4.1 is not true and in particular the linearly interpolated process of the iterates is not an APT of the associated DI. This equivalence was at the heart of Propositions 7, 8 and Theorem 2 of the present paper

MSC Class: 65K10; 62L20; 49J52; 32B20

arXiv:2108.02072 [pdf, ps, other]

Stochastic Subgradient Descent Escapes Active Strict Saddles on Weakly Convex Functions

Authors: Pascal Bianchi, Walid Hachem, Sholom Schechtman

Abstract: In non-smooth stochastic optimization, we establish the non-convergence of the stochastic subgradient descent (SGD) to the critical points recently called active strict saddles by Davis and Drusvyatskiy. Such points lie on a manifold $M$ where the function $f$ has a direction of second-order negative curvature. Off this manifold, the norm of the Clarke subdifferential of $f$ is lower-bounded. We r… ▽ More In non-smooth stochastic optimization, we establish the non-convergence of the stochastic subgradient descent (SGD) to the critical points recently called active strict saddles by Davis and Drusvyatskiy. Such points lie on a manifold $M$ where the function $f$ has a direction of second-order negative curvature. Off this manifold, the norm of the Clarke subdifferential of $f$ is lower-bounded. We require two conditions on $f$. The first assumption is a Verdier stratification condition, which is a refinement of the popular Whitney stratification. It allows us to establish a reinforced version of the projection formula of Bolte \emph{et.al.} for Whitney stratifiable functions, and which is of independent interest. The second assumption, termed the angle condition, allows to control the distance of the iterates to $M$. When $f$ is weakly convex, our assumptions are generic. Consequently, generically in the class of definable weakly convex functions, the SGD converges to a local minimizer. △ Less

Submitted 25 July, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

Comments: Accepted for publication in Mathematics of Operations Research

MSC Class: 65K10; 62L20 (Primary); 49J52; 32B20 (secondary)

arXiv:2103.16253 [pdf, ps, other]

doi 10.1007/s11590-022-01884-8

Stochastic proximal subgradient descent oscillates in the vicinity of its accumulation set

Authors: Sholom Schechtman

Abstract: We analyze the stochastic proximal subgradient descent in the case where the objective functions are path differentiable and verify a Sard-type condition. While the accumulation set may not be reduced to unique point, we show that the time spent by the iterates to move from one accumulation point to another goes to infinity. An oscillation-type behavior of the drift is established. These results s… ▽ More We analyze the stochastic proximal subgradient descent in the case where the objective functions are path differentiable and verify a Sard-type condition. While the accumulation set may not be reduced to unique point, we show that the time spent by the iterates to move from one accumulation point to another goes to infinity. An oscillation-type behavior of the drift is established. These results show a strong stability property of the proximal subgradient descent. Using the theory of closed measures, Bolte, Pauwels and Ríos-Zertuche established this type of behavior for the deterministic subgradient descent. Our technique of proof relies on the classical works on stochastic approximation of differential inclusions, which allows us to extend results in the deterministic case to a stochastic and proximal setting, as well as to treat these different cases in a unified manner. △ Less

Submitted 24 May, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

MSC Class: 65K10; 62L20 (Primary); 62M45 (Secondary) ACM Class: G.1.6; I.2.6

arXiv:2012.04002 [pdf, ps, other]

Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance

Authors: A. Barakat, P. Bianchi, W. Hachem, Sh. Schechtman

Abstract: In this paper, a general stochastic optimization procedure is studied, unifying several variants of the stochastic gradient descent such as, among others, the stochastic heavy ball method, the Stochastic Nesterov Accelerated Gradient algorithm (S-NAG), and the widely used Adam algorithm. The algorithm is seen as a noisy Euler discretization of a non-autonomous ordinary differential equation, recen… ▽ More In this paper, a general stochastic optimization procedure is studied, unifying several variants of the stochastic gradient descent such as, among others, the stochastic heavy ball method, the Stochastic Nesterov Accelerated Gradient algorithm (S-NAG), and the widely used Adam algorithm. The algorithm is seen as a noisy Euler discretization of a non-autonomous ordinary differential equation, recently introduced by Belotto da Silva and Gazeau, which is analyzed in depth. Assuming that the objective function is non-convex and differentiable, the stability and the almost sure convergence of the iterates to the set of critical points are established. A noteworthy special case is the convergence proof of S-NAG in a non-convex setting. Under some assumptions, the convergence rate is provided under the form of a Central Limit Theorem. Finally, the non-convergence of the algorithm to undesired critical points, such as local maxima or saddle points, is established. Here, the main ingredient is a new avoidance of traps result for non-autonomous settings, which is of independent interest. △ Less

Submitted 10 July, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Comments: Accepted for publication in Electronic Journal of Statistics. 49 pages

MSC Class: 62L20; 34A12; 60F99

arXiv:2005.08513 [pdf, ps, other]

Convergence of constant step stochastic gradient descent for non-smooth non-convex functions

Authors: Pascal Bianchi, Walid Hachem, Sholom Schechtman

Abstract: This paper studies the asymptotic behavior of the constant step Stochastic Gradient Descent for the minimization of an unknown function F , defined as the expectation of a non convex, non smooth, locally Lipschitz random function. As the gradient may not exist, it is replaced by a certain operator: a reasonable choice is to use an element of the Clarke subdifferential of the random function; an ot… ▽ More This paper studies the asymptotic behavior of the constant step Stochastic Gradient Descent for the minimization of an unknown function F , defined as the expectation of a non convex, non smooth, locally Lipschitz random function. As the gradient may not exist, it is replaced by a certain operator: a reasonable choice is to use an element of the Clarke subdifferential of the random function; an other choice is the output of the celebrated backpropagation algorithm, which is popular amongst practionners, and whose properties have recently been studied by Bolte and Pauwels [7]. Since the expectation of the chosen operator is not in general an element of the Clarke subdifferential BF of the mean function, it has been assumed in the literature that an oracle of BF is available. As a first result, it is shown in this paper that such an oracle is not needed for almost all initialization points of the algorithm. Next, in the small step size regime, it is shown that the interpolated trajectory of the algorithm converges in probability (in the compact convergence sense) towards the set of solutions of the differential inclusion. Finally, viewing the iterates as a Markov chain whose transition kernel is indexed by the step size, it is shown that the invariant distribution of the kernel converge weakly to the set of invariant distribution of this differential inclusion as the step size tends to zero. These results show that when the step size is small, with large probability, the iterates eventually lie in a neighborhood of the critical points of the mean function F . △ Less

Submitted 12 April, 2022; v1 submitted 18 May, 2020; originally announced May 2020.

Journal ref: Set-Valued and Variational Analysis, Springer, 2022

Showing 1–8 of 8 results for author: Schechtman, S