Stochastic Optimization under Hidden Convexity††thanks: This work is supported by ETH AI Center Doctoral Fellowship, NCCR Automation, and Swiss National Science Foundation.
Abstract
In this work, we consider constrained stochastic optimization problems under hidden convexity, i.e., those that admit a convex reformulation via non-linear (but invertible) map . A number of non-convex problems ranging from optimal control, revenue and inventory management, to convex reinforcement learning all admit such a hidden convex structure. Unfortunately, in the majority of applications considered, the map is unavailable or implicit; therefore, directly solving the convex reformulation is not possible. On the other hand, the stochastic gradients with respect to the original variable are often easy to obtain. Motivated by these observations, we examine the basic projected stochastic (sub-) gradient methods for solving such problems under hidden convexity. We provide the first sample complexity guarantees for global convergence in smooth and non-smooth settings. Additionally, in the smooth setting, we improve our results to the last iterate convergence in terms of function value gap using the momentum variant of projected stochastic gradient descent.
Contents
1 Introduction
We study constrained stochastic optimization
(1) |
where is a closed convex subset of , is a random variable following an unknown distribution , and is possibly non-convex in . Our central structural assumption about (1) is that it admits a convex reformulation of the form
(2) |
where is a convex function defined on a closed convex set , and is an invertible map (with its inverse denoted by ). This property is often referred to as hidden convexity and frequently appears in various modern applications, for example, policy optimization in convex reinforcement learning and optimal control [92, 80], generative models [50], supply chain and revenue management [36, 15], training neural networks [83, 30]. In the optimization literature, hidden convexity has been identified much earlier and dates back to at least 1990s in the context of quadratic optimization [78, 8]. Since then, several works have developed various tools to identify such property [7, 6]. More recently, hidden convexity has been established for a wider classes of non-convex programs [81, 84, 14] and for non-monotone games [82, 62], to name just a few. Despite the existence of the convex reformulation, the transformation function is usually hard to compute or even unknown, and one cannot readily solve the convex reformulation. This motivates the use of (sub-) gradient methods that optimize directly over the variable . Perhaps, the most basic algorithm is the projected stochastic (sub-) gradient method (SM). Starting from , SM generates a sequence via
(3) |
where denotes an unbiased estimate of the (sub-) gradient of at a point , and is the Euclidean projection onto a convex set , see Section 1.2 for details. When is differentiable, this reduces to Projected SGD. Stochastic (sub-) gradient methods and their numerous variants have a long history of development since the first works on stochastic approximation appeared in 1950s [73, 49, 9, 18]. Their analyses are richly documented for addressing convex problems and general nonconvex problems (refer to Appendix C for detailed summary); however, their convergence behaviors when dealing with hidden convexity still elude precise understanding.
Although hidden convexity has been previously identified in certain applications, the analysis of gradient methods under this condition is mostly done on a case by case basis for specific applications and often requires strong additional assumptions [93, 4, 16, 15]. In this work, we formally consider general-purpose stochastic optimization under hidden convexity and study the sample complexities for solving such problems through projected stochastic (sub-) gradient method and the like.
Contributions:
-
1.
We identify key properties of hidden convex optimization and demonstrate how these conditions can be used to derive global convergence of gradient methods.
-
2.
In the general non-differentiable case, we analyze convergence of the projected stochastic sub-gradient method (SM) and obtain sample complexity for driving the Moreau envelope of (1) -close to the optimal value in expectation. Here is weak convexity parameter of , is the bound on (stochastic) subgradients, and is the modulus of hidden convexity of that relates to the invertibility of map** ; see Section 2 and Section 4 for details. To our knowledge, it is the first result to address the non-differentiable setting under hidden convexity.
-
3.
Next, we specialize our results to the differentiable smooth setting, and obtain a similar sample complexity for Projected SGD, replacing by the variance of stochastic gradients . Furthermore, we analyze the momentum variant of Projected SGD improving our result to the last iterate convergence in terms of the function value gap, i.e., we have after iterations/stochastic gradient calls, where is the Lipschitz constant of .
-
4.
In the presence of strong convexity of the reformulated problem, we further improve the sample complexity for all above mentioned algorithms. For instance, Projected SGD attains sample complexity for achieving an -optimal solution, where corresponds to the strong convexity of .
Importantly, we show that all studied algorithms provably converge in online fashion, using only one stochastic gradient at every iteration.
1.1 Related work
The most closely related works are [92, 93, 4, 16, 40], which analyze gradient methods under similar structural assumptions in the context of specific applications.
Policy gradient methods in RL. Several work [92, 93, 4] exploited properties similar to hidden convexity in reinforcement learning (RL) applications and analyzed policy gradient (PG) type methods with global convergence guarantees. In [92], the authors considered a PG method with projection, but it is only limited to the case where the exact gradients are available. It is unclear how to extend the technique in their work to the case of stochastic gradients with bounded variance (without resorting to large batches). Next, [93] considered the stochastic setting and proposed a variance-reduced PG method with truncation using large batches of trajectories. Recently, [4] removed the requirement for large batches using a normalized variance-reduced PG method. However, their results are difficult to extend to the constrained case due to the normalization. Moreover, both works [93, 4] utilize variance reduced estimators, which require additional smoothness assumptions for theoretical analysis.
Stochastic gradient methods in revenue management. A different line of works [15, 16] considered hidden convex objectives in revenue management and studied global convergence of gradient-based methods over . For a special revenue management problem, [15] introduced a preconditioned gradient-based method that obtains an sample complexity under the assumptions that the domain is a box constraint, the transformation function is separable and the additional access to is available. Leveraging the box constraint structure, [15] also analyzed Projected SGD and derive sample complexity. In contrast, we show that Projected SGD can achieve a better sample complexity for a general convex compact constraint , and further extend the results to non-smooth setting.
Nonconvex online learning. Recently, [40] considered a structural property similar to hidden convexity and imposed strong assumptions on the reparameterization map (see Assumptions 1, 2 and 4 therein) under which non-convex online gradient descent in the original space is equivalent to online mirror descent for the (convex) reformulated problem. Such equivalence allows them to demonstrate an regret bound. Instead, we directly derive the last iterate convergence in the function value using a different technique and make less restrictive assumptions on , which allows us to cover a wide range of applications.
Related structural assumptions. We mention that several other non-convex structural assumptions have been explored in optimization literature that also ensures global convergence, including essential strong convexity [56], quasar (strong) convexity [42], restricted secant inequality [90], error bounds [59], quadratic growth [10], Polyak-Łojasiewicz (PŁ) condition [71, 58] – also known as global Kurdyka-Łojasiweicz (KŁ) and gradient domination condition. For an in-depth discussion on the relationships between these properties, refer to [47, 72] and references therein. Notably, the PŁ condition along with its various generalizations to constrained minimization such as Proximal-PŁ [47] and variational gradient dominance [85] have gained popularity in the recent years. The convergence of gradient-based methods under the PŁ-type conditions has been extensively analyzed, e.g., in the deterministic setting [47, 89] and the stochastic setting [32, 37, 48, 75, 54, 17, 88, 31]. Despite a few examples [35, 33, 26, 85] that show some variants of the PŁ condition hold, how to verify PŁ-like conditions for non-convex problems remains a big question in general. The situation becomes even more challenging, when dealing with constrained optimization and/or non-differentiable objectives, where a suitable generalization of the gradient dominance needs to be introduced and carefully studied. Unlike the PŁ condition, hidden convexity considered in this work is a very natural property, which easily extends to constrained optimization and non-differentiable objectives.
1.2 Notations and organization
In the following, we briefly revisit some basic notations from the convex analysis. Throughout, we denote by the inner product in along with its induced Euclidean norm . For a real valued matrix , we denote by its operator norm, i.e., . The map is called invertible if there exists a map (called inverse) such that for any and for any . For any and any , if , we say is convex. We denote the diameter of as . For a function , if there exists such that for all and , it holds we call convex on if , and -strongly convex on if .
A function is -weakly convex (-WC) if for any fixed , is convex in . If , we assign . The (Fréchet) sub-differential of at is The elements are called sub-gradients of at , see [24] for alternative equivalent definitions of the sub-differential set for -WC functions. A differentiable function is -smooth on if its gradient is -Lipschitz continuous on the set , i.e., it holds for all . For a convex set , the projection of a point onto is . We denote as the indicator function of a set and define if and otherwise. We define by the set of optimal points of and by its optimal value. A point is called a stationary point of a weakly convex function if . For any function and a real , we define the Moreau envelope and the proximal map** as follows
The rest of the paper is organized as follows. We formally introduce the hidden convex function class in Section 2 followed by motivating examples. The properties of hidden convex optimization, which are useful to analyze global convergence of gradient methods are in Section 3. Our main global convergence results are in Sections 4, 5, and 6 for subgradient methods, Projected SGD, and Projected SGD with momentum, respectively. The conclusion follows in Section 7.
2 Hidden Convex Problem Class
The existence of a convex reformulation (2) for the problem (1) signifies its representation as a compositional optimization in the form:
(4) |
Formally, we make the following definition.
Definition 1.
The above problem is called hidden convex with modulus (or function is hidden convex on ) if its components satisfy the following underlying conditions.
- C.1
-
C.2
The map is invertible. There exists such that for all it holds
(6)
In particular, if , we say the above problem is -hidden strongly convex.
Notice that the hidden convex problem class includes the convex problem as a special case when the transformation map is identical. In addition, it also includes many non-convex problems. For a simple example, let and consider , , . Then , albeit concave, is hidden convex on by the construction. Another simple example considers and , , . The obtained composition is also hidden convex on , although it is both non-convex and non-concave on .
In what follows, we present several more practical problems, which belong to our hidden convex class.
2.1 Non-linear least squares [67, 66, 28]
Consider solving a system of nonlinear equations under a box constraint, e.g., with for , . This problem can be equivalently formulated as
When is an invertible map**, it belongs to the hidden convex optimization class. For , , and , we illustrate its contour plot in fig. 1.
2.2 Minimizing posinomial functions [13, 29]
For power control in communication systems and optimal do** profile problems [13, 29], one often needs to minimize posinomial functions of the following form
where and for all , . The function is non-convex but admits a convex reformulation via a variable change . The convex reformulation is of the form
where is convex. One can easily see that the above problem is hidden convex on a convex compact constraint .
2.3 System level synthesis in optimal control [2]
Consider a linear time-varying system
where is a state, is a control input, and is an exogenous disturbance process, and are independent for . Matrices and determine the system dynamics. Define , , , and consider a time-varying controller of the form , which depends on a control matrix
The goal of the system level synthesis is to find a control policy to minimize some loss functions, e.g., quadratic in and : where and .
Despite the fact that is convex in both and , it is non-convex in the decision variable . Nevertheless, it admits a convex reformulation [2] of the form
where are the new variables, is a strongly-convex function of . is a deterministic matrix, which depends on matrices , , , and is the identity matrix. Moreover, there exists a bijection between variables and subject to the constraints of the reformulated problem. The (inverse of the) map is given by [2]. Therefore, one can easily verify that the optimization problem over is hidden convex. A number of other problems in optimal control also admit suitable convex reformulations. We refer readers to [12, 80] for more examples.
2.4 Convex reinforcement learning [92]
Convex reinforcement learning (RL) problem generalizes the classical RL setting. It bases on a discounted Markov Decision Process , where and denote the (finite) state and action spaces respectively, is the state transition probability kernel (where denotes the distribution over ), is the initial state distribution and is the discount factor. A stationary policy maps each state to a distribution over the action space . The set of all (stationary) policies is denoted by . At each time step in a state , the RL agent chooses an action with probability and the environment transitions to a state with probability We denote by the probability distribution of the Markov chain induced by the policy with an initial state distribution . For any policy , we define the state-action occupancy measure
(7) |
The set of such state-action occupancy measures is denoted by
Different from the classical RL, convex RL considers a general (convex) utility function that maps the state-action occupancy measure to a cost and aims to find a policy that minimizes the cost
(8) |
Notice that is not convex in in general. However, for several commonly used utility functions, exhibits convexity in the occupancy measure . For standard RL, is linear in , where is the reward vector. For the pure exploration setting, focused on fully exploring the transitions in the environment, represents the negative entropy of , which is also convex [92]. For the imitation learning where the objective is to imitate the expert’s behavior given their sampled trajectories, denotes the KL-divergence between and the state-action occupancy measure learned from the expert’s sampled trajectories, which is also convex [92]. Thus, the convex RL problem belongs to the hidden convex class with and (with ). Under mild assumptions on the initial distribution , the constant can be estimated, see e.g., [92]. Note that in convex RL, we can control only implicitly by changing the policy . The exact computation of the transformation map and its inverse requires the knowledge of the state transition probability kernel and can be computationally expensive.
2.5 Revenue management and inventory control [15, 16]
Consider a booking limit control in a passenger network revenue management problem. The goal is to maximize the revenue by finding an optimal booking limit threshold for each demand class, e.g., flying from New York to Seattle with economy class. Such a problem forms a two-stage stochastic programming such that
(9) | ||||
where denotes the number of demand classes in the airline networks, is the booking limit control threshold for each demand class, is the random demand vector (of the same dimension as ) during the reservation stage, denotes the number of reservations accepted, and denotes revenue collected during the reservation stage with being the price vector. In the service stage, denotes the penalty on the airline companies when there are number of reservations with plane seats capacity that is random, is the actual number of passengers that can get on the plane, is the penalty vector for declining passengers with reservation to get on the plane. Notice that is non-convex in due to the truncation between and . However, when admits component-wise independent coordinates, this problem admits a convex reformulation via a variable change [15], i.e., . Note that comparing to previous applications, the transformation function involves unknown distribution and thus is not explicitly known.
3 Properties of Hidden Convex Optimization
In this section, we provide key properties of hidden convex problems and discuss its connections with gradient dominated function classes.
3.1 Globally optimal solution
The following proposition suggests that every stationary point of a hidden convex function is a global minima.
Proposition 1.
Let be hidden convex on and be a stationary point. If the map is differentiable at , then is a global minimum, i.e., for any .
Proof.
By the definition of a stationary point and the chain rule [74] (Theorem 10.49), we can write
(10) |
where . As the map is invertible, then for some implies . Thus, we have . Since function is convex, by the sufficient optimality condition, is a globally optimal solution, i.e., for any . As a result, we have for any . ∎
Note that a similar result appeared in [92] under additional smoothness assumptions on and . The above proof is much simpler and does not require smoothness.
3.2 Connections with gradient dominated functions
It is natural to ask what is the connection between hidden convex problems and previously studied gradient dominated function classes that also ensure the global convergence of gradient-based algorithms in the smooth setting [47]. Unfortunately, the exact characterization is difficult to establish in the constrained setting. With the following proposition, we show that a problem satisfies the global KŁ condition if it is hidden strongly convex ().
Proposition 2.
Let be differentiable and -hidden strongly convex, the map be differentiable on . Then the optimization problem satisfies the global KŁ condition, i.e.,
(11) |
Proof.
Note that even in this restrictive case when , it remains elusive how condition (11) can be used to establish global convergence of Projected SGD. To our knowledge, under the condition (11), no global analysis of stochastic gradient methods exists in the literature even for smooth . In the more interesting case when is merely convex, the above global KŁ condition becomes vacuous as . Despite of this, Proposition 1 still holds even when .
3.3 Key inequalities for analysis of gradient methods
The following observations are key tools for deriving global convergence under hidden convexity.
Proposition 3.
Let be hidden convex with . For any , and , define . Then
(12) |
and
(13) |
Proof.
By (strong) convexity of and convexity of , we have
where the inequality uses the fact that is a convex set and that for any . By definition of and (6), we derive
∎
4 Stochastic Subgradient Method
In this section, we show how Proposition 3 can be used to analyze convergence of the projected stochastic subgradient method (SM) as described in (3) in the non-smooth setting.
We first make the following assumptions.
-
A.1
is -weakly convex on a closed, convex set .
-
A.2
We have access to a stochastic sub-gradient oracle of at any , which outputs a random vector such that where is the sub-differential set of at . Moreover, there exists
The above assumptions are standard and appear frequently in non-smooth optimization [24, 95, 23, 22]. Weak convexity is known to be a much weaker condition than smoothness [22]. Notably, in the context of our hidden convexity (C.1. and C.2.), weak convexity is not restrictive and comes for free from the Lipschitz continuity of and the smoothness of the transformation function . Specifically, if is convex and -Lipschitz continuous on and is -smooth, then it can be shown that the composition is -weakly convex with ; see e.g. Proposition 2.2(c) in [95]. In the absence of smoothness, the second assumption on bounded second moment of the (stochastic) sub-gradients is typical even in convex case. Later in Section 5, we show this assumption can be further relaxed to bounded variance in the smooth setting.
Let , and . We define the Lyapunov function
where is the Moreau envelope of . Notice that for any and if and only if .
Before stating the main result, we recall the following useful lemma from [22] that controls the distance between one step of the SM, , and one step of the proximal point method, . We include its proof in Appendix A for completeness.
The next theorem is the essential step for establishing the global convergence of SM in theorems 2 and 3.
Proof.
By the definition of , we have for any
where in we use the optimality of , follows from Young’s inequality for any , and in we apply the result of lemma 1. We now select , which guarantees , , and . Thus
We are now ready to utilize the properties of hidden convex functions to bound and for some specific choice of . By Proposition 3, we have for
Combining three inequalities above, we have
where the last inequality holds since (by the choice , ) and recognizing . Subtracting from both sides, we conclude the proof. ∎
4.1 Hidden Convex Setting
We first demonstrate the convergence rate of SM in the hidden convex setting.
Proof.
Setting in Theorem 1 and leveraging compactness of , we have
Unrolling the recursion for to , we get
where the last step holds by setting after ∎
We remark that in the absence of smoothness of , the guarantee on might not necessarily translate to the function value gap . However, with the following corollary we show that the output of SM, , is in fact close to an -approximate global solution .
Corollary 1.
Proof.
The result follows directly from the definition of and Theorem 2. ∎
4.2 Hidden Strongly Convex Setting
The following theorem presents a stronger result in the case when is additionally hidden strongly convex.
Proof.
We invoke Theorem 1 with . The choice of guarantees the coefficient in front of is non-positive and
It remains to conclude the proof by unrolling the recursion and setting the step-size accordingly. ∎
In the presence of hidden strong convexity, since the optimal is unique, we can establish a strong convergence of the sequence to .
Corollary 2.
Proof.
Since is -strongly convex on , we have
where the first inequality follows by the first-order characterization of strong convexity and the optimality condition, and the last inequality holds by C.2. Recall that with . Then
where the second inequality holds by (4.2) and the last step follows by Theorem 3. ∎
The above results highlight a notable observation: although is non-smooth and non-convex, simple SM converges to a globally optimal solution. This stands in contrast to recent results in general non-smooth non-convex optimization, where more sophisticated (randomized) algorithms are needed to obtain a meaningful solution (e.g., -Goldstein stationary point) [46]. Moreover, it is worth emphasizing the distinctions in the sample complexity results compared to classical findings in convex settings: Theorems 2 and 3 implies the sample complexities of and respectively to reach for hidden convex and hidden strongly convex problems, whereas in convex and strongly convex settings, the sample complexities are and respectively to reach [64].
5 Projected SGD
In this section, we consider the smooth setting when is continuously differentiable. In this case, SM reduces to Projected SGD (P-SGD):
In particular, we assume that
-
A.1’
The function is differentiable on a closed, convex set and its gradient is -Lipschitz continuous.
-
A.2’
We have access to an unbiased stochastic gradient oracle with bounded variance , i.e. for any : , and
where expectations are with respect to the random variable .
Note that Assumption A.2’ on bounded variance is considerably weaker than Assumption A.2 on the bounded second moment of stochastic gradient in previous section. Replacing lemma 1 with lemma 4 in the proof of theorem 1 of the previous section, we are able to derive the following results under Assumptions A.1’ and A.2’.
Using the above result, we provide a refined analysis of Projected SGD in the differentiable setting with smoothness and bounded variance.
5.1 Hidden Convex Setting
We start with the hidden convex case.
Proof.
Similar to Corollary 1, we can show that is close to an -global optimal solution. But in the case of smooth , we can also derive a stronger result after applying one (post-processing) step of Projected SGD with mini-batch. We defer this result to Corollary 4 in Appendix B.
Theorem 5 implies that in deterministic case when , the iteration complexity of the gradient method is , which coincides with the iteration complexity of Projected GD in the smooth convex setting in terms of (up to a logarithmic factor). However, in the stochastic setting, sample complexity is worse than the well known sample complexity in the convex case [51]. On the other hand, for general smooth nonconvex optimization, Projected SGD is only known to converge to a first-order stationary point (FOSP), i.e., find with , with the sample complexity [51, 22].
5.2 Hidden Strongly Convex Setting
Similar to the exposition in Section 4, we present an improved sample complexity result for hidden strongly convex problems.
Similarly to corollary 2, we can translate convergence in to the last iterate convergence in terms of distance to the optimal solution.
Corollary 3.
Theorem 6 and Corollary 3 imply that if , Projected SGD converges linearly in deterministic setting (when ) and achieves sample complexity in the stochastic setting. This means that compared to the special case of strongly convex optimization, the above rates have the same dependence on (up to a logarithmic factor) [79, 51].
6 Projected SGD with Momentum
We observe that the previous section only guarantees convergence of , however, this might not directly imply the convergence on the original function since for any . It is known that in convex optimization, momentum is often helpful to establish the last iterate convergence, see e.g., [55, 76]. Motivated by this, we consider Projected SGD with Polyak’s (heavy-ball) momentum [69] in the smooth setting. We show that with extra momentum step, we can establish last-iterate convergence to an -optimal solution. The Projected SGD with momentum admits the following updates:
Our analysis in this section uses the same properties presented in Section 3.3, but the Lyapunov function used here is completely different from used in Sections 4 and 5. Let , for any , we define the Lyapunov function
(17) |
The following lemma controls the error between the momentum gradient estimator and the true gradient . Similar recursive error control was previously used in general non-convex optimization, e.g., in [21, 19, 34].
The following result is the key to derive global convergence guarantee for Projected SGD with momentum under hidden convexity.
Theorem 7.
Proof.
By the update rule of and following the standard descent inequality (cf. lemma 5), we have for any that
(19) |
By the smoothness of , we derive
where follows from (19), holds by Young’s inequality, i.e., with , , holds by the smoothness of , i.e., . We are now ready to utilize the properties of hidden convex functions to bound and for some specific choice of . We select , for some , and . By Proposition 3, we have for ,
and
Combining the three inequalities above and utilizing the assumption , we complete the proof. ∎
6.1 Hidden Convex Setting
Proof.
By Theorem 7, subtracting from both sides of (18), setting , and taking the expectation, we have for any that
where the second inequality uses boundedness of .
Summing up the inequality above with a multiple of the result of Lemma 2, we recognize the Lyapunov function defined in (17), and derive
where the last step holds for and . Unrolling the recursion from to and choosing , we obtain
where the last inequality holds by setting , , and the number of iterations as
∎
6.2 Hidden Strongly Convex Setting
We conclude the section with the improved result for Projected SGD with momentum under hidden strong convexity.
Proof.
We remark that both Theorems 8 and 9 provide last iterate global convergence for Projected SGD with momentum without the need of using large mini-batch. Additionally, the gradient estimate is guaranteed to converge to the true gradient at the optimum , which might be non-zero when minimizing over a compact set . In the hidden strongly convex case, similarly to Corollaries 2 and 3, the result of Theorem 9 can be translated to the point convergence to the optimal solution.
7 Conclusions
In this work, we study stochastic optimization under hidden convexity and develop sample complexity results for batch-free stochastic (sub-) gradient methods with projection.
Several questions remain open. 1) We know that in case , the derived sample complexity is worst-case optimal (up to the logarithmic factor) in terms of dependence on since it matches the optimal rate known for strongly convex , and therefore, the complexity bounds are unimprovable for SM and P-SGD. However, for merely convex , i.e., , it is unclear if our sample complexity is tight for SM and P-SGD. 2). The benefits of momentum variants of P-SGD can be further explored, e.g., to understand if Nesterov’s acceleration is possible under hidden convexity. 3) When , our iteration and sample complexity results depend on the diameter of the reformulated problem. It would be interesting to explore if can be replaced with the distance to the solution, i.e., .
There are also many other directions to explore in the future. 1) SM and P-SGD are the simplest and generic methods for solving (1). It is important to explore more advanced specialized algorithms for applications, which may potentially speed up the convergence. For instance, given a stochastic information about the map , one can utilize the samples or in the algorithm. Despite some recent progress [15], the rigorous validation of such methods remains an open problem with a general convex constrained . 2) The development of stochastic gradient methods for solving hidden convex problems with non-convex (e.g., hidden convex) constraints is an interesting research direction [86, 25, 96, 53]. 3) Extension of our results to hidden convex saddle point problems and games [68] also merits further exploration.
Appendix A Technical Lemma
We report the following technical lemma from [22, 34] and include their slightly modified proofs for completeness.
Lemma 3 (Lemma 3.2 in [22]).
Let , and for any , define , where . Then where .
Proof.
By definition of and , we have
where the last equality holds, since , and are both convex (due to the conic combination rule). Multiplying both sides by and rearranging, we get Therefore, by the optimality condition for the proximal sub-problem, we have . ∎
Proof of Lemma 1. Lemma 3 states that for any and , we have . Thus, using the update rule of and non-expansiveness of the projection, we derive
where in we use unbiasedness of the gradient estimator. In , we use Young’s inequality and A.2, holds by hypo-monotonicity inequality . The last inequality holds by the choice of and .
Proof.
For a differentiable Lemma 3 implies that for , we have . Thus, using the update rule of and non-expansiveness of the projection, we derive
where in and use unbiasedness of the gradient estimator and bounded variance. In , we use Cauchy–Schwarz inequality and smoothness of , i.e., . The last inequality holds by the choice of and , and . ∎
Proof of Lemma 2. Using the update rule of and the unbiasedness of stochastic gradients, we have
where the first inequality uses Young’s inequality and the bound of the variance of stochastic gradients, and the last step uses the Lipschitz continuity of the gradient and the fact that for all .
The following technical lemma is fairly standard, e.g., see [41].
Lemma 5.
Let be convex and for some , , define , then
Appendix B Further Improvements with Mini-batching
Using only one (post-processing) step of mini-batch P-SGD to the output of one batch P-SGD is sufficient to translate convergence from to .
Corollary 4.
Proof.
The following corollary shows that if we apply mini-batching at each iteration with sufficiently large batch-size, then the number of iterations required for convergence is reduced to .
Corollary 5.
Let the assumptions of Theorem 5 hold and be the Lipschitz constant of over . Suppose P-SGD with batch-size is applied, i.e., is generated by with , . Define , where , and . Then after
Proof.
The proof follows from the previous corollary by replacing with . ∎
However, we highlight that the results of theorems 5 and 4 do not require using large batches of samples at every iteration.
Appendix C Historical Remarks
The sub-gradient method, its special case, Projected SGD, and their numerous variants have a long history of development since the first works on stochastic approximation appeared in 1950s [73, 49, 9, 18]. Convex optimization. The case of convex is particularly well documented [63, 1, 37]. Researchers have studied how to deal with convex constraints, proximal operators, general Bregman divergences [64, 5], and leveraging averaging and momentum schemes [70, 76, 38, 55]. In the convex case, the global convergence of gradient methods in the function value, i.e., find with for any , is naturally possible and the sample complexity required is .111For P-SGD under smooth and bounded variance assumptions, A.1’ and A.2’ in section 2, or for SM under Lipschitz continuity and bounded second moment of stochastic sub-gradients, i.e., A.2.
Non-convex optimization. In the last decade, the interest in the optimization community shifted towards general non-convex problems (often smooth or weakly convex), where only convergence to a FOSP is possible in general [48, 27, 3, 87], i.e., find with when is smooth. Similar to developments in convex optimization, convergence of non-convex SGD extends to constrained/proximal setting [39, 51, 11], mirror descent [95, 23], momentum [57, 34], variance-reduction [21, 3], and biased gradient setting [45, 43, 44]. For the more general weakly-convex case [22, 60, 95], the convergence guarantees are usually with respect to a gradient norm of a smoothed objective. Some works consider non-convex functions with a specific compositional structure similar to (2), e.g. the composition of a convex function with a differentiable and smooth map , see [65, 52, 28, 94]. Recently a number of works focus on non-convex non-smooth optimization (beyond weak convexity) and develop convergence for suitably defined notions of FOSP [20, 91, 46]. Although the above works consider non-convex problems, which find a wide range of applications, they often only provide convergence to a FOSP rather than global convergence in the function value.
References
- [1] Alekh Agarwal, Martin J Wainwright, Peter Bartlett, and Pradeep Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, volume 22, 2009.
- [2] James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System level synthesis. Annual Reviews in Control, 47:364–393, 2019.
- [3] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
- [4] Anas Barakat, Ilyas Fatkhullin, and Niao He. Reinforcement learning with general utilities: Simpler variance reduction and large state-action space. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 1753–1800, 2023.
- [5] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- [6] Aharon Ben-Tal and Dick Den Hertog. Hidden conic quadratic representation of some nonconvex quadratic optimization problems. Mathematical Programming, 143:1–29, 2014.
- [7] Aharon Ben-Tal, Dick Den Hertog, and Monique Laurent. Hidden convexity in partially separable optimization. 2011.
- [8] Aharon Ben-Tal and Marc Teboulle. Hidden convexity in some nonconvex quadratically constrained quadratic programming. Mathematical Programming, 72(1):51–63, 1996.
- [9] Julius R Blum. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics, pages 737–744, 1954.
- [10] Joseph Frédéric Bonnans and Alexander Ioffe. Second-order sufficiency and quadratic growth for nonisolated minima. Mathematics of Operations Research, 20(4):801–817, 1995.
- [11] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
- [12] Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear matrix inequalities in system and control theory. SIAM, 1994.
- [13] Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming. Optimization and engineering, 8:67–127, 2007.
- [14] Jean-Philippe Chancelier and Michel De Lara. Conditional infimum and hidden convexity in optimization. arXiv preprint arXiv:2104.05266, 2021.
- [15] Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for minimizing compositions of convex functions and random functions and its applications in network revenue management. arXiv preprint arXiv:2205.01774, 2022.
- [16] Yiwei Chen and Cong Shi. Network revenue management with online inverse batch gradient descent method. Available at SSRN 3331939, 2022.
- [17] Emilie Chouzenoux, Jean-Baptiste Fest, and Audrey Repetti. A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting. arXiv preprint arXiv:2302.06447v3, 2023.
- [18] Kai Lai Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, pages 463–483, 1954.
- [19] Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 2260–2268, 2020.
- [20] Ashok Cutkosky, Harsh Mehta, and Francesco Orabona. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 6643–6670, 2023.
- [21] Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing systems, volume 32, 2019.
- [22] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
- [23] Damek Davis, Dmitriy Drusvyatskiy, and Kellie J MacPhee. Stochastic model-based minimization under high-order growth. arXiv preprint arXiv:1807.00255, 2018.
- [24] Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908–1930, 2019.
- [25] Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, and Mihailo R Jovanović. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2206.02346, 2022.
- [26] Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global optimum convergence of momentum-based policy gradient. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 1910–1934, 2022.
- [27] Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent. In Proceedings of the 37th International Conference on Machine Learning, pages 2658–2667, 2020.
- [28] Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178:503–558, 2019.
- [29] Richard J Duffin. Geometric programming-theory and application. Technical report, 1967.
- [30] Tolga Ergen and Mert Pilanci. Global optimality beyond two layers: Training deep relu networks via convex programs. In Proceedings of the 38th International Conference on Machine Learning, pages 2993–3003, 2021.
- [31] Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 9827–9869, 2023.
- [32] Ilyas Fatkhullin, Jalal Etesami, Niao He, and Negar Kiyavash. Sharp analysis of stochastic optimization under global Kurdyka-Łojasiewicz inequality. In Advances in Neural Information Processing Systems, 2022.
- [33] Ilyas Fatkhullin and Boris Polyak. Optimizing Static Linear Feedback: Gradient Method. SIAM Journal on Control and Optimization, 59(5):3887–3911, 2021.
- [34] Ilyas Fatkhullin, Alexander Tyurin, and Peter Richtárik. Momentum provably improves error feedback! In Advances in Neural Information Processing Systems, 2023.
- [35] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the 35th International conference on machine learning, pages 1467–1476, 2018.
- [36] Qi Feng and J George Shanthikumar. Supply and demand functions in inventory models. Operations Research, 66(1):77–91, 2018.
- [37] Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approximation results for sgd and its continuous-time counterpart. In Proceedings of the 34th Annual Conference on Learning Theory, pages 1965–2058, 2021.
- [38] Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. arXiv preprint arXiv:1609.04228v2, 2018.
- [39] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- [40] Udaya Ghai, Zhou Lu, and Elad Hazan. Non-convex online learning via algorithmic equivalence. In Advances in Neural Information Processing Systems, volume 35, pages 22161–22172, 2022.
- [41] Osman Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2):403–419, 1991.
- [42] Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Proceedings of the 33d Annual Conference on learning theory, pages 1894–1938, 2020.
- [43] Yifan Hu, Xin Chen, and Niao He. On the bias-variance-cost tradeoff of stochastic optimization. In Advances in Neural Information Processing Systems, volume 34, 2021.
- [44] Yifan Hu, Wang Jie, Yao Xie, Andreas Krause, and Daniel Kuhn. Contextual stochastic bilevel optimization. In Advances in Neural Information Processing Systems, 2023.
- [45] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Advances in Neural Information Processing Systems, volume 33, pages 2759–2770, 2020.
- [46] Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. In Proceedings of the 36th Annual Conference on Learning Theory, pages 4570–4597, 2023.
- [47] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
- [48] Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023.
- [49] J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics, 23(3):462 – 466, 1952.
- [50] Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020.
- [51] Guanghui Lan. First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020.
- [52] Adrian S Lewis and Stephen J Wright. A proximal method for composite minimization. Mathematical Programming, 158:501–546, 2016.
- [53] Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, and Guanghui Lan. Faster algorithm and sharper analysis for constrained markov decision process. arXiv preprint arXiv:2110.10351, 2021.
- [54] Xiao Li, Andre Milzarek, and Junwen Qiu. Convergence of random reshuffling under the Kurdyka–Łojasiewicz inequality. SIAM Journal on Optimization, 33(2):1092–1120, 2023.
- [55] Xiaoyu Li, Mingrui Liu, and Francesco Orabona. On the last iterate convergence of momentum methods. In International Conference on Algorithmic Learning Theory, pages 699–717, 2022.
- [56] Ji Liu, Steve Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the 31st International Conference on Machine Learning, pages 469–477, 2014.
- [57] Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, volume 33, pages 18261–18271, 2020.
- [58] Stanislaw Lojasiewicz. A topological property of real analytical subsets. Partial differential equations, 117:87–89, 1963.
- [59] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1):157–178, 1993.
- [60] Vien Mai and Mikael Johansson. Convergence of a stochastic gradient method with momentum for non-smooth non-convex optimization. In Proceedings of the 37th International conference on machine learning, pages 6630–6639, 2020.
- [61] Sentao Miao and Yining Wang. Network revenue management with nonparametric demand learning: -regret and polynomial dimension dependency. Available at SSRN 3948140, 2021.
- [62] Andjela Mladenovic, Iosif Sakos, Gauthier Gidel, and Georgios Piliouras. Generalized natural gradient flows in hidden convex-concave games and gans. In Proceedings of the 9th International Conference on Learning Representations, 2021.
- [63] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, volume 24, 2011.
- [64] A. S. Nemirovski and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley UK/USA, 1983.
- [65] Yu Nesterov. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optimization Methods and Software, pages 469–483, 2007.
- [66] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- [67] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
- [68] Sarath Pattathil, Kaiqing Zhang, and Asuman Ozdaglar. Symmetric (optimistic) natural policy gradient for multi-agent learning with parameter convergence. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, pages 5641–5685, 2023.
- [69] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- [70] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
- [71] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki, 3(4):643–653, 1963.
- [72] Quentin Rebjock and Nicolas Boumal. Fast convergence to non-isolated minima: four equivalent conditions for functions. arXiv preprint arXiv:2303.00096, 2023.
- [73] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- [74] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
- [75] Kevin Scaman, Cedric Malherbe, and Ludovic Dos Santos. Convergence rates of non-convex stochastic gradient descent under a generic lojasiewicz condition and local smoothness. In Proceedings of the 39th International Conference on Machine Learning, 2022.
- [76] Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Proceedings of 34th Annual Conference on Learning Theory, pages 3935–3971, 2021.
- [77] Lorenzo Stella, Andreas Themelis, and Panagiotis Patrinos. Forward-backward quasi-newton methods for nonsmooth optimization problems. Computational Optimization and Applications, 67(3):443–487, 2017. arXiv:1604.08096 [math].
- [78] Ronald J Stern and Henry Wolkowicz. Indefinite trust region subproblems and nonsymmetric eigenvalue perturbations. SIAM Journal on Optimization, 5(2):286–313, 1995.
- [79] Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
- [80] Yue Sun and Maryam Fazel. Learning optimal controllers by policy gradient: Global optimality via convex parameterization. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 4576–4581. IEEE, 2021.
- [81] Ryan Tibshirani. Slides on hidden convexity, 2016.
- [82] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, and Georgios Piliouras. Solving min-max optimization with hidden structure via gradient descent ascent. In Advances in Neural Information Processing Systems, volume 34, pages 2373–2386, 2021.
- [83] Yifei Wang, Jonathan Lacotte, and Mert Pilanci. The hidden convex optimization landscape of two-layer relu neural networks: an exact characterization of the optimal solutions. arXiv preprint arXiv:2006.05900, 2020.
- [84] Yong Xia. A survey of hidden convex optimization. Journal of the Operations Research Society of China, 8(1):1–28, 2020.
- [85] Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
- [86] Tengyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new approach for safe reinforcement learning with convergence guarantee. In Proceedings of the 38th International Conference on Machine Learning, pages 11480–11491, 2021.
- [87] Junchi Yang, Xiang Li, Ilyas Fatkhullin, and Niao He. Two sides of one coin: the limits of untuned SGD and the power of adaptive methods. In Advances in Neural Information Processing Systems, 2023.
- [88] Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 3332–3380, 2022.
- [89] Pengyun Yue, Cong Fang, and Zhouchen Lin. On the lower bound of minimizing polyak-Łojasiewicz functions. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of the 36th Annual Conference on Learning Theory, volume 195, pages 2948–2968, 2023.
- [90] Hui Zhang and Wotao Yin. Gradient methods for convex minimization: better rates under weaker conditions. arXiv preprint arXiv:1303.4645, 2013.
- [91] **gzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, and Ali Jadbabaie. Complexity of finding stationary points of nonconvex nonsmooth functions. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 11173–11182, 2020.
- [92] Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities. In Advances in Neural Information Processing Systems, volume 33, pages 4572–4583, 2020.
- [93] Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample efficiency of variance-reduced policy gradient method. In Advances in Neural Information Processing Systems, volume 34, pages 2228–2240, 2021.
- [94] Junyu Zhang and Lin Xiao. Stochastic variance-reduced prox-linear algorithms for nonconvex composite optimization. Mathematical Programming, 2021.
- [95] Siqi Zhang and Niao He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. arXiv preprint arXiv:1806.04781, 2018.
- [96] Feiran Zhao, Keyou You, and Tamer Başar. Global convergence of policy gradient primal-dual methods for risk-constrained LQRs. IEEE Transactions on Automatic Control, 2023.