For solving the stochastic reformulation (8) of the optimization problem (1) we adapt the stochastic subgradient projection method from [17]. We refer to this algorithm as the Mini-batch Stochastic Subgradient Projection method (Mini-batch SSP).
Algorithm 1 (Mini-batch SSP):
(18)
(19)
(20)
Using the sampling paradigm in Section 3, the Mini-batch SSP algorithm can incorporate a diverse array of mini-batch variants, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches. Most of our variants of Mini-batch SSP, with different mini-batch sizes for the objective function and functional constraints, were never explicitly considered in the literature before, e.g., the variants corresponding to partition and nice samplings. Note that at each iteration our algorithm takes a mini-batch stochastic proximal subgradient step aimed at minimizing the objective function (see (19)) and then a subsequent mini-batch subgradient projection step minimizing the feasibility violation (see (20)). More precisely, if the random vector has for all and for all , then step (19) is a mini-batch proximal subgradient iteration:
|
|
|
Similarly, if the random vector has for all and for all , then step (20) minimizes the feasibility violation of the observed mini-batch of constraints, i.e., we choose from the mini-batch the constraint that is violated the most, for some index , and then perform a Polyak’s subgradient like update on it [24]:
|
|
|
Consider any arbitrary nonzero . Disregarding the abuse of notation, we compute the vector by:
|
|
|
When , we have for any choice of . Note that in the Mini-batch SPP algorithm and are deterministic stepsizes. Moreover, when , is the projection of onto the hyperplane given by the functional constraint that is violated the most in the observed mini-batch of constraints given by the index set :
|
|
|
that is, we have when we choose . In the next sections we analyse the convergence behaviour of Mini-batch SSP algorithm and derive rates depending explicitly on the mini-batch sizes and on the properties of the objective function.
4.1 Convergence analysis: convex objective function
In this section we consider that the functions and in problem (1) are convex and the random vectors and are non-negative. Let us define the filtration as the sigma algebra generated by the history of the random vectors and :
|
|
|
The next lemma, whose proof is similar to Lemma 5 in [17] provides a key descent property for the sequence (recall that and ).
Lemma 4.1.
Let and , with , be convex functions and . Additionally, let the bounded gradient condition from Assumption 2.1 hold. Then, for any and stepsize , we have the following recursion:
|
|
|
|
(21) |
with and given in Lemma 3.2.
The following lemma establishes a relation between and . The proof is similar to Lemma 6 in [17].
Lemma 4.2.
Let , with , be convex functions and . Additionally, assume that the bounded subgradient condition from Assumption 2.3 holds. Then, for any such that , the following relation holds:
|
|
|
(22) |
with given in Lemma 3.3.
Taking now , then and
|
|
|
|
|
|
|
|
|
|
|
|
Thus for any , we have:
|
|
|
(23) |
Lemma 4.3.
Let Assumptions 2.3 and 2.4 hold and the random vectors and be nonnegative. Then, the following relation is valid:
|
|
|
with and given in Lemmas 3.3 and 3.4, respectively.
Proof.
Note that for we have and using Lemma 4.2 with , we get:
|
|
|
Taking conditional expectation on given , we get:
|
|
|
|
|
|
|
|
|
|
|
|
Taking now the full expectation, we obtain our statement.
∎
For simplicity of the exposition let us introduce the following constant:
|
|
|
(24) |
We impose the following conditions on the stepsize :
|
|
|
(25) |
Then, we can define the following average sequence generated by the algorithm SSP:
|
|
|
Note that this type of average sequence is also consider in [5] for unconstrained stochastic optimization problems.
The next theorem derives sublinear convergence rates for the average sequence .
Theorem 4.4.
Let , , with , and , with , be convex functions. Additionally, Assumptions 2.1, 2.3 and 2.4 hold and the random vectors are nonnegative. Further, consider a nonincreasing positive stepsize sequence as in (25), satisfying and , and stepsize . Then, we have the following convergence rates for the average sequence in terms of optimality and feasibility violation for problem (1):
|
|
|
|
|
|
Proof.
Combining Lemma 4.3 with Lemma 4.1, we have:
|
|
|
|
|
|
Together with the fact that , it yields:
|
|
|
|
|
|
Summing this relation from , we get:
|
|
|
|
|
|
From the definition of the average sequence and the convexity of and of , we get sublinear rate in expectation for the average sequence in terms of optimality:
|
|
|
Also by using Jensen’s inequality and , we have:
|
|
|
|
|
|
These conclude our statements.
∎
For stepsize , with and satisfies (25), we have:
|
|
|
Consequently, for we obtain from Theorem 4.4 the following sublinear convergence rates:
|
|
|
|
(26) |
|
|
|
|
For the particular choice we can perform the same analysis as before and obtain similar convergence bounds (by replacing with ). Now, if we neglect the logarithmic terms, we get exactly the same rates as in (26), but replacing with . Hence, we omit the details for this case.
Minimizing the right hand side of the bound for optimality in (26) w.r.t. , we get an optimal choice for the initial stepsize, i.e., . Since must be in , then we consider for some . We distinguish two cases:
Case 1: If , where is an estimate of , then the expressions for the rates from (26) are (after ignoring terms):
|
|
|
|
|
|
Using the definition of and replacing the values for , , and from Theorem 3.5 for both types of samplings, i.e., partition or -, -nice samplings, we get:
|
|
|
|
|
|
Case 2: If , for some . Then, the expressions for the rates from (26) are (after ignoring terms):
|
|
|
(27) |
|
|
|
|
|
|
(28) |
Consider the case when , from (27), and (28), we have:
|
|
|
|
|
|
Using the definition of and the expressions for and from Theorem 3.5 for the partition or -, -nice samplings, we get:
|
|
|
|
|
|
When , from (27), and (28), we have:
|
|
|
|
|
|
Using the definition of and the expressions for , , and from Theorem 3.5 for the partition or -, -nice samplings, we get:
|
|
|
|
|
|
|
|
|
Note that for the initial stepsize choices and for the two particular choices of the sampling (partition or nice samplings), we obtain convergence rates depending explicitly on mini-batch sizes and , namely or , respectively. Hence, in these settings we have linear dependence on the mini-batch sizes for algorithm Mini-batch SSP.
Furthermore, since in the convex case we can consider a stepsize sequence , then for one can notice immediately that our stepsize sequence also depends linearly on the mini-batch size for the two particular choices of sampling (partition or nice samplings), i.e., .
Finally, one can notice that when , from Theorem 4.4 improved rates can be derived for Mini-batch SSP in the convex case. For example, for stepsize , with and , we obtain convergence rates for in optimality and feasibility violation of order and , respectively. In particular, for these rates become of order and .
In conclusion, by specializing our Theorem 4.4 to different mini-batching strategies, such as partition or nice samplings, we derive explicit expressions for the stepsize as a function of the mini-batch size and, consequently, convergence rates depending linearly on the mini-batch sizes . Hence, Theorem 4.4 shows that a mini-batch variant of the stochastic subgradient projection scheme is more beneficial than the nonmini-batch variant.