Mini-batch stochastic subgradient for functional constrained optimization

\nameNitesh Kumar Singh^a, Ion Necoara^a,b, Vyacheslav Kungurtsev^c Corresponding author: Ion Necoara, email: [email protected]. ^aAutomatic Control and Systems Engineering Department, University Politehnica Bucharest, Spl. Independentei 313, 060042 Bucharest, Romania; ^bGheorghe Mihoc-Caius Iacob Institute of Mathematical Statistics and Applied Mathematics of the Romanian Academy, 050711 Bucharest, Romania; ^cComputer Science Department, Czech Technical University, Karlovo Namesti 13, 12135 Prague, Czech Republic.

Abstract

In this paper we consider finite sum composite convex optimization problems with many functional constraints. The objective function is expressed as a finite sum of two terms, one of which admits easy computation of (sub)gradients while the other is amenable to proximal evaluations. We assume a generalized bounded gradient condition on the objective which allows us to simultaneously tackle both smooth and nonsmooth problems. We also consider the cases of both with and without a strong convexity property. Further, we assume that each constraint set is given as the level set of a convex but not necessarily differentiable function. We reformulate the constrained finite sum problem into a stochastic optimization problem for which the stochastic subgradient projection method from [17] specializes to a collection of mini-batch variants, with different mini-batch sizes for the objective function and functional constraints, respectively. More specifically, at each iteration, our algorithm takes a mini-batch stochastic proximal subgradient step aimed at minimizing the objective function and then a subsequent mini-batch subgradient projection step minimizing the feasibility violation. By specializing different mini-batching strategies, we derive exact expressions for the stepsizes as a function of the mini-batch size and in some cases we also derive insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. We also prove sublinear convergence rates for the mini-batch subgradient projection algorithm which depend explicitly on the mini-batch sizes and on the properties of the objective function. Numerical results also show a better performance of our mini-batch scheme over its single-batch counterpart.

keywords:

Finite sum convex optimization, functional constraints, stochastic subgradient method, mini-batching, convergence rates.

^†^†articletype: ARTICLE TEMPLATE

1 Introduction

In this work we consider the following composite convex optimization problem with many functional constraints:

\begin{array}[]{rl}F^{*}=\min_{x\in\mathcal{Y}\subseteq\mathbb{R}^{n}}&F(x)% \quad\left(:=\frac{1}{N}\sum_{i=1}^{N}(f_{i}(x)+g_{i}(x))\right)\\ \text{subject to }&h_{j}(x)\leq 0\;\;\;\forall j=1:m,\end{array}

(1)

where $f_{i},\;g_{i}$ and $h_{j}$ are proper lower semi-continuous convex functions, $\mathcal{Y}$ is a closed convex set and the number of sum-additive objective function components, $N$ , and/or the number of constraints, $m$ , are assumed to be large. This model is very general and covers many practical optimization applications, including machine learning and statistics [34, 2], distributed control [16], signal processing [18, 33], operations research and finance [28]. It can be remarked that more commonly, one sees a single $g$ representing the regularizer on the parameters. However, we are interested in the more general problem as there are also applications where one encounters more $g$ ’s, such as e.g., in Lasso problems with mixed $\ell_{1}-\ell_{2}$ regularizers, and in the case of regularizers with overlap** groups [11]. Multiple functional constraints can arise from multistage stochastic programming with equity constraints [36], robust classification [2], and fairness constraints in machine learning [37]. In the aforementioned applications the corresponding problems are becoming increasingly large in terms of both the number of variables and the size of training data. The use of regularizers and constraints in a composite objective structure make proximal gradient methods particularly natural for these classes of problems, see, e.g. [18, 31]. Moreover, when the composite objective function is expressed as a large finite sum of functions, then by computational practical necessity, we may have access only to stochastic estimates via samples of the (sub)gradients, proximal operators or projections. In this setting, the most popular stochastic methods are the stochastic gradient descent (SGD) [30, 19, 8, 18] and the stochastic proximal point (SPP) algorithms [15, 22, 18, 26, 31]. However, in practice it has been noticed that these stochastic methods converge slowly. To improve the convergence speed, one can use techniques such as mini-batching [1, 21, 23, 29, 3, 32], averaging [22, 25, 35] or variance reduction strategies [12, 14, 7]. In this work we consider a versatile mini-batching framework for a stochastic subgradient projection method for solving the constrained finite sum problem (1), and demonstrate, theoretically and experimentally, its favorable convergence properties.

The papers most related to our work are [21, 29, 17]. However, the optimization problem, the algorithm and consequently the convergence analysis are different from the present paper. In particular, [21] considers the optimization problem (1) with a single nonsmooth convex function $f$ and $g_{i}\equiv 0$ for all $i=1:N$ . Additionally, the objective function $f$ is assumed to be strongly convex. Under this setting, [21] proposes a stochastic subgradient scheme with mini-batch for constraints and derives a sublinear convergence rate for it, whose proof heavily relies on the strong convexity property of $f$ , bounded subgradients of $f$ assumption, and uniqueness of the optimal solution. In this work, these conditions do not hold anymore as we consider more general assumptions (i.e., smooth/nonsmooth functions $f_{i}$ ’s, objective function $F$ is convex or satisfies a strong convexity condition) and a more general optimization problem (i.e., finite sum composite objective). Moreover, our mini-batch subgradient method differs from the one in [21]: we consider mini-batching to handle both the objective function and the constraints, while [21] considers only mini-batching for constraints; moreover, the data selection rules used to form the mini-batches are also different in these two papers. Due to these distinctions, our convergence analysis and rates are not the same as the ones in [21]. In [29] an unconstrained finite sum problem is considered, i.e., in problem (1) $g_{i}\equiv 0$ for all $i=1:N$ , and $h_{j}\equiv 0$ for all $j=1:m$ , and reformulated as a stochastic optimization problem. This reformulation is then solved using SGD. In this paper we extend the stochastic reformulation from [29] to the finite sum composite objective function in (1) and add a new stochastic reformulation for the constraints. Then, we use the stochastic subgradient projection method from [17] to solve the reformulated problem, leading to an array of mini-batch variants depending on the data selection rule used to form mini-batches. This is the first time such an analysis is performed on the general problem (1), and most of our mini-batch variants of the stochastic subgradient projection method are new.

Contributions. In this paper we propose mini-batch variants of stochastic subgradient projection algorithm for solving the constrained finite sum composite convex problem (1). The main advantage of our formulation is that the theoretical convergence guarantees of the corresponding numerical scheme only require very basic properties of our problem functions (convexity, bounded gradient type conditions) and access only to stochastic (sub)gradients and proximal operators. The main contributions are:

Stochastic reformulation: we propose an equivalent stochastic reformulation for the finite sum composite objective function and for the constraints of problem (1) using arbitrary sampling rules. We also extend the assumptions considered for the original problem to the new stochastic reformulation and derive explicit bounds for the corresponding constants appearing in the assumptions, which depend on the random variables that define the stochastic problem. By specializing our bounds to different mini-batching strategies, such as partition sampling and nice sampling, we derive exact expressions for these constants.

Convergence rates: the stochastic problem is then solved with the stochastic subgradient projection method from [17], which specializes to a range of possible mini-batch schemes with different batch sizes for the objective function and functional constraints. Based on the constants defining the assumptions, we derive exact expressions for the stepsize as a function of the mini-batch size. Moreover, when the objective function satisfies a strong convexity condition we also derive informative stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. At each iteration, the algorithm takes a mini-batch stochastic proximal subgradient step aimed at minimizing the objective function, followed by a feasibility step for minimizing the feasibility violation of the observed mini-batch of random constraints. We prove sublinear convergence rates for a weighted averages of the iterates in terms of expected distance to the constraint set, as well as for expected optimality of the function values/distance to the optimal set. Our rates depend explicitly on the mini-batch sizes and on the properties of the problem functions. This work is the first analysis of a mini-batch stochastic subgradient projection method on the general problem (1), and most of our mini-batch variants were never explicitly considered in the literature before.

Content. In Section 2 we introduce some basic notation and present the main assumptions. In Section 3 we provide a stochastic reformulation for the original problem, present several sampling strategies and derive some relevant bounds. In Section 4 we present a mini-batch stochastic subgradient projection algorithm and analyze its convergence. Finally, in Section 5, the performance on numerical simulations is presented, providing support for the effectiveness of our method.

2 Notations and assumptions

For the finite sum problem (1) we assume that $\mathcal{Y}$ is a simple convex set, i.e., it is easy to evaluate the projection onto $\mathcal{Y}$ . Moreover, we assume that the interior of $\mathcal{Y}$ is contained in the effective domains of the functions $f_{i},\;g_{i}$ and $h_{j}$ . Additionally, all the functions $g_{i}$ have the common domain, $\text{dom}\,g$ . We make no assumptions on the differentiability of $f_{i}$ and use, with some abuse of notation, the same expression for the gradient or the subgradient of $f_{i}$ at $x$ , that is $\nabla f_{i}(x)\in\partial f_{i}(x)$ , where the subdifferential $\partial f_{i}(x)$ is either a singleton or a nonempty set for any $i=1:N$ . Similarly for $g_{i}$ ’s. Throughout the paper, the subgradient of $h(x,\xi)$ w.r.t. $x$ , $\nabla_{x}h(x,\xi)$ , is denoted simply by $\nabla h(x,\xi)$ . Let us denote $F_{i}(x)=f_{i}(x)+g_{i}(x)$ . Assuming $g_{i}$ ’s are convex functions, then from basic calculus rules we have $\nabla F_{i}(x)=\nabla f_{i}(x)+\nabla g_{i}(x)$ . Further, for a given $x\in\mathbb{R}^{N}$ , $\|x\|$ denotes its Euclidean norm and $(x)_{+}=\max\{0,x\}$ . The feasible set of (1) is denoted by:

\mathcal{X}=\left\{x\in\mathcal{Y}:\;h_{j}(x)\leq 0\;\;\forall j=1:m\right\}.

We assume the optimal value $F^{*}>-\infty$ and $\mathcal{X}^{*}\neq\phi$ denotes the optimal set, i.e.:

F^{*}=\min_{x\in\mathcal{X}}F(x):=\frac{1}{N}\sum_{i=1}^{N}F_{i}(x),\quad% \mathcal{X}^{*}=\{x\in\mathcal{X}\mid F(x)=F^{*}\}.

For any $x\in\mathbb{R}^{n}$ we denote its projection onto the optimal set $\mathcal{X}^{*}$ by $\bar{x}$ , that is:

\bar{x}=\Pi_{\mathcal{X}^{*}}(x).

We consider additionally the following assumptions. First, we assume that the objective function satisfies some bounded gradient condition.

Assumption 2.1.

The (sub)gradients of $F$ satisfy the following bounded gradient condition: there exist nonnegative constants $L\geq 0$ and $B\geq 0$ such that:

B^{2}+L(F(x)-F^{*})\geq\frac{1}{N}\sum_{i=1}^{N}\|\nabla F_{i}(x)\|^{2}\quad% \forall x\in\mathcal{Y}.

(2)

To the best of our knowledge this assumption was first introduced in [18] and further studied in [17, 29]. We present two examples of functions satisfying this assumption below (see [17] for proofs).

Example 1 [Non-smooth (Lipschitz) functions satisfy Assumption 2.1]: Assume that the convex functions $f_{i}$ and $g_{i}$ have bounded (sub)gradients:

\|\nabla f_{i}(x)\|\leq B_{f_{i}}\quad\text{and}\quad\|\nabla g_{i}(x)\|\leq B% _{g_{i}}\quad\forall x\in\mathcal{Y}.

Then, Assumption 2.1 holds with $L=0\;\;\text{and}\;\;B^{2}=\frac{2}{N}\sum_{i=1}^{N}(B_{f_{i}}^{2}+B_{g_{i}}^{% 2}).$

Example 2 [Smooth (Lipschitz gradient) functions satisfy Assumption 2.1]: Condition (2) contains the class of convex functions formed as a sum of two convex terms, one having Lipschitz continuous gradients with constants $L_{f_{i}}$ ’s and the other having bounded subgradients over bounded set $\mathcal{Y}$ with constant $B_{g}$ . Then, Assumption 2.1 holds with (here $D$ denotes the diameter of $\mathcal{Y}$ ):

L=4\max_{i=1:N}L_{f_{i}}\;\text{and}\;B^{2}\!=\!\frac{4}{N}\sum_{i=1}^{N}B_{g}% ^{2}+4\max_{\bar{x}\in\mathcal{X}^{*}}\left(\!\frac{1}{N}\sum_{i=1}^{N}\|% \nabla f_{i}(\bar{x})\|^{2}+D\max_{i=1:N}L_{f_{i}}\|\nabla F(\bar{x})\|\!% \right)\!.

In our analysis below we also assume $F$ to satisfy a (strong) convexity condition:

Assumption 2.2.

The function $F$ satisfies a (strong) convex condition on $\mathcal{Y}$ , i.e., there exists non-negative constant $\mu\geq 0$ such that:

F(y)\geq F(x)+\langle\nabla F(x),y-x\rangle+\frac{\mu}{2}\|y-x\|^{2}\quad% \forall x,y\in\mathcal{Y}.

(3)

Note that when $\mu=0$ relation (3) states that $F$ is convex on $\mathcal{Y}$ . Additionally, we assume the following bound for the functional constraints:

Assumption 2.3.

The functional constraints $h_{j}$ have bounded subgradients on dom g, i.e., there exists $B_{j}>0$ such that:

\|\nabla h_{j}(x)\|\leq B_{j}\quad\forall\,\nabla h_{j}(x)\in\partial h_{j}(x)% ,\;x\in\emph{dom}\;g,\;\;j=1:m.

(4)

Note that this assumption implies that the functional constraints $h_{j}$ are Lipschitz continuous. Additionally, we assume a Hölderian growth condition for the constraints.

Assumption 2.4.

The functional constraints satisfy additionally the following Hölderian growth condition for some constants $\bar{c}>0$ and $q\geq 1$ :

\emph{dist}^{2q}(y,\mathcal{X})\leq\bar{c}\left(\max_{j=1:m}(h_{j}(y))\right)_% {+}^{2}\;\;\;\forall y\in\emph{dom}\;g.

(5)

Note that this assumption has been used in [32] in the context of convex feasibility problems and in [17] for $q=1$ in the context of stochastic optimization problems. It holds e.g., when the feasible set $\mathcal{X}$ has an interior point, see e.g. [13], or when the feasible set is polyhedral. However, Assumption 2.4 holds for more general sets, e.g., when a strengthened Slater condition holds for the collection of functional constraints, such as the generalized Robinson condition, as detailed in [13] Corollary 3.

3 Stochastic reformulation

In this section we reformulate the deterministic problem (1) into a stochastic one wherein the objective function is expressed in the form of an expectation. We analyze its main properties and then use the machinery of stochastic sampling to devise efficient mini-batch schemes. For this we use an arbitrary sampling paradigm. More precisely, let $(\Omega_{1},\mathcal{F}_{1},\mathbb{P}_{1})$ be a finite probability space with $\Omega_{1}=\{1,...,N\}$ and a random vector $\zeta\in\mathbb{R}^{N}$ drawn from some probability distribution $\mathbb{P}_{1}$ having the property $\mathbb{E}[\zeta^{i}]=1\;\text{for all}\;i=1:N$ . Then, let us define the following functions:

\displaystyle f(x,\zeta)=\frac{1}{N}\sum_{i=1}^{N}\zeta^{i}f_{i}(x)\;\;\text{% and}\;\;g(x,\zeta)=\frac{1}{N}\sum_{i=1}^{N}\zeta^{i}g_{i}(x).

(6)

Note that if $f_{i}$ and $g_{i}$ , with $i=1:N$ , are convex functions and $\zeta_{i}\geq 0$ , then $f(\cdot,\zeta)$ and $g(\cdot,\zeta)$ are also convex functions. Also consider a probability space $(\Omega_{2},\mathcal{F}_{2},\mathbb{P}_{2})$ , with $\Omega_{2}=\{1,...,m\}$ and a random vector $\xi\in\mathbb{R}^{m}$ drawn from some probability distribution $\mathbb{P}_{2}$ having the property $\mathbb{E}[\xi^{j}]>0$ and $0\leq\xi^{j}\leq\bar{\xi}$ , for all $j=1:m$ and some $\bar{\xi}<\infty$ . Then, let us define the functional constraints:

\displaystyle h(x,\xi)=\max_{j=1:m}(\xi^{j}h_{j}(x)).

(7)

Since $\xi^{j}\geq 0$ , then $h(\cdot,\xi)$ is a convex function provided that $h_{j},\;\text{with}\;j=1:m$ , are convex functions. Then, we can define a stochastic reformulation of the original optimization problem (1):

\begin{array}[]{rl}F^{*}=&\min\limits_{x\in\mathcal{Y}\subseteq\mathbb{R}^{n}}% \;\mathbb{E}[f(x,\zeta)+g(x,\zeta)]\\ &\text{subject to }\;h(x,\xi)\leq 0\;\;\forall\xi\in\mathcal{F}_{2}.\end{array}

(8)

Note that $F(x,\zeta)=f(x,\zeta)+g(x,\zeta)$ and $\nabla F(x,\zeta)=\nabla f(x,\zeta)+\nabla g(x,\zeta)$ are unbiased estimators of $F(x)$ and $\nabla F(x)$ , respectively. Indeed:

\displaystyle\mathbb{E}[\nabla F(x,\zeta)]\overset{\eqref{eq:reformulation_f}}% {=}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[\zeta^{i}](\nabla f_{i}(x)+\nabla g_{i}% (x))\overset{\mathbb{E}[\zeta^{i}]=1}{=}\nabla F(x).

In the following lemma we prove that under some basic conditions on the random vectors, the deterministic problem (1) is equivalent to the stochastic problem (8).

Lemma 3.1.

Let the random vectors $\zeta$ and $\xi$ satisfy $\mathbb{E}[\zeta^{i}]=1\;\;\text{for all}\;\;i=1:N$ and $\xi\geq 0$ , with $\mathbb{E}[\xi^{j}]>0\;\;\text{for all}\;\;j=1:m$ . Then, the the deterministic problem (1) is equivalent to stochastic problem (8).

Proof.

For the objective function in problem (8), we have:

	$\displaystyle\mathbb{E}[f(x,\zeta)+g(x,\zeta)]$	$\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{i=1}^{N}\zeta^{i}(f_{i}(x)+g_{i% }(x))\right]$
		$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[\zeta^{i}](f_{i}(x)+g_{i}(x)% )=\frac{1}{N}\sum_{i=1}^{N}(f_{i}(x)+g_{i}(x))$
		$\displaystyle=F(x).$

For the functional constraints, if $x$ is feasible for the stochastic problem (8), i.e. $h(x,\xi)\leq 0$ , then we have:

\displaystyle h(x,\xi)

\displaystyle=\max_{j=1:m}(\xi^{j}h_{j}(x))\leq 0\implies\xi^{j}h_{j}(x)\leq 0% \;\;\;\forall j=1:m.

Taking expectation on both sides, we get:

\displaystyle\mathbb{E}[\xi^{j}h_{j}(x)]\leq 0\overset{\mathbb{E}[\xi^{j}]>0}{% \implies}h_{j}(x)\leq 0\;\;\;\forall j=1:m,

$x$ is feasible for the original problem (1). On the other hand, if $h_{j}(x)\leq 0$ , for all $j=1:m$ , then using $\xi^{j}\geq 0$ , we get:

\displaystyle\xi^{j}h_{j}(x)\leq 0\implies\max_{j=1:m}(\xi^{j}h_{j}(x))\leq 0% \implies h(x,\xi)\leq 0.

This concludes our proof. ∎

3.1 Properties of stochastic problem

In this section we prove that the assumptions valid for the original problem (1) can be extended to the stochastic reformulation (8). Moreover, we derive explicit bounds for the corresponding assumptions’ constants depending on the random variables that define the stochastic problem. Let $\nabla\hat{F}(x)$ be the matrix of dimension $n\times N$ obtained by arranging $\nabla F_{i}(x)$ ’s as its columns. In the next lemma we prove that a stochastic bounded gradient type condition holds for the objective function of problem (8).

Lemma 3.2.

Let Assumption 2.1 hold and consider the random vector $\zeta$ satisfying $\mathbb{E}[\zeta^{i}]=1$ . Then, the (sub)gradients of $F(\cdot,\zeta)$ from the problem (8) satisfy a stochastic bounded gradient condition:

\mathcal{B}^{2}+\mathcal{L}(F(x)-F^{*})\geq\mathbb{E}_{\zeta}[\|\nabla F(x,% \zeta)\|^{2}]\quad\forall x\in\mathcal{Y},

(9)

with the parameters $\mathcal{B}^{2}=\frac{\mathbb{E}[\|\zeta\|^{2}]}{N}B^{2}$ and $\mathcal{L}=\frac{\mathbb{E}[\|\zeta\|^{2}]}{N}L$ .

Proof.

Using the definition of $F(x,\zeta)$ , we get:

	$\displaystyle\\|\nabla F(x,\zeta)\\|^{2}$	$\displaystyle=\left\\|\frac{1}{N}\sum_{i=1}^{N}\zeta^{i}\nabla F_{i}(x)\right\\|% ^{2}=\frac{1}{N^{2}}\\|\nabla\hat{F}(x)\zeta\\|^{2}\leq\frac{1}{N^{2}}\\|\nabla% \hat{F}(x)\\|^{2}\\|\zeta\\|^{2}$
		$\displaystyle\leq\frac{1}{N^{2}}\\|\nabla\hat{F}(x)\\|_{F}^{2}\\|\zeta\\|^{2}=% \frac{\\|\zeta\\|^{2}}{N}\left(\frac{1}{N}\sum_{i=1}^{N}\\|\nabla F_{i}(x)\\|^{2}\right)$
		$\displaystyle\overset{\eqref{as:main1_spg}}{\leq}\frac{\\|\zeta\\|^{2}}{N}B^{2}+% \frac{\\|\zeta\\|^{2}}{N}L(F(x)-F(\bar{x})),$

where the second inequality follows from the fact that the Frobenius norm is larger than the 2-norm of a matrix. Then, the statement follows after taking expectation with respect to $\zeta$ . ∎

From Jensen’s inequality, taking $x=x^{*}\in\mathcal{X}^{*}$ in (2), we get:

B^{2}\geq\mathbb{E}_{\zeta}[\|\nabla F(x^{*},\zeta)\|^{2}]\geq\|\mathbb{E}_{% \zeta}[\nabla F(x^{*},\zeta)]\|^{2}=\|\nabla F(x^{*})\|^{2}\quad\forall x^{*}% \in\mathcal{X}^{*}.

(10)

Since $F(x,\zeta)$ is an unbiased estimator of $F(x)$ , it also follows that if Assumption 2.2 holds for the original objective function, then the same condition is valid for the objective function of the stochastic problem (8) with the same constant $\mu$ . Further, for a given $x$ let us define the set of active constraints by $J^{*}(x)=\{j=1:m\;|\;h(x,\xi)=\xi^{j}h_{j}(x)\}$ . In the next lemma we provide a bounded subgradient condition for the functional constraints of the stochastic problem (8).

Lemma 3.3.

Let Assumption 2.3 hold and consider the random vector $\xi\geq 0$ satisfying $\mathbb{E}[\xi^{j}]>0$ and $\xi^{j}\leq\bar{\xi}$ , for all $j=1:m$ and some $\bar{\xi}<\infty$ . Then, the functional constraints $h(\cdot,\xi)$ of the problem (8) have bounded subgradients on dom g, i.e.:

\displaystyle\|\nabla h(x,\xi)\|\leq\mathcal{B}_{h}\quad\forall x\in{\mathrm{% dom}}\,g\;\;\text{and }\;\;\xi\in\mathcal{F}_{2},

(11)

where $\nabla h(x,\xi)\in\partial h(x,\xi)$ and $\mathcal{B}_{h}=\bar{\xi}\max_{j=1:m}B_{j}.$

Proof.

Let $x\in\text{dom}\;g$ and $\nabla h(x,\xi)\in\partial h(x,\xi)$ . Then, from the definition of $h(\cdot,\xi)$ and of the index set $J^{*}(x)$ , we have:

h(x,\xi)=\max_{j=1:m}(\xi^{j}h_{j}(x))=\xi^{j^{*}}\cdot h_{j^{*}}(x)\quad% \forall j^{*}\in J^{*}(x).

Then, we further have (see Lemma 3.1.13 in [20]):

	$\displaystyle\nabla h(x,\xi)=\text{Conv}\{\xi_{j^{}}\nabla h_{j^{}}(x)\|j^{}% \in J^{}(x)\}$
$\displaystyle\implies$	$\displaystyle\\|\nabla h(x,\xi)\\|\leq\max_{\theta_{j^{}}\geq 0,\;\sum_{j^{}% \in J^{}(x)}\theta_{j^{}}=1}\left\\|\sum_{j^{}\in J^{}(x)}\theta_{j^{}}\xi% ^{j^{}}\cdot\nabla h_{j^{*}}(x)\right\\|$
	$\displaystyle\leq\max_{\theta_{j^{}}\geq 0,\;\sum_{j^{}\in J^{}(x)}\theta_{% j^{}}=1}\sum_{j^{}\in J^{}(x)}\theta_{j^{}}\xi^{j^{}}\cdot\\|\nabla h_{j^{% *}}(x)\\|$
	$\displaystyle\leq\max_{\theta_{j^{}}\geq 0,\;\sum_{j^{}\in J^{}}\theta_{j^{% }}=1}\sum_{j^{}\in J^{}(x)}\theta_{j^{}}\bar{\xi}\cdot\\|\nabla h_{j^{}}(x)\\|$
	$\displaystyle=\bar{\xi}\max_{j^{}\in J^{}(x)}\\|\nabla h_{j^{*}}(x)\\|\overset% {\eqref{ass:3}}{\leq}\bar{\xi}\max_{j=1:m}B_{j}=\mathcal{B}_{h},$	(12)

which proves our statement. ∎

In the next lemma we provide a Hölderian growth type condition for the functional constraints of the stochastic problem (8).

Lemma 3.4.

Let Assumption 2.4 hold and consider the random vector $\xi$ satisfying $\xi\geq 0$ and $\mathbb{E}[\xi^{j}]>0$ for all $j=1:m$ . Then, the functional constraints of the problem (8) satisfy the following Hölderian growth type condition:

\emph{dist}^{2q}(y,\mathcal{X})\leq c\cdot\mathbb{E}\left[(h(y,\xi))_{+}^{2}% \right]\;\;\forall y\in{\mathrm{dom}}\,g,

(13)

with the parameter $c=\left(\frac{\bar{c}}{\min_{j=1:m}\mathbb{E}[\xi^{j}]}\right)$ .

Proof.

Let $y\in\text{dom}\;g$ , using the definition of $h(\cdot,\xi)$ and Jensen’s inequality, we have:

	$\displaystyle\mathbb{E}\left[(h(y,\xi))_{+}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\left(\max_{j=1:m}(\xi^{j}h_{j}(y))\right)_{+}^{% 2}\right]\geq\left(\max_{j=1:m}(\mathbb{E}[\xi^{j}]h_{j}(y))\right)_{+}^{2}$
		$\displaystyle\overset{\mathbb{E}[\xi^{j}]>0}{\geq}\min_{j=1:m}\mathbb{E}[\xi^{% j}]\left(\max_{j=1:m}(h_{j}(y))\right)_{+}^{2}\overset{\eqref{eq:% constrainterrbound}}{\geq}\left(\min_{j=1:m}\mathbb{E}[\xi^{j}]\right)\frac{1}% {\bar{c}}\text{dist}^{2q}(y,\mathcal{X}).$

Thus, we have:

\displaystyle\text{dist}^{2q}(y,\mathcal{X})

\displaystyle\leq\frac{\bar{c}}{\min\limits_{j=1:m}\mathbb{E}[\xi^{j}]}\mathbb% {E}\left[(h(y,\xi))_{+}^{2}\right],

(14)

which proves our statement. ∎

3.2 Choices for random vectors $\zeta$ and $\xi$

In this section we provide several choices for the two random vectors $\zeta$ and $\xi$ . Let $\mathcal{I}\subseteq[1:N]$ and let $e_{\mathcal{I}}=\sum_{i\in\mathcal{I}}e_{i}$ , where $\{e_{1},...,e_{N}\}$ is the standard basis of $\mathbb{R}^{N}$ . These subsets will be selected using a random set valued map, i.e. sampling $S$ . A sampling $S$ is uniquely characterized by choosing the probabilities $p_{\mathcal{I}}\geq 0$ for all subsets $\mathcal{I}$ :

\mathbb{P}[S=\mathcal{I}]=p_{\mathcal{I}}\;\;\forall\mathcal{I}\subset[1:N],

such that $\sum_{\mathcal{I}\subset[1:N]}p_{\mathcal{I}}=1$ . A sampling $S$ is called proper if $p_{i}=\mathbb{P}[i\in S]=\mathbb{E}[\mathbf{1}_{i\in S}]=\sum_{\mathcal{I}:i% \in\mathcal{I}}p_{\mathcal{I}}$ is positive for all $i=1:N$ , see also [29, 27]. We now define some practical sampling vectors $\zeta=\zeta(S)$ . For example, let $S$ be a proper sampling and let $\hat{\mathbb{P}}=\text{Diag}(p_{1},...,p_{N})$ . Then, we can consider the sampling vector as:

\zeta=\hat{\mathbb{P}}^{-1}e_{S}\implies\zeta^{i}=\frac{\mathbf{1}_{i\in S}}{p% _{i}}.

(15)

Note that $\mathbb{E}[\zeta^{i}]=\frac{\mathbb{E}[\mathbf{1}_{i\in S}]}{p_{i}}=1$ and since $\zeta^{T}\zeta=\sum_{i=1}^{N}(\zeta^{i})^{2}=\sum_{i=1}^{N}\mathbf{1}_{i\in S}% /p_{i}^{2}$ , then $\mathbb{E}[\|\zeta\|^{2}]=\sum_{i=1}^{N}1/p_{i}$ . For constraints, if we let $\mathcal{I}^{\prime}\subseteq[1:m]$ and define $e_{\mathcal{I}^{\prime}}=\sum_{j\in\mathcal{I}^{\prime}}e_{j}$ , then a sampling $S^{\prime}$ is uniquely characterized by choosing probabilities $p_{\mathcal{I}^{\prime}}\geq 0$ for all subsets $\mathcal{I}^{\prime}$ of $[1:m]$ . Let $S^{\prime}$ be a proper sampling vector, then we can define the practical sampling vector $\xi=\xi(S^{\prime})$ as:

\xi=e_{S^{\prime}}\implies\xi^{j}=\mathbf{1}_{j\in S^{\prime}}.

(16)

Note that $\mathbb{E}[\xi^{j}]=\mathbb{E}[\mathbf{1}_{j\in S^{\prime}}]=p_{j}>0$ and $\xi^{j}\leq\bar{\xi}=1$ . Furthermore, each sampling $S$ and $S^{\prime}$ give rise to a particular sampling vector $\zeta=\zeta(S)$ and $\xi=\xi(S^{\prime})$ . Below we provide some sampling examples.

Partition sampling: A partition $\mathcal{P}$ of $[1:N]$ is a set consisting of subsets of $[1:N]$ such that $\cup_{\mathcal{I}\in\mathcal{P}}\mathcal{I}=[1:N]$ and $\mathcal{I}_{i}\cap\mathcal{I}_{l}=\phi$ for any $\mathcal{I}_{i},\mathcal{I}_{l}\in\mathcal{P}$ with $i\neq l$ . A partition sampling $S$ is a sampling such that $p_{\mathcal{I}}=\mathbb{P}[S=\mathcal{I}]>0$ for all $\mathcal{I}\in\mathcal{P}$ and $\sum_{\mathcal{I}\in\mathcal{P}}p_{\mathcal{I}}=1$ .

$\tau$ -nice sampling: We say that $S$ is $\tau$ –nice if $S$ samples from all subsets of $[1:N]$ of cardinality $\tau$ uniformly at random. In this case we have that $p_{i}=\frac{\tau}{N}\;\text{for all}\;i=1:N$ . Then, $p_{\mathcal{I}}=\mathbb{P}\left[S=\mathcal{I}\right]=1/{N\choose\tau}$ for all subsets $\mathcal{I}\subset\{1,...,N\}$ with $\tau$ elements.

The reader can also consider other examples for sampling, see e.g., [29] for more details. Let the cardinality of samples $S$ and $S^{\prime}$ be $\tau_{1}$ and $\tau_{2}$ , respectively. In the next theorem, using Lemmas 3.2, 3.3 and 3.4, we derive explicit expressions. which depend on the mini-batch sizes $\tau_{1}$ and $\tau_{2}$ , for the assumptions’ constants $\mathcal{B},\mathcal{L},\mathcal{B}_{h}$ and $c$ for the two sampling given previously.

Theorem 3.5.

Let Assumption 2.1, 2.3 and 2.4 hold. Let also $S$ and $S^{\prime}$ be sampled uniform at random with partition sampling having the same cardinality $\tau_{1}$ and $\tau_{2}$ , or alternatively with $\tau_{1}$ - and $\tau_{2}$ -nice sampling. Then, the constants $\mathcal{B},\mathcal{L},\mathcal{B}_{h}$ and $c$ are:

\mathcal{B}^{2}=\frac{N}{\tau_{1}}B^{2},\;\mathcal{L}=\frac{N}{\tau_{1}}L,\;% \mathcal{B}_{h}=\max_{j=1:m}B_{j}\;\text{and}\;c=\left(\frac{\bar{c}m}{\tau_{2% }}\right).

Proof.

From Lemma 3.2, for the parameters $\mathcal{B}\;\text{and}\;\mathcal{L}$ , we have:

		$\displaystyle\mathcal{B}^{2}=\frac{\mathbb{E}[\\|\zeta\\|^{2}]}{N}B^{2}\overset{% \eqref{samplingVector1}}{=}\frac{1}{N}\sum_{S}p_{S}\sum_{i\in S}\frac{1}{p_{i}% ^{2}}B^{2},$		(17)
		$\displaystyle\mathcal{L}=\frac{\mathbb{E}[\\|\zeta\\|^{2}]}{N}L\overset{\eqref{% samplingVector1}}{=}\frac{1}{N}\sum_{S}p_{S}\sum_{i\in S}\frac{1}{p_{i}^{2}}L.$

For partition sampling given the realization $S=\mathcal{I}$ , we have $p_{i}=p_{\mathcal{I}}$ if $i\in\mathcal{I}$ . Since the cardinality of each $\mathcal{I}$ is $\tau_{1}$ and the sampling $S\in\{\mathcal{I}_{1},...,\mathcal{I}_{\ell}\}$ is chosen uniform at random, then $p_{i}=p_{\mathcal{I}}=\frac{1}{\ell}=\frac{\tau_{1}}{N}$ . Thus, using (17) we have:

\displaystyle\mathcal{B}^{2}=\frac{1}{N}\sum_{\mathcal{I}\in\mathcal{P}}p_{% \mathcal{I}}\sum_{i\in\mathcal{I}}\frac{1}{p_{i}^{2}}B^{2}=\frac{1}{N}\sum_{% \mathcal{I}\in\mathcal{P}}\frac{\tau_{1}}{N}\tau_{1}\frac{N^{2}}{\tau_{1}^{2}}% B^{2}=\sum_{\mathcal{I}\in\mathcal{P}}B^{2}=\ell B^{2}=\frac{N}{\tau_{1}}B^{2}.

Similarly, we can prove that $\mathcal{L}=\frac{N}{\tau_{1}}L$ . For $\tau_{1}$ -nice sampling given the realization $S=\mathcal{I}$ , we have, $p_{i}=\frac{\tau_{1}}{N}$ for all $i$ and $p_{\mathcal{I}}=1/{N\choose\tau_{1}}$ . Using (17), we get:

\mathcal{B}^{2}=\frac{B^{2}}{N}\sum_{\mathcal{I}}p_{\mathcal{I}}\sum_{i\in% \mathcal{I}}\frac{1}{p_{i}^{2}}=\frac{B^{2}}{N}\sum_{\mathcal{I}}\frac{1}{{N% \choose\tau_{1}}}\tau_{1}\frac{N^{2}}{\tau_{1}^{2}}=\frac{N}{\tau_{1}}B^{2}% \sum_{\mathcal{I}}\frac{1}{{N\choose\tau_{1}}}=\frac{N}{\tau_{1}}B^{2}.

Similarly, we can get the value for other parameter, i.e., $\mathcal{L}=\frac{N}{\tau_{1}}L$ . By Lemma 3.3, for the parameter $\mathcal{B}_{h}$ , we have:

\mathcal{B}_{h}=\bar{\xi}\max_{j=1:m}B_{j}.

Using the definition of $\xi^{j}$ from (16), i.e., $\xi^{j}\leq\bar{\xi}=1$ , we get:

\mathcal{B}_{h}=\max_{j=1:m}B_{j}.

Note that this bound holds for both types of sampling. Finally, from Lemma 3.4, for the parameter $c$ , we have:

c=\frac{\bar{c}}{\min_{j=1:m}\mathbb{E}[\xi^{j}]}\overset{\eqref{% samplingVector2}}{=}\frac{\bar{c}}{\min_{j=1:m}p_{j}}.

Here we use the fact that $\mathbb{E}[\xi^{j}]=\mathbb{E}[\mathbf{1}_{j\in S^{\prime}}]=p_{j}$ . Now for the given realization $S^{\prime}=\mathcal{I}^{\prime}$ , we have $p_{j}=p_{\mathcal{I}^{\prime}}=\frac{\tau_{2}}{m}$ for partition sampling and $p_{j}=\frac{\tau_{2}}{m}$ for $\tau_{2}$ -nice sampling, respectively. Therefore, $c=\left(\frac{\bar{c}m}{\tau_{2}}\right)$ . These prove our statements. ∎

4 Mini-batch stochastic subgradient projection algorithm

For solving the stochastic reformulation (8) of the optimization problem (1) we adapt the stochastic subgradient projection method from [17]. We refer to this algorithm as the Mini-batch Stochastic Subgradient Projection method (Mini-batch SSP).

Algorithm 1 (Mini-batch SSP): $\text{Choose}\;x_{0}\in\mathcal{Y}\;\text{and stepsizes}\;\alpha_{k}>0,\;\beta% \in(0,2).$ $\text{For}\;k\geq 0\;\text{repeat:}$ $\displaystyle\text{Draw}\;\text{sample vectors}\;\zeta_{k}\sim\mathbb{P}_{1}\;% \text{and}\;\xi_{k}\sim\mathbb{P}_{2}\;\text{independently}.$ (18) $\displaystyle v_{k}=\text{prox}_{\alpha_{k}g(\cdot,\zeta_{k})}\left(x_{k}-% \alpha_{k}\nabla f(x_{k},\zeta_{k})\right)$ (19) $\displaystyle\text{Compute}\;h(v_{k},\xi_{k})=\max(\xi^{1}_{k}h_{1}(v_{k}),...% ,\xi^{m}_{k}h_{m}(v_{k}))$ $\displaystyle z_{k}=v_{k}-\beta\frac{(h(v_{k},\xi_{k}))_{+}}{\|\nabla h(v_{k},% \xi_{k})\|^{2}}\nabla h(v_{k},\xi_{k})$ (20) $\displaystyle x_{k+1}=\Pi_{\mathcal{Y}}(z_{k}).$

Using the sampling paradigm in Section 3, the Mini-batch SSP algorithm can incorporate a diverse array of mini-batch variants, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches. Most of our variants of Mini-batch SSP, with different mini-batch sizes for the objective function and functional constraints, were never explicitly considered in the literature before, e.g., the variants corresponding to partition and nice samplings. Note that at each iteration our algorithm takes a mini-batch stochastic proximal subgradient step aimed at minimizing the objective function (see (19)) and then a subsequent mini-batch subgradient projection step minimizing the feasibility violation (see (20)). More precisely, if the random vector $\zeta_{k}$ has $\zeta_{k}^{i}=1$ for all $i\in\mathcal{I}_{k}$ and $\zeta_{k}^{i}=0$ for all $i\in\{1,\cdots,N\}\setminus\mathcal{I}_{k}$ , then step (19) is a mini-batch proximal subgradient iteration:

v_{k}=\text{prox}_{\alpha_{k}\sum_{i\in\mathcal{I}_{k}}g_{i}}\left(x_{k}-% \alpha_{k}\sum_{i\in\mathcal{I}_{k}}\nabla f_{i}(x_{k})\right).

Similarly, if the random vector $\xi_{k}$ has $\xi_{k}^{i}=1$ for all $i\in\mathcal{I}_{k}^{\prime}$ and $\xi_{k}^{i}=0$ for all $i\in\{1,\cdots,m\}\setminus\mathcal{I}_{k}^{\prime}$ , then step (20) minimizes the feasibility violation of the observed mini-batch of constraints, i.e., we choose from the mini-batch the constraint that is violated the most, $h(v_{k},\xi_{k})=\max_{j\in\mathcal{I}_{k}^{\prime}}h_{j}(v_{k})=h_{j_{k}^{*}}% (v_{k})$ for some index $j_{k}^{*}\in\mathcal{I}_{k}^{\prime}$ , and then perform a Polyak’s subgradient like update on it [24]:

z_{k}=v_{k}-\beta\frac{(h(v_{k},\xi_{k}))_{+}}{\|\nabla h(v_{k},\xi_{k})\|^{2}% }\nabla h(v_{k},\xi_{k})=v_{k}-\beta\frac{(h_{j_{k}^{*}}(v_{k}))_{+}}{\|\nabla h% _{j_{k}^{*}}(v_{k})\|^{2}}\nabla h_{j_{k}^{*}}(v_{k}).

Consider any arbitrary nonzero $s_{h}\in\mathbb{R}^{n}$ . Disregarding the abuse of notation, we compute the vector $\nabla h(v_{k},\xi_{k})=\nabla h_{j_{k}^{*}}(v_{k})$ by:

\nabla h_{j_{k}^{*}}(v_{k})=\begin{cases}\nabla h_{j_{k}^{*}}(v_{k})\in% \partial h_{j_{k}^{*}}(v_{k})&\mbox{if }\;h_{j_{k}^{*}}(v_{k})>0\\ s_{h}\neq 0&\mbox{if }\;h_{j_{k}^{*}}(v_{k})\leq 0.\end{cases}

When $(h(v_{k},\xi_{k}))_{+}=(h_{j_{k}^{*}}(v_{k}))_{+}=0$ , we have $z_{k}=v_{k}$ for any choice of $s_{h}\neq 0$ . Note that in the Mini-batch SPP algorithm $\alpha_{k}>0$ and $\beta>0$ are deterministic stepsizes. Moreover, when $\beta=1$ , $z_{k}$ is the projection of $v_{k}$ onto the hyperplane given by the functional constraint that is violated the most in the observed mini-batch of constraints given by the index set $\mathcal{I}_{k}^{\prime}$ :

\mathcal{H}_{v_{k},\xi_{k}}=\{z:h(v_{k},\xi_{k})+\nabla h(v_{k},\xi_{k})^{T}(z% -v_{k})\!\leq\!0\}\!=\!\{z:h_{j_{k}^{*}}(v_{k})+\nabla h_{j_{k}^{*}}(v_{k})^{T% }(z-v_{k})\!\leq\!0\},

that is, we have $z_{k}=\Pi_{\mathcal{H}_{v_{k},\xi_{k}}}(v_{k})$ when we choose $\beta=1$ . In the next sections we analyse the convergence behaviour of Mini-batch SSP algorithm and derive rates depending explicitly on the mini-batch sizes and on the properties of the objective function.

4.1 Convergence analysis: convex objective function

In this section we consider that the functions $f_{i},g_{i}$ and $h_{j}$ in problem (1) are convex and the random vectors $\zeta$ and $\xi$ are non-negative. Let us define the filtration as the sigma algebra generated by the history of the random vectors $\zeta$ and $\xi$ :

\mathcal{F}_{[k]}=\sigma(\{\zeta_{t},\xi_{t}:\;0\leq t\leq k\}).

The next lemma, whose proof is similar to Lemma 5 in [17] provides a key descent property for the sequence $v_{k}$ (recall that $\bar{v}_{k}=\Pi_{\mathcal{X}^{*}}(v_{k})$ and $\bar{x}_{k}=\Pi_{\mathcal{X}^{*}}(x_{k})$ ).

Lemma 4.1.

Let $f_{i}$ and $g_{i}$ , with $i=1:N$ , be convex functions and $\zeta\geq 0$ . Additionally, let the bounded gradient condition from Assumption 2.1 hold. Then, for any $k\geq 0$ and stepsize $\alpha_{k}>0$ , we have the following recursion:

\displaystyle\mathbb{E}[\|v_{k}-\bar{v}_{k}\|^{2}]\leq\mathbb{E}[\|x_{k}-\bar{% x}_{k}\|^{2}]-\alpha_{k}(2-\alpha_{k}\mathcal{L})\,\mathbb{E}[F(x_{k})-F(\bar{% x}_{k})]+\alpha_{k}^{2}\mathcal{B}^{2},

(21)

with $\mathcal{B}$ and $\mathcal{L}$ given in Lemma 3.2.

The following lemma establishes a relation between $x_{k}$ and $v_{k-1}$ . The proof is similar to Lemma 6 in [17].

Lemma 4.2.

Let $h_{j}$ , with $j=1:m$ , be convex functions and $\xi\geq 0$ . Additionally, assume that the bounded subgradient condition from Assumption 2.3 holds. Then, for any $y\in\mathcal{Y}$ such that $(h(y,\xi_{k-1}))_{+}=0$ , the following relation holds:

\displaystyle\|x_{k}-y\|^{2}\leq\|v_{k-1}-y\|^{2}-\beta(2-\beta)\left[\frac{(h% (v_{k-1},\xi_{k-1}))_{+}^{2}}{\mathcal{B}^{2}_{h}}\right],

(22)

with $\mathcal{B}_{h}$ given in Lemma 3.3.

Taking now $y=\Pi_{\mathcal{X}}({v}_{k-1})\subseteq\mathcal{X}\subseteq\mathcal{Y}$ , then $(h(\Pi_{\mathcal{X}}({v}_{k-1}),\xi_{k-1}))_{+}=0$ and

	$\displaystyle\text{dist}^{2}(x_{k},\mathcal{X})$	$\displaystyle=\\|x_{k}-\Pi_{\mathcal{X}}({x}_{k})\\|^{2}\leq\\|x_{k}-\Pi_{% \mathcal{X}}({v}_{k-1})\\|^{2}$
		$\displaystyle\overset{\eqref{eq:x_k_v_k-1}}{\leq}\text{dist}^{2}(v_{k-1},% \mathcal{X})-\beta(2-\beta)\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{\mathcal{B}^{% 2}_{h}}$
		$\displaystyle\leq\text{dist}^{2}(v_{k-1},\mathcal{X}).$

Thus for any $q\geq 1$ , we have:

\displaystyle\text{dist}^{2q}(x_{k},\mathcal{X})\leq\text{dist}^{2q}(v_{k-1},% \mathcal{X}).

(23)

Lemma 4.3.

Let Assumptions 2.3 and 2.4 hold and the random vectors $\xi$ and $\zeta$ be nonnegative. Then, the following relation is valid:

\mathbb{E}[\|x_{k}-\bar{x}_{k}\|^{2}]\leq\mathbb{E}[\|v_{k-1}-\bar{v}_{k-1}\|^% {2}]-\frac{\beta(2-\beta)}{c\mathcal{B}^{2}_{h}}\mathbb{E}\left[\emph{dist}^{2% q}(x_{k},\mathcal{X})\right],

with $\mathcal{B}_{h}$ and $c$ given in Lemmas 3.3 and 3.4, respectively.

Proof.

Note that for $\bar{v}_{k-1}\in\mathcal{X}^{*}\subseteq\mathcal{X}\subseteq\mathcal{Y}$ we have $(h(\bar{v}_{k-1},\xi_{k-1}))_{+}=0$ and using Lemma 4.2 with $y=\bar{v}_{k-1}$ , we get:

\displaystyle\|x_{k}-\bar{x}_{k}\|^{2}\leq\|x_{k}-\bar{v}_{k-1}\|^{2}\leq\|v_{% k-1}-\bar{v}_{k-1}\|^{2}-\beta(2-\beta)\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^% {2}}{\mathcal{B}^{2}_{h}}\right].

Taking conditional expectation on $\xi_{k-1}$ given $\mathcal{F}_{[k-2]}$ , we get:

	$\displaystyle\mathbb{E}_{\xi_{k-1}}[\\|x_{k}-\bar{x}_{k}\\|^{2}\|\mathcal{F}_{[k-% 2]}]$	$\displaystyle\leq\\|v_{k-1}-\bar{v}_{k-1}\\|^{2}-\beta(2-\beta)\mathbb{E}_{\xi_{% k-1}}\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{\mathcal{B}^{2}_{h}}\|\mathcal% {F}_{[k-2]}\right]$
		$\displaystyle\overset{\eqref{qreg}}{\leq}\\|v_{k-1}-\bar{v}_{k-1}\\|^{2}-\frac{% \beta(2-\beta)}{c\mathcal{B}^{2}_{h}}\text{dist}^{2q}(v_{k-1},\mathcal{X})$
		$\displaystyle\overset{\eqref{eq:distvdistx}}{\leq}\\|v_{k-1}-\bar{v}_{k-1}\\|^{2% }-\frac{\beta(2-\beta)}{c\mathcal{B}^{2}_{h}}\text{dist}^{2q}(x_{k},\mathcal{X% }).$

Taking now the full expectation, we obtain our statement. ∎

For simplicity of the exposition let us introduce the following constant:

\displaystyle C_{\beta,c,\mathcal{B}_{h}}:=\frac{\beta(2-\beta)}{c\mathcal{B}^% {2}_{h}}>0.

(24)

We impose the following conditions on the stepsize $\alpha_{k}$ :

\displaystyle 0<\alpha_{k}\leq\alpha_{k}(2-\alpha_{k}\mathcal{L})<1\;\;\iff\;% \;\alpha_{k}\in\begin{cases}\left(0,\frac{1}{2}\right)\;\;\text{if}\;\mathcal{% L}=0\\ \left(0,\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{\mathcal{L}}\right)\;\;\text{if}\;% \mathcal{L}>0.\end{cases}

(25)

Then, we can define the following average sequence generated by the algorithm SSP:

\hat{x}_{k}=\frac{\sum_{j=1}^{k}\alpha_{j}{\color[rgb]{0,0,0}(2-\alpha_{j}% \mathcal{L})}x_{j}}{S_{k}},\quad\text{where}\;S_{k}=\sum_{j=1}^{k}\alpha_{j}{% \color[rgb]{0,0,0}(2-\alpha_{j}\mathcal{L})}.

Note that this type of average sequence is also consider in [5] for unconstrained stochastic optimization problems. The next theorem derives sublinear convergence rates for the average sequence $\hat{x}_{k}$ .

Theorem 4.4.

Let $f_{i}$ , $g_{i}$ , with $i=1:N$ , and $h_{j}$ , with $j=1:m$ , be convex functions. Additionally, Assumptions 2.1, 2.3 and 2.4 hold and the random vectors $\zeta,\;\xi$ are nonnegative. Further, consider a nonincreasing positive stepsize sequence $\alpha_{k}$ as in (25), satisfying $\sum_{k\geq 0}\alpha_{k}=\infty$ and $\sum_{k\geq 0}\alpha_{k}^{2}<\infty$ , and stepsize $\beta\in(0,2)$ . Then, we have the following convergence rates for the average sequence $\hat{x}_{k}$ in terms of optimality and feasibility violation for problem (1):

	$\displaystyle\mathbb{E}\left[F(\hat{x}_{k})-F^{*}\right]\leq\frac{\\|v_{0}-% \overline{v}_{0}\\|^{2}}{S_{k}}+\frac{\mathcal{B}^{2}\sum_{t=1}^{k}\alpha_{t}^{% 2}}{S_{k}},$
	$\displaystyle\mathbb{E}\left[\emph{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{1}{C_{\beta,c,\mathcal{B}_{h}}\cdot S_{k}}\right)^{\frac{1}{q}% }\left[\\|v_{0}-\bar{v}_{0}\\|^{\frac{2}{q}}+\mathcal{B}^{\frac{2}{q}}\sum_{t=1}% ^{k}\alpha_{t}^{\frac{2}{q}}\right].$

Proof.

Combining Lemma 4.3 with Lemma 4.1, we have:

	$\displaystyle\mathbb{E}\left[\\|v_{k}-\bar{v}_{k}\\|^{2}\right]+\frac{\beta(2-% \beta)}{c\mathcal{B}^{2}_{h}}\mathbb{E}[\text{dist}^{2q}(x_{k},\mathcal{X})]+% \alpha_{k}(2-\alpha_{k}\mathcal{L})\mathbb{E}\left[F(x_{k})-F(\bar{x}_{k})\right]$
	$\displaystyle\leq\mathbb{E}[\\|v_{k-1}-\bar{v}_{k-1}\\|^{2}]+\alpha_{k}^{2}% \mathcal{B}^{2}.$

Together with the fact that $\alpha_{k}(2-\alpha_{k}\mathcal{L})<1$ , it yields:

	$\displaystyle\mathbb{E}\left[\\|v_{k}-\bar{v}_{k}\\|^{2}\right]+C_{\beta,c,% \mathcal{B}_{h}}\alpha_{k}(2-\alpha_{k}\mathcal{L})\mathbb{E}[\text{dist}^{2q}% (x_{k},\mathcal{X})]+\alpha_{k}(2-\alpha_{k}\mathcal{L})\mathbb{E}\left[F(x_{k% })-F(\bar{x}_{k})\right]$
	$\displaystyle\leq\mathbb{E}[\\|v_{k-1}-\bar{v}_{k-1}\\|^{2}]+\alpha_{k}^{2}% \mathcal{B}^{2}.$

Summing this relation from $t=1:k$ , we get:

	$\displaystyle\mathbb{E}\left[\\|v_{k}-\bar{v}_{k}\\|^{2}\right]+C_{\beta,c,% \mathcal{B}_{h}}\sum_{t=1}^{k}{\color[rgb]{0,0,0}\alpha_{t}(2-\alpha_{t}% \mathcal{L})}\mathbb{E}\left[\text{dist}^{2q}(x_{t},\mathcal{X})\right]$
	$\displaystyle\quad+\sum_{t=1}^{k}{\color[rgb]{0,0,0}\alpha_{t}(2-\alpha_{t}% \mathcal{L})}\mathbb{E}\left[F(x_{t})-F^{*}\right]\leq\\|v_{0}-\bar{v}_{0}\\|^{2% }+\mathcal{B}^{2}\sum_{t=1}^{k}\alpha_{t}^{2}.$

From the definition of the average sequence $\hat{x}_{k}$ and the convexity of $F$ and of $\text{dist}^{2}(\cdot,\mathcal{X})$ , we get sublinear rate in expectation for the average sequence in terms of optimality:

\displaystyle\mathbb{E}\left[F(\hat{x}_{k})-F^{*}\right]\leq\sum_{t=1}^{k}% \frac{{\color[rgb]{0,0,0}\alpha_{t}(2-\alpha_{t}\mathcal{L})}}{S_{k}}\mathbb{E% }\left[F(x_{t})-F^{*}\right]\leq\frac{\|v_{0}-\bar{v}_{0}\|^{2}}{S_{k}}+% \mathcal{B}^{2}\frac{\sum_{t=1}^{k}\alpha_{t}^{2}}{S_{k}}.

Also by using Jensen’s inequality and $q\geq 1$ , we have:

	$\displaystyle C_{\beta,c,\mathcal{B}_{h}}\left(\mathbb{E}\left[\text{dist}^{2}% (\hat{x}_{k},\mathcal{X})\right]\right)^{q}\leq C_{\beta,c,\mathcal{B}_{h}}% \mathbb{E}\left[\text{dist}^{2q}(\hat{x}_{k},\mathcal{X})\right]$
	$\displaystyle\leq C_{\beta,c,\mathcal{B}_{h}}\sum_{t=1}^{k}\frac{{\color[rgb]{% 0,0,0}\alpha_{t}(2-\alpha_{t}\mathcal{L})}}{S_{k}}\mathbb{E}\left[\text{dist}^% {2q}(x_{t},\mathcal{X})\right]\leq\frac{\\|v_{0}-\bar{v}_{0}\\|^{2}}{S_{k}}+% \mathcal{B}^{2}\frac{\sum_{t=1}^{k}\alpha_{t}^{2}}{S_{k}}.$

These conclude our statements. ∎

For stepsize $\alpha_{k}=\frac{\alpha_{0}}{(k+1)^{\gamma}}$ , with $\gamma\in[1/2,1)$ and $\alpha_{0}$ satisfies (25), we have:

\frac{1}{\alpha_{0}}S_{k}{\color[rgb]{0,0,0}\overset{\eqref{eq:alk}}{\geq}}% \frac{1}{\alpha_{0}}\sum_{t=1}^{k}\alpha_{t}\geq{\cal O}(k^{1-\gamma})\quad% \text{and}\quad\frac{1}{\alpha_{0}^{2}}\sum_{t=1}^{k}\alpha_{t}^{2}\leq\begin{% cases}{\cal O}(1)\;\;\text{if}\;\gamma>1/2\\ {\cal O}(\ln(k))\;\;\;\text{if}\;\gamma=1/2.\end{cases}

Consequently, for $\gamma\in(1/2,1)$ we obtain from Theorem 4.4 the following sublinear convergence rates:

		$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{\\|v_{0}-% \bar{v}_{0}\\|^{2}}{\alpha_{0}{\color[rgb]{0,0,0}{\cal O}(k^{1-\gamma})}}+\frac% {\alpha_{0}\mathcal{B}^{2}{\color[rgb]{0,0,0}{\cal O}(1)}}{{\color[rgb]{0,0,0}% {\cal O}(k^{1-\gamma})}},$		(26)
		$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{1}{C_{\beta,c,\mathcal{B}_{h}}\cdot{\color[rgb]{0,0,0}\alpha_{% 0}{\cal O}(k^{1-\gamma})}}\right)^{\frac{1}{q}}\left[\\|v_{0}-\bar{v}_{0}\\|^{% \frac{2}{q}}+(\alpha_{0}^{2}\mathcal{B}^{2}{\color[rgb]{0,0,0}{\cal O}(1)})^{% \frac{1}{q}}\right].$

For the particular choice $\gamma=1/2$ we can perform the same analysis as before and obtain similar convergence bounds (by replacing ${\cal O}(1)$ with ${\cal O}(\ln(k))$ ). Now, if we neglect the logarithmic terms, we get exactly the same rates as in (26), but replacing $k^{1-\gamma}$ with $k^{1/2}$ . Hence, we omit the details for this case.

Minimizing the right hand side of the bound for optimality in (26) w.r.t. $\alpha_{0}$ , we get an optimal choice for the initial stepsize, i.e., $\alpha_{0}^{*}=\frac{\|v_{0}-\bar{v}_{0}\|}{\mathcal{B}}$ . Since $\alpha_{0}$ must be in $\left(0,\min\left(\frac{1}{2},\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{\mathcal{L}}% \right)\right)$ , then we consider $\alpha^{*}_{0}=\min\left(\frac{\|v_{0}-\bar{v}_{0}\|}{\mathcal{B}},{\color[rgb% ]{0,0,0}\min\left(\frac{1}{2},\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{\mathcal{L}}% \right)}-\delta\right)$ for some $\delta\in(0,\frac{1}{2})$ . We distinguish two cases:

Case 1: If $\alpha_{0}^{*}=\frac{\mathcal{R}_{0}}{\mathcal{B}}\leq{\color[rgb]{0,0,0}\min% \left(\frac{1}{2},\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{\mathcal{L}}\right)}-\delta$ , where $\mathcal{R}_{0}$ is an estimate of $\|v_{0}-\bar{v}_{0}\|$ , then the expressions for the rates from (26) are (after ignoring ${\cal O}(1)/{\cal O}(\ln(k))$ terms):

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{\mathcal{B% }\\|v_{0}-\bar{v}_{0}\\|^{2}}{\mathcal{R}_{0}{\color[rgb]{0,0,0}{\cal O}(k^{1-% \gamma})}}+\frac{\mathcal{R}_{0}\mathcal{B}}{{\color[rgb]{0,0,0}{\cal O}(k^{1-% \gamma})}},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq{\color[rgb]{0,0,0}\left(\frac{\mathcal{B}}{C_{\beta,c,\mathcal{B}_{h}}% \mathcal{R}_{0}\cdot\mathcal{O}(k^{1-\gamma})}\right)^{\frac{1}{q}}\left[\\|v_{% 0}-\bar{v}_{0}\\|^{\frac{2}{q}}+(\mathcal{R}_{0})^{\frac{2}{q}}\right].}$

Using the definition of $C_{\beta,c,\mathcal{B}_{h}}$ and replacing the values for $\mathcal{L}$ , $\mathcal{B}$ , $\mathcal{B}_{h}$ and $c$ from Theorem 3.5 for both types of samplings, i.e., partition or $\tau_{1}$ -, $\tau_{2}$ -nice samplings, we get:

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\sqrt{\frac{N}{% \tau_{1}}}\frac{B}{\mathcal{O}(k^{1-\gamma})}\left(\frac{\\|v_{0}-\bar{v}_{0}\\|% ^{2}}{\mathcal{R}_{0}}+\mathcal{R}_{0}\right),$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq{\color[rgb]{0,0,0}\left(\sqrt{\frac{N}{\tau_{1}}}\frac{Bm\bar{c}\max_{j=1% :m}^{2}B_{j}}{\tau_{2}\cdot\beta(2-\beta)\mathcal{R}_{0}\cdot\mathcal{O}(k^{1-% \gamma})}\right)^{\frac{1}{q}}\left[\\|v_{0}-\bar{v}_{0}\\|^{\frac{2}{q}}+(% \mathcal{R}_{0})^{\frac{2}{q}}\right].}$

Case 2: If $\alpha_{0}^{*}={\color[rgb]{0,0,0}\min\left(\frac{1}{2},\frac{1-\sqrt{(1-% \mathcal{L})_{+}}}{\mathcal{L}}\right)}-\delta<\frac{\|v_{0}-\bar{v}_{0}\|}{% \mathcal{B}}$ , for some $\delta\in(0,1/2)$ . Then, the expressions for the rates from (26) are (after ignoring ${\cal O}(1)/{\cal O}(\ln(k))$ terms):

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{})\right]\leq\frac{\\|v_{0}-% \bar{v}_{0}\\|^{2}}{\alpha_{0}^{}\cdot{\cal O}(k^{1-\gamma})}+\frac{\alpha_{0}% ^{}\cdot\mathcal{B}^{2}}{{\cal O}(k^{1-\gamma})}\leq\frac{2\\|v_{0}-\bar{v}_{0% }\\|^{2}}{\alpha_{0}^{}\cdot{\cal O}(k^{1-\gamma})},$		(27)
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{1}{C_{\beta,c,\mathcal{B}_{h}}{\color[rgb]{0,0,0}\alpha_{0}^{% }}\cdot{\cal O}(k^{1-\gamma})}\right)^{\frac{1}{q}}\left[\\|v_{0}-\bar{v}_{0}\\|% ^{\frac{2}{q}}+\left((\alpha_{0}^{})^{2}\mathcal{B}^{2}\right)^{\frac{1}{q}}\right]$
	$\displaystyle\qquad\qquad\qquad\quad\leq\left(\frac{1}{C_{\beta,c,\mathcal{B}_% {h}}{\color[rgb]{0,0,0}\alpha_{0}^{*}}\cdot{\cal O}(k^{1-\gamma})}\right)^{% \frac{1}{q}}\left[2\\|v_{0}-\bar{v}_{0}\\|^{\frac{2}{q}}\right].$		(28)

Consider the case when $\alpha^{*}_{0}=\frac{1}{2}-\delta$ , from (27), and (28), we have:

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{4\\|v_{0}-% \bar{v}_{0}\\|^{2}}{(1-2\delta){\cal O}(k^{1-\gamma})},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{2}{C_{\beta,c,\mathcal{B}_{h}}(1-2\delta){\cal O}(k^{1-\gamma}% )}\right)^{\frac{1}{q}}\left[2\\|v_{0}-\bar{v}_{0}\\|^{\frac{2}{q}}\right].$

Using the definition of $C_{\beta,c,\mathcal{B}_{h}}$ and the expressions for $\mathcal{B}_{h}$ and $c$ from Theorem 3.5 for the partition or $\tau_{1}$ -, $\tau_{2}$ -nice samplings, we get:

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{4\\|v_{0}-% \bar{v}_{0}\\|^{2}}{(1-2\delta){\cal O}(k^{1-\gamma})},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{2m\bar{c}\max_{j=1:m}^{2}B_{j}}{\tau_{2}\cdot\beta(2-\beta)% \cdot(1-2\delta){\cal O}(k^{1-\gamma})}\right)^{\frac{1}{q}}\left[2\\|v_{0}-% \bar{v}_{0}\\|^{\frac{2}{q}}\right].$

When $\alpha^{*}_{0}=\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{\mathcal{L}}-\delta$ , from (27), and (28), we have:

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{2\mathcal{% L}\\|v_{0}-\bar{v}_{0}\\|^{2}}{(1-\sqrt{(1-\mathcal{L})_{+}}-\delta\mathcal{L}){% \cal O}(k^{1-\gamma})},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{2\mathcal{L}}{C_{\beta,c,\mathcal{B}_{h}}(1-\sqrt{(1-\mathcal{% L})_{+}}-\delta\mathcal{L}){\cal O}(k^{1-\gamma})}\right)^{\frac{1}{q}}\left[2% \\|v_{0}-\bar{v}_{0}\\|^{\frac{2}{q}}\right].$

Using the definition of $C_{\beta,c,\mathcal{B}_{h}}$ and the expressions for $\mathcal{L}$ , $\mathcal{B}$ , $\mathcal{B}_{h}$ and $c$ from Theorem 3.5 for the partition or $\tau_{1}$ -, $\tau_{2}$ -nice samplings, we get:

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{N}{\tau_{1% }}\frac{2L\\|v_{0}-\bar{v}_{0}\\|^{2}}{{\color[rgb]{0,0,0}\left(1-\sqrt{(1-\frac% {N}{\tau_{1}}L)_{+}}-\delta\frac{N}{\tau_{1}}L\right)}\cdot{\cal O}(k^{1-% \gamma})},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]$
	$\displaystyle\leq\!\left(\!{\color[rgb]{0,0,0}\frac{N}{\tau_{1}}}\frac{m}{\tau% _{2}}\frac{L\bar{c}\max_{j=1:m}^{2}B_{j}}{\beta(2-\beta){\color[rgb]{0,0,0}% \left(1-\sqrt{(1-\frac{N}{\tau_{1}}L)_{+}}-\delta\frac{N}{\tau_{1}}L\right)}% \cdot{\cal O}(k^{1-\gamma})}\!\right)^{\frac{1}{q}}\!\left[2\\|v_{0}-\bar{v}_{0% }\\|^{\frac{2}{q}}\right].$

Note that for the initial stepsize choices $\alpha_{0}^{*}=\frac{{\color[rgb]{0,0,0}\mathcal{R}_{0}}}{\mathcal{B}}\;\text{% or}\;\alpha_{0}^{*}={\color[rgb]{0,0,0}\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{% \mathcal{L}}}-\delta$ and for the two particular choices of the sampling (partition or nice samplings), we obtain convergence rates depending explicitly on mini-batch sizes $\tau_{1}$ and $\tau_{2}$ , namely $\left(\sqrt{\frac{N}{\tau_{1}}},\left({\color[rgb]{0,0,0}\sqrt{\frac{N}{\tau_{% 1}}}}\frac{m}{\tau_{2}}\right)^{1/q}\right)$ or $\left(\frac{N}{\tau_{1}},\left({\color[rgb]{0,0,0}\frac{N}{\tau_{1}}}\frac{m}{% \tau_{2}}\right)^{1/q}\right)$ , respectively. Hence, in these settings we have linear dependence on the mini-batch sizes $(\tau_{1},\tau_{2})$ for algorithm Mini-batch SSP.

Furthermore, since in the convex case we can consider a stepsize sequence $\alpha_{k}=\frac{\alpha_{0}}{(k+1)^{\gamma}}$ , then for $\alpha_{0}=\frac{{\color[rgb]{0,0,0}\mathcal{R}_{0}}}{\mathcal{B}}\;\text{or}% \;\alpha_{0}={\color[rgb]{0,0,0}\frac{1-\sqrt{(1-\mathcal{L})_{+}}}{\mathcal{L% }}}-\delta$ one can notice immediately that our stepsize sequence $\alpha_{k}$ also depends linearly on the mini-batch size $\tau_{1}$ for the two particular choices of sampling (partition or nice samplings), i.e., $\alpha_{k}=\mathcal{O}\left(\frac{\tau_{1}}{N(k+1)^{\gamma}}\right)$ .

Finally, one can notice that when $B=0$ , from Theorem 4.4 improved rates can be derived for Mini-batch SSP in the convex case. For example, for stepsize $\alpha_{k}=\frac{\alpha_{0}}{(k+1)^{\gamma}}$ , with $\gamma\in[0,1)$ and $\alpha_{0}={\color[rgb]{0,0,0}\min\left(\frac{1}{2},\frac{1-\sqrt{(1-\mathcal{% L})_{+}}}{\mathcal{L}}\right)}-\delta$ , we obtain convergence rates for $\hat{x}_{k}$ in optimality and feasibility violation of order ${\cal O}\left(\frac{N}{\tau_{1}k^{1-\gamma}}\right)$ and ${\cal O}\left(\frac{{\color[rgb]{0,0,0}N}m}{{\color[rgb]{0,0,0}\tau_{1}}\tau_{% 2}k^{1-\gamma}}\right)^{\frac{1}{q}}$ , respectively. In particular, for $\gamma=0$ these rates become of order ${\cal O}\left(\frac{N}{\tau_{1}k}\right)$ and ${\cal O}\left(\frac{{\color[rgb]{0,0,0}N}m}{{\color[rgb]{0,0,0}\tau_{1}}\tau_{% 2}k}\right)^{\frac{1}{q}}$ .

In conclusion, by specializing our Theorem 4.4 to different mini-batching strategies, such as partition or nice samplings, we derive explicit expressions for the stepsize $\alpha_{k}$ as a function of the mini-batch size and, consequently, convergence rates depending linearly on the mini-batch sizes $(\tau_{1},\tau_{2})$ . Hence, Theorem 4.4 shows that a mini-batch variant of the stochastic subgradient projection scheme is more beneficial than the nonmini-batch variant.

4.2 Convergence analysis: strongly convex objective function

In this section, we additionally assume the inequality from Assumption 2.2 holds. The next lemma derives an improved recurrence for the sequence $v_{k}$ under the strongly convex assumption. The proof is similar to Lemma 8 in [17].

Lemma 4.5.

Let $f_{i},\;g_{i}$ , with $i=1:N$ and $h_{j}$ , with $j=1:m$ , be convex functions. Additionally, Assumptions 2.1–2.4 hold, with $\mu>0$ , and the random vectors $\zeta,\;\xi$ are nonnegative. Define $k_{0}=\lceil\frac{8\mathcal{L}}{\mu}\rceil$ , $\beta\in\left(0,2\right)$ , $\theta_{\mathcal{L},\mu}\!=\!1\!-\!\mu/(4\mathcal{L})$ and $\alpha_{k}\!=\!\frac{4}{\mu}\gamma_{k}$ , where $\gamma_{k}$ is given by:

\gamma_{k}=\left\{\begin{array}[]{ll}\frac{\mu}{4\mathcal{L}}&\text{\emph{if}}% \;\;k\leq k_{0}\\ \frac{2}{k+1}&\text{\emph{if}}\;\;k>k_{0}.\end{array}\right.

Then, the iterates of Algorithm Mini-batch SSP satisfy the following recurrence:

	$\displaystyle\mathbb{E}[\\|v_{k_{0}}-x^{}\\|^{2}]\leq\left\{\begin{array}[]{ll}% \frac{\mathcal{B}^{2}}{\mathcal{L}^{2}}&\text{\emph{if}}\;\;\theta_{\mathcal{L% },\mu}\leq 0\\ \theta_{\mathcal{L},\mu}^{k_{0}}\\|v_{0}-x^{}\\|^{2}+\frac{1-\theta_{\mathcal{L% },\mu}^{k_{0}}}{1-\theta_{\mathcal{L},\mu}}{\color[rgb]{0,0,0}\left(1+\frac{2}% {C_{\beta,c,\mathcal{B}_{h}}\theta_{\mathcal{L},\mu}}\right)}\frac{\mathcal{B}% ^{2}}{\mathcal{L}^{2}}&\text{\emph{if}}\;\;\theta_{\mathcal{L},\mu}>0,\end{% array}\right.$
	$\displaystyle\mathbb{E}[\\|v_{k}-x^{}\\|^{2}]+\gamma_{k}\mathbb{E}[\\|x_{k}-x^{% }\\|^{2}]+{\color[rgb]{0,0,0}\frac{1}{6}}C_{\beta,c,\mathcal{B}_{h}}\mathbb{E}[% \emph{dist}^{2q}(x_{k},\mathcal{X})]$
	$\displaystyle\leq\left(1-\gamma_{k}\right)\mathbb{E}[\\|v_{k-1}-x^{*}\\|^{2}]+{% \color[rgb]{0,0,0}\left(1+\frac{6}{C_{\beta,c,\mathcal{B}_{h}}}\right)}\frac{1% 6}{\mu^{2}}\gamma_{k}^{2}\mathcal{B}^{2}\quad\forall k>k_{0}.$

Let us define for $k\geq k_{0}+1$ the sum:

\displaystyle S_{k}=\sum_{t=k_{0}+1}^{k}(t+1)^{2}\sim\mathcal{O}(k^{3}+k_{0}^{% 2}k+k^{2}k_{0})

and the corresponding average sequences:

\displaystyle\hat{x}_{k}=\frac{\sum_{t=k_{0}+1}^{k}(t+1)^{2}x_{t}}{S_{k}},% \quad\text{and}\quad\hat{w}_{k}=\frac{\sum_{t=k_{0}+1}^{k}(t+1)^{2}\Pi_{% \mathcal{X}}(x_{t})}{S_{k}}\in\mathcal{X}.

Theorem 4.6.

Let $f_{i},\;g_{i}$ , with $i=1:N$ and $h_{j}$ , with $j=1:m$ , be convex functions. Additionally, Assumptions 2.1–2.4 hold and the random vectors $\zeta,\;\xi$ are non-negative. Further, consider the stepsizes-switching rule $\alpha_{k}=\min\left(\frac{1}{\mathcal{L}},\frac{8}{\mu(k+1)}\right)$ , $\beta\in\left(0,2\right)$ and $k_{0}=\lceil{\frac{8\mathcal{L}}{\mu}}\rceil$ . Then, for $k>k_{0}$ we have the following sublinear convergence rates for the average sequence $\hat{x}_{k}$ in terms of optimality and feasibility violation for problem (1) (kee** only the dominant terms):

	$\displaystyle\mathbb{E}\left[\\|\hat{x}_{k}-x^{*}\\|^{2}\right]\leq{\color[rgb]{% 0,0,0}\mathcal{O}\left(\frac{\mathcal{B}^{2}}{\mu^{2}C_{\beta,c,\mathcal{B}_{h% }}\,(k+1)}\right)},$
	$\displaystyle\mathbb{E}\left[\emph{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\mathcal{O}\left(\frac{\mathcal{B}^{2/q}}{\mu^{2/q}{\color[rgb]{0,0,0}C_{% \beta,c,\mathcal{B}_{h}}^{2/q}}(k+1)^{2/q}}\right).$

Proof.

Using Lemma 4.5, we get the recurrence:

	$\displaystyle(k+1)^{2}\mathbb{E}[\\|v_{k}-x^{*}\\|^{2}]$	$\displaystyle+2(k+1)\mathbb{E}[\\|x_{k}-x^{*}\\|^{2}]+\frac{C_{\beta,c,\mathcal{% B}_{h}}}{6}(k+1)^{2}\mathbb{E}[\text{dist}^{2q}(x_{k},\mathcal{X})]$
		$\displaystyle\leq k^{2}\mathbb{E}[\\|v_{k-1}-x^{*}\\|^{2}]+{\color[rgb]{0,0,0}% \left(1+\frac{6}{C_{\beta,c,\mathcal{B}_{h}}}\right)}\frac{64}{\mu^{2}}% \mathcal{B}^{2}\quad\forall k>k_{0}.$

Summing this inequality from $k_{0}+1$ to $k$ and using linearity of the expectation operator and convexity of the norm, we get:

	$\displaystyle{(k+1)^{2}}\mathbb{E}[\\|v_{k}-x^{}\\|^{2}]+\frac{2S_{k}}{(k+1)}% \mathbb{E}[\\|\hat{x}_{k}-x^{}\\|^{2}]+\frac{S_{k}C_{\beta,c,\mathcal{B}_{h}}}{% 6}\mathbb{E}[\\|\hat{w}_{k}-\hat{x}_{k}\\|^{2q}]$
	$\displaystyle\leq(k_{0}+1)^{2}\mathbb{E}[\\|v_{k_{0}}-x^{*}\\|^{2}]+{\color[rgb]% {0,0,0}\left(1+\frac{6}{C_{\beta,c,\mathcal{B}_{h}}}\right)}\frac{64}{\mu^{2}}% \mathcal{B}^{2}(k-k_{0}).$

After simple calculations and kee** only the dominant terms, we get the following convergence rate for the average sequence $\hat{x}_{k}$ in terms of optimality:

	$\displaystyle\mathbb{E}[\\|\hat{x}_{k}-x^{*}\\|^{2}]\leq\mathcal{O}\left(\frac{% \mathcal{B}^{2}}{\mu^{2}{\color[rgb]{0,0,0}C_{\beta,c,\mathcal{B}_{h}}}\,(k+1)% }\right),$
	$\displaystyle\left(\mathbb{E}[\\|\hat{w}_{k}-\hat{x}_{k}\\|^{2}]\right)^{q}\leq% \mathbb{E}[\\|\hat{w}_{k}-\hat{x}_{k}\\|^{2q}]\leq\mathcal{O}\left(\frac{% \mathcal{B}^{2}}{\mu^{2}{\color[rgb]{0,0,0}C^{2}_{\beta,c,\mathcal{B}_{h}}}\,(% k+1)^{2}}\right).$

Since $\hat{w}_{k}\in\mathcal{X}$ , we get the following convergence rate for the average sequence $\hat{x}_{k}$ in terms of feasibility violation:

\displaystyle\mathbb{E}[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})]

\displaystyle\leq\mathbb{E}[\|\hat{w}_{k}-\hat{x}_{k}\|^{2}]\leq\mathcal{O}% \left(\frac{\mathcal{B}^{2}}{\mu^{2}{\color[rgb]{0,0,0}C^{2}_{\beta,c,\mathcal% {B}_{h}}}\,(k+1)^{2}}\right)^{\frac{1}{q}}.

These prove our statements. ∎

Note that our previous theoretical convergence analysis naturally imposes a stepsize-switching rule which describes when one should switch from a constant regime (depending on mini-batch size $\tau_{1}$ ) to a decreasing stepsize regime, i.e., $\alpha_{k}=\min\left(\frac{1}{\mathcal{L}},\frac{8}{\mu(k+1)}\right)$ . For the particular choice of the stepsize $\beta=1$ , we have (see (24)):

\displaystyle C_{1,c,\mathcal{B}_{h}}=\left(\frac{1}{c\mathcal{B}_{h}^{2}}% \right)>0,

since we can always choose $c$ such that $c\mathcal{B}_{h}^{2}>1$ . Using this expression in the convergence rates of Theorem 4.6, we obtain:

	$\displaystyle\mathbb{E}\left[\\|\hat{x}_{k}-x^{*}\\|^{2}\right]\leq{\color[rgb]{% 0,0,0}\mathcal{O}\left(\frac{\mathcal{B}^{2}{\color[rgb]{0,0,0}(c\mathcal{B}_{% h}^{2})}}{\mu^{2}(k+1)}\right)},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\mathcal{O}\left(\frac{\mathcal{B}^{2}{\color[rgb]{0,0,0}(c\mathcal{B}_{h}% ^{2})^{2}}}{\mu^{2}(k+1)^{2}}\right)^{1/q}.$

By replacing the values for $\mathcal{L}$ , $\mathcal{B}$ , $\mathcal{B}_{h}$ and $c$ from Theorem 3.5 for both types of sampling, i.e., partition or $\tau_{1}$ , $\tau_{2}$ -nice samplings, we get:

	$\displaystyle\mathbb{E}\left[\\|\hat{x}_{k}-x^{*}\\|^{2}\right]\leq{\color[rgb]{% 0,0,0}\mathcal{O}\left(\frac{m}{\tau_{2}}\frac{N}{\tau_{1}}\cdot\frac{B^{2}% \bar{c}\max_{j=1:m}^{2}B_{j}}{\mu^{2}(k+1)}\right)},$
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq{\color[rgb]{0,0,0}\mathcal{O}\left(\left(\frac{m}{\tau_{2}}\right)^{2}% \frac{N}{\tau_{1}}\frac{B^{2}\bar{c}^{2}\max_{j=1:m}^{4}B_{j}}{\mu^{2}(k+1)^{2% }}\right)^{1/q}.}$

One can easily see that also in this case the obtained rates have linear dependence on the mini-batch sizes $(\tau_{1},\tau_{2})$ . Therefore, Theorem 4.6 also proves that in the quadratic growth convex case a mini-batch variant of the stochastic subgradient projection scheme with a stepsize-switching rule brings benefits over the nonmini-batch variant.

5 Numerical simulations

In this section, we consider a general quadratic program with quadratic constraints:

\begin{array}[]{rl}\min_{x\in\mathbb{R}^{n}}&\frac{1}{2}\|Ax-b\|^{2}+\|\Delta x% \|_{1}\\ \text{subject to }&Cx+d\geq 0,\;\;c_{i}^{T}x+d_{i}\geq\|Q_{i}^{-1/2}x\|\quad% \forall i=1:m,\end{array}

(29)

with the matrices $A\in\mathbb{R}^{N\times n}$ , $\Delta\in\mathbb{R}^{N\times n}$ , $C\in\mathbb{R}^{m\times n}$ , $Q_{i}\in\mathbb{R}^{m_{i}\times n}$ and $c_{i}\in\mathbb{R}^{n}$ , with $i=1:m$ . One can notice that this problem fits into our general modeling framework (1) (e.g., define $f_{i}(x)=1/2(a_{i}^{T}x-b_{i})^{2}$ , with $a_{i}$ the $i$ th row of matrix $A$ , $g_{i}(x)=\|\delta_{i}^{T}x\|_{1}$ , with $\delta_{i}$ the $i$ th row of matrix $\Delta$ , for all $i=1:N$ , and $h_{j}$ are either linear or quadratic constraints, for all $j=1:2m$ ). Moreover, (29) is a general constrained Lasso problem which appears in many applications from machine learning, signal processing and statistics, see [2, 17, 10, 4, 9]. In particular if one considers appropriate matrices $A$ , $\Delta$ , $C$ and $Q_{i}$ , one can recast the robust (sparse) SVM problem from [2, 17] as problem (29). Indeed, the robust (sparse) SVM problem is defined as [2, 17]:

	$\displaystyle\min_{w,d,u}\;\frac{\lambda}{2}\\|w\\|^{2}+\delta\sum_{i=1}^{m}u_{i% }+\\|w\\|_{1}$
	$\displaystyle\text{subject to:}\;u\geq 0,\;y_{i}(w^{T}\bar{z}_{i}+d)\geq 1\!-% \!u_{i},$
	$\displaystyle\qquad\qquad\;\;\;y_{i}(w^{T}\bar{z}_{i}+d)\geq\\|Q_{i}^{-1/2}w\\|+% 1\!-\!u_{i}\;\;\;\forall i=1\!:\!m,$

where $(\bar{z}_{i})_{i=1}^{m}$ is the training dataset, $(y_{i})_{i=1}^{m}\in\{-1,1\}$ are the corresponding labels, $Q_{i}$ ’s are diagonal matrices with positive entries, $\delta>0$ and $(w,d)\in\mathbb{R}^{n}\times\mathbb{R}$ are the parameters of the hyperplane to separate the data.

In the numerical experiments we consider random matrices $A$ and $C$ and diagonal matrices $\Delta$ and $Q_{i}$ , all generated from normal distributions. We consider as epoch $\max\left(\frac{N}{\tau_{1}},\frac{m}{\tau_{2}}\right)$ iterations of Mini-batch SSP algorithm and our stop** criteria are $\|\max(0,h(x))\|_{2}\leq 10^{-2}$ and $F(x)-F^{*}\leq 10^{-2}$ (we consider CVX solution [6] for computing $F^{*}$ , when CVX finishes in a reasonable time). The codes are written in Matlab and run on a PC with i7 CPU at 2.1 GHz and 16 GB RAM memory.

Figure 1 shows the convergence behaviour of Mini-batch SSP algorithm along epochs with four different choices for mini-batch sizes $(\tau_{1},\tau_{2})$ as $(1,1),\;(20,80),\;(60,160)$ and $(N=120,m=240)$ in terms of optimality (left) and feasibility (right) for solving the constrained Lasso problem (29) with $N=120,n=110,m=240$ . As we can see from this figure, increasing the minibatch sizes ( $\tau_{1},\tau_{2}$ ) leads to better convergence than the nonmini-batch counterpart, as our theory also predicted.

Refer to caption — Figure 1: Behaviour of Mini-batch SSP algorithm in terms of optimality (left) and feasibility (right) for $N=120,n=110,m=240$ and different mini-batch sizes ( $\tau_{1},\tau_{2}$ ).

Finally, in Table 1 we compare Mini-batch SSP algorithm with CVX in terms of cpu time (in seconds) for solving problem (29) over different dimensions of the problem ranging from several hundreds to thousands of functions ( $N$ ) and constraints ( $m$ ), respectively (note that if $N<n$ , then the objective function $F$ is convex, otherwise $F$ is strongly convex). For Mini-batch SSP algorithm we consider four different choices for mini-batch sizes and in the table we also give the number of epochs. The results we present in the table is the average of $10$ runs on the same problem. From the table we observe that for some choices of mini-batch sizes Mini-batch SSP algorithm is even $10$ times faster than CVX (”*” means that CVX has not finished after 3 hours). Moreover, Mini-batch SSP is much faster than its nonmini-batch counterpart.

Sizes

Mini-batch SSP

sizes(

\tau_{1}

\tau_{2}

)

epochs

cpu time

CVX

cpu time

N = 120, m = 240, n = 110

(1, 1)

655

0.17

(20, 80)

148

0.06

1.44

(60, 160)

131

0.07

(N, m)

166

0.11

N = 100, m = 240, n = 110

(1, 1)

1023

0.25

(20, 80)

202

0.08

1.51

(60, 160)

175

0.08

(N, m)

357

0.21

N = 1200, m = 2400, n = 1100

(1, 1)

8131

51.94

(200, 800)

958

9.38

177.08

(600, 1600)

713

9.59

(N, m)

2327

48.81

N = 1000, m = 2400, n = 1100

(1, 1)

13115

66.15

(200, 800)

1983

14.70

179.67

(600, 1600)

1158

12.07

(N, m)

5771

61.33

N = 3600, m = 7200, n = 3300

(1, 1)

19491

2008.60

(600, 2400)

298

52.94

(1800, 4800)

1432

387.91

(N, m)

1200

464.79

N = 3000, m = 7200, n = 3300

(1, 1)

40168

3618.37

(600, 2400)

2990

457.99

(1800, 4800)

2130

471.87

(N, m)

24903

7260.44

Table 1: Comparison between Mini-batch SSP and CVX for different dimensions and mini-batch sizes.

6 Conclusions

In this paper we have considered a deterministic general finite sum composite optimization problem with many functional constraints. We have reformulated this problem into a stochastic problem for which the stochastic subgradient projection method from [17] specializes to an infinite array of mini-batch variants, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches. By specializing different mini-batching strategies, we have derived exact expressions for the stepsizes as a function of the mini-batch size and in some cases we have derived stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. We have also proved sublinear convergence rates for the mini-batch subgradient projection algorithm which depend explicitly on the mini-batch sizes and on the properties of the objective function. Preliminary numerical results support the effectiveness of our method in practice.

Funding

The research leading to these results has received funding from: the NO Grants 2014–2021 RO-NO-2019-0184, under project ELO-Hyp, contract no. 24/2020; UEFISCDI PN-III-P4-PCE-2021-0720, under project L2O-MOC, nr. 70/2022 for N.K. Singh and I. Necoara. The OP VVV project CZ.02.1.01/0.0/0.0/16_019/0000765 Research Center for Informatics for V. Kungurtsev.

References

[1] H. Asi, K. Chadha, G. Cheng and J. Duchi, Minibatch stochastic approximate proximal point methods, Advances in Neural Information Processing Systems Conference, 2020.
Bhattacharyya et al., [2004] C. Bhattacharyya, L.R. Grate, M.I. Jordan, L. El Ghaoui and S. Mian, Robust sparse hyperplane classifiers: Application to uncertain molecular profiling data, Journal of Computational Biology, 11(6): 1073–1089, 2004.
[3] J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research, 10: 2899–2934, 2009.
[4] B.R. Gaines, J. Kim and H. Zhou, Algorithms for fitting the constrained Lasso, J. Comput. Graph. Stat., 27(4): 861–871, 2018.
[5] G. Garrigos and R.M. Gower, Handbook of convergence theorems for (stochastic) gradient methods, arXiv:2301.11235v2, 2023.
[6] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.0 beta, http://cvxr.com/cvx, 2013.
[7] E. Gorbunov, F. Hanzely and P. Richtarik, A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent, Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, vol. 108, 2020.
[8] M. Hardt, B. Recht and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, International Conference on Machine Learning, 2016.
[9] Q. Hu, P. Zeng and L. Lin, The dual and degrees of freedom of linearly constrained generalized lasso, Comput. Stat. Data Anal., 86:13–26, 2015.
[10] G.M. James, C. Paulsonand and P. Rusmevichientong, Penalized and constrained optimization: an application to high-dimensional website advertising, SIAM Journal on Optimization, 30(4), 3230–3251, 2019.
[11] L. Jacob, G. Obozinski and J.P. Vert, Group lasso with overlap and graph lasso, International Conference on Machine Learning, 433–-440, 2009.
[12] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems, 315–323, 2013.
[13] A. Lewis and J.S. Pang, Error bounds for convex inequality systems, Generalized Convexity, Generalized Monotonicity (J.-P. Crouzeix, J.-E.Martinez-Legaz, and M. Volle, eds.), 75–110, Cambridge University Press, 1998.
[14] H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems Conference, 2015.
[15] E. Moulines and F. Bach, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Advances in Neural Information Processing Systems Conf., 2011.
Nedelcu et al., [2014] V. Nedelcu, I. Necoara and Q. Tran Dinh, Computational complexity of inexact gradient augmented Lagrangian methods: application to constrained MPC, SIAM Journal on Control and Optimization, 52(5): 3109–3134, 2014.
[17] I. Necoara and N.K. Singh Stochastic subgradient projection methods for composite optimization with functional constraints, arXiv preprint: 2204.08204, 2022.
[18] I. Necoara, General convergence analysis of stochastic first order methods for composite optimization, Journal of Optimization Theory and Applications, 189: 66–95 2021.
[19] A. Nemirovski and D.B. Yudin, Problem complexity and method efficiency in optimization, Wiley Interscience, 1983.
[20] Yu. Nesterov, Lectures on Convex Optimization, Springer Optimization and Its Applications, 137, 2018.
[21] A. Nedich and I. Necoara. Random minibatch subgradient algorithms for convex problems with functional constraints, Applied Mathematics and Optimization, 8(3): 801–833, 2019.
[22] A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journ. Optimization, 19(4): 1574–1609, 2009.
[23] X. Peng, L. Li and F. Wang, Accelerating minibatch stochastic gradient descent using typicality sampling, IEEE Trans. Neural Networks Learn. Syst., 2019.
[24] B.T. Polyak, Minimization of unsmooth functionals, USSR Computational Mathematics and Mathematical Physics, 9(3): 14-29, 1969.
[25] B.T. Polyak and A.B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization, 30(4): 838–855, 1992.
[26] A. Patrascu and I. Necoara, Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization, Journal of Machine Learning Research, 18(198): 1–42, 2018.
[27] P. Richtarik and M. Takac, On optimal probabilities in stochastic coordinate descent methods, Optimization Letters, 10(6): 1233-1243, 2016.
[28] R.T. Rockafellar and S.P. Uryasev, Optimization of conditional value-at-risk, Journal of Risk, 2: 21–41, 2000.
[29] R. Gower, L. Nicolas, Q. Xun, S. Alibek, S. Egor and P. Richtarik, SGD: General Analysis and Improved Rates, International Conference on Machine Learning, 2019.
[30] H. Robbins and S. Monro, A Stochastic Approximation Method, The Annals of Mathematical Statistics, 22(3): 400–407, 1951.
[31] L. Rosasco, S. Villa and B.C. Vu, Convergence of stochastic proximal gradient algorithm, Applied Mathematics and Optimization, 82: 891–917 , 2020.
[32] J. Renegar and S. Zhou A different perspective on the stochastic convex feasibility problem, arXiv preprint: 2108.12029v1, 2021.
Tibshirani, [2011] R. Tibshirani, The solution path of the generalized lasso, Phd Thesis, Stanford Univ., 2011.
Vapnik, [1998] V. Vapnik, Statistical learning theory, John Wiley, 1998.
[35] T. Yang and Q. Lin, RSG: Beating subgradient method without smoothness and strong convexity, Journal of Machine Learning Research, 19(6): 1–33, 2018.
[36] X. Yin and İ Büyüktahtakın, A multi-stage stochastic programming approach to epidemic resource allocation with equity considerations, Health Care Management Science, 24(3): 597–622, 2021.
[37] M. Zafar, I. Valera, M. Gomez-Rodriguez and K. Gummadi, Fairness constraints: A flexible approach for fair classification, Journal of Machine Learning Research, 20(1): 2737–2778, 2019.

	$\displaystyle\\|\nabla F(x,\zeta)\\|^{2}$	$\displaystyle=\left\\|\frac{1}{N}\sum_{i=1}^{N}\zeta^{i}\nabla F_{i}(x)\right\\|% ^{2}=\frac{1}{N^{2}}\\|\nabla\hat{F}(x)\zeta\\|^{2}\leq\frac{1}{N^{2}}\\|\nabla% \hat{F}(x)\\|^{2}\\|\zeta\\|^{2}$
		$\displaystyle\leq\frac{1}{N^{2}}\\|\nabla\hat{F}(x)\\|_{F}^{2}\\|\zeta\\|^{2}=% \frac{\\|\zeta\\|^{2}}{N}\left(\frac{1}{N}\sum_{i=1}^{N}\\|\nabla F_{i}(x)\\|^{2}\right)$
		$\displaystyle\overset{\eqref{as:main1_spg}}{\leq}\frac{\\|\zeta\\|^{2}}{N}B^{2}+% \frac{\\|\zeta\\|^{2}}{N}L(F(x)-F(\bar{x})),$

	$\displaystyle\nabla h(x,\xi)=\text{Conv}\{\xi_{j^{}}\nabla h_{j^{}}(x)\|j^{}% \in J^{}(x)\}$
$\displaystyle\implies$	$\displaystyle\\|\nabla h(x,\xi)\\|\leq\max_{\theta_{j^{}}\geq 0,\;\sum_{j^{}% \in J^{}(x)}\theta_{j^{}}=1}\left\\|\sum_{j^{}\in J^{}(x)}\theta_{j^{}}\xi% ^{j^{}}\cdot\nabla h_{j^{*}}(x)\right\\|$
	$\displaystyle\leq\max_{\theta_{j^{}}\geq 0,\;\sum_{j^{}\in J^{}(x)}\theta_{% j^{}}=1}\sum_{j^{}\in J^{}(x)}\theta_{j^{}}\xi^{j^{}}\cdot\\|\nabla h_{j^{% *}}(x)\\|$
	$\displaystyle\leq\max_{\theta_{j^{}}\geq 0,\;\sum_{j^{}\in J^{}}\theta_{j^{% }}=1}\sum_{j^{}\in J^{}(x)}\theta_{j^{}}\bar{\xi}\cdot\\|\nabla h_{j^{}}(x)\\|$
	$\displaystyle=\bar{\xi}\max_{j^{}\in J^{}(x)}\\|\nabla h_{j^{*}}(x)\\|\overset% {\eqref{ass:3}}{\leq}\bar{\xi}\max_{j=1:m}B_{j}=\mathcal{B}_{h},$	(12)

	$\displaystyle\mathbb{E}_{\xi_{k-1}}[\\|x_{k}-\bar{x}_{k}\\|^{2}\|\mathcal{F}_{[k-% 2]}]$	$\displaystyle\leq\\|v_{k-1}-\bar{v}_{k-1}\\|^{2}-\beta(2-\beta)\mathbb{E}_{\xi_{% k-1}}\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{\mathcal{B}^{2}_{h}}\|\mathcal% {F}_{[k-2]}\right]$
		$\displaystyle\overset{\eqref{qreg}}{\leq}\\|v_{k-1}-\bar{v}_{k-1}\\|^{2}-\frac{% \beta(2-\beta)}{c\mathcal{B}^{2}_{h}}\text{dist}^{2q}(v_{k-1},\mathcal{X})$
		$\displaystyle\overset{\eqref{eq:distvdistx}}{\leq}\\|v_{k-1}-\bar{v}_{k-1}\\|^{2% }-\frac{\beta(2-\beta)}{c\mathcal{B}^{2}_{h}}\text{dist}^{2q}(x_{k},\mathcal{X% }).$

	$\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{})\right]\leq\frac{\\|v_{0}-% \bar{v}_{0}\\|^{2}}{\alpha_{0}^{}\cdot{\cal O}(k^{1-\gamma})}+\frac{\alpha_{0}% ^{}\cdot\mathcal{B}^{2}}{{\cal O}(k^{1-\gamma})}\leq\frac{2\\|v_{0}-\bar{v}_{0% }\\|^{2}}{\alpha_{0}^{}\cdot{\cal O}(k^{1-\gamma})},$		(27)
	$\displaystyle\mathbb{E}\left[\text{dist}^{2}(\hat{x}_{k},\mathcal{X})\right]% \leq\left(\frac{1}{C_{\beta,c,\mathcal{B}_{h}}{\color[rgb]{0,0,0}\alpha_{0}^{% }}\cdot{\cal O}(k^{1-\gamma})}\right)^{\frac{1}{q}}\left[\\|v_{0}-\bar{v}_{0}\\|% ^{\frac{2}{q}}+\left((\alpha_{0}^{})^{2}\mathcal{B}^{2}\right)^{\frac{1}{q}}\right]$
	$\displaystyle\qquad\qquad\qquad\quad\leq\left(\frac{1}{C_{\beta,c,\mathcal{B}_% {h}}{\color[rgb]{0,0,0}\alpha_{0}^{*}}\cdot{\cal O}(k^{1-\gamma})}\right)^{% \frac{1}{q}}\left[2\\|v_{0}-\bar{v}_{0}\\|^{\frac{2}{q}}\right].$		(28)

Mini-batch stochastic subgradient for functional constrained optimization

Abstract

keywords:

1 Introduction

2 Notations and assumptions

Assumption 2.1.

Assumption 2.2.

Assumption 2.3.

Assumption 2.4.

3 Stochastic reformulation

Lemma 3.1.

Proof.

3.1 Properties of stochastic problem

Lemma 3.2.

Proof.

Lemma 3.3.

Proof.

Lemma 3.4.

Proof.

3.2 Choices for random vectors ζ𝜁\zetaitalic_ζ and ξ𝜉\xiitalic_ξ

Theorem 3.5.

Proof.

4 Mini-batch stochastic subgradient projection algorithm

4.1 Convergence analysis: convex objective function

Lemma 4.1.

Lemma 4.2.

Lemma 4.3.

Proof.

Theorem 4.4.

Proof.

4.2 Convergence analysis: strongly convex objective function

Lemma 4.5.

Theorem 4.6.

Proof.

5 Numerical simulations

6 Conclusions

Funding

References

3.2 Choices for random vectors $\zeta$ and $\xi$