License: arXiv.org perpetual non-exclusive license
arXiv:2402.15771v1 [math.OC] 24 Feb 2024

Inertial Accelerated Stochastic Mirror Descent for Large-Scale Generalized Tensor CP Decomposition

Zehui Liu LMIB, School of Mathematical Sciences, Beihang University, Bei**g 100191, China. Email: [email protected]   Qingsong Wang School of Mathematics and Computational Science, Xiangtan University, Xiangtan 411105, China. Email: [email protected]   Chunfeng Cui LMIB, School of Mathematical Sciences, Beihang University, Bei**g 100191, China. Email: [email protected]   Yong Xia LMIB, School of Mathematical Sciences, Beihang University, Bei**g 100191, China. Email: [email protected]
Abstract

The majority of classic tensor CP decomposition models are designed for squared loss, employing Euclidean distance as a local proximal term. However, the Euclidean distance is unsuitable for the generalized loss function applicable to various types of real-world data, such as integer and binary data. Consequently, algorithms developed under the squared loss are not easily adaptable to handle these generalized losses, partially due to the lack of the gradient Lipschitz continuity. This paper considers the generalized tensor CP decomposition. We use the Bregman distance as the proximal term and propose an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD). Within a broader multi-block variance reduction and inertial acceleration framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. We further show that iTableSMD requires at most 𝒪(ε2)𝒪superscript𝜀2\mathcal{O}(\varepsilon^{-2})caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations in expectation to attain an ε𝜀\varepsilonitalic_ε-stationary point and establish the global convergence of the sequence. Numerical experiments on real datasets demonstrate that our proposed algorithm is efficient and achieve better performance than the existing state-of-the-art methods.

Keywords:  Generalized tensor CP decomposition, Inertial acceleration, Stochastic mirror descent, Bregman divergence, Non-Lipschitz gradient continuity, Variance reduction.

1 Introduction

A fundamental generic optimization model that encompasses a wide range of multi-block models arising in various applications is the well-known composite minimization problem. It can be formally defined as:

min{xt}t=1sΦ(x1,,xs)f(x1,,xs)+t=1sht(xt),subscriptsuperscriptsubscriptsubscript𝑥𝑡𝑡1𝑠Φsubscript𝑥1subscript𝑥𝑠𝑓subscript𝑥1subscript𝑥𝑠superscriptsubscript𝑡1𝑠subscript𝑡subscript𝑥𝑡\min_{\{x_{t}\}_{t=1}^{s}}\Phi\left({x}_{1},\ldots,{x}_{s}\right)\equiv f\left% ({x}_{1},\ldots,{x}_{s}\right)+\sum_{t=1}^{s}h_{t}\left({x}_{t}\right),roman_min start_POSTSUBSCRIPT { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≡ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (1)

where variable x𝑥xitalic_x can be decomposed into s𝑠sitalic_s blocks x1,,xssubscript𝑥1subscript𝑥𝑠{x}_{1},\ldots,{x}_{s}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, f𝑓fitalic_f is assumed to be a continuously differentiable nonconvex function over x=(x1,,xs)𝑥subscript𝑥1subscript𝑥𝑠x=\left({x}_{1},\ldots,{x}_{s}\right)italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), and can be convex in the manner of a block xtsubscript𝑥𝑡{x}_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, while all the other blocks are fixed. It can also admit a finite-sum structure form f(x)=1ni=1nfi(x)𝑓𝑥1𝑛superscriptsubscript𝑖1𝑛subscript𝑓𝑖𝑥f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) with different indexes. The usual restrictive condition of the gradient Lipschitz continuity of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not required, and ht,t=1,,sformulae-sequencesubscript𝑡𝑡1𝑠h_{t},t=1,\ldots,sitalic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t = 1 , … , italic_s, are extended-value weakly convex functions, which is a structure-promoting regularizer that captures the prior information about xisubscript𝑥𝑖{x}_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, such as column-wise orthogonality [32], Tikhonov regularization [44], iterated Tikhonov regularization [40], non-negativity [36].

The majority of classic models and algorithms developed for multi-block problem (1) are least squares (LS) problems using the Euclidean distance-based fitting criterion [22, 6]. Efficient methods in this line include the block coordinate descent (BCD) [62], alternating direction method of multipliers (ADMM) [21], and proximal alternating linearized minimization (PALM) [4] algorithm. Then Pock and Sabach [47] introduced a inertial variant of PALM (iPALM). Huang et al. [28] introduced a primal-dual algorithm AO-ADMM, which is a hybrid between alternating optimization and ADMM. There are also some other first-order type algorithms [46, 12] and (quasi-)second-order methods [52, 45]. However, the Euclidean distance is unsuitable for measuring the proximity between various real-world data types, including nonnegative, integer, and binary data. Essentially, utilizing data geometry-aware divergences as fitting criteria has the potential to substantially improve both performance and robustness in real-world applications [26, 31, 56, 55]. For instance, the proximity between two probability distributions is measured using an appropriate divergence, such as the generalized Kullback-Leibler (KL) divergence [27, 29, 19, 9] or the Itakura-Saito divergence [17]. From a statistical perspective, numerous non-Euclidean divergences share a close connection with maximum likelihood estimators (MLEs) under reasonable data distribution assumptions. For example, the generalized KL divergence [10] and logistic loss can be obtained as MLEs for integer data with Poisson distributions and binary data with Bernoulli distributions, respectively. Nevertheless, methods under non-Euclidean divergences are much more challenging compared with the case under Euclidean loss, especially when the data size becomes huge. Algorithms designed for the LS loss are not easily adaptable to handle complex loss functions, primarily due to the absence of gradient Lipschitz continuity, even under relatively mild conditions.

The form of problem (1) can be applied to tensor CP decomposition with regularization, which can be viewed as an extension of matrix factorization [53]. The first idea of canonical polyadic (CP) decomposition is from Hitchcock [24, 25] in 1927, which expresses a tensor as a sum of a finite number of rank-1 tensors. Subsequently, Cattell [7, 8] proposed ideas for parallel proportional analysis and multiple axes for analysis. Furthermore, the research developed by Hong et al. in 2020 is a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. Below, we briefly review existing developments for multi-block models with non-Euclidean loss functions.

Many existing non-Euclidean approaches employ the block coordinate descent (BCD) [62] for updating block variables. Convergence of the BCD method typically requires the uniqueness of the minimizer at each step or the quasi-convexity of the objective function [54]. Unfortunately, these requirements can be restrictive in some important practical problems such as the tensor decomposition problem [30]. Cichocki and Phan [11] proposed a hierarchical alternating optimization algorithm for CP decomposition with α𝛼\alphaitalic_α- and β𝛽\betaitalic_β-divergence. In [10], the generalized KL-divergence loss was explored, leading to the development of a block majorization-minimization (MM) algorithm. The work in [29] presented the exponential gradient algorithm for handling the KL-divergence. Additionally, alternative optimization frameworks like Gauss-Newton based methods [55] and quasi-Newton methods [26] have been devised for non-Euclidean models.

It is worth noting that most of the mentioned algorithms use the entire dataset for each update, which will be time-consuming. In contrast, stochastic algorithms reduce computational and memory requirements per iteration. A recent stochastic gradient-based algorithm [31] was introduced for tensor CP decomposition. However, it randomly samples tensor entries for updates, neglecting the potential computational efficiency enhancements through multilinear algebraic properties of low-rank tensors. More importantly, this update strategy loses the opportunity to incorporate regularization terms on the entire latent factors because the sampled entries only provide partial information about them. To address this, Battaglino et al. proposed an algorithm [1] that samples tensor fibers containing information about complete latent factors. However, these stochastic algorithms lack convergence guarantees. Pu et al. [49] developed a block-randomized stochastic mirror descent (SMD) [13, 3] algorithmic framework for large-scale CP decomposition under various non-Euclidean losses, also referred to as generalized CP decomposition. Specifically, at each iteration, one block factor is randomly chosen for an update while kee** all other factors fixed. Then, instead of solving the subproblem directly, it updates the unknown factor by one SMD step. This work also incorporated a fiber sampling strategy to assist in designing SMD updates. In this way, the computational cost is much smaller. However, the pure SMD is still slow in convergence. Wang et al. proposed mBrasCPD [57] and iBrasCPD[60], which speed up the SGD scheme by the heavy ball method [48] and inertial acceleration. Both of these algorithms are designed for scenarios involving Euclidean loss functions and are not suitable for the generalized non-Euclidean loss functions directly. Additionally, these algorithms only consider that stochastic gradient is unbiased, which can only induce weak convergence properties. Recently, [58] and [61] introduced the Bregman proximal stochastic gradient (BPSG) method and BPSG with extrapolation (BPSGE), respectively and established the convergence properties of the generated sequence in terms of subsequential and global convergence under a general framework of variance reduction. However, both BPSG and BPSGE are designed for single-block problems, and the potential for computing generalized tensor CP decomposition remains untapped.

An overview of several state-of-the-art algorithms for multi-block problem (1) is presented in Table 1.

Table 1: Summary of the properties of iTableSMD (Algorithm 1) and several state-of-the-art stochastic methods. “subseq.” and “seq.” denote the subsequential and sequential convergence, respectively. “Complexity” means the complexity (in expectation) to obtain an ε𝜀\varepsilonitalic_ε-stationary point (Definition 3) of ΦΦ\Phiroman_Φ and “-” means not given.
Algorithm Loss h(x)𝑥h(x)italic_h ( italic_x ) Acceleration Convergence Complexity fisubscript𝑓𝑖\nabla f_{i}∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-Lip ~fi~subscript𝑓𝑖\tilde{\nabla}f_{i}over~ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
BrasCPD [18] LS convex no subseq. - yes unbiased
mBrasCPD [57] LS convex heavy ball subseq. - yes unbiased
iBrasCPD [60] LS convex inertial subseq. - yes unbiased
SPRING [16] general nonconvex no subseq./seq. - yes biased
iSPALM [23] general nonconvex inertial subseq./seq. - yes biased
SmartCPD [49] general convex no subseq. - no unbiased
iTableSMD (Alg. 1) general weakly-convex inertial subseq./seq. 𝒪(ε2)𝒪superscript𝜀2\mathcal{O}(\varepsilon^{-2})caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) no biased

In this paper, inspired by inertial acceleration skill and variance reduction framework for stochastic algorithms, we propose an inertial accelerated stochastic mirror descent to solve the nonconvex and nonsmooth optimization problem (1), which can be applied to the block-wise subproblem of tensor generalized CP decomposition under non-Euclidean loss functions. Our main contributions addressed in this article are as follows:

  • (1)

    We introduce an inertial accelerated block-randomized SMD algorithm, denoted as iTableSMD, designed to address the GCP decomposition problem. Within a broader multi-block variance reduction framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm.

  • (2)

    Within a broader multi-block variance reduction and inertial acceleration framework, we establish the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. Furthermore, we introduce a novel Lyapunov function and prove that it requires at most 𝒪(ε2)𝒪superscript𝜀2\mathcal{O}(\varepsilon^{-2})caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations in expectation to attain an ε𝜀\varepsilonitalic_ε-stationary point. Additionally, we establish the global convergence of the sequence generated by iTableSMD.

  • (3)

    We conduct extensive experiments, including three synthetic datasets and two real-world datasets in several distributions, to demonstrate the effectiveness of our proposed algorithms iTableSMD. Our numerical experiments exhibit that our proposed methods can achieve better convergence.

The rest of this paper is organized as follows. Section 2 outlines necessary definitions and preliminary results of existing models and algorithms. Section 3 details the formulation of the iTableSMD algorithm, while Section 4 is dedicated to proving its convergence and analyzing its rate of convergence. In Section 5, we compare the performance of the iTableSMD algorithm with several baselines using synthetic and real-world datasets. We conclude the paper in Section 6.

2 Preliminaries

Definition 1.

(Bregman Divergence): Given a strongly convex function ψ():domψ(n)normal-:𝜓normal-⋅normal-→annotatednormal-dom𝜓absentsuperscript𝑛\psi(\cdot):\operatorname{dom}\psi\left(\subseteq\mathbb{R}^{n}\right)% \rightarrow\mathbb{R}italic_ψ ( ⋅ ) : roman_dom italic_ψ ( ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) → blackboard_R, the Bregman distance between xdomψ𝑥normal-dom𝜓x\in\operatorname{dom}\psiitalic_x ∈ roman_dom italic_ψ and y𝑦absenty\initalic_y ∈ int dom ψ𝜓\psiitalic_ψ is

Dψ(x,y):=ψ(x)ψ(y)ψ(y),xy.assignsubscript𝐷𝜓𝑥𝑦𝜓𝑥𝜓𝑦𝜓𝑦𝑥𝑦D_{\psi}(x,y):=\psi(x)-\psi(y)-\langle\nabla\psi(y),x-y\rangle.italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) := italic_ψ ( italic_x ) - italic_ψ ( italic_y ) - ⟨ ∇ italic_ψ ( italic_y ) , italic_x - italic_y ⟩ . (2)

It measures the proximity of x𝑥xitalic_x and y𝑦yitalic_y. Indeed, ψ𝜓\psiitalic_ψ is convex if and only if Dψ(x,y)0subscript𝐷𝜓𝑥𝑦0D_{\psi}(x,y)\geq 0italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) ≥ 0 for any xdom ψ,yint dom ψformulae-sequence𝑥dom 𝜓𝑦int dom 𝜓x\in\text{dom }\psi,y\in\text{int dom }\psiitalic_x ∈ dom italic_ψ , italic_y ∈ int dom italic_ψ due to the gradient inequality.

Remark 1.

The Bregman divergence was originally defined in [63] using a Legendre function ψ𝜓\psiitalic_ψ. Here, we consider the case where ψ𝜓\psiitalic_ψ is a strongly convex function; see more details in [2]. It should be noted that

Dψ(x,y)σ2xy2,xdom ψ,yint dom ψ,formulae-sequencesubscript𝐷𝜓𝑥𝑦𝜎2superscriptnorm𝑥𝑦2formulae-sequencefor-all𝑥dom 𝜓𝑦int dom 𝜓D_{\psi}(x,y)\geq\frac{\sigma}{2}\|x-y\|^{2},\quad\forall\,x\in\text{dom }\psi% ,y\in\text{int dom }\psi,italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) ≥ divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_x ∈ dom italic_ψ , italic_y ∈ int dom italic_ψ , (3)

where σ𝜎\sigmaitalic_σ is the strongly convex parameter of ψ𝜓\psiitalic_ψ. Dψ(x,y)=0subscript𝐷𝜓𝑥𝑦0D_{\psi}\left(x,y\right)=0italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) = 0 if and only if x=y𝑥𝑦x=yitalic_x = italic_y. In addition, Dψ(x,y)subscript𝐷𝜓𝑥𝑦D_{\psi}(x,y)italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) can be defined in a coordinate-wise form as Dψ(x,y)=i=1Inj=1Rψ(xij)ψ(yij)subscript𝐷𝜓𝑥𝑦superscriptsubscript𝑖1subscript𝐼𝑛superscriptsubscript𝑗1𝑅𝜓subscript𝑥𝑖𝑗limit-from𝜓subscript𝑦𝑖𝑗D_{\psi}\left(x,y\right)=\sum_{i=1}^{I_{n}}\sum_{j=1}^{R}\psi\left(x_{ij}% \right)-\psi\left(y_{ij}\right)-italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - italic_ψ ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - ψ(yij),xijyijnormal-∇𝜓subscript𝑦𝑖𝑗subscript𝑥𝑖𝑗subscript𝑦𝑖𝑗\left\langle\nabla\psi\left(y_{ij}\right),x_{ij}-y_{ij}\right\rangle⟨ ∇ italic_ψ ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⟩.

Definition 2.

([39] (L¯,L¯)normal-¯𝐿normal-¯𝐿(\bar{L},\underline{L})( over¯ start_ARG italic_L end_ARG , under¯ start_ARG italic_L end_ARG )-smooth adaptable) Given ψ𝜓\psiitalic_ψ, let f:𝒳(,+]normal-:𝑓normal-→𝒳f:\mathcal{X}\rightarrow(-\infty,+\infty]italic_f : caligraphic_X → ( - ∞ , + ∞ ] be a proper and lower semi-continuous function with domψdomfnormal-dom𝜓normal-dom𝑓\mathrm{dom}\,\psi\subset\mathrm{dom}\,froman_dom italic_ψ ⊂ roman_dom italic_f, which is continuously differentiable. We say (f,ψ)𝑓𝜓(f,\psi)( italic_f , italic_ψ ) is (L¯,L¯)normal-¯𝐿normal-¯𝐿(\bar{L},\underline{L})( over¯ start_ARG italic_L end_ARG , under¯ start_ARG italic_L end_ARG )- smooth adaptable on C𝐶Citalic_C if there exist L¯>0normal-¯𝐿0\bar{L}>0over¯ start_ARG italic_L end_ARG > 0 and L¯0normal-¯𝐿0\underline{L}\geq 0under¯ start_ARG italic_L end_ARG ≥ 0 such that for any x,y𝑥𝑦x,yitalic_x , italic_y,

f(x)f(y)f(y),xyL¯Dψ(x,y),𝑓𝑥𝑓𝑦𝑓𝑦𝑥𝑦¯𝐿subscript𝐷𝜓𝑥𝑦\displaystyle f(x)-f(y)-\langle\nabla f(y),x-y\rangle\leq\bar{L}D_{\psi}(x,y),italic_f ( italic_x ) - italic_f ( italic_y ) - ⟨ ∇ italic_f ( italic_y ) , italic_x - italic_y ⟩ ≤ over¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) , (4)

and

L¯Dψ(x,y)f(x)f(y)f(y),xy.¯𝐿subscript𝐷𝜓𝑥𝑦𝑓𝑥𝑓𝑦𝑓𝑦𝑥𝑦\displaystyle-\underline{L}D_{\psi}(x,y)\leq f(x)-f(y)-\langle\nabla f(y),x-y\rangle.- under¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) ≤ italic_f ( italic_x ) - italic_f ( italic_y ) - ⟨ ∇ italic_f ( italic_y ) , italic_x - italic_y ⟩ . (5)

If L¯=L¯¯𝐿¯𝐿\underline{L}=\bar{L}under¯ start_ARG italic_L end_ARG = over¯ start_ARG italic_L end_ARG, it recovers [5, Definition 2.2]. Suppose f𝑓fitalic_f is convex. If L¯=0¯𝐿0\underline{L}=0under¯ start_ARG italic_L end_ARG = 0, this definition recovers [2, Lemma 1] and [38, Definition 1.1].

Definition 3.

([33] ϵitalic-ϵ\epsilonitalic_ϵ-stationary point) Given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, a solution {x1*,,xs*}superscriptsubscript𝑥1normal-…superscriptsubscript𝑥𝑠\{x_{1}^{*},\dots,x_{s}^{*}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } is said to be an ϵitalic-ϵ\epsilonitalic_ϵ-stationary point of function Φ(x1,,xs)normal-Φsubscript𝑥1normal-…subscript𝑥𝑠\Phi(x_{1},\dots,x_{s})roman_Φ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) if

𝑑𝑖𝑠𝑡(0,Φ(x1*,,xs*))ϵ.𝑑𝑖𝑠𝑡0Φsuperscriptsubscript𝑥1superscriptsubscript𝑥𝑠italic-ϵ\mbox{dist}(0,\partial\Phi(x_{1}^{*},\dots,x_{s}^{*}))\leq\epsilon.dist ( 0 , ∂ roman_Φ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ) ≤ italic_ϵ .

2.1 Generalized CP decomposition

Consider an N𝑁Nitalic_N-th order tensor 𝒳I1×I2××IN𝒳superscriptsubscript𝐼1subscript𝐼2subscript𝐼𝑁\mathcal{X}\in\mathbb{R}^{I_{1}\times I_{2}\times\dots\times I_{N}}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the size of the n𝑛nitalic_n-th mode of 𝒳𝒳\mathcal{X}caligraphic_X. Such multi-array arises in many applications. 𝒳(i1,,iN)𝒳subscript𝑖1subscript𝑖𝑁\mathcal{X}(i_{1},\dots,i_{N})caligraphic_X ( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is an entry of 𝒳𝒳\mathcal{X}caligraphic_X. The entries of the data tensor 𝒳𝒳\mathcal{X}caligraphic_X could be various types of real datasets, such as continuous numbers, non-negative integers, and binaries. A general problem of tensor CP decomposition is to approximate 𝒳𝒳\mathcal{X}caligraphic_X using a low rank tensor \mathcal{M}caligraphic_M, defined by

=r=1RA1(:,r)AN(:,r),superscriptsubscript𝑟1𝑅subscript𝐴1:𝑟subscript𝐴𝑁:𝑟\mathcal{M}=\sum_{r=1}^{R}A_{1}(:,r)\circ\dots\circ A_{N}(:,r),caligraphic_M = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( : , italic_r ) ∘ ⋯ ∘ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( : , italic_r ) , (6)

where \circ denotes the outer product of vectors, AnIn×Rsubscript𝐴𝑛superscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT, are the unknown mode-n𝑛nitalic_n latent factor matrix. R𝑅Ritalic_R is the smallest positive integer for which equation (6) is satisfied, and it is also known as the rank of \mathcal{M}caligraphic_M.

Let I=(I1I2IN)1N𝐼superscriptsubscript𝐼1subscript𝐼2subscript𝐼𝑁1𝑁I=(I_{1}I_{2}\dots I_{N})^{\frac{1}{N}}italic_I = ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG end_POSTSUPERSCRIPT be the geometric mean of the dimensions. An N𝑁Nitalic_N-dimensional integer vector i𝑖iitalic_i is used to represent the entry coordinate, i.e.,

i{(i1,i2,,iN)in=1,2,,In,n}𝑖conditional-setsubscript𝑖1subscript𝑖2subscript𝑖𝑁subscript𝑖𝑛12subscript𝐼𝑛for-all𝑛i\in\mathcal{I}\triangleq\left\{\left(i_{1},i_{2},\ldots,i_{N}\right)\mid i_{n% }=1,2,\ldots,I_{n},\forall n\right\}italic_i ∈ caligraphic_I ≜ { ( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∣ italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , 2 , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ∀ italic_n }

The generalized CP (GCP) decomposition problem can be formulated as the following optimization task, where the primary objective is to minimize a data-adaptive loss function denoted by f(,)𝑓f(\cdot,\cdot)italic_f ( ⋅ , ⋅ ): ×maps-to\mathbb{R}\times\mathbb{R}\mapsto\mathbb{R}blackboard_R × blackboard_R ↦ blackboard_R,

minA1,A2,,An1INif(¯i;𝒳¯i)+n=1Nhn(An) s.t. ¯i=r=1Rn=1NAn(in,r),i,subscriptsubscript𝐴1subscript𝐴2subscript𝐴𝑛1superscript𝐼𝑁subscript𝑖𝑓subscript¯𝑖subscript¯𝒳𝑖superscriptsubscript𝑛1𝑁subscript𝑛subscript𝐴𝑛 s.t. formulae-sequencesubscript¯𝑖superscriptsubscript𝑟1𝑅superscriptsubscriptproduct𝑛1𝑁subscript𝐴𝑛subscript𝑖𝑛𝑟for-all𝑖\displaystyle\begin{aligned} \min_{A_{1},A_{2},\ldots,A_{n}}&\quad\frac{1}{I^{% N}}\sum_{i\in\mathcal{I}}f\left(\underline{\mathcal{M}}_{i};\underline{% \mathcal{X}}_{i}\right)+\sum_{n=1}^{N}h_{n}\left(A_{n}\right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{N}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT italic_f ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r ) , ∀ italic_i ∈ caligraphic_I , end_CELL end_ROW (7)

where the entries 𝒳¯isubscript¯𝒳𝑖\underline{\mathcal{X}}_{i}under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ¯isubscript¯𝑖\underline{\mathcal{M}}_{i}under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correspond to the elements of 𝒳𝒳\mathcal{X}caligraphic_X and \mathcal{M}caligraphic_M indexed by i𝑖iitalic_i, respectively. hn(An)subscript𝑛subscript𝐴𝑛h_{n}(A_{n})italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is a structure-promoting regularizer that captures the prior information about the latent factors Ansubscript𝐴𝑛{A}_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, such as column-wise orthogonality [32], Tikhonov regularization [44], iterated Tikhonov regularization [40], nonnegativity [36]. Those regularizations can result in well-posed problems. For instance, if An+In×R:={An|An0}subscript𝐴𝑛subscriptsuperscriptsubscript𝐼𝑛𝑅assignconditional-setsubscript𝐴𝑛subscript𝐴𝑛0A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}:=\{A_{n}|A_{n}\geq 0\}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT := { italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 } is applied, we can write hn()subscript𝑛h_{n}(\cdot)italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ) as the indicator function:

hn(A)=+In×R(A)={0,A0,, otherwise. subscript𝑛𝐴subscriptsuperscriptsubscriptsubscript𝐼𝑛𝑅𝐴cases0𝐴0 otherwise. h_{n}(A)=\mathcal{I}_{\mathbb{R}_{+}^{I_{n}\times R}}\left(A\right)=\left\{% \begin{array}[]{ll}0,&A\geq 0,\\ \infty,&\text{ otherwise. }\end{array}\right.italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A ) = caligraphic_I start_POSTSUBSCRIPT blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A ) = { start_ARRAY start_ROW start_CELL 0 , end_CELL start_CELL italic_A ≥ 0 , end_CELL end_ROW start_ROW start_CELL ∞ , end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY

Lim and Comon [36] showed that the nonnegative CP decomposition always has optimal solutions. By selecting appropriate loss functions f𝑓fitalic_f, problem (7) becomes adaptable to handle diverse data types, including continuous, count, and binary data. To illustrate, we present several representative motivating examples.

The difference between GCP and the conventional CP formulation lies in the flexibility in the selection of loss functions. In this section, we provide alternative loss functions by examining the statistical likelihood of a model for a specific data tensor. We assume a parameterized probability density function (PDF) or probability mass function (PMF) is available, offering the likelihood estimation for each entry, i.e.,

xip(xiθi), where (θi)=mi.formulae-sequencesimilar-tosubscript𝑥𝑖𝑝conditionalsubscript𝑥𝑖subscript𝜃𝑖 where subscript𝜃𝑖subscript𝑚𝑖x_{i}\sim p\left(x_{i}\mid\theta_{i}\right),\quad\text{ where }\quad\ell\left(% \theta_{i}\right)=m_{i}.italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where roman_ℓ ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Here xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents an observation of a random variable, while ()\ell(\cdot)roman_ℓ ( ⋅ ) denotes an invertible link function connecting the model parameter misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the natural parameter of the distribution, θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The link function is commonly assumed to be the identity function or related to the expectation of distribution.

We aim to find the model \mathcal{M}caligraphic_M that is the maximum likelihood estimate (MLE) across all entries. By assuming conditional independence of observations, the overall likelihood is simplified to the product of individual likelihoods. Hence, the MLE is the solution of the following optimization problem:

maxL(;𝒳)iΩp(xi1(mi)),iΩ.formulae-sequencesubscript𝐿𝒳subscriptproduct𝑖Ω𝑝conditionalsubscript𝑥𝑖superscript1subscript𝑚𝑖for-all𝑖Ω\max_{\mathcal{M}}\ L(\mathcal{M};\mathcal{X})\equiv\prod_{i\in\Omega}p\left(x% _{i}\mid\ell^{-1}(m_{i})\right),\quad\forall\,i\in\Omega.roman_max start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT italic_L ( caligraphic_M ; caligraphic_X ) ≡ ∏ start_POSTSUBSCRIPT italic_i ∈ roman_Ω end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ roman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , ∀ italic_i ∈ roman_Ω .

Here, ={mi}iΩsubscriptsubscript𝑚𝑖𝑖Ω\mathcal{M}=\{m_{i}\}_{i\in\Omega}caligraphic_M = { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ roman_Ω end_POSTSUBSCRIPT. Then, we employ the negative logarithm to transform the product into a summation and convert it to a minimization problem. As the logarithm is a monotonic function, it preserves the maximizer.

minF(;𝒳)iΩf0(mi;xi), wheref0(mi;xi)logp(x1(m)).formulae-sequencesubscript𝐹𝒳subscript𝑖Ωsubscript𝑓0subscript𝑚𝑖subscript𝑥𝑖 wheresubscript𝑓0subscript𝑚𝑖subscript𝑥𝑖𝑝conditional𝑥superscript1𝑚\min_{\mathcal{M}}\ F(\mathcal{M};\mathcal{X})\equiv\sum_{i\in\Omega}f_{0}% \left(m_{i};x_{i}\right),\quad\text{ where}\quad f_{0}(m_{i};x_{i})\equiv-\log p% \left(x\mid\ell^{-1}(m)\right).roman_min start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT italic_F ( caligraphic_M ; caligraphic_X ) ≡ ∑ start_POSTSUBSCRIPT italic_i ∈ roman_Ω end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≡ - roman_log italic_p ( italic_x ∣ roman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) ) . (8)
Table 2: Generalized loss function for different data types.
Data Type Distribution Link Function Loss function Constraints
Continuous: x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R Gaussian 1(m)=msuperscript1𝑚𝑚\ell^{-1}(m)=mroman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) = italic_m 12(xm)212superscript𝑥𝑚2\frac{1}{2}(x-m)^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_m ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT AnIn×Rsubscript𝐴𝑛superscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT
Gamma 1(m)=msuperscript1𝑚𝑚\ell^{-1}(m)=mroman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) = italic_m xm+log(m)𝑥𝑚𝑚\frac{x}{m}+\log(m)divide start_ARG italic_x end_ARG start_ARG italic_m end_ARG + roman_log ( italic_m ) An+In×Rsubscript𝐴𝑛subscriptsuperscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
Count: x𝑥x\in\mathbb{N}italic_x ∈ blackboard_N Poisson 1(m)=msuperscript1𝑚𝑚\ell^{-1}(m)=mroman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) = italic_m mxlog(m)𝑚𝑥𝑚m-x\log(m)italic_m - italic_x roman_log ( italic_m ) An+In×Rsubscript𝐴𝑛subscriptsuperscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
1(m)=emsuperscript1𝑚superscript𝑒𝑚\ell^{-1}(m)=e^{m}roman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) = italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT emxmsuperscript𝑒𝑚𝑥𝑚e^{m}-xmitalic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - italic_x italic_m AnIn×Rsubscript𝐴𝑛superscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT
Binary: x{0,1}𝑥01x\in\{0,1\}italic_x ∈ { 0 , 1 } Bernoulli 1(m)=m1+msuperscript1𝑚𝑚1𝑚\ell^{-1}(m)=\frac{m}{1+m}roman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) = divide start_ARG italic_m end_ARG start_ARG 1 + italic_m end_ARG log(m+1)xlog(m)𝑚1𝑥𝑚\log(m+1)-x\log(m)roman_log ( italic_m + 1 ) - italic_x roman_log ( italic_m ) An+In×Rsubscript𝐴𝑛subscriptsuperscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
1(m)=em1+emsuperscript1𝑚superscript𝑒𝑚1superscript𝑒𝑚\ell^{-1}(m)=\frac{e^{m}}{1+e^{m}}roman_ℓ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_m ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG log(1+em)xm1superscript𝑒𝑚𝑥𝑚\log\left(1+e^{m}\right)-xmroman_log ( 1 + italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) - italic_x italic_m AnIn×Rsubscript𝐴𝑛superscriptsubscript𝐼𝑛𝑅A_{n}\in\mathbb{R}^{I_{n}\times R}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT

Table 2 presents commonly utilized generalized loss functions f(x,m)𝑓𝑥𝑚f(x,m)italic_f ( italic_x , italic_m ), associated link functions, and various distributions.

2.2 Stochastic methods for GCP decomposition

We can modify the element-wise regularized problem (7) to a block-wise regularized problem

min{An}n=1NΦ(A1,,AN):=f(A1,,AN)+n=1Nhn(An),assignsubscriptsuperscriptsubscriptsubscript𝐴𝑛𝑛1𝑁Φsubscript𝐴1subscript𝐴𝑁𝑓subscript𝐴1subscript𝐴𝑁superscriptsubscript𝑛1𝑁subscript𝑛subscript𝐴𝑛\displaystyle\min_{\{A_{n}\}_{n=1}^{N}}\,\,\Phi(A_{1},\dots,A_{N}):=f(A_{1},% \dots,A_{N})+\sum_{n=1}^{N}h_{n}(A_{n}),roman_min start_POSTSUBSCRIPT { italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) := italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (9)

where f=f0([[A1,A2,,AN]]i;xi)𝑓subscript𝑓0subscriptdelimited-[]subscript𝐴1subscript𝐴2subscript𝐴𝑁𝑖subscript𝑥𝑖f=f_{0}([\![A_{1},A_{2},\cdots,A_{N}]\!]_{i};x_{i})italic_f = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( [ [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), mi=r=1Rn=1NAn(in,r),iformulae-sequencesubscript𝑚𝑖superscriptsubscript𝑟1𝑅superscriptsubscriptproduct𝑛1𝑁subscript𝐴𝑛subscript𝑖𝑛𝑟for-all𝑖m_{i}=\sum_{r=1}^{R}\prod_{n=1}^{N}{A}_{n}\left(i_{n},r\right),\forall\,i\in% \mathcal{I}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r ) , ∀ italic_i ∈ caligraphic_I, and f:Πn=1NIn×R:𝑓superscriptsubscriptΠ𝑛1𝑁superscriptsubscript𝐼𝑛𝑅f:\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}italic_f : roman_Π start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT → blackboard_R is finite-valued and differential. Shortly, let 𝒱(Φ)=minAnΦ(A1,,AN)𝒱Φsubscriptsubscript𝐴𝑛Φsubscript𝐴1subscript𝐴𝑁\mathcal{V}(\Phi)=\min_{A_{n}}\Phi(A_{1},\dots,A_{N})caligraphic_V ( roman_Φ ) = roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ).

We present the stochastic gradient with respect to the factor An(n=1,,N)subscript𝐴𝑛𝑛1𝑁A_{n}(n=1,\dots,N)italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_n = 1 , … , italic_N ), denoted by ~Anf(A1,,AN)subscript~subscript𝐴𝑛𝑓subscript𝐴1subscript𝐴𝑁\tilde{\nabla}_{A_{n}}f(A_{1},\dots,A_{N})over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ). Suppose a mode index n𝑛nitalic_n is sampled from 1111 to N𝑁Nitalic_N. For instance, we take the squared error loss function as an example. Namely, consider

f(A1,,AN)=12IN[[A1,A2,,AN]]𝒳F2=12INHnAnX(n)F2,𝑓subscript𝐴1subscript𝐴𝑁12superscript𝐼𝑁superscriptsubscriptnormdelimited-[]subscript𝐴1subscript𝐴2subscript𝐴𝑁𝒳𝐹212superscript𝐼𝑁superscriptsubscriptnormsubscript𝐻𝑛superscriptsubscript𝐴𝑛topsubscript𝑋𝑛𝐹2\displaystyle f(A_{1},\dots,A_{N})=\frac{1}{2I^{N}}\left\|[\![A_{1},A_{2},% \cdots,A_{N}]\!]-\mathcal{X}\right\|_{F}^{2}=\frac{1}{2I^{N}}\left\|H_{n}A_{n}% ^{\top}-X_{(n)}\right\|_{F}^{2},italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ∥ [ [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ] - caligraphic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ∥ italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_X start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where Hn=ANAn+1An1A1Jn×Rsubscript𝐻𝑛direct-productdirect-productsubscript𝐴𝑁subscript𝐴𝑛1subscript𝐴𝑛1subscript𝐴1superscriptsubscript𝐽𝑛𝑅H_{n}=A_{N}\odot\dots A_{n+1}\odot A_{n-1}\dots A_{1}\in\mathbb{R}^{J_{n}% \times R}italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⊙ … italic_A start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ⊙ italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT … italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT, direct-product\odot denotes the Khatri-Rao product or a column-wise Kronecker product, X(n)Jn×Insubscript𝑋𝑛superscriptsubscript𝐽𝑛subscript𝐼𝑛X_{(n)}\in\mathbb{R}^{J_{n}\times I_{n}}italic_X start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the mode-n𝑛nitalic_n matrization and Jn=IN/Insubscript𝐽𝑛superscript𝐼𝑁subscript𝐼𝑛J_{n}=I^{N}/I_{n}italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT / italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Then, we rewrite the full gradient as follows

Anf(A1,,AN)=1IN(Ani=1JnHn(i,:)Hn(i:)X(n)(i,:)Hn(i,:))=1In(An𝔼i[Hn(i,:)Hn(i:)]𝔼i[X(n)(i,:)Hn(i,:)]).\displaystyle\begin{aligned} \nabla_{A_{n}}f(A_{1},\dots,A_{N})=&\frac{1}{I^{N% }}\left(A_{n}\sum_{i=1}^{J_{n}}H_{n}(i,:)^{\top}H_{n}(i:)-X_{(n)}(i,:)^{\top}H% _{n}(i,:)\right)\\ =&\frac{1}{I_{n}}\left(A_{n}\mathbb{E}_{i}[H_{n}(i,:)^{\top}H_{n}(i:)]-\mathbb% {E}_{i}[X_{(n)}(i,:)^{\top}H_{n}(i,:)]\right).\end{aligned}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i , : ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i : ) - italic_X start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT ( italic_i , : ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i , : ) ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i , : ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i : ) ] - blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT ( italic_i , : ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i , : ) ] ) . end_CELL end_ROW (10)

Here, H(i,:)𝐻𝑖:H(i,:)italic_H ( italic_i , : ) denotes the i𝑖iitalic_i-th column of H𝐻Hitalic_H, and 𝔼isubscript𝔼𝑖\mathbb{E}_{i}blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the expectation over the index i𝑖iitalic_i. To alleviate the burden of computing the full gradient (10), we randomly sample a set of mode-n𝑛nitalic_n fibers that is indexed by n{1,,Jn}subscript𝑛1subscript𝐽𝑛\mathcal{F}_{n}\subset\{1,\dots,J_{n}\}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ { 1 , … , italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with |n|=Bsubscript𝑛𝐵|\mathcal{F}_{n}|=B| caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = italic_B. Note that a mode-n𝑛nitalic_n fiber of 𝒳𝒳\mathcal{X}caligraphic_X is a row of the mode-n𝑛nitalic_n unfolding X(n)subscript𝑋𝑛X_{(n)}italic_X start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT. Compared with the fiber sampling-based method in [1], our requirement on the batchsize B𝐵Bitalic_B is much lower. Hence, it admits lower per-iteration memory and computational complexities, especially when the rank is high.

Let ~Anf(A1,,AN)In×Rsubscript~subscript𝐴𝑛𝑓subscript𝐴1subscript𝐴𝑁superscriptsubscript𝐼𝑛𝑅\tilde{\nabla}_{A_{n}}f(A_{1},\dots,A_{N})\in\mathbb{R}^{I_{n}\times R}over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT be the stochastic gradient of f(A1,,AN)𝑓subscript𝐴1subscript𝐴𝑁f(A_{1},\dots,A_{N})italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) for Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have

~Anf(A1,,AN)=1In|n|(AnHn(n)Hn(n)Xn(n)Hn(n)),subscript~subscript𝐴𝑛𝑓subscript𝐴1subscript𝐴𝑁absent1subscript𝐼𝑛subscript𝑛subscript𝐴𝑛superscriptsubscript𝐻𝑛topsubscript𝑛subscript𝐻𝑛subscript𝑛superscriptsubscript𝑋𝑛topsubscript𝑛subscript𝐻𝑛subscript𝑛\displaystyle\begin{aligned} \tilde{\nabla}_{A_{n}}f(A_{1},\dots,A_{N})&=\frac% {1}{{I_{n}}\left|\mathcal{F}_{n}\right|}\left(A_{n}H_{n}^{\top}\left(\mathcal{% F}_{n}\right)H_{n}\left(\mathcal{F}_{n}\right)-X_{n}^{\top}\left(\mathcal{F}_{% n}\right)H_{n}\left(\mathcal{F}_{n}\right)\right),\end{aligned}start_ROW start_CELL over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | end_ARG ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (11)

where

Xn(n)=Xn(n,:),Hn(n)=Hn(n,:).formulae-sequencesubscript𝑋𝑛subscript𝑛subscript𝑋𝑛subscript𝑛:subscript𝐻𝑛subscript𝑛subscript𝐻𝑛subscript𝑛:X_{n}\left(\mathcal{F}_{n}\right)=X_{n}\left(\mathcal{F}_{n},:\right),\quad H_% {n}\left(\mathcal{F}_{n}\right)=H_{n}\left(\mathcal{F}_{n},:\right).italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , : ) , italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , : ) .

2.3 Stochastic mirror descent

In this subsection, we introduce some basics for optimization methods used in this paper, such as stochastic mirror descent (SMD) and inertial framework.

Consider the special case in optimization problem (1) with the block s=1𝑠1s=1italic_s = 1

minxdΦ(x)=f(x)+h(x),subscript𝑥superscript𝑑Φ𝑥𝑓𝑥𝑥\min_{x\in\mathbb{R}^{d}}\Phi\left(x\right)=f(x)+h\left(x\right),roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Φ ( italic_x ) = italic_f ( italic_x ) + italic_h ( italic_x ) ,

where the component function f𝑓fitalic_f is a continuously differentiable nonconvex function, and hhitalic_h is an extended valued function that are bounded from below. The update of SMD [37, 41, 64] is given as follows,

xk+1argmin𝑥h(x)+~f(xk),xxk+1ηkDψ(x,xk),superscript𝑥𝑘1𝑥𝑥~𝑓superscript𝑥𝑘𝑥superscript𝑥𝑘1superscript𝜂𝑘subscript𝐷𝜓𝑥superscript𝑥𝑘x^{k+1}\in\underset{x}{\arg\min}\,\,h(x)+\langle\tilde{\nabla}f(x^{k}),x-x^{k}% \rangle+\frac{1}{\eta^{k}}D_{\psi}(x,x^{k}),italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∈ underitalic_x start_ARG roman_arg roman_min end_ARG italic_h ( italic_x ) + ⟨ over~ start_ARG ∇ end_ARG italic_f ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_x - italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ,

where the stochastic gradient can be chosen as mini-batch version ~f(xk)=1|Bk|iBkfi(xk)~𝑓superscript𝑥𝑘1superscript𝐵𝑘subscript𝑖superscript𝐵𝑘subscript𝑓𝑖superscript𝑥𝑘\tilde{\nabla}f(x^{k})=\frac{1}{|B^{k}|}\sum_{i\in B^{k}}\nabla f_{i}(x^{k})over~ start_ARG ∇ end_ARG italic_f ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). Here, the mini-batch Bksuperscript𝐵𝑘B^{k}italic_B start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is chosen uniformly at random from all subsets of {1,,T}1𝑇\{1,\dots,T\}{ 1 , … , italic_T }, with the batchsize |Bk|=Bsuperscript𝐵𝑘𝐵|B^{k}|=B| italic_B start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | = italic_B being considerably smaller than T𝑇Titalic_T. Compared with SGD, SMD replaces the quadratic term AnAnkF2superscriptsubscriptnormsubscript𝐴𝑛superscriptsubscript𝐴𝑛𝑘𝐹2\left\|{A}_{n}-{A}_{n}^{k}\right\|_{F}^{2}∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. When ψ𝜓\psiitalic_ψ is properly designed, SMD can exploit the geometry of the problem and achieve significant efficiency enhancements compared to SGD, particularly when utilizing generalized loss functions. The extensive literature on MD and SMD in optimization is available [5, 34, 14, 64, 35].

As usual with the analysis of Bregman based schemes, the following simple but remarkable three points identity for Dψsubscript𝐷𝜓D_{\psi}italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT is very useful, which follows from elementary algebra. Given any xdomψ𝑥dom𝜓x\in\operatorname{dom}\psiitalic_x ∈ roman_dom italic_ψ and y,z𝑦𝑧absenty,z\initalic_y , italic_z ∈ int dom ψ𝜓\psiitalic_ψ, the three point equality is

Dψ(x,z)=Dψ(x,y)+Dψ(y,z)+ψ(y)ψ(z),xy.subscript𝐷𝜓𝑥𝑧subscript𝐷𝜓𝑥𝑦subscript𝐷𝜓𝑦𝑧𝜓𝑦𝜓𝑧𝑥𝑦D_{\psi}(x,z)=D_{\psi}(x,y)+D_{\psi}(y,z)+\langle\nabla\psi(y)-\nabla\psi(z),x% -y\rangle.italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_z ) = italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_y , italic_z ) + ⟨ ∇ italic_ψ ( italic_y ) - ∇ italic_ψ ( italic_z ) , italic_x - italic_y ⟩ . (12)

For the multi-block problem, Pu et al. [49] develop a unified stochastic mirror descent algorithmic framework (SmartCPD) for large-scale CPD under various non-Euclidean losses, which is a special case of multi-block problem and updates the factor variables by

Ank+1=argminAnhn(An)+~Anf(A1k,,ANk),AnAnk+1ηkDψ(An,Ank),Ank+1=Ank,nn.superscriptsubscript𝐴𝑛𝑘1absentsubscriptsubscript𝐴𝑛subscript𝑛subscript𝐴𝑛subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘subscript𝐴𝑛superscriptsubscript𝐴𝑛𝑘1superscript𝜂𝑘subscript𝐷𝜓subscript𝐴𝑛superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴superscript𝑛𝑘1formulae-sequenceabsentsuperscriptsubscript𝐴superscript𝑛𝑘superscript𝑛𝑛\displaystyle\begin{aligned} A_{n}^{k+1}&=\arg\min_{A_{n}}\,\,h_{n}\left(A_{n}% \right)+\langle\tilde{\nabla}_{A_{n}}f(A_{1}^{k},\dots,A_{N}^{k}),A_{n}-A_{n}^% {k}\rangle+\frac{1}{\eta^{k}}D_{\psi}(A_{n},A_{n}^{k}),\\ A_{n^{\prime}}^{k+1}&=A_{n^{\prime}}^{k},\quad n^{\prime}\neq n.\end{aligned}start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT end_CELL start_CELL = roman_arg roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ⟨ over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT end_CELL start_CELL = italic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_n . end_CELL end_ROW (13)

However, directly employing stochastic mirror descent for the GCP problem may not yield the most effective results. In this paper, we study stochastic gradients under the variance-reduced stochastic gradient estimators, such as SAGA [15] and SARAH [43]. Furthermore, the inertial acceleration framework is applied, which can be given by

x¯k=superscript¯𝑥𝑘absent\displaystyle\bar{x}^{k}=over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = xk+αk(xkxk1),superscript𝑥𝑘superscript𝛼𝑘superscript𝑥𝑘superscript𝑥𝑘1\displaystyle x^{k}+\alpha^{k}(x^{k}-x^{k-1}),italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ,
x^k=superscript^𝑥𝑘absent\displaystyle\hat{x}^{k}=over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = xk+βk(xkxk1),superscript𝑥𝑘superscript𝛽𝑘superscript𝑥𝑘superscript𝑥𝑘1\displaystyle x^{k}+\beta^{k}(x^{k}-x^{k-1}),italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ,
xk+1=superscript𝑥𝑘1absent\displaystyle x^{k+1}=italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = x^kηkf(x¯k),superscript^𝑥𝑘superscript𝜂𝑘𝑓superscript¯𝑥𝑘\displaystyle\hat{x}^{k}-\eta^{k}\nabla f(\bar{x}^{k}),over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∇ italic_f ( over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ,

where αk,βk[0,1]superscript𝛼𝑘superscript𝛽𝑘01\alpha^{k},\beta^{k}\in[0,1]italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] are two inertial parameters. For example, if αk=βk=0superscript𝛼𝑘superscript𝛽𝑘0\alpha^{k}=\beta^{k}=0italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0, it will be degenerated into the gradient descent method; If αk=0superscript𝛼𝑘0\alpha^{k}=0italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0, then it will be reduced to the heavy-ball method [48]; If αk=βksuperscript𝛼𝑘superscript𝛽𝑘\alpha^{k}=\beta^{k}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, then it will be reduced to the Nesterov accelerated gradient method [42].

3 Inertial accelerated block-randomized SMD

In this section, we propose an inertial accelerated block-randomized stochastic mirror descent algorithm (iTableSMD) for GCP decomposition (9). Before presenting the algorithm framework of iTableSMD, we make the following assumptions throughout the paper.

Assumption 1.

We assume that the following three conditions hold:

  • (i)

    hn:In×R{+}(n=1,2,,N):subscript𝑛superscriptsubscript𝐼𝑛𝑅𝑛12𝑁h_{n}:\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}\cup\{+\infty\}(n=1,2,% \dots,N)italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT → blackboard_R ∪ { + ∞ } ( italic_n = 1 , 2 , … , italic_N ) are proper lower semi-continuous (l.s.c.) functions that are bounded from below. There exists α+𝛼subscript\alpha\in\mathbb{R}_{+}italic_α ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT such that h()+α22h(\cdot)+\frac{\alpha}{2}\|\cdot\|^{2}italic_h ( ⋅ ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is convex.

  • (ii)

    ψ:In×R:𝜓superscriptsubscript𝐼𝑛𝑅\psi:\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}italic_ψ : blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT → blackboard_R is continuously differentiable and σ𝜎\sigmaitalic_σ-strongly convex. Let σ=1𝜎1\sigma=1italic_σ = 1 for simplicity.

  • (iii)

    ψ𝜓\nabla\psi∇ italic_ψ is Lipschitz continuous with modulus M2>0subscript𝑀20M_{2}>0italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0. For any two points Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, A^isubscript^𝐴𝑖\hat{A}_{i}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT In×Rabsentsuperscriptsubscript𝐼𝑛𝑅\in\mathbb{R}^{I_{n}\times R}∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT, it presents that

    ψ(Ai)ψ(A^i)M2AiA^i,i=1,2,,N.formulae-sequencenorm𝜓subscript𝐴𝑖𝜓subscript^𝐴𝑖subscript𝑀2normsubscript𝐴𝑖subscript^𝐴𝑖𝑖12𝑁\|\nabla\psi(A_{i})-\nabla\psi(\hat{A}_{i})\|\leq M_{2}\|A_{i}-\hat{A}_{i}\|,% \quad i=1,2,\dots,N.∥ ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∇ italic_ψ ( over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ , italic_i = 1 , 2 , … , italic_N .
  • (iv)

    f:Πn=1NIn×R:𝑓superscriptsubscriptΠ𝑛1𝑁superscriptsubscript𝐼𝑛𝑅f:\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}italic_f : roman_Π start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT → blackboard_R is a proper and lower semi-continuous function with dom ψdom fdom 𝜓dom 𝑓\text{dom }\psi\subset\text{dom }fdom italic_ψ ⊂ dom italic_f.

  • (v)

    The couple of functions (f,ψ)𝑓𝜓(f,\psi)( italic_f , italic_ψ ) is (L¯,L¯)¯𝐿¯𝐿(\bar{L},\underline{L})( over¯ start_ARG italic_L end_ARG , under¯ start_ARG italic_L end_ARG )-smooth adaptable.

  • (vi)

    The function ΦΦ\Phiroman_Φ is bounded from below, i.e., there exists a finite optimal objective value 𝒱(Φ)𝒱Φ\mathcal{V}(\Phi)caligraphic_V ( roman_Φ ).

Algorithm 1 iTableSMD: inertial accelerated block-randomized stochastic mirror descent for the optimization problem (9)

Input: an N𝑁Nitalic_N-way tensor 𝒳I1××IN𝒳superscriptsubscript𝐼1subscript𝐼𝑁\mathcal{X}\in\mathbb{R}^{I_{1}\times\dots\times I_{N}}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT; the rank R𝑅Ritalic_R; the sample size B𝐵Bitalic_B; initialization {An1}n=1N,{An0}n=1Nsuperscriptsubscriptsuperscriptsubscript𝐴𝑛1𝑛1𝑁superscriptsubscriptsuperscriptsubscript𝐴𝑛0𝑛1𝑁\{A_{n}^{-1}\}_{n=1}^{N},\{A_{n}^{0}\}_{n=1}^{N}{ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , { italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT; stepsize {ηk}k0subscriptsubscript𝜂𝑘𝑘0\{\eta_{k}\}_{k\geq 0}{ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT; inertial parameters {αk}k0,{βk}k0[0,1]subscriptsuperscript𝛼𝑘𝑘0subscriptsuperscript𝛽𝑘𝑘001\{\alpha^{k}\}_{k\geq 0},\{\beta^{k}\}_{k\geq 0}\in[0,1]{ italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT , { italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∈ [ 0 , 1 ]; two constants δ,ϵ𝛿italic-ϵ\delta,\epsilonitalic_δ , italic_ϵ with 1>δ>ϵ>01𝛿italic-ϵ01>\delta>\epsilon>01 > italic_δ > italic_ϵ > 0.

1:k0𝑘0k\leftarrow 0italic_k ← 0;
2:repeat
3:     sample n𝑛nitalic_n uniformly from {1,,N}1𝑁\{1,\dots,N\}{ 1 , … , italic_N }.
4:     sample nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT uniformly from {1,,Jn}1subscript𝐽𝑛\{1,\dots,J_{n}\}{ 1 , … , italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with |n|=Bsubscript𝑛𝐵|\mathcal{F}_{n}|=B| caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = italic_B.
5:     compute A~nk=Ank+αk(AnkAnk1)intdomψsuperscriptsubscript~𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘superscript𝛼𝑘superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1intdom𝜓\tilde{A}_{n}^{k}={A}_{n}^{k}+\alpha^{k}(A_{n}^{k}-A_{n}^{k-1})\in\mathrm{int}% \,\mathrm{dom}\,\psiover~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ∈ roman_int roman_dom italic_ψ, where αk[0,1)subscript𝛼𝑘01\alpha_{k}\in[0,1)italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ).
6:     compute an extrapolation parameter βksuperscript𝛽𝑘\beta^{k}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT such that
Dψ(Ank,A¯nk)δϵ1+L¯ηk1Dψ(Ank1,Ank),subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘𝛿italic-ϵ1¯𝐿subscript𝜂𝑘1subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘\displaystyle D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})\leq\frac{\delta-% \epsilon}{1+\underline{L}\eta_{k-1}}D_{\psi}(A_{n}^{k-1},A_{n}^{k}),italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≤ divide start_ARG italic_δ - italic_ϵ end_ARG start_ARG 1 + under¯ start_ARG italic_L end_ARG italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , (14)
         where A¯nk=Ank+βk(AnkAnk1)intdomψsuperscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘superscript𝛽𝑘superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1intdom𝜓\underline{A}_{n}^{k}=A_{n}^{k}+\beta^{k}(A_{n}^{k}-A_{n}^{k-1})\in\mathrm{int% }\,\mathrm{dom}\,\psiunder¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ∈ roman_int roman_dom italic_ψ.
7:     compute the stochastic gradient ~Anf(A1kA¯nkANk)subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) with the batchsize of fibers B𝐵Bitalic_B.
8:     set ηkmin{ηk1,L¯1}subscript𝜂𝑘subscript𝜂𝑘1superscript¯𝐿1\eta_{k}\leq\min\{\eta_{k-1},\bar{L}^{-1}\}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ roman_min { italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } and update Ank+1superscriptsubscript𝐴𝑛𝑘1A_{n}^{k+1}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT and Ank+1superscriptsubscript𝐴superscript𝑛𝑘1A_{n^{\prime}}^{k+1}italic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT:
Ank+1superscriptsubscript𝐴𝑛𝑘1\displaystyle A_{n}^{k+1}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT =argminAnhn(An)+~Anf(A1kA¯nkANk),AnA~nk+1ηkDψ(An,A~nk),absentsubscriptsubscript𝐴𝑛subscript𝑛subscript𝐴𝑛subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘subscript𝐴𝑛superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓subscript𝐴𝑛superscriptsubscript~𝐴𝑛𝑘\displaystyle=\arg\min_{A_{n}}\,\,h_{n}\left(A_{n}\right)+\langle\tilde{\nabla% }_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k}),A_{n}-\tilde{% A}_{n}^{k}\rangle+\frac{1}{\eta_{k}}D_{\psi}(A_{n},\tilde{A}_{n}^{k}),= roman_arg roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ⟨ over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , (15)
Ank+1superscriptsubscript𝐴superscript𝑛𝑘1\displaystyle A_{n^{\prime}}^{k+1}italic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT =Ank,nn.formulae-sequenceabsentsuperscriptsubscript𝐴superscript𝑛𝑘for-allsuperscript𝑛𝑛\displaystyle=A_{n^{\prime}}^{k},\quad\forall n^{\prime}\neq n.= italic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ∀ italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_n .
9:     kk+1𝑘𝑘1k\leftarrow k+1italic_k ← italic_k + 1;
10:until some stop** criterion is reached;

Output: {Ank}n=1Nsuperscriptsubscriptsuperscriptsubscript𝐴𝑛𝑘𝑛1𝑁\{A_{n}^{k}\}_{n=1}^{N}{ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

Let ξksuperscript𝜉𝑘\xi^{k}italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and ζksuperscript𝜁𝑘\zeta^{k}italic_ζ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT be the stochastic parameters for the block index and the stochastic gradient, respectively. Denote 𝔼k[]=𝔼[|ξk,ζk]\mathbb{E}_{k}[\cdot]=\mathbb{E}[\cdot|\xi^{k},\zeta^{k}]blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ⋅ ] = blackboard_E [ ⋅ | italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_ζ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] and 𝔼[]=𝔼[|ξ0,ζ0,]\mathbb{E}[\cdot]=\mathbb{E}[\cdot|\xi^{0},\zeta^{0},\dots]blackboard_E [ ⋅ ] = blackboard_E [ ⋅ | italic_ξ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ζ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … ].

Definition 4.

(Variance reduced stochastic gradient) We say a gradient estimator ~Anfsubscriptnormal-~normal-∇subscript𝐴𝑛𝑓\tilde{\nabla}_{A_{n}}fover~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f with n=1,2,N𝑛12normal-…𝑁n=1,2\dots,Nitalic_n = 1 , 2 … , italic_N, is variance-reduced with constants V1,V2,VΓ0subscript𝑉1subscript𝑉2subscript𝑉normal-Γ0V_{1},V_{2},V_{\Gamma}\geq 0italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ≥ 0, and τ(0,1]𝜏01\tau\in(0,1]italic_τ ∈ ( 0 , 1 ] if it satisfies the following conditions:

  • (i)

    (MSE Bound): there exists a sequence of random variables {Γk}k1subscriptsubscriptΓ𝑘𝑘1\{\Gamma_{k}\}_{k\geq 1}{ roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT such that

    𝔼k[~Aξkf(A1k,,Aξk1k,A¯ξkk,Aξk+1k,,ANk)Aξkf(A1k,,Aξk1k,A¯ξkk,Aξk+1k,,ANk)*2]Γk+V1(AkAk12+Ak1Ak22),missing-subexpressionsubscript𝔼𝑘delimited-[]superscriptsubscriptnormsubscript~subscript𝐴superscript𝜉𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript𝐴𝑁𝑘subscriptsubscript𝐴superscript𝜉𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript𝐴𝑁𝑘2subscriptΓ𝑘subscript𝑉1superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle\begin{aligned} &\mathbb{E}_{k}[\|\tilde{\nabla}_{A_{\xi^{k}}}f(A% _{1}^{k},\cdots,A_{\xi^{k}-1}^{k},\underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k% },\cdots,A_{N}^{k})-\nabla_{A_{\xi^{k}}}f(A_{1}^{k},\cdots,A_{\xi^{k}-1}^{k},% \underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k},\cdots,A_{N}^{k})\|_{*}^{2}]\\ \leq&\Gamma_{k}+V_{1}(\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}),\end{aligned}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , end_CELL end_ROW (16)

    and random variables {Υk}k1subscriptsubscriptΥ𝑘𝑘1\{\Upsilon_{k}\}_{k\geq 1}{ roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT such that

    𝔼k[~Aξkf(A1k,,Aξk1k,A¯ξkk,Aξk+1k,,ANk)Aξkf(A1k,,Aξk1k,A¯ξkk,Aξk+1k,,ANk)*]Υk+V2(AkAk1+Ak1Ak2).missing-subexpressionsubscript𝔼𝑘delimited-[]subscriptnormsubscript~subscript𝐴superscript𝜉𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript𝐴𝑁𝑘subscriptsubscript𝐴superscript𝜉𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴superscript𝜉𝑘1𝑘superscriptsubscript𝐴𝑁𝑘subscriptΥ𝑘subscript𝑉2normsuperscript𝐴𝑘superscript𝐴𝑘1normsuperscript𝐴𝑘1superscript𝐴𝑘2\displaystyle\begin{aligned} &\mathbb{E}_{k}[\|\tilde{\nabla}_{A_{\xi^{k}}}f(A% _{1}^{k},\cdots,A_{\xi^{k}-1}^{k},\underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k% },\cdots,A_{N}^{k})-\nabla_{A_{\xi^{k}}}f(A_{1}^{k},\cdots,A_{\xi^{k}-1}^{k},% \underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k},\cdots,A_{N}^{k})\|_{*}]\\ \leq&\Upsilon_{k}+V_{2}(\|A^{k}-A^{k-1}\|+\|A^{k-1}-A^{k-2}\|).\end{aligned}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ ) . end_CELL end_ROW (17)
  • (ii)

    (Geometric Decay): The sequence {Γk}k1subscriptsubscriptΓ𝑘𝑘1\{\Gamma_{k}\}_{k\geq 1}{ roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT satisfy the following inequality in expectation:

    𝔼k[Γk+1](1τ)Γk+VΓ(AkAk12+Ak1Ak22).subscript𝔼𝑘delimited-[]subscriptΓ𝑘1absent1𝜏subscriptΓ𝑘subscript𝑉Γsuperscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle\begin{aligned} \mathbb{E}_{k}[\Gamma_{k+1}]\leq&(1-\tau)\Gamma_{% k}+V_{\Gamma}(\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}).\end{aligned}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] ≤ end_CELL start_CELL ( 1 - italic_τ ) roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ( ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . end_CELL end_ROW (18)
  • (iii)

    (Convergence of Estimator): For all sequences {Ak}k=0superscriptsubscriptsuperscript𝐴𝑘𝑘0\{A^{k}\}_{k=0}^{\infty}{ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, if they satisfy
    limk𝔼AkAk120subscript𝑘𝔼superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘120\lim_{k\rightarrow\infty}\mathbb{E}\|A^{k}-A^{k-1}\|^{2}\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT blackboard_E ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → 0, then it follows that 𝔼Γk0𝔼subscriptΓ𝑘0\mathbb{E}\Gamma_{k}\rightarrow 0blackboard_E roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 0 and 𝔼Υk0𝔼subscriptΥ𝑘0\mathbb{E}\Upsilon_{k}\rightarrow 0blackboard_E roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 0.

In Proposition 1, we show both SAGA and SARAH are variance reduced stochastic gradients.

4 Convergence analysis

This section establishes the convergence properties of the iTableSMD algorithm. We prove its sublinear convergence rate for the subsequential sequence and further show that iTableSMD requires at most 𝒪(ε2)𝒪superscript𝜀2\mathcal{O}(\varepsilon^{-2})caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations in expectation to attain an ε𝜀\varepsilonitalic_ε-stationary point. Additionally, we confirm the global convergence of the generated sequence.

4.1 Subsequential convergence analysis

Next, we show the descent amount of Φ(A1k+1,,ANk+1)Φsuperscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑁𝑘1\Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) under expectation in the following lemma.

Lemma 1.

Suppose Assumption1 is satisfied and ~Anfsubscriptnormal-~normal-∇subscript𝐴𝑛𝑓\tilde{\nabla}_{A_{n}}fover~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f with n=1,2,N𝑛12normal-…𝑁n=1,2\dots,Nitalic_n = 1 , 2 … , italic_N, is variance-reduced by Definition 4. Let {Ank}k>0subscriptsuperscriptsubscript𝐴𝑛𝑘𝑘0\{A_{n}^{k}\}_{k>0}{ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k > 0 end_POSTSUBSCRIPT with n{1,,N}𝑛1normal-…𝑁n\in\{1,\dots,N\}italic_n ∈ { 1 , … , italic_N } be the sequence generated by Algorithm 1. Then the following inequality holds for any k>0𝑘0k>0italic_k > 0,

𝔼k[Φ(A1k+1,,ANk+1)]+12γ¯τ𝔼k[Γk+1]+(1ηkαγ¯γkηk)𝔼k[DψN(Ak,Ak+1)]subscript𝔼𝑘delimited-[]Φsuperscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑁𝑘112¯𝛾𝜏subscript𝔼𝑘delimited-[]subscriptΓ𝑘11subscript𝜂𝑘𝛼¯𝛾subscript𝛾𝑘subscript𝜂𝑘subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1\displaystyle\mathbb{E}_{k}[\Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)]+% \frac{1}{2\bar{\gamma}\tau}\mathbb{E}_{k}[\Gamma_{k+1}]+\left(\frac{1}{\eta_{k% }}-\alpha-\bar{\gamma}-\frac{\gamma_{k}}{\eta_{k}}\right)\mathbb{E}_{k}[D^{N}_% {\psi}(A^{k},A^{k+1})]blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] + divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG italic_τ end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] + ( divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - italic_α - over¯ start_ARG italic_γ end_ARG - divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ]
\displaystyle\leq Φ(A1k,,ANk)+12γ¯τΓk+(δϵηk+γ¯2+M22(αkβk)2ηkγk)DψN(Ak1,Ak)+γ¯2DψN(Ak2,Ak1).Φsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘12¯𝛾𝜏subscriptΓ𝑘𝛿italic-ϵsubscript𝜂𝑘¯𝛾2superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘2subscript𝜂𝑘subscript𝛾𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘¯𝛾2subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\frac{1}{2\bar{\gamma% }\tau}\Gamma_{k}+\left(\frac{\delta-\epsilon}{\eta_{k}}+\frac{\bar{\gamma}}{2}% +\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{2}}{\eta_{k}\gamma_{k}}\right)D^{N}_{% \psi}(A^{k-1},A^{k})+\frac{\bar{\gamma}}{2}D^{N}_{\psi}(A^{k-2},A^{k-1}).roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG italic_τ end_ARG roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( divide start_ARG italic_δ - italic_ϵ end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) .

Here, γ¯=2(VΓ/τ+V1)normal-¯𝛾2subscript𝑉normal-Γ𝜏subscript𝑉1\bar{\gamma}=\sqrt{2(V_{\Gamma}/\tau+V_{1})}over¯ start_ARG italic_γ end_ARG = square-root start_ARG 2 ( italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT / italic_τ + italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG, α𝛼\alphaitalic_α is the weakly convex parameter in Assumption 1 (i), δ𝛿\deltaitalic_δ and ϵitalic-ϵ\epsilonitalic_ϵ are introduced in (14), and V1,V2,VΓ0subscript𝑉1subscript𝑉2subscript𝑉normal-Γ0V_{1},V_{2},V_{\Gamma}\geq 0italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ≥ 0, τ(0,1]𝜏01\tau\in(0,1]italic_τ ∈ ( 0 , 1 ] are parameters in Definition 4.

Proof.

From the convexity of h()+α22h(\cdot)+\frac{\alpha}{2}\|\cdot\|^{2}italic_h ( ⋅ ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can obtain the following inequality

hn(Ank+1)+α2Ank+12+ξk+1+αAnk+1,AnkAnk+1hn(Ank)+α2Ank2,subscript𝑛superscriptsubscript𝐴𝑛𝑘1𝛼2superscriptnormsuperscriptsubscript𝐴𝑛𝑘12subscript𝜉𝑘1𝛼superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1subscript𝑛superscriptsubscript𝐴𝑛𝑘𝛼2superscriptnormsuperscriptsubscript𝐴𝑛𝑘2h_{n}(A_{n}^{k+1})+\frac{\alpha}{2}\left\|A_{n}^{k+1}\right\|^{2}+\left\langle% \xi_{k+1}+\alpha A_{n}^{k+1},A_{n}^{k}-A_{n}^{k+1}\right\rangle\leq h_{n}(A_{n% }^{k})+\frac{\alpha}{2}\left\|A_{n}^{k}\right\|^{2},italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ italic_ξ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + italic_α italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ⟩ ≤ italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (19)

where ξk+1hn(Ank+1)subscript𝜉𝑘1subscript𝑛superscriptsubscript𝐴𝑛𝑘1\xi_{k+1}\in\partial h_{n}(A_{n}^{k+1})italic_ξ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∈ ∂ italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ). From the optimality condition of (15), it shows that

ξk+1+~Anf(A1kA¯nkANk)+1ηk(ψ(Ank+1)ψ(A~nk))=0,subscript𝜉𝑘1subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘1subscript𝜂𝑘𝜓superscriptsubscript𝐴𝑛𝑘1𝜓superscriptsubscript~𝐴𝑛𝑘0\xi_{k+1}+\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A% _{N}^{k})+\frac{1}{\eta_{k}}(\nabla\psi(A_{n}^{k+1})-\nabla\psi(\tilde{A}_{n}^% {k}))=0,italic_ξ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) = 0 ,

which combined with (19) yields that

hn(Ank+1)subscript𝑛superscriptsubscript𝐴𝑛𝑘1\displaystyle h_{n}(A_{n}^{k+1})italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) hn(Ank)+α2Ank+1Ank2+~Anf(A1kA¯nkANk),AnkAnk+1absentsubscript𝑛superscriptsubscript𝐴𝑛𝑘𝛼2superscriptnormsuperscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘2subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1\displaystyle\leq h_{n}(A_{n}^{k})+\frac{\alpha}{2}\left\|A_{n}^{k+1}-A_{n}^{k% }\right\|^{2}+\langle\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^% {k}\cdots A_{N}^{k}),A_{n}^{k}-{A}_{n}^{k+1}\rangle≤ italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ⟩
1ηkDψ(Ank,Ank+1)1ηkDψ(Ank+1,A~nk)+1ηkDψ(Ank,A~nk).1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘11subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘\displaystyle-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},A_{n}^{k+1})-\frac{1}{\eta_% {k}}D_{\psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})+\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{% k},\tilde{A}_{n}^{k}).- divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) .

Furthermore, since f𝑓fitalic_f is an (L¯,L¯)¯𝐿¯𝐿(\bar{L},\underline{L})( over¯ start_ARG italic_L end_ARG , under¯ start_ARG italic_L end_ARG )-relative smooth function with respect to ψ𝜓\psiitalic_ψ, we have

f(A1k+1Ank+1ANk+1)𝑓superscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑁𝑘1\displaystyle f(A_{1}^{k+1}\cdots A_{n}^{k+1}\cdots A_{N}^{k+1})italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) f(A1kA¯nkANk)absent𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘\displaystyle\leq f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})≤ italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+Anf(A1kA¯nkANk),Ank+1A¯nk+L¯Dψ(Ank+1,A¯nk),subscriptsubscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘\displaystyle+\langle\nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}% \cdots A_{N}^{k}),A_{n}^{k+1}-\underline{A}_{n}^{k}\rangle+\bar{L}D_{\psi}(A_{% n}^{k+1},\underline{A}_{n}^{k}),+ ⟨ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ + over¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ,

and

f(A1kA¯nkANk)+Anf(A1kA¯nkANk),AnkA¯nkf(A1k,AnkANk)+L¯Dψ(Ank,A¯nk).𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘subscriptsubscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘\displaystyle f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})+\langle% \nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k}),A_{n}^{k% }-\underline{A}_{n}^{k}\rangle\leq f(A_{1}^{k},\cdots A_{n}^{k}\cdots A_{N}^{k% })+\underline{L}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k}).italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ ≤ italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + under¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) .

Combining two inequalities, we can get

f(A1k+1Ank+1ANk+1)f(A1kAnkANk)+Anf(A1kA¯nkANk),Ank+1Ank+L¯Dψ(Ank+1,A¯nk)+L¯Dψ(Ank,A¯nk).𝑓superscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑁𝑘1absent𝑓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘subscriptsubscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘missing-subexpression¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘\displaystyle\begin{aligned} f(A_{1}^{k+1}\cdots A_{n}^{k+1}\cdots A_{N}^{k+1}% )&\leq f(A_{1}^{k}\cdots A_{n}^{k}\cdots A_{N}^{k})+\langle\nabla_{A_{n}}f(A_{% 1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k}),A_{n}^{k+1}-A_{n}^{k}% \rangle\\ &+\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})+\underline{L}D_{\psi}(A_{% n}^{k},\underline{A}_{n}^{k}).\end{aligned}start_ROW start_CELL italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL ≤ italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + over¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + under¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) . end_CELL end_ROW (20)

By summing the two inequalities together, we obtain

Φ(A1k+1,,ANk+1)Φ(A1k,,ANk)+α2Ank+1Ank2+Anf(A1kA¯nkANk)~Anf(A1kA¯nkANk),Ank+1Ank+L¯Dψ(Ank,A¯nk)+1ηkDψ(Ank,A~nk)1ηkDψ(Ank,Ank+1)+L¯Dψ(Ank+1,A¯nk)1ηkDψ(Ank+1,A~nk)Φ(A1k,,ANk)+α+γ¯k2Ank+1Ank2+12γ¯kAnf(A1kA¯nkANk)~Anf(A1kA¯nkANk)*2+L¯Dψ(Ank,A¯nk)+1ηkDψ(Ank,A~nk)1ηkDψ(Ank,Ank+1)+L¯Dψ(Ank+1,A¯nk)1ηkDψ(Ank+1,A~nk),Φsuperscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑁𝑘1absentΦsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘𝛼2superscriptnormsuperscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘2missing-subexpressionsubscriptsubscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘missing-subexpression¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1missing-subexpression¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript~𝐴𝑛𝑘missing-subexpressionabsentΦsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘𝛼subscript¯𝛾𝑘2superscriptnormsuperscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘2missing-subexpression12subscript¯𝛾𝑘superscriptsubscriptnormsubscriptsubscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘subscript~subscript𝐴𝑛𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑁𝑘2missing-subexpression¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1missing-subexpression¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript~𝐴𝑛𝑘\displaystyle\begin{aligned} \Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)&% \leq\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\frac{\alpha}{2}\|A_{n}^{k+1}-% A_{n}^{k}\|^{2}\\ &+\langle\nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})% -\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})% ,A_{n}^{k+1}-A_{n}^{k}\rangle\\ &+\underline{L}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})+\frac{1}{\eta_{k}}D_{% \psi}(A_{n}^{k},\tilde{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},A_{n}^% {k+1})\\ &+\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi% }(A_{n}^{k+1},\tilde{A}_{n}^{k})\\ &\leq\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\frac{\alpha+\bar{\gamma}_{k}% }{2}\|A_{n}^{k+1}-A_{n}^{k}\|^{2}\\ &+\frac{1}{2\bar{\gamma}_{k}}\|\nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n% }^{k}\cdots A_{N}^{k})-\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n% }^{k}\cdots A_{N}^{k})\|_{*}^{2}\\ &+\underline{L}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})+\frac{1}{\eta_{k}}D_{% \psi}(A_{n}^{k},\tilde{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},A_{n}^% {k+1})\\ &+\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi% }(A_{n}^{k+1},\tilde{A}_{n}^{k}),\end{aligned}start_ROW start_CELL roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL ≤ roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ⟨ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + under¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + over¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_α + over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + under¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + over¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , end_CELL end_ROW (21)

where the last inequality follows from a,bγ2a2+12γb*2𝑎𝑏𝛾2superscriptnorm𝑎212𝛾superscriptsubscriptnorm𝑏2\langle a,b\rangle\leq\frac{\gamma}{2}\|a\|^{2}+\frac{1}{2\gamma}\|b\|_{*}^{2}⟨ italic_a , italic_b ⟩ ≤ divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_b ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any γk>0subscript𝛾𝑘0\gamma_{k}>0italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 and ηkL¯1subscript𝜂𝑘superscript¯𝐿1\eta_{k}\leq\bar{L}^{-1}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

By (12), we know

Dψ(Ank+1,A¯nk)=subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘absent\displaystyle D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})=italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = Dψ(Ank+1,Ank)+Dψ(Ank,A¯nk)+ψ(Ank)ψ(A¯nk),Ank+1Ank,subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘𝜓superscriptsubscript𝐴𝑛𝑘𝜓superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘\displaystyle D_{\psi}(A_{n}^{k+1},A_{n}^{k})+D_{\psi}(A_{n}^{k},\underline{A}% _{n}^{k})+\langle\nabla\psi(A_{n}^{k})-\nabla\psi(\underline{A}_{n}^{k}),A_{n}% ^{k+1}-A_{n}^{k}\rangle,italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ ,
Dψ(Ank+1,A~nk)=subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript~𝐴𝑛𝑘absent\displaystyle D_{\psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})=italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = Dψ(Ank+1,Ank)+Dψ(Ank,A~nk)+ψ(Ank)ψ(A~nk),Ank+1Ank.subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘𝜓superscriptsubscript𝐴𝑛𝑘𝜓superscriptsubscript~𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘\displaystyle D_{\psi}(A_{n}^{k+1},A_{n}^{k})+D_{\psi}(A_{n}^{k},\tilde{A}_{n}% ^{k})+\langle\nabla\psi(A_{n}^{k})-\nabla\psi(\tilde{A}_{n}^{k}),A_{n}^{k+1}-A% _{n}^{k}\rangle.italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ .

For the last two terms on the right of the inequality (21), it shows that

L¯Dψ(Ank+1,A¯nk)1ηkDψ(Ank+1,A~nk)1ηk[Dψ(Ank+1,A¯nk)Dψ(Ank+1,A~nk)]=1ηkDψ(Ank,A¯nk)1ηkDψ(Ank,A~nk)+1ηkψ(A~nk)ψ(A¯nk),Ank+1Ank1ηkDψ(Ank,A¯nk)1ηkDψ(Ank,A~nk)+12ηkγkψ(A~nk)ψ(A¯nk)2+γk2ηkAnk+1Ank21ηkDψ(Ank,A¯nk)1ηkDψ(Ank,A~nk)+M222ηkγkA~nkA¯nk2+γk2ηkAnk+1Ank21ηkDψ(Ank,A¯nk)1ηkDψ(Ank,A~nk)+M22(αkβk)22ηkγkAnkAnk12+γk2ηkAnk+1Ank2.missing-subexpression¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘delimited-[]subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript¯𝐴𝑛𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘1subscript𝜂𝑘𝜓superscriptsubscript~𝐴𝑛𝑘𝜓superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘12subscript𝜂𝑘subscript𝛾𝑘superscriptnorm𝜓superscriptsubscript~𝐴𝑛𝑘𝜓superscriptsubscript¯𝐴𝑛𝑘2subscript𝛾𝑘2subscript𝜂𝑘superscriptnormsuperscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘21subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘superscriptsubscript𝑀222subscript𝜂𝑘subscript𝛾𝑘superscriptnormsuperscriptsubscript~𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘2subscript𝛾𝑘2subscript𝜂𝑘superscriptnormsuperscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘21subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘1subscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript~𝐴𝑛𝑘superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘22subscript𝜂𝑘subscript𝛾𝑘superscriptnormsuperscriptsubscript𝐴𝑛𝑘superscriptsubscript𝐴𝑛𝑘12subscript𝛾𝑘2subscript𝜂𝑘superscriptnormsuperscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘2\displaystyle\begin{aligned} &\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k% })-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})\\ \leq&\frac{1}{\eta_{k}}\left[D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})-D_{% \psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})\right]\\ =&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta_{k% }}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{1}{\eta_{k}}\langle\nabla\psi(% \tilde{A}_{n}^{k})-\nabla\psi(\underline{A}_{n}^{k}),A_{n}^{k+1}-A_{n}^{k}% \rangle\\ \leq&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta% _{k}}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{1}{2\eta_{k}\gamma_{k}}\|% \nabla\psi(\tilde{A}_{n}^{k})-\nabla\psi(\underline{A}_{n}^{k})\|^{2}+\frac{% \gamma_{k}}{2\eta_{k}}\|A_{n}^{k+1}-A_{n}^{k}\|^{2}\\ \leq&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta% _{k}}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{M_{2}^{2}}{2\eta_{k}\gamma_{k% }}\|\tilde{A}_{n}^{k}-\underline{A}_{n}^{k}\|^{2}+\frac{\gamma_{k}}{2\eta_{k}}% \|A_{n}^{k+1}-A_{n}^{k}\|^{2}\\ \leq&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta% _{k}}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k% })^{2}}{2\eta_{k}\gamma_{k}}\|A_{n}^{k}-A_{n}^{k-1}\|^{2}+\frac{\gamma_{k}}{2% \eta_{k}}\|A_{n}^{k+1}-A_{n}^{k}\|^{2}.\end{aligned}start_ROW start_CELL end_CELL start_CELL over¯ start_ARG italic_L end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG [ italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟨ ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (22)

Suppose n=ξk𝑛superscript𝜉𝑘n=\xi^{k}italic_n = italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT at the k𝑘kitalic_k-th iteration. We apply the conditional expectation operator 𝔼ksubscript𝔼𝑘\mathbb{E}_{k}blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to the above inequality (21) and bounding the MSE term by (16) in Definition 4, then we have

𝔼k[Φ(A1k+1,,ANk+1)]Φ(A1k,,ANk)+(α+γ¯k2+γk2ηk)𝔼k[Aξkk+1Aξkk2]+12γ¯k𝔼k[Aξkf(A1kA¯ξkkANk)~Aξkf(A1kA¯ξkkANk)*2]+(1ηk+L¯)𝔼k[Dψ(Aξkk,A¯ξkk)]+M22(αkβk)22ηkγk𝔼k[AξkkAξkk12]1ηk𝔼k[Dψ(Aξkk,Aξkk+1)]Φ(A1k,,ANk)+(α+γ¯k2+γk2ηk)𝔼k[Ak+1Ak2]+12γ¯k(Γk+V1(AkAk12+Ak1Ak22))+M22(αkβk)22ηkγkAkAk12+(1ηk+L¯)[Dψ(A1k,A¯1k)+Dψ(A2k,A¯2k)++Dψ(ANk,A¯Nk)]1ηk𝔼k[Dψ(A1k,A1k+1)+Dψ(A2k,A2k+1)++Dψ(ANk,ANk+1)]Φ(A1k,,ANk)+(α+γ¯k2+γk2ηk)𝔼k[Ak+1Ak2]+12γ¯kτ(Γk𝔼k[Γk+1])+VΓ2γ¯kτ(AkAk12+Ak1Ak22)+V12γ¯k(AkAk12+Ak1Ak22)+M22(αkβk)22ηkγkAkAk12+δϵηkDψN(Ak1,Ak)1ηk𝔼k[DψN(Ak,Ak+1)],missing-subexpressionsubscript𝔼𝑘delimited-[]Φsuperscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑁𝑘1Φsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘𝛼subscript¯𝛾𝑘2subscript𝛾𝑘2subscript𝜂𝑘subscript𝔼𝑘delimited-[]superscriptnormsuperscriptsubscript𝐴superscript𝜉𝑘𝑘1superscriptsubscript𝐴superscript𝜉𝑘𝑘2missing-subexpression12subscript¯𝛾𝑘subscript𝔼𝑘delimited-[]superscriptsubscriptnormsubscriptsuperscriptsubscript𝐴𝜉𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴𝑁𝑘subscript~superscriptsubscript𝐴𝜉𝑘𝑓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴𝑁𝑘2missing-subexpression1subscript𝜂𝑘¯𝐿subscript𝔼𝑘delimited-[]subscript𝐷𝜓superscriptsubscript𝐴superscript𝜉𝑘𝑘superscriptsubscript¯𝐴superscript𝜉𝑘𝑘superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘22subscript𝜂𝑘subscript𝛾𝑘subscript𝔼𝑘delimited-[]superscriptnormsuperscriptsubscript𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴superscript𝜉𝑘𝑘121subscript𝜂𝑘subscript𝔼𝑘delimited-[]subscript𝐷𝜓superscriptsubscript𝐴superscript𝜉𝑘𝑘superscriptsubscript𝐴superscript𝜉𝑘𝑘1Φsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘𝛼subscript¯𝛾𝑘2subscript𝛾𝑘2subscript𝜂𝑘subscript𝔼𝑘delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2missing-subexpression12subscript¯𝛾𝑘subscriptΓ𝑘subscript𝑉1superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘22subscript𝜂𝑘subscript𝛾𝑘superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12missing-subexpression1subscript𝜂𝑘¯𝐿delimited-[]subscript𝐷𝜓superscriptsubscript𝐴1𝑘superscriptsubscript¯𝐴1𝑘subscript𝐷𝜓superscriptsubscript𝐴2𝑘superscriptsubscript¯𝐴2𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑁𝑘superscriptsubscript¯𝐴𝑁𝑘missing-subexpression1subscript𝜂𝑘subscript𝔼𝑘delimited-[]subscript𝐷𝜓superscriptsubscript𝐴1𝑘superscriptsubscript𝐴1𝑘1subscript𝐷𝜓superscriptsubscript𝐴2𝑘superscriptsubscript𝐴2𝑘1subscript𝐷𝜓superscriptsubscript𝐴𝑁𝑘superscriptsubscript𝐴𝑁𝑘1Φsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘𝛼subscript¯𝛾𝑘2subscript𝛾𝑘2subscript𝜂𝑘subscript𝔼𝑘delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2missing-subexpression12subscript¯𝛾𝑘𝜏subscriptΓ𝑘subscript𝔼𝑘delimited-[]subscriptΓ𝑘1subscript𝑉Γ2subscript¯𝛾𝑘𝜏superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22missing-subexpressionsubscript𝑉12subscript¯𝛾𝑘superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘22subscript𝜂𝑘subscript𝛾𝑘superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12missing-subexpression𝛿italic-ϵsubscript𝜂𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘1subscript𝜂𝑘subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1\displaystyle\begin{aligned} &\mathbb{E}_{k}[\Phi\left(A_{1}^{k+1},\cdots,A_{N% }^{k+1}\right)]\\ \leq&\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\left(\frac{\alpha+\bar{% \gamma}_{k}}{2}+\frac{\gamma_{k}}{2\eta_{k}}\right)\mathbb{E}_{k}[\|A_{\xi^{k}% }^{k+1}-A_{\xi^{k}}^{k}\|^{2}]\\ &+\frac{1}{2\bar{\gamma}_{k}}\mathbb{E}_{k}[\|\nabla_{A_{\xi}^{k}}f(A_{1}^{k}% \cdots\underline{A}_{\xi^{k}}^{k}\cdots A_{N}^{k})-\tilde{\nabla}_{A_{\xi}^{k}% }f(A_{1}^{k}\cdots\underline{A}_{\xi^{k}}^{k}\cdots A_{N}^{k})\|_{*}^{2}]\\ &+\left(\frac{1}{\eta_{k}}+\underline{L}\right)\mathbb{E}_{k}[D_{\psi}({A}_{% \xi^{k}}^{k},\underline{A}_{\xi^{k}}^{k})]+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k% })^{2}}{2\eta_{k}\gamma_{k}}\mathbb{E}_{k}[\|A_{\xi^{k}}^{k}-A_{\xi^{k}}^{k-1}% \|^{2}]-\frac{1}{\eta_{k}}\mathbb{E}_{k}[D_{\psi}({A}_{\xi^{k}}^{k},A_{\xi^{k}% }^{k+1})]\\ \leq&\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\left(\frac{\alpha+\bar{% \gamma}_{k}}{2}+\frac{\gamma_{k}}{2\eta_{k}}\right)\mathbb{E}_{k}[\|A^{k+1}-A^% {k}\|^{2}]\\ &+\frac{1}{2\bar{\gamma}_{k}}(\Gamma_{k}+V_{1}(\left\|A^{k}-A^{k-1}\right\|^{2% }+\left\|A^{k-1}-A^{k-2}\right\|^{2}))+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{% 2}}{2\eta_{k}\gamma_{k}}\|A^{k}-A^{k-1}\|^{2}\\ &+\left(\frac{1}{\eta_{k}}+\underline{L}\right)\left[D_{\psi}\left({A}_{1}^{k}% ,\underline{A}_{1}^{k}\right)+D_{\psi}\left({A}_{2}^{k},\underline{A}_{2}^{k}% \right)+\cdots+D_{\psi}\left({A}_{N}^{k},\underline{A}_{N}^{k}\right)\right]\\ &-\frac{1}{\eta_{k}}\mathbb{E}_{k}[D_{\psi}\left({A}_{1}^{k},A_{1}^{k+1}\right% )+D_{\psi}\left({A}_{2}^{k},A_{2}^{k+1}\right)+\cdots+D_{\psi}\left({A}_{N}^{k% },A_{N}^{k+1}\right)]\\ \leq&\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\left(\frac{\alpha+\bar{% \gamma}_{k}}{2}+\frac{\gamma_{k}}{2\eta_{k}}\right)\mathbb{E}_{k}[\|A^{k+1}-A^% {k}\|^{2}]\\ &+\frac{1}{2\bar{\gamma}_{k}\tau}(\Gamma_{k}-\mathbb{E}_{k}[\Gamma_{k+1}])+% \frac{V_{\Gamma}}{2\bar{\gamma}_{k}\tau}(\left\|A^{k}-A^{k-1}\right\|^{2}+% \left\|A^{k-1}-A^{k-2}\right\|^{2})\\ &+\frac{V_{1}}{2\bar{\gamma}_{k}}\left(\left\|A^{k}-A^{k-1}\right\|^{2}+\left% \|A^{k-1}-A^{k-2}\right\|^{2}\right)+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{2}% }{2\eta_{k}\gamma_{k}}\|A^{k}-A^{k-1}\|^{2}\\ &+\frac{\delta-\epsilon}{\eta_{k}}D^{N}_{\psi}\left({A}^{k-1},A^{k}\right)-% \frac{1}{\eta_{k}}\mathbb{E}_{k}[D^{N}_{\psi}\left({A}^{k},A^{k+1}\right)],% \end{aligned}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ( divide start_ARG italic_α + over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋯ italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + under¯ start_ARG italic_L end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ( divide start_ARG italic_α + over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + under¯ start_ARG italic_L end_ARG ) [ italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ⋯ + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) + ⋯ + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ( divide start_ARG italic_α + over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ end_ARG ( roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] ) + divide start_ARG italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ end_ARG ( ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG italic_δ - italic_ϵ end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] , end_CELL end_ROW (23)

where the last inequality follows from (18) in Definition 4. From (14) and ηkmin{ηk1,L¯1}subscript𝜂𝑘subscript𝜂𝑘1superscript¯𝐿1\eta_{k}\leq\min\{\eta_{k-1},\bar{L}^{-1}\}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ roman_min { italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT }, it presents that

(1ηk+L¯)Dψ(Ank,A¯nk)L¯ηk+1ηkδϵ1+L¯ηk1Dψ(Ank1,Ank)δϵηkDψ(Ank1,Ank),1subscript𝜂𝑘¯𝐿subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘¯𝐿subscript𝜂𝑘1subscript𝜂𝑘𝛿italic-ϵ1¯𝐿subscript𝜂𝑘1subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘𝛿italic-ϵsubscript𝜂𝑘subscript𝐷𝜓superscriptsubscript𝐴𝑛𝑘1superscriptsubscript𝐴𝑛𝑘\displaystyle\begin{aligned} \left(\frac{1}{\eta_{k}}+\underline{L}\right)D_{% \psi}({A}_{n}^{k},\underline{A}_{n}^{k})\leq\frac{\underline{L}\eta_{k}+1}{% \eta_{k}}\frac{\delta-\epsilon}{1+\underline{L}\eta_{k-1}}D_{\psi}({A}_{n}^{k-% 1},{A}_{n}^{k})\leq\frac{\delta-\epsilon}{\eta_{k}}D_{\psi}({A}_{n}^{k-1},{A}_% {n}^{k}),\end{aligned}start_ROW start_CELL ( divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + under¯ start_ARG italic_L end_ARG ) italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≤ divide start_ARG under¯ start_ARG italic_L end_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG divide start_ARG italic_δ - italic_ϵ end_ARG start_ARG 1 + under¯ start_ARG italic_L end_ARG italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≤ divide start_ARG italic_δ - italic_ϵ end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , end_CELL end_ROW

and we also use notation DψN(A,B):=Dψ(A1,B1)++Dψ(AN,BN)assignsubscriptsuperscript𝐷𝑁𝜓𝐴𝐵subscript𝐷𝜓subscript𝐴1subscript𝐵1subscript𝐷𝜓subscript𝐴𝑁subscript𝐵𝑁D^{N}_{\psi}(A,B):=D_{\psi}(A_{1},B_{1})+\cdots+D_{\psi}(A_{N},B_{N})italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A , italic_B ) := italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⋯ + italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) for simplicity. Then we can get

𝔼k[Φ(A1k+1,,ANk+1)]subscript𝔼𝑘delimited-[]Φsuperscriptsubscript𝐴1𝑘1superscriptsubscript𝐴𝑁𝑘1\displaystyle\mathbb{E}_{k}[\Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)]blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ]
\displaystyle\leq Φ(A1k,,ANk)(1ηkαγ¯kγkηk)𝔼k[DψN(Ak,Ak+1)]Φsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘1subscript𝜂𝑘𝛼subscript¯𝛾𝑘subscript𝛾𝑘subscript𝜂𝑘subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1\displaystyle\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)-\left(\frac{1}{\eta_{% k}}-\alpha-\bar{\gamma}_{k}-\frac{\gamma_{k}}{\eta_{k}}\right)\mathbb{E}_{k}[D% ^{N}_{\psi}(A^{k},A^{k+1})]roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ( divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - italic_α - over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ]
+12γ¯kτ(Γk𝔼k[Γk+1])+(δϵηk+VΓγ¯kτ+V1γ¯k+M22(αkβk)2ηkγk)DψN(Ak1,Ak)12subscript¯𝛾𝑘𝜏subscriptΓ𝑘subscript𝔼𝑘delimited-[]subscriptΓ𝑘1𝛿italic-ϵsubscript𝜂𝑘subscript𝑉Γsubscript¯𝛾𝑘𝜏subscript𝑉1subscript¯𝛾𝑘superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘2subscript𝜂𝑘subscript𝛾𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\displaystyle+\frac{1}{2\bar{\gamma}_{k}\tau}(\Gamma_{k}-\mathbb{E}_{k}[\Gamma% _{k+1}])+\left(\frac{\delta-\epsilon}{\eta_{k}}+\frac{V_{\Gamma}}{\bar{\gamma}% _{k}\tau}+\frac{V_{1}}{\bar{\gamma}_{k}}+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})% ^{2}}{\eta_{k}\gamma_{k}}\right)D^{N}_{\psi}(A^{k-1},A^{k})+ divide start_ARG 1 end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ end_ARG ( roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] ) + ( divide start_ARG italic_δ - italic_ϵ end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ end_ARG + divide start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+(VΓγkτ+V1γk)DψN(Ak2,Ak1).subscript𝑉Γsubscript𝛾𝑘𝜏subscript𝑉1subscript𝛾𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle+\left(\frac{V_{\Gamma}}{\gamma_{k}\tau}+\frac{V_{1}}{\gamma_{k}}% \right)D^{N}_{\psi}(A^{k-2},A^{k-1}).+ ( divide start_ARG italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ end_ARG + divide start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) .

Therefore, the results can be obtained by rearranging the above terms with γ¯k=2(VΓ/τ+V1)subscript¯𝛾𝑘2subscript𝑉Γ𝜏subscript𝑉1\bar{\gamma}_{k}=\sqrt{2(V_{\Gamma}/\tau+V_{1})}over¯ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = square-root start_ARG 2 ( italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT / italic_τ + italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG. This completes the proof. ∎

Next, we introduce a new Lyapunov function and show it is monotonically decreasing in expectation. For simplicity, we denote

Φk=Φ(A1k,,ANk).superscriptΦ𝑘Φsuperscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑁𝑘\Phi^{k}=\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right).roman_Φ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_Φ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) .
Lemma 2.

Suppose the same conditions with Lemma 1 hold, and the stepsize satisfies

ηkmin{ηk1,1L¯,1δ2|αkβk|M2α+2γ¯},k>0.formulae-sequencesubscript𝜂𝑘subscript𝜂𝑘11¯𝐿1𝛿2subscript𝛼𝑘subscript𝛽𝑘subscript𝑀2𝛼2¯𝛾for-all𝑘0\displaystyle\eta_{k}\leq\min\left\{\eta_{k-1},\frac{1}{\bar{L}},\frac{1-% \delta-2|\alpha_{k}-\beta_{k}|M_{2}}{\alpha+2\bar{\gamma}}\right\},\quad% \forall\,k>0.italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ roman_min { italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_L end_ARG end_ARG , divide start_ARG 1 - italic_δ - 2 | italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_α + 2 over¯ start_ARG italic_γ end_ARG end_ARG } , ∀ italic_k > 0 . (24)

Let {Ank}k>0subscriptsuperscriptsubscript𝐴𝑛𝑘𝑘0\{A_{n}^{k}\}_{k>0}{ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k > 0 end_POSTSUBSCRIPT with n{1,,N}𝑛1normal-…𝑁n\in\{1,\dots,N\}italic_n ∈ { 1 , … , italic_N } be a sequence generated by iTableSMD (Algorithm 1) and define the following Lyapunov sequence

Ψk+1:=ηk(Φk+1𝒱(Φ))+(1ηkαηkγ¯γkϵ3)DψN(Ak,Ak+1)+ηk(γ¯2+ϵ3ηk)DψN(Ak1,Ak)+ηk2τγ¯Γk+1,assignsubscriptΨ𝑘1absentsubscript𝜂𝑘superscriptΦ𝑘1𝒱Φ1subscript𝜂𝑘𝛼subscript𝜂𝑘¯𝛾subscript𝛾𝑘italic-ϵ3subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1missing-subexpressionsubscript𝜂𝑘¯𝛾2italic-ϵ3subscript𝜂𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscript𝜂𝑘2𝜏¯𝛾subscriptΓ𝑘1\displaystyle\begin{aligned} \Psi_{k+1}:=&\eta_{k}\left(\Phi^{k+1}-\mathcal{V}% (\Phi)\right)+\left(1-\eta_{k}\alpha-\eta_{k}\bar{\gamma}-\gamma_{k}-\frac{% \epsilon}{3}\right)D^{N}_{\psi}(A^{k},A^{k+1})\\ &+\eta_{k}\left(\frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k}}\right)D^{N}_% {\psi}(A^{k-1},A^{k})+\frac{\eta_{k}}{2\tau\bar{\gamma}}\Gamma_{k+1},\end{aligned}start_ROW start_CELL roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := end_CELL start_CELL italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - caligraphic_V ( roman_Φ ) ) + ( 1 - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG 3 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_τ over¯ start_ARG italic_γ end_ARG end_ARG roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , end_CELL end_ROW (25)

where γk=|αkβk|M2subscript𝛾𝑘subscript𝛼𝑘subscript𝛽𝑘subscript𝑀2\gamma_{k}=|\alpha_{k}-\beta_{k}|M_{2}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = | italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, for all k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N, we have

𝔼k[Ψk+1]Ψkϵ3(𝔼k[DψN(Ak,Ak+1)]+DψN(Ak1,Ak)+DψN(Ak2,Ak1)).subscript𝔼𝑘delimited-[]subscriptΨ𝑘1subscriptΨ𝑘italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle\begin{aligned} \mathbb{E}_{k}[\Psi_{k+1}]\leq\Psi_{k}-\frac{% \epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D^{N}_{\psi}(A^{k-1},% A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1})).\end{aligned}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] ≤ roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ) . end_CELL end_ROW (26)
Proof.

From Lemma 1, it shows that

ηk(Φk)𝒱(Φ))ηk(𝔼k[Φk+1]𝒱(Φ))+(1ηkαηkγ¯γk)𝔼k[DψN(Ak,Ak+1)]+ηk2γ¯τ(𝔼k[Γk+1]Γk)(δϵ+γ¯ηk2+M22(αkβk)2γk)DψN(Ak1,Ak)γ¯ηk2DψN(Ak2,Ak1)\displaystyle\begin{aligned} &\eta_{k}(\Phi^{k})-\mathcal{V}(\Phi))\\ \geq&\eta_{k}(\mathbb{E}_{k}[\Phi^{k+1}]-\mathcal{V}(\Phi))+(1-\eta_{k}\alpha-% \eta_{k}\bar{\gamma}-\gamma_{k})\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]\\ &+\frac{\eta_{k}}{2\bar{\gamma}\tau}(\mathbb{E}_{k}[\Gamma_{k+1}]-\Gamma_{k})-% \left(\delta-\epsilon+\frac{\bar{\gamma}\eta_{k}}{2}+\frac{M_{2}^{2}(\alpha_{k% }-\beta_{k})^{2}}{\gamma_{k}}\right)D^{N}_{\psi}(A^{k-1},A^{k})-\frac{\bar{% \gamma}\eta_{k}}{2}D^{N}_{\psi}(A^{k-2},A^{k-1})\end{aligned}start_ROW start_CELL end_CELL start_CELL italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - caligraphic_V ( roman_Φ ) ) end_CELL end_ROW start_ROW start_CELL ≥ end_CELL start_CELL italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Φ start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ] - caligraphic_V ( roman_Φ ) ) + ( 1 - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 over¯ start_ARG italic_γ end_ARG italic_τ end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] - roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ( italic_δ - italic_ϵ + divide start_ARG over¯ start_ARG italic_γ end_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG over¯ start_ARG italic_γ end_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW (27)

Combining (25) with ηkηk1subscript𝜂𝑘subscript𝜂𝑘1\eta_{k}\leq\eta_{k-1}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, we have

Ψk𝔼k[Ψk+1]subscriptΨ𝑘subscript𝔼𝑘delimited-[]subscriptΨ𝑘1\displaystyle\Psi_{k}-\mathbb{E}_{k}[\Psi_{k+1}]roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ]
=\displaystyle== ηk1(Φk𝒱(Φ))+(1ηk1αηk1γ¯γk1ϵ3)DψN(Ak1,Ak)ηk2τγ¯𝔼k[Γk+1]subscript𝜂𝑘1superscriptΦ𝑘𝒱Φ1subscript𝜂𝑘1𝛼subscript𝜂𝑘1¯𝛾subscript𝛾𝑘1italic-ϵ3subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscript𝜂𝑘2𝜏¯𝛾subscript𝔼𝑘delimited-[]subscriptΓ𝑘1\displaystyle\eta_{k-1}(\Phi^{k}-\mathcal{V}(\Phi))+\left(1-\eta_{k-1}\alpha-% \eta_{k-1}\bar{\gamma}-\gamma_{k-1}-\frac{\epsilon}{3}\right)D^{N}_{\psi}(A^{k% -1},A^{k})-\frac{\eta_{k}}{2\tau\bar{\gamma}}\mathbb{E}_{k}[\Gamma_{k+1}]italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - caligraphic_V ( roman_Φ ) ) + ( 1 - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_α - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_τ over¯ start_ARG italic_γ end_ARG end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ]
+ηk12τγ¯Γk+ηk1(γ¯2+ϵ3ηk1)DψN(Ak2,Ak1)ηk(𝔼k[Φk+1]𝒱(Φ))subscript𝜂𝑘12𝜏¯𝛾subscriptΓ𝑘subscript𝜂𝑘1¯𝛾2italic-ϵ3subscript𝜂𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1subscript𝜂𝑘subscript𝔼𝑘delimited-[]superscriptΦ𝑘1𝒱Φ\displaystyle+\frac{\eta_{k-1}}{2\tau\bar{\gamma}}\Gamma_{k}+\eta_{k-1}\left(% \frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k-1}}\right)D^{N}_{\psi}(A^{k-2}% ,A^{k-1})-\eta_{k}(\mathbb{E}_{k}[\Phi^{k+1}]-\mathcal{V}(\Phi))+ divide start_ARG italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_τ over¯ start_ARG italic_γ end_ARG end_ARG roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG 3 italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Φ start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ] - caligraphic_V ( roman_Φ ) )
ηk(γ¯2+ϵ3ηk)DψN(Ak1,Ak)(1ηkαηkγ¯γkϵ3)𝔼k[DψN(Ak,Ak+1)]subscript𝜂𝑘¯𝛾2italic-ϵ3subscript𝜂𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘1subscript𝜂𝑘𝛼subscript𝜂𝑘¯𝛾subscript𝛾𝑘italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1\displaystyle-\eta_{k}\left(\frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k}}% \right)D^{N}_{\psi}(A^{k-1},A^{k})-(1-\eta_{k}\alpha-\eta_{k}\bar{\gamma}-% \gamma_{k}-\frac{\epsilon}{3})\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]- italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG 3 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ( 1 - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ]
\displaystyle\geq ηk(Φk)𝒱(Φ))+(1ηk1αηk1γ¯γk1ϵ3)DNψ(Ak1,Ak)ηk2τγ¯𝔼k[Γk+1]\displaystyle\eta_{k}(\Phi^{k})-\mathcal{V}(\Phi))+\left(1-\eta_{k-1}\alpha-% \eta_{k-1}\bar{\gamma}-\gamma_{k-1}-\frac{\epsilon}{3}\right)D^{N}_{\psi}(A^{k% -1},A^{k})-\frac{\eta_{k}}{2\tau\bar{\gamma}}\mathbb{E}_{k}[\Gamma_{k+1}]italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - caligraphic_V ( roman_Φ ) ) + ( 1 - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_α - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_τ over¯ start_ARG italic_γ end_ARG end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ]
+ηk2τγ¯Γk+ηk1(γ¯2+ϵ3ηk1)DψN(Ak2,Ak1)ηk(𝔼k[Φk+1]𝒱(Φ))subscript𝜂𝑘2𝜏¯𝛾subscriptΓ𝑘subscript𝜂𝑘1¯𝛾2italic-ϵ3subscript𝜂𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1subscript𝜂𝑘subscript𝔼𝑘delimited-[]superscriptΦ𝑘1𝒱Φ\displaystyle+\frac{\eta_{k}}{2\tau\bar{\gamma}}\Gamma_{k}+\eta_{k-1}\left(% \frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k-1}}\right)D^{N}_{\psi}(A^{k-2}% ,A^{k-1})-\eta_{k}(\mathbb{E}_{k}[\Phi^{k+1}]-\mathcal{V}(\Phi))+ divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_τ over¯ start_ARG italic_γ end_ARG end_ARG roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG 3 italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Φ start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ] - caligraphic_V ( roman_Φ ) )
ηk(γ¯2+ϵ3ηk)DψN(Ak1,Ak)(1ηkαηkγ¯γkϵ3)𝔼k[DψN(Ak,Ak+1)]subscript𝜂𝑘¯𝛾2italic-ϵ3subscript𝜂𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘1subscript𝜂𝑘𝛼subscript𝜂𝑘¯𝛾subscript𝛾𝑘italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1\displaystyle-\eta_{k}\left(\frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k}}% \right)D^{N}_{\psi}(A^{k-1},A^{k})-\left(1-\eta_{k}\alpha-\eta_{k}\bar{\gamma}% -\gamma_{k}-\frac{\epsilon}{3}\right)\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]- italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG over¯ start_ARG italic_γ end_ARG end_ARG start_ARG 2 end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG 3 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ( 1 - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ]
\displaystyle\geq (1δηk1α(ηk1+ηk)γ¯γk1M22(αkβk)2γk)DψN(Ak1,Ak)1𝛿subscript𝜂𝑘1𝛼subscript𝜂𝑘1subscript𝜂𝑘¯𝛾subscript𝛾𝑘1superscriptsubscript𝑀22superscriptsubscript𝛼𝑘subscript𝛽𝑘2subscript𝛾𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\displaystyle\left(1-\delta-\eta_{k-1}\alpha-(\eta_{k-1}+\eta_{k})\bar{\gamma}% -\gamma_{k-1}-\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{2}}{\gamma_{k}}\right)D^{% N}_{\psi}(A^{k-1},A^{k})( 1 - italic_δ - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_α - ( italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+ϵ3(𝔼k[DψN(Ak,Ak+1)]+DψN(Ak1,Ak)+DψN(Ak2,Ak1)).italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle+\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D% ^{N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1})).+ divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ) .

Let γk=|αkβk|M2subscript𝛾𝑘subscript𝛼𝑘subscript𝛽𝑘subscript𝑀2\gamma_{k}=|\alpha_{k}-\beta_{k}|M_{2}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = | italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and assume γkγk1subscript𝛾𝑘subscript𝛾𝑘1\gamma_{k}\geq\gamma_{k-1}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_γ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT111In numerical experiments in [47, 59], there is αk=c1k1k+2subscript𝛼𝑘subscript𝑐1𝑘1𝑘2\alpha_{k}=c_{1}\frac{k-1}{k+2}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_k + 2 end_ARG and βk=c2k1k+2subscript𝛽𝑘subscript𝑐2𝑘1𝑘2\beta_{k}=c_{2}\frac{k-1}{k+2}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_k + 2 end_ARG. Hence, we have this inequality holds., then we have

Ψk𝔼k[Ψk+1]subscriptΨ𝑘subscript𝔼𝑘delimited-[]subscriptΨ𝑘1\displaystyle\Psi_{k}-\mathbb{E}_{k}[\Psi_{k+1}]roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ]
\displaystyle\geq (1δηk1α2ηk1γ¯γk1γk)DψN(Ak1,Ak)1𝛿subscript𝜂𝑘1𝛼2subscript𝜂𝑘1¯𝛾subscript𝛾𝑘1subscript𝛾𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\displaystyle\left(1-\delta-\eta_{k-1}\alpha-2\eta_{k-1}\bar{\gamma}-\gamma_{k% -1}-\gamma_{k}\right)D^{N}_{\psi}(A^{k-1},A^{k})( 1 - italic_δ - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_α - 2 italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - italic_γ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+ϵ3(𝔼k[DψN(Ak,Ak+1)]+DψN(Ak1,Ak)+DψN(Ak2,Ak1))italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle+\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D% ^{N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1}))+ divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) )
\displaystyle\geq (1δηk1α2ηk1γ¯2γk)DψN(Ak1,Ak)1𝛿subscript𝜂𝑘1𝛼2subscript𝜂𝑘1¯𝛾2subscript𝛾𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\displaystyle\left(1-\delta-\eta_{k-1}\alpha-2\eta_{k-1}\bar{\gamma}-2\gamma_{% k}\right)D^{N}_{\psi}(A^{k-1},A^{k})( 1 - italic_δ - italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_α - 2 italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG - 2 italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+ϵ3(𝔼k[DψN(Ak,Ak+1)]+DψN(Ak1,Ak)+DψN(Ak2,Ak1))italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle+\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D% ^{N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1}))+ divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) )
\displaystyle\geq ϵ3(𝔼k[DψN(Ak,Ak+1)]+DψN(Ak1,Ak)+DψN(Ak2,Ak1)),italic-ϵ3subscript𝔼𝑘delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘superscript𝐴𝑘1subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘2superscript𝐴𝑘1\displaystyle\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D^% {N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1})),divide start_ARG italic_ϵ end_ARG start_ARG 3 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ] + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ) ,

where the second and the last inequality follow from (27) and (24), respectively. This completes the proof.

Theorem 1.

Let {Ank}k>0subscriptsuperscriptsubscript𝐴𝑛𝑘𝑘0\{A_{n}^{k}\}_{k>0}{ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k > 0 end_POSTSUBSCRIPT with n{1,,N}𝑛1normal-…𝑁n\in\{1,\dots,N\}italic_n ∈ { 1 , … , italic_N } be a sequence generated by iTableSMD algorithm. Then, the following statements hold.

  • (i)

    The sequence {𝔼[Ψk]}ksubscript𝔼delimited-[]subscriptΨ𝑘𝑘\{\mathbb{E}[\Psi_{k}]\}_{k\in\mathbb{N}}{ blackboard_E [ roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT is nonincreasing.

  • (ii)

    k=1+𝔼[DψN(Ak1,Ak)]<+superscriptsubscript𝑘1𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\sum\limits_{k=1}^{+\infty}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]<+\infty∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] < + ∞, and the sequence {𝔼[DψN(Ak1,Ak)]}𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\{\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\}{ blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] } converges to zero.

  • (iii)

    min1kK𝔼[DψN(Ak1,Ak)]3Ψ1ϵKsubscript1𝑘𝐾𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘3subscriptΨ1italic-ϵ𝐾\min\limits_{1\leq k\leq K}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\frac{3% \Psi_{1}}{\epsilon K}roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_K end_POSTSUBSCRIPT blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] ≤ divide start_ARG 3 roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ italic_K end_ARG.

Proof.
  • (i)

    This statement follows directly from Lemma 2 and ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0.

  • (i)

    By summing (26) from k=0𝑘0k=0italic_k = 0 to a positive integer K𝐾Kitalic_K, we have

    k=1K𝔼[DψN(Ak1,Ak)]3ϵ𝔼[Ψ1ΨK+1]3ϵΨ1,superscriptsubscript𝑘1𝐾𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘3italic-ϵ𝔼delimited-[]subscriptΨ1subscriptΨ𝐾13italic-ϵsubscriptΨ1\sum_{k=1}^{K}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\frac{3}{\epsilon}% \mathbb{E}[\Psi_{1}-\Psi_{K+1}]\leq\frac{3}{\epsilon}\Psi_{1},∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] ≤ divide start_ARG 3 end_ARG start_ARG italic_ϵ end_ARG blackboard_E [ roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Ψ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ] ≤ divide start_ARG 3 end_ARG start_ARG italic_ϵ end_ARG roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

    where the last inequality follows from Ψk0subscriptΨ𝑘0\Psi_{k}\geq 0roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0 for any k>0𝑘0k>0italic_k > 0 due to (24). Taking the limit as K+𝐾K\rightarrow+\inftyitalic_K → + ∞, we have k=1+𝔼[DψN(Ak1,Ak)]<+superscriptsubscript𝑘1𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\sum_{k=1}^{+\infty}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]<+\infty∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] < + ∞. Then we may deduce that the sequence {𝔼[DψN(Ak1,Ak)]}𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘\{\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\}{ blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] } converges to zero.

  • (iii)

    We have

    Kmin1kK𝔼[DψN(Ak1,Ak)]k=1K𝔼[DψN(Ak1,Ak)]3ϵΨ1,𝐾subscript1𝑘𝐾𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘superscriptsubscript𝑘1𝐾𝔼delimited-[]subscriptsuperscript𝐷𝑁𝜓superscript𝐴𝑘1superscript𝐴𝑘3italic-ϵsubscriptΨ1K\min_{1\leq k\leq K}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\sum_{k=1}^{K}% \mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\frac{3}{\epsilon}\Psi_{1},italic_K roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_K end_POSTSUBSCRIPT blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] ≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ italic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] ≤ divide start_ARG 3 end_ARG start_ARG italic_ϵ end_ARG roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

    which yields the desired result.

This completes the proof. ∎

4.2 Global convergence analysis

In this subsection, we present the analysis of iTableSMD algorithm with the expected squared distance of the subgradient and global convergence. In addition, We impose another stronger assumption on function f𝑓fitalic_f.

Assumption 2.

The partial gradient Aifsubscriptnormal-∇subscript𝐴𝑖𝑓\nabla_{A_{i}}f∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f is Lipschitz continuous with modulus M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on bounded sets of Πn=1NIn×Rsuperscriptsubscriptnormal-Π𝑛1𝑁superscriptsubscript𝐼𝑛𝑅\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}roman_Π start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT. Namely, for any two points A𝐴Aitalic_A and A^normal-^𝐴\hat{A}over^ start_ARG italic_A end_ARG, where A:=(A1,,Ai,,AN)assign𝐴subscript𝐴1normal-…subscript𝐴𝑖normal-…subscript𝐴𝑁A:=(A_{1},\dots,A_{i},\dots,A_{N})italic_A := ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), A^:=(A1,,Ai1,A^i,Ai+1,,AN)assignnormal-^𝐴subscript𝐴1normal-…subscript𝐴𝑖1subscriptnormal-^𝐴𝑖subscript𝐴𝑖1normal-…subscript𝐴𝑁\hat{A}:=(A_{1},\dots,A_{i-1},\hat{A}_{i},A_{i+1},\dots,A_{N})over^ start_ARG italic_A end_ARG := ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) Πn=1NIn×Rabsentsuperscriptsubscriptnormal-Π𝑛1𝑁superscriptsubscript𝐼𝑛𝑅\in\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}∈ roman_Π start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_R end_POSTSUPERSCRIPT, it shows that

Aif(A)Aif(A^)M1AiA^i,i=1,2,,N.formulae-sequencenormsubscriptsubscript𝐴𝑖𝑓𝐴subscriptsubscript𝐴𝑖𝑓^𝐴subscript𝑀1normsubscript𝐴𝑖subscript^𝐴𝑖𝑖12𝑁\|\nabla_{A_{i}}f(A)-\nabla_{A_{i}}f(\hat{A})\|\leq M_{1}\|A_{i}-\hat{A}_{i}\|% ,\quad i=1,2,\dots,N.∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( over^ start_ARG italic_A end_ARG ) ∥ ≤ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ , italic_i = 1 , 2 , … , italic_N .

Under Definition 4 and the definition of SAGA [15] and SARAH [43], we have the following proposition.

Proposition 1.

Under Assumption 2, we have the following two statements hold.

  • (i)

    The SAGA gradient estimator [15] is defined as

    ~nSAGAf(A¯k):=1In|nk|(jnkAnfj(A¯k)Anfj((ϕk)j))+1Jni=1JnAnfi((ϕk)i),assignsubscriptsuperscript~𝑆𝐴𝐺𝐴𝑛𝑓superscript¯𝐴𝑘1subscript𝐼𝑛superscriptsubscript𝑛𝑘subscript𝑗superscriptsubscript𝑛𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscriptsuperscriptitalic-ϕ𝑘𝑗1subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖\displaystyle\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k}):=\frac{1}{{I_{n}}% \left|\mathcal{F}_{n}^{k}\right|}(\sum_{j\in\mathcal{F}_{n}^{k}}\nabla_{A_{n}}% f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}))+\frac{1}{J_{n}}% \sum_{i=1}^{J_{n}}\nabla_{A_{n}}f_{i}((\phi^{k})^{i}),over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_G italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) := divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | end_ARG ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (28)

    where A¯k:=(A1k,,An1k,A¯nk,An+1k,,ANk)assignsuperscript¯𝐴𝑘superscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑛1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑛1𝑘superscriptsubscript𝐴𝑁𝑘\underline{A}^{k}:=(A_{1}^{k},\dots,A_{n-1}^{k},\underline{A}_{n}^{k},A_{n+1}^% {k},\dots,A_{N}^{k})under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT := ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), and the variable (ϕk)isuperscriptsuperscriptitalic-ϕ𝑘𝑖(\phi^{k})^{i}( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT follow the update rules (ϕk)i=A¯k1superscriptsuperscriptitalic-ϕ𝑘𝑖superscript¯𝐴𝑘1(\phi^{k})^{i}=\underline{A}^{k-1}( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT if ink𝑖superscriptsubscript𝑛𝑘i\in\mathcal{F}_{n}^{k}italic_i ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and (ϕk)i=(ϕk1)isuperscriptsuperscriptitalic-ϕ𝑘𝑖superscriptsuperscriptitalic-ϕ𝑘1𝑖(\phi^{k})^{i}=(\phi^{k-1})^{i}( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_ϕ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT otherwise. A set of sampled mode-n𝑛nitalic_n fibers is indexed by nk{1,,Jn}superscriptsubscript𝑛𝑘1subscript𝐽𝑛\mathcal{F}_{n}^{k}\subset\{1,\dots,J_{n}\}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⊂ { 1 , … , italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with |nk|=Bsuperscriptsubscript𝑛𝑘𝐵|\mathcal{F}_{n}^{k}|=B| caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | = italic_B. Then it is variance reduced with

    Γk+1:=1BJni=1JnAnfi(A¯k)Anfi((ϕk)i)*2,assignsubscriptΓ𝑘11𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖2\Gamma_{k+1}:=\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2},roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
    Υk+1:=1BJni=1JnAnfi(A¯k)Anfi((ϕk)i))*.\Upsilon_{k+1}:=\frac{1}{\sqrt{BJ_{n}}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}% (\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i}))\|_{*}.roman_Υ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT .

    The constants τ=B2Jn𝜏𝐵2subscript𝐽𝑛\tau=\frac{B}{2J_{n}}italic_τ = divide start_ARG italic_B end_ARG start_ARG 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, VΓ=2Jn+4Jn2BM12subscript𝑉Γ2subscript𝐽𝑛4superscriptsubscript𝐽𝑛2𝐵superscriptsubscript𝑀12V_{\Gamma}=2J_{n}+\frac{4J_{n}^{2}}{B}M_{1}^{2}italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT = 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + divide start_ARG 4 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, V1=M12,V2=M1formulae-sequencesubscript𝑉1superscriptsubscript𝑀12subscript𝑉2subscript𝑀1V_{1}=M_{1}^{2},V_{2}=M_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

  • (ii)

    The SARAH gradient estimator [43] which is defined as

    ~nSARAHf(A¯k)subscriptsuperscript~𝑆𝐴𝑅𝐴𝐻𝑛𝑓superscript¯𝐴𝑘\displaystyle\tilde{\nabla}^{SARAH}_{n}f(\underline{A}^{k})over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_R italic_A italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
    =\displaystyle== {Anf(A¯k),w.p.1p,1B(jnkAnfj(A¯k)Anfj(A¯k1))+~nSARAHf(A¯k1),otherwise.casessubscriptsubscript𝐴𝑛𝑓superscript¯𝐴𝑘w.p.1𝑝1𝐵𝑗superscriptsubscript𝑛𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscript¯𝐴𝑘1subscriptsuperscript~𝑆𝐴𝑅𝐴𝐻𝑛𝑓superscript¯𝐴𝑘1otherwise.\displaystyle\left\{\begin{array}[]{ll}\nabla_{A_{n}}f(\underline{A}^{k}),&% \mbox{w.p.}\,\,\frac{1}{p},\\ \frac{1}{B}(\underset{j\in\mathcal{F}_{n}^{k}}{\sum}\nabla_{A_{n}}f_{j}(% \underline{A}^{k})-\nabla_{A_{n}}f_{j}(\underline{A}^{k-1}))+\tilde{\nabla}^{% SARAH}_{n}f(\underline{A}^{k-1}),&\mbox{otherwise.}\end{array}\right.{ start_ARRAY start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , end_CELL start_CELL w.p. divide start_ARG 1 end_ARG start_ARG italic_p end_ARG , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( start_UNDERACCENT italic_j ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG ∑ end_ARG ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ) + over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_R italic_A italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) , end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY

    Here “w.p. 1p1𝑝\frac{1}{p}divide start_ARG 1 end_ARG start_ARG italic_p end_ARG” means with probability 1p(0,1]1𝑝01\frac{1}{p}\in(0,1]divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ∈ ( 0 , 1 ]. Then it is variance reduced with

    Γk+1=~nSARAHf(A¯k)Anf(A¯k)*2,Υk+1=~nSARAHf(A¯k)Anf(A¯k)*,formulae-sequencesubscriptΓ𝑘1superscriptsubscriptnormsubscriptsuperscript~𝑆𝐴𝑅𝐴𝐻𝑛𝑓superscript¯𝐴𝑘subscriptsubscript𝐴𝑛𝑓superscript¯𝐴𝑘2subscriptΥ𝑘1subscriptnormsubscriptsuperscript~𝑆𝐴𝑅𝐴𝐻𝑛𝑓superscript¯𝐴𝑘subscriptsubscript𝐴𝑛𝑓superscript¯𝐴𝑘\displaystyle\Gamma_{k+1}=\|\tilde{\nabla}^{SARAH}_{n}f(\underline{A}^{k})-% \nabla_{A_{n}}f(\underline{A}^{k})\|_{*}^{2},\quad\Upsilon_{k+1}=\|\tilde{% \nabla}^{SARAH}_{n}f(\underline{A}^{k})-\nabla_{A_{n}}f(\underline{A}^{k})\|_{% *},roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = ∥ over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_R italic_A italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , roman_Υ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = ∥ over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_R italic_A italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ,

    and constants τ=1p𝜏1𝑝\tau=\frac{1}{p}italic_τ = divide start_ARG 1 end_ARG start_ARG italic_p end_ARG, V1=VΓ=2M12subscript𝑉1subscript𝑉Γ2superscriptsubscript𝑀12V_{1}=V_{\Gamma}=2M_{1}^{2}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT = 2 italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, V2=2M1subscript𝑉22subscript𝑀1V_{2}=2M_{1}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Proof.

From the definition of SAGA stochastic gradient estimator ~SAGAf(A¯k)superscript~𝑆𝐴𝐺𝐴𝑓superscript¯𝐴𝑘\tilde{\nabla}^{SAGA}f(\underline{A}^{k})over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_G italic_A end_POSTSUPERSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) and the Lipschitz continuity of Aif()subscriptsubscript𝐴𝑖𝑓\nabla_{A_{i}}f(\cdot)∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( ⋅ ), it shows that

𝔼k~SAGAf(A¯k)f(A¯k)*2subscript𝔼𝑘superscriptsubscriptnormsuperscript~𝑆𝐴𝐺𝐴𝑓superscript¯𝐴𝑘𝑓superscript¯𝐴𝑘2\displaystyle\mathbb{E}_{k}\|\tilde{\nabla}^{SAGA}f(\underline{A}^{k})-\nabla f% (\underline{A}^{k})\|_{*}^{2}blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_G italic_A end_POSTSUPERSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼k1InB(jnkAnfj(A¯k)Anfj((ϕk)j))+1Jni=1JnAnfi((ϕk)i)f(A¯k)*2subscript𝔼𝑘superscriptsubscriptnorm1subscript𝐼𝑛𝐵subscript𝑗superscriptsubscript𝑛𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscriptsuperscriptitalic-ϕ𝑘𝑗1subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖𝑓superscript¯𝐴𝑘2\displaystyle\mathbb{E}_{k}\|\frac{1}{I_{n}B}(\sum_{j\in\mathcal{F}_{n}^{k}}% \nabla_{A_{n}}f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}))+% \frac{1}{J_{n}}\sum_{i=1}^{J_{n}}\nabla_{A_{n}}f_{i}((\phi^{k})^{i})-\nabla f(% \underline{A}^{k})\|_{*}^{2}blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_B end_ARG ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - ∇ italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 1B2In2𝔼kjnkAnfj(A¯k)Anfj((ϕk)j)*21superscript𝐵2subscriptsuperscript𝐼2𝑛subscript𝔼𝑘subscript𝑗superscriptsubscript𝑛𝑘superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑗superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑗superscriptsuperscriptitalic-ϕ𝑘𝑗2\displaystyle\frac{1}{B^{2}I^{2}_{n}}\mathbb{E}_{k}\sum_{j\in\mathcal{F}_{n}^{% k}}\|\nabla_{A_{n}}f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}% )\|_{*}^{2}divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 1BIn2Jni=1JnAnfi(A¯k)Anfi((ϕk)i)*21𝐵subscriptsuperscript𝐼2𝑛subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖2\displaystyle\frac{1}{BI^{2}_{n}J_{n}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}divide start_ARG 1 end_ARG start_ARG italic_B italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 1BJni=1JnAnfi(A¯k)Anfi((ϕk)i)*2,1𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖2\displaystyle\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}\,,divide start_ARG 1 end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the last inequality follows from the fact that 𝔼ky1++yt*2=𝔼ky1*2++𝔼kyt*2subscript𝔼𝑘superscriptsubscriptnormsubscript𝑦1subscript𝑦𝑡2subscript𝔼𝑘superscriptsubscriptnormsubscript𝑦12subscript𝔼𝑘superscriptsubscriptnormsubscript𝑦𝑡2\mathbb{E}_{k}\|y_{1}+\cdots+y_{t}\|_{*}^{2}=\mathbb{E}_{k}\|y_{1}\|_{*}^{2}+% \cdots+\mathbb{E}_{k}\|y_{t}\|_{*}^{2}blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any independent random variables yi(i=1,,t)subscript𝑦𝑖𝑖1𝑡y_{i}(i=1,\dots,t)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 1 , … , italic_t ) with 𝔼k[yi]=0subscript𝔼𝑘delimited-[]subscript𝑦𝑖0\mathbb{E}_{k}[y_{i}]=0blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = 0 for all i𝑖iitalic_i. Combined with Jensen’s inequality, we can get

𝔼k~nSAGAf(A¯k)f(A¯k)*subscript𝔼𝑘subscriptnormsubscriptsuperscript~𝑆𝐴𝐺𝐴𝑛𝑓superscript¯𝐴𝑘𝑓superscript¯𝐴𝑘\displaystyle\mathbb{E}_{k}\|\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k})-% \nabla f(\underline{A}^{k})\|_{*}blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_G italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
\displaystyle\leq 𝔼k~nSAGAf(A¯k)f(A¯k)*2subscript𝔼𝑘superscriptsubscriptnormsubscriptsuperscript~𝑆𝐴𝐺𝐴𝑛𝑓superscript¯𝐴𝑘𝑓superscript¯𝐴𝑘2\displaystyle\sqrt{\mathbb{E}_{k}\|\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k% })-\nabla f(\underline{A}^{k})\|_{*}^{2}}square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_G italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
\displaystyle\leq 1BJni=1JnAnfi(A¯k)Anfi((ϕk)i)*21𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖2\displaystyle\frac{1}{\sqrt{BJ_{n}}}\sqrt{\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_% {i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
\displaystyle\leq 1BJni=1JnAnfi(A¯k)Anfi((ϕk)i)*.1𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛subscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖\displaystyle\frac{1}{\sqrt{BJ_{n}}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}.divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT .

We bound the MSE of the stochastic gradient estimator ~SAGAf()superscript~𝑆𝐴𝐺𝐴𝑓\tilde{\nabla}^{SAGA}f(\cdot)over~ start_ARG ∇ end_ARG start_POSTSUPERSCRIPT italic_S italic_A italic_G italic_A end_POSTSUPERSCRIPT italic_f ( ⋅ ) as follows,

1BJni=1Jn𝔼kAnfi(A¯k)Anfi((ϕk)i)*21𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛subscript𝔼𝑘superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖2\displaystyle\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\mathbb{E}_{k}\|\nabla_{A_{n}}f% _{i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}divide start_ARG 1 end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 1+δBJn𝔼ki=1JnAnfi(A¯k1)Anfi((ϕk)i)*2+1+δ1BJn𝔼ki=1JnAnfi(A¯k)Anfi(A¯k1)*21𝛿𝐵subscript𝐽𝑛subscript𝔼𝑘superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘1subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖21superscript𝛿1𝐵subscript𝐽𝑛subscript𝔼𝑘superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘12\displaystyle\frac{1+\delta}{BJ_{n}}\mathbb{E}_{k}\sum_{i=1}^{J_{n}}\|\nabla_{% A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}% +\frac{1+\delta^{-1}}{BJ_{n}}\mathbb{E}_{k}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f% _{i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}(\underline{A}^{k-1})\|_{*}^{2}divide start_ARG 1 + italic_δ end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 + italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 1+δBJn(1BJn)i=1JnAnfi(A¯k1)Anfi((ϕk1)i)*2+1+δ1BM12𝔼kA¯nkA¯nk121𝛿𝐵subscript𝐽𝑛1𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘1subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘1𝑖21superscript𝛿1𝐵superscriptsubscript𝑀12subscript𝔼𝑘superscriptnormsuperscriptsubscript¯𝐴𝑛𝑘superscriptsubscript¯𝐴𝑛𝑘12\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \|_{*}^{2}+\frac{1+\delta^{-1}}{B}M_{1}^{2}\mathbb{E}_{k}\|\underline{A}_{n}^{% k}-\underline{A}_{n}^{k-1}\|^{2}divide start_ARG 1 + italic_δ end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( 1 - divide start_ARG italic_B end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 + italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 1+δBJn(1BJn)i=1JnAnfi(A¯k1)Anfi((ϕk1)i)*2+1+δ1BM12𝔼k[(1+αk2)AnkAnk12\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \|_{*}^{2}+\frac{1+\delta^{-1}}{B}M_{1}^{2}\mathbb{E}_{k}[(1+\alpha_{k}^{2})\|% A_{n}^{k}-A_{n}^{k-1}\|^{2}divide start_ARG 1 + italic_δ end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( 1 - divide start_ARG italic_B end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 + italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ( 1 + italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αk12Ank1Ank22]\displaystyle+\alpha_{k-1}^{2}\|A_{n}^{k-1}-A_{n}^{k-2}\|^{2}]+ italic_α start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\leq 1+δBJn(1BJn)i=1JnAnfi(A¯k1)Anfi((ϕk1)i)*2+2+2δ1BNM12[AkAk12+Ak1Ak22],1𝛿𝐵subscript𝐽𝑛1𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘1subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘1𝑖222superscript𝛿1𝐵𝑁superscriptsubscript𝑀12delimited-[]superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \|_{*}^{2}+\frac{2+2\delta^{-1}}{BN}M_{1}^{2}[\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-% A^{k-2}\|^{2}],divide start_ARG 1 + italic_δ end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( 1 - divide start_ARG italic_B end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 + 2 italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_N end_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where the first inequality follows from xz*2(1+δ)xy*2+(1+δ1)yz*2superscriptsubscriptnorm𝑥𝑧21𝛿superscriptsubscriptnorm𝑥𝑦21superscript𝛿1superscriptsubscriptnorm𝑦𝑧2\|x-z\|_{*}^{2}\leq(1+\delta)\|x-y\|_{*}^{2}+(1+\delta^{-1})\|y-z\|_{*}^{2}∥ italic_x - italic_z ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( 1 + italic_δ ) ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ∥ italic_y - italic_z ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Let Γk+1:=1BJni=1JnAnfi(A¯k)Anfi((ϕk)i)*2assignsubscriptΓ𝑘11𝐵subscript𝐽𝑛superscriptsubscript𝑖1subscript𝐽𝑛superscriptsubscriptnormsubscriptsubscript𝐴𝑛subscript𝑓𝑖superscript¯𝐴𝑘subscriptsubscript𝐴𝑛subscript𝑓𝑖superscriptsuperscriptitalic-ϕ𝑘𝑖2\Gamma_{k+1}:=\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_B italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and δ=B2Jn𝛿𝐵2subscript𝐽𝑛\delta=\frac{B}{2J_{n}}italic_δ = divide start_ARG italic_B end_ARG start_ARG 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, it shows that

𝔼kΓk+1subscript𝔼𝑘subscriptΓ𝑘1absent\displaystyle\mathbb{E}_{k}\Gamma_{k+1}\leqblackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ≤ (1+B2Jn)(1BJn)Γk+(2Jn+4Jn2B)M12N[AkAk12+Ak1Ak22]1𝐵2subscript𝐽𝑛1𝐵subscript𝐽𝑛subscriptΓ𝑘2subscript𝐽𝑛4superscriptsubscript𝐽𝑛2𝐵superscriptsubscript𝑀12𝑁delimited-[]superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle(1+\frac{B}{2J_{n}})(1-\frac{B}{J_{n}})\Gamma_{k}+(2J_{n}+\frac{4% J_{n}^{2}}{B})\frac{M_{1}^{2}}{N}[\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}]( 1 + divide start_ARG italic_B end_ARG start_ARG 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ( 1 - divide start_ARG italic_B end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + divide start_ARG 4 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) divide start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG [ ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\leq (1B2Jn)Γk+(2Jn+4Jn2B)M12N[AkAk12+Ak1Ak22].1𝐵2subscript𝐽𝑛subscriptΓ𝑘2subscript𝐽𝑛4superscriptsubscript𝐽𝑛2𝐵superscriptsubscript𝑀12𝑁delimited-[]superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle(1-\frac{B}{2J_{n}})\Gamma_{k}+(2J_{n}+\frac{4J_{n}^{2}}{B})\frac% {M_{1}^{2}}{N}[\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}].( 1 - divide start_ARG italic_B end_ARG start_ARG 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 2 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + divide start_ARG 4 italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) divide start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG [ ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

This proves the geometric decay of ΓksubscriptΓ𝑘\Gamma_{k}roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in expectation. Similar to Appendix B in [16], we also have that the third condition holds in Definition 4. For the SARAH stochastic gradient estimator, we can get the results directly similar to Lemma 5 in [58]. The proof of Proposition 1 (2) is completed. This completes the proof. ∎

Corollary 1.

If ψ:=122\psi:=\frac{1}{2}\|\cdot\|^{2}italic_ψ := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the inequality from Lemma 2 becomes

𝔼k[Ψk+1]Ψkϵ6(𝔼k[Ak+1Ak2]+AkAk12+Ak1Ak22).subscript𝔼𝑘delimited-[]subscriptΨ𝑘1subscriptΨ𝑘italic-ϵ6subscript𝔼𝑘delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\mathbb{E}_{k}[\Psi_{k+1}]\leq\Psi_{k}-\frac{\epsilon}{6}\left(\mathbb{E}_{k}[% \|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}\right).blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] ≤ roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_ϵ end_ARG start_ARG 6 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Now we can prove the following result, which means that the subgradient of Φ(Ak)Φsuperscript𝐴𝑘\Phi(A^{k})roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) is bounded.

Lemma 3.

Suppose that Assumptions 1-2 hold and the stepsize ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies 0<ηηk0𝜂subscript𝜂𝑘0<\eta\leq\eta_{k}0 < italic_η ≤ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and (24).The sequence {A1k,,ANk}superscriptsubscript𝐴1𝑘normal-…superscriptsubscript𝐴𝑁𝑘\{A_{1}^{k},\dots,A_{N}^{k}\}{ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } generated by iTableSMD is bounded for all k𝑘kitalic_k. Define

Pnk+1:=Anf(Ak+1)~Anf(A¯k)+1ηk(ψ(ϕnk)ψ(Ank+1)),assignsuperscriptsubscript𝑃𝑛𝑘1subscriptsubscript𝐴𝑛𝑓superscript𝐴𝑘1subscript~subscript𝐴𝑛𝑓superscript¯𝐴𝑘1subscript𝜂𝑘𝜓superscriptsubscriptitalic-ϕ𝑛𝑘𝜓superscriptsubscript𝐴𝑛𝑘1P_{n}^{k+1}:=\nabla_{A_{n}}f(A^{k+1})-\tilde{\nabla}_{A_{n}}f(\underline{A}^{k% })+\frac{1}{\eta_{k}}(\nabla\psi(\phi_{n}^{k})-\nabla\psi(A_{n}^{k+1})),italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT := ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∇ italic_ψ ( italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) ,

where Pnk+1nΦ(Ak+1)superscriptsubscript𝑃𝑛𝑘1subscript𝑛normal-Φsuperscript𝐴𝑘1P_{n}^{k+1}\in\partial_{n}\Phi\left(A^{k+1}\right)italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∈ ∂ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) and Pk+1=(P1k+1,P2k+1,,PNk+1)superscript𝑃𝑘1superscriptsubscript𝑃1𝑘1superscriptsubscript𝑃2𝑘1normal-…superscriptsubscript𝑃𝑁𝑘1P^{k+1}=\left(P_{1}^{k+1},P_{2}^{k+1},\dots,P_{N}^{k+1}\right)italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ), implying that Pk+1Φ(Ak+1)superscript𝑃𝑘1normal-Φsuperscript𝐴𝑘1P^{k+1}\in\partial\Phi\left(A^{k+1}\right)italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∈ ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ). Then, we can obtain

𝔼kPk+1w(𝔼kAk+1Ak+AkAk1+Ak1Ak2)+Υk,subscript𝔼𝑘normsuperscript𝑃𝑘1𝑤subscript𝔼𝑘normsuperscript𝐴𝑘1superscript𝐴𝑘normsuperscript𝐴𝑘superscript𝐴𝑘1normsuperscript𝐴𝑘1superscript𝐴𝑘2subscriptΥ𝑘\mathbb{E}_{k}\|P^{k+1}\|\leq w(\mathbb{E}_{k}\|A^{k+1}-A^{k}\|+\|A^{k}-A^{k-1% }\|+\|A^{k-1}-A^{k-2}\|)+\Upsilon_{k},blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∥ ≤ italic_w ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ ) + roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where w=max{M1+M2η,V2+βkM1+αkM2η,V2}𝑤subscript𝑀1subscript𝑀2𝜂subscript𝑉2superscript𝛽𝑘subscript𝑀1superscript𝛼𝑘subscript𝑀2𝜂subscript𝑉2w=\max\left\{M_{1}+\frac{M_{2}}{\eta},V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{% 2}}{\eta},V_{2}\right\}italic_w = roman_max { italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }.

Proof.

From the implicit definition of the proximal operator (15) in the iTableSMD algorithm, we have

0hn(Ank+1)+~Anf(A¯k)+1ηk(ψ(Ank+1)ψ(A~nk)),0subscript𝑛superscriptsubscript𝐴𝑛𝑘1subscript~subscript𝐴𝑛𝑓superscript¯𝐴𝑘1subscript𝜂𝑘𝜓superscriptsubscript𝐴𝑛𝑘1𝜓superscriptsubscript~𝐴𝑛𝑘0\in\partial h_{n}(A_{n}^{k+1})+\tilde{\nabla}_{A_{n}}f(\underline{A}^{k})+% \frac{1}{\eta_{k}}(\nabla\psi(A_{n}^{k+1})-\nabla\psi(\tilde{A}_{n}^{k})),0 ∈ ∂ italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) + over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) ,

where A¯k:=(A1k,,An1k,A¯nk,An+1k,,ANk)assignsuperscript¯𝐴𝑘superscriptsubscript𝐴1𝑘superscriptsubscript𝐴𝑛1𝑘superscriptsubscript¯𝐴𝑛𝑘superscriptsubscript𝐴𝑛1𝑘superscriptsubscript𝐴𝑁𝑘\underline{A}^{k}:=(A_{1}^{k},\dots,A_{n-1}^{k},\underline{A}_{n}^{k},A_{n+1}^% {k},\dots,A_{N}^{k})under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT := ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). Combining it with nΦ(Ak+1)Anf(Ak+1)+hn(Ank+1)subscript𝑛Φsuperscript𝐴𝑘1subscriptsubscript𝐴𝑛𝑓superscript𝐴𝑘1subscript𝑛superscriptsubscript𝐴𝑛𝑘1\partial_{n}\Phi\left(A^{k+1}\right)\equiv\nabla_{A_{n}}f\left(A^{k+1}\right)+% \partial h_{n}(A_{n}^{k+1})∂ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ≡ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) + ∂ italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ), we have Pnk+1nΦ(Ak+1)superscriptsubscript𝑃𝑛𝑘1subscript𝑛Φsuperscript𝐴𝑘1P_{n}^{k+1}\in\partial_{n}\Phi(A^{k+1})italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∈ ∂ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ). Furthermore, in Problem (9), with h(Ak+1)=n=1Nhn(An)superscript𝐴𝑘1superscriptsubscript𝑛1𝑁subscript𝑛subscript𝐴𝑛h(A^{k+1})=\sum_{n=1}^{N}h_{n}(A_{n})italic_h ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we have Pk+1=(P1k+1,P2k+1,,PNk+1)superscript𝑃𝑘1superscriptsubscript𝑃1𝑘1superscriptsubscript𝑃2𝑘1superscriptsubscript𝑃𝑁𝑘1P^{k+1}=(P_{1}^{k+1},P_{2}^{k+1},\dots,P_{N}^{k+1})italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ), and it follows that Pk+1Φ(Ak+1)superscript𝑃𝑘1Φsuperscript𝐴𝑘1P^{k+1}\in\partial\Phi(A^{k+1})italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∈ ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ), where Φ(Ak+1)f(Ak+1)+h(Ak+1)Φsuperscript𝐴𝑘1𝑓superscript𝐴𝑘1superscript𝐴𝑘1\partial\Phi(A^{k+1})\equiv\nabla f\left(A^{k+1}\right)+\partial h(A^{k+1})∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ≡ ∇ italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) + ∂ italic_h ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ).

All that remains is to bound the norm of Pk+1superscript𝑃𝑘1P^{k+1}italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT. Suppose n=ξk𝑛superscript𝜉𝑘n=\xi^{k}italic_n = italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT at the k𝑘kitalic_k-th iteration. It shows that

𝔼kPk+1subscript𝔼𝑘normsuperscript𝑃𝑘1\displaystyle\mathbb{E}_{k}\|P^{k+1}\|blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∥
=\displaystyle== 𝔼kPξkk+1subscript𝔼𝑘normsuperscriptsubscript𝑃superscript𝜉𝑘𝑘1\displaystyle\mathbb{E}_{k}\|P_{\xi^{k}}^{k+1}\|blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_P start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∥
\displaystyle\leq 𝔼kAξkf(Ak+1)~Aξkf(A¯k)+1ηk(ψ(A~ξkk)ψ(Aξkk+1))subscript𝔼𝑘normsubscriptsubscript𝐴superscript𝜉𝑘𝑓superscript𝐴𝑘1subscript~subscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘1subscript𝜂𝑘𝜓superscriptsubscript~𝐴superscript𝜉𝑘𝑘𝜓superscriptsubscript𝐴superscript𝜉𝑘𝑘1\displaystyle\mathbb{E}_{k}\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\tilde{\nabla}_{A_% {\xi^{k}}}f(\underline{A}^{k})+\frac{1}{\eta_{k}}(\nabla\psi(\tilde{A}_{\xi^{k% }}^{k})-\nabla\psi(A_{\xi^{k}}^{k+1}))\|blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) ∥
\displaystyle\leq 𝔼kAξkf(Ak+1)~Aξkf(A¯k)+1ηk𝔼kψ(A~ξkk)ψ(Aξkk+1)subscript𝔼𝑘normsubscriptsubscript𝐴superscript𝜉𝑘𝑓superscript𝐴𝑘1subscript~subscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘1subscript𝜂𝑘subscript𝔼𝑘norm𝜓superscriptsubscript~𝐴superscript𝜉𝑘𝑘𝜓superscriptsubscript𝐴superscript𝜉𝑘𝑘1\displaystyle\mathbb{E}_{k}\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\tilde{\nabla}_{A_% {\xi^{k}}}f(\underline{A}^{k})\|+\frac{1}{\eta_{k}}\mathbb{E}_{k}\|\nabla\psi(% \tilde{A}_{\xi^{k}}^{k})-\nabla\psi(A_{\xi^{k}}^{k+1})\|blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ∥
\displaystyle\leq 𝔼kAξkf(Ak+1)Aξkf(A¯k)+𝔼kAξkf(A¯k)~Aξkf(A¯k)+1ηk𝔼kψ(A~ξkk)ψ(Aξkk+1)subscript𝔼𝑘normsubscriptsubscript𝐴superscript𝜉𝑘𝑓superscript𝐴𝑘1subscriptsubscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘subscript𝔼𝑘normsubscriptsubscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘subscript~subscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘1subscript𝜂𝑘subscript𝔼𝑘norm𝜓superscriptsubscript~𝐴superscript𝜉𝑘𝑘𝜓superscriptsubscript𝐴superscript𝜉𝑘𝑘1\displaystyle\mathbb{E}_{k}\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\nabla_{A_{\xi^{k}% }}f(\underline{A}^{k})\|+\mathbb{E}_{k}\|\nabla_{A_{\xi^{k}}}f(\underline{A}^{% k})-\tilde{\nabla}_{A_{\xi^{k}}}f(\underline{A}^{k})\|+\frac{1}{\eta_{k}}% \mathbb{E}_{k}\|\nabla\psi(\tilde{A}_{\xi^{k}}^{k})-\nabla\psi(A_{\xi^{k}}^{k+% 1})\|blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ + blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ∥
\displaystyle\leq M1𝔼kAξkk+1A¯ξkk+Υk+V2AkAk1+V2Ak1Ak2+M2ηk𝔼kAξkk+1A~ξkksubscript𝑀1subscript𝔼𝑘normsuperscriptsubscript𝐴superscript𝜉𝑘𝑘1superscriptsubscript¯𝐴superscript𝜉𝑘𝑘subscriptΥ𝑘subscript𝑉2normsuperscript𝐴𝑘superscript𝐴𝑘1subscript𝑉2normsuperscript𝐴𝑘1superscript𝐴𝑘2subscript𝑀2subscript𝜂𝑘subscript𝔼𝑘normsuperscriptsubscript𝐴superscript𝜉𝑘𝑘1superscriptsubscript~𝐴superscript𝜉𝑘𝑘\displaystyle M_{1}\mathbb{E}_{k}\|A_{\xi^{k}}^{k+1}-\underline{A}_{\xi^{k}}^{% k}\|+\Upsilon_{k}+V_{2}\|A^{k}-A^{k-1}\|+V_{2}\|A^{k-1}-A^{k-2}\|+\frac{M_{2}}% {\eta_{k}}\mathbb{E}_{k}\|A_{\xi^{k}}^{k+1}-\tilde{A}_{\xi^{k}}^{k}\|italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ + roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥
\displaystyle\leq M1𝔼kAk+1A¯k+Υk+V2AkAk1+V2Ak1Ak2+M2ηk𝔼kAk+1A~ksubscript𝑀1subscript𝔼𝑘normsuperscript𝐴𝑘1superscript¯𝐴𝑘subscriptΥ𝑘subscript𝑉2normsuperscript𝐴𝑘superscript𝐴𝑘1subscript𝑉2normsuperscript𝐴𝑘1superscript𝐴𝑘2subscript𝑀2subscript𝜂𝑘subscript𝔼𝑘normsuperscript𝐴𝑘1superscript~𝐴𝑘\displaystyle M_{1}\mathbb{E}_{k}\|A^{k+1}-\underline{A}^{k}\|+\Upsilon_{k}+V_% {2}\|A^{k}-A^{k-1}\|+V_{2}\|A^{k-1}-A^{k-2}\|+\frac{M_{2}}{\eta_{k}}\mathbb{E}% _{k}\|A^{k+1}-\tilde{A}^{k}\|italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ + roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥
\displaystyle\leq (M1+M2ηk)𝔼kAk+1Ak+(V2+βkM1+αkM2ηk)AkAk1+V2Ak1Ak2+Υksubscript𝑀1subscript𝑀2subscript𝜂𝑘subscript𝔼𝑘normsuperscript𝐴𝑘1superscript𝐴𝑘subscript𝑉2superscript𝛽𝑘subscript𝑀1superscript𝛼𝑘subscript𝑀2superscript𝜂𝑘normsuperscript𝐴𝑘superscript𝐴𝑘1subscript𝑉2normsuperscript𝐴𝑘1superscript𝐴𝑘2subscriptΥ𝑘\displaystyle\left(M_{1}+\frac{M_{2}}{\eta_{k}}\right)\mathbb{E}_{k}\|A^{k+1}-% A^{k}\|+\left(V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{2}}{\eta^{k}}\right)\|A^% {k}-A^{k-1}\|+V_{2}\|A^{k-1}-A^{k-2}\|+\Upsilon_{k}( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ + ( italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ + roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
\displaystyle\leq (M1+M2η)𝔼kAk+1Ak+(V2+βkM1+αkM2η)AkAk1+V2Ak1Ak2+Υksubscript𝑀1subscript𝑀2𝜂subscript𝔼𝑘normsuperscript𝐴𝑘1superscript𝐴𝑘subscript𝑉2superscript𝛽𝑘subscript𝑀1superscript𝛼𝑘subscript𝑀2𝜂normsuperscript𝐴𝑘superscript𝐴𝑘1subscript𝑉2normsuperscript𝐴𝑘1superscript𝐴𝑘2subscriptΥ𝑘\displaystyle\left(M_{1}+\frac{M_{2}}{\eta}\right)\mathbb{E}_{k}\|A^{k+1}-A^{k% }\|+\left(V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{2}}{\eta}\right)\|A^{k}-A^{k% -1}\|+V_{2}\|A^{k-1}-A^{k-2}\|+\Upsilon_{k}( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ + ( italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG ) ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ + roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
\displaystyle\leq w(𝔼kAk+1Ak+AkAk1+Ak1Ak2)+Υk,𝑤subscript𝔼𝑘normsuperscript𝐴𝑘1superscript𝐴𝑘normsuperscript𝐴𝑘superscript𝐴𝑘1normsuperscript𝐴𝑘1superscript𝐴𝑘2subscriptΥ𝑘\displaystyle w(\mathbb{E}_{k}\|A^{k+1}-A^{k}\|+\|A^{k}-A^{k-1}\|+\|A^{k-1}-A^% {k-2}\|)+\Upsilon_{k},italic_w ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ ) + roman_Υ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where w=max{M1+M2η,V2+βkM1+αkM2η,V2}𝑤subscript𝑀1subscript𝑀2𝜂subscript𝑉2superscript𝛽𝑘subscript𝑀1superscript𝛼𝑘subscript𝑀2𝜂subscript𝑉2w=\max\left\{M_{1}+\frac{M_{2}}{\eta},V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{% 2}}{\eta},V_{2}\right\}italic_w = roman_max { italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. This completes the proof. ∎

Lemma 4.

Under the same conditions in Lemma 3, there exists a constant w¯>0normal-¯𝑤0\bar{w}>0over¯ start_ARG italic_w end_ARG > 0 such that

𝔼[dist(0,Φ(Ak+1))2]w¯(𝔼k[Ak+1Ak2]+AkAk12+Ak1Ak22)+3𝔼Γk.𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12¯𝑤subscript𝔼𝑘delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘223𝔼subscriptΓ𝑘\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]\leq\bar{w}\left(\mathbb% {E}_{k}[\|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}% \right)+3\mathbb{E}\Gamma_{k}.blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ over¯ start_ARG italic_w end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 3 blackboard_E roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .
Proof.

From Lemma 3, it shows that

𝔼kPk+12subscript𝔼𝑘superscriptnormsuperscript𝑃𝑘12\displaystyle\mathbb{E}_{k}\|P^{k+1}\|^{2}blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 3𝔼kAξkf(Ak+1)Aξkf(A¯k)2+3𝔼kAξkf(A¯k)~Aξkf(A¯k)2+3ηk𝔼kψ(A~ξkk)ψ(Aξkk+1)23subscript𝔼𝑘superscriptnormsubscriptsubscript𝐴superscript𝜉𝑘𝑓superscript𝐴𝑘1subscriptsubscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘23subscript𝔼𝑘superscriptnormsubscriptsubscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘subscript~subscript𝐴superscript𝜉𝑘𝑓superscript¯𝐴𝑘23subscript𝜂𝑘subscript𝔼𝑘superscriptnorm𝜓superscriptsubscript~𝐴superscript𝜉𝑘𝑘𝜓superscriptsubscript𝐴superscript𝜉𝑘𝑘12\displaystyle 3\mathbb{E}_{k}\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\nabla_{A_{\xi^{% k}}}f(\underline{A}^{k})\|^{2}+3\mathbb{E}_{k}\|\nabla_{A_{\xi^{k}}}f(% \underline{A}^{k})-\tilde{\nabla}_{A_{\xi^{k}}}f(\underline{A}^{k})\|^{2}+% \frac{3}{\eta_{k}}\mathbb{E}_{k}\|\nabla\psi(\tilde{A}_{\xi^{k}}^{k})-\nabla% \psi(A_{\xi^{k}}^{k+1})\|^{2}3 blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - over~ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 3 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_ψ ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_ψ ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 3M12𝔼kAk+1A¯k2+3Γk+3V1AkAk12+3V1Ak1Ak22+3M22ηk𝔼kAk+1A~k23superscriptsubscript𝑀12subscript𝔼𝑘superscriptnormsuperscript𝐴𝑘1superscript¯𝐴𝑘23subscriptΓ𝑘3subscript𝑉1superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘123subscript𝑉1superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘223superscriptsubscript𝑀22subscript𝜂𝑘subscript𝔼𝑘superscriptnormsuperscript𝐴𝑘1superscript~𝐴𝑘2\displaystyle 3M_{1}^{2}\mathbb{E}_{k}\|A^{k+1}-\underline{A}^{k}\|^{2}+3% \Gamma_{k}+3V_{1}\|A^{k}-A^{k-1}\|^{2}+3V_{1}\|A^{k-1}-A^{k-2}\|^{2}+\frac{3M_% {2}^{2}}{\eta_{k}}\mathbb{E}_{k}\|A^{k+1}-\tilde{A}^{k}\|^{2}3 italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - under¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 3 italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq (6M12+6M22ηk)𝔼kAk+1Ak2+(3V1+6βk2M12+6αk2M22ηk)AkAk126superscriptsubscript𝑀126superscriptsubscript𝑀22subscript𝜂𝑘subscript𝔼𝑘superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘23subscript𝑉16superscriptsubscript𝛽𝑘2superscriptsubscript𝑀126superscriptsubscript𝛼𝑘2superscriptsubscript𝑀22subscript𝜂𝑘superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12\displaystyle\left(6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta_{k}}\right)\mathbb{E}_{k}% \|A^{k+1}-{A}^{k}\|^{2}+\left(3V_{1}+6\beta_{k}^{2}M_{1}^{2}+\frac{6\alpha_{k}% ^{2}M_{2}^{2}}{\eta_{k}}\right)\|A^{k}-A^{k-1}\|^{2}( 6 italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+3V1Ak1Ak22+3Γk3subscript𝑉1superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘223subscriptΓ𝑘\displaystyle+3V_{1}\|A^{k-1}-A^{k-2}\|^{2}+3\Gamma_{k}+ 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
\displaystyle\leq (6M12+6M22η)𝔼kAk+1Ak2+(3V1+6βk2M12+6αk2M22η)AkAk126superscriptsubscript𝑀126superscriptsubscript𝑀22𝜂subscript𝔼𝑘superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘23subscript𝑉16superscriptsubscript𝛽𝑘2superscriptsubscript𝑀126superscriptsubscript𝛼𝑘2superscriptsubscript𝑀22𝜂superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12\displaystyle\left(6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta}\right)\mathbb{E}_{k}\|A^% {k+1}-{A}^{k}\|^{2}+\left(3V_{1}+6\beta_{k}^{2}M_{1}^{2}+\frac{6\alpha_{k}^{2}% M_{2}^{2}}{\eta}\right)\|A^{k}-A^{k-1}\|^{2}( 6 italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG ) ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+3V1Ak1Ak22+3Γk3subscript𝑉1superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘223subscriptΓ𝑘\displaystyle+3V_{1}\|A^{k-1}-A^{k-2}\|^{2}+3\Gamma_{k}+ 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
\displaystyle\leq w¯(𝔼k[Ak+1Ak2]+AkAk12+Ak1Ak22)+3𝔼Γk,¯𝑤subscript𝔼𝑘delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘223𝔼subscriptΓ𝑘\displaystyle\bar{w}\left(\mathbb{E}_{k}[\|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1% }\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}\right)+3\mathbb{E}\Gamma_{k},over¯ start_ARG italic_w end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 3 blackboard_E roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where w¯:=max{6M12+6M22η,3V1+6βk2M12+6αk2M22η,3V1}assign¯𝑤6superscriptsubscript𝑀126superscriptsubscript𝑀22𝜂3subscript𝑉16superscriptsubscript𝛽𝑘2superscriptsubscript𝑀126superscriptsubscript𝛼𝑘2superscriptsubscript𝑀22𝜂3subscript𝑉1\bar{w}:=\max\left\{6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta},3V_{1}+6\beta_{k}^{2}M_% {1}^{2}+\frac{6\alpha_{k}^{2}M_{2}^{2}}{\eta},3V_{1}\right\}over¯ start_ARG italic_w end_ARG := roman_max { 6 italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG , 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG , 3 italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }. Through dist(0,Φ(Ak+1))2Pk+12distsuperscript0Φsuperscript𝐴𝑘12superscriptnormsuperscript𝑃𝑘12\mathrm{dist}\left(0,\partial\Phi\left(A^{k+1}\right)\right)^{2}\leq\|P^{k+1}% \|^{2}roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ italic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and taking full expectation on both sides, it shows that

𝔼[dist(0,Φ(Ak+1))2]w¯(𝔼k[Ak+1Ak2]+AkAk12+Ak1Ak22)+3𝔼Γk.𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12¯𝑤subscript𝔼𝑘delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘223𝔼subscriptΓ𝑘\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]\leq\bar{w}\left(\mathbb% {E}_{k}[\|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}% \right)+3\mathbb{E}\Gamma_{k}.blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ over¯ start_ARG italic_w end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 3 blackboard_E roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

This completes the proof. ∎

Using Lemma 4, we can show the convergence rate of the expected squared distance of the subgradient to 00.

Theorem 2.

Assume that Assumptions 1-2 hold, and the stepsize satisfies 0<ηηk0𝜂subscript𝜂𝑘0<\eta\leq\eta_{k}0 < italic_η ≤ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and (24). Let {Ak}ksubscriptsuperscript𝐴𝑘𝑘\{A^{k}\}_{k\in\mathbb{N}}{ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT generated by iTableSMD be bounded for all k𝑘kitalic_k. Then there exists 0<σ<ϵ/60𝜎italic-ϵ60<\sigma<\epsilon/60 < italic_σ < italic_ϵ / 6 such that

𝔼[dist(0,Φ(Ak^))2]w¯(ϵ/6σ)K(𝔼Ψ1+ϵ/23στw¯𝔼Γ1)=𝒪(1/K),𝔼delimited-[]distsuperscript0Φsuperscript𝐴^𝑘2¯𝑤italic-ϵ6𝜎𝐾𝔼subscriptΨ1italic-ϵ23𝜎𝜏¯𝑤𝔼subscriptΓ1𝒪1𝐾\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{\hat{k}}))^{2}]\leq\frac{\bar{w}}{(% \epsilon/6-\sigma)K}(\mathbb{E}\Psi_{1}+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}% \mathbb{E}\Gamma_{1})=\mathcal{O}(1/K),blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG over¯ start_ARG italic_w end_ARG end_ARG start_ARG ( italic_ϵ / 6 - italic_σ ) italic_K end_ARG ( blackboard_E roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = caligraphic_O ( 1 / italic_K ) ,

where k^normal-^𝑘\hat{k}over^ start_ARG italic_k end_ARG is drawn from {2,,K+1}2normal-…𝐾1\{2,\dots,K+1\}{ 2 , … , italic_K + 1 }. In other words, it takes at most 𝒪(ϵ2)𝒪superscriptitalic-ϵ2\mathcal{O}(\epsilon^{-2})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations in expectation to obtain an ϵitalic-ϵ\epsilonitalic_ϵ-stationary point (see Definition 3) of Φnormal-Φ\Phiroman_Φ.

Proof.

From Corollary 1 and Lemma 4, it shows that

𝔼[ΨkΨk+1]𝔼delimited-[]subscriptΨ𝑘subscriptΨ𝑘1\displaystyle\mathbb{E}[\Psi_{k}-\Psi_{k+1}]blackboard_E [ roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ]
\displaystyle\geq ϵ6𝔼[Ak+1Ak2+AkAk12+Ak1Ak22]italic-ϵ6𝔼delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle\frac{\epsilon}{6}\mathbb{E}[\|A^{k+1}-A^{k}\|^{2}+\|A^{k}-A^{k-1% }\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}]divide start_ARG italic_ϵ end_ARG start_ARG 6 end_ARG blackboard_E [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\geq σ𝔼[Ak+1Ak2+AkAk12+Ak1Ak22]+ϵ/6σw¯𝔼[dist(0,Φ(Ak+1))2]𝜎𝔼delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22italic-ϵ6𝜎¯𝑤𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12\displaystyle\sigma\mathbb{E}[\|A^{k+1}-A^{k}\|^{2}+\|A^{k}-A^{k-1}\|^{2}+\|A^% {k-1}-A^{k-2}\|^{2}]+\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}[\mathrm{dist}% (0,\partial\Phi(A^{k+1}))^{2}]italic_σ blackboard_E [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_ϵ / 6 - italic_σ end_ARG start_ARG over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
ϵ/23σw¯𝔼Γkitalic-ϵ23𝜎¯𝑤𝔼subscriptΓ𝑘\displaystyle-\frac{\epsilon/2-3\sigma}{\bar{w}}\mathbb{E}\Gamma_{k}- divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG over¯ start_ARG italic_w end_ARG end_ARG blackboard_E roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
\displaystyle\geq σ𝔼[Ak+1Ak2+AkAk12+Ak1Ak22]+ϵ/6σw¯𝔼[dist(0,Φ(Ak+1))2]𝜎𝔼delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22italic-ϵ6𝜎¯𝑤𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12\displaystyle\sigma\mathbb{E}[\|A^{k+1}-A^{k}\|^{2}+\|A^{k}-A^{k-1}\|^{2}+\|A^% {k-1}-A^{k-2}\|^{2}]+\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}[\mathrm{dist}% (0,\partial\Phi(A^{k+1}))^{2}]italic_σ blackboard_E [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_ϵ / 6 - italic_σ end_ARG start_ARG over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+ϵ/23στw¯𝔼[Γk+1Γk](ϵ/23σ)VΓτw¯𝔼[AkAk12+Ak1Ak22]italic-ϵ23𝜎𝜏¯𝑤𝔼delimited-[]subscriptΓ𝑘1subscriptΓ𝑘italic-ϵ23𝜎subscript𝑉Γ𝜏¯𝑤𝔼delimited-[]superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{k+1}-% \Gamma_{k}]-\frac{(\epsilon/2-3\sigma)V_{\Gamma}}{\tau\bar{w}}\mathbb{E}[\|A^{% k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}]+ divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] - divide start_ARG ( italic_ϵ / 2 - 3 italic_σ ) italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\geq σ𝔼[Ak+1Ak2+AkAk12+Ak1Ak22]+ϵ/6σw¯𝔼[dist(0,Φ(Ak+1))2]𝜎𝔼delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22italic-ϵ6𝜎¯𝑤𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12\displaystyle\sigma\mathbb{E}[\|A^{k+1}-A^{k}\|^{2}+\|A^{k}-A^{k-1}\|^{2}+\|A^% {k-1}-A^{k-2}\|^{2}]+\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}[\mathrm{dist}% (0,\partial\Phi(A^{k+1}))^{2}]italic_σ blackboard_E [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_ϵ / 6 - italic_σ end_ARG start_ARG over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+ϵ/23στw¯𝔼[Γk+1Γk](ϵ/23σ)VΓτw¯𝔼[Ak+1Ak2+AkAk12+Ak1Ak22],italic-ϵ23𝜎𝜏¯𝑤𝔼delimited-[]subscriptΓ𝑘1subscriptΓ𝑘italic-ϵ23𝜎subscript𝑉Γ𝜏¯𝑤𝔼delimited-[]superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2superscriptnormsuperscript𝐴𝑘superscript𝐴𝑘12superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘22\displaystyle+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{k+1}-% \Gamma_{k}]-\frac{(\epsilon/2-3\sigma)V_{\Gamma}}{\tau\bar{w}}\mathbb{E}[\|A^{% k+1}-A^{k}\|^{2}+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}],+ divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] - divide start_ARG ( italic_ϵ / 2 - 3 italic_σ ) italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_A start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where the third inequality follows from (18) in Definition 4. If we let σ=(ϵ/23σ)VΓτw¯𝜎italic-ϵ23𝜎subscript𝑉Γ𝜏¯𝑤\sigma=\frac{(\epsilon/2-3\sigma)V_{\Gamma}}{\tau\bar{w}}italic_σ = divide start_ARG ( italic_ϵ / 2 - 3 italic_σ ) italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG, i.e., σ=ϵVΓ2(3VΓ+τw¯)𝜎italic-ϵsubscript𝑉Γ23subscript𝑉Γ𝜏¯𝑤\sigma=\frac{\epsilon V_{\Gamma}}{2(3V_{\Gamma}+\tau\bar{w})}italic_σ = divide start_ARG italic_ϵ italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 3 italic_V start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT + italic_τ over¯ start_ARG italic_w end_ARG ) end_ARG, it shows that

𝔼[ΨkΨk+1]ϵ/6σw¯𝔼[dist(0,Φ(Ak+1))2]+ϵ/23στw¯𝔼[Γk+1Γk].𝔼delimited-[]subscriptΨ𝑘subscriptΨ𝑘1italic-ϵ6𝜎¯𝑤𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12italic-ϵ23𝜎𝜏¯𝑤𝔼delimited-[]subscriptΓ𝑘1subscriptΓ𝑘\mathbb{E}[\Psi_{k}-\Psi_{k+1}]\geq\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}% [\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]+\frac{\epsilon/2-3\sigma}{\tau% \bar{w}}\mathbb{E}[\Gamma_{k+1}-\Gamma_{k}].blackboard_E [ roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Ψ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] ≥ divide start_ARG italic_ϵ / 6 - italic_σ end_ARG start_ARG over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_Γ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] .

Summing up k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K, we have

𝔼[Ψ1ΨK+1]ϵ/6σw¯k=1K𝔼[dist(0,Φ(Ak+1))2]+ϵ/23στw¯𝔼[ΓK+1Γ1],𝔼delimited-[]subscriptΨ1subscriptΨ𝐾1italic-ϵ6𝜎¯𝑤superscriptsubscript𝑘1𝐾𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12italic-ϵ23𝜎𝜏¯𝑤𝔼delimited-[]subscriptΓ𝐾1subscriptΓ1\mathbb{E}[\Psi_{1}-\Psi_{K+1}]\geq\frac{\epsilon/6-\sigma}{\bar{w}}\sum_{k=1}% ^{K}\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]+\frac{\epsilon/2-3% \sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{K+1}-\Gamma_{1}],blackboard_E [ roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Ψ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ] ≥ divide start_ARG italic_ϵ / 6 - italic_σ end_ARG start_ARG over¯ start_ARG italic_w end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_Γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ,

which means there exists a k^{2,,K+1}^𝑘2𝐾1\hat{k}\in\{2,\dots,K+1\}over^ start_ARG italic_k end_ARG ∈ { 2 , … , italic_K + 1 } such that

𝔼[dist(0,Φ(Ak^))2]𝔼delimited-[]distsuperscript0Φsuperscript𝐴^𝑘2absent\displaystyle\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{\hat{k}}))^{2}]\leqblackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT over^ start_ARG italic_k end_ARG end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 1Kk=1K𝔼[dist(0,Φ(Ak+1))2]1𝐾superscriptsubscript𝑘1𝐾𝔼delimited-[]distsuperscript0Φsuperscript𝐴𝑘12\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}[\mathrm{dist}(0,\partial\Phi(% A^{k+1}))^{2}]divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\leq w¯(ϵ/6σ)K(𝔼[Ψ1ΨK+1]+ϵ/23στw¯𝔼[Γ1ΓK+1])¯𝑤italic-ϵ6𝜎𝐾𝔼delimited-[]subscriptΨ1subscriptΨ𝐾1italic-ϵ23𝜎𝜏¯𝑤𝔼delimited-[]subscriptΓ1subscriptΓ𝐾1\displaystyle\frac{\bar{w}}{(\epsilon/6-\sigma)K}(\mathbb{E}[\Psi_{1}-\Psi_{K+% 1}]+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{1}-\Gamma_{K+1}])divide start_ARG over¯ start_ARG italic_w end_ARG end_ARG start_ARG ( italic_ϵ / 6 - italic_σ ) italic_K end_ARG ( blackboard_E [ roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Ψ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ] + divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E [ roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Γ start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ] )
\displaystyle\leq w¯(ϵ/6σ)K(𝔼Ψ1+ϵ/23στw¯𝔼Γ1).¯𝑤italic-ϵ6𝜎𝐾𝔼subscriptΨ1italic-ϵ23𝜎𝜏¯𝑤𝔼subscriptΓ1\displaystyle\frac{\bar{w}}{(\epsilon/6-\sigma)K}(\mathbb{E}\Psi_{1}+\frac{% \epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}\Gamma_{1}).divide start_ARG over¯ start_ARG italic_w end_ARG end_ARG start_ARG ( italic_ϵ / 6 - italic_σ ) italic_K end_ARG ( blackboard_E roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_ϵ / 2 - 3 italic_σ end_ARG start_ARG italic_τ over¯ start_ARG italic_w end_ARG end_ARG blackboard_E roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

This completes the proof. ∎

We define the set of cluster points of {Ak}ksubscriptsuperscript𝐴𝑘𝑘\{A^{k}\}_{k\in\mathbb{N}}{ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT as

Ω(A0):={A*: an increasing sequence of integers {kl}l such that AklA* as l+}.assignΩsuperscript𝐴0absentconditional-setsuperscript𝐴 an increasing sequence of integers subscriptsubscript𝑘𝑙𝑙 such that superscript𝐴subscript𝑘𝑙superscript𝐴 as 𝑙\displaystyle\begin{aligned} \Omega(A^{0}):=&\{A^{*}:\exists\text{ an % increasing sequence of integers }\{k_{l}\}_{l\in\mathbb{N}}\text{ such that }A% ^{k_{l}}\rightarrow A^{*}\text{ as }l\rightarrow+\infty\}.\end{aligned}start_ROW start_CELL roman_Ω ( italic_A start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) := end_CELL start_CELL { italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : ∃ an increasing sequence of integers { italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l ∈ blackboard_N end_POSTSUBSCRIPT such that italic_A start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as italic_l → + ∞ } . end_CELL end_ROW (30)
Lemma 5.

Suppose that Assumptions 1 to 2 hold, the step ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies 0<ηηk0𝜂subscript𝜂𝑘0<\eta\leq\eta_{k}0 < italic_η ≤ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and (24). Then the following statements hold.

  • (1)

    k=0Ak+1Ak2<+superscriptsubscript𝑘0superscriptnormsuperscript𝐴𝑘1superscript𝐴𝑘2\sum_{k=0}^{\infty}\|A^{k+1}-A^{k}\|^{2}<+\infty∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < + ∞ a.s., and limk+Ak+1Ak0subscript𝑘normsuperscript𝐴𝑘1superscript𝐴𝑘0\lim_{k\rightarrow+\infty}\|A^{k+1}-A^{k}\|\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_k → + ∞ end_POSTSUBSCRIPT ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ → 0 a.s.

  • (2)

    𝔼[Φ(Ak)]Φ*𝔼delimited-[]Φsuperscript𝐴𝑘superscriptΦ\mathbb{E}[\Phi(A^{k})]\rightarrow\Phi^{*}blackboard_E [ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] → roman_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, where Φ*[𝒱(Φ),+)superscriptΦ𝒱Φ\Phi^{*}\in[\mathcal{V}(\Phi),+\infty)roman_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ [ caligraphic_V ( roman_Φ ) , + ∞ ) with 𝒱(Φ):=infAΦ(A)assign𝒱Φsubscriptinfimum𝐴Φ𝐴\mathcal{V}(\Phi):=\inf_{A}\Phi(A)caligraphic_V ( roman_Φ ) := roman_inf start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT roman_Φ ( italic_A ), and 𝔼Φ(A*)=Φ*𝔼Φsuperscript𝐴subscriptΦ\mathbb{E}\Phi(A^{*})=\Phi_{*}blackboard_E roman_Φ ( italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = roman_Φ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT for all A*Ω(A0)superscript𝐴Ωsubscript𝐴0A^{*}\in\Omega(A_{0})italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_Ω ( italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

  • (3)

    𝔼[dist(0,Φ(Ak))]0𝔼delimited-[]dist0Φsuperscript𝐴𝑘0\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k}))]\rightarrow 0blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) ] → 0. Moreover, the set Ω(A0)Ωsuperscript𝐴0\Omega(A^{0})roman_Ω ( italic_A start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) is nonempty, and 𝔼[dist(0,Φ(A*))]=0𝔼delimited-[]dist0Φsuperscript𝐴0\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{*}))]=0blackboard_E [ roman_dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ) ] = 0 for all A*Ω(A0)superscript𝐴Ωsubscript𝐴0A^{*}\in\Omega(A_{0})italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_Ω ( italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

  • (4)

    dist(Ak,Ω(A0))0distsuperscript𝐴𝑘Ωsubscript𝐴00\mathrm{dist}(A^{k},\Omega(A_{0}))\rightarrow 0roman_dist ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Ω ( italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) → 0 a.s., and Ω(A0)Ωsubscript𝐴0\Omega(A_{0})roman_Ω ( italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a.s. compact and connected.

Proof.

The proof of the above statements is similar to that of Lemma 9 in [58], so we omit the details here for simplicity. ∎

The following lemma is from [16], which is analogous to the Uniformized KŁ property of [4] and allows us to apply the KŁ inequality.

Lemma 6.

Assuming {Ak}ksubscriptsuperscript𝐴𝑘𝑘\{A^{k}\}_{k\in\mathbb{N}}{ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT is a bounded sequence of iterates for all k𝑘kitalic_k generated by the iTableSMD algorithm using a variance-reduced gradient estimator (see Definition 4). Let Φnormal-Φ\Phiroman_Φ be a semialgebraic function satisfying the KŁ property [4] with exponent θ𝜃\thetaitalic_θ. Then there exists an index k¯normal-¯𝑘\bar{k}over¯ start_ARG italic_k end_ARG and a desingularizing function ϕ(r)=ar1θitalic-ϕ𝑟𝑎superscript𝑟1𝜃\phi(r)=ar^{1-\theta}italic_ϕ ( italic_r ) = italic_a italic_r start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT with a>0𝑎0a>0italic_a > 0, θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ) so that the following bound holds almost surely (a.s.),

ϕ(𝔼[Φ(Ak)Φk*])𝔼𝑑𝑖𝑠𝑡(0,Φ(Ak))1,k>k¯,formulae-sequencesuperscriptitalic-ϕ𝔼delimited-[]Φsuperscript𝐴𝑘superscriptsubscriptΦ𝑘𝔼𝑑𝑖𝑠𝑡0Φsuperscript𝐴𝑘1for-all𝑘¯𝑘\displaystyle\phi^{\prime}(\mathbb{E}[\Phi(A^{k})-\Phi_{k}^{*}])\mathbb{E}% \mbox{dist}(0,\partial\Phi(A^{k}))\geq 1,\,\,\forall k>\bar{k},italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( blackboard_E [ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] ) blackboard_E dist ( 0 , ∂ roman_Φ ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) ≥ 1 , ∀ italic_k > over¯ start_ARG italic_k end_ARG , (31)

where Φk*superscriptsubscriptnormal-Φ𝑘\Phi_{k}^{*}roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is a nondecreasing sequence converging to 𝔼Φ(A*)𝔼normal-Φsuperscript𝐴\mathbb{E}\Phi(A^{*})blackboard_E roman_Φ ( italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) for some A*Ω(A0)superscript𝐴normal-Ωsubscript𝐴0A^{*}\in\Omega(A_{0})italic_A start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_Ω ( italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Now we give the global convergence result of the iTableSMD algorithm in the following theorem which can be proved by the above lemmas and we omit the proof details here. See [16, 58] for details.

Theorem 3.

Suppose that Assumptions 1-2 hold, the step ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies 0<ηηk0𝜂subscript𝜂𝑘0<\eta\leq\eta_{k}0 < italic_η ≤ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and (24). Let {Ak}ksubscriptsuperscript𝐴𝑘𝑘\{A^{k}\}_{k\in\mathbb{N}}{ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT be the sequence generated by the iTableSMD algorithm which is assumed to be bounded. If the optimization function Φnormal-Φ\Phiroman_Φ is a semialgebraic function that satisfies the KŁ property with exponent θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ) (see Lemma 6), then either the point Aksuperscript𝐴𝑘A^{k}italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a critical point after a finite number of iterations or the sequence {Ak}ksubscriptsuperscript𝐴𝑘𝑘\{A^{k}\}_{k\in\mathbb{N}}{ italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT almost surely satisfies the finite length property in expectation, namely,

k=0+𝔼Ak+1Ak<+.superscriptsubscript𝑘0𝔼normsuperscript𝐴𝑘1superscript𝐴𝑘\sum_{k=0}^{+\infty}\mathbb{E}\|A^{k+1}-A^{k}\|<+\infty.∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT blackboard_E ∥ italic_A start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ < + ∞ .

5 Numerical experiments

In this section, we evaluate the proposed iTableSMD (Algorithm 1) using synthetic datasets as well as multiple real-world datasets. We aim to demonstrate its superior efficiency through comparisons with state-of-the-art algorithms as our main baseline. The first one is an entry-sampling based stochastic non-Euclidean CP decomposition optimization algorithm, namely, GCP-OPT, proposed in [31, 26]. The GCP-OPT method is implemented in Tensor Toolbox and “Adam” is selected as the optimization solver. The sampling rule of GCP-OPT is the default “uniform” setting for dense tensors unless specified for particular examples. The Second one is a tensor fiber-sampling based flexible stochastic mirror descent framework denoted by SmartCPD [49].

We perform experiments involving low-rank GCP decomposition with nonnegative constraints An0subscript𝐴𝑛0A_{n}\geq 0italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 for n=1,,N𝑛1𝑁n=1,\dots,Nitalic_n = 1 , … , italic_N. Our focus extends to three distinct synthetic data distributions: Gamma, Poisson, and Bernoulli. Additionally, we incorporate several real datasets into our analysis, including the Enron emails dataset [50], six months of Uber pickup data [51] in New York City, both of which are characterized by integer counts following the Poisson distribution. Each element in the Enron emails dataset represents the sender-receiver-word, and the values are counts of words. Each element in the Uber pickup data represents the date-latitude-longitude of pickup, and the values are counts of pickups. We also test the tags from the Flickr dataset [20], where non-zero values are binary, indicating user tagging of images on a given day. In synthetic experiments, we generate third-order tensors with different sizes and ranks and we do not require each dimension of the tensor to remain the same. For the fiber-sampling based algorithms, in each iteration, iTableSMD and SmartCPD sample 2R2𝑅2R2 italic_R fibers, while for entry-sampling algorithm, GCP-OPT samples 2i=1NInNR2superscriptsubscript𝑖1𝑁subscript𝐼𝑛𝑁𝑅2\frac{\sum_{i=1}^{N}I_{n}}{N}R2 divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_R entries. The generating function of Bregman distance is chosen as ψ(a)=aloga𝜓𝑎𝑎𝑎\psi(a)=a\log aitalic_ψ ( italic_a ) = italic_a roman_log italic_a in the update of iTableSMD and SmartCPD. The numerical experiment performance is measured by the cost function value (denotes by “NRE”) and the mean squared error (MSE). The MSE of the latent matrices is used as a performance metric, which is defined as

MSE=minπ(r)[R]1Rr=1R𝑨n(:,π(r))𝑨n(:,π(r))2𝑨¯n(:,r)𝑨¯n(:,r)22,MSEsubscript𝜋𝑟delimited-[]𝑅1𝑅superscriptsubscript𝑟1𝑅superscriptnormsubscript𝑨𝑛:𝜋𝑟subscriptnormsubscript𝑨𝑛:𝜋𝑟2subscript¯𝑨𝑛:𝑟subscriptnormsubscript¯𝑨𝑛:𝑟22\mathrm{MSE}=\min_{\pi(r)\in[R]}\frac{1}{R}\sum_{r=1}^{R}\left\|\frac{% \boldsymbol{A}_{n}(:,\pi(r))}{\left\|\boldsymbol{A}_{n}(:,\pi(r))\right\|_{2}}% -\frac{\bar{\boldsymbol{A}}_{n}(:,r)}{\left\|\bar{\boldsymbol{A}}_{n}(:,r)% \right\|_{2}}\right\|^{2},roman_MSE = roman_min start_POSTSUBSCRIPT italic_π ( italic_r ) ∈ [ italic_R ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∥ divide start_ARG bold_italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( : , italic_π ( italic_r ) ) end_ARG start_ARG ∥ bold_italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( : , italic_π ( italic_r ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG over¯ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( : , italic_r ) end_ARG start_ARG ∥ over¯ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( : , italic_r ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 𝑨¯nsubscript¯𝑨𝑛\bar{\boldsymbol{A}}_{n}over¯ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the estimate of original matrix 𝑨(n)subscript𝑨𝑛\boldsymbol{A}_{(n)}bold_italic_A start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT and {π(1),,π(R)}𝜋1𝜋𝑅\{\pi(1),\ldots,\pi(R)\}{ italic_π ( 1 ) , … , italic_π ( italic_R ) } represents a permutation of the set [R]={1,,R}delimited-[]𝑅1𝑅[R]=\{1,\ldots,R\}[ italic_R ] = { 1 , … , italic_R }, which is used to fix the intrinsic column permutation in CP decomposition.

5.1 Synthetic data experiments

5.1.1 Gamma distribution

In this subsection, we compute the GCP decomposition on two artificial three-way tensors of size 150×100×150150100150150\times 100\times 150150 × 100 × 150 and 300×400×300300400300300\times 400\times 300300 × 400 × 300 with different ranks using the gamma loss function: x/m+log(m)𝑥𝑚𝑚x/m+\log(m)italic_x / italic_m + roman_log ( italic_m ). In practice, we use the constraint m0𝑚0m\geq 0italic_m ≥ 0 and replace m𝑚mitalic_m with m+ϵ𝑚italic-ϵm+\epsilonitalic_m + italic_ϵ (e.g., ϵ=109italic-ϵsuperscript109\epsilon=10^{-9}italic_ϵ = 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT) in the loss function to prevent function values or gradients from becoming ±plus-or-minus\pm\infty± ∞. Namely, f(m;x)=x/(m+ϵ)+log(m+ϵ)𝑓𝑚𝑥𝑥𝑚italic-ϵ𝑚italic-ϵf(m\,;x)=x/(m+\epsilon)+\log(m+\epsilon)italic_f ( italic_m ; italic_x ) = italic_x / ( italic_m + italic_ϵ ) + roman_log ( italic_m + italic_ϵ ). With nonnegative constraints on the factor matrices, the latent factors A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and A3subscript𝐴3A_{3}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are drawn from i.i.d.uniform distribution between 0 and Amaxsubscript𝐴𝑚𝑎𝑥A_{max}italic_A start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, where the Amax=0.5subscript𝐴𝑚𝑎𝑥0.5A_{max}=0.5italic_A start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 0.5 is a positive constant. The observed nonnegative data tensor 𝒳𝒳\mathcal{X}caligraphic_X is generated following the gamma distribution, i.e. 𝒳¯iGamma(¯i)similar-tosubscript¯𝒳𝑖𝐺𝑎𝑚𝑚𝑎subscript¯𝑖\underline{\mathcal{X}}_{i}\sim Gamma(\underline{\mathcal{M}}_{i})under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_G italic_a italic_m italic_m italic_a ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Namely, we focus on

minA1,A2,A31INi𝒳¯i/(¯i+ϵ)+log(¯i+ϵ)+n=13hn(An) s.t. ¯i=r=1Rn=13An(in,r),i,subscriptsubscript𝐴1subscript𝐴2subscript𝐴31superscript𝐼𝑁subscript𝑖subscript¯𝒳𝑖subscript¯𝑖italic-ϵsubscript¯𝑖italic-ϵsuperscriptsubscript𝑛13subscript𝑛subscript𝐴𝑛 s.t. formulae-sequencesubscript¯𝑖superscriptsubscript𝑟1𝑅superscriptsubscriptproduct𝑛13subscript𝐴𝑛subscript𝑖𝑛𝑟for-all𝑖\displaystyle\begin{aligned} \min_{A_{1},A_{2},A_{3}}&\quad\frac{1}{I^{N}}\sum% _{i\in\mathcal{I}}\underline{\mathcal{X}}_{i}/(\underline{\mathcal{M}}_{i}+% \epsilon)+\log(\underline{\mathcal{M}}_{i}+\epsilon)+\sum_{n=1}^{3}h_{n}\left(% A_{n}\right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{3}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ ) + roman_log ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r ) , ∀ italic_i ∈ caligraphic_I , end_CELL end_ROW

We set the inertial parameters as αk=3(k1)5(k+2)superscript𝛼𝑘3𝑘15𝑘2\alpha^{k}=\frac{3(k-1)}{5(k+2)}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 3 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG, βk=4(k1)5(k+2)superscript𝛽𝑘4𝑘15𝑘2\beta^{k}=\frac{4(k-1)}{5(k+2)}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 4 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG for simplicity222Theoretically, the inequality (14) is required. However, it is time-consuming to check this inequality in the numerical experiments. Therefore, we directly set βk=4(k1)5(k+2)superscript𝛽𝑘4𝑘15𝑘2\beta^{k}=\frac{4(k-1)}{5(k+2)}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 4 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG. Our numerical experiments show that iTableSMD always converges with this βksuperscript𝛽𝑘\beta^{k}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. and set the stepsize as ηk=0.1superscript𝜂𝑘0.1\eta^{k}=0.1italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0.1 to verify the difference between SmartCPD with SGD and SAGA, GCP-OPT, and iTableSMD with SGD and SAGA. Our numerical results for two synthetic data are presented in Figure 1.

The first synthetic experiment, visualized in the top row of the figure, captures the algorithmic performance across a tensor of dimensions 150×100×150150100150150\times 100\times 150150 × 100 × 150, evaluated at varying ranks. Notably, the iTableSMD, especially when coupled with the SAGA, achieves a rapid improvement in MSE. For instance, in Figure 1(a), iTableSMD-SAGA reduces the MSE to below 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT within an average time of fewer than 3 seconds, outpacing SmartCPD, which requires a minimum of 6 seconds, and GCP-OPT, which exceeds 10 seconds to attain comparable MSE reductions.

The second row in Figure 1 evaluates the efficacy of various algorithms on a tensor with increased dimensions 300×400×300300400300300\times 400\times 300300 × 400 × 300. It is clear that as the tensor size increases, the iTableSMD method gains a lower MSE within the same time compared to SmartCPD and GCP-OPT. The noticeable improvement over the SmartCPD method indicates that the inertial acceleration framework of iTableSMD has a significant impact on its performance, hel** to achieve faster convergence.

Refer to caption Refer to caption Refer to caption
(a) 150×100×150150100150150\times 100\times 150150 × 100 × 150, R=15𝑅15R=15italic_R = 15 (b) 150×100×150150100150150\times 100\times 150150 × 100 × 150, R=20𝑅20R=20italic_R = 20 (c) 150×100×150150100150150\times 100\times 150150 × 100 × 150, R=30𝑅30R=30italic_R = 30
Refer to caption Refer to caption Refer to caption
(d) 300×400×300300400300300\times 400\times 300300 × 400 × 300, R=15𝑅15R=15italic_R = 15 (e) 300×400×300300400300300\times 400\times 300300 × 400 × 300, R=20𝑅20R=20italic_R = 20 (f) 300×400×300300400300300\times 400\times 300300 × 400 × 300, R=30𝑅30R=30italic_R = 30
Figure 1: Numerical experiments for Gamma distribution on synthetic datasets.

5.1.2 Poisson distribution

We next evaluate the performance on two synthetic count data tensors with the size of 150×51×15215051152150\times 51\times 152150 × 51 × 152 and 300×300×300300300300300\times 300\times 300300 × 300 × 300 . For simplicity in our experiments, we set inertial parameters αk=3(k1)5(k+2)superscript𝛼𝑘3𝑘15𝑘2\alpha^{k}=\frac{3(k-1)}{5(k+2)}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 3 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG and βk=4(k1)5(k+2)superscript𝛽𝑘4𝑘15𝑘2\beta^{k}=\frac{4(k-1)}{5(k+2)}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 4 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG using a formula based on the iteration k𝑘kitalic_k, and chose a stepsize as ηk=0.2superscript𝜂𝑘0.2\eta^{k}=0.2italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0.2. The loss function used is a modification of the standard Poisson log-likelihood and is defined as f(m;x)=mxlog(m+ϵ)𝑓𝑚𝑥𝑚𝑥𝑙𝑜𝑔𝑚italic-ϵf(m\,;x)=m-xlog(m+\epsilon)italic_f ( italic_m ; italic_x ) = italic_m - italic_x italic_l italic_o italic_g ( italic_m + italic_ϵ ). We initialized the factor matrices A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and A3subscript𝐴3A_{3}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with values uniformly distributed between 0 and a set maximum Amaxsubscript𝐴𝑚𝑎𝑥A_{max}italic_A start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, here chosen as 0.5. The observed count data tensor 𝒳𝒳\mathcal{X}caligraphic_X is generated following the Poisson distribution, i.e. 𝒳¯iPoisson(¯i)similar-tosubscript¯𝒳𝑖𝑃𝑜𝑖𝑠𝑠𝑜𝑛subscript¯𝑖\underline{\mathcal{X}}_{i}\sim Poisson(\underline{\mathcal{M}}_{i})under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P italic_o italic_i italic_s italic_s italic_o italic_n ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

minA1,A2,A31INi¯i+𝒳¯ilog(¯i+ϵ)+n=13hn(An) s.t. ¯i=r=1Rn=13An(in,r),i,subscriptsubscript𝐴1subscript𝐴2subscript𝐴31superscript𝐼𝑁subscript𝑖subscript¯𝑖subscript¯𝒳𝑖subscript¯𝑖italic-ϵsuperscriptsubscript𝑛13subscript𝑛subscript𝐴𝑛 s.t. formulae-sequencesubscript¯𝑖superscriptsubscript𝑟1𝑅superscriptsubscriptproduct𝑛13subscript𝐴𝑛subscript𝑖𝑛𝑟for-all𝑖\displaystyle\begin{aligned} \min_{A_{1},A_{2},A_{3}}&\quad\frac{1}{I^{N}}\sum% _{i\in\mathcal{I}}\underline{\mathcal{M}}_{i}+\underline{\mathcal{X}}_{i}\log(% \underline{\mathcal{M}}_{i}+\epsilon)+\sum_{n=1}^{3}h_{n}\left(A_{n}\right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{3}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r ) , ∀ italic_i ∈ caligraphic_I , end_CELL end_ROW

In Figure 2, we see the results of numerical experiments on synthetic datasets modeled with Poisson distribution for tensors of two sizes, with varying tensor ranks. For the smaller tensor (150×51×15215051152150\times 51\times 152150 × 51 × 152), as the rank increases from R=10𝑅10R=10italic_R = 10 to R=20𝑅20R=20italic_R = 20, the iTableSMD-SAGA maintains a lower MSE compared to others, indicating a more efficient performance. In particular, for R = 20 in Figure 2(c), iTableSMD-SAGA reduces the MSE significantly faster than the other methods in 10 seconds. For the larger tensor (300×300×300300300300300\times 300\times 300300 × 300 × 300), in Figures 2(d)-(f), as the rank grows, the MSE tends to decrease at a slower rate for all methods. However, the iTableSMD-SAGA still shows a consistent advantage, reaching lower MSEs quicker than the competing algorithms, which becomes more notable as the rank moves to 30 and beyond. From figures 2(b)-(e), compared with figure 1, we can see that SmartCPD can not always perform better than GCP-OPT, since the type of data distribution may affect its effectiveness. However, iTableSMDs continue to show superior performance in terms of iteration speed, regardless of the change in data distribution.

Refer to caption Refer to caption Refer to caption
(a) 150×51×15215051152150\times 51\times 152150 × 51 × 152, R=10𝑅10R=10italic_R = 10 (b) 150×51×15215051152150\times 51\times 152150 × 51 × 152, R=15𝑅15R=15italic_R = 15 (c) 150×51×15215051152150\times 51\times 152150 × 51 × 152, R=20𝑅20R=20italic_R = 20
Refer to caption Refer to caption Refer to caption
(d) 300×300×300300300300300\times 300\times 300300 × 300 × 300, R=20𝑅20R=20italic_R = 20 (e) 300×300×300300300300300\times 300\times 300300 × 300 × 300, R=30𝑅30R=30italic_R = 30 (f) 300×300×300300300300300\times 300\times 300300 × 300 × 300, R=40𝑅40R=40italic_R = 40
Figure 2: Numerical experiments for Poisson distribution on synthetic datasets.

5.1.3 Bernoulli distribution

We see the results of numerical experiments on binary tensors of sizes 100×80×10010080100100\times 80\times 100100 × 80 × 100 and 50×100×2005010020050\times 100\times 20050 × 100 × 200. We set inertial parameters αk=3(k1)5(k+2)superscript𝛼𝑘3𝑘15𝑘2\alpha^{k}=\frac{3(k-1)}{5(k+2)}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 3 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG and βk=4(k1)5(k+2)superscript𝛽𝑘4𝑘15𝑘2\beta^{k}=\frac{4(k-1)}{5(k+2)}italic_β start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 4 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG using a formula based on the iteration k𝑘kitalic_k, and chose a stepsize as ηk=0.2superscript𝜂𝑘0.2\eta^{k}=0.2italic_η start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0.2. The loss function is f(m;x)=log(m+1)xlog(m+ϵ)𝑓𝑚𝑥𝑙𝑜𝑔𝑚1𝑥𝑙𝑜𝑔𝑚italic-ϵf(m\,;x)=log(m+1)-x\,log(m+\epsilon)italic_f ( italic_m ; italic_x ) = italic_l italic_o italic_g ( italic_m + 1 ) - italic_x italic_l italic_o italic_g ( italic_m + italic_ϵ ) and each entry of the binary tensor is generated from the Bernoulli distribution, i.e., 𝒳¯i=1subscript¯𝒳𝑖1\underline{\mathcal{X}}_{i}=1under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 with probability ¯i/(1+¯i)subscript¯𝑖1subscript¯𝑖\underline{\mathcal{M}}_{i}/(1+\underline{\mathcal{M}}_{i})under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( 1 + under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then, we focus on

minA1,A2,A31INilog(¯i+1)𝒳¯ilog(¯i+ϵ)+n=13hn(An) s.t. ¯i=r=1Rn=13An(in,r),i,subscriptsubscript𝐴1subscript𝐴2subscript𝐴31superscript𝐼𝑁subscript𝑖subscript¯𝑖1subscript¯𝒳𝑖subscript¯𝑖italic-ϵsuperscriptsubscript𝑛13subscript𝑛subscript𝐴𝑛 s.t. formulae-sequencesubscript¯𝑖superscriptsubscript𝑟1𝑅superscriptsubscriptproduct𝑛13subscript𝐴𝑛subscript𝑖𝑛𝑟for-all𝑖\displaystyle\begin{aligned} \min_{A_{1},A_{2},A_{3}}&\quad\frac{1}{I^{N}}\sum% _{i\in\mathcal{I}}\log(\underline{\mathcal{M}}_{i}+1)-\underline{\mathcal{X}}_% {i}\log(\underline{\mathcal{M}}_{i}+\epsilon)+\sum_{n=1}^{3}h_{n}\left(A_{n}% \right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{3}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT roman_log ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ) - under¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL under¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r ) , ∀ italic_i ∈ caligraphic_I , end_CELL end_ROW

In Figure 3(a) with tensor size 100×80×10010080100100\times 80\times 100100 × 80 × 100 and rank R=5𝑅5R=5italic_R = 5, it is observed that all algorithms quickly reduce the MSE within the first few seconds. As the rank increases to R=7𝑅7R=7italic_R = 7 in Figures 3(b) and (e), and R=10𝑅10R=10italic_R = 10 in subfigures 3(c) and (f) respectively, there is a noticeable shift in the speed at which MSE decreases, with higher ranks leading to a slight slow down in convergence. In these four subfigures, we can further confirm that SmartCPD does not consistently outperform GCP-OPT and may at times be less effective. Nonetheless, iTalbeSMD-SAGA consistently shows robust performance, achieving low MSEs faster compared to the other method. When examining larger tensor sizes, as in subfigures 3(d)-(f), a similar pattern is evident, with all algorithms performing slower as the size and rank increase. Yet, the relative efficiency of iTalbeSMD-SAGA remains apparent, suggesting its advantage in dealing with larger and more complex data sets.

Overall, iTalbeSMD-SGD and iTalbeSMD-SAGA are shown to reliably achieve lower MSEs more quickly across various synthetic tensor sizes and ranks, underlining its efficiency in different scenarios.

Refer to caption Refer to caption Refer to caption
(a) 100×80×10010080100100\times 80\times 100100 × 80 × 100, R=5𝑅5R=5italic_R = 5 (b) 100×80×10010080100100\times 80\times 100100 × 80 × 100, R=7𝑅7R=7italic_R = 7 (c) 100×80×10010080100100\times 80\times 100100 × 80 × 100, R=10𝑅10R=10italic_R = 10
Refer to caption Refer to caption Refer to caption
(d) 50×100×2005010020050\times 100\times 20050 × 100 × 200, R=5𝑅5R=5italic_R = 5 (e) 50×100×2005010020050\times 100\times 20050 × 100 × 200, R=7𝑅7R=7italic_R = 7 (f) 50×100×2005010020050\times 100\times 20050 × 100 × 200, R=10𝑅10R=10italic_R = 10
Figure 3: Numerical experiments for Bernoulli distribution on synthetic datasets.

5.2 Real data experiments

5.2.1 Enron emails dataset

We apply the algorithms to the Enron emails dataset. This dataset comprises a large collection of email messages exchanged by the employees of the Enron Corporation, which was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). We use the extractive version in [50] and select a subset involving 142 senders, 147 receivers, and 148 unique words, excluding any additional dimensions. The data is in the form of a third-order tensor (sender×receiver×word𝑠𝑒𝑛𝑑𝑒𝑟𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑟𝑤𝑜𝑟𝑑sender\times receiver\times worditalic_s italic_e italic_n italic_d italic_e italic_r × italic_r italic_e italic_c italic_e italic_i italic_v italic_e italic_r × italic_w italic_o italic_r italic_d) with integer entries representing the number of words. The size of the tensor is 142×147×148142147148142\times 147\times 148142 × 147 × 148. It has 6581 (0.2%absentpercent0.2\approx 0.2\%≈ 0.2 %) nonzero entries. We choose the loss function corresponding to the Poisson distribution, i.e., f(x,m)=mxlog(m+ϵ)𝑓𝑥𝑚𝑚𝑥𝑙𝑜𝑔𝑚italic-ϵf(x,m)=m-x\,log(m+\epsilon)italic_f ( italic_x , italic_m ) = italic_m - italic_x italic_l italic_o italic_g ( italic_m + italic_ϵ ) and non-negativity constraints are considered for the latent matrices. In every iteration, 2R2𝑅2R2 italic_R fibers are sampled by iTableSMD and SmartCPD and 2R×142+147+14832𝑅14214714832R\times\frac{142+147+148}{3}2 italic_R × divide start_ARG 142 + 147 + 148 end_ARG start_ARG 3 end_ARG entries are sampled for GCP-OPT. We set inertial parameters αk=3(k1)5(k+2)superscript𝛼𝑘3𝑘15𝑘2\alpha^{k}=\frac{3(k-1)}{5(k+2)}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 3 ( italic_k - 1 ) end_ARG start_ARG 5 ( italic_k + 2 ) end_ARG. All algorithms under test are stopped when the relative change in the loss function is less than 1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT.

Figure 4 presents a series of numerical experiments conducted on enron dataset with a Poisson distribution, examining the performance of various optimization algorithms with different tensor ranks under R=3𝑅3R=3italic_R = 3, R=5𝑅5R=5italic_R = 5, and R=7𝑅7R=7italic_R = 7, respectively. Each algorithm is run for 5 trials and in each trial, the factor matrices are initialized by randomly sampling its entries from uniform distribution between 0 and 1. It is observed that iTableSMD stands out for its quick cost reduction, reaching a low cost within approximately 6 seconds, noticeably faster than the baseline methods, which require more time and yet do not achieve as low of a cost. This quick performance is most notable at the higher rank of R=7𝑅7R=7italic_R = 7, where iTableSMD quickly lower the cost apparently, surpassing other algorithms that struggle to converge within the same time.

Refer to caption Refer to caption Refer to caption
(a) 142×147×148142147148142\times 147\times 148142 × 147 × 148, R=3𝑅3R=3italic_R = 3 (b) 142×147×148142147148142\times 147\times 148142 × 147 × 148, R=5𝑅5R=5italic_R = 5 (c) 142×147×148142147148142\times 147\times 148142 × 147 × 148, R=7𝑅7R=7italic_R = 7
Figure 4: Numerical experiments for Poisson distribution on Enron emails dataset.

5.2.2 The Flickr dataset

We evaluate the algorithms on the Flickr dataset, as referenced by Gorlitz et al. [20]. The dataset consists of tags representing whether a user has labeled an image on a particular day, with non-zero values marked as binary indicators. We form a third-order binary tensor of size 520×520×520520520520520\times 520\times 520520 × 520 × 520. The chosen loss function is tailored for the Bernoulli distribution, i.e., f(x,m)=log(m+1)xlog(m+ϵ)𝑓𝑥𝑚𝑙𝑜𝑔𝑚1𝑥𝑙𝑜𝑔𝑚italic-ϵf(x,m)=log(m+1)-x\,log(m+\epsilon)italic_f ( italic_x , italic_m ) = italic_l italic_o italic_g ( italic_m + 1 ) - italic_x italic_l italic_o italic_g ( italic_m + italic_ϵ ). Other settings and parameters are as before. Figure 5 shows the cost value change against time in seconds for different values of R𝑅Ritalic_R. Similar to the previous datasets, the proposed iTableSMD shows considerable runtime advantages over GCP-OPT. The results reinforce the capability of iTableSMD to handle complex, real-world datasets effectively.

Refer to caption Refer to caption Refer to caption
(a) 520×520×520520520520520\times 520\times 520520 × 520 × 520, R=5𝑅5R=5italic_R = 5 (b) 520×520×520520520520520\times 520\times 520520 × 520 × 520, R=10𝑅10R=10italic_R = 10 (c) 520×520×520520520520520\times 520\times 520520 × 520 × 520, R=15𝑅15R=15italic_R = 15
Figure 5: Numerical experiments for Bernoulli distribution on the Flickr dataset.

6 Conclusion

In this paper, we proposed an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD) for nonconvex multi-block objective functions beyond global Lipschitz gradient continuity. This algorithm is particularly tailored for large-scale Generalized Tensor CP (GCP) decomposition under non-Euclidean losses. By integrating a broader version of multi-block variance ruduction, we establish the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm and prove it requires at most 𝒪(ε2)𝒪superscript𝜀2\mathcal{O}(\varepsilon^{-2})caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations in expectation to attain an ε𝜀\varepsilonitalic_ε-stationary point. Additionally, we verify the global convergence of the sequence generated by iTableSMD. We tested the algorithm over various types of simulated and real data with several baselines, indicating significant computational efficiency improvements over existing state-of-the-art methods. These results highlight the advantages and effectiveness of incorporating an inertial accelerated stochastic approach in the algorithmic framework for GCP tensor decomposition.

Declarations

Funding: This research is supported by the R&D project of Pazhou Lab (Huangpu) (Grant no. 2023K0603), the National Natural Science Foundation of China (NSFC) grant 12171021 and the Fundamental Research Funds for the Central Universities (Grant No. YWF-22-T-204).

Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.

Data Availability Statement: Data will be made available on reasonable request.

References

  • [1] C. Battaglino, G. Ballard, and T. G. Kolda. A practical randomized CP tensor decomposition. SIAM J. Matrix Anal. Appl., 39(2):876–901, 2018.
  • [2] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Math. Oper. Res., 42(2):330–348, 2017.
  • [3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett., 31(3):167–175, 2003.
  • [4] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program., 146(1-2):459–494, 2014.
  • [5] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim., 28(3):2131–2151, 2018.
  • [6] J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an n𝑛nitalic_n-way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3):283–319, 1970.
  • [7] R. Cattell. “Parallel proportional profiles” and other principles for determining the choice of factors by rotation. Psychometrika, 9(4):267–283, 1944.
  • [8] R. B. Cattell. The three basic factor-analytic research designs-their interrelations and derivatives. Psychol. Bull., 49(5):499–520, 1952.
  • [9] L. Cheng, X. Tong, S. Wang, Y.-C. Wu, and H. V. Poor. Learning nonnegative factors from tensor data: Probabilistic modeling and inference algorithm. IEEE Trans. Signal Process., 68:1792–1806, 2020.
  • [10] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl., 33(4):1272–1299, 2012.
  • [11] A. Cichocki and A. Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci., 92-A:708–721, 2009.
  • [12] P. Comon, X. Luciani, and A. L. F. de Almeida. Tensor decompositions, alternating least squares and other tales. J. Chemom., 23, 2009.
  • [13] C. D. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Optim., 25(2):856–881, 2015.
  • [14] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions, 2018.
  • [15] A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
  • [16] D. Driggs, J. Tang, J. Liang, M. E. Davies, and C. Schönlieb. A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J. Imaging Sci., 14(4):1932–1970, 2021.
  • [17] B. Ermiş, E. Acar, and A. T. Cemgil. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Min. Knowl. Discov., 29(1):203–236, 2015.
  • [18] X. Fu, S. Ibrahim, H. Wai, C. Gao, and K. Huang. Block-randomized stochastic proximal gradient for low-rank tensor factorization. IEEE Trans. Signal Process., 68:2170–2185, 2020.
  • [19] X. Fu, E. Seo, J. Clarke, and R. A. Hutchinson. Link prediction under imperfect detection: Collaborative filtering for ecological networks. IEEE Transactions on Knowledge and Data Engineering, 33(8):3117–3128, 2021.
  • [20] O. Görlitz, S. Sizov, and S. Staab. Pints: peer-to-peer infrastructure for tagging systems. In IPTPS, page 19, 2008.
  • [21] D. Han. A survey on some recent developments of alternating direction method of multipliers. J. Oper. Res. Soc. China, 10:1–52, 2022.
  • [22] R. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis. 1970.
  • [23] J. Hertrich and G. Steidl. Inertial stochastic PALM and applications in machine learning. Sampl. Theory Signal Process. Data Anal., 20(1), 2022.
  • [24] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys., 6(1-4):164–189, 1927.
  • [25] F. L. Hitchcock. Multiple invariants and generalized rank of a p-way matrix or tensor. J. Math. Phys., 7(1-4):39–79, 1928.
  • [26] D. Hong, T. G. Kolda, and J. A. Duersch. Generalized canonical polyadic tensor decomposition. SIAM Rev., 62(1):133–163, 2020.
  • [27] K. Huang and N. D. Sidiropoulos. Kullback-Leibler principal component for tensors is not NP-hard. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 693–697, 2017.
  • [28] K. Huang, N. D. Sidiropoulos, and A. P. Liavas. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process., 64(19):5052–5065, 2016.
  • [29] N. Kargas and N. Sidiropoulos. Learning mixtures of smooth product distributions: Identifiability and algorithm. In Proc. 22nd Int. Conf. Artif. Intell. Statist., pages 388–396, 2019.
  • [30] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, 2009.
  • [31] T. G. Kolda and D. Hong. Stochastic gradients for large-scale tensor decomposition. SIAM J. Math. Data Sci, abs/1906.01687, 2019.
  • [32] W. P. Krijnen, T. K. Dijkstra, and A. Stegeman. On the non-existence of optimal solutions and the occurrence of ”degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika, 73(3):431–439, 2008.
  • [33] G. Lan. First-Order and Stochastic Optimization Methods for Machine Learning. Springer, 2020.
  • [34] P. Latafat, A. Themelis, M. Ahookhosh, and P. Patrinos. Bregman Finito/MISO for nonconvex regularized finite sum minimization without lipschitz gradient continuity. SIAM J. Optim., 32(3):2230–2262, 2022.
  • [35] Q. Li, Z. Zhu, G. Tang, and M. B. Wakin. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems, 2019.
  • [36] L.-H. Lim and P. Comon. Nonnegative approximations of nonnegative tensors. J. Chemom., 23:432–441, 2009.
  • [37] H. Lu. ”relative-continuity” for non-lipschitz non-smooth convex optimization using stochastic (or deterministic) mirror descent, 2018.
  • [38] H. Lu, R. M. Freund, and Y. E. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim., 28(1):333–354, 2018.
  • [39] M. C. Mukkamala, P. Ochs, T. Pock, and S. Sabach. Convex-concave backtracking for inertial Bregman proximal gradient algorithms in nonconvex optimization. SIAM J. Math. Data Sci., 2(3):658–682, 2020.
  • [40] C. Navasca, L. De Lathauwer, and S. Kindermann. Swamp reducing technique for tensor decomposition. In 16th European Signal Processing Conference, pages 1–5, 2008.
  • [41] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.
  • [42] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2)𝑂1superscript𝑘2{O}(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Soviet Math. Dokl., 27(2):372–376, 1983.
  • [43] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takác. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pages 2613–2621, 2017.
  • [44] P. Paatero. Construction and analysis of degenerate PARAFAC models. J. Chemom., 14(3):285–299, 2000.
  • [45] A.-H. Phan, P. Tichavský, and A. Cichocki. Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC. SIAM J. Matrix Anal. Appl., 34(1):126–147, 2013.
  • [46] A.-H. Phan, P. Tichavský, and A. Cichocki. Fast alternating LS algorithms for high order candecomp/parafac tensor factorizations. IEEE Trans. Signal Process., 61(19):4834–4846, 2013.
  • [47] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci., 9(4):1756–1787, 2016.
  • [48] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys., 4(5):1–17, 1964.
  • [49] W. Pu, S. Ibrahim, X. Fu, and M. Hong. Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses. IEEE Transactions on Signal Processing, 70:1803–1818, 2022.
  • [50] J. Shetty and J. Adibi. The enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, 4, 2004.
  • [51] S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. FROSTT: The formidable repository of open sparse tensors and tools, 2017.
  • [52] L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based algorithms for tensor decompositions: Canonical polyadic decomposition, decomposition in rank-(lr,lr,1)subscript𝑙𝑟subscript𝑙𝑟1(l_{r},l_{r},1)( italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , 1 ) terms, and a new generalization. SIAM J. Optim., 23(2):695–720, 2013.
  • [53] M. Teboulle and Y. Vaisbourd. Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM J. Imaging Sci., 13(1):381–421, 2020.
  • [54] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109:475–494, 2001.
  • [55] M. Vandecappelle, N. Vervliet, and L. D. Lathauwer. A second-order method for fitting the canonical polyadic decomposition with non-least-squares cost. IEEE Transactions on Signal Processing, 68:4454–4465, 2020.
  • [56] M. Wang and L. Li. Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality. J. Mach. Learn. Res., 21(1), 2020.
  • [57] Q. Wang, C. Cui, and D. Han. A momentum block-randomized stochastic algorithm for low-rank tensor CP decomposition. Pac. J. Optim., 17(3):433–452, 2021.
  • [58] Q. Wang and D. Han. A Bregman stochastic method for nonconvex nonsmooth problem beyond global Lipschitz gradient continuity. Optim. Methods. Softw., Online, 2023.
  • [59] Q. Wang and D. Han. A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl. Numer. Math., 189:66–87, 2023.
  • [60] Q. Wang, Z. Liu, C. Cui, and D. Han. Inertial accelerated sgd algorithms for solving large-scale lower-rank tensor CP decomposition problems. J. Comput. Appl. Math., 423:114948, 2023.
  • [61] Q. Wang, Z. Liu, C. Cui, and D. Han. A Bregman proximal stochastic gradient method with extrapolation for nonconvex nonsmooth problems. In Association for the Advancement of Artificial Intelligence (AAAI), 2024.
  • [62] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imag. Sci., 6(3):1758–1789, 2013.
  • [63] A. Yeredor and M. Haardt. Maximum likelihood estimation of a low-rank probability mass tensor from partial observations. IEEE Signal Process. Lett., 26(10):1551–1555, 2019.
  • [64] S. Zhang and N. He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization, 2018.