Inertial Accelerated Stochastic Mirror Descent for Large-Scale Generalized Tensor CP Decomposition

Zehui Liu LMIB, School of Mathematical Sciences, Beihang University, Bei**g 100191, China. Email: [email protected] Qingsong Wang School of Mathematics and Computational Science, Xiangtan University, Xiangtan 411105, China. Email: [email protected] Chunfeng Cui LMIB, School of Mathematical Sciences, Beihang University, Bei**g 100191, China. Email: [email protected] Yong Xia LMIB, School of Mathematical Sciences, Beihang University, Bei**g 100191, China. Email: [email protected]

Abstract

The majority of classic tensor CP decomposition models are designed for squared loss, employing Euclidean distance as a local proximal term. However, the Euclidean distance is unsuitable for the generalized loss function applicable to various types of real-world data, such as integer and binary data. Consequently, algorithms developed under the squared loss are not easily adaptable to handle these generalized losses, partially due to the lack of the gradient Lipschitz continuity. This paper considers the generalized tensor CP decomposition. We use the Bregman distance as the proximal term and propose an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD). Within a broader multi-block variance reduction and inertial acceleration framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. We further show that iTableSMD requires at most $\mathcal{O}(\varepsilon^{-2})$ iterations in expectation to attain an $\varepsilon$ -stationary point and establish the global convergence of the sequence. Numerical experiments on real datasets demonstrate that our proposed algorithm is efficient and achieve better performance than the existing state-of-the-art methods.

Keywords: Generalized tensor CP decomposition, Inertial acceleration, Stochastic mirror descent, Bregman divergence, Non-Lipschitz gradient continuity, Variance reduction.

1 Introduction

A fundamental generic optimization model that encompasses a wide range of multi-block models arising in various applications is the well-known composite minimization problem. It can be formally defined as:

\min_{\{x_{t}\}_{t=1}^{s}}\Phi\left({x}_{1},\ldots,{x}_{s}\right)\equiv f\left% ({x}_{1},\ldots,{x}_{s}\right)+\sum_{t=1}^{s}h_{t}\left({x}_{t}\right),

(1)

where variable $x$ can be decomposed into $s$ blocks ${x}_{1},\ldots,{x}_{s}$ , $f$ is assumed to be a continuously differentiable nonconvex function over $x=\left({x}_{1},\ldots,{x}_{s}\right)$ , and can be convex in the manner of a block ${x}_{t}$ , while all the other blocks are fixed. It can also admit a finite-sum structure form $f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ with different indexes. The usual restrictive condition of the gradient Lipschitz continuity of $f_{i}$ is not required, and $h_{t},t=1,\ldots,s$ , are extended-value weakly convex functions, which is a structure-promoting regularizer that captures the prior information about ${x}_{i}$ , such as column-wise orthogonality [32], Tikhonov regularization [44], iterated Tikhonov regularization [40], non-negativity [36].

The majority of classic models and algorithms developed for multi-block problem (1) are least squares (LS) problems using the Euclidean distance-based fitting criterion [22, 6]. Efficient methods in this line include the block coordinate descent (BCD) [62], alternating direction method of multipliers (ADMM) [21], and proximal alternating linearized minimization (PALM) [4] algorithm. Then Pock and Sabach [47] introduced a inertial variant of PALM (iPALM). Huang et al. [28] introduced a primal-dual algorithm AO-ADMM, which is a hybrid between alternating optimization and ADMM. There are also some other first-order type algorithms [46, 12] and (quasi-)second-order methods [52, 45]. However, the Euclidean distance is unsuitable for measuring the proximity between various real-world data types, including nonnegative, integer, and binary data. Essentially, utilizing data geometry-aware divergences as fitting criteria has the potential to substantially improve both performance and robustness in real-world applications [26, 31, 56, 55]. For instance, the proximity between two probability distributions is measured using an appropriate divergence, such as the generalized Kullback-Leibler (KL) divergence [27, 29, 19, 9] or the Itakura-Saito divergence [17]. From a statistical perspective, numerous non-Euclidean divergences share a close connection with maximum likelihood estimators (MLEs) under reasonable data distribution assumptions. For example, the generalized KL divergence [10] and logistic loss can be obtained as MLEs for integer data with Poisson distributions and binary data with Bernoulli distributions, respectively. Nevertheless, methods under non-Euclidean divergences are much more challenging compared with the case under Euclidean loss, especially when the data size becomes huge. Algorithms designed for the LS loss are not easily adaptable to handle complex loss functions, primarily due to the absence of gradient Lipschitz continuity, even under relatively mild conditions.

The form of problem (1) can be applied to tensor CP decomposition with regularization, which can be viewed as an extension of matrix factorization [53]. The first idea of canonical polyadic (CP) decomposition is from Hitchcock [24, 25] in 1927, which expresses a tensor as a sum of a finite number of rank-1 tensors. Subsequently, Cattell [7, 8] proposed ideas for parallel proportional analysis and multiple axes for analysis. Furthermore, the research developed by Hong et al. in 2020 is a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. Below, we briefly review existing developments for multi-block models with non-Euclidean loss functions.

Many existing non-Euclidean approaches employ the block coordinate descent (BCD) [62] for updating block variables. Convergence of the BCD method typically requires the uniqueness of the minimizer at each step or the quasi-convexity of the objective function [54]. Unfortunately, these requirements can be restrictive in some important practical problems such as the tensor decomposition problem [30]. Cichocki and Phan [11] proposed a hierarchical alternating optimization algorithm for CP decomposition with $\alpha$ - and $\beta$ -divergence. In [10], the generalized KL-divergence loss was explored, leading to the development of a block majorization-minimization (MM) algorithm. The work in [29] presented the exponential gradient algorithm for handling the KL-divergence. Additionally, alternative optimization frameworks like Gauss-Newton based methods [55] and quasi-Newton methods [26] have been devised for non-Euclidean models.

It is worth noting that most of the mentioned algorithms use the entire dataset for each update, which will be time-consuming. In contrast, stochastic algorithms reduce computational and memory requirements per iteration. A recent stochastic gradient-based algorithm [31] was introduced for tensor CP decomposition. However, it randomly samples tensor entries for updates, neglecting the potential computational efficiency enhancements through multilinear algebraic properties of low-rank tensors. More importantly, this update strategy loses the opportunity to incorporate regularization terms on the entire latent factors because the sampled entries only provide partial information about them. To address this, Battaglino et al. proposed an algorithm [1] that samples tensor fibers containing information about complete latent factors. However, these stochastic algorithms lack convergence guarantees. Pu et al. [49] developed a block-randomized stochastic mirror descent (SMD) [13, 3] algorithmic framework for large-scale CP decomposition under various non-Euclidean losses, also referred to as generalized CP decomposition. Specifically, at each iteration, one block factor is randomly chosen for an update while kee** all other factors fixed. Then, instead of solving the subproblem directly, it updates the unknown factor by one SMD step. This work also incorporated a fiber sampling strategy to assist in designing SMD updates. In this way, the computational cost is much smaller. However, the pure SMD is still slow in convergence. Wang et al. proposed mBrasCPD [57] and iBrasCPD[60], which speed up the SGD scheme by the heavy ball method [48] and inertial acceleration. Both of these algorithms are designed for scenarios involving Euclidean loss functions and are not suitable for the generalized non-Euclidean loss functions directly. Additionally, these algorithms only consider that stochastic gradient is unbiased, which can only induce weak convergence properties. Recently, [58] and [61] introduced the Bregman proximal stochastic gradient (BPSG) method and BPSG with extrapolation (BPSGE), respectively and established the convergence properties of the generated sequence in terms of subsequential and global convergence under a general framework of variance reduction. However, both BPSG and BPSGE are designed for single-block problems, and the potential for computing generalized tensor CP decomposition remains untapped.

An overview of several state-of-the-art algorithms for multi-block problem (1) is presented in Table 1.

Table 1: Summary of the properties of iTableSMD (Algorithm 1) and several state-of-the-art stochastic methods. “subseq.” and “seq.” denote the subsequential and sequential convergence, respectively. “Complexity” means the complexity (in expectation) to obtain an

\varepsilon

-stationary point (Definition 3) of

\Phi

and “-” means not given.

Algorithm	Loss	$h(x)$	Acceleration	Convergence	Complexity	$\nabla f_{i}$ -Lip	$\tilde{\nabla}f_{i}$
BrasCPD [18]	LS	convex	no	subseq.	-	yes	unbiased
mBrasCPD [57]	LS	convex	heavy ball	subseq.	-	yes	unbiased
iBrasCPD [60]	LS	convex	inertial	subseq.	-	yes	unbiased
SPRING [16]	general	nonconvex	no	subseq./seq.	-	yes	biased
iSPALM [23]	general	nonconvex	inertial	subseq./seq.	-	yes	biased
SmartCPD [49]	general	convex	no	subseq.	-	no	unbiased
iTableSMD (Alg. 1)	general	weakly-convex	inertial	subseq./seq.	$\mathcal{O}(\varepsilon^{-2})$	no	biased

In this paper, inspired by inertial acceleration skill and variance reduction framework for stochastic algorithms, we propose an inertial accelerated stochastic mirror descent to solve the nonconvex and nonsmooth optimization problem (1), which can be applied to the block-wise subproblem of tensor generalized CP decomposition under non-Euclidean loss functions. Our main contributions addressed in this article are as follows:

(1)

We introduce an inertial accelerated block-randomized SMD algorithm, denoted as iTableSMD, designed to address the GCP decomposition problem. Within a broader multi-block variance reduction framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm.
(2)

Within a broader multi-block variance reduction and inertial acceleration framework, we establish the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. Furthermore, we introduce a novel Lyapunov function and prove that it requires at most $\mathcal{O}(\varepsilon^{-2})$ iterations in expectation to attain an $\varepsilon$ -stationary point. Additionally, we establish the global convergence of the sequence generated by iTableSMD.
(3)

We conduct extensive experiments, including three synthetic datasets and two real-world datasets in several distributions, to demonstrate the effectiveness of our proposed algorithms iTableSMD. Our numerical experiments exhibit that our proposed methods can achieve better convergence.

The rest of this paper is organized as follows. Section 2 outlines necessary definitions and preliminary results of existing models and algorithms. Section 3 details the formulation of the iTableSMD algorithm, while Section 4 is dedicated to proving its convergence and analyzing its rate of convergence. In Section 5, we compare the performance of the iTableSMD algorithm with several baselines using synthetic and real-world datasets. We conclude the paper in Section 6.

2 Preliminaries

Definition 1.

(Bregman Divergence): Given a strongly convex function $\psi(\cdot):\operatorname{dom}\psi\left(\subseteq\mathbb{R}^{n}\right)% \rightarrow\mathbb{R}$ , the Bregman distance between $x\in\operatorname{dom}\psi$ and $y\in$ int dom $\psi$ is

D_{\psi}(x,y):=\psi(x)-\psi(y)-\langle\nabla\psi(y),x-y\rangle.

(2)

It measures the proximity of $x$ and $y$ . Indeed, $\psi$ is convex if and only if $D_{\psi}(x,y)\geq 0$ for any $x\in\text{dom }\psi,y\in\text{int dom }\psi$ due to the gradient inequality.

Remark 1.

The Bregman divergence was originally defined in [63] using a Legendre function $\psi$ . Here, we consider the case where $\psi$ is a strongly convex function; see more details in [2]. It should be noted that

D_{\psi}(x,y)\geq\frac{\sigma}{2}\|x-y\|^{2},\quad\forall\,x\in\text{dom }\psi% ,y\in\text{int dom }\psi,

(3)

where $\sigma$ is the strongly convex parameter of $\psi$ . $D_{\psi}\left(x,y\right)=0$ if and only if $x=y$ . In addition, $D_{\psi}(x,y)$ can be defined in a coordinate-wise form as $D_{\psi}\left(x,y\right)=\sum_{i=1}^{I_{n}}\sum_{j=1}^{R}\psi\left(x_{ij}% \right)-\psi\left(y_{ij}\right)-$ $\left\langle\nabla\psi\left(y_{ij}\right),x_{ij}-y_{ij}\right\rangle$ .

Definition 2.

([39] $(\bar{L},\underline{L})$ -smooth adaptable) Given $\psi$ , let $f:\mathcal{X}\rightarrow(-\infty,+\infty]$ be a proper and lower semi-continuous function with $\mathrm{dom}\,\psi\subset\mathrm{dom}\,f$ , which is continuously differentiable. We say $(f,\psi)$ is $(\bar{L},\underline{L})$ - smooth adaptable on $C$ if there exist $\bar{L}>0$ and $\underline{L}\geq 0$ such that for any $x,y$ ,

\displaystyle f(x)-f(y)-\langle\nabla f(y),x-y\rangle\leq\bar{L}D_{\psi}(x,y),

(4)

and

\displaystyle-\underline{L}D_{\psi}(x,y)\leq f(x)-f(y)-\langle\nabla f(y),x-y\rangle.

(5)

If $\underline{L}=\bar{L}$ , it recovers [5, Definition 2.2]. Suppose $f$ is convex. If $\underline{L}=0$ , this definition recovers [2, Lemma 1] and [38, Definition 1.1].

Definition 3.

([33] $\epsilon$ -stationary point) Given $\epsilon>0$ , a solution $\{x_{1}^{*},\dots,x_{s}^{*}\}$ is said to be an $\epsilon$ -stationary point of function $\Phi(x_{1},\dots,x_{s})$ if

\mbox{dist}(0,\partial\Phi(x_{1}^{*},\dots,x_{s}^{*}))\leq\epsilon.

2.1 Generalized CP decomposition

Consider an $N$ -th order tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times I_{2}\times\dots\times I_{N}}$ , where $I_{n}$ represents the size of the $n$ -th mode of $\mathcal{X}$ . Such multi-array arises in many applications. $\mathcal{X}(i_{1},\dots,i_{N})$ is an entry of $\mathcal{X}$ . The entries of the data tensor $\mathcal{X}$ could be various types of real datasets, such as continuous numbers, non-negative integers, and binaries. A general problem of tensor CP decomposition is to approximate $\mathcal{X}$ using a low rank tensor $\mathcal{M}$ , defined by

\mathcal{M}=\sum_{r=1}^{R}A_{1}(:,r)\circ\dots\circ A_{N}(:,r),

(6)

where $\circ$ denotes the outer product of vectors, $A_{n}\in\mathbb{R}^{I_{n}\times R}$ , are the unknown mode- $n$ latent factor matrix. $R$ is the smallest positive integer for which equation (6) is satisfied, and it is also known as the rank of $\mathcal{M}$ .

Let $I=(I_{1}I_{2}\dots I_{N})^{\frac{1}{N}}$ be the geometric mean of the dimensions. An $N$ -dimensional integer vector $i$ is used to represent the entry coordinate, i.e.,

i\in\mathcal{I}\triangleq\left\{\left(i_{1},i_{2},\ldots,i_{N}\right)\mid i_{n% }=1,2,\ldots,I_{n},\forall n\right\}

The generalized CP (GCP) decomposition problem can be formulated as the following optimization task, where the primary objective is to minimize a data-adaptive loss function denoted by $f(\cdot,\cdot)$ : $\mathbb{R}\times\mathbb{R}\mapsto\mathbb{R}$ ,

\displaystyle\begin{aligned} \min_{A_{1},A_{2},\ldots,A_{n}}&\quad\frac{1}{I^{% N}}\sum_{i\in\mathcal{I}}f\left(\underline{\mathcal{M}}_{i};\underline{% \mathcal{X}}_{i}\right)+\sum_{n=1}^{N}h_{n}\left(A_{n}\right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{N}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}

(7)

where the entries $\underline{\mathcal{X}}_{i}$ and $\underline{\mathcal{M}}_{i}$ correspond to the elements of $\mathcal{X}$ and $\mathcal{M}$ indexed by $i$ , respectively. $h_{n}(A_{n})$ is a structure-promoting regularizer that captures the prior information about the latent factors ${A}_{n}$ , such as column-wise orthogonality [32], Tikhonov regularization [44], iterated Tikhonov regularization [40], nonnegativity [36]. Those regularizations can result in well-posed problems. For instance, if $A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}:=\{A_{n}|A_{n}\geq 0\}$ is applied, we can write $h_{n}(\cdot)$ as the indicator function:

h_{n}(A)=\mathcal{I}_{\mathbb{R}_{+}^{I_{n}\times R}}\left(A\right)=\left\{% \begin{array}[]{ll}0,&A\geq 0,\\ \infty,&\text{ otherwise. }\end{array}\right.

Lim and Comon [36] showed that the nonnegative CP decomposition always has optimal solutions. By selecting appropriate loss functions $f$ , problem (7) becomes adaptable to handle diverse data types, including continuous, count, and binary data. To illustrate, we present several representative motivating examples.

The difference between GCP and the conventional CP formulation lies in the flexibility in the selection of loss functions. In this section, we provide alternative loss functions by examining the statistical likelihood of a model for a specific data tensor. We assume a parameterized probability density function (PDF) or probability mass function (PMF) is available, offering the likelihood estimation for each entry, i.e.,

x_{i}\sim p\left(x_{i}\mid\theta_{i}\right),\quad\text{ where }\quad\ell\left(% \theta_{i}\right)=m_{i}.

Here $x_{i}$ represents an observation of a random variable, while $\ell(\cdot)$ denotes an invertible link function connecting the model parameter $m_{i}$ with the natural parameter of the distribution, $\theta_{i}$ . The link function is commonly assumed to be the identity function or related to the expectation of distribution.

We aim to find the model $\mathcal{M}$ that is the maximum likelihood estimate (MLE) across all entries. By assuming conditional independence of observations, the overall likelihood is simplified to the product of individual likelihoods. Hence, the MLE is the solution of the following optimization problem:

\max_{\mathcal{M}}\ L(\mathcal{M};\mathcal{X})\equiv\prod_{i\in\Omega}p\left(x% _{i}\mid\ell^{-1}(m_{i})\right),\quad\forall\,i\in\Omega.

Here, $\mathcal{M}=\{m_{i}\}_{i\in\Omega}$ . Then, we employ the negative logarithm to transform the product into a summation and convert it to a minimization problem. As the logarithm is a monotonic function, it preserves the maximizer.

\min_{\mathcal{M}}\ F(\mathcal{M};\mathcal{X})\equiv\sum_{i\in\Omega}f_{0}% \left(m_{i};x_{i}\right),\quad\text{ where}\quad f_{0}(m_{i};x_{i})\equiv-\log p% \left(x\mid\ell^{-1}(m)\right).

(8)

Table 2: Generalized loss function for different data types.

Data Type	Distribution	Link Function	Loss function	Constraints
Continuous: $x\in\mathbb{R}$	Gaussian	$\ell^{-1}(m)=m$	$\frac{1}{2}(x-m)^{2}$	$A_{n}\in\mathbb{R}^{I_{n}\times R}$
Continuous: $x\in\mathbb{R}$	Gamma	$\ell^{-1}(m)=m$	$\frac{x}{m}+\log(m)$	$A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}$
Count: $x\in\mathbb{N}$	Poisson	$\ell^{-1}(m)=m$	$m-x\log(m)$	$A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}$
Count: $x\in\mathbb{N}$	Poisson	$\ell^{-1}(m)=e^{m}$	$e^{m}-xm$	$A_{n}\in\mathbb{R}^{I_{n}\times R}$
Binary: $x\in\{0,1\}$	Bernoulli	$\ell^{-1}(m)=\frac{m}{1+m}$	$\log(m+1)-x\log(m)$	$A_{n}\in\mathbb{R}^{I_{n}\times R}_{+}$
Binary: $x\in\{0,1\}$	Bernoulli	$\ell^{-1}(m)=\frac{e^{m}}{1+e^{m}}$	$\log\left(1+e^{m}\right)-xm$	$A_{n}\in\mathbb{R}^{I_{n}\times R}$

Table 2 presents commonly utilized generalized loss functions $f(x,m)$ , associated link functions, and various distributions.

2.2 Stochastic methods for GCP decomposition

We can modify the element-wise regularized problem (7) to a block-wise regularized problem

\displaystyle\min_{\{A_{n}\}_{n=1}^{N}}\,\,\Phi(A_{1},\dots,A_{N}):=f(A_{1},% \dots,A_{N})+\sum_{n=1}^{N}h_{n}(A_{n}),

(9)

where $f=f_{0}([\![A_{1},A_{2},\cdots,A_{N}]\!]_{i};x_{i})$ , $m_{i}=\sum_{r=1}^{R}\prod_{n=1}^{N}{A}_{n}\left(i_{n},r\right),\forall\,i\in% \mathcal{I}$ , and $f:\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}$ is finite-valued and differential. Shortly, let $\mathcal{V}(\Phi)=\min_{A_{n}}\Phi(A_{1},\dots,A_{N})$ .

We present the stochastic gradient with respect to the factor $A_{n}(n=1,\dots,N)$ , denoted by $\tilde{\nabla}_{A_{n}}f(A_{1},\dots,A_{N})$ . Suppose a mode index $n$ is sampled from $1$ to $N$ . For instance, we take the squared error loss function as an example. Namely, consider

\displaystyle f(A_{1},\dots,A_{N})=\frac{1}{2I^{N}}\left\|[\![A_{1},A_{2},% \cdots,A_{N}]\!]-\mathcal{X}\right\|_{F}^{2}=\frac{1}{2I^{N}}\left\|H_{n}A_{n}% ^{\top}-X_{(n)}\right\|_{F}^{2},

where $H_{n}=A_{N}\odot\dots A_{n+1}\odot A_{n-1}\dots A_{1}\in\mathbb{R}^{J_{n}% \times R}$ , $\odot$ denotes the Khatri-Rao product or a column-wise Kronecker product, $X_{(n)}\in\mathbb{R}^{J_{n}\times I_{n}}$ is the mode- $n$ matrization and $J_{n}=I^{N}/I_{n}$ .

Then, we rewrite the full gradient as follows

\displaystyle\begin{aligned} \nabla_{A_{n}}f(A_{1},\dots,A_{N})=&\frac{1}{I^{N% }}\left(A_{n}\sum_{i=1}^{J_{n}}H_{n}(i,:)^{\top}H_{n}(i:)-X_{(n)}(i,:)^{\top}H% _{n}(i,:)\right)\\ =&\frac{1}{I_{n}}\left(A_{n}\mathbb{E}_{i}[H_{n}(i,:)^{\top}H_{n}(i:)]-\mathbb% {E}_{i}[X_{(n)}(i,:)^{\top}H_{n}(i,:)]\right).\end{aligned}

(10)

Here, $H(i,:)$ denotes the $i$ -th column of $H$ , and $\mathbb{E}_{i}$ is the expectation over the index $i$ . To alleviate the burden of computing the full gradient (10), we randomly sample a set of mode- $n$ fibers that is indexed by $\mathcal{F}_{n}\subset\{1,\dots,J_{n}\}$ with $|\mathcal{F}_{n}|=B$ . Note that a mode- $n$ fiber of $\mathcal{X}$ is a row of the mode- $n$ unfolding $X_{(n)}$ . Compared with the ﬁber sampling-based method in [1], our requirement on the batchsize $B$ is much lower. Hence, it admits lower per-iteration memory and computational complexities, especially when the rank is high.

Let $\tilde{\nabla}_{A_{n}}f(A_{1},\dots,A_{N})\in\mathbb{R}^{I_{n}\times R}$ be the stochastic gradient of $f(A_{1},\dots,A_{N})$ for $A_{n}$ , we have

\displaystyle\begin{aligned} \tilde{\nabla}_{A_{n}}f(A_{1},\dots,A_{N})&=\frac% {1}{{I_{n}}\left|\mathcal{F}_{n}\right|}\left(A_{n}H_{n}^{\top}\left(\mathcal{% F}_{n}\right)H_{n}\left(\mathcal{F}_{n}\right)-X_{n}^{\top}\left(\mathcal{F}_{% n}\right)H_{n}\left(\mathcal{F}_{n}\right)\right),\end{aligned}

(11)

where

X_{n}\left(\mathcal{F}_{n}\right)=X_{n}\left(\mathcal{F}_{n},:\right),\quad H_% {n}\left(\mathcal{F}_{n}\right)=H_{n}\left(\mathcal{F}_{n},:\right).

2.3 Stochastic mirror descent

In this subsection, we introduce some basics for optimization methods used in this paper, such as stochastic mirror descent (SMD) and inertial framework.

Consider the special case in optimization problem (1) with the block $s=1$

\min_{x\in\mathbb{R}^{d}}\Phi\left(x\right)=f(x)+h\left(x\right),

where the component function $f$ is a continuously differentiable nonconvex function, and $h$ is an extended valued function that are bounded from below. The update of SMD [37, 41, 64] is given as follows,

x^{k+1}\in\underset{x}{\arg\min}\,\,h(x)+\langle\tilde{\nabla}f(x^{k}),x-x^{k}% \rangle+\frac{1}{\eta^{k}}D_{\psi}(x,x^{k}),

where the stochastic gradient can be chosen as mini-batch version $\tilde{\nabla}f(x^{k})=\frac{1}{|B^{k}|}\sum_{i\in B^{k}}\nabla f_{i}(x^{k})$ . Here, the mini-batch $B^{k}$ is chosen uniformly at random from all subsets of $\{1,\dots,T\}$ , with the batchsize $|B^{k}|=B$ being considerably smaller than $T$ . Compared with SGD, SMD replaces the quadratic term $\left\|{A}_{n}-{A}_{n}^{k}\right\|_{F}^{2}$ . When $\psi$ is properly designed, SMD can exploit the geometry of the problem and achieve significant efficiency enhancements compared to SGD, particularly when utilizing generalized loss functions. The extensive literature on MD and SMD in optimization is available [5, 34, 14, 64, 35].

As usual with the analysis of Bregman based schemes, the following simple but remarkable three points identity for $D_{\psi}$ is very useful, which follows from elementary algebra. Given any $x\in\operatorname{dom}\psi$ and $y,z\in$ int dom $\psi$ , the three point equality is

D_{\psi}(x,z)=D_{\psi}(x,y)+D_{\psi}(y,z)+\langle\nabla\psi(y)-\nabla\psi(z),x% -y\rangle.

(12)

For the multi-block problem, Pu et al. [49] develop a unified stochastic mirror descent algorithmic framework (SmartCPD) for large-scale CPD under various non-Euclidean losses, which is a special case of multi-block problem and updates the factor variables by

\displaystyle\begin{aligned} A_{n}^{k+1}&=\arg\min_{A_{n}}\,\,h_{n}\left(A_{n}% \right)+\langle\tilde{\nabla}_{A_{n}}f(A_{1}^{k},\dots,A_{N}^{k}),A_{n}-A_{n}^% {k}\rangle+\frac{1}{\eta^{k}}D_{\psi}(A_{n},A_{n}^{k}),\\ A_{n^{\prime}}^{k+1}&=A_{n^{\prime}}^{k},\quad n^{\prime}\neq n.\end{aligned}

(13)

However, directly employing stochastic mirror descent for the GCP problem may not yield the most effective results. In this paper, we study stochastic gradients under the variance-reduced stochastic gradient estimators, such as SAGA [15] and SARAH [43]. Furthermore, the inertial acceleration framework is applied, which can be given by

	$\displaystyle\bar{x}^{k}=$	$\displaystyle x^{k}+\alpha^{k}(x^{k}-x^{k-1}),$
	$\displaystyle\hat{x}^{k}=$	$\displaystyle x^{k}+\beta^{k}(x^{k}-x^{k-1}),$
	$\displaystyle x^{k+1}=$	$\displaystyle\hat{x}^{k}-\eta^{k}\nabla f(\bar{x}^{k}),$

where $\alpha^{k},\beta^{k}\in[0,1]$ are two inertial parameters. For example, if $\alpha^{k}=\beta^{k}=0$ , it will be degenerated into the gradient descent method; If $\alpha^{k}=0$ , then it will be reduced to the heavy-ball method [48]; If $\alpha^{k}=\beta^{k}$ , then it will be reduced to the Nesterov accelerated gradient method [42].

3 Inertial accelerated block-randomized SMD

In this section, we propose an inertial accelerated block-randomized stochastic mirror descent algorithm (iTableSMD) for GCP decomposition (9). Before presenting the algorithm framework of iTableSMD, we make the following assumptions throughout the paper.

Assumption 1.

We assume that the following three conditions hold:

(i)

$h_{n}:\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}\cup\{+\infty\}(n=1,2,% \dots,N)$ are proper lower semi-continuous (l.s.c.) functions that are bounded from below. There exists $\alpha\in\mathbb{R}_{+}$ such that $h(\cdot)+\frac{\alpha}{2}\|\cdot\|^{2}$ is convex.
(ii)

$\psi:\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}$ is continuously differentiable and $\sigma$ -strongly convex. Let $\sigma=1$ for simplicity.

(iii)

$\nabla\psi$ is Lipschitz continuous with modulus $M_{2}>0$ . For any two points $A_{i}$ , $\hat{A}_{i}$ $\in\mathbb{R}^{I_{n}\times R}$ , it presents that

\|\nabla\psi(A_{i})-\nabla\psi(\hat{A}_{i})\|\leq M_{2}\|A_{i}-\hat{A}_{i}\|,% \quad i=1,2,\dots,N.

(iv)

$f:\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}\rightarrow\mathbb{R}$ is a proper and lower semi-continuous function with $\text{dom }\psi\subset\text{dom }f$ .
(v)

The couple of functions $(f,\psi)$ is $(\bar{L},\underline{L})$ -smooth adaptable.
(vi)

The function $\Phi$ is bounded from below, i.e., there exists a finite optimal objective value $\mathcal{V}(\Phi)$ .

Algorithm 1 iTableSMD: inertial accelerated block-randomized stochastic mirror descent for the optimization problem (9)

Input: an $N$ -way tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times\dots\times I_{N}}$ ; the rank $R$ ; the sample size $B$ ; initialization $\{A_{n}^{-1}\}_{n=1}^{N},\{A_{n}^{0}\}_{n=1}^{N}$ ; stepsize $\{\eta_{k}\}_{k\geq 0}$ ; inertial parameters $\{\alpha^{k}\}_{k\geq 0},\{\beta^{k}\}_{k\geq 0}\in[0,1]$ ; two constants $\delta,\epsilon$ with $1>\delta>\epsilon>0$ .

k\leftarrow 0

;

2:repeat

3: sample

n

uniformly from

\{1,\dots,N\}

4: sample

\mathcal{F}_{n}

uniformly from

\{1,\dots,J_{n}\}

with

|\mathcal{F}_{n}|=B

5: compute

\tilde{A}_{n}^{k}={A}_{n}^{k}+\alpha^{k}(A_{n}^{k}-A_{n}^{k-1})\in\mathrm{int}% \,\mathrm{dom}\,\psi

, where

\alpha_{k}\in[0,1)

6: compute an extrapolation parameter

\beta^{k}

such that

\displaystyle D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})\leq\frac{\delta-% \epsilon}{1+\underline{L}\eta_{k-1}}D_{\psi}(A_{n}^{k-1},A_{n}^{k}),

(14)

where

\underline{A}_{n}^{k}=A_{n}^{k}+\beta^{k}(A_{n}^{k}-A_{n}^{k-1})\in\mathrm{int% }\,\mathrm{dom}\,\psi

7: compute the stochastic gradient

\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})

with the batchsize of fibers

B

8: set

\eta_{k}\leq\min\{\eta_{k-1},\bar{L}^{-1}\}

and update

A_{n}^{k+1}

and

A_{n^{\prime}}^{k+1}

	$\displaystyle A_{n}^{k+1}$	$\displaystyle=\arg\min_{A_{n}}\,\,h_{n}\left(A_{n}\right)+\langle\tilde{\nabla% }_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k}),A_{n}-\tilde{% A}_{n}^{k}\rangle+\frac{1}{\eta_{k}}D_{\psi}(A_{n},\tilde{A}_{n}^{k}),$		(15)
	$\displaystyle A_{n^{\prime}}^{k+1}$	$\displaystyle=A_{n^{\prime}}^{k},\quad\forall n^{\prime}\neq n.$		(15)

k\leftarrow k+1

;

10:until some stop** criterion is reached;

Output: $\{A_{n}^{k}\}_{n=1}^{N}$ .

Let $\xi^{k}$ and $\zeta^{k}$ be the stochastic parameters for the block index and the stochastic gradient, respectively. Denote $\mathbb{E}_{k}[\cdot]=\mathbb{E}[\cdot|\xi^{k},\zeta^{k}]$ and $\mathbb{E}[\cdot]=\mathbb{E}[\cdot|\xi^{0},\zeta^{0},\dots]$ .

Definition 4.

(Variance reduced stochastic gradient) We say a gradient estimator $\tilde{\nabla}_{A_{n}}f$ with $n=1,2\dots,N$ , is variance-reduced with constants $V_{1},V_{2},V_{\Gamma}\geq 0$ , and $\tau\in(0,1]$ if it satisfies the following conditions:

(i)

(MSE Bound): there exists a sequence of random variables $\{\Gamma_{k}\}_{k\geq 1}$ such that

\displaystyle\begin{aligned} &\mathbb{E}_{k}[\|\tilde{\nabla}_{A_{\xi^{k}}}f(A% _{1}^{k},\cdots,A_{\xi^{k}-1}^{k},\underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k% },\cdots,A_{N}^{k})-\nabla_{A_{\xi^{k}}}f(A_{1}^{k},\cdots,A_{\xi^{k}-1}^{k},% \underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k},\cdots,A_{N}^{k})\|_{*}^{2}]\\ \leq&\Gamma_{k}+V_{1}(\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}),\end{aligned}

(16)

and random variables $\{\Upsilon_{k}\}_{k\geq 1}$ such that

\displaystyle\begin{aligned} &\mathbb{E}_{k}[\|\tilde{\nabla}_{A_{\xi^{k}}}f(A% _{1}^{k},\cdots,A_{\xi^{k}-1}^{k},\underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k% },\cdots,A_{N}^{k})-\nabla_{A_{\xi^{k}}}f(A_{1}^{k},\cdots,A_{\xi^{k}-1}^{k},% \underline{A}_{\xi^{k}}^{k},A_{\xi^{k}+1}^{k},\cdots,A_{N}^{k})\|_{*}]\\ \leq&\Upsilon_{k}+V_{2}(\|A^{k}-A^{k-1}\|+\|A^{k-1}-A^{k-2}\|).\end{aligned}

(17)

(ii)

(Geometric Decay): The sequence $\{\Gamma_{k}\}_{k\geq 1}$ satisfy the following inequality in expectation:

\displaystyle\begin{aligned} \mathbb{E}_{k}[\Gamma_{k+1}]\leq&(1-\tau)\Gamma_{% k}+V_{\Gamma}(\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}).\end{aligned}

(18)

(iii)

(Convergence of Estimator): For all sequences $\{A^{k}\}_{k=0}^{\infty}$ , if they satisfy
$\lim_{k\rightarrow\infty}\mathbb{E}\|A^{k}-A^{k-1}\|^{2}\rightarrow 0$ , then it follows that $\mathbb{E}\Gamma_{k}\rightarrow 0$ and $\mathbb{E}\Upsilon_{k}\rightarrow 0$ .

In Proposition 1, we show both SAGA and SARAH are variance reduced stochastic gradients.

4 Convergence analysis

This section establishes the convergence properties of the iTableSMD algorithm. We prove its sublinear convergence rate for the subsequential sequence and further show that iTableSMD requires at most $\mathcal{O}(\varepsilon^{-2})$ iterations in expectation to attain an $\varepsilon$ -stationary point. Additionally, we confirm the global convergence of the generated sequence.

4.1 Subsequential convergence analysis

Next, we show the descent amount of $\Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)$ under expectation in the following lemma.

Lemma 1.

Suppose Assumption1 is satisfied and $\tilde{\nabla}_{A_{n}}f$ with $n=1,2\dots,N$ , is variance-reduced by Definition 4. Let $\{A_{n}^{k}\}_{k>0}$ with $n\in\{1,\dots,N\}$ be the sequence generated by Algorithm 1. Then the following inequality holds for any $k>0$ ,

		$\displaystyle\mathbb{E}_{k}[\Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)]+% \frac{1}{2\bar{\gamma}\tau}\mathbb{E}_{k}[\Gamma_{k+1}]+\left(\frac{1}{\eta_{k% }}-\alpha-\bar{\gamma}-\frac{\gamma_{k}}{\eta_{k}}\right)\mathbb{E}_{k}[D^{N}_% {\psi}(A^{k},A^{k+1})]$
	$\displaystyle\leq$	$\displaystyle\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\frac{1}{2\bar{\gamma% }\tau}\Gamma_{k}+\left(\frac{\delta-\epsilon}{\eta_{k}}+\frac{\bar{\gamma}}{2}% +\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{2}}{\eta_{k}\gamma_{k}}\right)D^{N}_{% \psi}(A^{k-1},A^{k})+\frac{\bar{\gamma}}{2}D^{N}_{\psi}(A^{k-2},A^{k-1}).$

Here, $\bar{\gamma}=\sqrt{2(V_{\Gamma}/\tau+V_{1})}$ , $\alpha$ is the weakly convex parameter in Assumption 1 (i), $\delta$ and $\epsilon$ are introduced in (14), and $V_{1},V_{2},V_{\Gamma}\geq 0$ , $\tau\in(0,1]$ are parameters in Definition 4.

Proof.

From the convexity of $h(\cdot)+\frac{\alpha}{2}\|\cdot\|^{2}$ , we can obtain the following inequality

h_{n}(A_{n}^{k+1})+\frac{\alpha}{2}\left\|A_{n}^{k+1}\right\|^{2}+\left\langle% \xi_{k+1}+\alpha A_{n}^{k+1},A_{n}^{k}-A_{n}^{k+1}\right\rangle\leq h_{n}(A_{n% }^{k})+\frac{\alpha}{2}\left\|A_{n}^{k}\right\|^{2},

(19)

where $\xi_{k+1}\in\partial h_{n}(A_{n}^{k+1})$ . From the optimality condition of (15), it shows that

\xi_{k+1}+\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A% _{N}^{k})+\frac{1}{\eta_{k}}(\nabla\psi(A_{n}^{k+1})-\nabla\psi(\tilde{A}_{n}^% {k}))=0,

which combined with (19) yields that

	$\displaystyle h_{n}(A_{n}^{k+1})$	$\displaystyle\leq h_{n}(A_{n}^{k})+\frac{\alpha}{2}\left\\|A_{n}^{k+1}-A_{n}^{k% }\right\\|^{2}+\langle\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^% {k}\cdots A_{N}^{k}),A_{n}^{k}-{A}_{n}^{k+1}\rangle$
		$\displaystyle-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},A_{n}^{k+1})-\frac{1}{\eta_% {k}}D_{\psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})+\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{% k},\tilde{A}_{n}^{k}).$

Furthermore, since $f$ is an $(\bar{L},\underline{L})$ -relative smooth function with respect to $\psi$ , we have

	$\displaystyle f(A_{1}^{k+1}\cdots A_{n}^{k+1}\cdots A_{N}^{k+1})$	$\displaystyle\leq f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})$
		$\displaystyle+\langle\nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}% \cdots A_{N}^{k}),A_{n}^{k+1}-\underline{A}_{n}^{k}\rangle+\bar{L}D_{\psi}(A_{% n}^{k+1},\underline{A}_{n}^{k}),$

and

\displaystyle f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})+\langle% \nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k}),A_{n}^{k% }-\underline{A}_{n}^{k}\rangle\leq f(A_{1}^{k},\cdots A_{n}^{k}\cdots A_{N}^{k% })+\underline{L}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k}).

Combining two inequalities, we can get

\displaystyle\begin{aligned} f(A_{1}^{k+1}\cdots A_{n}^{k+1}\cdots A_{N}^{k+1}% )&\leq f(A_{1}^{k}\cdots A_{n}^{k}\cdots A_{N}^{k})+\langle\nabla_{A_{n}}f(A_{% 1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k}),A_{n}^{k+1}-A_{n}^{k}% \rangle\\ &+\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})+\underline{L}D_{\psi}(A_{% n}^{k},\underline{A}_{n}^{k}).\end{aligned}

(20)

By summing the two inequalities together, we obtain

\displaystyle\begin{aligned} \Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)&% \leq\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\frac{\alpha}{2}\|A_{n}^{k+1}-% A_{n}^{k}\|^{2}\\ &+\langle\nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})% -\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n}^{k}\cdots A_{N}^{k})% ,A_{n}^{k+1}-A_{n}^{k}\rangle\\ &+\underline{L}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})+\frac{1}{\eta_{k}}D_{% \psi}(A_{n}^{k},\tilde{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},A_{n}^% {k+1})\\ &+\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi% }(A_{n}^{k+1},\tilde{A}_{n}^{k})\\ &\leq\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\frac{\alpha+\bar{\gamma}_{k}% }{2}\|A_{n}^{k+1}-A_{n}^{k}\|^{2}\\ &+\frac{1}{2\bar{\gamma}_{k}}\|\nabla_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n% }^{k}\cdots A_{N}^{k})-\tilde{\nabla}_{A_{n}}f(A_{1}^{k}\cdots\underline{A}_{n% }^{k}\cdots A_{N}^{k})\|_{*}^{2}\\ &+\underline{L}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})+\frac{1}{\eta_{k}}D_{% \psi}(A_{n}^{k},\tilde{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},A_{n}^% {k+1})\\ &+\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})-\frac{1}{\eta_{k}}D_{\psi% }(A_{n}^{k+1},\tilde{A}_{n}^{k}),\end{aligned}

(21)

where the last inequality follows from $\langle a,b\rangle\leq\frac{\gamma}{2}\|a\|^{2}+\frac{1}{2\gamma}\|b\|_{*}^{2}$ for any $\gamma_{k}>0$ and $\eta_{k}\leq\bar{L}^{-1}$ .

By (12), we know

	$\displaystyle D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})=$	$\displaystyle D_{\psi}(A_{n}^{k+1},A_{n}^{k})+D_{\psi}(A_{n}^{k},\underline{A}% _{n}^{k})+\langle\nabla\psi(A_{n}^{k})-\nabla\psi(\underline{A}_{n}^{k}),A_{n}% ^{k+1}-A_{n}^{k}\rangle,$
	$\displaystyle D_{\psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})=$	$\displaystyle D_{\psi}(A_{n}^{k+1},A_{n}^{k})+D_{\psi}(A_{n}^{k},\tilde{A}_{n}% ^{k})+\langle\nabla\psi(A_{n}^{k})-\nabla\psi(\tilde{A}_{n}^{k}),A_{n}^{k+1}-A% _{n}^{k}\rangle.$

For the last two terms on the right of the inequality (21), it shows that

\displaystyle\begin{aligned} &\bar{L}D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k% })-\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})\\ \leq&\frac{1}{\eta_{k}}\left[D_{\psi}(A_{n}^{k+1},\underline{A}_{n}^{k})-D_{% \psi}(A_{n}^{k+1},\tilde{A}_{n}^{k})\right]\\ =&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta_{k% }}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{1}{\eta_{k}}\langle\nabla\psi(% \tilde{A}_{n}^{k})-\nabla\psi(\underline{A}_{n}^{k}),A_{n}^{k+1}-A_{n}^{k}% \rangle\\ \leq&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta% _{k}}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{1}{2\eta_{k}\gamma_{k}}\|% \nabla\psi(\tilde{A}_{n}^{k})-\nabla\psi(\underline{A}_{n}^{k})\|^{2}+\frac{% \gamma_{k}}{2\eta_{k}}\|A_{n}^{k+1}-A_{n}^{k}\|^{2}\\ \leq&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta% _{k}}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{M_{2}^{2}}{2\eta_{k}\gamma_{k% }}\|\tilde{A}_{n}^{k}-\underline{A}_{n}^{k}\|^{2}+\frac{\gamma_{k}}{2\eta_{k}}% \|A_{n}^{k+1}-A_{n}^{k}\|^{2}\\ \leq&\frac{1}{\eta_{k}}D_{\psi}(A_{n}^{k},\underline{A}_{n}^{k})-\frac{1}{\eta% _{k}}D_{\psi}(A_{n}^{k},\tilde{A}_{n}^{k})+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k% })^{2}}{2\eta_{k}\gamma_{k}}\|A_{n}^{k}-A_{n}^{k-1}\|^{2}+\frac{\gamma_{k}}{2% \eta_{k}}\|A_{n}^{k+1}-A_{n}^{k}\|^{2}.\end{aligned}

(22)

Suppose $n=\xi^{k}$ at the $k$ -th iteration. We apply the conditional expectation operator $\mathbb{E}_{k}$ to the above inequality (21) and bounding the MSE term by (16) in Definition 4, then we have

\displaystyle\begin{aligned} &\mathbb{E}_{k}[\Phi\left(A_{1}^{k+1},\cdots,A_{N% }^{k+1}\right)]\\ \leq&\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\left(\frac{\alpha+\bar{% \gamma}_{k}}{2}+\frac{\gamma_{k}}{2\eta_{k}}\right)\mathbb{E}_{k}[\|A_{\xi^{k}% }^{k+1}-A_{\xi^{k}}^{k}\|^{2}]\\ &+\frac{1}{2\bar{\gamma}_{k}}\mathbb{E}_{k}[\|\nabla_{A_{\xi}^{k}}f(A_{1}^{k}% \cdots\underline{A}_{\xi^{k}}^{k}\cdots A_{N}^{k})-\tilde{\nabla}_{A_{\xi}^{k}% }f(A_{1}^{k}\cdots\underline{A}_{\xi^{k}}^{k}\cdots A_{N}^{k})\|_{*}^{2}]\\ &+\left(\frac{1}{\eta_{k}}+\underline{L}\right)\mathbb{E}_{k}[D_{\psi}({A}_{% \xi^{k}}^{k},\underline{A}_{\xi^{k}}^{k})]+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k% })^{2}}{2\eta_{k}\gamma_{k}}\mathbb{E}_{k}[\|A_{\xi^{k}}^{k}-A_{\xi^{k}}^{k-1}% \|^{2}]-\frac{1}{\eta_{k}}\mathbb{E}_{k}[D_{\psi}({A}_{\xi^{k}}^{k},A_{\xi^{k}% }^{k+1})]\\ \leq&\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\left(\frac{\alpha+\bar{% \gamma}_{k}}{2}+\frac{\gamma_{k}}{2\eta_{k}}\right)\mathbb{E}_{k}[\|A^{k+1}-A^% {k}\|^{2}]\\ &+\frac{1}{2\bar{\gamma}_{k}}(\Gamma_{k}+V_{1}(\left\|A^{k}-A^{k-1}\right\|^{2% }+\left\|A^{k-1}-A^{k-2}\right\|^{2}))+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{% 2}}{2\eta_{k}\gamma_{k}}\|A^{k}-A^{k-1}\|^{2}\\ &+\left(\frac{1}{\eta_{k}}+\underline{L}\right)\left[D_{\psi}\left({A}_{1}^{k}% ,\underline{A}_{1}^{k}\right)+D_{\psi}\left({A}_{2}^{k},\underline{A}_{2}^{k}% \right)+\cdots+D_{\psi}\left({A}_{N}^{k},\underline{A}_{N}^{k}\right)\right]\\ &-\frac{1}{\eta_{k}}\mathbb{E}_{k}[D_{\psi}\left({A}_{1}^{k},A_{1}^{k+1}\right% )+D_{\psi}\left({A}_{2}^{k},A_{2}^{k+1}\right)+\cdots+D_{\psi}\left({A}_{N}^{k% },A_{N}^{k+1}\right)]\\ \leq&\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)+\left(\frac{\alpha+\bar{% \gamma}_{k}}{2}+\frac{\gamma_{k}}{2\eta_{k}}\right)\mathbb{E}_{k}[\|A^{k+1}-A^% {k}\|^{2}]\\ &+\frac{1}{2\bar{\gamma}_{k}\tau}(\Gamma_{k}-\mathbb{E}_{k}[\Gamma_{k+1}])+% \frac{V_{\Gamma}}{2\bar{\gamma}_{k}\tau}(\left\|A^{k}-A^{k-1}\right\|^{2}+% \left\|A^{k-1}-A^{k-2}\right\|^{2})\\ &+\frac{V_{1}}{2\bar{\gamma}_{k}}\left(\left\|A^{k}-A^{k-1}\right\|^{2}+\left% \|A^{k-1}-A^{k-2}\right\|^{2}\right)+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{2}% }{2\eta_{k}\gamma_{k}}\|A^{k}-A^{k-1}\|^{2}\\ &+\frac{\delta-\epsilon}{\eta_{k}}D^{N}_{\psi}\left({A}^{k-1},A^{k}\right)-% \frac{1}{\eta_{k}}\mathbb{E}_{k}[D^{N}_{\psi}\left({A}^{k},A^{k+1}\right)],% \end{aligned}

(23)

where the last inequality follows from (18) in Definition 4. From (14) and $\eta_{k}\leq\min\{\eta_{k-1},\bar{L}^{-1}\}$ , it presents that

\displaystyle\begin{aligned} \left(\frac{1}{\eta_{k}}+\underline{L}\right)D_{% \psi}({A}_{n}^{k},\underline{A}_{n}^{k})\leq\frac{\underline{L}\eta_{k}+1}{% \eta_{k}}\frac{\delta-\epsilon}{1+\underline{L}\eta_{k-1}}D_{\psi}({A}_{n}^{k-% 1},{A}_{n}^{k})\leq\frac{\delta-\epsilon}{\eta_{k}}D_{\psi}({A}_{n}^{k-1},{A}_% {n}^{k}),\end{aligned}

and we also use notation $D^{N}_{\psi}(A,B):=D_{\psi}(A_{1},B_{1})+\cdots+D_{\psi}(A_{N},B_{N})$ for simplicity. Then we can get

		$\displaystyle\mathbb{E}_{k}[\Phi\left(A_{1}^{k+1},\cdots,A_{N}^{k+1}\right)]$
	$\displaystyle\leq$	$\displaystyle\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right)-\left(\frac{1}{\eta_{% k}}-\alpha-\bar{\gamma}_{k}-\frac{\gamma_{k}}{\eta_{k}}\right)\mathbb{E}_{k}[D% ^{N}_{\psi}(A^{k},A^{k+1})]$
		$\displaystyle+\frac{1}{2\bar{\gamma}_{k}\tau}(\Gamma_{k}-\mathbb{E}_{k}[\Gamma% _{k+1}])+\left(\frac{\delta-\epsilon}{\eta_{k}}+\frac{V_{\Gamma}}{\bar{\gamma}% _{k}\tau}+\frac{V_{1}}{\bar{\gamma}_{k}}+\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})% ^{2}}{\eta_{k}\gamma_{k}}\right)D^{N}_{\psi}(A^{k-1},A^{k})$
		$\displaystyle+\left(\frac{V_{\Gamma}}{\gamma_{k}\tau}+\frac{V_{1}}{\gamma_{k}}% \right)D^{N}_{\psi}(A^{k-2},A^{k-1}).$

Therefore, the results can be obtained by rearranging the above terms with $\bar{\gamma}_{k}=\sqrt{2(V_{\Gamma}/\tau+V_{1})}$ . This completes the proof. ∎

Next, we introduce a new Lyapunov function and show it is monotonically decreasing in expectation. For simplicity, we denote

\Phi^{k}=\Phi\left(A_{1}^{k},\cdots,A_{N}^{k}\right).

Lemma 2.

Suppose the same conditions with Lemma 1 hold, and the stepsize satisfies

\displaystyle\eta_{k}\leq\min\left\{\eta_{k-1},\frac{1}{\bar{L}},\frac{1-% \delta-2|\alpha_{k}-\beta_{k}|M_{2}}{\alpha+2\bar{\gamma}}\right\},\quad% \forall\,k>0.

(24)

Let $\{A_{n}^{k}\}_{k>0}$ with $n\in\{1,\dots,N\}$ be a sequence generated by iTableSMD (Algorithm 1) and define the following Lyapunov sequence

\displaystyle\begin{aligned} \Psi_{k+1}:=&\eta_{k}\left(\Phi^{k+1}-\mathcal{V}% (\Phi)\right)+\left(1-\eta_{k}\alpha-\eta_{k}\bar{\gamma}-\gamma_{k}-\frac{% \epsilon}{3}\right)D^{N}_{\psi}(A^{k},A^{k+1})\\ &+\eta_{k}\left(\frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k}}\right)D^{N}_% {\psi}(A^{k-1},A^{k})+\frac{\eta_{k}}{2\tau\bar{\gamma}}\Gamma_{k+1},\end{aligned}

(25)

where $\gamma_{k}=|\alpha_{k}-\beta_{k}|M_{2}$ . Then, for all $k\in\mathbb{N}$ , we have

\displaystyle\begin{aligned} \mathbb{E}_{k}[\Psi_{k+1}]\leq\Psi_{k}-\frac{% \epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D^{N}_{\psi}(A^{k-1},% A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1})).\end{aligned}

(26)

Proof.

From Lemma 1, it shows that

\displaystyle\begin{aligned} &\eta_{k}(\Phi^{k})-\mathcal{V}(\Phi))\\ \geq&\eta_{k}(\mathbb{E}_{k}[\Phi^{k+1}]-\mathcal{V}(\Phi))+(1-\eta_{k}\alpha-% \eta_{k}\bar{\gamma}-\gamma_{k})\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]\\ &+\frac{\eta_{k}}{2\bar{\gamma}\tau}(\mathbb{E}_{k}[\Gamma_{k+1}]-\Gamma_{k})-% \left(\delta-\epsilon+\frac{\bar{\gamma}\eta_{k}}{2}+\frac{M_{2}^{2}(\alpha_{k% }-\beta_{k})^{2}}{\gamma_{k}}\right)D^{N}_{\psi}(A^{k-1},A^{k})-\frac{\bar{% \gamma}\eta_{k}}{2}D^{N}_{\psi}(A^{k-2},A^{k-1})\end{aligned}

(27)

Combining (25) with $\eta_{k}\leq\eta_{k-1}$ , we have

		$\displaystyle\Psi_{k}-\mathbb{E}_{k}[\Psi_{k+1}]$
	$\displaystyle=$	$\displaystyle\eta_{k-1}(\Phi^{k}-\mathcal{V}(\Phi))+\left(1-\eta_{k-1}\alpha-% \eta_{k-1}\bar{\gamma}-\gamma_{k-1}-\frac{\epsilon}{3}\right)D^{N}_{\psi}(A^{k% -1},A^{k})-\frac{\eta_{k}}{2\tau\bar{\gamma}}\mathbb{E}_{k}[\Gamma_{k+1}]$
		$\displaystyle+\frac{\eta_{k-1}}{2\tau\bar{\gamma}}\Gamma_{k}+\eta_{k-1}\left(% \frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k-1}}\right)D^{N}_{\psi}(A^{k-2}% ,A^{k-1})-\eta_{k}(\mathbb{E}_{k}[\Phi^{k+1}]-\mathcal{V}(\Phi))$
		$\displaystyle-\eta_{k}\left(\frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k}}% \right)D^{N}_{\psi}(A^{k-1},A^{k})-(1-\eta_{k}\alpha-\eta_{k}\bar{\gamma}-% \gamma_{k}-\frac{\epsilon}{3})\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]$
	$\displaystyle\geq$	$\displaystyle\eta_{k}(\Phi^{k})-\mathcal{V}(\Phi))+\left(1-\eta_{k-1}\alpha-% \eta_{k-1}\bar{\gamma}-\gamma_{k-1}-\frac{\epsilon}{3}\right)D^{N}_{\psi}(A^{k% -1},A^{k})-\frac{\eta_{k}}{2\tau\bar{\gamma}}\mathbb{E}_{k}[\Gamma_{k+1}]$
		$\displaystyle+\frac{\eta_{k}}{2\tau\bar{\gamma}}\Gamma_{k}+\eta_{k-1}\left(% \frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k-1}}\right)D^{N}_{\psi}(A^{k-2}% ,A^{k-1})-\eta_{k}(\mathbb{E}_{k}[\Phi^{k+1}]-\mathcal{V}(\Phi))$
		$\displaystyle-\eta_{k}\left(\frac{\bar{\gamma}}{2}+\frac{\epsilon}{3\eta_{k}}% \right)D^{N}_{\psi}(A^{k-1},A^{k})-\left(1-\eta_{k}\alpha-\eta_{k}\bar{\gamma}% -\gamma_{k}-\frac{\epsilon}{3}\right)\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]$
	$\displaystyle\geq$	$\displaystyle\left(1-\delta-\eta_{k-1}\alpha-(\eta_{k-1}+\eta_{k})\bar{\gamma}% -\gamma_{k-1}-\frac{M_{2}^{2}(\alpha_{k}-\beta_{k})^{2}}{\gamma_{k}}\right)D^{% N}_{\psi}(A^{k-1},A^{k})$
		$\displaystyle+\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D% ^{N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1})).$

Let $\gamma_{k}=|\alpha_{k}-\beta_{k}|M_{2}$ , and assume $\gamma_{k}\geq\gamma_{k-1}$ ¹¹1In numerical experiments in [47, 59], there is $\alpha_{k}=c_{1}\frac{k-1}{k+2}$ and $\beta_{k}=c_{2}\frac{k-1}{k+2}$ . Hence, we have this inequality holds., then we have

		$\displaystyle\Psi_{k}-\mathbb{E}_{k}[\Psi_{k+1}]$
	$\displaystyle\geq$	$\displaystyle\left(1-\delta-\eta_{k-1}\alpha-2\eta_{k-1}\bar{\gamma}-\gamma_{k% -1}-\gamma_{k}\right)D^{N}_{\psi}(A^{k-1},A^{k})$
		$\displaystyle+\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D% ^{N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1}))$
	$\displaystyle\geq$	$\displaystyle\left(1-\delta-\eta_{k-1}\alpha-2\eta_{k-1}\bar{\gamma}-2\gamma_{% k}\right)D^{N}_{\psi}(A^{k-1},A^{k})$
		$\displaystyle+\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D% ^{N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1}))$
	$\displaystyle\geq$	$\displaystyle\frac{\epsilon}{3}(\mathbb{E}_{k}[D^{N}_{\psi}(A^{k},A^{k+1})]+D^% {N}_{\psi}(A^{k-1},A^{k})+D^{N}_{\psi}(A^{k-2},A^{k-1})),$

where the second and the last inequality follow from (27) and (24), respectively. This completes the proof.

∎

Theorem 1.

Let $\{A_{n}^{k}\}_{k>0}$ with $n\in\{1,\dots,N\}$ be a sequence generated by iTableSMD algorithm. Then, the following statements hold.

(i)

The sequence $\{\mathbb{E}[\Psi_{k}]\}_{k\in\mathbb{N}}$ is nonincreasing.
(ii)

$\sum\limits_{k=1}^{+\infty}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]<+\infty$ , and the sequence $\{\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\}$ converges to zero.
(iii)

$\min\limits_{1\leq k\leq K}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\frac{3% \Psi_{1}}{\epsilon K}$ .

Proof.

(i)

This statement follows directly from Lemma 2 and $\epsilon>0$ .

(i)

By summing (26) from $k=0$ to a positive integer $K$ , we have

\sum_{k=1}^{K}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\frac{3}{\epsilon}% \mathbb{E}[\Psi_{1}-\Psi_{K+1}]\leq\frac{3}{\epsilon}\Psi_{1},

where the last inequality follows from $\Psi_{k}\geq 0$ for any $k>0$ due to (24). Taking the limit as $K\rightarrow+\infty$ , we have $\sum_{k=1}^{+\infty}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]<+\infty$ . Then we may deduce that the sequence $\{\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\}$ converges to zero.

(iii)

We have

K\min_{1\leq k\leq K}\mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\sum_{k=1}^{K}% \mathbb{E}[D^{N}_{\psi}(A^{k-1},A^{k})]\leq\frac{3}{\epsilon}\Psi_{1},

which yields the desired result.

This completes the proof. ∎

4.2 Global convergence analysis

In this subsection, we present the analysis of iTableSMD algorithm with the expected squared distance of the subgradient and global convergence. In addition, We impose another stronger assumption on function $f$ .

Assumption 2.

The partial gradient $\nabla_{A_{i}}f$ is Lipschitz continuous with modulus $M_{1}$ on bounded sets of $\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}$ . Namely, for any two points $A$ and $\hat{A}$ , where $A:=(A_{1},\dots,A_{i},\dots,A_{N})$ , $\hat{A}:=(A_{1},\dots,A_{i-1},\hat{A}_{i},A_{i+1},\dots,A_{N})$ $\in\Pi_{n=1}^{N}\mathbb{R}^{I_{n}\times R}$ , it shows that

\|\nabla_{A_{i}}f(A)-\nabla_{A_{i}}f(\hat{A})\|\leq M_{1}\|A_{i}-\hat{A}_{i}\|% ,\quad i=1,2,\dots,N.

Under Definition 4 and the definition of SAGA [15] and SARAH [43], we have the following proposition.

Proposition 1.

Under Assumption 2, we have the following two statements hold.

(i)

The SAGA gradient estimator [15] is defined as

\displaystyle\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k}):=\frac{1}{{I_{n}}% \left|\mathcal{F}_{n}^{k}\right|}(\sum_{j\in\mathcal{F}_{n}^{k}}\nabla_{A_{n}}% f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}))+\frac{1}{J_{n}}% \sum_{i=1}^{J_{n}}\nabla_{A_{n}}f_{i}((\phi^{k})^{i}),

(28)

where $\underline{A}^{k}:=(A_{1}^{k},\dots,A_{n-1}^{k},\underline{A}_{n}^{k},A_{n+1}^% {k},\dots,A_{N}^{k})$ , and the variable $(\phi^{k})^{i}$ follow the update rules $(\phi^{k})^{i}=\underline{A}^{k-1}$ if $i\in\mathcal{F}_{n}^{k}$ and $(\phi^{k})^{i}=(\phi^{k-1})^{i}$ otherwise. A set of sampled mode- $n$ fibers is indexed by $\mathcal{F}_{n}^{k}\subset\{1,\dots,J_{n}\}$ with $|\mathcal{F}_{n}^{k}|=B$ . Then it is variance reduced with

\Gamma_{k+1}:=\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2},

\Upsilon_{k+1}:=\frac{1}{\sqrt{BJ_{n}}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}% (\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i}))\|_{*}.

The constants $\tau=\frac{B}{2J_{n}}$ , $V_{\Gamma}=2J_{n}+\frac{4J_{n}^{2}}{B}M_{1}^{2}$ , $V_{1}=M_{1}^{2},V_{2}=M_{1}$ .

(ii)

The SARAH gradient estimator [43] which is defined as

			$\displaystyle\tilde{\nabla}^{SARAH}_{n}f(\underline{A}^{k})$
		$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{ll}\nabla_{A_{n}}f(\underline{A}^{k}),&% \mbox{w.p.}\,\,\frac{1}{p},\\ \frac{1}{B}(\underset{j\in\mathcal{F}_{n}^{k}}{\sum}\nabla_{A_{n}}f_{j}(% \underline{A}^{k})-\nabla_{A_{n}}f_{j}(\underline{A}^{k-1}))+\tilde{\nabla}^{% SARAH}_{n}f(\underline{A}^{k-1}),&\mbox{otherwise.}\end{array}\right.$

Here “w.p. $\frac{1}{p}$ ” means with probability $\frac{1}{p}\in(0,1]$ . Then it is variance reduced with

\displaystyle\Gamma_{k+1}=\|\tilde{\nabla}^{SARAH}_{n}f(\underline{A}^{k})-% \nabla_{A_{n}}f(\underline{A}^{k})\|_{*}^{2},\quad\Upsilon_{k+1}=\|\tilde{% \nabla}^{SARAH}_{n}f(\underline{A}^{k})-\nabla_{A_{n}}f(\underline{A}^{k})\|_{% *},

and constants $\tau=\frac{1}{p}$ , $V_{1}=V_{\Gamma}=2M_{1}^{2}$ , $V_{2}=2M_{1}$ .

Proof.

From the definition of SAGA stochastic gradient estimator $\tilde{\nabla}^{SAGA}f(\underline{A}^{k})$ and the Lipschitz continuity of $\nabla_{A_{i}}f(\cdot)$ , it shows that

		$\displaystyle\mathbb{E}_{k}\\|\tilde{\nabla}^{SAGA}f(\underline{A}^{k})-\nabla f% (\underline{A}^{k})\\|_{*}^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|\frac{1}{I_{n}B}(\sum_{j\in\mathcal{F}_{n}^{k}}% \nabla_{A_{n}}f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}))+% \frac{1}{J_{n}}\sum_{i=1}^{J_{n}}\nabla_{A_{n}}f_{i}((\phi^{k})^{i})-\nabla f(% \underline{A}^{k})\\|_{*}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{B^{2}I^{2}_{n}}\mathbb{E}_{k}\sum_{j\in\mathcal{F}_{n}^{% k}}\\|\nabla_{A_{n}}f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}% )\\|_{*}^{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{BI^{2}_{n}J_{n}}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}\,,$

where the last inequality follows from the fact that $\mathbb{E}_{k}\|y_{1}+\cdots+y_{t}\|_{*}^{2}=\mathbb{E}_{k}\|y_{1}\|_{*}^{2}+% \cdots+\mathbb{E}_{k}\|y_{t}\|_{*}^{2}$ for any independent random variables $y_{i}(i=1,\dots,t)$ with $\mathbb{E}_{k}[y_{i}]=0$ for all $i$ . Combined with Jensen’s inequality, we can get

		$\displaystyle\mathbb{E}_{k}\\|\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k})-% \nabla f(\underline{A}^{k})\\|_{*}$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}_{k}\\|\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k% })-\nabla f(\underline{A}^{k})\\|_{*}^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{BJ_{n}}}\sqrt{\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_% {i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{BJ_{n}}}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}.$

We bound the MSE of the stochastic gradient estimator $\tilde{\nabla}^{SAGA}f(\cdot)$ as follows,

		$\displaystyle\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\mathbb{E}_{k}\\|\nabla_{A_{n}}f% _{i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}\mathbb{E}_{k}\sum_{i=1}^{J_{n}}\\|\nabla_{% A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{}^{2}% +\frac{1+\delta^{-1}}{BJ_{n}}\mathbb{E}_{k}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f% _{i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}(\underline{A}^{k-1})\\|_{}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \\|_{*}^{2}+\frac{1+\delta^{-1}}{B}M_{1}^{2}\mathbb{E}_{k}\\|\underline{A}_{n}^{% k}-\underline{A}_{n}^{k-1}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \\|_{*}^{2}+\frac{1+\delta^{-1}}{B}M_{1}^{2}\mathbb{E}_{k}[(1+\alpha_{k}^{2})\\|% A_{n}^{k}-A_{n}^{k-1}\\|^{2}$
		$\displaystyle+\alpha_{k-1}^{2}\\|A_{n}^{k-1}-A_{n}^{k-2}\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \\|_{*}^{2}+\frac{2+2\delta^{-1}}{BN}M_{1}^{2}[\\|A^{k}-A^{k-1}\\|^{2}+\\|A^{k-1}-% A^{k-2}\\|^{2}],$

where the first inequality follows from $\|x-z\|_{*}^{2}\leq(1+\delta)\|x-y\|_{*}^{2}+(1+\delta^{-1})\|y-z\|_{*}^{2}$ .

Let $\Gamma_{k+1}:=\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\|_{*}^{2}$ and $\delta=\frac{B}{2J_{n}}$ , it shows that

	$\displaystyle\mathbb{E}_{k}\Gamma_{k+1}\leq$	$\displaystyle(1+\frac{B}{2J_{n}})(1-\frac{B}{J_{n}})\Gamma_{k}+(2J_{n}+\frac{4% J_{n}^{2}}{B})\frac{M_{1}^{2}}{N}[\\|A^{k}-A^{k-1}\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle(1-\frac{B}{2J_{n}})\Gamma_{k}+(2J_{n}+\frac{4J_{n}^{2}}{B})\frac% {M_{1}^{2}}{N}[\\|A^{k}-A^{k-1}\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}].$

This proves the geometric decay of $\Gamma_{k}$ in expectation. Similar to Appendix B in [16], we also have that the third condition holds in Definition 4. For the SARAH stochastic gradient estimator, we can get the results directly similar to Lemma 5 in [58]. The proof of Proposition 1 (2) is completed. This completes the proof. ∎

Corollary 1.

If $\psi:=\frac{1}{2}\|\cdot\|^{2}$ , the inequality from Lemma 2 becomes

\mathbb{E}_{k}[\Psi_{k+1}]\leq\Psi_{k}-\frac{\epsilon}{6}\left(\mathbb{E}_{k}[% \|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}\right).

Now we can prove the following result, which means that the subgradient of $\Phi(A^{k})$ is bounded.

Lemma 3.

Suppose that Assumptions 1-2 hold and the stepsize $\eta_{k}$ satisfies $0<\eta\leq\eta_{k}$ and (24).The sequence $\{A_{1}^{k},\dots,A_{N}^{k}\}$ generated by iTableSMD is bounded for all $k$ . Define

P_{n}^{k+1}:=\nabla_{A_{n}}f(A^{k+1})-\tilde{\nabla}_{A_{n}}f(\underline{A}^{k% })+\frac{1}{\eta_{k}}(\nabla\psi(\phi_{n}^{k})-\nabla\psi(A_{n}^{k+1})),

where $P_{n}^{k+1}\in\partial_{n}\Phi\left(A^{k+1}\right)$ and $P^{k+1}=\left(P_{1}^{k+1},P_{2}^{k+1},\dots,P_{N}^{k+1}\right)$ , implying that $P^{k+1}\in\partial\Phi\left(A^{k+1}\right)$ . Then, we can obtain

\mathbb{E}_{k}\|P^{k+1}\|\leq w(\mathbb{E}_{k}\|A^{k+1}-A^{k}\|+\|A^{k}-A^{k-1% }\|+\|A^{k-1}-A^{k-2}\|)+\Upsilon_{k},

where $w=\max\left\{M_{1}+\frac{M_{2}}{\eta},V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{% 2}}{\eta},V_{2}\right\}$ .

Proof.

From the implicit deﬁnition of the proximal operator (15) in the iTableSMD algorithm, we have

0\in\partial h_{n}(A_{n}^{k+1})+\tilde{\nabla}_{A_{n}}f(\underline{A}^{k})+% \frac{1}{\eta_{k}}(\nabla\psi(A_{n}^{k+1})-\nabla\psi(\tilde{A}_{n}^{k})),

where $\underline{A}^{k}:=(A_{1}^{k},\dots,A_{n-1}^{k},\underline{A}_{n}^{k},A_{n+1}^% {k},\dots,A_{N}^{k})$ . Combining it with $\partial_{n}\Phi\left(A^{k+1}\right)\equiv\nabla_{A_{n}}f\left(A^{k+1}\right)+% \partial h_{n}(A_{n}^{k+1})$ , we have $P_{n}^{k+1}\in\partial_{n}\Phi(A^{k+1})$ . Furthermore, in Problem (9), with $h(A^{k+1})=\sum_{n=1}^{N}h_{n}(A_{n})$ , we have $P^{k+1}=(P_{1}^{k+1},P_{2}^{k+1},\dots,P_{N}^{k+1})$ , and it follows that $P^{k+1}\in\partial\Phi(A^{k+1})$ , where $\partial\Phi(A^{k+1})\equiv\nabla f\left(A^{k+1}\right)+\partial h(A^{k+1})$ .

All that remains is to bound the norm of $P^{k+1}$ . Suppose $n=\xi^{k}$ at the $k$ -th iteration. It shows that

		$\displaystyle\mathbb{E}_{k}\\|P^{k+1}\\|$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|P_{\xi^{k}}^{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\tilde{\nabla}_{A_% {\xi^{k}}}f(\underline{A}^{k})+\frac{1}{\eta_{k}}(\nabla\psi(\tilde{A}_{\xi^{k% }}^{k})-\nabla\psi(A_{\xi^{k}}^{k+1}))\\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\tilde{\nabla}_{A_% {\xi^{k}}}f(\underline{A}^{k})\\|+\frac{1}{\eta_{k}}\mathbb{E}_{k}\\|\nabla\psi(% \tilde{A}_{\xi^{k}}^{k})-\nabla\psi(A_{\xi^{k}}^{k+1})\\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\nabla_{A_{\xi^{k}% }}f(\underline{A}^{k})\\|+\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(\underline{A}^{% k})-\tilde{\nabla}_{A_{\xi^{k}}}f(\underline{A}^{k})\\|+\frac{1}{\eta_{k}}% \mathbb{E}_{k}\\|\nabla\psi(\tilde{A}_{\xi^{k}}^{k})-\nabla\psi(A_{\xi^{k}}^{k+% 1})\\|$
	$\displaystyle\leq$	$\displaystyle M_{1}\mathbb{E}_{k}\\|A_{\xi^{k}}^{k+1}-\underline{A}_{\xi^{k}}^{% k}\\|+\Upsilon_{k}+V_{2}\\|A^{k}-A^{k-1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\frac{M_{2}}% {\eta_{k}}\mathbb{E}_{k}\\|A_{\xi^{k}}^{k+1}-\tilde{A}_{\xi^{k}}^{k}\\|$
	$\displaystyle\leq$	$\displaystyle M_{1}\mathbb{E}_{k}\\|A^{k+1}-\underline{A}^{k}\\|+\Upsilon_{k}+V_% {2}\\|A^{k}-A^{k-1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\frac{M_{2}}{\eta_{k}}\mathbb{E}% _{k}\\|A^{k+1}-\tilde{A}^{k}\\|$
	$\displaystyle\leq$	$\displaystyle\left(M_{1}+\frac{M_{2}}{\eta_{k}}\right)\mathbb{E}_{k}\\|A^{k+1}-% A^{k}\\|+\left(V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{2}}{\eta^{k}}\right)\\|A^% {k}-A^{k-1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\Upsilon_{k}$
	$\displaystyle\leq$	$\displaystyle\left(M_{1}+\frac{M_{2}}{\eta}\right)\mathbb{E}_{k}\\|A^{k+1}-A^{k% }\\|+\left(V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{2}}{\eta}\right)\\|A^{k}-A^{k% -1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\Upsilon_{k}$
	$\displaystyle\leq$	$\displaystyle w(\mathbb{E}_{k}\\|A^{k+1}-A^{k}\\|+\\|A^{k}-A^{k-1}\\|+\\|A^{k-1}-A^% {k-2}\\|)+\Upsilon_{k},$

where $w=\max\left\{M_{1}+\frac{M_{2}}{\eta},V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{% 2}}{\eta},V_{2}\right\}$ . This completes the proof. ∎

Lemma 4.

Under the same conditions in Lemma 3, there exists a constant $\bar{w}>0$ such that

\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]\leq\bar{w}\left(\mathbb% {E}_{k}[\|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}% \right)+3\mathbb{E}\Gamma_{k}.

Proof.

From Lemma 3, it shows that

		$\displaystyle\mathbb{E}_{k}\\|P^{k+1}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 3\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\nabla_{A_{\xi^{% k}}}f(\underline{A}^{k})\\|^{2}+3\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(% \underline{A}^{k})-\tilde{\nabla}_{A_{\xi^{k}}}f(\underline{A}^{k})\\|^{2}+% \frac{3}{\eta_{k}}\mathbb{E}_{k}\\|\nabla\psi(\tilde{A}_{\xi^{k}}^{k})-\nabla% \psi(A_{\xi^{k}}^{k+1})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 3M_{1}^{2}\mathbb{E}_{k}\\|A^{k+1}-\underline{A}^{k}\\|^{2}+3% \Gamma_{k}+3V_{1}\\|A^{k}-A^{k-1}\\|^{2}+3V_{1}\\|A^{k-1}-A^{k-2}\\|^{2}+\frac{3M_% {2}^{2}}{\eta_{k}}\mathbb{E}_{k}\\|A^{k+1}-\tilde{A}^{k}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\left(6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta_{k}}\right)\mathbb{E}_{k}% \\|A^{k+1}-{A}^{k}\\|^{2}+\left(3V_{1}+6\beta_{k}^{2}M_{1}^{2}+\frac{6\alpha_{k}% ^{2}M_{2}^{2}}{\eta_{k}}\right)\\|A^{k}-A^{k-1}\\|^{2}$
		$\displaystyle+3V_{1}\\|A^{k-1}-A^{k-2}\\|^{2}+3\Gamma_{k}$
	$\displaystyle\leq$	$\displaystyle\left(6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta}\right)\mathbb{E}_{k}\\|A^% {k+1}-{A}^{k}\\|^{2}+\left(3V_{1}+6\beta_{k}^{2}M_{1}^{2}+\frac{6\alpha_{k}^{2}% M_{2}^{2}}{\eta}\right)\\|A^{k}-A^{k-1}\\|^{2}$
		$\displaystyle+3V_{1}\\|A^{k-1}-A^{k-2}\\|^{2}+3\Gamma_{k}$
	$\displaystyle\leq$	$\displaystyle\bar{w}\left(\mathbb{E}_{k}[\\|A^{k+1}-A^{k}\\|^{2}]+\\|A^{k}-A^{k-1% }\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}\right)+3\mathbb{E}\Gamma_{k},$

where $\bar{w}:=\max\left\{6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta},3V_{1}+6\beta_{k}^{2}M_% {1}^{2}+\frac{6\alpha_{k}^{2}M_{2}^{2}}{\eta},3V_{1}\right\}$ . Through $\mathrm{dist}\left(0,\partial\Phi\left(A^{k+1}\right)\right)^{2}\leq\|P^{k+1}% \|^{2}$ and taking full expectation on both sides, it shows that

\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]\leq\bar{w}\left(\mathbb% {E}_{k}[\|A^{k+1}-A^{k}\|^{2}]+\|A^{k}-A^{k-1}\|^{2}+\|A^{k-1}-A^{k-2}\|^{2}% \right)+3\mathbb{E}\Gamma_{k}.

This completes the proof. ∎

Using Lemma 4, we can show the convergence rate of the expected squared distance of the subgradient to $0$ .

Theorem 2.

Assume that Assumptions 1-2 hold, and the stepsize satisfies $0<\eta\leq\eta_{k}$ and (24). Let $\{A^{k}\}_{k\in\mathbb{N}}$ generated by iTableSMD be bounded for all $k$ . Then there exists $0<\sigma<\epsilon/6$ such that

\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{\hat{k}}))^{2}]\leq\frac{\bar{w}}{(% \epsilon/6-\sigma)K}(\mathbb{E}\Psi_{1}+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}% \mathbb{E}\Gamma_{1})=\mathcal{O}(1/K),

where $\hat{k}$ is drawn from $\{2,\dots,K+1\}$ . In other words, it takes at most $\mathcal{O}(\epsilon^{-2})$ iterations in expectation to obtain an $\epsilon$ -stationary point (see Definition 3) of $\Phi$ .

Proof.

From Corollary 1 and Lemma 4, it shows that

		$\displaystyle\mathbb{E}[\Psi_{k}-\Psi_{k+1}]$
	$\displaystyle\geq$	$\displaystyle\frac{\epsilon}{6}\mathbb{E}[\\|A^{k+1}-A^{k}\\|^{2}+\\|A^{k}-A^{k-1% }\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}]$
	$\displaystyle\geq$	$\displaystyle\sigma\mathbb{E}[\\|A^{k+1}-A^{k}\\|^{2}+\\|A^{k}-A^{k-1}\\|^{2}+\\|A^% {k-1}-A^{k-2}\\|^{2}]+\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}[\mathrm{dist}% (0,\partial\Phi(A^{k+1}))^{2}]$
		$\displaystyle-\frac{\epsilon/2-3\sigma}{\bar{w}}\mathbb{E}\Gamma_{k}$
	$\displaystyle\geq$	$\displaystyle\sigma\mathbb{E}[\\|A^{k+1}-A^{k}\\|^{2}+\\|A^{k}-A^{k-1}\\|^{2}+\\|A^% {k-1}-A^{k-2}\\|^{2}]+\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}[\mathrm{dist}% (0,\partial\Phi(A^{k+1}))^{2}]$
		$\displaystyle+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{k+1}-% \Gamma_{k}]-\frac{(\epsilon/2-3\sigma)V_{\Gamma}}{\tau\bar{w}}\mathbb{E}[\\|A^{% k}-A^{k-1}\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}]$
	$\displaystyle\geq$	$\displaystyle\sigma\mathbb{E}[\\|A^{k+1}-A^{k}\\|^{2}+\\|A^{k}-A^{k-1}\\|^{2}+\\|A^% {k-1}-A^{k-2}\\|^{2}]+\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}[\mathrm{dist}% (0,\partial\Phi(A^{k+1}))^{2}]$
		$\displaystyle+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{k+1}-% \Gamma_{k}]-\frac{(\epsilon/2-3\sigma)V_{\Gamma}}{\tau\bar{w}}\mathbb{E}[\\|A^{% k+1}-A^{k}\\|^{2}+\\|A^{k}-A^{k-1}\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}],$

where the third inequality follows from (18) in Definition 4. If we let $\sigma=\frac{(\epsilon/2-3\sigma)V_{\Gamma}}{\tau\bar{w}}$ , i.e., $\sigma=\frac{\epsilon V_{\Gamma}}{2(3V_{\Gamma}+\tau\bar{w})}$ , it shows that

\mathbb{E}[\Psi_{k}-\Psi_{k+1}]\geq\frac{\epsilon/6-\sigma}{\bar{w}}\mathbb{E}% [\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]+\frac{\epsilon/2-3\sigma}{\tau% \bar{w}}\mathbb{E}[\Gamma_{k+1}-\Gamma_{k}].

Summing up $k=1$ to $K$ , we have

\mathbb{E}[\Psi_{1}-\Psi_{K+1}]\geq\frac{\epsilon/6-\sigma}{\bar{w}}\sum_{k=1}% ^{K}\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k+1}))^{2}]+\frac{\epsilon/2-3% \sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{K+1}-\Gamma_{1}],

which means there exists a $\hat{k}\in\{2,\dots,K+1\}$ such that

	$\displaystyle\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{\hat{k}}))^{2}]\leq$	$\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}[\mathrm{dist}(0,\partial\Phi(% A^{k+1}))^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{\bar{w}}{(\epsilon/6-\sigma)K}(\mathbb{E}[\Psi_{1}-\Psi_{K+% 1}]+\frac{\epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}[\Gamma_{1}-\Gamma_{K+1}])$
	$\displaystyle\leq$	$\displaystyle\frac{\bar{w}}{(\epsilon/6-\sigma)K}(\mathbb{E}\Psi_{1}+\frac{% \epsilon/2-3\sigma}{\tau\bar{w}}\mathbb{E}\Gamma_{1}).$

This completes the proof. ∎

We define the set of cluster points of $\{A^{k}\}_{k\in\mathbb{N}}$ as

\displaystyle\begin{aligned} \Omega(A^{0}):=&\{A^{*}:\exists\text{ an % increasing sequence of integers }\{k_{l}\}_{l\in\mathbb{N}}\text{ such that }A% ^{k_{l}}\rightarrow A^{*}\text{ as }l\rightarrow+\infty\}.\end{aligned}

(30)

Lemma 5.

Suppose that Assumptions 1 to 2 hold, the step $\eta_{k}$ satisfies $0<\eta\leq\eta_{k}$ and (24). Then the following statements hold.

(1)

$\sum_{k=0}^{\infty}\|A^{k+1}-A^{k}\|^{2}<+\infty$ a.s., and $\lim_{k\rightarrow+\infty}\|A^{k+1}-A^{k}\|\rightarrow 0$ a.s.
(2)

$\mathbb{E}[\Phi(A^{k})]\rightarrow\Phi^{*}$ , where $\Phi^{*}\in[\mathcal{V}(\Phi),+\infty)$ with $\mathcal{V}(\Phi):=\inf_{A}\Phi(A)$ , and $\mathbb{E}\Phi(A^{*})=\Phi_{*}$ for all $A^{*}\in\Omega(A_{0})$ .
(3)

$\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{k}))]\rightarrow 0$ . Moreover, the set $\Omega(A^{0})$ is nonempty, and $\mathbb{E}[\mathrm{dist}(0,\partial\Phi(A^{*}))]=0$ for all $A^{*}\in\Omega(A_{0})$ .
(4)

$\mathrm{dist}(A^{k},\Omega(A_{0}))\rightarrow 0$ a.s., and $\Omega(A_{0})$ is a.s. compact and connected.

Proof.

The proof of the above statements is similar to that of Lemma 9 in [58], so we omit the details here for simplicity. ∎

The following lemma is from [16], which is analogous to the Uniformized KŁ property of [4] and allows us to apply the KŁ inequality.

Lemma 6.

Assuming $\{A^{k}\}_{k\in\mathbb{N}}$ is a bounded sequence of iterates for all $k$ generated by the iTableSMD algorithm using a variance-reduced gradient estimator (see Definition 4). Let $\Phi$ be a semialgebraic function satisfying the KŁ property [4] with exponent $\theta$ . Then there exists an index $\bar{k}$ and a desingularizing function $\phi(r)=ar^{1-\theta}$ with $a>0$ , $\theta\in[0,1)$ so that the following bound holds almost surely (a.s.),

\displaystyle\phi^{\prime}(\mathbb{E}[\Phi(A^{k})-\Phi_{k}^{*}])\mathbb{E}% \mbox{dist}(0,\partial\Phi(A^{k}))\geq 1,\,\,\forall k>\bar{k},

(31)

where $\Phi_{k}^{*}$ is a nondecreasing sequence converging to $\mathbb{E}\Phi(A^{*})$ for some $A^{*}\in\Omega(A_{0})$ .

Now we give the global convergence result of the iTableSMD algorithm in the following theorem which can be proved by the above lemmas and we omit the proof details here. See [16, 58] for details.

Theorem 3.

Suppose that Assumptions 1-2 hold, the step $\eta_{k}$ satisfies $0<\eta\leq\eta_{k}$ and (24). Let $\{A^{k}\}_{k\in\mathbb{N}}$ be the sequence generated by the iTableSMD algorithm which is assumed to be bounded. If the optimization function $\Phi$ is a semialgebraic function that satisﬁes the KŁ property with exponent $\theta\in[0,1)$ (see Lemma 6), then either the point $A^{k}$ is a critical point after a ﬁnite number of iterations or the sequence $\{A^{k}\}_{k\in\mathbb{N}}$ almost surely satisﬁes the ﬁnite length property in expectation, namely,

\sum_{k=0}^{+\infty}\mathbb{E}\|A^{k+1}-A^{k}\|<+\infty.

5 Numerical experiments

In this section, we evaluate the proposed iTableSMD (Algorithm 1) using synthetic datasets as well as multiple real-world datasets. We aim to demonstrate its superior efficiency through comparisons with state-of-the-art algorithms as our main baseline. The first one is an entry-sampling based stochastic non-Euclidean CP decomposition optimization algorithm, namely, GCP-OPT, proposed in [31, 26]. The GCP-OPT method is implemented in Tensor Toolbox and “Adam” is selected as the optimization solver. The sampling rule of GCP-OPT is the default “uniform” setting for dense tensors unless specified for particular examples. The Second one is a tensor fiber-sampling based flexible stochastic mirror descent framework denoted by SmartCPD [49].

We perform experiments involving low-rank GCP decomposition with nonnegative constraints $A_{n}\geq 0$ for $n=1,\dots,N$ . Our focus extends to three distinct synthetic data distributions: Gamma, Poisson, and Bernoulli. Additionally, we incorporate several real datasets into our analysis, including the Enron emails dataset [50], six months of Uber pickup data [51] in New York City, both of which are characterized by integer counts following the Poisson distribution. Each element in the Enron emails dataset represents the sender-receiver-word, and the values are counts of words. Each element in the Uber pickup data represents the date-latitude-longitude of pickup, and the values are counts of pickups. We also test the tags from the Flickr dataset [20], where non-zero values are binary, indicating user tagging of images on a given day. In synthetic experiments, we generate third-order tensors with different sizes and ranks and we do not require each dimension of the tensor to remain the same. For the fiber-sampling based algorithms, in each iteration, iTableSMD and SmartCPD sample $2R$ fibers, while for entry-sampling algorithm, GCP-OPT samples $2\frac{\sum_{i=1}^{N}I_{n}}{N}R$ entries. The generating function of Bregman distance is chosen as $\psi(a)=a\log a$ in the update of iTableSMD and SmartCPD. The numerical experiment performance is measured by the cost function value (denotes by “NRE”) and the mean squared error (MSE). The MSE of the latent matrices is used as a performance metric, which is defined as

\mathrm{MSE}=\min_{\pi(r)\in[R]}\frac{1}{R}\sum_{r=1}^{R}\left\|\frac{% \boldsymbol{A}_{n}(:,\pi(r))}{\left\|\boldsymbol{A}_{n}(:,\pi(r))\right\|_{2}}% -\frac{\bar{\boldsymbol{A}}_{n}(:,r)}{\left\|\bar{\boldsymbol{A}}_{n}(:,r)% \right\|_{2}}\right\|^{2},

where $\bar{\boldsymbol{A}}_{n}$ denotes the estimate of original matrix $\boldsymbol{A}_{(n)}$ and $\{\pi(1),\ldots,\pi(R)\}$ represents a permutation of the set $[R]=\{1,\ldots,R\}$ , which is used to fix the intrinsic column permutation in CP decomposition.

5.1 Synthetic data experiments

5.1.1 Gamma distribution

In this subsection, we compute the GCP decomposition on two artificial three-way tensors of size $150\times 100\times 150$ and $300\times 400\times 300$ with different ranks using the gamma loss function: $x/m+\log(m)$ . In practice, we use the constraint $m\geq 0$ and replace $m$ with $m+\epsilon$ (e.g., $\epsilon=10^{-9}$ ) in the loss function to prevent function values or gradients from becoming $\pm\infty$ . Namely, $f(m\,;x)=x/(m+\epsilon)+\log(m+\epsilon)$ . With nonnegative constraints on the factor matrices, the latent factors $A_{1}$ , $A_{2}$ , and $A_{3}$ are drawn from i.i.d.uniform distribution between 0 and $A_{max}$ , where the $A_{max}=0.5$ is a positive constant. The observed nonnegative data tensor $\mathcal{X}$ is generated following the gamma distribution, i.e. $\underline{\mathcal{X}}_{i}\sim Gamma(\underline{\mathcal{M}}_{i})$ . Namely, we focus on

\displaystyle\begin{aligned} \min_{A_{1},A_{2},A_{3}}&\quad\frac{1}{I^{N}}\sum% _{i\in\mathcal{I}}\underline{\mathcal{X}}_{i}/(\underline{\mathcal{M}}_{i}+% \epsilon)+\log(\underline{\mathcal{M}}_{i}+\epsilon)+\sum_{n=1}^{3}h_{n}\left(% A_{n}\right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{3}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}

We set the inertial parameters as $\alpha^{k}=\frac{3(k-1)}{5(k+2)}$ , $\beta^{k}=\frac{4(k-1)}{5(k+2)}$ for simplicity²²2Theoretically, the inequality (14) is required. However, it is time-consuming to check this inequality in the numerical experiments. Therefore, we directly set $\beta^{k}=\frac{4(k-1)}{5(k+2)}$ . Our numerical experiments show that iTableSMD always converges with this $\beta^{k}$ . and set the stepsize as $\eta^{k}=0.1$ to verify the difference between SmartCPD with SGD and SAGA, GCP-OPT, and iTableSMD with SGD and SAGA. Our numerical results for two synthetic data are presented in Figure 1.

The first synthetic experiment, visualized in the top row of the figure, captures the algorithmic performance across a tensor of dimensions $150\times 100\times 150$ , evaluated at varying ranks. Notably, the iTableSMD, especially when coupled with the SAGA, achieves a rapid improvement in MSE. For instance, in Figure 1(a), iTableSMD-SAGA reduces the MSE to below $10^{-6}$ within an average time of fewer than 3 seconds, outpacing SmartCPD, which requires a minimum of 6 seconds, and GCP-OPT, which exceeds 10 seconds to attain comparable MSE reductions.

The second row in Figure 1 evaluates the efficacy of various algorithms on a tensor with increased dimensions $300\times 400\times 300$ . It is clear that as the tensor size increases, the iTableSMD method gains a lower MSE within the same time compared to SmartCPD and GCP-OPT. The noticeable improvement over the SmartCPD method indicates that the inertial acceleration framework of iTableSMD has a significant impact on its performance, hel** to achieve faster convergence.

Refer to caption — Figure 1: Numerical experiments for Gamma distribution on synthetic datasets.

5.1.2 Poisson distribution

We next evaluate the performance on two synthetic count data tensors with the size of $150\times 51\times 152$ and $300\times 300\times 300$ . For simplicity in our experiments, we set inertial parameters $\alpha^{k}=\frac{3(k-1)}{5(k+2)}$ and $\beta^{k}=\frac{4(k-1)}{5(k+2)}$ using a formula based on the iteration $k$ , and chose a stepsize as $\eta^{k}=0.2$ . The loss function used is a modification of the standard Poisson log-likelihood and is defined as $f(m\,;x)=m-xlog(m+\epsilon)$ . We initialized the factor matrices $A_{1}$ , $A_{2}$ , and $A_{3}$ with values uniformly distributed between 0 and a set maximum $A_{max}$ , here chosen as 0.5. The observed count data tensor $\mathcal{X}$ is generated following the Poisson distribution, i.e. $\underline{\mathcal{X}}_{i}\sim Poisson(\underline{\mathcal{M}}_{i})$ .

\displaystyle\begin{aligned} \min_{A_{1},A_{2},A_{3}}&\quad\frac{1}{I^{N}}\sum% _{i\in\mathcal{I}}\underline{\mathcal{M}}_{i}+\underline{\mathcal{X}}_{i}\log(% \underline{\mathcal{M}}_{i}+\epsilon)+\sum_{n=1}^{3}h_{n}\left(A_{n}\right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{3}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}

In Figure 2, we see the results of numerical experiments on synthetic datasets modeled with Poisson distribution for tensors of two sizes, with varying tensor ranks. For the smaller tensor ( $150\times 51\times 152$ ), as the rank increases from $R=10$ to $R=20$ , the iTableSMD-SAGA maintains a lower MSE compared to others, indicating a more efficient performance. In particular, for R = 20 in Figure 2(c), iTableSMD-SAGA reduces the MSE significantly faster than the other methods in 10 seconds. For the larger tensor ( $300\times 300\times 300$ ), in Figures 2(d)-(f), as the rank grows, the MSE tends to decrease at a slower rate for all methods. However, the iTableSMD-SAGA still shows a consistent advantage, reaching lower MSEs quicker than the competing algorithms, which becomes more notable as the rank moves to 30 and beyond. From figures 2(b)-(e), compared with figure 1, we can see that SmartCPD can not always perform better than GCP-OPT, since the type of data distribution may affect its effectiveness. However, iTableSMDs continue to show superior performance in terms of iteration speed, regardless of the change in data distribution.

5.1.3 Bernoulli distribution

We see the results of numerical experiments on binary tensors of sizes $100\times 80\times 100$ and $50\times 100\times 200$ . We set inertial parameters $\alpha^{k}=\frac{3(k-1)}{5(k+2)}$ and $\beta^{k}=\frac{4(k-1)}{5(k+2)}$ using a formula based on the iteration $k$ , and chose a stepsize as $\eta^{k}=0.2$ . The loss function is $f(m\,;x)=log(m+1)-x\,log(m+\epsilon)$ and each entry of the binary tensor is generated from the Bernoulli distribution, i.e., $\underline{\mathcal{X}}_{i}=1$ with probability $\underline{\mathcal{M}}_{i}/(1+\underline{\mathcal{M}}_{i})$ . Then, we focus on

\displaystyle\begin{aligned} \min_{A_{1},A_{2},A_{3}}&\quad\frac{1}{I^{N}}\sum% _{i\in\mathcal{I}}\log(\underline{\mathcal{M}}_{i}+1)-\underline{\mathcal{X}}_% {i}\log(\underline{\mathcal{M}}_{i}+\epsilon)+\sum_{n=1}^{3}h_{n}\left(A_{n}% \right)\\ \text{ s.t. }&\quad\underline{\mathcal{M}}_{i}=\sum_{r=1}^{R}\prod_{n=1}^{3}{A% }_{n}\left(i_{n},r\right),\forall\,i\in\mathcal{I},\\ \end{aligned}

In Figure 3(a) with tensor size $100\times 80\times 100$ and rank $R=5$ , it is observed that all algorithms quickly reduce the MSE within the first few seconds. As the rank increases to $R=7$ in Figures 3(b) and (e), and $R=10$ in subfigures 3(c) and (f) respectively, there is a noticeable shift in the speed at which MSE decreases, with higher ranks leading to a slight slow down in convergence. In these four subfigures, we can further confirm that SmartCPD does not consistently outperform GCP-OPT and may at times be less effective. Nonetheless, iTalbeSMD-SAGA consistently shows robust performance, achieving low MSEs faster compared to the other method. When examining larger tensor sizes, as in subfigures 3(d)-(f), a similar pattern is evident, with all algorithms performing slower as the size and rank increase. Yet, the relative efficiency of iTalbeSMD-SAGA remains apparent, suggesting its advantage in dealing with larger and more complex data sets.

Overall, iTalbeSMD-SGD and iTalbeSMD-SAGA are shown to reliably achieve lower MSEs more quickly across various synthetic tensor sizes and ranks, underlining its efficiency in different scenarios.

5.2 Real data experiments

5.2.1 Enron emails dataset

We apply the algorithms to the Enron emails dataset. This dataset comprises a large collection of email messages exchanged by the employees of the Enron Corporation, which was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). We use the extractive version in [50] and select a subset involving 142 senders, 147 receivers, and 148 unique words, excluding any additional dimensions. The data is in the form of a third-order tensor ( $sender\times receiver\times word$ ) with integer entries representing the number of words. The size of the tensor is $142\times 147\times 148$ . It has 6581 ( $\approx 0.2\%$ ) nonzero entries. We choose the loss function corresponding to the Poisson distribution, i.e., $f(x,m)=m-x\,log(m+\epsilon)$ and non-negativity constraints are considered for the latent matrices. In every iteration, $2R$ fibers are sampled by iTableSMD and SmartCPD and $2R\times\frac{142+147+148}{3}$ entries are sampled for GCP-OPT. We set inertial parameters $\alpha^{k}=\frac{3(k-1)}{5(k+2)}$ . All algorithms under test are stopped when the relative change in the loss function is less than $10^{-10}$ .

Figure 4 presents a series of numerical experiments conducted on enron dataset with a Poisson distribution, examining the performance of various optimization algorithms with different tensor ranks under $R=3$ , $R=5$ , and $R=7$ , respectively. Each algorithm is run for 5 trials and in each trial, the factor matrices are initialized by randomly sampling its entries from uniform distribution between 0 and 1. It is observed that iTableSMD stands out for its quick cost reduction, reaching a low cost within approximately 6 seconds, noticeably faster than the baseline methods, which require more time and yet do not achieve as low of a cost. This quick performance is most notable at the higher rank of $R=7$ , where iTableSMD quickly lower the cost apparently, surpassing other algorithms that struggle to converge within the same time.

5.2.2 The Flickr dataset

We evaluate the algorithms on the Flickr dataset, as referenced by Gorlitz et al. [20]. The dataset consists of tags representing whether a user has labeled an image on a particular day, with non-zero values marked as binary indicators. We form a third-order binary tensor of size $520\times 520\times 520$ . The chosen loss function is tailored for the Bernoulli distribution, i.e., $f(x,m)=log(m+1)-x\,log(m+\epsilon)$ . Other settings and parameters are as before. Figure 5 shows the cost value change against time in seconds for different values of $R$ . Similar to the previous datasets, the proposed iTableSMD shows considerable runtime advantages over GCP-OPT. The results reinforce the capability of iTableSMD to handle complex, real-world datasets effectively.

6 Conclusion

In this paper, we proposed an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD) for nonconvex multi-block objective functions beyond global Lipschitz gradient continuity. This algorithm is particularly tailored for large-scale Generalized Tensor CP (GCP) decomposition under non-Euclidean losses. By integrating a broader version of multi-block variance ruduction, we establish the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm and prove it requires at most $\mathcal{O}(\varepsilon^{-2})$ iterations in expectation to attain an $\varepsilon$ -stationary point. Additionally, we verify the global convergence of the sequence generated by iTableSMD. We tested the algorithm over various types of simulated and real data with several baselines, indicating significant computational efficiency improvements over existing state-of-the-art methods. These results highlight the advantages and effectiveness of incorporating an inertial accelerated stochastic approach in the algorithmic framework for GCP tensor decomposition.

Declarations

Funding: This research is supported by the R&D project of Pazhou Lab (Huangpu) (Grant no. 2023K0603), the National Natural Science Foundation of China (NSFC) grant 12171021 and the Fundamental Research Funds for the Central Universities (Grant No. YWF-22-T-204).

Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.

Data Availability Statement: Data will be made available on reasonable request.

References

[1] C. Battaglino, G. Ballard, and T. G. Kolda. A practical randomized CP tensor decomposition. SIAM J. Matrix Anal. Appl., 39(2):876–901, 2018.
[2] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Math. Oper. Res., 42(2):330–348, 2017.
[3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett., 31(3):167–175, 2003.
[4] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program., 146(1-2):459–494, 2014.
[5] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim., 28(3):2131–2151, 2018.
[6] J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an $n$ -way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3):283–319, 1970.
[7] R. Cattell. “Parallel proportional profiles” and other principles for determining the choice of factors by rotation. Psychometrika, 9(4):267–283, 1944.
[8] R. B. Cattell. The three basic factor-analytic research designs-their interrelations and derivatives. Psychol. Bull., 49(5):499–520, 1952.
[9] L. Cheng, X. Tong, S. Wang, Y.-C. Wu, and H. V. Poor. Learning nonnegative factors from tensor data: Probabilistic modeling and inference algorithm. IEEE Trans. Signal Process., 68:1792–1806, 2020.
[10] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl., 33(4):1272–1299, 2012.
[11] A. Cichocki and A. Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci., 92-A:708–721, 2009.
[12] P. Comon, X. Luciani, and A. L. F. de Almeida. Tensor decompositions, alternating least squares and other tales. J. Chemom., 23, 2009.
[13] C. D. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Optim., 25(2):856–881, 2015.
[14] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions, 2018.
[15] A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
[16] D. Driggs, J. Tang, J. Liang, M. E. Davies, and C. Schönlieb. A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J. Imaging Sci., 14(4):1932–1970, 2021.
[17] B. Ermiş, E. Acar, and A. T. Cemgil. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Min. Knowl. Discov., 29(1):203–236, 2015.
[18] X. Fu, S. Ibrahim, H. Wai, C. Gao, and K. Huang. Block-randomized stochastic proximal gradient for low-rank tensor factorization. IEEE Trans. Signal Process., 68:2170–2185, 2020.
[19] X. Fu, E. Seo, J. Clarke, and R. A. Hutchinson. Link prediction under imperfect detection: Collaborative filtering for ecological networks. IEEE Transactions on Knowledge and Data Engineering, 33(8):3117–3128, 2021.
[20] O. Görlitz, S. Sizov, and S. Staab. Pints: peer-to-peer infrastructure for tagging systems. In IPTPS, page 19, 2008.
[21] D. Han. A survey on some recent developments of alternating direction method of multipliers. J. Oper. Res. Soc. China, 10:1–52, 2022.
[22] R. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis. 1970.
[23] J. Hertrich and G. Steidl. Inertial stochastic PALM and applications in machine learning. Sampl. Theory Signal Process. Data Anal., 20(1), 2022.
[24] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys., 6(1-4):164–189, 1927.
[25] F. L. Hitchcock. Multiple invariants and generalized rank of a p-way matrix or tensor. J. Math. Phys., 7(1-4):39–79, 1928.
[26] D. Hong, T. G. Kolda, and J. A. Duersch. Generalized canonical polyadic tensor decomposition. SIAM Rev., 62(1):133–163, 2020.
[27] K. Huang and N. D. Sidiropoulos. Kullback-Leibler principal component for tensors is not NP-hard. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 693–697, 2017.
[28] K. Huang, N. D. Sidiropoulos, and A. P. Liavas. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process., 64(19):5052–5065, 2016.
[29] N. Kargas and N. Sidiropoulos. Learning mixtures of smooth product distributions: Identifiability and algorithm. In Proc. 22nd Int. Conf. Artif. Intell. Statist., pages 388–396, 2019.
[30] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, 2009.
[31] T. G. Kolda and D. Hong. Stochastic gradients for large-scale tensor decomposition. SIAM J. Math. Data Sci, abs/1906.01687, 2019.
[32] W. P. Krijnen, T. K. Dijkstra, and A. Stegeman. On the non-existence of optimal solutions and the occurrence of ”degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika, 73(3):431–439, 2008.
[33] G. Lan. First-Order and Stochastic Optimization Methods for Machine Learning. Springer, 2020.
[34] P. Latafat, A. Themelis, M. Ahookhosh, and P. Patrinos. Bregman Finito/MISO for nonconvex regularized finite sum minimization without lipschitz gradient continuity. SIAM J. Optim., 32(3):2230–2262, 2022.
[35] Q. Li, Z. Zhu, G. Tang, and M. B. Wakin. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems, 2019.
[36] L.-H. Lim and P. Comon. Nonnegative approximations of nonnegative tensors. J. Chemom., 23:432–441, 2009.
[37] H. Lu. ”relative-continuity” for non-lipschitz non-smooth convex optimization using stochastic (or deterministic) mirror descent, 2018.
[38] H. Lu, R. M. Freund, and Y. E. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim., 28(1):333–354, 2018.
[39] M. C. Mukkamala, P. Ochs, T. Pock, and S. Sabach. Convex-concave backtracking for inertial Bregman proximal gradient algorithms in nonconvex optimization. SIAM J. Math. Data Sci., 2(3):658–682, 2020.
[40] C. Navasca, L. De Lathauwer, and S. Kindermann. Swamp reducing technique for tensor decomposition. In 16th European Signal Processing Conference, pages 1–5, 2008.
[41] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.
[42] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence ${O}(1/k^{2})$ . Soviet Math. Dokl., 27(2):372–376, 1983.
[43] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takác. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pages 2613–2621, 2017.
[44] P. Paatero. Construction and analysis of degenerate PARAFAC models. J. Chemom., 14(3):285–299, 2000.
[45] A.-H. Phan, P. Tichavský, and A. Cichocki. Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC. SIAM J. Matrix Anal. Appl., 34(1):126–147, 2013.
[46] A.-H. Phan, P. Tichavský, and A. Cichocki. Fast alternating LS algorithms for high order candecomp/parafac tensor factorizations. IEEE Trans. Signal Process., 61(19):4834–4846, 2013.
[47] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci., 9(4):1756–1787, 2016.
[48] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys., 4(5):1–17, 1964.
[49] W. Pu, S. Ibrahim, X. Fu, and M. Hong. Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses. IEEE Transactions on Signal Processing, 70:1803–1818, 2022.
[50] J. Shetty and J. Adibi. The enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, 4, 2004.
[51] S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. FROSTT: The formidable repository of open sparse tensors and tools, 2017.
[52] L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based algorithms for tensor decompositions: Canonical polyadic decomposition, decomposition in rank- $(l_{r},l_{r},1)$ terms, and a new generalization. SIAM J. Optim., 23(2):695–720, 2013.
[53] M. Teboulle and Y. Vaisbourd. Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM J. Imaging Sci., 13(1):381–421, 2020.
[54] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109:475–494, 2001.
[55] M. Vandecappelle, N. Vervliet, and L. D. Lathauwer. A second-order method for fitting the canonical polyadic decomposition with non-least-squares cost. IEEE Transactions on Signal Processing, 68:4454–4465, 2020.
[56] M. Wang and L. Li. Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality. J. Mach. Learn. Res., 21(1), 2020.
[57] Q. Wang, C. Cui, and D. Han. A momentum block-randomized stochastic algorithm for low-rank tensor CP decomposition. Pac. J. Optim., 17(3):433–452, 2021.
[58] Q. Wang and D. Han. A Bregman stochastic method for nonconvex nonsmooth problem beyond global Lipschitz gradient continuity. Optim. Methods. Softw., Online, 2023.
[59] Q. Wang and D. Han. A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl. Numer. Math., 189:66–87, 2023.
[60] Q. Wang, Z. Liu, C. Cui, and D. Han. Inertial accelerated sgd algorithms for solving large-scale lower-rank tensor CP decomposition problems. J. Comput. Appl. Math., 423:114948, 2023.
[61] Q. Wang, Z. Liu, C. Cui, and D. Han. A Bregman proximal stochastic gradient method with extrapolation for nonconvex nonsmooth problems. In Association for the Advancement of Artificial Intelligence (AAAI), 2024.
[62] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imag. Sci., 6(3):1758–1789, 2013.
[63] A. Yeredor and M. Haardt. Maximum likelihood estimation of a low-rank probability mass tensor from partial observations. IEEE Signal Process. Lett., 26(10):1551–1555, 2019.
[64] S. Zhang and N. He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization, 2018.

		$\displaystyle\mathbb{E}_{k}\\|\tilde{\nabla}^{SAGA}f(\underline{A}^{k})-\nabla f% (\underline{A}^{k})\\|_{*}^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|\frac{1}{I_{n}B}(\sum_{j\in\mathcal{F}_{n}^{k}}% \nabla_{A_{n}}f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}))+% \frac{1}{J_{n}}\sum_{i=1}^{J_{n}}\nabla_{A_{n}}f_{i}((\phi^{k})^{i})-\nabla f(% \underline{A}^{k})\\|_{*}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{B^{2}I^{2}_{n}}\mathbb{E}_{k}\sum_{j\in\mathcal{F}_{n}^{% k}}\\|\nabla_{A_{n}}f_{j}(\underline{A}^{k})-\nabla_{A_{n}}f_{j}((\phi^{k})^{j}% )\\|_{*}^{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{BI^{2}_{n}J_{n}}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}\,,$

		$\displaystyle\mathbb{E}_{k}\\|\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k})-% \nabla f(\underline{A}^{k})\\|_{*}$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}_{k}\\|\tilde{\nabla}^{SAGA}_{n}f(\underline{A}^{k% })-\nabla f(\underline{A}^{k})\\|_{*}^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{BJ_{n}}}\sqrt{\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_% {i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{BJ_{n}}}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f_{i}(% \underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}.$

		$\displaystyle\frac{1}{BJ_{n}}\sum_{i=1}^{J_{n}}\mathbb{E}_{k}\\|\nabla_{A_{n}}f% _{i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{*}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}\mathbb{E}_{k}\sum_{i=1}^{J_{n}}\\|\nabla_{% A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k})^{i})\\|_{}^{2}% +\frac{1+\delta^{-1}}{BJ_{n}}\mathbb{E}_{k}\sum_{i=1}^{J_{n}}\\|\nabla_{A_{n}}f% _{i}(\underline{A}^{k})-\nabla_{A_{n}}f_{i}(\underline{A}^{k-1})\\|_{}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \\|_{*}^{2}+\frac{1+\delta^{-1}}{B}M_{1}^{2}\mathbb{E}_{k}\\|\underline{A}_{n}^{% k}-\underline{A}_{n}^{k-1}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \\|_{*}^{2}+\frac{1+\delta^{-1}}{B}M_{1}^{2}\mathbb{E}_{k}[(1+\alpha_{k}^{2})\\|% A_{n}^{k}-A_{n}^{k-1}\\|^{2}$
		$\displaystyle+\alpha_{k-1}^{2}\\|A_{n}^{k-1}-A_{n}^{k-2}\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{1+\delta}{BJ_{n}}(1-\frac{B}{J_{n}})\sum_{i=1}^{J_{n}}\\|% \nabla_{A_{n}}f_{i}(\underline{A}^{k-1})-\nabla_{A_{n}}f_{i}((\phi^{k-1})^{i})% \\|_{*}^{2}+\frac{2+2\delta^{-1}}{BN}M_{1}^{2}[\\|A^{k}-A^{k-1}\\|^{2}+\\|A^{k-1}-% A^{k-2}\\|^{2}],$

		$\displaystyle\mathbb{E}_{k}\\|P^{k+1}\\|$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|P_{\xi^{k}}^{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\tilde{\nabla}_{A_% {\xi^{k}}}f(\underline{A}^{k})+\frac{1}{\eta_{k}}(\nabla\psi(\tilde{A}_{\xi^{k% }}^{k})-\nabla\psi(A_{\xi^{k}}^{k+1}))\\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\tilde{\nabla}_{A_% {\xi^{k}}}f(\underline{A}^{k})\\|+\frac{1}{\eta_{k}}\mathbb{E}_{k}\\|\nabla\psi(% \tilde{A}_{\xi^{k}}^{k})-\nabla\psi(A_{\xi^{k}}^{k+1})\\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\nabla_{A_{\xi^{k}% }}f(\underline{A}^{k})\\|+\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(\underline{A}^{% k})-\tilde{\nabla}_{A_{\xi^{k}}}f(\underline{A}^{k})\\|+\frac{1}{\eta_{k}}% \mathbb{E}_{k}\\|\nabla\psi(\tilde{A}_{\xi^{k}}^{k})-\nabla\psi(A_{\xi^{k}}^{k+% 1})\\|$
	$\displaystyle\leq$	$\displaystyle M_{1}\mathbb{E}_{k}\\|A_{\xi^{k}}^{k+1}-\underline{A}_{\xi^{k}}^{% k}\\|+\Upsilon_{k}+V_{2}\\|A^{k}-A^{k-1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\frac{M_{2}}% {\eta_{k}}\mathbb{E}_{k}\\|A_{\xi^{k}}^{k+1}-\tilde{A}_{\xi^{k}}^{k}\\|$
	$\displaystyle\leq$	$\displaystyle M_{1}\mathbb{E}_{k}\\|A^{k+1}-\underline{A}^{k}\\|+\Upsilon_{k}+V_% {2}\\|A^{k}-A^{k-1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\frac{M_{2}}{\eta_{k}}\mathbb{E}% _{k}\\|A^{k+1}-\tilde{A}^{k}\\|$
	$\displaystyle\leq$	$\displaystyle\left(M_{1}+\frac{M_{2}}{\eta_{k}}\right)\mathbb{E}_{k}\\|A^{k+1}-% A^{k}\\|+\left(V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{2}}{\eta^{k}}\right)\\|A^% {k}-A^{k-1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\Upsilon_{k}$
	$\displaystyle\leq$	$\displaystyle\left(M_{1}+\frac{M_{2}}{\eta}\right)\mathbb{E}_{k}\\|A^{k+1}-A^{k% }\\|+\left(V_{2}+\beta^{k}M_{1}+\frac{\alpha^{k}M_{2}}{\eta}\right)\\|A^{k}-A^{k% -1}\\|+V_{2}\\|A^{k-1}-A^{k-2}\\|+\Upsilon_{k}$
	$\displaystyle\leq$	$\displaystyle w(\mathbb{E}_{k}\\|A^{k+1}-A^{k}\\|+\\|A^{k}-A^{k-1}\\|+\\|A^{k-1}-A^% {k-2}\\|)+\Upsilon_{k},$

		$\displaystyle\mathbb{E}_{k}\\|P^{k+1}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 3\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(A^{k+1})-\nabla_{A_{\xi^{% k}}}f(\underline{A}^{k})\\|^{2}+3\mathbb{E}_{k}\\|\nabla_{A_{\xi^{k}}}f(% \underline{A}^{k})-\tilde{\nabla}_{A_{\xi^{k}}}f(\underline{A}^{k})\\|^{2}+% \frac{3}{\eta_{k}}\mathbb{E}_{k}\\|\nabla\psi(\tilde{A}_{\xi^{k}}^{k})-\nabla% \psi(A_{\xi^{k}}^{k+1})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 3M_{1}^{2}\mathbb{E}_{k}\\|A^{k+1}-\underline{A}^{k}\\|^{2}+3% \Gamma_{k}+3V_{1}\\|A^{k}-A^{k-1}\\|^{2}+3V_{1}\\|A^{k-1}-A^{k-2}\\|^{2}+\frac{3M_% {2}^{2}}{\eta_{k}}\mathbb{E}_{k}\\|A^{k+1}-\tilde{A}^{k}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\left(6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta_{k}}\right)\mathbb{E}_{k}% \\|A^{k+1}-{A}^{k}\\|^{2}+\left(3V_{1}+6\beta_{k}^{2}M_{1}^{2}+\frac{6\alpha_{k}% ^{2}M_{2}^{2}}{\eta_{k}}\right)\\|A^{k}-A^{k-1}\\|^{2}$
		$\displaystyle+3V_{1}\\|A^{k-1}-A^{k-2}\\|^{2}+3\Gamma_{k}$
	$\displaystyle\leq$	$\displaystyle\left(6M_{1}^{2}+\frac{6M_{2}^{2}}{\eta}\right)\mathbb{E}_{k}\\|A^% {k+1}-{A}^{k}\\|^{2}+\left(3V_{1}+6\beta_{k}^{2}M_{1}^{2}+\frac{6\alpha_{k}^{2}% M_{2}^{2}}{\eta}\right)\\|A^{k}-A^{k-1}\\|^{2}$
		$\displaystyle+3V_{1}\\|A^{k-1}-A^{k-2}\\|^{2}+3\Gamma_{k}$
	$\displaystyle\leq$	$\displaystyle\bar{w}\left(\mathbb{E}_{k}[\\|A^{k+1}-A^{k}\\|^{2}]+\\|A^{k}-A^{k-1% }\\|^{2}+\\|A^{k-1}-A^{k-2}\\|^{2}\right)+3\mathbb{E}\Gamma_{k},$


(a) $150\times 100\times 150$ , $R=15$	(b) $150\times 100\times 150$ , $R=20$	(c) $150\times 100\times 150$ , $R=30$

(d) $300\times 400\times 300$ , $R=15$	(e) $300\times 400\times 300$ , $R=20$	(f) $300\times 400\times 300$ , $R=30$


(a) $150\times 51\times 152$ , $R=10$	(b) $150\times 51\times 152$ , $R=15$	(c) $150\times 51\times 152$ , $R=20$

(d) $300\times 300\times 300$ , $R=20$	(e) $300\times 300\times 300$ , $R=30$	(f) $300\times 300\times 300$ , $R=40$


(a) $100\times 80\times 100$ , $R=5$	(b) $100\times 80\times 100$ , $R=7$	(c) $100\times 80\times 100$ , $R=10$

(d) $50\times 100\times 200$ , $R=5$	(e) $50\times 100\times 200$ , $R=7$	(f) $50\times 100\times 200$ , $R=10$


(a) $142\times 147\times 148$ , $R=3$	(b) $142\times 147\times 148$ , $R=5$	(c) $142\times 147\times 148$ , $R=7$


(a) $520\times 520\times 520$ , $R=5$	(b) $520\times 520\times 520$ , $R=10$	(c) $520\times 520\times 520$ , $R=15$