Inertial Accelerated Stochastic Mirror Descent for Large-Scale Generalized Tensor CP Decomposition
Abstract
The majority of classic tensor CP decomposition models are designed for squared loss, employing Euclidean distance as a local proximal term. However, the Euclidean distance is unsuitable for the generalized loss function applicable to various types of real-world data, such as integer and binary data. Consequently, algorithms developed under the squared loss are not easily adaptable to handle these generalized losses, partially due to the lack of the gradient Lipschitz continuity. This paper considers the generalized tensor CP decomposition. We use the Bregman distance as the proximal term and propose an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD). Within a broader multi-block variance reduction and inertial acceleration framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. We further show that iTableSMD requires at most iterations in expectation to attain an -stationary point and establish the global convergence of the sequence. Numerical experiments on real datasets demonstrate that our proposed algorithm is efficient and achieve better performance than the existing state-of-the-art methods.
Keywords: Generalized tensor CP decomposition, Inertial acceleration, Stochastic mirror descent, Bregman divergence, Non-Lipschitz gradient continuity, Variance reduction.
1 Introduction
A fundamental generic optimization model that encompasses a wide range of multi-block models arising in various applications is the well-known composite minimization problem. It can be formally defined as:
(1) |
where variable can be decomposed into blocks , is assumed to be a continuously differentiable nonconvex function over , and can be convex in the manner of a block , while all the other blocks are fixed. It can also admit a finite-sum structure form with different indexes. The usual restrictive condition of the gradient Lipschitz continuity of is not required, and , are extended-value weakly convex functions, which is a structure-promoting regularizer that captures the prior information about , such as column-wise orthogonality [32], Tikhonov regularization [44], iterated Tikhonov regularization [40], non-negativity [36].
The majority of classic models and algorithms developed for multi-block problem (1) are least squares (LS) problems using the Euclidean distance-based fitting criterion [22, 6]. Efficient methods in this line include the block coordinate descent (BCD) [62], alternating direction method of multipliers (ADMM) [21], and proximal alternating linearized minimization (PALM) [4] algorithm. Then Pock and Sabach [47] introduced a inertial variant of PALM (iPALM). Huang et al. [28] introduced a primal-dual algorithm AO-ADMM, which is a hybrid between alternating optimization and ADMM. There are also some other first-order type algorithms [46, 12] and (quasi-)second-order methods [52, 45]. However, the Euclidean distance is unsuitable for measuring the proximity between various real-world data types, including nonnegative, integer, and binary data. Essentially, utilizing data geometry-aware divergences as fitting criteria has the potential to substantially improve both performance and robustness in real-world applications [26, 31, 56, 55]. For instance, the proximity between two probability distributions is measured using an appropriate divergence, such as the generalized Kullback-Leibler (KL) divergence [27, 29, 19, 9] or the Itakura-Saito divergence [17]. From a statistical perspective, numerous non-Euclidean divergences share a close connection with maximum likelihood estimators (MLEs) under reasonable data distribution assumptions. For example, the generalized KL divergence [10] and logistic loss can be obtained as MLEs for integer data with Poisson distributions and binary data with Bernoulli distributions, respectively. Nevertheless, methods under non-Euclidean divergences are much more challenging compared with the case under Euclidean loss, especially when the data size becomes huge. Algorithms designed for the LS loss are not easily adaptable to handle complex loss functions, primarily due to the absence of gradient Lipschitz continuity, even under relatively mild conditions.
The form of problem (1) can be applied to tensor CP decomposition with regularization, which can be viewed as an extension of matrix factorization [53]. The first idea of canonical polyadic (CP) decomposition is from Hitchcock [24, 25] in 1927, which expresses a tensor as a sum of a finite number of rank-1 tensors. Subsequently, Cattell [7, 8] proposed ideas for parallel proportional analysis and multiple axes for analysis. Furthermore, the research developed by Hong et al. in 2020 is a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. Below, we briefly review existing developments for multi-block models with non-Euclidean loss functions.
Many existing non-Euclidean approaches employ the block coordinate descent (BCD) [62] for updating block variables. Convergence of the BCD method typically requires the uniqueness of the minimizer at each step or the quasi-convexity of the objective function [54]. Unfortunately, these requirements can be restrictive in some important practical problems such as the tensor decomposition problem [30]. Cichocki and Phan [11] proposed a hierarchical alternating optimization algorithm for CP decomposition with - and -divergence. In [10], the generalized KL-divergence loss was explored, leading to the development of a block majorization-minimization (MM) algorithm. The work in [29] presented the exponential gradient algorithm for handling the KL-divergence. Additionally, alternative optimization frameworks like Gauss-Newton based methods [55] and quasi-Newton methods [26] have been devised for non-Euclidean models.
It is worth noting that most of the mentioned algorithms use the entire dataset for each update, which will be time-consuming. In contrast, stochastic algorithms reduce computational and memory requirements per iteration. A recent stochastic gradient-based algorithm [31] was introduced for tensor CP decomposition. However, it randomly samples tensor entries for updates, neglecting the potential computational efficiency enhancements through multilinear algebraic properties of low-rank tensors. More importantly, this update strategy loses the opportunity to incorporate regularization terms on the entire latent factors because the sampled entries only provide partial information about them. To address this, Battaglino et al. proposed an algorithm [1] that samples tensor fibers containing information about complete latent factors. However, these stochastic algorithms lack convergence guarantees. Pu et al. [49] developed a block-randomized stochastic mirror descent (SMD) [13, 3] algorithmic framework for large-scale CP decomposition under various non-Euclidean losses, also referred to as generalized CP decomposition. Specifically, at each iteration, one block factor is randomly chosen for an update while kee** all other factors fixed. Then, instead of solving the subproblem directly, it updates the unknown factor by one SMD step. This work also incorporated a fiber sampling strategy to assist in designing SMD updates. In this way, the computational cost is much smaller. However, the pure SMD is still slow in convergence. Wang et al. proposed mBrasCPD [57] and iBrasCPD[60], which speed up the SGD scheme by the heavy ball method [48] and inertial acceleration. Both of these algorithms are designed for scenarios involving Euclidean loss functions and are not suitable for the generalized non-Euclidean loss functions directly. Additionally, these algorithms only consider that stochastic gradient is unbiased, which can only induce weak convergence properties. Recently, [58] and [61] introduced the Bregman proximal stochastic gradient (BPSG) method and BPSG with extrapolation (BPSGE), respectively and established the convergence properties of the generated sequence in terms of subsequential and global convergence under a general framework of variance reduction. However, both BPSG and BPSGE are designed for single-block problems, and the potential for computing generalized tensor CP decomposition remains untapped.
An overview of several state-of-the-art algorithms for multi-block problem (1) is presented in Table 1.
Algorithm | Loss | Acceleration | Convergence | Complexity | -Lip | ||
BrasCPD [18] | LS | convex | no | subseq. | - | yes | unbiased |
mBrasCPD [57] | LS | convex | heavy ball | subseq. | - | yes | unbiased |
iBrasCPD [60] | LS | convex | inertial | subseq. | - | yes | unbiased |
SPRING [16] | general | nonconvex | no | subseq./seq. | - | yes | biased |
iSPALM [23] | general | nonconvex | inertial | subseq./seq. | - | yes | biased |
SmartCPD [49] | general | convex | no | subseq. | - | no | unbiased |
iTableSMD (Alg. 1) | general | weakly-convex | inertial | subseq./seq. | no | biased |
In this paper, inspired by inertial acceleration skill and variance reduction framework for stochastic algorithms, we propose an inertial accelerated stochastic mirror descent to solve the nonconvex and nonsmooth optimization problem (1), which can be applied to the block-wise subproblem of tensor generalized CP decomposition under non-Euclidean loss functions. Our main contributions addressed in this article are as follows:
-
(1)
We introduce an inertial accelerated block-randomized SMD algorithm, denoted as iTableSMD, designed to address the GCP decomposition problem. Within a broader multi-block variance reduction framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm.
-
(2)
Within a broader multi-block variance reduction and inertial acceleration framework, we establish the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. Furthermore, we introduce a novel Lyapunov function and prove that it requires at most iterations in expectation to attain an -stationary point. Additionally, we establish the global convergence of the sequence generated by iTableSMD.
-
(3)
We conduct extensive experiments, including three synthetic datasets and two real-world datasets in several distributions, to demonstrate the effectiveness of our proposed algorithms iTableSMD. Our numerical experiments exhibit that our proposed methods can achieve better convergence.
The rest of this paper is organized as follows. Section 2 outlines necessary definitions and preliminary results of existing models and algorithms. Section 3 details the formulation of the iTableSMD algorithm, while Section 4 is dedicated to proving its convergence and analyzing its rate of convergence. In Section 5, we compare the performance of the iTableSMD algorithm with several baselines using synthetic and real-world datasets. We conclude the paper in Section 6.
2 Preliminaries
Definition 1.
(Bregman Divergence): Given a strongly convex function , the Bregman distance between and int dom is
(2) |
It measures the proximity of and . Indeed, is convex if and only if for any due to the gradient inequality.
Remark 1.
The Bregman divergence was originally defined in [63] using a Legendre function . Here, we consider the case where is a strongly convex function; see more details in [2]. It should be noted that
(3) |
where is the strongly convex parameter of . if and only if . In addition, can be defined in a coordinate-wise form as .
Definition 2.
([39] -smooth adaptable) Given , let be a proper and lower semi-continuous function with , which is continuously differentiable. We say is - smooth adaptable on if there exist and such that for any ,
(4) |
and
(5) |
If , it recovers [5, Definition 2.2]. Suppose is convex. If , this definition recovers [2, Lemma 1] and [38, Definition 1.1].
Definition 3.
([33] -stationary point) Given , a solution is said to be an -stationary point of function if
2.1 Generalized CP decomposition
Consider an -th order tensor , where represents the size of the -th mode of . Such multi-array arises in many applications. is an entry of . The entries of the data tensor could be various types of real datasets, such as continuous numbers, non-negative integers, and binaries. A general problem of tensor CP decomposition is to approximate using a low rank tensor , defined by
(6) |
where denotes the outer product of vectors, , are the unknown mode- latent factor matrix. is the smallest positive integer for which equation (6) is satisfied, and it is also known as the rank of .
Let be the geometric mean of the dimensions. An -dimensional integer vector is used to represent the entry coordinate, i.e.,
The generalized CP (GCP) decomposition problem can be formulated as the following optimization task, where the primary objective is to minimize a data-adaptive loss function denoted by : ,
(7) |
where the entries and correspond to the elements of and indexed by , respectively. is a structure-promoting regularizer that captures the prior information about the latent factors , such as column-wise orthogonality [32], Tikhonov regularization [44], iterated Tikhonov regularization [40], nonnegativity [36]. Those regularizations can result in well-posed problems. For instance, if is applied, we can write as the indicator function:
Lim and Comon [36] showed that the nonnegative CP decomposition always has optimal solutions. By selecting appropriate loss functions , problem (7) becomes adaptable to handle diverse data types, including continuous, count, and binary data. To illustrate, we present several representative motivating examples.
The difference between GCP and the conventional CP formulation lies in the flexibility in the selection of loss functions. In this section, we provide alternative loss functions by examining the statistical likelihood of a model for a specific data tensor. We assume a parameterized probability density function (PDF) or probability mass function (PMF) is available, offering the likelihood estimation for each entry, i.e.,
Here represents an observation of a random variable, while denotes an invertible link function connecting the model parameter with the natural parameter of the distribution, . The link function is commonly assumed to be the identity function or related to the expectation of distribution.
We aim to find the model that is the maximum likelihood estimate (MLE) across all entries. By assuming conditional independence of observations, the overall likelihood is simplified to the product of individual likelihoods. Hence, the MLE is the solution of the following optimization problem:
Here, . Then, we employ the negative logarithm to transform the product into a summation and convert it to a minimization problem. As the logarithm is a monotonic function, it preserves the maximizer.
(8) |
Data Type | Distribution | Link Function | Loss function | Constraints |
---|---|---|---|---|
Continuous: | Gaussian | |||
Gamma | ||||
Count: | Poisson | |||
Binary: | Bernoulli | |||
Table 2 presents commonly utilized generalized loss functions , associated link functions, and various distributions.
2.2 Stochastic methods for GCP decomposition
We can modify the element-wise regularized problem (7) to a block-wise regularized problem
(9) |
where , , and is finite-valued and differential. Shortly, let .
We present the stochastic gradient with respect to the factor , denoted by . Suppose a mode index is sampled from to . For instance, we take the squared error loss function as an example. Namely, consider
where , denotes the Khatri-Rao product or a column-wise Kronecker product, is the mode- matrization and .
Then, we rewrite the full gradient as follows
(10) |
Here, denotes the -th column of , and is the expectation over the index . To alleviate the burden of computing the full gradient (10), we randomly sample a set of mode- fibers that is indexed by with . Note that a mode- fiber of is a row of the mode- unfolding . Compared with the fiber sampling-based method in [1], our requirement on the batchsize is much lower. Hence, it admits lower per-iteration memory and computational complexities, especially when the rank is high.
Let be the stochastic gradient of for , we have
(11) |
where
2.3 Stochastic mirror descent
In this subsection, we introduce some basics for optimization methods used in this paper, such as stochastic mirror descent (SMD) and inertial framework.
Consider the special case in optimization problem (1) with the block
where the component function is a continuously differentiable nonconvex function, and is an extended valued function that are bounded from below. The update of SMD [37, 41, 64] is given as follows,
where the stochastic gradient can be chosen as mini-batch version . Here, the mini-batch is chosen uniformly at random from all subsets of , with the batchsize being considerably smaller than . Compared with SGD, SMD replaces the quadratic term . When is properly designed, SMD can exploit the geometry of the problem and achieve significant efficiency enhancements compared to SGD, particularly when utilizing generalized loss functions. The extensive literature on MD and SMD in optimization is available [5, 34, 14, 64, 35].
As usual with the analysis of Bregman based schemes, the following simple but remarkable three points identity for is very useful, which follows from elementary algebra. Given any and int dom , the three point equality is
(12) |
For the multi-block problem, Pu et al. [49] develop a unified stochastic mirror descent algorithmic framework (SmartCPD) for large-scale CPD under various non-Euclidean losses, which is a special case of multi-block problem and updates the factor variables by
(13) |
However, directly employing stochastic mirror descent for the GCP problem may not yield the most effective results. In this paper, we study stochastic gradients under the variance-reduced stochastic gradient estimators, such as SAGA [15] and SARAH [43]. Furthermore, the inertial acceleration framework is applied, which can be given by
where are two inertial parameters. For example, if , it will be degenerated into the gradient descent method; If , then it will be reduced to the heavy-ball method [48]; If , then it will be reduced to the Nesterov accelerated gradient method [42].
3 Inertial accelerated block-randomized SMD
In this section, we propose an inertial accelerated block-randomized stochastic mirror descent algorithm (iTableSMD) for GCP decomposition (9). Before presenting the algorithm framework of iTableSMD, we make the following assumptions throughout the paper.
Assumption 1.
We assume that the following three conditions hold:
-
(i)
are proper lower semi-continuous (l.s.c.) functions that are bounded from below. There exists such that is convex.
-
(ii)
is continuously differentiable and -strongly convex. Let for simplicity.
-
(iii)
is Lipschitz continuous with modulus . For any two points , , it presents that
-
(iv)
is a proper and lower semi-continuous function with .
-
(v)
The couple of functions is -smooth adaptable.
-
(vi)
The function is bounded from below, i.e., there exists a finite optimal objective value .
Input: an -way tensor ; the rank ; the sample size ; initialization ; stepsize ; inertial parameters ; two constants with .
(14) |
(15) | ||||
Output: .
Let and be the stochastic parameters for the block index and the stochastic gradient, respectively. Denote and .
Definition 4.
(Variance reduced stochastic gradient) We say a gradient estimator with , is variance-reduced with constants , and if it satisfies the following conditions:
-
(i)
(MSE Bound): there exists a sequence of random variables such that
(16) and random variables such that
(17) -
(ii)
(Geometric Decay): The sequence satisfy the following inequality in expectation:
(18) -
(iii)
(Convergence of Estimator): For all sequences , if they satisfy
, then it follows that and .
In Proposition 1, we show both SAGA and SARAH are variance reduced stochastic gradients.
4 Convergence analysis
This section establishes the convergence properties of the iTableSMD algorithm. We prove its sublinear convergence rate for the subsequential sequence and further show that iTableSMD requires at most iterations in expectation to attain an -stationary point. Additionally, we confirm the global convergence of the generated sequence.
4.1 Subsequential convergence analysis
Next, we show the descent amount of under expectation in the following lemma.
Lemma 1.
Suppose Assumption1 is satisfied and with , is variance-reduced by Definition 4. Let with be the sequence generated by Algorithm 1. Then the following inequality holds for any ,
Here, , is the weakly convex parameter in Assumption 1 (i), and are introduced in (14), and , are parameters in Definition 4.
Proof.
From the convexity of , we can obtain the following inequality
(19) |
where . From the optimality condition of (15), it shows that
which combined with (19) yields that
Furthermore, since is an -relative smooth function with respect to , we have
and
Combining two inequalities, we can get
(20) |
By summing the two inequalities together, we obtain
(21) |
where the last inequality follows from for any and .
Suppose at the -th iteration. We apply the conditional expectation operator to the above inequality (21) and bounding the MSE term by (16) in Definition 4, then we have
(23) |
where the last inequality follows from (18) in Definition 4. From (14) and , it presents that
and we also use notation for simplicity. Then we can get
Therefore, the results can be obtained by rearranging the above terms with . This completes the proof. ∎
Next, we introduce a new Lyapunov function and show it is monotonically decreasing in expectation. For simplicity, we denote
Lemma 2.
Proof.
From Lemma 1, it shows that
(27) |
Combining (25) with , we have
Let , and assume 111In numerical experiments in [47, 59], there is and . Hence, we have this inequality holds., then we have
where the second and the last inequality follow from (27) and (24), respectively. This completes the proof.
∎
Theorem 1.
Let with be a sequence generated by iTableSMD algorithm. Then, the following statements hold.
-
(i)
The sequence is nonincreasing.
-
(ii)
, and the sequence converges to zero.
-
(iii)
.
Proof.
-
(i)
This statement follows directly from Lemma 2 and .
- (i)
-
(iii)
We have
which yields the desired result.
This completes the proof. ∎
4.2 Global convergence analysis
In this subsection, we present the analysis of iTableSMD algorithm with the expected squared distance of the subgradient and global convergence. In addition, We impose another stronger assumption on function .
Assumption 2.
The partial gradient is Lipschitz continuous with modulus on bounded sets of . Namely, for any two points and , where , , it shows that
Under Definition 4 and the definition of SAGA [15] and SARAH [43], we have the following proposition.
Proposition 1.
Under Assumption 2, we have the following two statements hold.
-
(i)
The SAGA gradient estimator [15] is defined as
(28) where , and the variable follow the update rules if and otherwise. A set of sampled mode- fibers is indexed by with . Then it is variance reduced with
The constants , , .
-
(ii)
The SARAH gradient estimator [43] which is defined as
Here “w.p. ” means with probability . Then it is variance reduced with
and constants , , .
Proof.
From the definition of SAGA stochastic gradient estimator and the Lipschitz continuity of , it shows that
where the last inequality follows from the fact that for any independent random variables with for all . Combined with Jensen’s inequality, we can get
We bound the MSE of the stochastic gradient estimator as follows,
where the first inequality follows from .
Let and , it shows that
This proves the geometric decay of in expectation. Similar to Appendix B in [16], we also have that the third condition holds in Definition 4. For the SARAH stochastic gradient estimator, we can get the results directly similar to Lemma 5 in [58]. The proof of Proposition 1 (2) is completed. This completes the proof. ∎
Corollary 1.
If , the inequality from Lemma 2 becomes
Now we can prove the following result, which means that the subgradient of is bounded.
Lemma 3.
Suppose that Assumptions 1-2 hold and the stepsize satisfies and (24).The sequence generated by iTableSMD is bounded for all . Define
where and , implying that . Then, we can obtain
where .
Proof.
From the implicit definition of the proximal operator (15) in the iTableSMD algorithm, we have
where . Combining it with , we have . Furthermore, in Problem (9), with , we have , and it follows that , where .
All that remains is to bound the norm of . Suppose at the -th iteration. It shows that
where . This completes the proof. ∎
Lemma 4.
Under the same conditions in Lemma 3, there exists a constant such that
Proof.
From Lemma 3, it shows that
where . Through and taking full expectation on both sides, it shows that
This completes the proof. ∎
Using Lemma 4, we can show the convergence rate of the expected squared distance of the subgradient to .
Theorem 2.
Proof.
We define the set of cluster points of as
(30) |
Lemma 5.
Proof.
The proof of the above statements is similar to that of Lemma 9 in [58], so we omit the details here for simplicity. ∎
The following lemma is from [16], which is analogous to the Uniformized KŁ property of [4] and allows us to apply the KŁ inequality.
Lemma 6.
Assuming is a bounded sequence of iterates for all generated by the iTableSMD algorithm using a variance-reduced gradient estimator (see Definition 4). Let be a semialgebraic function satisfying the KŁ property [4] with exponent . Then there exists an index and a desingularizing function with , so that the following bound holds almost surely (a.s.),
(31) |
where is a nondecreasing sequence converging to for some .
Now we give the global convergence result of the iTableSMD algorithm in the following theorem which can be proved by the above lemmas and we omit the proof details here. See [16, 58] for details.
Theorem 3.
Suppose that Assumptions 1-2 hold, the step satisfies and (24). Let be the sequence generated by the iTableSMD algorithm which is assumed to be bounded. If the optimization function is a semialgebraic function that satisfies the KŁ property with exponent (see Lemma 6), then either the point is a critical point after a finite number of iterations or the sequence almost surely satisfies the finite length property in expectation, namely,
5 Numerical experiments
In this section, we evaluate the proposed iTableSMD (Algorithm 1) using synthetic datasets as well as multiple real-world datasets. We aim to demonstrate its superior efficiency through comparisons with state-of-the-art algorithms as our main baseline. The first one is an entry-sampling based stochastic non-Euclidean CP decomposition optimization algorithm, namely, GCP-OPT, proposed in [31, 26]. The GCP-OPT method is implemented in Tensor Toolbox and “Adam” is selected as the optimization solver. The sampling rule of GCP-OPT is the default “uniform” setting for dense tensors unless specified for particular examples. The Second one is a tensor fiber-sampling based flexible stochastic mirror descent framework denoted by SmartCPD [49].
We perform experiments involving low-rank GCP decomposition with nonnegative constraints for . Our focus extends to three distinct synthetic data distributions: Gamma, Poisson, and Bernoulli. Additionally, we incorporate several real datasets into our analysis, including the Enron emails dataset [50], six months of Uber pickup data [51] in New York City, both of which are characterized by integer counts following the Poisson distribution. Each element in the Enron emails dataset represents the sender-receiver-word, and the values are counts of words. Each element in the Uber pickup data represents the date-latitude-longitude of pickup, and the values are counts of pickups. We also test the tags from the Flickr dataset [20], where non-zero values are binary, indicating user tagging of images on a given day. In synthetic experiments, we generate third-order tensors with different sizes and ranks and we do not require each dimension of the tensor to remain the same. For the fiber-sampling based algorithms, in each iteration, iTableSMD and SmartCPD sample fibers, while for entry-sampling algorithm, GCP-OPT samples entries. The generating function of Bregman distance is chosen as in the update of iTableSMD and SmartCPD. The numerical experiment performance is measured by the cost function value (denotes by “NRE”) and the mean squared error (MSE). The MSE of the latent matrices is used as a performance metric, which is defined as
where denotes the estimate of original matrix and represents a permutation of the set , which is used to fix the intrinsic column permutation in CP decomposition.
5.1 Synthetic data experiments
5.1.1 Gamma distribution
In this subsection, we compute the GCP decomposition on two artificial three-way tensors of size and with different ranks using the gamma loss function: . In practice, we use the constraint and replace with (e.g., ) in the loss function to prevent function values or gradients from becoming . Namely, . With nonnegative constraints on the factor matrices, the latent factors , , and are drawn from i.i.d.uniform distribution between 0 and , where the is a positive constant. The observed nonnegative data tensor is generated following the gamma distribution, i.e. . Namely, we focus on
We set the inertial parameters as , for simplicity222Theoretically, the inequality (14) is required. However, it is time-consuming to check this inequality in the numerical experiments. Therefore, we directly set . Our numerical experiments show that iTableSMD always converges with this . and set the stepsize as to verify the difference between SmartCPD with SGD and SAGA, GCP-OPT, and iTableSMD with SGD and SAGA. Our numerical results for two synthetic data are presented in Figure 1.
The first synthetic experiment, visualized in the top row of the figure, captures the algorithmic performance across a tensor of dimensions , evaluated at varying ranks. Notably, the iTableSMD, especially when coupled with the SAGA, achieves a rapid improvement in MSE. For instance, in Figure 1(a), iTableSMD-SAGA reduces the MSE to below within an average time of fewer than 3 seconds, outpacing SmartCPD, which requires a minimum of 6 seconds, and GCP-OPT, which exceeds 10 seconds to attain comparable MSE reductions.
The second row in Figure 1 evaluates the efficacy of various algorithms on a tensor with increased dimensions . It is clear that as the tensor size increases, the iTableSMD method gains a lower MSE within the same time compared to SmartCPD and GCP-OPT. The noticeable improvement over the SmartCPD method indicates that the inertial acceleration framework of iTableSMD has a significant impact on its performance, hel** to achieve faster convergence.
![]() |
![]() |
![]() |
(a) , | (b) , | (c) , |
![]() |
![]() |
![]() |
(d) , | (e) , | (f) , |
5.1.2 Poisson distribution
We next evaluate the performance on two synthetic count data tensors with the size of and . For simplicity in our experiments, we set inertial parameters and using a formula based on the iteration , and chose a stepsize as . The loss function used is a modification of the standard Poisson log-likelihood and is defined as . We initialized the factor matrices , , and with values uniformly distributed between 0 and a set maximum , here chosen as 0.5. The observed count data tensor is generated following the Poisson distribution, i.e. .
In Figure 2, we see the results of numerical experiments on synthetic datasets modeled with Poisson distribution for tensors of two sizes, with varying tensor ranks. For the smaller tensor (), as the rank increases from to , the iTableSMD-SAGA maintains a lower MSE compared to others, indicating a more efficient performance. In particular, for R = 20 in Figure 2(c), iTableSMD-SAGA reduces the MSE significantly faster than the other methods in 10 seconds. For the larger tensor (), in Figures 2(d)-(f), as the rank grows, the MSE tends to decrease at a slower rate for all methods. However, the iTableSMD-SAGA still shows a consistent advantage, reaching lower MSEs quicker than the competing algorithms, which becomes more notable as the rank moves to 30 and beyond. From figures 2(b)-(e), compared with figure 1, we can see that SmartCPD can not always perform better than GCP-OPT, since the type of data distribution may affect its effectiveness. However, iTableSMDs continue to show superior performance in terms of iteration speed, regardless of the change in data distribution.
![]() |
![]() |
![]() |
(a) , | (b) , | (c) , |
![]() |
![]() |
![]() |
(d) , | (e) , | (f) , |
5.1.3 Bernoulli distribution
We see the results of numerical experiments on binary tensors of sizes and . We set inertial parameters and using a formula based on the iteration , and chose a stepsize as . The loss function is and each entry of the binary tensor is generated from the Bernoulli distribution, i.e., with probability . Then, we focus on
In Figure 3(a) with tensor size and rank , it is observed that all algorithms quickly reduce the MSE within the first few seconds. As the rank increases to in Figures 3(b) and (e), and in subfigures 3(c) and (f) respectively, there is a noticeable shift in the speed at which MSE decreases, with higher ranks leading to a slight slow down in convergence. In these four subfigures, we can further confirm that SmartCPD does not consistently outperform GCP-OPT and may at times be less effective. Nonetheless, iTalbeSMD-SAGA consistently shows robust performance, achieving low MSEs faster compared to the other method. When examining larger tensor sizes, as in subfigures 3(d)-(f), a similar pattern is evident, with all algorithms performing slower as the size and rank increase. Yet, the relative efficiency of iTalbeSMD-SAGA remains apparent, suggesting its advantage in dealing with larger and more complex data sets.
Overall, iTalbeSMD-SGD and iTalbeSMD-SAGA are shown to reliably achieve lower MSEs more quickly across various synthetic tensor sizes and ranks, underlining its efficiency in different scenarios.
![]() |
![]() |
![]() |
(a) , | (b) , | (c) , |
![]() |
![]() |
![]() |
(d) , | (e) , | (f) , |
5.2 Real data experiments
5.2.1 Enron emails dataset
We apply the algorithms to the Enron emails dataset. This dataset comprises a large collection of email messages exchanged by the employees of the Enron Corporation, which was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). We use the extractive version in [50] and select a subset involving 142 senders, 147 receivers, and 148 unique words, excluding any additional dimensions. The data is in the form of a third-order tensor () with integer entries representing the number of words. The size of the tensor is . It has 6581 () nonzero entries. We choose the loss function corresponding to the Poisson distribution, i.e., and non-negativity constraints are considered for the latent matrices. In every iteration, fibers are sampled by iTableSMD and SmartCPD and entries are sampled for GCP-OPT. We set inertial parameters . All algorithms under test are stopped when the relative change in the loss function is less than .
Figure 4 presents a series of numerical experiments conducted on enron dataset with a Poisson distribution, examining the performance of various optimization algorithms with different tensor ranks under , , and , respectively. Each algorithm is run for 5 trials and in each trial, the factor matrices are initialized by randomly sampling its entries from uniform distribution between 0 and 1. It is observed that iTableSMD stands out for its quick cost reduction, reaching a low cost within approximately 6 seconds, noticeably faster than the baseline methods, which require more time and yet do not achieve as low of a cost. This quick performance is most notable at the higher rank of , where iTableSMD quickly lower the cost apparently, surpassing other algorithms that struggle to converge within the same time.
![]() |
![]() |
![]() |
(a) , | (b) , | (c) , |
5.2.2 The Flickr dataset
We evaluate the algorithms on the Flickr dataset, as referenced by Gorlitz et al. [20]. The dataset consists of tags representing whether a user has labeled an image on a particular day, with non-zero values marked as binary indicators. We form a third-order binary tensor of size . The chosen loss function is tailored for the Bernoulli distribution, i.e., . Other settings and parameters are as before. Figure 5 shows the cost value change against time in seconds for different values of . Similar to the previous datasets, the proposed iTableSMD shows considerable runtime advantages over GCP-OPT. The results reinforce the capability of iTableSMD to handle complex, real-world datasets effectively.
![]() |
![]() |
![]() |
(a) , | (b) , | (c) , |
6 Conclusion
In this paper, we proposed an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD) for nonconvex multi-block objective functions beyond global Lipschitz gradient continuity. This algorithm is particularly tailored for large-scale Generalized Tensor CP (GCP) decomposition under non-Euclidean losses. By integrating a broader version of multi-block variance ruduction, we establish the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm and prove it requires at most iterations in expectation to attain an -stationary point. Additionally, we verify the global convergence of the sequence generated by iTableSMD. We tested the algorithm over various types of simulated and real data with several baselines, indicating significant computational efficiency improvements over existing state-of-the-art methods. These results highlight the advantages and effectiveness of incorporating an inertial accelerated stochastic approach in the algorithmic framework for GCP tensor decomposition.
Declarations
Funding: This research is supported by the R&D project of Pazhou Lab (Huangpu) (Grant no. 2023K0603), the National Natural Science Foundation of China (NSFC) grant 12171021 and the Fundamental Research Funds for the Central Universities (Grant No. YWF-22-T-204).
Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.
Data Availability Statement: Data will be made available on reasonable request.
References
- [1] C. Battaglino, G. Ballard, and T. G. Kolda. A practical randomized CP tensor decomposition. SIAM J. Matrix Anal. Appl., 39(2):876–901, 2018.
- [2] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Math. Oper. Res., 42(2):330–348, 2017.
- [3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett., 31(3):167–175, 2003.
- [4] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program., 146(1-2):459–494, 2014.
- [5] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim., 28(3):2131–2151, 2018.
- [6] J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an -way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3):283–319, 1970.
- [7] R. Cattell. “Parallel proportional profiles” and other principles for determining the choice of factors by rotation. Psychometrika, 9(4):267–283, 1944.
- [8] R. B. Cattell. The three basic factor-analytic research designs-their interrelations and derivatives. Psychol. Bull., 49(5):499–520, 1952.
- [9] L. Cheng, X. Tong, S. Wang, Y.-C. Wu, and H. V. Poor. Learning nonnegative factors from tensor data: Probabilistic modeling and inference algorithm. IEEE Trans. Signal Process., 68:1792–1806, 2020.
- [10] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl., 33(4):1272–1299, 2012.
- [11] A. Cichocki and A. Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci., 92-A:708–721, 2009.
- [12] P. Comon, X. Luciani, and A. L. F. de Almeida. Tensor decompositions, alternating least squares and other tales. J. Chemom., 23, 2009.
- [13] C. D. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Optim., 25(2):856–881, 2015.
- [14] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions, 2018.
- [15] A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
- [16] D. Driggs, J. Tang, J. Liang, M. E. Davies, and C. Schönlieb. A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J. Imaging Sci., 14(4):1932–1970, 2021.
- [17] B. Ermiş, E. Acar, and A. T. Cemgil. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Min. Knowl. Discov., 29(1):203–236, 2015.
- [18] X. Fu, S. Ibrahim, H. Wai, C. Gao, and K. Huang. Block-randomized stochastic proximal gradient for low-rank tensor factorization. IEEE Trans. Signal Process., 68:2170–2185, 2020.
- [19] X. Fu, E. Seo, J. Clarke, and R. A. Hutchinson. Link prediction under imperfect detection: Collaborative filtering for ecological networks. IEEE Transactions on Knowledge and Data Engineering, 33(8):3117–3128, 2021.
- [20] O. Görlitz, S. Sizov, and S. Staab. Pints: peer-to-peer infrastructure for tagging systems. In IPTPS, page 19, 2008.
- [21] D. Han. A survey on some recent developments of alternating direction method of multipliers. J. Oper. Res. Soc. China, 10:1–52, 2022.
- [22] R. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis. 1970.
- [23] J. Hertrich and G. Steidl. Inertial stochastic PALM and applications in machine learning. Sampl. Theory Signal Process. Data Anal., 20(1), 2022.
- [24] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys., 6(1-4):164–189, 1927.
- [25] F. L. Hitchcock. Multiple invariants and generalized rank of a p-way matrix or tensor. J. Math. Phys., 7(1-4):39–79, 1928.
- [26] D. Hong, T. G. Kolda, and J. A. Duersch. Generalized canonical polyadic tensor decomposition. SIAM Rev., 62(1):133–163, 2020.
- [27] K. Huang and N. D. Sidiropoulos. Kullback-Leibler principal component for tensors is not NP-hard. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 693–697, 2017.
- [28] K. Huang, N. D. Sidiropoulos, and A. P. Liavas. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process., 64(19):5052–5065, 2016.
- [29] N. Kargas and N. Sidiropoulos. Learning mixtures of smooth product distributions: Identifiability and algorithm. In Proc. 22nd Int. Conf. Artif. Intell. Statist., pages 388–396, 2019.
- [30] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, 2009.
- [31] T. G. Kolda and D. Hong. Stochastic gradients for large-scale tensor decomposition. SIAM J. Math. Data Sci, abs/1906.01687, 2019.
- [32] W. P. Krijnen, T. K. Dijkstra, and A. Stegeman. On the non-existence of optimal solutions and the occurrence of ”degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika, 73(3):431–439, 2008.
- [33] G. Lan. First-Order and Stochastic Optimization Methods for Machine Learning. Springer, 2020.
- [34] P. Latafat, A. Themelis, M. Ahookhosh, and P. Patrinos. Bregman Finito/MISO for nonconvex regularized finite sum minimization without lipschitz gradient continuity. SIAM J. Optim., 32(3):2230–2262, 2022.
- [35] Q. Li, Z. Zhu, G. Tang, and M. B. Wakin. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems, 2019.
- [36] L.-H. Lim and P. Comon. Nonnegative approximations of nonnegative tensors. J. Chemom., 23:432–441, 2009.
- [37] H. Lu. ”relative-continuity” for non-lipschitz non-smooth convex optimization using stochastic (or deterministic) mirror descent, 2018.
- [38] H. Lu, R. M. Freund, and Y. E. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim., 28(1):333–354, 2018.
- [39] M. C. Mukkamala, P. Ochs, T. Pock, and S. Sabach. Convex-concave backtracking for inertial Bregman proximal gradient algorithms in nonconvex optimization. SIAM J. Math. Data Sci., 2(3):658–682, 2020.
- [40] C. Navasca, L. De Lathauwer, and S. Kindermann. Swamp reducing technique for tensor decomposition. In 16th European Signal Processing Conference, pages 1–5, 2008.
- [41] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.
- [42] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Soviet Math. Dokl., 27(2):372–376, 1983.
- [43] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takác. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pages 2613–2621, 2017.
- [44] P. Paatero. Construction and analysis of degenerate PARAFAC models. J. Chemom., 14(3):285–299, 2000.
- [45] A.-H. Phan, P. Tichavský, and A. Cichocki. Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC. SIAM J. Matrix Anal. Appl., 34(1):126–147, 2013.
- [46] A.-H. Phan, P. Tichavský, and A. Cichocki. Fast alternating LS algorithms for high order candecomp/parafac tensor factorizations. IEEE Trans. Signal Process., 61(19):4834–4846, 2013.
- [47] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci., 9(4):1756–1787, 2016.
- [48] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys., 4(5):1–17, 1964.
- [49] W. Pu, S. Ibrahim, X. Fu, and M. Hong. Stochastic mirror descent for low-rank tensor decomposition under non-euclidean losses. IEEE Transactions on Signal Processing, 70:1803–1818, 2022.
- [50] J. Shetty and J. Adibi. The enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, 4, 2004.
- [51] S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. FROSTT: The formidable repository of open sparse tensors and tools, 2017.
- [52] L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based algorithms for tensor decompositions: Canonical polyadic decomposition, decomposition in rank- terms, and a new generalization. SIAM J. Optim., 23(2):695–720, 2013.
- [53] M. Teboulle and Y. Vaisbourd. Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM J. Imaging Sci., 13(1):381–421, 2020.
- [54] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109:475–494, 2001.
- [55] M. Vandecappelle, N. Vervliet, and L. D. Lathauwer. A second-order method for fitting the canonical polyadic decomposition with non-least-squares cost. IEEE Transactions on Signal Processing, 68:4454–4465, 2020.
- [56] M. Wang and L. Li. Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality. J. Mach. Learn. Res., 21(1), 2020.
- [57] Q. Wang, C. Cui, and D. Han. A momentum block-randomized stochastic algorithm for low-rank tensor CP decomposition. Pac. J. Optim., 17(3):433–452, 2021.
- [58] Q. Wang and D. Han. A Bregman stochastic method for nonconvex nonsmooth problem beyond global Lipschitz gradient continuity. Optim. Methods. Softw., Online, 2023.
- [59] Q. Wang and D. Han. A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl. Numer. Math., 189:66–87, 2023.
- [60] Q. Wang, Z. Liu, C. Cui, and D. Han. Inertial accelerated sgd algorithms for solving large-scale lower-rank tensor CP decomposition problems. J. Comput. Appl. Math., 423:114948, 2023.
- [61] Q. Wang, Z. Liu, C. Cui, and D. Han. A Bregman proximal stochastic gradient method with extrapolation for nonconvex nonsmooth problems. In Association for the Advancement of Artificial Intelligence (AAAI), 2024.
- [62] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imag. Sci., 6(3):1758–1789, 2013.
- [63] A. Yeredor and M. Haardt. Maximum likelihood estimation of a low-rank probability mass tensor from partial observations. IEEE Signal Process. Lett., 26(10):1551–1555, 2019.
- [64] S. Zhang and N. He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization, 2018.