ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting

Rui Pan111footnotemark: 11,   Jipeng Zhang11,   Xingyuan Pan2,   Renjie Pi1,   Xiaoyu Wang11, Tong Zhang2
1The Hong Kong University of Science and Technology
2University of Illinois Urbana-Champaign
{rpan,jzhanggr,maxywang,rpi}@ust.hk   [email protected]
[email protected]
  Equal Contribution. Code is available at https://github.com/2003pro/ScaleBiO.Corresponding author is Xiaoyu Wang <[email protected]>.
Abstract

Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to 34-billion-parameter LLMs on eight A40 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including GPT-2, LLaMA-3-8B, GPT-NeoX-20B, and Yi-34B, where bilevel optimization succeeds in filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.

Preprint. Work in Progress.

1 Introduction

Bilevel optimization (BO) has emerged as a prominent area of research, drawing substantial attention due to its effectiveness in numerous machine learning applications, such as hyperparameter optimization (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017; Lorraine et al., 2020), meta-learning (Andrychowicz et al., 2016; Franceschi et al., 2018; Rajeswaran et al., 2019) and reinforcement learning (Konda and Tsitsiklis, 1999; Hong et al., 2020). In its standard formulation, bilevel optimization involves a two-level hierarchical structure, where the solution to the outer-level problem depends on the solution to the inner-level problem,

minλΛsubscript𝜆Λ\displaystyle\min_{\lambda\in\Lambda}\quadroman_min start_POSTSUBSCRIPT italic_λ ∈ roman_Λ end_POSTSUBSCRIPT (λ)=L1(λ,w(λ))𝜆subscript𝐿1𝜆subscript𝑤𝜆\displaystyle\quad\mathcal{L}(\lambda)=L_{1}(\lambda,w_{\ast}(\lambda))\quadcaligraphic_L ( italic_λ ) = italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) )
s.t. w(λ)=argminwL2(λ,w).subscript𝑤𝜆subscript𝑤subscript𝐿2𝜆𝑤\displaystyle\quad w_{\ast}(\lambda)=\arg\min_{w}L_{2}(\lambda,w).italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) = roman_arg roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) . (1)

For example, on data reweighting tasks, λ𝜆\lambdaitalic_λ are weights of different data sources, w𝑤witalic_w represents the trainable model parameters, w(λ)subscript𝑤𝜆w_{\ast}(\lambda)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) means the optimal parameters trained on a weighted dataset, while outer function L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and inner function L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT stand for validation and training losses, respectively.

Despite the inherent flexibility and applicability of bilevel optimization across a wide range of problems, its extensive utilization in large-scale problems has been relatively limited thus far. The primary obstacle hindering the scalability of bilevel optimization arises from the interdependence between the upper-level and lower-level problems. This mutual dependency introduces significant computational challenges, particularly when dealing with large-scale problem instances. The natural gradient-based iterative method of solving Problem (1) is to compute (or estimate) the hyper-gradient

(λ)λ=L1(w(λ),λ)λ+L1(w,λ)ww(λ)λ.𝜆𝜆subscript𝐿1subscript𝑤𝜆𝜆𝜆subscript𝐿1subscript𝑤𝜆subscript𝑤subscript𝑤𝜆𝜆\displaystyle\frac{\partial\mathcal{L}(\lambda)}{\partial\lambda}=\frac{% \partial L_{1}(w_{\ast}(\lambda),\lambda)}{\partial\lambda}+\frac{\partial L_{% 1}(w_{\ast},\lambda)}{\partial w_{\ast}}\frac{\partial w_{\ast}(\lambda)}{% \partial\lambda}.divide start_ARG ∂ caligraphic_L ( italic_λ ) end_ARG start_ARG ∂ italic_λ end_ARG = divide start_ARG ∂ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) , italic_λ ) end_ARG start_ARG ∂ italic_λ end_ARG + divide start_ARG ∂ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_λ ) end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) end_ARG start_ARG ∂ italic_λ end_ARG . (2)

Under certain smoothness assumptions, the derivative w(λ)λsubscript𝑤𝜆𝜆\frac{\partial w_{\ast}(\lambda)}{\partial\lambda}divide start_ARG ∂ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) end_ARG start_ARG ∂ italic_λ end_ARG exists and is derived by implicit function theorem L2w(w(λ),λ)=0subscript𝐿2𝑤subscript𝑤𝜆𝜆0\frac{\partial L_{2}}{\partial w}(w_{\ast}(\lambda),\lambda)=0divide start_ARG ∂ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_w end_ARG ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) , italic_λ ) = 0 via

2L2(w(λ),λ)2ww(λ)λ+2L2(w(λ),λ)λw=0.superscript2subscript𝐿2subscript𝑤𝜆𝜆superscript2𝑤subscript𝑤𝜆𝜆superscript2subscript𝐿2subscript𝑤𝜆𝜆𝜆𝑤0\displaystyle\frac{\partial^{2}L_{2}(w_{\ast}(\lambda),\lambda)}{\partial^{2}w% }\frac{\partial w_{\ast}(\lambda)}{\partial\lambda}+\frac{\partial^{2}L_{2}(w_% {\ast}(\lambda),\lambda)}{\partial\lambda\partial w}=0.divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) , italic_λ ) end_ARG start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w end_ARG divide start_ARG ∂ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) end_ARG start_ARG ∂ italic_λ end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) , italic_λ ) end_ARG start_ARG ∂ italic_λ ∂ italic_w end_ARG = 0 . (3)

The main challenge lies in computing or approximating the derivative w(λ)/λsubscript𝑤𝜆𝜆\partial w_{\ast}(\lambda)/\partial\lambda∂ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) / ∂ italic_λ in (2), where a direct computation through (3) involves the Jacobian and Hessian of the inner function L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and may even require the Hessian inverse. There is a line of research (Domke, 2012; Pedregosa, 2016; Grazzi et al., 2020; Lorraine et al., 2020; Franceschi et al., 2017; Shaban et al., 2019; Grazzi et al., 2020; Ghadimi and Wang, 2018; Hong et al., 2020; Yang et al., 2021; Ji et al., 2021; Chen et al., 2022) have been tempted to address this challenge. However, these works mentioned above all require the computations of Hessian, Jacobian, or their products with vectors, which can be computationally expensive and memory-intensive for large-scale problems. Recently,  Kwon et al. (2023) proposed a fully first-order method for stochastic bilevel optimization via only the first-order gradient oracle. This approach addresses the challenges associated with second-order computations and offers promising potential for stochastic bilevel optimization.

Despite these groundbreaking advancements in algorithms and theory, the practical performance of bilevel optimization in large-scale real-world settings has yet to be thoroughly investigated. Aiming to close this gap, this paper considers a practical scenario where Large Language Models (LLMs) are fine-tuned with different sources of datasets. We identify a significant challenge in determining the optimal sampling weights for each data source. For instance, Wang et al. (2024) has demonstrated that LLMs’ task-specific performance degrades in the presence of certain training datasets. However, the inclusion and combination of various datasets should intuitively enhance the models’ overall performance with proper sampling weights. This data-task misalignment poses a primary challenge in training LLMs with multiple data sources:

How to balance each data source in the training dataset to obtain optimal performance?

Various methods have been proposed in attempting to address this challenge. However, they either rely on intuitive preset (Zhou et al., 2024; Muennighoff et al., 2022; Du et al., 2022a; Almazrouei et al., 2023) or lacks theoretical guarantees (Xia et al., 2024a; Xie et al., 2024; Xia et al., 2024b), leading to suboptimal sampling weights. To this end, we land bilevel optimization to the data re-weighting tasks for LLMs, aiming to overcome the limitations of existing methods. We formulate this problem into a bilevel optimization problem where the sampling weights act as the outer learnable variables. Extensive experimental results in data denoising, multi-lingual training, and instruction-following fine-tuning demonstrate the scalability and effectiveness of our approach. The primary contributions are summarized as follows:

  • We first propose the scalable instantiation of bilevel optimization on large-sized LLM training problems, which is capable of scaling to models with 34 billion parameters.

  • We successfully bridge the gap between algorithmic advancements in bilevel optimization and their application in data reweighting, allowing the optimal data weights to be learnable for large-scale LLMs.

  • We provide both experimental and theoretical results to demonstrate the efficiency of ScaleBiO. Empirically, ScaleBiO is capable of filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO’s convergence guarantee matches the results of Kwon et al. (2023) on smooth and strongly convex objectives.

2 Related Work

Method Description Task Model Size
RMD (Bengio, 2000) 2-nd order, deterministic hyperparameter optimization Linear <<<1M
CG (Grazzi et al., 2020) 2-nd order, deterministic equilibrium models CNN <<<1M
stocBiO (Ji et al., 2021) 2-nd order, stochastic meta learning CNN <<<1M
FdeHBO (Yang et al., 2023) 1-st order, stochastic hyper-representation LeNet <<<1M
BOME (Liu et al., 2022) 1-st order, stochastic data hyper-cleaning Linear <<<1M
SOBA (Dagréou et al., 2022) 2-nd order, stochastic data reweighting Transformers 7M
PZOBO (Sow et al., 2022) 1-st order, stochastic few-shot meta-learning ResNet 12M
BFTSS (Somayajula et al., 2023) 1-st order, stochastic task-dependent structure learning BERT 336M
ScaleBiO (Ours) 1-st order, stochastic data reweighting Yi 34B

Table 1: In this table, we compare the maximal model size implemented in their original paper, where ’M’ stands for million and ’B’ stands for billion. We also summarize their methods in Description and report the task they tested.

2.1 Bilevel Optimization

Traditional bilevel optimization algorithms are majorly categorized into two classes: 1) approximate implicit differentiable (AID) methods (Domke, 2012; Pedregosa, 2016; Grazzi et al., 2020; Lorraine et al., 2020), or 2) iterative differentiable (ITD) methods (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2019; Grazzi et al., 2020; Zhou et al., 2022a; Gao et al., 2023; Zhou et al., 2022b; Lin et al., 2023). Both approaches follow a two-loops manner and require huge computational cost for large-scale problems. To reduce the cost, attempts in stochastic bilevel optimization have been made (Ghadimi and Wang, 2018; Hong et al., 2020; Ji et al., 2021; Chen et al., 2022; Khanduri et al., 2021), which significantly improve the efficiency of traditional methods, but still lack practicality for large-scale settings due to the requirements of second-order information, such as Jacobian- and Hessian-vector products for estimating the hyper-gradient. Sow et al. (2022); Yang et al. (2023) attempt to approximate the Jacobian matrix y(x)superscript𝑦𝑥\nabla y^{\ast}(x)∇ italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) in (2) by finite differences, but the finite-different estimation can be sensitive to the selection of the smoothing constant and may suffer from some numerical issues in practice (Jorge and Stephen, 2006).

Recently, a new paradigm of fully first-order penalty-based methods has been introduced, which reformulate the inner-level problem into the optimality constraint (Liu et al., 2022; Kwon et al., 2023; Chen et al., 2023). Liu et al. (2022) first found the hypergradient only involving first-order information, while the method only applies to deterministic functions. Kwon et al. (2023) introduced a first-order gradient-based approach that avoids the estimations of Hessian or Jacobian. This method is easily adapted and extended to stochastic bilevel optimization settings. Chen et al. (2023) provided the near-optimal sample complexity, which improves the theoretical result of (Kwon et al., 2023) in the deterministic bilevel optimization. These results verify the effectiveness of the proposed paradigm in theory, yet its practical applications in large-scale LLM settings remain unexplored.

On the practical side, bilevel optimization has been explored in various NLP tasks. Somayajula et al. (2023) uses bilevel optimization to learn the task-dependent similarity structure. Although their approach demonstrates effectiveness on BERT models (Devlin et al., 2018), the finite difference approximation suffers from high error and therefore lacks the scalability in LLMs with billions of parameters. Grangier et al. (2024) adopts SOBA (Dagréou et al., 2022) to modify the training data distributions for language modeling under domain shift. However, the algorithm still requires gradient approximation and Hessian-vector products, posing challenges to scalability and engineering for large-scale problems. We summarize typical bilevel algorithms and their model sizes in Table 1, where to the best of our knowledge, no approach listed in the table has been successfully applied to over 1B-sized LLM models.

2.2 Data Reweighting

The proportion of training data sources significantly affects the performance of large language models (Du et al., 2022b; Xie et al., 2023). To this end, various methods have been proposed to reweight data sources for optimal training data mixture. For instance,  Mindermann et al. (2022) utilizes the loss gap between a trained model and a base model to identify learnable data samples, assigning them higher weights on the fly. Thakkar et al. (2023) proposes to use self-influence score to guide the reweighting in mini-batch during pretraining. Xia et al. (2024b) leverages reference losses on validation sets and adjusts the weights dynamically, adding minimal overhead to standard training. DoReMi (Xie et al., 2024) applies distributionally robust optimization (DRO) to tuning the domain weights without knowledge of downstream tasks, showing effectiveness in improving perplexity and downstream accuracy. Nevertheless, none of the aforementioned methods ensures the optimality of the learned data weights, let alone scalable experiments on over 30B-sized models.

3 Methods

In this section, we elaborate on our ScaleBiO method for finding the optimal sampling weights in training large-scaled LLMs. We first formulate this problem as a bilevel optimization problem in Section 3.1 and then develop an efficient training method for our formulation in Section 3.2.

3.1 Problem Formulation

Suppose that m𝑚mitalic_m data sources are available for training, e.g. Alpaca (Taori et al., 2023), FLAN (Wei et al., 2021), and ShareGPT (Chiang et al., 2023), where each source Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a set of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT examples Si={a1i,a2i,,anii}subscript𝑆𝑖superscriptsubscript𝑎1𝑖superscriptsubscript𝑎2𝑖superscriptsubscript𝑎subscript𝑛𝑖𝑖S_{i}=\{a_{1}^{i},a_{2}^{i},\dots,a_{n_{i}}^{i}\}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }. We balance multiple training data sources by assigning each data source Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT a sampling weight pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that satisfies i=1mpi=1superscriptsubscript𝑖1𝑚subscript𝑝𝑖1\sum_{i=1}^{m}p_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.

Accordingly, each data source Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contributes pi|𝒟trn|subscript𝑝𝑖subscript𝒟trnp_{i}|\mathcal{D}_{\text{trn}}|italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT | samples to the training dataset 𝒟trnsubscript𝒟trn\mathcal{D}_{\text{trn}}caligraphic_D start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT. We then optimize the sampling weights to minimize the model’s loss on validation set 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT. This leads to the following bilevel optimization problem:

minpΛsubscript𝑝Λ\displaystyle\min_{p\in\Lambda}roman_min start_POSTSUBSCRIPT italic_p ∈ roman_Λ end_POSTSUBSCRIPT Lval(w(p))subscript𝐿valsuperscript𝑤𝑝\displaystyle\,\,L_{\text{val}}(w^{*}(p))italic_L start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_p ) )
s.t.formulae-sequencest\displaystyle\mathrm{s.t.}roman_s . roman_t . w(p)=argminwi=1mpinij=1niLtrn(w,aji)superscript𝑤𝑝subscript𝑤superscriptsubscript𝑖1𝑚subscript𝑝𝑖subscript𝑛𝑖superscriptsubscript𝑗1subscript𝑛𝑖subscript𝐿trn𝑤superscriptsubscript𝑎𝑗𝑖\displaystyle\,\,w^{*}(p)=\arg\min_{w}\sum_{i=1}^{m}\frac{p_{i}}{n_{i}}\sum_{j% =1}^{n_{i}}L_{\text{trn}}(w,a_{j}^{i})italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_p ) = roman_arg roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT ( italic_w , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )

where w𝑤witalic_w denotes the parameters of LLM, {pi}subscript𝑝𝑖\{p_{i}\}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is the probability distribution over m𝑚mitalic_m data sources, Lvalsubscript𝐿valL_{\text{val}}italic_L start_POSTSUBSCRIPT val end_POSTSUBSCRIPT and Ltrnsubscript𝐿trnL_{\text{trn}}italic_L start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT respectively denote the language modeling loss on Dvalsubscript𝐷valD_{\text{val}}italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT and Dtrnsubscript𝐷trnD_{\text{trn}}italic_D start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT. In addition, we introduce a trainable variable λm𝜆superscript𝑚\lambda\in\mathbb{R}^{m}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT to represent pi=eλi/j=1meλjsubscript𝑝𝑖superscript𝑒subscript𝜆𝑖superscriptsubscript𝑗1𝑚superscript𝑒subscript𝜆𝑗p_{i}=e^{\lambda_{i}}/\sum_{j=1}^{m}e^{\lambda_{j}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, so that i=1mpi=1superscriptsubscript𝑖1𝑚subscript𝑝𝑖1\sum_{i=1}^{m}p_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 is satisfied.

Algorithm 1 ScaleBiO for high-dimensional and large-scale minimax problems
1:  Input: step-sizes {ηu,ηω,ηλ}subscript𝜂𝑢subscript𝜂𝜔subscript𝜂𝜆\left\{\eta_{u},\eta_{\omega},\eta_{\lambda}\right\}{ italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT }, penalty α𝛼\alphaitalic_α, and initialization λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
2:  for k=0:K1:𝑘0𝐾1k=0:K-1italic_k = 0 : italic_K - 1 do
3:     Uniformly and independently select two jk,rksubscript𝑗𝑘subscript𝑟𝑘j_{k},r_{k}italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT block coordinates from {1,2,,J}12𝐽\left\{1,2,\cdots,J\right\}{ 1 , 2 , ⋯ , italic_J }, respectively
4:     Generating iid samples {Dtrk,Dvalk}superscriptsubscript𝐷tr𝑘superscriptsubscript𝐷val𝑘\left\{D_{\text{tr}}^{k},D_{\text{val}}^{k}\right\}{ italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } from training dataset Dtrsubscript𝐷trD_{\text{tr}}italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT and validation dataset Dvalsubscript𝐷valD_{\text{val}}italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT
5:     uk+1jk=ukjkαηujkL2(λk,uk;Dtrk)superscriptsubscript𝑢𝑘1subscript𝑗𝑘superscriptsubscript𝑢𝑘subscript𝑗𝑘𝛼subscript𝜂𝑢subscriptsubscript𝑗𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝐷tr𝑘u_{k+1}^{j_{k}}=u_{k}^{j_{k}}-\alpha\eta_{u}\nabla_{j_{k}}L_{2}(\lambda_{k},u_% {k};D_{\text{tr}}^{k})italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_α italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
6:     uk+1=uk+Ujk(uk+1jkukjk)subscript𝑢𝑘1subscript𝑢𝑘subscript𝑈subscript𝑗𝑘superscriptsubscript𝑢𝑘1subscript𝑗𝑘superscriptsubscript𝑢𝑘subscript𝑗𝑘u_{k+1}=u_{k}+U_{j_{k}}(u_{k+1}^{j_{k}}-u_{k}^{j_{k}})italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
7:     wk+1rk=wkrkηw(rkL1(λk,wk;Dvalk)+αrkL2(λk,wk;Dtrk))superscriptsubscript𝑤𝑘1subscript𝑟𝑘superscriptsubscript𝑤𝑘subscript𝑟𝑘subscript𝜂𝑤subscriptsubscript𝑟𝑘subscript𝐿1subscript𝜆𝑘subscript𝑤𝑘superscriptsubscript𝐷val𝑘𝛼subscriptsubscript𝑟𝑘subscript𝐿2subscript𝜆𝑘subscript𝑤𝑘superscriptsubscript𝐷tr𝑘w_{k+1}^{r_{k}}=w_{k}^{r_{k}}-\eta_{w}\left(\nabla_{r_{k}}L_{1}(\lambda_{k},w_% {k};D_{\text{val}}^{k})+\alpha\nabla_{r_{k}}L_{2}(\lambda_{k},w_{k};D_{\text{% tr}}^{k})\right)italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( ∇ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_α ∇ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) )
8:     wk+1=wk+Wrk(wk+1rkwtrk)subscript𝑤𝑘1subscript𝑤𝑘subscript𝑊subscript𝑟𝑘superscriptsubscript𝑤𝑘1subscript𝑟𝑘superscriptsubscript𝑤𝑡subscript𝑟𝑘w_{k+1}=w_{k}+W_{r_{k}}(w_{k+1}^{r_{k}}-w_{t}^{r_{k}})italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
9:     λk+1=λkηλ(L1(λk,wk;Dvalk)+α(L2(λk,wk;Dtrk)L2(λk,uk;Dtrk)))\lambda_{k+1}=\lambda_{k}-\eta_{\lambda}\left(\nabla L_{1}(\lambda_{k},w_{k};D% _{\text{val}}^{k})+\alpha\bigl{(}\nabla L_{2}(\lambda_{k},w_{k};D_{\text{tr}}^% {k})-\nabla L_{2}(\lambda_{k},u_{k};D_{\text{tr}}^{k})\bigl{)}\right)italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( ∇ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_α ( ∇ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) )
10:  end for
11:  Output: (λK,wK,uK)subscript𝜆𝐾subscript𝑤𝐾subscript𝑢𝐾(\lambda_{K},w_{K},u_{K})( italic_λ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT )

3.2 Fully First-order Hypergradient Method

Note that the inner-level problem of Problem (1) can be viewed as a constraint of the outer-level problem, and thus it is tempting to reformulate

minλΛ,wsubscript𝜆Λ𝑤\displaystyle\min_{\lambda\in\Lambda,w}roman_min start_POSTSUBSCRIPT italic_λ ∈ roman_Λ , italic_w end_POSTSUBSCRIPT L1(λ,w)subscript𝐿1𝜆𝑤\displaystyle\quad L_{1}(\lambda,w)italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w )
s.t. L2(λ,w)minuL2(λ,u)=0.subscript𝐿2𝜆𝑤subscript𝑢subscript𝐿2𝜆𝑢0\displaystyle\quad L_{2}(\lambda,w)-\min_{u}L_{2}(\lambda,u)=0.italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) - roman_min start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_u ) = 0 . (4)

Here an auxiliary variable u𝑢uitalic_u is introduced to transform the inner problem w(λ)=argminwL2(λ,w)subscript𝑤𝜆subscript𝑤subscript𝐿2𝜆𝑤w_{\ast}(\lambda)=\arg\min_{w}L_{2}(\lambda,w)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) = roman_arg roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) to be the constraint L2(λ,w)minuL2(λ,u)=0subscript𝐿2𝜆𝑤subscript𝑢subscript𝐿2𝜆𝑢0L_{2}(\lambda,w)-\min_{u}L_{2}(\lambda,u)=0italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) - roman_min start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_u ) = 0 where u𝑢uitalic_u serves as the proxy of w(λ)subscript𝑤𝜆w_{\ast}(\lambda)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ). By penalizing the constraint of (3.2) with multiplier α>0𝛼0\alpha>0italic_α > 0 into the outer-level objective, we have the minimax formulation (Kwon et al., 2023; Lu and Mei, 2023):

minλΛ,wmaxuα(λ,w,u)subscript𝜆Λ𝑤subscript𝑢superscript𝛼𝜆𝑤𝑢\displaystyle\min_{\lambda\in\Lambda,w}\max_{u}\,\,\mathcal{L}^{\alpha}(% \lambda,w,u)roman_min start_POSTSUBSCRIPT italic_λ ∈ roman_Λ , italic_w end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) (5)

where

α(λ,w,u)=L1(λ,w)+α(L2(λ,w)L2(λ,u))superscript𝛼𝜆𝑤𝑢subscript𝐿1𝜆𝑤𝛼subscript𝐿2𝜆𝑤subscript𝐿2𝜆𝑢\displaystyle\mathcal{L}^{\alpha}(\lambda,w,u)=L_{1}(\lambda,w)+\alpha\left(L_% {2}(\lambda,w)-L_{2}(\lambda,u)\right)caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) = italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w ) + italic_α ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_u ) )

In this way, the approximation of both inner constraint and outer optimum can be obtained during the same optimization process, and α𝛼\alphaitalic_α controls the priority. When α𝛼\alpha\to\inftyitalic_α → ∞, the bilevel problem (1) is equivalent to the minimax problem (5) under certain smoothness assumptions.

To precisely describe the optimality of the minimax problem with the stationarity of the bilevel problem, we overload the notation in (5) and define

Φα(λ,w)superscriptΦ𝛼𝜆𝑤\displaystyle\Phi^{\alpha}(\lambda,w)roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w ) :=maxuα(λ,w,u);assignabsentsubscript𝑢superscript𝛼𝜆𝑤𝑢\displaystyle:=\max_{u}\mathcal{L}^{\alpha}(\lambda,w,u);:= roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) ; (6)
u(λ)subscript𝑢𝜆\displaystyle u_{\ast}(\lambda)italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) :=argmaxuα(λ,w,u);assignabsentsubscript𝑢superscript𝛼𝜆𝑤𝑢\displaystyle:=\arg\max_{u}\mathcal{L}^{\alpha}(\lambda,w,u);:= roman_arg roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) ; (7)
Γα(λ)superscriptΓ𝛼𝜆\displaystyle\Gamma^{\alpha}(\lambda)roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) :=minwΦα(λ,w);assignabsentsubscript𝑤superscriptΦ𝛼𝜆𝑤\displaystyle:=\min_{w}\Phi^{\alpha}(\lambda,w);:= roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w ) ; (8)
wα(λ)superscriptsubscript𝑤𝛼𝜆\displaystyle w_{\ast}^{\alpha}(\lambda)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) :=argminwΦα(λ,w).assignabsentsubscript𝑤superscriptΦ𝛼𝜆𝑤\displaystyle:=\arg\min_{w}\Phi^{\alpha}(\lambda,w).:= roman_arg roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w ) . (9)

We make the following assumptions for the proposed minimax problem throughout this paper.

Assumption 1.

Suppose that

  • (1)

    L1(λ,w)subscript𝐿1𝜆𝑤L_{1}(\lambda,w)italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w ) is twice continuously differentiable, 10subscript10\ell_{10}roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT-Lipschitz continuous in w𝑤witalic_w; 11subscript11\ell_{11}roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT-gradient Lipschitz.

  • (2)

    L2(λ,w)subscript𝐿2𝜆𝑤L_{2}(\lambda,w)italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) is 21subscript21\ell_{21}roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT-gradient Lipschitz, 22subscript22\ell_{22}roman_ℓ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT-Hessian Lipschitz, and μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-strongly convex in w𝑤witalic_w.

Lemma 1.

Under Assumption 1, if α>211/μ2𝛼2subscript11subscript𝜇2\alpha>2\ell_{11}/\mu_{2}italic_α > 2 roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have

|(λ)Γα(λ)|𝜆superscriptΓ𝛼𝜆\displaystyle|\mathcal{L}(\lambda)-\Gamma^{\alpha}(\lambda)|| caligraphic_L ( italic_λ ) - roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) | 𝒪(1α)absent𝒪1𝛼\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha}\right)≤ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) (10)
(λ)Γα(λ)norm𝜆superscriptΓ𝛼𝜆\displaystyle\left\|\nabla\mathcal{L}(\lambda)-\nabla\Gamma^{\alpha}(\lambda)\right\|∥ ∇ caligraphic_L ( italic_λ ) - ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) ∥ 𝒪(1α)absent𝒪1𝛼\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha}\right)≤ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) (11)
2Γα(λ)normsuperscript2superscriptΓ𝛼𝜆\displaystyle\left\|\nabla^{2}\Gamma^{\alpha}(\lambda)\right\|∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) ∥ 𝒪(κ3)absent𝒪superscript𝜅3\displaystyle\leq\mathcal{O}(\kappa^{3})≤ caligraphic_O ( italic_κ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) (12)

where the condition number κ𝜅\kappaitalic_κ is defined by max{10,11,21,22}/μ2subscript10subscript11subscript21subscript22subscript𝜇2\max\left\{\ell_{10},\ell_{11},\ell_{21},\ell_{22}\right\}/\mu_{2}roman_max { roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT } / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Under Assumption 1, as indicated by Lemma 1, if α𝛼\alphaitalic_α goes to infinity, the stationary point of the minimax problem (5) is also a stationary point of the bilevel problem (1).

3.3 Proposed Algorithm

By adopting the approach in Section 3.2, the outer-inner dependency can be naturally eliminated from the original bilevel problem and reformulated as an equivalent min-max problem. We propose a single-loop framework, referred to as ScaleBiO, to solve the large-scale min-max problem, outlined in Algorithm 1. In many large-scale machine-learning scenarios, the dimensionality of the training variables can be exceptionally high. For bilevel problems, we employ the randomized block coordinate method (Nesterov, 2012; Pan et al., 2024) to update the inner variables u,w𝑢𝑤u,witalic_u , italic_w, which improves memory efficiency. As for the outer variable λ𝜆\lambdaitalic_λ, we utilize the stochastic gradient descent method for its update. Besides, the popular methods such as SGD momentum and Adam (Kingma and Ba, 2015) can be seamlessly integrated into Algorithm 1 to update each variable u𝑢uitalic_u, w𝑤witalic_w, and λ𝜆\lambdaitalic_λ. The multiplier α𝛼\alphaitalic_α is predefined with a sufficiently large factor that ensures the min-max solution is a good approximation of the original bilevel problem.

3.4 Theoretical Results

In this part, we provide a convergence analysis of Algorithm 1. Before showing the details of theoretical results, we introduce the notations for partitions. Let {x1,x2,,xJ}superscript𝑥1superscript𝑥2superscript𝑥𝐽\left\{x^{1},x^{2},\cdots,x^{J}\right\}{ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT } with xjdj×1superscript𝑥𝑗superscriptsubscript𝑑𝑗1x^{j}\in\mathbb{R}^{d_{j}\times 1}italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT be J𝐽Jitalic_J non-overlap** blocks of x𝑥xitalic_x. Let the matrix Ujd×djsubscript𝑈𝑗superscript𝑑subscript𝑑𝑗U_{j}\in\mathbb{R}^{d\times d_{j}}italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT columns of a d×d𝑑𝑑d\times ditalic_d × italic_d permutation matrix U𝑈Uitalic_U corresponding to j𝑗jitalic_j block coordinates in x𝑥xitalic_x. For any partition of x𝑥xitalic_x and U𝑈Uitalic_U,

x=j=1JUjxj,xj=UjTx.formulae-sequence𝑥superscriptsubscript𝑗1𝐽subscript𝑈𝑗superscript𝑥𝑗subscript𝑥𝑗superscriptsubscript𝑈𝑗𝑇𝑥\displaystyle x=\sum_{j=1}^{J}U_{j}x^{j},\quad x_{j}=U_{j}^{T}x.italic_x = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x . (13)

We present the essential lemmas in appendix C to show the theoretical properties of minimax objective αsuperscript𝛼\mathcal{L}^{\alpha}caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT in (5), as well as its optimizers usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and wαsuperscriptsubscript𝑤𝛼w_{\ast}^{\alpha}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT. Lemma 1 provides clear evidence that Γα(λ)superscriptΓ𝛼𝜆\Gamma^{\alpha}(\lambda)roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) is smooth with parameter Γ=𝒪(κ3)subscriptΓ𝒪superscript𝜅3\ell_{\Gamma}=\mathcal{O}(\kappa^{3})roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT = caligraphic_O ( italic_κ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) which is independent on the multiplier α𝛼\alphaitalic_α.

Theorem 1.

Suppose that Assumptions 1 holds and the parameter α𝛼\alphaitalic_α and step-sizes ηu,ηw,ηλsubscript𝜂𝑢subscript𝜂𝑤subscript𝜂𝜆\eta_{u},\eta_{w},\eta_{\lambda}italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT are properly chosen such that

α=K1/7,ηu=ηw=η0K4/7,ηλ=η0λK5/7.formulae-sequenceformulae-sequence𝛼superscript𝐾17subscript𝜂𝑢subscript𝜂𝑤subscript𝜂0superscript𝐾47subscript𝜂𝜆superscriptsubscript𝜂0𝜆superscript𝐾57\displaystyle\alpha=K^{1/7},\eta_{u}=\eta_{w}=\frac{\eta_{0}}{K^{4/7}},\eta_{% \lambda}=\frac{\eta_{0}^{\lambda}}{K^{5/7}}.italic_α = italic_K start_POSTSUPERSCRIPT 1 / 7 end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 4 / 7 end_POSTSUPERSCRIPT end_ARG , italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 5 / 7 end_POSTSUPERSCRIPT end_ARG .

Consider Algorithm 1, if α11/μ2𝛼subscript11subscript𝜇2\alpha\geq\ell_{11}/\mu_{2}italic_α ≥ roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, for η0λ1/(8Γ)superscriptsubscript𝜂0𝜆18subscriptΓ\eta_{0}^{\lambda}\leq 1/(8\ell_{\Gamma})italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ≤ 1 / ( 8 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ), η08J/μ2subscript𝜂08𝐽subscript𝜇2\eta_{0}\leq 8J/\mu_{2}italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ 8 italic_J / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and η0/η0λ62κ2Jsubscript𝜂0superscriptsubscript𝜂0𝜆62superscript𝜅2𝐽\eta_{0}/\eta_{0}^{\lambda}\geq 6\sqrt{2}\kappa^{2}Jitalic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ≥ 6 square-root start_ARG 2 end_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J, then

𝔼[(λ~)2]𝒪(1K2/7)𝔼delimited-[]superscriptnorm~𝜆2𝒪1superscript𝐾27\displaystyle\mathbb{E}\left[\left\|\nabla\mathcal{L}(\tilde{\lambda})\right\|% ^{2}\right]\leq\mathcal{O}\left(\frac{1}{K^{2/7}}\right)blackboard_E [ ∥ ∇ caligraphic_L ( over~ start_ARG italic_λ end_ARG ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 / 7 end_POSTSUPERSCRIPT end_ARG ) (14)

where λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG is uniformly chosen from {λk}k=1Ksuperscriptsubscriptsubscript𝜆𝑘𝑘1𝐾\left\{\lambda_{k}\right\}_{k=1}^{K}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

When considering the batch size B=𝒪(1)𝐵𝒪1B=\mathcal{O}(1)italic_B = caligraphic_O ( 1 ), the complexity of finding an ϵitalic-ϵ\epsilonitalic_ϵ-stationary point of Algorithm 1 is 𝒪(ϵ7)𝒪superscriptitalic-ϵ7\mathcal{O}(\epsilon^{-7})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT ), which matches that of (Kwon et al., 2023). The proof of Theorem 1 is provided in Appendix D.

4 Experiments

Our experiments are two-fold: We conduct (1) Small Scale Experiments in Section 4.1 for the intuitive understanding of our method ScaleBiO and (2) Real-World Application Experiments in Section 4.2 to validate its effectiveness and scalability. We start with detailing our main findings on the basic capabilities of our approach. Then we scale them up in real-world scenarios to address the challenge of finding the optimal sampling weight when fine-tuning LLM with multiple data sources.

Refer to caption
Figure 1: Data denoising with GPT-2: weights for noisy data and clean data.

4.1 Small Scale Experiments

To verify the effectiveness of ScaleBiO in small-scale LLMs, we conducted experiments with GPT-2 (124M) on three tasks with synthetic datasets: data denoising, multilingual training, and instruction-following fine-tuning. Full details are available in Appendix B.1.

4.1.1 Data Denoising

This experiment aims to verify ScaleBiO’s effectiveness in the data-denoising tasks. Our validation dataset, denoted as 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT, comprises 1000 clean samples randomly selected from the Alpaca dataset (Taori et al., 2023). The training dataset, 𝒟trnsubscript𝒟trn\mathcal{D}_{\text{trn}}caligraphic_D start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT, is derived from two distinct sources: the first includes 1000 clean samples also from Alpaca, while the second incorporates 9000 samples from Alpaca that have been artificially corrupted with synthetic noise, where the outputs are replaced with ".".

Figure 1 demonstrates that our approach has a robust capability to mitigate the influence of harmful data sources via automatic data denoising, where ScaleBiO assigns minimal weight to noisy data sources, effectively filtering the irrelevant samples.

4.1.2 Multilingual Training

It is also intriguing to check if ScaleBiO can recover optimal sampling weights for more general distributions. To this end, we design multilingual training experiments, where our validation data 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT comprises 600 random samples from Alpaca-GPT4-ZH (Peng et al., 2023) and 400 random samples from Alpaca-GPT4-EN (Peng et al., 2023). Hence, the underlying optimal weight is 6:4. In contrast, the training set 𝒟trnsubscript𝒟trn\mathcal{D}_{\text{trn}}caligraphic_D start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT has a 1:1 mix ratio, which consists of 40,000/40,000 random examples from Alpaca-GPT4-EN and Alpaca-GPT4-ZH, respectively.

As shown in Figure 2, ScaleBiO nearly replicates the optimal 6:4 ratio after reweighting the training data. This serves as another concrete proof that ScaleBiO is capable of adapting training data weights optimally to downstream validation datasets.

Refer to caption
Figure 2: Multilingual reweighting with GPT-2: weights of Chinese and English. Training set: 1:1; Validation set: 6:4.

4.1.3 Instruction Following

In instruction-following fine-tuning tasks, there is a fundamental tradeoff between diversity and quality. To verify if ScaleBiO can deduce these implicit weights of low- and high-quality datasets, we conduct experiments on instruction-following tasks with GPT-2, where Alpaca and Alpaca-GPT4 (Peng et al., 2023) are employed. Here Alpaca-GPT4 shares the same instructions and input as Alpaca, whose high quality is distinguished by its outputs generated from a more sophisticated model GPT-4 (Achiam et al., 2023). Our validation data for bilevel optimization 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT consists of 1000 random samples from Alpaca-GPT4, while our training data 𝒟trnsubscript𝒟trn\mathcal{D}_{\text{trn}}caligraphic_D start_POSTSUBSCRIPT trn end_POSTSUBSCRIPT consists of 2 separate parts: 1000 random samples from Alpaca-GPT4 and 9000 random samples from Alpaca.

As shown in Figure 3, although Alpaca-GPT4 accounts for only a small proportion of the training data (10%), it is highlighted by our method, revealing that our method effectively up-weights the high-quality data source, leading to improved model outcomes.

Refer to caption
Figure 3: Instruction Following with GPT-2: weights for Alpaca-GPT4 and Alpaca.

4.2 Real-World Application Experiments

In this section, we take a step forward to land bilevel optimization in real-world data reweighting applications, where we aim to balance multiple training data sources for boosting LLMs’ overall performance in downstream tasks.

4.2.1 Large-Scale Data Reweighting

LLaMA-3-8B LLaMA-3-13B GPT-NeoX-20B Yi-34B
source weight source weight source weight source weight
WildChat 0.711 WildChat 0.711 Airoboros 0.986 ShareGPT4 0.627
Airoboros 0.154 ShareGPT4 0.137 ShareGPT4 0.005 Airoboros 0.111
ChatAlpaca 0.119 ChatAlpaca 0.021 ChatAlpaca 0.003 WildChat 0.105
Total 0.984 Total 0.869 Total 0.994 Total 0.843
Table 2: Data sources with top-3 weights for LLaMA-3-8B, LLaMA-3-13B, GPT-NeoX-20B and Yi-34B.
Refer to caption
Figure 4: Learned weights in large-scale data reweighting experiments for Yi-34B. We focus on the top 5 data sources and denote the remaining as Others.

To demonstrate ScaleBiO’s effectiveness in large-scale settings, practical choices of model and datasets are employed under the data reweighting scenarios. Specifically, we collect popular instruction following and chat completion data to construct a diverse training dataset of 18 distinct sources. The training set is formed by 9 high-quality datasets that are mostly from GPT-4 and 9 multilingual datasets serving as less relevant data sources. For our reference dataset 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT, we prompt GPT-4 to generate 1200 two-turn instructions and responses that correspond to the 8 categories in MT-Bench. Our detailed data description and preprocessing can be found in Appendix B.2.

We test our approach on LLaMA-3 model series111https://ai.meta.com/blog/meta-llama-3/ with parameter size of 8B and 13B222https://huggingface.co/Replete-AI/Llama-3-13B. We also conduct experiments on GPT-NeoX-20B (Black et al., 2022) and Yi-34B (01.AI et al., 2024). To evaluate our approach, we first sample 3,000 data from each source for reweighting. Then we sample 10,000 data according to the weights at the end of bilevel optimization to train our model (denotes as Ours). For a fair baseline comparison, we sample 10,000 data uniformly and add the reference data 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT into them to train the baseline model (denotes as Baseline).

As shown in Table 2 and Figure 4, for models with sizes varied from 8B to 34B, ScaleBiO is capable of filtering out the most irrelevant multi-lingual datasets and putting emphasis on GPT-4 generated data sources. Strikingly, the top three weights in all models were found to be within a small subset of our numerous data sources (WildChat, Airoboros, ChatAlpaca and ShareGPT4), which suggests a universality in the importance attributed to these data sources. Moreover, the top three weights in all models account for over 80% of the total weight. This concentration of importance in a small subset of data sources highlights the effectiveness of our method in identifying and emphasizing the most influential data sources, indicating that our method consistently identifies the most critical data sources, regardless of the model architecture.

4.2.2 Large Scale Instruction Following

To confirm the gain of ScaleBiO, we further apply learned sampling weights to fine-tune LLaMA-3-8B and LLaMA-3-70B models.

We evaluate LLMs’ instruction-following ability using MT-Bench (Zheng et al., 2023a) with single-answer grading. This benchmark prompts chat assistants with challenging multi-turn open-ended questions and uses "LLM-as-a-judge" for evaluation. MT-Bench stands out as a comprehensive benchmark that aligns well with human preference. It comprises 80 questions, evenly distributed across 8 categories: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM), and Knowledge II (humanities/social science). To demonstrate, we compare the performance of LLaMA-3-8B trained with our reweighted data versus the original data. Specifically, we first reweight data sources on LLaMA-3-8B. Then we fune-tune the LLM using the reweighted data.

The results, as shown in Table 3 and Table 4, indicate that our method consistently outperforms the baseline across different categories by a large margin, highlighting the effectiveness of our reweighting approach.

On top of that, the same weights learned from LLaMA-3-8B can be seamlessly transferred to fine-tuning LLaMA-3-70B model. As shown in Table 3, our model achieves noticeable improvement on average compared with the baseline, proving the transferability of the learned data weights across the same family of models.

Model Average
LLaMA-3-8B Baseline 6.11
Ours 7.12
LLaMA-3-8B \rightarrow LLaMA-3-70B Baseline 7.85
Ours 8.05

Table 3: MT-Bench results of LLaMA-3-70B with transfer trained weights from LLaMA-3-8B.

4.2.3 Implementation Details

Our data reweighting process lasts for 3 epochs with α𝛼\alphaitalic_α equals to 100 and initial learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for models and 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for weights. For all our fine-tuning processes, we train the LLM for 1 epoch with an initial learning rate of 8×1068superscript1068\times 10^{-6}8 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and a global batch size of 64. Throughout our experiments, we adopt randomized coordinate descent with AdamW (Pan et al., 2024) and bfloat16 precision for efficient training and inference. Our experiments are conducted on 8 NVIDIA A100 80GB GPUs.

5 Conclusion

In this paper, we propose ScaleBiO, the first bilevel optimization instantiation that is capable of scaling to 34B LLMs on data reweighting tasks. Theoretically, ScaleBiO ensures optimality of the learned data weights and enjoys the same convergence guarantees as conventional first-order penalty-based bilevel optimization algorithms on smooth and strongly convex objectives. Empirically, ScaleBiO enables data reweighting on 7Babsent7𝐵\geq 7B≥ 7 italic_B sized models, bringing forth an efficient data filtering and selection pipeline for improving model performance on various downstream tasks. On top of that, the learned sampling weights on LLaMA-3-8B can be seamlessly transferred to larger-sized LLaMA-3-70B while still obtaining non-trivial performance improvements.

Limitations

The proposed algorithm of ScaleBiO has yet to be verified in large-scale pre-training settings, where a huge amount of computation resources are required for conducting such experiments. We hope the success of ScaleBiO in large-scale fine-tuning settings can be the first step towards this direction.

The potential risks of ScaleBiO are the same as other data reweighting techniques, where optimizing the sampling weights on a single loss metric may lead to models that neglect other aspects, such as safety or ethics. In that case, multi-objective losses and post-training alignments are highly recommended to compensate for this deficiency.

The positive aspect of ScaleBiO is that it helps reweight data more effectively, thus allowing the training cost of large language models to be further reduced.

Ethical Considerations

In conducting our experiments on a diverse set of datasets for instruction following, we have given careful consideration to ethical concerns that may arise. Our work involves datasets such as ShareGPT, OpenOrca, WildChat, AlpacaChat, LMSYS-Chat, Airoboros, etc. We list the license for each dataset in the Appendix and ensure compliance with the licensing agreements for each dataset. Furthermore, all these data sources are publicly available and do not involve privacy issues.

References

  • Domke [2012] Justin Domke. Generic methods for optimization-based modeling. In Neil D. Lawrence and Mark Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 318–326, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
  • Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2113–2122, Lille, France, 07–09 Jul 2015. PMLR.
  • Franceschi et al. [2017] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1165–1173. PMLR, 2017.
  • Lorraine et al. [2020] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540–1552. PMLR, 2020.
  • Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
  • Franceschi et al. [2018] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568–1577. PMLR, 2018.
  • Rajeswaran et al. [2019] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
  • Konda and Tsitsiklis [1999] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  • Hong et al. [2020] Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170, 2020.
  • Pedregosa [2016] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • Grazzi et al. [2020] Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pages 3748–3758. PMLR, 2020.
  • Shaban et al. [2019] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1723–1732. PMLR, 2019.
  • Ghadimi and Wang [2018] Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. arXiv preprint arXiv:1802.02246, 2018.
  • Yang et al. [2021] Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670–13682, 2021.
  • Ji et al. [2021] Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pages 4882–4892. PMLR, 2021.
  • Chen et al. [2022] Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, pages 2466–2488. PMLR, 2022.
  • Kwon et al. [2023] Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, pages 18083–18113. PMLR, 2023.
  • Wang et al. [2024] Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou et al. [2024] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, ** Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  • Muennighoff et al. [2022] Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  • Du et al. [2022a] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022a.
  • Almazrouei et al. [2023] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  • Xia et al. [2024a] Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024a.
  • Xie et al. [2024] Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
  • Xia et al. [2024b] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=09iOdaeOzp.
  • Bengio [2000] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
  • Yang et al. [2023] Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving 𝒪(ϵ1.5)𝒪superscriptitalic-ϵ1.5\mathcal{O}(\epsilon^{-1.5})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 1.5 end_POSTSUPERSCRIPT ) complexity in Hessian/Jacobian-free stochastic bilevel optimization. Advances in Neural Information Processing Systems, 36, 2023.
  • Liu et al. [2022] Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach. In Advances in neural information processing systems, volume 35, pages 17248–17262, 2022.
  • Dagréou et al. [2022] Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in Neural Information Processing Systems, 35:26698–26710, 2022.
  • Sow et al. [2022] Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms. In Advances in Neural Information Processing Systems, volume 35, pages 4136–4149, 2022.
  • Somayajula et al. [2023] Sai Ashish Somayajula, Lifeng **, Linfeng Song, Haitao Mi, and Dong Yu. Bi-level finetuning with task-dependent similarity structure for low-resource training. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8569–8588, 2023.
  • Zhou et al. [2022a] Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, and Tong Zhang. Model agnostic sample reweighting for out-of-distribution learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27203–27221. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/zhou22d.html.
  • Gao et al. [2023] Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. Self-guided noise-free data generation for efficient zero-shot learning, 2023. URL https://arxiv.longhoe.net/abs/2205.12679.
  • Zhou et al. [2022b] Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Zonghao Chen, and Tong Zhang. Probabilistic bilevel coreset selection. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27287–27302. PMLR, 17–23 Jul 2022b. URL https://proceedings.mlr.press/v162/zhou22h.html.
  • Lin et al. [2023] Yong Lin, Renjie Pi, Weizhong Zhang, Xiaobo Xia, Jiahui Gao, Xiao Zhou, Tongliang Liu, and Bo Han. A holistic view of label noise transition matrix in deep learning and beyond. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:259298577.
  • Khanduri et al. [2021] Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Advances in neural information processing systems, 34:30271–30283, 2021.
  • Jorge and Stephen [2006] Nocedal Jorge and J Wright Stephen. Numerical optimization, 2006.
  • Chen et al. [2023] Lesi Chen, Yaohua Ma, and **gzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles. arXiv preprint arXiv:2306.14853, 2023.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Grangier et al. [2024] David Grangier, Pierre Ablin, and Awni Hannun. Bilevel optimization to learn training distributions for language modeling under domain shift. In NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2024. URL https://openreview.net/forum?id=D67r01BYYP.
  • Du et al. [2022b] Nan Du, Yan** Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022b.
  • Xie et al. [2023] Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36:34201–34227, 2023.
  • Mindermann et al. [2022] Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
  • Thakkar et al. [2023] Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. Self-influence guided data reweighting for language model pre-training. arXiv preprint arXiv:2311.00913, 2023.
  • Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Wei et al. [2021] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Blog post, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  • Lu and Mei [2023] Zhaosong Lu and Sanyou Mei. First-order penalty methods for bilevel optimization. arXiv preprint arXiv:2301.01716, 2023.
  • Nesterov [2012] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • Pan et al. [2024] Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv:2403.17919, 2024.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Lei Ba. ADAM: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Peng et al. [2023] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Black et al. [2022] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022. URL https://arxiv.longhoe.net/abs/2204.06745.
  • 01.AI et al. [2024] 01.AI, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, **g Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai, 2024.
  • Zheng et al. [2023a] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023a.
  • Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lian et al. [2023] Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  • Bian et al. [2023] Ning Bian, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun, and Ben He. Chatalpaca: A multi-turn dialogue corpus based on alpaca instructions. https://github.com/cascip/ChatAlpaca, 2023.
  • Mukherjee et al. [2023] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  • Zhao et al. [2024] Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Ye** Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
  • Zheng et al. [2023b] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023b.
  • "Teknium" [2023] "Teknium". Gpteacher general-instruct. https://huggingface.co/datasets/teknium/GPTeacher-General-Instruct, 2023.
  • Durbin [2023] Jon Durbin. Airoboros: using large language models to fine-tune large language models. https://huggingface.co/datasets/jondurbin/airoboros-3.2, 2023.

Appendix A Additional Experiments

A.1 Detailed MT-Bench Scores for LLaMA

The detailed MT-Bench scores for LLaMA-3-8B in Section B.2 are provided in Table 4.

Model Coding Extraction Humanities Math Reasoning Roleplay STEM Writing Average
LLaMA-3-8B Baseline 4.65 6.20 7.50 3.40 5.25 7.70 7.00 7.20 6.11
Ours 5.40 8.55 9.68 3.45 5.30 8.20 8.73 7.63 7.12

Table 4: Comparison of our approach and the baseline on MT-Bench.

Appendix B Experimental Details

B.1 Small Scale Experiments

Throughout our small-scale experiments, we use GPT-2 [Radford et al., 2019] with 124 million parameters as the backbone model. For bilevel optimization hyperparameters, we set the learning rate to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for sampling weights and 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for models. We run our algorithm for 3 epochs with a batch size of 64 and alpha of 10 while adopting AdamW [Loshchilov and Hutter, 2017] for optimization.

B.2 Large Scale Experiments

Datasets Size Kind License
AlpacaGPT4 [Peng et al., 2023] 52K Instruction Apache-2.0
ShareGPT4 [Chiang et al., 2023] 6K Conversation Apache-2.0
SlimOrca [Lian et al., 2023] 518K Instruction MIT
AlpacaChat [Bian et al., 2023] 20K Conversation Apache-2.0
OpenOrcaGPT4 [Mukherjee et al., 2023] 1M Instruction MIT
WildChat [Zhao et al., 2024] 1M Conversation AI2 ImpACT
LMSYS-Chat [Zheng et al., 2023b] 1M Conversation LMSYS-Chat-1M
GPTeacher ["Teknium", 2023] 89K Instruction MIT
Airoboros [Durbin, 2023] 59K Conversation CC-BY-4.0
Alpaca-es333https://huggingface.co/datasets/bertin-project/alpaca-spanish 52K Instruction CC-BY-4.0
Alpaca-de444https://huggingface.co/datasets/mayflowergmbh/alpaca-gpt4_de 50K Instruction Apache-2.0
Alpaca-ja555https://huggingface.co/datasets/fujiki/japanese_alpaca_data 52K Instruction CC-BY-NC-SA-4.0
Alpaca-ko666https://huggingface.co/datasets/Bingsu/ko_alpaca_data 50K Instruction CC-BY-NC-4.0
Alpaca-ru777https://huggingface.co/datasets/IlyaGusev/ru_turbo_alpaca 30K Instruction CC-BY-4.0
Alpaca-it888https://huggingface.co/datasets/mchl-labs/stambecco_data_it 52K Instruction CC-BY-NC-SA-4.0
Alpaca-fr999https://huggingface.co/datasets/jpacifico/French-Alpaca-dataset-Instruct-55K 55K Instruction Apache-2.0
Alpaca-zh101010https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh 49K Instruction CC-BY-4.0
Alpaca-pt111111https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br 52K Instruction CC-BY-NC-4.0

Table 5: In this table, we summarize the datasets we used. Above the middle line are the high-quality datasets we used. Below the middle line are multilingual Alpaca datasets with language code in their name.

Our training data consists of 18 distinct sources as detailed in Table 5. We collect 9 high-quality datasets and 9 multilingual Alpaca datasets which serve as irrelevant data sources. For each data source, we preprocess by filtering out conversations/instructions that exceed the max length (1024 tokens in our experiments). For our reference dataset 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT, we prompt GPT4 using the prompt "Help me generate 3 sets of 2-turn instructions to evaluate the {category} ability of LLMs. The instructions for the second turn need to be highly relevant to the first turn. The following is an example.\n\n\n EXAMPLE:{example}\n TURN1:{turn1}\n TURN2:{turn2}\n", where {category} corresponds to one of the 8 categories in MT-Bench and {example} is one example from MT-Bench. In this way, we obtain a reference dataset that is highly similar to MT-Bench.

Appendix C Important Lemmas

Suppose Assumption 1 hold, the functions α(λ,w,u)superscript𝛼𝜆𝑤𝑢\mathcal{L}^{\alpha}(\lambda,w,u)caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) and Γα(λ)superscriptΓ𝛼𝜆\Gamma^{\alpha}(\lambda)roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) satisfy the following properties.

Lemma 2.

Under Assumption 1, the followings hold:

  • (i)

    α(λ,w,u)superscript𝛼𝜆𝑤𝑢\mathcal{L}^{\alpha}(\lambda,w,u)caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) is μ2αsubscript𝜇2𝛼\mu_{2}\alphaitalic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_α-strongly concave w.r.t. u𝑢uitalic_u;

  • (ii)

    α(λ,w,u)superscript𝛼𝜆𝑤𝑢\mathcal{L}^{\alpha}(\lambda,w,u)caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) is μ2α/2subscript𝜇2𝛼2\mu_{2}\alpha/2italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_α / 2-strongly convex w.r.t. w𝑤witalic_w if α>211/μ2𝛼2subscript11subscript𝜇2\alpha>2\ell_{11}/\mu_{2}italic_α > 2 roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The results of Lemma 2 can be found in [Kwon et al., 2023] and Lemma B.1 of [Chen et al., 2023]. From Lemma B.7 in [Chen et al., 2023], the following result holds for Γα(λ)superscriptΓ𝛼𝜆\Gamma^{\alpha}(\lambda)roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ):

Lemma 3.

Under Assumption 1, if α>211/μ2𝛼2subscript11subscript𝜇2\alpha>2\ell_{11}/\mu_{2}italic_α > 2 roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then Γα(λ)superscriptΓ𝛼𝜆\Gamma^{\alpha}(\lambda)roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) is ΓsubscriptΓ\ell_{\Gamma}roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT-smooth, where Γ=𝒪(κ3)subscriptΓ𝒪superscript𝜅3\ell_{\Gamma}=\mathcal{O}(\kappa^{3})roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT = caligraphic_O ( italic_κ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) is a constant that is independent on α𝛼\alphaitalic_α.

Moreover, the functions wα(λ)superscriptsubscript𝑤𝛼𝜆w_{\ast}^{\alpha}(\lambda)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) and u(λ)subscript𝑢𝜆u_{\ast}(\lambda)italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) satisfy the following properties.

Lemma 4.

Under Assumption 1, we have

wα(λ)w(λ)C0αnormsuperscriptsubscript𝑤𝛼𝜆subscript𝑤𝜆subscript𝐶0𝛼\displaystyle\left\|w_{\ast}^{\alpha}(\lambda)-w_{\ast}(\lambda)\right\|\leq% \frac{C_{0}}{\alpha}∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) ∥ ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG

where C0=10/μ2subscript𝐶0subscript10subscript𝜇2C_{0}=\ell_{10}/\mu_{2}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The result in Lemma 4 follows from Lemma B.2 of Chen et al. [2023].

Lemma 5.

Under Assumption 1, if α>211/μ2𝛼2subscript11subscript𝜇2\alpha>2\ell_{11}/\mu_{2}italic_α > 2 roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then we have

  • (i)

    u(λ)subscript𝑢𝜆u_{\ast}(\lambda)italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) is κ𝜅\kappaitalic_κ-Lipschitz continuous;

  • (ii)

    wα(λ)superscriptsubscript𝑤𝛼𝜆w_{\ast}^{\alpha}(\lambda)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ ) is w,0subscriptsubscript𝑤0\ell_{w_{\ast},0}roman_ℓ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT-Lipschitz continuous where w,0=3κsubscriptsubscript𝑤03𝜅\ell_{w_{\ast},0}=3\kapparoman_ℓ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT = 3 italic_κ.

where the condition number κ=max{10,11,21,22}/μ2𝜅subscript10subscript11subscript21subscript22subscript𝜇2\kappa=\max\left\{\ell_{10},\ell_{11},\ell_{21},\ell_{22}\right\}/\mu_{2}italic_κ = roman_max { roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT } / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Claim (i) in Lemma 5 can be found in Lemma 2.2 of [Ghadimi and Wang, 2018] and Claim (ii) implies from Lemma 3.2 (setting λ1=λ2subscript𝜆1subscript𝜆2\lambda_{1}=\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) of [Kwon et al., 2023].

Lemma 6.

Under Assumption 1, if α>211/μ2𝛼2subscript11subscript𝜇2\alpha>2\ell_{11}/\mu_{2}italic_α > 2 roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then u(λ)subscript𝑢𝜆u_{\ast}(\lambda)italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) is usubscriptsubscript𝑢\ell_{\nabla u_{\ast}}roman_ℓ start_POSTSUBSCRIPT ∇ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT-smooth where u=𝒪(κ2μ2(21+1))subscriptsubscript𝑢𝒪superscript𝜅2subscript𝜇2subscript211\ell_{\nabla u_{\ast}}=\mathcal{O}\left(\frac{\kappa^{2}}{\mu_{2}}\left(\ell_{% 21}+1\right)\right)roman_ℓ start_POSTSUBSCRIPT ∇ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_O ( divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ( roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT + 1 ) ) where the condition number κ=max{10,11,21,22}/μ2𝜅subscript10subscript11subscript21subscript22subscript𝜇2\kappa=\max\left\{\ell_{10},\ell_{11},\ell_{21},\ell_{22}\right\}/\mu_{2}italic_κ = roman_max { roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT } / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Following Lemma A.3 of [Kwon et al., 2023] and recalling the Lipschitz continuous property of u(λ)subscript𝑢𝜆u_{\ast}(\lambda)italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) from Lemma 5, we have this claim is correct.

Appendix D Proofs of Theorem 1

Proof.

We sample the function αsuperscript𝛼\mathcal{L}^{\alpha}caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT by the following mini-batch approximation Dkαsuperscriptsubscriptsubscript𝐷𝑘𝛼\mathcal{L}_{D_{k}}^{\alpha}caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT per iteration:

Dkα(λ,w,u):=L1(λ,w;Dvalk)+α(L2(λ,w;Dtrk)L2(λ,u;Dtrk))assignsuperscriptsubscriptsubscript𝐷𝑘𝛼𝜆𝑤𝑢subscript𝐿1𝜆𝑤superscriptsubscript𝐷val𝑘𝛼subscript𝐿2𝜆𝑤superscriptsubscript𝐷tr𝑘subscript𝐿2𝜆𝑢superscriptsubscript𝐷tr𝑘\mathcal{L}_{D_{k}}^{\alpha}(\lambda,w,u):=L_{1}(\lambda,w;D_{\text{val}}^{k})% +\alpha\left(L_{2}(\lambda,w;D_{\text{tr}}^{k})-L_{2}(\lambda,u;D_{\text{tr}}^% {k})\right)caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ , italic_w , italic_u ) := italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w ; italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_α ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_u ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) (15)

where Dk={Dtrk,Dvalk}subscript𝐷𝑘superscriptsubscript𝐷tr𝑘superscriptsubscript𝐷val𝑘D_{k}=\left\{D_{\text{tr}}^{k},D_{\text{val}}^{k}\right\}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } and Dtrk,Dvalksuperscriptsubscript𝐷tr𝑘superscriptsubscript𝐷val𝑘D_{\text{tr}}^{k},D_{\text{val}}^{k}italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are i.i.d. mini-batch samples from the training and validation datasets Dtr,Dvalsubscript𝐷trsubscript𝐷valD_{\text{tr}},D_{\text{val}}italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT, respectively. We use ksubscript𝑘\mathcal{F}_{k}caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to denote the random information before the iteration (λk,wk,uk)subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘(\lambda_{k},w_{k},u_{k})( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), that is k:=σ({(λk,ωk,uk),Dk1,,D1})assignsubscript𝑘𝜎subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝐷𝑘1subscript𝐷1\mathcal{F}_{k}:=\sigma\left(\left\{(\lambda_{k},\omega_{k},u_{k}),D_{k-1},% \cdots,D_{1}\right\}\right)caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_σ ( { ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_D start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , ⋯ , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ). We use 𝒞k=σ({j1,j2,jt1;r1,r2,,rt1})subscript𝒞𝑘𝜎subscript𝑗1subscript𝑗2subscript𝑗𝑡1subscript𝑟1subscript𝑟2subscript𝑟𝑡1\mathcal{C}_{k}=\sigma\left(\left\{j_{1},j_{2}\cdots,j_{t-1};r_{1},r_{2},% \cdots,r_{t-1}\right\}\right)caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_σ ( { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ , italic_j start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } ) to denote the random information of variables u,w𝑢𝑤u,witalic_u , italic_w for the randomized block coordinates before the iteration k𝑘kitalic_k.

We recall the iterating formula of λ𝜆\lambdaitalic_λ in the stochastic version of the minimax algorithm that λk+1λk=ηλλDkα(λk,wk,uk)subscript𝜆𝑘1subscript𝜆𝑘subscript𝜂𝜆subscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘\lambda_{k+1}-\lambda_{k}=-\eta_{\lambda}\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},w_{k},u_{k})italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). At each iteration,

𝔼[Dkα(λk,ωk,uk)k]=α(λk,ωk,uk).𝔼delimited-[]conditionalsuperscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝑘superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘\displaystyle\mathbb{E}[\nabla\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k},\omega_% {k},u_{k})\mid\mathcal{F}_{k}]=\nabla\mathcal{L}^{\alpha}(\lambda_{k},\omega_{% k},u_{k}).blackboard_E [ ∇ caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = ∇ caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (16)

By the smoothness of ΓαsuperscriptΓ𝛼\Gamma^{\alpha}roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT (see Lemma 3), we have

Γα(λk+1)superscriptΓ𝛼subscript𝜆𝑘1\displaystyle\Gamma^{\alpha}(\lambda_{k+1})roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) Γα(λk)+Γα(λk),λk+1λk+Γ2λk+1λk2absentsuperscriptΓ𝛼subscript𝜆𝑘superscriptΓ𝛼subscript𝜆𝑘subscript𝜆𝑘1subscript𝜆𝑘subscriptΓ2superscriptnormsubscript𝜆𝑘1subscript𝜆𝑘2\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})+\left\langle\nabla\Gamma^{\alpha% }(\lambda_{k}),\lambda_{k+1}-\lambda_{k}\right\rangle+\frac{\ell_{\Gamma}}{2}% \left\|\lambda_{k+1}-\lambda_{k}\right\|^{2}≤ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Γα(λk)ηλΓα(λk),λLDkα(λk,ωk,uk)+Γηλ22λDkα(λk,ωk,uk)2.absentsuperscriptΓ𝛼subscript𝜆𝑘subscript𝜂𝜆superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscriptsubscript𝐿subscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscriptΓsuperscriptsubscript𝜂𝜆22superscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2\displaystyle=\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}L_{D_{k}}^{\alpha}(\lambda_{k},% \omega_{k},u_{k})\right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left% \|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k},\omega_{k},u_{k})% \right\|^{2}.= roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (17)

Taking conditional expectation w.r.t. k,𝒞ksubscript𝑘subscript𝒞𝑘\mathcal{F}_{k},\mathcal{C}_{k}caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on the above inequality, we have

𝔼[Γα(λk+1)k,𝒞k]𝔼delimited-[]conditionalsuperscriptΓ𝛼subscript𝜆𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}[\Gamma^{\alpha}(\lambda_{k+1})\mid\mathcal{F}_{k},% \mathcal{C}_{k}]blackboard_E [ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
Γα(λk)ηλΓα(λk),𝔼[λDkα(λk,ωk,uk)k,𝒞k]+Γηλ22𝔼[λDkα(λk,ωk,uk)2k,𝒞k]absentsuperscriptΓ𝛼subscript𝜆𝑘subscript𝜂𝜆superscriptΓ𝛼subscript𝜆𝑘𝔼delimited-[]conditionalsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝑘subscript𝒞𝑘subscriptΓsuperscriptsubscript𝜂𝜆22𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2subscript𝑘subscript𝒞𝑘\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\mathbb{E}[\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})\mid\mathcal{F}_{k},\mathcal{C}_{k}]% \right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\mathbb{E}\left[\left\|% \nabla_{\lambda}\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k},\omega_{k},u_{k})% \right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}\right]≤ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , blackboard_E [ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⟩ + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
Γα(λk)ηλΓα(λk),λα(λk,ωk,uk)+Γηλ22𝔼[λDkα(λk,ωk,uk)2k,𝒞k]absentsuperscriptΓ𝛼subscript𝜆𝑘subscript𝜂𝜆superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscriptΓsuperscriptsubscript𝜂𝜆22𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2subscript𝑘subscript𝒞𝑘\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},% \omega_{k},u_{k})\right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}% \mathbb{E}\left[\left\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k% },\omega_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}\right]≤ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] (18)

where the inequality follows the fact that Dkαsuperscriptsubscriptsubscript𝐷𝑘𝛼\mathcal{L}_{D_{k}}^{\alpha}caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT is an unbiased estimation of αsuperscript𝛼\mathcal{L}^{\alpha}caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and

𝔼[λDkα(λk,ωk,uk)2k,𝒞k]𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}\left[\left\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k},\mathcal{% C}_{k}\right]blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=𝔼[λDkα(λk,ωk,uk)λα(λk,ωk,uk)+λα(λk,ωk,uk)2k]absent𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2subscript𝑘\displaystyle=\mathbb{E}\left[\left\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})+\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k}% ,\omega_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]= blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
𝔼[λDkα(λk,ωk,uk)λα(λk,ωk,uk)2k]+λα(λk,ωk,uk)2absent𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2subscript𝑘superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2\displaystyle\leq\mathbb{E}\left[\left\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]+\left\|% \nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\|^{2}≤ blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] + ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
σ12+2α2σ22B+λα(λk,ωk,uk)2absentsuperscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22𝐵superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2\displaystyle\leq\frac{\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}}{B}+\left\|% \nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\|^{2}≤ divide start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG + ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (19)

where the variance of the minibatch stochastic gradients (with batch size B𝐵Bitalic_B) is bounded

𝔼[L1(λ,w;Dvalk)L1(λ,w)2]σ12B,𝔼[L2(λ,w;Dtrk)L2(λ,w)2]σ22B,formulae-sequence𝔼delimited-[]superscriptnormsubscript𝐿1𝜆𝑤superscriptsubscript𝐷val𝑘subscript𝐿1𝜆𝑤2superscriptsubscript𝜎12𝐵𝔼delimited-[]superscriptnormsubscript𝐿2𝜆𝑤superscriptsubscript𝐷tr𝑘subscript𝐿2𝜆𝑤2superscriptsubscript𝜎22𝐵\displaystyle\mathbb{E}\left[\left\|\nabla L_{1}(\lambda,w;D_{\text{val}}^{k})% -\nabla L_{1}(\lambda,w)\right\|^{2}\right]\leq\frac{\sigma_{1}^{2}}{B},\quad% \mathbb{E}\left[\left\|\nabla L_{2}(\lambda,w;D_{\text{tr}}^{k})-\nabla L_{2}(% \lambda,w)\right\|^{2}\right]\leq\frac{\sigma_{2}^{2}}{B},blackboard_E [ ∥ ∇ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w ; italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ , italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG , blackboard_E [ ∥ ∇ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ , italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG , (20)

then

𝔼[λDkα(λk,ωk,uk)λα(λk,ωk,uk)2k]𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆superscriptsubscriptsubscript𝐷𝑘𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2subscript𝑘\displaystyle\mathbb{E}\left[\left\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=𝔼[λL1(λk,wk;Dvalk)λL1(λk,wk)2+α2λL2(λk,wk;Dtrk)λL2(λk,wk)2k]absent𝔼delimited-[]superscriptnormsubscript𝜆subscript𝐿1subscript𝜆𝑘subscript𝑤𝑘superscriptsubscript𝐷val𝑘subscript𝜆subscript𝐿1subscript𝜆𝑘subscript𝑤𝑘2conditionalsuperscript𝛼2superscriptnormsubscript𝜆subscript𝐿2subscript𝜆𝑘subscript𝑤𝑘superscriptsubscript𝐷tr𝑘subscript𝜆subscript𝐿2subscript𝜆𝑘subscript𝑤𝑘2subscript𝑘\displaystyle=\mathbb{E}\left[\left\|\nabla_{\lambda}L_{1}(\lambda_{k},w_{k};D% _{\text{val}}^{k})-\nabla_{\lambda}L_{1}(\lambda_{k},w_{k})\right\|^{2}+\alpha% ^{2}\left\|\nabla_{\lambda}L_{2}(\lambda_{k},w_{k};D_{\text{tr}}^{k})-\nabla_{% \lambda}L_{2}(\lambda_{k},w_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]= blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
+α2𝔼[λL2(λk,uk;Dtrk)λL2(λk,uk)2k]superscript𝛼2𝔼delimited-[]conditionalsuperscriptnormsubscript𝜆subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝐷tr𝑘subscript𝜆subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2subscript𝑘\displaystyle\quad+\alpha^{2}\mathbb{E}\left[\left\|\nabla_{\lambda}L_{2}(% \lambda_{k},u_{k};D_{\text{tr}}^{k})-\nabla_{\lambda}L_{2}(\lambda_{k},u_{k})% \right\|^{2}\mid\mathcal{F}_{k}\right]+ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
σ12+2α2σ22B.absentsuperscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22𝐵\displaystyle\leq\frac{\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}}{B}.≤ divide start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG . (21)

Applying the above results, we have

𝔼[Γα(λk+1)k,𝒞k]𝔼delimited-[]conditionalsuperscriptΓ𝛼subscript𝜆𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}[\Gamma^{\alpha}(\lambda_{k+1})\mid\mathcal{F}_{k},% \mathcal{C}_{k}]blackboard_E [ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] Γα(λk)ηλΓα(λk),λα(λk,ωk,uk)+Γηλ22λα(λk,ωk,uk)2absentsuperscriptΓ𝛼subscript𝜆𝑘subscript𝜂𝜆superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscriptΓsuperscriptsubscript𝜂𝜆22superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘2\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},% \omega_{k},u_{k})\right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left% \|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\|^{2}≤ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+Γηλ22B(σ12+2α2σ22).subscriptΓsuperscriptsubscript𝜂𝜆22𝐵superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22\displaystyle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2B}\left(\sigma_{1}^{2}+2% \alpha^{2}\sigma_{2}^{2}\right).+ divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_B end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (22)

Let δk=uku(λk)2subscript𝛿𝑘superscriptnormsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘2\delta_{k}=\left\|u_{k}-u_{\ast}(\lambda_{k})\right\|^{2}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∥ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and rk=wkwα(λk)2subscript𝑟𝑘superscriptnormsubscript𝑤𝑘superscriptsubscript𝑤𝛼subscript𝜆𝑘2r_{k}=\left\|w_{k}-w_{\ast}^{\alpha}(\lambda_{k})\right\|^{2}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The inner product term of RHS of (D) is estimated as follows:

Γα(λk),λα(λk,ωk,uk)superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\rangle- ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
=\displaystyle== Γα(λk),λα(λk,ωk,uk)λα(λk,wk,u(λk))superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝜔𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}% ^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\rangle- ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⟩
Γα(λk),λα(λk,wk,u(λk))λΦα(wα(λk),λk)+λΦα(wα(λk),λk)superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘subscript𝜆superscriptΦ𝛼superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝜆𝑘subscript𝜆superscriptΦ𝛼superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝜆𝑘\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}% \Phi^{\alpha}(w_{\ast}^{\alpha}(\lambda_{k}),\lambda_{k})+\nabla_{\lambda}\Phi% ^{\alpha}(w_{\ast}^{\alpha}(\lambda_{k}),\lambda_{k})\right\rangle- ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
=(a)superscript𝑎\displaystyle\mathop{=}^{(a)}= start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT Γα(λk),λα(λk,wk,uk)λα(λk,wk,u(λk))superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{% \alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\rangle- ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⟩
Γα(λk),λΦα(λk,wk)λΦα(λk,wα(λk))+Γα(λk)superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscriptΦ𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝜆superscriptΦ𝛼subscript𝜆𝑘superscriptsubscript𝑤𝛼subscript𝜆𝑘superscriptΓ𝛼subscript𝜆𝑘\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \Phi^{\alpha}(\lambda_{k},w_{k})-\nabla_{\lambda}\Phi^{\alpha}(\lambda_{k},w_{% \ast}^{\alpha}(\lambda_{k}))+\nabla\Gamma^{\alpha}(\lambda_{k})\right\rangle- ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
=Γα(λk)2Γα(λk),λα(uk,ωk,λk)λα(λk,wk,u(λk))absentsuperscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝑢𝑘subscript𝜔𝑘subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘\displaystyle=-\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}-\left% \langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}\mathcal{L}^{\alpha}% (u_{k},\omega_{k},\lambda_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k% },w_{k},u_{\ast}(\lambda_{k}))\right\rangle= - ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⟩
Γα(λk),λα(λk,wk,u(λk))λα(λk,wα(λk),u(λk))superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝑢subscript𝜆𝑘\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{\ast}^{\alpha}(\lambda_{k}),u_{\ast}(% \lambda_{k}))\right\rangle- ⟨ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⟩
(b)superscript𝑏\displaystyle\mathop{\leq}^{(b)}≤ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT 12Γα(λk)2+λα(λk,wk,uk)λα(λk,wk,u(λk))212superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘2\displaystyle-\frac{1}{2}\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}% +\left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{% \lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\|^% {2}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+λα(λk,wk,u(λk))λα(λk,wα(λk),u(λk))2superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝑢subscript𝜆𝑘2\displaystyle\quad+\left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{% k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{% \ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))\right\|^{2}+ ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(c)superscript𝑐\displaystyle\mathop{\leq}^{(c)}≤ start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT 12Γα(λk)2+α2212uku(λk)2+2(112+α2212)ωkωα(λk)212superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2superscript𝛼2superscriptsubscript212superscriptnormsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘22superscriptsubscript112superscript𝛼2superscriptsubscript212superscriptnormsubscript𝜔𝑘superscriptsubscript𝜔𝛼subscript𝜆𝑘2\displaystyle-\frac{1}{2}\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}% +\alpha^{2}\ell_{21}^{2}\left\|u_{k}-u_{\ast}(\lambda_{k})\right\|^{2}+2\left(% \ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)\left\|\omega_{k}-\omega_{\ast}^{% \alpha}(\lambda_{k})\right\|^{2}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_ω start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 12Γα(λk)2+α2212δk+2(112+α2212)rk12superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2superscript𝛼2superscriptsubscript212subscript𝛿𝑘2superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘\displaystyle-\frac{1}{2}\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}% +\alpha^{2}\ell_{21}^{2}\delta_{k}+2\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2% }\right)r_{k}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 2 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (23)

where (a)𝑎(a)( italic_a ) uses the optimality of ΦΦ\Phiroman_Φ over w𝑤witalic_w that λΦα(wα(λk),λk)=Γα(λk)=λα(λk,wα(λk),u(λk))subscript𝜆superscriptΦ𝛼superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝜆𝑘superscriptΓ𝛼subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝑢subscript𝜆𝑘\nabla_{\lambda}\Phi^{\alpha}(w_{\ast}^{\alpha}(\lambda_{k}),\lambda_{k})=% \nabla\Gamma^{\alpha}(\lambda_{k})=\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},w_{\ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ), (b)𝑏(b)( italic_b ) follows from the Cauchy-Schwartz inequality and (c)𝑐(c)( italic_c ) uses the smoothness of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Next we turn to estimate the norm of gradient λα(λk,wk,uk)subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as follows

λα(λk,wk,uk)2superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘2\displaystyle\left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{% k})\right\|^{2}∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =λα(λk,wk,uk)Γα(λk)+Γα(λk)2absentsuperscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘superscriptΓ𝛼subscript𝜆𝑘superscriptΓ𝛼subscript𝜆𝑘2\displaystyle=\left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_% {k})-\nabla\Gamma^{\alpha}(\lambda_{k})+\nabla\Gamma^{\alpha}(\lambda_{k})% \right\|^{2}= ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2(Γα(λk)2+λα(λk,wk,uk)Γα(λk)2)absent2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘superscriptΓ𝛼subscript𝜆𝑘2\displaystyle\leq 2\left(\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}% +\left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla% \Gamma^{\alpha}(\lambda_{k})\right\|^{2}\right)≤ 2 ( ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
2Γα(λk)2+4λα(λk,wk,uk)λα(λk,wk,u(λk))2absent2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘24superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢𝑘subscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘2\displaystyle\leq 2\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}+4% \left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{% \lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\|^% {2}≤ 2 ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+4λα(λk,wk,u(λk))λα(λk,wα(λk),u(λk))24superscriptnormsubscript𝜆superscript𝛼subscript𝜆𝑘subscript𝑤𝑘subscript𝑢subscript𝜆𝑘subscript𝜆superscript𝛼subscript𝜆𝑘superscriptsubscript𝑤𝛼subscript𝜆𝑘subscript𝑢subscript𝜆𝑘2\displaystyle\quad+4\left\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_% {k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_% {\ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))\right\|^{2}+ 4 ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)2Γα(λk)2+4α2112uku(λk)2+8(112+α2212)ωkωα(λk)2superscript𝑎2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘24superscript𝛼2superscriptsubscript112superscriptnormsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘28superscriptsubscript112superscript𝛼2superscriptsubscript212superscriptnormsubscript𝜔𝑘superscriptsubscript𝜔𝛼subscript𝜆𝑘2\displaystyle\mathop{\leq}^{(a)}2\left\|\nabla\Gamma^{\alpha}(\lambda_{k})% \right\|^{2}+4\alpha^{2}\ell_{11}^{2}\left\|u_{k}-u_{\ast}(\lambda_{k})\right% \|^{2}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)\left\|\omega_{k}-% \omega_{\ast}^{\alpha}(\lambda_{k})\right\|^{2}≤ start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT 2 ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 8 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_ω start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=2Γ(λk)2+4α2112δk+8(112+α2212)rkabsent2superscriptnormΓsubscript𝜆𝑘24superscript𝛼2superscriptsubscript112subscript𝛿𝑘8superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘\displaystyle=2\left\|\nabla\Gamma(\lambda_{k})\right\|^{2}+4\alpha^{2}\ell_{1% 1}^{2}\delta_{k}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}= 2 ∥ ∇ roman_Γ ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 8 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (24)

where (a)𝑎(a)( italic_a ) uses the smoothness of objectives L1,L2subscript𝐿1subscript𝐿2L_{1},L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Incorporating the above inequalities (D) and (D) into (D) gives

𝔼[Γα(λk+1)k,𝒞k]𝔼delimited-[]conditionalsuperscriptΓ𝛼subscript𝜆𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}[\Gamma^{\alpha}(\lambda_{k+1})\mid\mathcal{F}_{k},% \mathcal{C}_{k}]blackboard_E [ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] Γα(λk)ηλ2Γα(λk)2+Γηλ22(2Γα(λk)2+4α2112δk+8(112+α2212)rk)absentsuperscriptΓ𝛼subscript𝜆𝑘subscript𝜂𝜆2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2subscriptΓsuperscriptsubscript𝜂𝜆222superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘24superscript𝛼2superscriptsubscript112subscript𝛿𝑘8superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\frac{\eta_{\lambda}}{2}\left\|% \nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}+\frac{\ell_{\Gamma}\eta_{% \lambda}^{2}}{2}\left(2\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}+4% \alpha^{2}\ell_{11}^{2}\delta_{k}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}% \right)r_{k}\right)≤ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - divide start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( 2 ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 8 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
+ηλ(α2112δk+2(112+α2212)rk)+Γηλ22(σ12+2α2σ22).subscript𝜂𝜆superscript𝛼2superscriptsubscript112subscript𝛿𝑘2superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘subscriptΓsuperscriptsubscript𝜂𝜆22superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22\displaystyle+\eta_{\lambda}\left(\alpha^{2}\ell_{11}^{2}\delta_{k}+2\left(% \ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)+\frac{\ell_{\Gamma}% \eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right).+ italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 2 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (25)

Then, we focus on estimating δksubscript𝛿𝑘\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. For the inner variables u,w𝑢𝑤u,witalic_u , italic_w, we use the randomized block coordinates method with total J𝐽Jitalic_J blocks and each block is uniformly chosen. By the strong concavity of αsuperscript𝛼\mathcal{L}^{\alpha}caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT with respect to u𝑢uitalic_u, we first achieve the following evaluations for δksubscript𝛿𝑘\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

𝔼[uk+1u(λk)2k,𝒞k]=𝔼[ukαηuUjtuL2(λk,uk;𝒟ktr)u(λk)2k,𝒞k]𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢𝑘1subscript𝑢subscript𝜆𝑘2subscript𝑘subscript𝒞𝑘𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢𝑘𝛼subscript𝜂𝑢subscript𝑈subscript𝑗𝑡subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘trsubscript𝑢subscript𝜆𝑘2subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}\left[\left\|u_{k+1}-u_{\ast}(\lambda_{k})\right\|^{2}% \mid\mathcal{F}_{k},\mathcal{C}_{k}\right]=\mathbb{E}\left[\left\|u_{k}-\alpha% \eta_{u}U_{j_{t}}\nabla_{u}L_{2}\left(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text% {tr}}\right)-u_{\ast}(\lambda_{k})\right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_% {k}\right]blackboard_E [ ∥ italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = blackboard_E [ ∥ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=u(λk)uk22αηu𝔼[uku(λk),uL2(λk,uk;𝒟ktr)jtk,𝒞k]absentsuperscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘22𝛼subscript𝜂𝑢𝔼delimited-[]conditionalsubscriptsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘trsubscript𝑗𝑡subscript𝑘subscript𝒞𝑘\displaystyle=\left\|u_{\ast}(\lambda_{k})-u_{k}\right\|^{2}-2\alpha\eta_{u}% \mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_{2}(% \lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\rangle_{j_{t}}\mid% \mathcal{F}_{k},\mathcal{C}_{k}\right]= ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_α italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_E [ ⟨ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
+α2ηu2𝔼[UjtuL2(uk,λk;𝒟ktr)2k,𝒞k]superscript𝛼2superscriptsubscript𝜂𝑢2𝔼delimited-[]conditionalsuperscriptnormsubscript𝑈subscript𝑗𝑡subscript𝑢subscript𝐿2subscript𝑢𝑘subscript𝜆𝑘superscriptsubscript𝒟𝑘tr2subscript𝑘subscript𝒞𝑘\displaystyle\quad+\alpha^{2}\eta_{u}^{2}\mathbb{E}\left[\left\|U_{j_{t}}% \nabla_{u}L_{2}(u_{k},\lambda_{k};\mathcal{D}_{k}^{\text{tr}})\right\|^{2}\mid% \mathcal{F}_{k},\mathcal{C}_{k}\right]+ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=(a)u(λk)uk22αηuJuku(λk),uL2(λk,uk)+α2ηu2J𝔼[uL2(λk,uk;𝒟ktr)2k]superscript𝑎superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘22𝛼subscript𝜂𝑢𝐽subscript𝑢𝑘subscript𝑢subscript𝜆𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscript𝛼2superscriptsubscript𝜂𝑢2𝐽𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘tr2subscript𝑘\displaystyle\mathop{=}^{(a)}\left\|u_{\ast}(\lambda_{k})-u_{k}\right\|^{2}-% \frac{2\alpha\eta_{u}}{J}\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\rangle+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}% \left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})% \right\|^{2}\mid\mathcal{F}_{k}\right]= start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 2 italic_α italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ⟨ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
(b)u(λk)uk22ηuαJ(L2(λk,uk)L2(λk,u(λk))+μ22u(λk)uk2)superscript𝑏superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘22subscript𝜂𝑢𝛼𝐽subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢subscript𝜆𝑘subscript𝜇22superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘2\displaystyle\mathop{\leq}^{(b)}\left\|u_{\ast}(\lambda_{k})-u_{k}\right\|^{2}% -\frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\|u_{\ast}(\lambda_{k})-u_{k}\right% \|^{2}\right)≤ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 2 italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_α end_ARG start_ARG italic_J end_ARG ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+α2ηu2J𝔼[uL2(λk,uk;𝒟ktr)2k]superscript𝛼2superscriptsubscript𝜂𝑢2𝐽𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘tr2subscript𝑘\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\|^{2}\mid% \mathcal{F}_{k}\right]+ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=(c)u(λk)uk22ηuαJ(L2(λk,uk)L2(λk,u(λk))+μ22u(λk)uk2)superscript𝑐superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘22subscript𝜂𝑢𝛼𝐽subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢subscript𝜆𝑘subscript𝜇22superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘2\displaystyle\mathop{=}^{(c)}\left\|u_{\ast}(\lambda_{k})-u_{k}\right\|^{2}-% \frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\|u_{\ast}(\lambda_{k})-u_{k}\right% \|^{2}\right)= start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 2 italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_α end_ARG start_ARG italic_J end_ARG ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+α2ηu2J𝔼[uL2(λk,uk;𝒟ktr)uL2(λk,uk)2k]+α2ηu2JuL2(λk,uk)2superscript𝛼2superscriptsubscript𝜂𝑢2𝐽𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘trsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2subscript𝑘superscript𝛼2superscriptsubscript𝜂𝑢2𝐽superscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})-\nabla_{u}L_{2}% (\lambda_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]+\frac{\alpha^{2}\eta% _{u}^{2}}{J}\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k})\right\|^{2}+ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J end_ARG ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(d)u(λk)uk22ηuαJ(L2(λk,uk)L2(λk,u(λk))+μ22u(λk)uk2)superscript𝑑superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘22subscript𝜂𝑢𝛼𝐽subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢subscript𝜆𝑘subscript𝜇22superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘2\displaystyle\mathop{\leq}^{(d)}\left\|u_{\ast}(\lambda_{k})-u_{k}\right\|^{2}% -\frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\|u_{\ast}(\lambda_{k})-u_{k}\right% \|^{2}\right)≤ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 2 italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_α end_ARG start_ARG italic_J end_ARG ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+α2ηu2J𝔼[uL2(λk,uk;𝒟ktr)uL2(λk,uk)2k]+221ηu2α2J(L2(λk,uk)L2(λk,u(λk)))superscript𝛼2superscriptsubscript𝜂𝑢2𝐽𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘trsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2subscript𝑘2subscript21superscriptsubscript𝜂𝑢2superscript𝛼2𝐽subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢subscript𝜆𝑘\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})-\nabla_{u}L_{2}% (\lambda_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]+\frac{2\ell_{21}\eta% _{u}^{2}\alpha^{2}}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{\ast% }(\lambda_{k}))\right)+ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] + divide start_ARG 2 roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J end_ARG ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) )
(e)(1αμ2ηuJ)u(λk)uk2+α2ηu2σ22JB.superscript𝑒1𝛼subscript𝜇2subscript𝜂𝑢𝐽superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘2superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵\displaystyle\mathop{\leq}^{(e)}\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)% \left\|u_{\ast}(\lambda_{k})-u_{k}\right\|^{2}+\frac{\alpha^{2}\eta_{u}^{2}% \sigma_{2}^{2}}{JB}.≤ start_POSTSUPERSCRIPT ( italic_e ) end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ) ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG . (26)

where (a)𝑎(a)( italic_a ) use the truth that since the jksubscript𝑗𝑘j_{k}italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT block coordinate is uniformly chosen from {1,2,,J}12𝐽\left\{1,2,\cdots,J\right\}{ 1 , 2 , ⋯ , italic_J }, we have

𝔼[uku(λk),uL2(λk,uk;𝒟ktr)jtk,𝒞k]𝔼delimited-[]conditionalsubscriptsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘trsubscript𝑗𝑡subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{% u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\rangle_{j_{t}}% \mid\mathcal{F}_{k},\mathcal{C}_{k}\right]blackboard_E [ ⟨ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] =1J𝔼[uku(λk),uL2(λk,uk;𝒟ktr)k]absent1𝐽𝔼delimited-[]conditionalsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘trsubscript𝑘\displaystyle=\frac{1}{J}\mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{% k}),\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right% \rangle\mid\mathcal{F}_{k}\right]= divide start_ARG 1 end_ARG start_ARG italic_J end_ARG blackboard_E [ ⟨ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ⟩ ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=1Juku(λk),uL2(λk,uk)absent1𝐽subscript𝑢𝑘subscript𝑢subscript𝜆𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘\displaystyle=\frac{1}{J}\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\rangle= divide start_ARG 1 end_ARG start_ARG italic_J end_ARG ⟨ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (27)

and

𝔼[UjtuL2(uk,λk;𝒟ktr)2k,𝒞k]=1J𝔼[uL2(λk,uk;𝒟ktr)2k]𝔼delimited-[]conditionalsuperscriptnormsubscript𝑈subscript𝑗𝑡subscript𝑢subscript𝐿2subscript𝑢𝑘subscript𝜆𝑘superscriptsubscript𝒟𝑘tr2subscript𝑘subscript𝒞𝑘1𝐽𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘tr2subscript𝑘\displaystyle\mathbb{E}\left[\left\|U_{j_{t}}\nabla_{u}L_{2}(u_{k},\lambda_{k}% ;\mathcal{D}_{k}^{\text{tr}})\right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}% \right]=\frac{1}{J}\mathbb{E}\left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};% \mathcal{D}_{k}^{\text{tr}})\right\|^{2}\mid\mathcal{F}_{k}\right]blackboard_E [ ∥ italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_J end_ARG blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tr end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] (28)

(b)𝑏(b)( italic_b ) follows from the strong convexity of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT w.r.t. u𝑢uitalic_u which implies that

L2(λk,u(λk))L2(λk,uk)+uL2(λk,uk),u(λk)uk+μ22uku(λk)2,subscript𝐿2subscript𝜆𝑘subscript𝑢subscript𝜆𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝑢subscript𝜆𝑘subscript𝑢𝑘subscript𝜇22superscriptnormsubscript𝑢𝑘subscript𝑢subscript𝜆𝑘2\displaystyle L_{2}(\lambda_{k},u_{\ast}(\lambda_{k}))\geq L_{2}(\lambda_{k},u% _{k})+\left\langle\nabla_{u}L_{2}(\lambda_{k},u_{k}),u_{\ast}(\lambda_{k})-u_{% k}\right\rangle+\frac{\mu_{2}}{2}\left\|u_{k}-u_{\ast}(\lambda_{k})\right\|^{2},italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ≥ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ + divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

(c)𝑐(c)( italic_c ) uses the relationship 𝔼[uL2(λk,uk;Dtrk)k]=uL2(λk,uk)𝔼delimited-[]conditionalsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝐷tr𝑘subscript𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘\mathbb{E}\left[\nabla_{u}L_{2}(\lambda_{k},u_{k};D_{\text{tr}}^{k})\mid% \mathcal{F}_{k}\right]=\nabla_{u}L_{2}(\lambda_{k},u_{k})blackboard_E [ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) which induces that

𝔼[uL2(λk,uk;Dkstr)2k]=𝔼[uL2(λk,uk;𝒟kstr)uL2(λk,uk)2k]+uL2(λk,uk)2𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝐷𝑘str2subscript𝑘𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘strsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2subscript𝑘superscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2\displaystyle\mathbb{E}\left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};D_{k}^{% \text{str}})\right\|^{2}\mid\mathcal{F}_{k}\right]=\mathbb{E}\left[\left\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{str}})-\nabla_{u}L_{2% }(\lambda_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]+\left\|\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\|^{2}blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT str end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT str end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] + ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (29)

and (d)𝑑(d)( italic_d ) uses the optimality of u(λ)subscript𝑢𝜆u_{\ast}(\lambda)italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) and the smoothness of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that

L2(λk,u(λk))L2(λk,uk)subscript𝐿2subscript𝜆𝑘subscript𝑢subscript𝜆𝑘subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘\displaystyle L_{2}(\lambda_{k},u_{\ast}(\lambda_{k}))-L_{2}(\lambda_{k},u_{k})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) L2(λk,u~)L2(λk,uk)absentsubscript𝐿2subscript𝜆𝑘~𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘\displaystyle\leq L_{2}(\lambda_{k},\tilde{u})-L_{2}(\lambda_{k},u_{k})≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_u end_ARG ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
L2(λk,uk)+uL2(λk,uk),u~uk+212u~uk2L2(λk,uk)absentsubscript𝐿2subscript𝜆𝑘subscript𝑢𝑘subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘~𝑢subscript𝑢𝑘subscript212superscriptnorm~𝑢subscript𝑢𝑘2subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘\displaystyle\leq L_{2}(\lambda_{k},u_{k})+\left\langle\nabla_{u}L_{2}(\lambda% _{k},u_{k}),\tilde{u}-u_{k}\right\rangle+\frac{\ell_{21}}{2}\left\|\tilde{u}-u% _{k}\right\|^{2}-L_{2}(\lambda_{k},u_{k})≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , over~ start_ARG italic_u end_ARG - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ over~ start_ARG italic_u end_ARG - italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=1221uL2(λk,uk)2absent12subscript21superscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2\displaystyle=-\frac{1}{2\ell_{21}}\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k})% \right\|^{2}= - divide start_ARG 1 end_ARG start_ARG 2 roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (30)

where u~=uk121uL2(λk,uk)~𝑢subscript𝑢𝑘1subscript21subscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘\tilde{u}=u_{k}-\frac{1}{\ell_{21}}\nabla_{u}L_{2}(\lambda_{k},u_{k})over~ start_ARG italic_u end_ARG = italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and (e)𝑒(e)( italic_e ) uses

𝔼[uL2(λk,uk;𝒟kstr)uL2(λk,uk)2k]σ22B.𝔼delimited-[]conditionalsuperscriptnormsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘superscriptsubscript𝒟𝑘strsubscript𝑢subscript𝐿2subscript𝜆𝑘subscript𝑢𝑘2subscript𝑘superscriptsubscript𝜎22𝐵\displaystyle\mathbb{E}\left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal% {D}_{k}^{\text{str}})-\nabla_{u}L_{2}(\lambda_{k},u_{k})\right\|^{2}\mid% \mathcal{F}_{k}\right]\leq\frac{\sigma_{2}^{2}}{B}.blackboard_E [ ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT str end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ divide start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG . (31)

and ηu1/(α21)subscript𝜂𝑢1𝛼subscript21\eta_{u}\leq 1/(\alpha\ell_{21})italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ≤ 1 / ( italic_α roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ). Then we make the following recursive estimation for δksubscript𝛿𝑘\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

δk+1=subscript𝛿𝑘1absent\displaystyle\delta_{k+1}=italic_δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = u(λk+1)uk+12=u(λk+1)u(λk)+u(λk)uk+12superscriptnormsubscript𝑢subscript𝜆𝑘1subscript𝑢𝑘12superscriptnormsubscript𝑢subscript𝜆𝑘1subscript𝑢subscript𝜆𝑘subscript𝑢subscript𝜆𝑘subscript𝑢𝑘12\displaystyle\left\|u_{\ast}(\lambda_{k+1})-u_{k+1}\right\|^{2}=\left\|u_{\ast% }(\lambda_{k+1})-u_{\ast}(\lambda_{k})+u_{\ast}(\lambda_{k})-u_{k+1}\right\|^{2}∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)superscript𝑎\displaystyle\mathop{\leq}^{(a)}≤ start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT (1+γ1)u(λk+1)u(λk)2+(1+1/γ1)u(λk)uk+121subscript𝛾1superscriptnormsubscript𝑢subscript𝜆𝑘1subscript𝑢subscript𝜆𝑘211subscript𝛾1superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘12\displaystyle(1+\gamma_{1})\left\|u_{\ast}(\lambda_{k+1})-u_{\ast}(\lambda_{k}% )\right\|^{2}+(1+1/\gamma_{1})\left\|u_{\ast}(\lambda_{k})-u_{k+1}\right\|^{2}( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(b)superscript𝑏\displaystyle\mathop{\leq}^{(b)}≤ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT (1+γ1)κ2λk+1λk2+(1+1/γ1)u(λk)uk+121subscript𝛾1superscript𝜅2superscriptnormsubscript𝜆𝑘1subscript𝜆𝑘211subscript𝛾1superscriptnormsubscript𝑢subscript𝜆𝑘subscript𝑢𝑘12\displaystyle(1+\gamma_{1})\kappa^{2}\left\|\lambda_{k+1}-\lambda_{k}\right\|^% {2}+(1+1/\gamma_{1})\left\|u_{\ast}(\lambda_{k})-u_{k+1}\right\|^{2}( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(c)superscript𝑐\displaystyle\mathop{\leq}^{(c)}≤ start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT (1+γ1)κ2λk+1λk2+(1+1/γ1)((1αμ2ηuJ)δk+α2ηu2σ22JB)1subscript𝛾1superscript𝜅2superscriptnormsubscript𝜆𝑘1subscript𝜆𝑘211subscript𝛾11𝛼subscript𝜇2subscript𝜂𝑢𝐽subscript𝛿𝑘superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵\displaystyle(1+\gamma_{1})\kappa^{2}\left\|\lambda_{k+1}-\lambda_{k}\right\|^% {2}+(1+1/\gamma_{1})\left(\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\delta% _{k}+\frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}\right)( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ) italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG )
(d)superscript𝑑\displaystyle\mathop{\leq}^{(d)}≤ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT (1+γ1)κ2ηλ2λα(uk,ωk,λk)2+(1+1/γ1)(1αμ2ηuJ)δk+(1+1/γ1)α2ηu2σ22JB1subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆2superscriptnormsubscript𝜆superscript𝛼subscript𝑢𝑘subscript𝜔𝑘subscript𝜆𝑘211subscript𝛾11𝛼subscript𝜇2subscript𝜂𝑢𝐽subscript𝛿𝑘11subscript𝛾1superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵\displaystyle(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left\|\nabla_{\lambda}% \mathcal{L}^{\alpha}(u_{k},\omega_{k},\lambda_{k})\right\|^{2}+(1+1/\gamma_{1}% )\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\delta_{k}+(1+1/\gamma_{1})% \frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ) italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG
(e)superscript𝑒\displaystyle\mathop{\leq}^{(e)}≤ start_POSTSUPERSCRIPT ( italic_e ) end_POSTSUPERSCRIPT (1+γ1)κ2ηλ2(2Γα(λk)2+4α2212δk+8(112+α2212)rk)+(1+1/γ1)(1αμ2ηuJ)δk1subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆22superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘24superscript𝛼2superscriptsubscript212subscript𝛿𝑘8superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘11subscript𝛾11𝛼subscript𝜇2subscript𝜂𝑢𝐽subscript𝛿𝑘\displaystyle(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left(2\left\|\nabla% \Gamma^{\alpha}(\lambda_{k})\right\|^{2}+4\alpha^{2}\ell_{21}^{2}\delta_{k}+8% \left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)+(1+1/\gamma_{1}% )\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\delta_{k}( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 8 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ) italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
+(1+1/γ1)α2ηu2σ22JB11subscript𝛾1superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵\displaystyle+(1+1/\gamma_{1})\frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}+ ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG
=\displaystyle== (4α2(1+γ1)κ2ηλ2212+(1+1/γ1)(1αμ2ηuJ))δk+8(1+γ1)κ2ηλ2(112+α2212)rk4superscript𝛼21subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆2superscriptsubscript21211subscript𝛾11𝛼subscript𝜇2subscript𝜂𝑢𝐽subscript𝛿𝑘81subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆2superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘\displaystyle\left(4\alpha^{2}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\ell_{% 21}^{2}+(1+1/\gamma_{1})\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\right)% \delta_{k}+8(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left(\ell_{11}^{2}+% \alpha^{2}\ell_{21}^{2}\right)r_{k}( 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ) ) italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 8 ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
+2(1+γ1)κ2ηλ2Γα(λk)2+(1+1/γ1)α2ηu2σ22JB.21subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘211subscript𝛾1superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵\displaystyle+2(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left\|\nabla\Gamma^{% \alpha}(\lambda_{k})\right\|^{2}+(1+1/\gamma_{1})\frac{\alpha^{2}\eta_{u}^{2}% \sigma_{2}^{2}}{JB}.+ 2 ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG . (32)

where (a)𝑎(a)( italic_a ) follows from Cauchy-Schwartz inequality with γ1>0subscript𝛾10\gamma_{1}>0italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0; (b) uses the Lipschitz continuity of usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT from Lemma 5; (c)𝑐(c)( italic_c ) follows from the inequality (D); (d)𝑑(d)( italic_d ) uses the iterating formula of λk+1subscript𝜆𝑘1\lambda_{k+1}italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT; (e)𝑒(e)( italic_e ) follows from the inequality (D).

Since L1+αL2subscript𝐿1𝛼subscript𝐿2L_{1}+\alpha L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is strongly convex with respect to w𝑤witalic_w with parameter αμ2/2𝛼subscript𝜇22\alpha\mu_{2}/2italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 if α221/μ2𝛼2subscript21subscript𝜇2\alpha\geq 2\ell_{21}/\mu_{2}italic_α ≥ 2 roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Similar to δksubscript𝛿𝑘\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we can achieve the following result for rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

𝔼[ωα(λk)ωk+12k,𝒞k](1αμ2ηw2J)rk+ηw2(σ12+α2σ22)JB𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝜔𝛼subscript𝜆𝑘subscript𝜔𝑘12subscript𝑘subscript𝒞𝑘1𝛼subscript𝜇2subscript𝜂𝑤2𝐽subscript𝑟𝑘superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle\mathbb{E}\left[\left\|\omega_{\ast}^{\alpha}(\lambda_{k})-\omega% _{k+1}\right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}\right]\leq\left(1-\frac{% \alpha\mu_{2}\eta_{w}}{2J}\right)r_{k}+\frac{\eta_{w}^{2}\left(\sigma_{1}^{2}+% \alpha^{2}\sigma_{2}^{2}\right)}{JB}blackboard_E [ ∥ italic_ω start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_J end_ARG ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG (33)

Following the same procedure as in (D), we estimate the recursion rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as below

rk+1subscript𝑟𝑘1\displaystyle r_{k+1}italic_r start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT (1+γ2)ωα(λk+1)ωα(λk)2+(1+γ21)ωα(λk)ωk+12absent1subscript𝛾2superscriptnormsuperscriptsubscript𝜔𝛼subscript𝜆𝑘1superscriptsubscript𝜔𝛼subscript𝜆𝑘21superscriptsubscript𝛾21superscriptnormsuperscriptsubscript𝜔𝛼subscript𝜆𝑘subscript𝜔𝑘12\displaystyle\leq(1+\gamma_{2})\left\|\omega_{\ast}^{\alpha}(\lambda_{k+1})-% \omega_{\ast}^{\alpha}(\lambda_{k})\right\|^{2}+(1+\gamma_{2}^{-1})\left\|% \omega_{\ast}^{\alpha}(\lambda_{k})-\omega_{k+1}\right\|^{2}≤ ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ italic_ω start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_ω start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ∥ italic_ω start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(1+γ2)κ2λk+1λk2+(1+γ21)((1αμ2ηw2J)rk+ηw2(σ12+α2σ22)JB)absent1subscript𝛾2superscript𝜅2superscriptnormsubscript𝜆𝑘1subscript𝜆𝑘21superscriptsubscript𝛾211𝛼subscript𝜇2subscript𝜂𝑤2𝐽subscript𝑟𝑘superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle\leq(1+\gamma_{2})\kappa^{2}\left\|\lambda_{k+1}-\lambda_{k}% \right\|^{2}+(1+\gamma_{2}^{-1})\left(\left(1-\frac{\alpha\mu_{2}\eta_{w}}{2J}% \right)r_{k}+\frac{\eta_{w}^{2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}% \right)}{JB}\right)≤ ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ( ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_J end_ARG ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG )
(4(1+γ2)κ2ηλ2(112+α2212)+(1+1/γ2)(1αμ2ηw2J))rk+8(1+γ2)κ2ηλ2α2212δkabsent41subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscriptsubscript112superscript𝛼2superscriptsubscript21211subscript𝛾21𝛼subscript𝜇2subscript𝜂𝑤2𝐽subscript𝑟𝑘81subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscript𝛼2superscriptsubscript212subscript𝛿𝑘\displaystyle\leq\left(4(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\left(\ell_{% 11}^{2}+\alpha^{2}\ell_{21}^{2}\right)+(1+1/\gamma_{2})\left(1-\frac{\alpha\mu% _{2}\eta_{w}}{2J}\right)\right)r_{k}+8(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{% 2}\alpha^{2}\ell_{21}^{2}\delta_{k}≤ ( 4 ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_J end_ARG ) ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 8 ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
+2(1+γ2)κ2ηλ2Γα(λk)2+(1+1/γ2)ηw2(σ12+α2σ22)JB21subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘211subscript𝛾2superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle+2(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\left\|\nabla\Gamma^{% \alpha}(\lambda_{k})\right\|^{2}+(1+1/\gamma_{2})\frac{\eta_{w}^{2}\left(% \sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}+ 2 ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG (34)

where γ2>0subscript𝛾20\gamma_{2}>0italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0.

We define the Lyapunov function

Rk=Γα(λk)Γminα+ξ1,kδk+ξ2krksubscript𝑅𝑘superscriptΓ𝛼subscript𝜆𝑘superscriptsubscriptΓ𝛼subscript𝜉1𝑘subscript𝛿𝑘superscriptsubscript𝜉2𝑘subscript𝑟𝑘\displaystyle R_{k}=\Gamma^{\alpha}(\lambda_{k})-\Gamma_{\min}^{\alpha}+\xi_{1% ,k}\delta_{k}+\xi_{2}^{k}r_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (35)

where ξ1,k,ξ2,k>0subscript𝜉1𝑘subscript𝜉2𝑘0\xi_{1,k},\xi_{2,k}>0italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT > 0 are non-increasing sequences and ΓminαsuperscriptsubscriptΓ𝛼\Gamma_{\min}^{\alpha}roman_Γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT is the minimum of ΓαsuperscriptΓ𝛼\Gamma^{\alpha}roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT. We must have Rk0subscript𝑅𝑘0R_{k}\geq 0italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0. Incorporating the results of (D), (D), (D) gives

𝔼[Rk+1k,𝒞k]𝔼delimited-[]conditionalsubscript𝑅𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k},\mathcal{C}_{k}]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
Rkηλ2Γ(λk)2+Γηλ22(2Γα(λk)2+4α2212δk+8(112+α2212)rk)absentsubscript𝑅𝑘subscript𝜂𝜆2superscriptnormΓsubscript𝜆𝑘2subscriptΓsuperscriptsubscript𝜂𝜆222superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘24superscript𝛼2superscriptsubscript212subscript𝛿𝑘8superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘\displaystyle\leq R_{k}-\frac{\eta_{\lambda}}{2}\left\|\nabla\Gamma(\lambda_{k% })\right\|^{2}+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(2\left\|\nabla% \Gamma^{\alpha}(\lambda_{k})\right\|^{2}+4\alpha^{2}\ell_{21}^{2}\delta_{k}+8% \left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)≤ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ ∇ roman_Γ ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( 2 ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 8 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
+ηλ(α2212δk+2(112+α2212)rk)+Γηλ22(σ12+2α2σ22)+(ξ1,k+1δk+1ξ1,kδk)+(ξ2,k+1rk+1ξ2,krk)subscript𝜂𝜆superscript𝛼2superscriptsubscript212subscript𝛿𝑘2superscriptsubscript112superscript𝛼2superscriptsubscript212subscript𝑟𝑘subscriptΓsuperscriptsubscript𝜂𝜆22superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22subscript𝜉1𝑘1subscript𝛿𝑘1subscript𝜉1𝑘subscript𝛿𝑘subscript𝜉2𝑘1subscript𝑟𝑘1subscript𝜉2𝑘subscript𝑟𝑘\displaystyle+\eta_{\lambda}\left(\alpha^{2}\ell_{21}^{2}\delta_{k}+2\left(% \ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)+\frac{\ell_{\Gamma}% \eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right)+% \left(\xi_{1,k+1}\delta_{k+1}-\xi_{1,k}\delta_{k}\right)+\left(\xi_{2,k+1}r_{k% +1}-\xi_{2,k}r_{k}\right)+ italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 2 ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ( italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
Rk(ηλ2Γηλ22ξ1,k+1(1+γ1)κ2ηλ22ξ2,k+1(1+γ2)κ2ηλ2)Γα(λk)2+ϕ1δk+ϕ2rkabsentsubscript𝑅𝑘subscript𝜂𝜆2subscriptΓsuperscriptsubscript𝜂𝜆22subscript𝜉1𝑘11subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆22subscript𝜉2𝑘11subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2subscriptitalic-ϕ1subscript𝛿𝑘subscriptitalic-ϕ2subscript𝑟𝑘\displaystyle\leq R_{k}-\left(\frac{\eta_{\lambda}}{2}-\ell_{\Gamma}\eta_{% \lambda}^{2}-2\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}-2\xi_{2,k+% 1}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\right)\left\|\nabla\Gamma^{\alpha% }(\lambda_{k})\right\|^{2}+\phi_{1}\delta_{k}+\phi_{2}r_{k}≤ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ( divide start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
+Γηλ22(σ12+2α2σ22)+ξ1,k+1(1+γ11)α2ηu2σ22JB+ξ2,k+1(1+γ21)ηw2(σ12+α2σ22)JBsubscriptΓsuperscriptsubscript𝜂𝜆22superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22subscript𝜉1𝑘11superscriptsubscript𝛾11superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵subscript𝜉2𝑘11superscriptsubscript𝛾21superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2% \alpha^{2}\sigma_{2}^{2}\right)+\xi_{1,k+1}(1+\gamma_{1}^{-1})\frac{\alpha^{2}% \eta_{u}^{2}\sigma_{2}^{2}}{JB}+\xi_{2,k+1}(1+\gamma_{2}^{-1})\frac{\eta_{w}^{% 2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}+ divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG + italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG (36)

where

ϕ1subscriptitalic-ϕ1\displaystyle\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =ξ1,k+1(4α2(1+γ1)κ2ηλ2212+(1+1/γ1)(1αμ2ηuJ))ξ1,k+2Γηλ2α2212+ηλα2212absentsubscript𝜉1𝑘14superscript𝛼21subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆2superscriptsubscript21211subscript𝛾11𝛼subscript𝜇2subscript𝜂𝑢𝐽subscript𝜉1𝑘2subscriptΓsuperscriptsubscript𝜂𝜆2superscript𝛼2superscriptsubscript212subscript𝜂𝜆superscript𝛼2superscriptsubscript212\displaystyle=\xi_{1,k+1}\left(4\alpha^{2}(1+\gamma_{1})\kappa^{2}\eta_{% \lambda}^{2}\ell_{21}^{2}+(1+1/\gamma_{1})\left(1-\frac{\alpha\mu_{2}\eta_{u}}% {J}\right)\right)-\xi_{1,k}+2\ell_{\Gamma}\eta_{\lambda}^{2}\alpha^{2}\ell_{21% }^{2}+\eta_{\lambda}\alpha^{2}\ell_{21}^{2}= italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_J end_ARG ) ) - italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT + 2 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+8ξ2,k+1(1+γ2)κ2ηλ2(112+α2212)8subscript𝜉2𝑘11subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscriptsubscript112superscript𝛼2superscriptsubscript212\displaystyle+8\xi_{2,k+1}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\left(\ell% _{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)+ 8 italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
ϕ2subscriptitalic-ϕ2\displaystyle\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =ξ2,k+1(4(1+γ2)κ2ηλ2(112+α2212)+(1+1/γ2)(1αμ2ηw2J))ξ2,k+4Γηλ2(112+α2212)absentsubscript𝜉2𝑘141subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscriptsubscript112superscript𝛼2superscriptsubscript21211subscript𝛾21𝛼subscript𝜇2subscript𝜂𝑤2𝐽subscript𝜉2𝑘4subscriptΓsuperscriptsubscript𝜂𝜆2superscriptsubscript112superscript𝛼2superscriptsubscript212\displaystyle=\xi_{2,k+1}\left(4(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}% \left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)+(1+1/\gamma_{2})\left(1-% \frac{\alpha\mu_{2}\eta_{w}}{2J}\right)\right)-\xi_{2,k}+4\ell_{\Gamma}\eta_{% \lambda}^{2}\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)= italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 4 ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_α italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_J end_ARG ) ) - italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT + 4 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+2ηλ(112+α2212)+8ξ1,k+1(1+γ1)κ2ηλ2α2212.2subscript𝜂𝜆superscriptsubscript112superscript𝛼2superscriptsubscript2128subscript𝜉1𝑘11subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆2superscript𝛼2superscriptsubscript212\displaystyle+2\eta_{\lambda}\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right% )+8\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\alpha^{2}\ell_{21}^{2}.+ 2 italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 8 italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (37)

Let ηu=ηω=η0/Kasubscript𝜂𝑢subscript𝜂𝜔subscript𝜂0superscript𝐾𝑎\eta_{u}=\eta_{\omega}=\eta_{0}/K^{a}italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_K start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and ηλ=ηλ0/Kbsubscript𝜂𝜆superscriptsubscript𝜂𝜆0superscript𝐾𝑏\eta_{\lambda}=\eta_{\lambda}^{0}/K^{b}italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT / italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, and α=Kc𝛼superscript𝐾𝑐\alpha=K^{c}italic_α = italic_K start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT where 0ab0𝑎𝑏0\leq a\leq b0 ≤ italic_a ≤ italic_b and c>0𝑐0c>0italic_c > 0, and =max{11,21}subscript11subscript21\ell=\max\left\{\ell_{11},\ell_{21}\right\}roman_ℓ = roman_max { roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT } then ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be re-written as:

ϕ1subscriptitalic-ϕ1\displaystyle\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =ξ1,k+1(4(1+γ1)κ2(ηλ0)2K2(bc)2+(1+1/γ1)(1μ2η0JK(ac)))ξ1,k+2Γ2(ηλ0)2K2(bc)+2ηλ0K(b2c)absentsubscript𝜉1𝑘141subscript𝛾1superscript𝜅2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐superscript211subscript𝛾11subscript𝜇2subscript𝜂0𝐽superscript𝐾𝑎𝑐subscript𝜉1𝑘2subscriptΓsuperscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐\displaystyle=\xi_{1,k+1}\left(\frac{4(1+\gamma_{1})\kappa^{2}(\eta_{\lambda}^% {0})^{2}}{K^{2(b-c)}}\ell^{2}+(1+1/\gamma_{1})\left(1-\frac{\mu_{2}\eta_{0}}{% JK^{(a-c)}}\right)\right)-\xi_{1,k}+\frac{2\ell_{\Gamma}\ell^{2}(\eta_{\lambda% }^{0})^{2}}{K^{2(b-c)}}+\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}= italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( divide start_ARG 4 ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) ) - italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT + divide start_ARG 2 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG
+8ξ2,k+1(1+γ2)κ22(ηλ0)2K2(bc)8subscript𝜉2𝑘11subscript𝛾2superscript𝜅2superscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐\displaystyle\quad+\frac{8\xi_{2,k+1}(1+\gamma_{2})\kappa^{2}\ell^{2}(\eta_{% \lambda}^{0})^{2}}{K^{2(b-c)}}+ divide start_ARG 8 italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG
ϕ2subscriptitalic-ϕ2\displaystyle\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =ξ2,k+1(4(1+γ2)κ2(ηλ0)2K2(bc)2+(1+1/γ2)(1μ2η02JK(ac)))ξ2,k+4Γ2(ηλ0)2K2(bc)+22ηλ0K(b2c)absentsubscript𝜉2𝑘141subscript𝛾2superscript𝜅2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐superscript211subscript𝛾21subscript𝜇2subscript𝜂02𝐽superscript𝐾𝑎𝑐subscript𝜉2𝑘4subscriptΓsuperscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐2superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐\displaystyle=\xi_{2,k+1}\left(\frac{4(1+\gamma_{2})\kappa^{2}(\eta_{\lambda}^% {0})^{2}}{K^{2(b-c)}}\ell^{2}+(1+1/\gamma_{2})\left(1-\frac{\mu_{2}\eta_{0}}{2% JK^{(a-c)}}\right)\right)-\xi_{2,k}+\frac{4\ell_{\Gamma}\ell^{2}(\eta_{\lambda% }^{0})^{2}}{K^{2{(b-c)}}}+\frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}= italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( divide start_ARG 4 ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + 1 / italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) ) - italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT + divide start_ARG 4 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG
+8ξ1,k+1(1+γ1)κ22(ηλ0)2K2(bc).8subscript𝜉1𝑘11subscript𝛾1superscript𝜅2superscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐\displaystyle\quad+\frac{8\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\ell^{2}(\eta_{% \lambda}^{0})^{2}}{K^{2{(b-c)}}}.+ divide start_ARG 8 italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG . (38)

In order to achieve ϕ10subscriptitalic-ϕ10\phi_{1}\leq 0italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 0 and ϕ20subscriptitalic-ϕ20\phi_{2}\leq 0italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0, we might let γ1=γ2=4JK(ac)/(μ2η0)1subscript𝛾1subscript𝛾24𝐽superscript𝐾𝑎𝑐subscript𝜇2subscript𝜂01\gamma_{1}=\gamma_{2}=4JK^{(a-c)}/(\mu_{2}\eta_{0})-1italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 4 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT / ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - 1, then

(1+1/γ1)(1μ2η0JK(ac))11subscript𝛾11subscript𝜇2subscript𝜂0𝐽superscript𝐾𝑎𝑐\displaystyle(1+1/\gamma_{1})\left(1-\frac{\mu_{2}\eta_{0}}{JK^{(a-c)}}\right)( 1 + 1 / italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) 13μ2η04JK(ac)absent13subscript𝜇2subscript𝜂04𝐽superscript𝐾𝑎𝑐\displaystyle\leq 1-\frac{3\mu_{2}\eta_{0}}{4JK^{(a-c)}}≤ 1 - divide start_ARG 3 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG
(1+1/γ2)(1μ2η02JK(ac))11subscript𝛾21subscript𝜇2subscript𝜂02𝐽superscript𝐾𝑎𝑐\displaystyle(1+1/\gamma_{2})\left(1-\frac{\mu_{2}\eta_{0}}{2JK^{(a-c)}}\right)( 1 + 1 / italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) 1μ2η04JK(ac).absent1subscript𝜇2subscript𝜂04𝐽superscript𝐾𝑎𝑐\displaystyle\leq 1-\frac{\mu_{2}\eta_{0}}{4JK^{(a-c)}}.≤ 1 - divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG . (39)

For η08Jμ2subscript𝜂08𝐽subscript𝜇2\eta_{0}\leq\frac{8J}{\mu_{2}}italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ divide start_ARG 8 italic_J end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG, we have μ2η04J12subscript𝜇2subscript𝜂04𝐽12\frac{\mu_{2}\eta_{0}}{4J}\leq\frac{1}{2}divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_J end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Consider that ξ1,ksubscript𝜉1𝑘\xi_{1,k}italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT and ξ2,ksubscript𝜉2𝑘\xi_{2,k}italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT are non-increasing sequence, then ξ1,kξ1,k+1subscript𝜉1𝑘subscript𝜉1𝑘1\xi_{1,k}\geq\xi_{1,k+1}italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ≥ italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT and ξ2,kξ2,k+1subscript𝜉2𝑘subscript𝜉2𝑘1\xi_{2,k}\geq\xi_{2,k+1}italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ≥ italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT, we have

ϕ1subscriptitalic-ϕ1\displaystyle\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ξ1,k(1+(ηλ0)22κ2Jμ2η0K(2bca)3μ2η04JK(ac))ξ1,k+2Γ2(ηλ0)2K2(bc)+2ηλ0K(b2c)+ξ2,k8J(ηλ0)22κ2μ2η0K(2bca)0absentsubscript𝜉1𝑘1superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2𝐽subscript𝜇2subscript𝜂0superscript𝐾2𝑏𝑐𝑎3subscript𝜇2subscript𝜂04𝐽superscript𝐾𝑎𝑐subscript𝜉1𝑘2subscriptΓsuperscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐subscript𝜉2𝑘8𝐽superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2subscript𝜇2subscript𝜂0superscript𝐾2𝑏𝑐𝑎0\displaystyle\leq\xi_{1,k}\left(1+\frac{(\eta_{\lambda}^{0})^{2}\ell^{2}\kappa% ^{2}J}{\mu_{2}\eta_{0}K^{(2b-c-a)}}-\frac{3\mu_{2}\eta_{0}}{4JK^{(a-c)}}\right% )-\xi_{1,k}+\frac{2\ell_{\Gamma}\ell^{2}(\eta_{\lambda}^{0})^{2}}{K^{2(b-c)}}+% \frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{2,k}\frac{8J(\eta_{\lambda}% ^{0})^{2}\ell^{2}\kappa^{2}}{\mu_{2}\eta_{0}K^{(2b-c-a)}}\leq 0≤ italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ( 1 + divide start_ARG ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ( 2 italic_b - italic_c - italic_a ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG 3 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) - italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT + divide start_ARG 2 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT divide start_ARG 8 italic_J ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ( 2 italic_b - italic_c - italic_a ) end_POSTSUPERSCRIPT end_ARG ≤ 0
ϕ2subscriptitalic-ϕ2\displaystyle\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ξ2,k(1+(ηλ0)22κ2Jμ2η0K(2bca)μ2η04JK(ac))ξ2,k+4Γ2(ηλ0)2K2(bc)+22ηλ0K(b2c)+ξ1,k8J(ηλ0)22κ2μ2η0K(2bca)0absentsubscript𝜉2𝑘1superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2𝐽subscript𝜇2subscript𝜂0superscript𝐾2𝑏𝑐𝑎subscript𝜇2subscript𝜂04𝐽superscript𝐾𝑎𝑐subscript𝜉2𝑘4subscriptΓsuperscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐2superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐subscript𝜉1𝑘8𝐽superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2subscript𝜇2subscript𝜂0superscript𝐾2𝑏𝑐𝑎0\displaystyle\leq\xi_{2,k}\left(1+\frac{(\eta_{\lambda}^{0})^{2}\ell^{2}\kappa% ^{2}J}{\mu_{2}\eta_{0}K^{(2b-c-a)}}-\frac{\mu_{2}\eta_{0}}{4JK^{(a-c)}}\right)% -\xi_{2,k}+\frac{4\ell_{\Gamma}\ell^{2}(\eta_{\lambda}^{0})^{2}}{K^{2(b-c)}}+% \frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{1,k}\frac{8J(\eta_{\lambda% }^{0})^{2}\ell^{2}\kappa^{2}}{\mu_{2}\eta_{0}K^{(2b-c-a)}}\leq 0≤ italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ( 1 + divide start_ARG ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ( 2 italic_b - italic_c - italic_a ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) - italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT + divide start_ARG 4 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT divide start_ARG 8 italic_J ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ( 2 italic_b - italic_c - italic_a ) end_POSTSUPERSCRIPT end_ARG ≤ 0

If ηλ01/(2Γ)superscriptsubscript𝜂𝜆012subscriptΓ\eta_{\lambda}^{0}\leq 1/(2\ell_{\Gamma})italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≤ 1 / ( 2 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ) and η0/ηλ062κ2Jsubscript𝜂0superscriptsubscript𝜂𝜆062superscript𝜅2𝐽\eta_{0}/\eta_{\lambda}^{0}\geq 6\sqrt{2}\kappa^{2}Jitalic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≥ 6 square-root start_ARG 2 end_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J, for ba𝑏𝑎b\geq aitalic_b ≥ italic_a and k>1𝑘1k>1italic_k > 1, then

2Γ2(ηλ0)2K2(bc)2ηλ0K(b2c),9(ηλ0)22κ2Jμ2η0μ2η08J.formulae-sequence2subscriptΓsuperscript2superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾2𝑏𝑐superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐9superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2𝐽subscript𝜇2subscript𝜂0subscript𝜇2subscript𝜂08𝐽\displaystyle\frac{2\ell_{\Gamma}\ell^{2}(\eta_{\lambda}^{0})^{2}}{K^{2(b-c)}}% \leq\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}},\frac{9(\eta_{\lambda}^{0})^% {2}\ell^{2}\kappa^{2}J}{\mu_{2}\eta_{0}}\leq\frac{\mu_{2}\eta_{0}}{8J}.divide start_ARG 2 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 ( italic_b - italic_c ) end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG , divide start_ARG 9 ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 8 italic_J end_ARG .

The inequalities of ϕ1,ϕ2subscriptitalic-ϕ1subscriptitalic-ϕ2\phi_{1},\phi_{2}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be simplified as

ϕ1subscriptitalic-ϕ1\displaystyle\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ξ1,k(153μ2η072JK(ac))ξ1,k+2ηλ0K(b2c)+ξ2,kμ2η09JK(ac)0absentsubscript𝜉1𝑘153subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐subscript𝜉1𝑘superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐subscript𝜉2𝑘subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐0\displaystyle\leq\xi_{1,k}\left(1-\frac{53\mu_{2}\eta_{0}}{72JK^{(a-c)}}\right% )-\xi_{1,k}+\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{2,k}\frac{\mu_{% 2}\eta_{0}}{9JK^{(a-c)}}\leq 0≤ italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ( 1 - divide start_ARG 53 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) - italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT + divide start_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ≤ 0 (40)
ϕ2subscriptitalic-ϕ2\displaystyle\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ξ2,k(117μ2η072JK(ac))ξ2,k+22ηλ0K(b2c)+ξ1,kμ2η09JK(ac)0absentsubscript𝜉2𝑘117subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐subscript𝜉2𝑘2superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐subscript𝜉1𝑘subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐0\displaystyle\leq\xi_{2,k}\left(1-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\right% )-\xi_{2,k}+\frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{1,k}\frac{\mu_% {2}\eta_{0}}{9JK^{(a-c)}}\leq 0≤ italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ( 1 - divide start_ARG 17 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) - italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT + divide start_ARG 2 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ≤ 0 (41)

We might solve the above inequalities and properly set

ξ1,ksubscript𝜉1𝑘\displaystyle\xi_{1,k}italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT =53μ2η072JK(ac)22ηλ0K(b2c)2ηλ0K(b2c)μ2η09JK(ac)μ2η09JK(ac)μ2η09JK(ac)17μ2η072JK(ac)53μ2η072JK(ac)=1142ηλ0K(b2c)837μ2η072JK(ac)=102ηλ0μ2η0JK(bac)absent53subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐2superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐17subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐53subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐114superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐837subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐10superscript2superscriptsubscript𝜂𝜆0subscript𝜇2subscript𝜂0𝐽superscript𝐾𝑏𝑎𝑐\displaystyle=\frac{-\frac{53\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{2\ell^{2}\eta% _{\lambda}^{0}}{K^{(b-2c)}}-\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}\frac% {\mu_{2}\eta_{0}}{9JK^{(a-c)}}}{\frac{\mu_{2}\eta_{0}}{9JK^{(a-c)}}\frac{\mu_{% 2}\eta_{0}}{9JK^{(a-c)}}-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{53\mu_{2}% \eta_{0}}{72JK^{(a-c)}}}=\frac{\frac{114\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}% }}{\frac{837\mu_{2}\eta_{0}}{72JK^{(a-c)}}}=\frac{10\ell^{2}\eta_{\lambda}^{0}% }{\mu_{2}\eta_{0}}\frac{J}{K^{(b-a-c)}}= divide start_ARG - divide start_ARG 53 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG 2 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG 17 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG 53 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG divide start_ARG 114 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG divide start_ARG 837 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG 10 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG divide start_ARG italic_J end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG
ξ2,ksubscript𝜉2𝑘\displaystyle\xi_{2,k}italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT =17μ2η072JK(ac)2ηλ0K(b2c)22ηλ0K(b2c)μ2η09JK(ac)μ2η09JK(ac)μ2η09JK(ac)17μ2η072JK(ac)53μ2η072JK(ac)=32ηλ0μ2η0JK(bac)absent17subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐2superscript2superscriptsubscript𝜂𝜆0superscript𝐾𝑏2𝑐subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐subscript𝜇2subscript𝜂09𝐽superscript𝐾𝑎𝑐17subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐53subscript𝜇2subscript𝜂072𝐽superscript𝐾𝑎𝑐3superscript2superscriptsubscript𝜂𝜆0subscript𝜇2subscript𝜂0𝐽superscript𝐾𝑏𝑎𝑐\displaystyle=\frac{-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{\ell^{2}\eta_% {\lambda}^{0}}{K^{(b-2c)}}-\frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}\frac% {\mu_{2}\eta_{0}}{9JK^{(a-c)}}}{\frac{\mu_{2}\eta_{0}}{9JK^{(a-c)}}\frac{\mu_{% 2}\eta_{0}}{9JK^{(a-c)}}-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{53\mu_{2}% \eta_{0}}{72JK^{(a-c)}}}=\frac{3\ell^{2}\eta_{\lambda}^{0}}{\mu_{2}\eta_{0}}% \frac{J}{K^{(b-a-c)}}= divide start_ARG - divide start_ARG 17 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - 2 italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 9 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG 17 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG divide start_ARG 53 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 72 italic_J italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG 3 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG divide start_ARG italic_J end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG

to guarantee that ϕ10subscriptitalic-ϕ10\phi_{1}\leq 0italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 0 and ϕ20subscriptitalic-ϕ20\phi_{2}\leq 0italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0. Then the main inequality (D) can be estimated as

𝔼[Rk+1k,𝒞k]𝔼delimited-[]conditionalsubscript𝑅𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k},\mathcal{C}_{k}]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] Rk(ηλ2Γηλ22ξ1,k+1(1+γ1)κ2ηλ22ξ2,k+1(1+γ2)κ2ηλ2)Γα(λk)2absentsubscript𝑅𝑘subscript𝜂𝜆2subscriptΓsuperscriptsubscript𝜂𝜆22subscript𝜉1𝑘11subscript𝛾1superscript𝜅2superscriptsubscript𝜂𝜆22subscript𝜉2𝑘11subscript𝛾2superscript𝜅2superscriptsubscript𝜂𝜆2superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2\displaystyle\leq R_{k}-\left(\frac{\eta_{\lambda}}{2}-\ell_{\Gamma}\eta_{% \lambda}^{2}-2\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}-2\xi_{2,k+% 1}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\right)\left\|\nabla\Gamma^{\alpha% }(\lambda_{k})\right\|^{2}≤ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ( divide start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+Γηλ22(σ12+2α2σ22)+ξ1,k+1(1+γ11)α2ηu2σ22JB+ξ2,k+1(1+γ21)ηw2(σ12+α2σ22)JB.subscriptΓsuperscriptsubscript𝜂𝜆22superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎22subscript𝜉1𝑘11superscriptsubscript𝛾11superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵subscript𝜉2𝑘11superscriptsubscript𝛾21superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2% \alpha^{2}\sigma_{2}^{2}\right)+\xi_{1,k+1}(1+\gamma_{1}^{-1})\frac{\alpha^{2}% \eta_{u}^{2}\sigma_{2}^{2}}{JB}+\xi_{2,k+1}(1+\gamma_{2}^{-1})\frac{\eta_{w}^{% 2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}.+ divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG + italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG .

If we set ηλ01/(8Γ)superscriptsubscript𝜂𝜆018subscriptΓ\eta_{\lambda}^{0}\leq 1/(8\ell_{\Gamma})italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≤ 1 / ( 8 roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ), then Γηλ2ηλ8subscriptΓsuperscriptsubscript𝜂𝜆2subscript𝜂𝜆8\ell_{\Gamma}\eta_{\lambda}^{2}\leq\frac{\eta_{\lambda}}{8}roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG. For ba𝑏𝑎b\geq aitalic_b ≥ italic_a and k1𝑘1k\geq 1italic_k ≥ 1, if we set η0/ηλ083κ2Jsubscript𝜂0superscriptsubscript𝜂𝜆083superscript𝜅2𝐽\eta_{0}/\eta_{\lambda}^{0}\geq 8\sqrt{3}\kappa^{2}Jitalic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≥ 8 square-root start_ARG 3 end_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J

ξ1,k(1+γ1)κ2ηλsubscript𝜉1𝑘1subscript𝛾1superscript𝜅2subscript𝜂𝜆\displaystyle\xi_{1,k}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}italic_ξ start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT 40(ηλ0)2K(ac)2κ2J2μ22η02KbK(bac)=40(ηλ0)22κ2J2μ22η02K(2b2a)116absent40superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾𝑎𝑐superscript2superscript𝜅2superscript𝐽2superscriptsubscript𝜇22superscriptsubscript𝜂02superscript𝐾𝑏superscript𝐾𝑏𝑎𝑐40superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2superscript𝐽2superscriptsubscript𝜇22superscriptsubscript𝜂02superscript𝐾2𝑏2𝑎116\displaystyle\leq\frac{40(\eta_{\lambda}^{0})^{2}K^{(a-c)}\ell^{2}\kappa^{2}J^% {2}}{\mu_{2}^{2}\eta_{0}^{2}K^{b}K^{(b-a-c)}}=\frac{40(\eta_{\lambda}^{0})^{2}% \ell^{2}\kappa^{2}J^{2}}{\mu_{2}^{2}\eta_{0}^{2}K^{(2b-2a)}}\leq\frac{1}{16}≤ divide start_ARG 40 ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG 40 ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( 2 italic_b - 2 italic_a ) end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 16 end_ARG
ξ2,k(1+γ2)κ2ηλsubscript𝜉2𝑘1subscript𝛾2superscript𝜅2subscript𝜂𝜆\displaystyle\xi_{2,k}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}italic_ξ start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT 12(ηλ0)2K(ac)2κ2J2μ22η02KbK(bac)=12(ηλ0)22κ2J2μ22η02K(2b2a)116.absent12superscriptsuperscriptsubscript𝜂𝜆02superscript𝐾𝑎𝑐superscript2superscript𝜅2superscript𝐽2superscriptsubscript𝜇22superscriptsubscript𝜂02superscript𝐾𝑏superscript𝐾𝑏𝑎𝑐12superscriptsuperscriptsubscript𝜂𝜆02superscript2superscript𝜅2superscript𝐽2superscriptsubscript𝜇22superscriptsubscript𝜂02superscript𝐾2𝑏2𝑎116\displaystyle\leq\frac{12(\eta_{\lambda}^{0})^{2}K^{(a-c)}\ell^{2}\kappa^{2}J^% {2}}{\mu_{2}^{2}\eta_{0}^{2}K^{b}K^{(b-a-c)}}=\frac{12(\eta_{\lambda}^{0})^{2}% \ell^{2}\kappa^{2}J^{2}}{\mu_{2}^{2}\eta_{0}^{2}K^{(2b-2a)}}\leq\frac{1}{16}.≤ divide start_ARG 12 ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( italic_a - italic_c ) end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG 12 ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ( 2 italic_b - 2 italic_a ) end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 16 end_ARG .

Then

𝔼[Rk+1k,𝒞k]𝔼delimited-[]conditionalsubscript𝑅𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k},\mathcal{C}_{k}]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] Rkηλ4Γα(λk)2+Γηλ22(σ12+2α2σ22)+8ξ1,k+17α2ηu2σ22JB+3ξ2,k+14ηw2(σ12+α2σ22)JB.absentsubscript𝑅𝑘subscript𝜂𝜆4superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2subscriptΓsuperscriptsubscript𝜂𝜆22superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎228subscript𝜉1𝑘17superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵3subscript𝜉2𝑘14superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle\leq R_{k}-\frac{\eta_{\lambda}}{4}\left\|\nabla\Gamma^{\alpha}(% \lambda_{k})\right\|^{2}+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(\sigma% _{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right)+\frac{8\xi_{1,k+1}}{7}\frac{\alpha^{% 2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}+\frac{3\xi_{2,k+1}}{4}\frac{\eta_{w}^{2}% \left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}.≤ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 8 italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT end_ARG start_ARG 7 end_ARG divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG + divide start_ARG 3 italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG .

Telesco** the above inequality gives

𝔼[Γα(λ~)2]𝔼delimited-[]superscriptnormsuperscriptΓ𝛼~𝜆2\displaystyle\mathbb{E}\left[\left\|\nabla\Gamma^{\alpha}(\tilde{\lambda})% \right\|^{2}\right]blackboard_E [ ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =1Kk=1K𝔼[Γα(λk)2]absent1𝐾superscriptsubscript𝑘1𝐾𝔼delimited-[]superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\|\nabla\Gamma^{% \alpha}(\lambda_{k})\right\|^{2}\right]= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
4Kηλ(k=1T𝔼[Rkk1,𝒞k1]𝔼[Rk+1k],𝒞k)absent4𝐾subscript𝜂𝜆superscriptsubscript𝑘1𝑇𝔼delimited-[]conditionalsubscript𝑅𝑘subscript𝑘1subscript𝒞𝑘1𝔼delimited-[]conditionalsubscript𝑅𝑘1subscript𝑘subscript𝒞𝑘\displaystyle\leq\frac{4}{K\eta_{\lambda}}\left(\sum_{k=1}^{T}\mathbb{E}[R_{k}% \mid\mathcal{F}_{k-1},\mathcal{C}_{k-1}]-\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k}% ],\mathcal{C}_{k}\right)≤ divide start_ARG 4 end_ARG start_ARG italic_K italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] - blackboard_E [ italic_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
+4Kηλk=1K(Γηλ22(σ12+2α2σ22)+8ξ1,k+17α2ηu2σ22JB+3ξ2,k+14ηw2(σ12+α2σ22)JB)4𝐾subscript𝜂𝜆superscriptsubscript𝑘1𝐾subscriptΓsuperscriptsubscript𝜂𝜆22superscriptsubscript𝜎122superscript𝛼2superscriptsubscript𝜎228subscript𝜉1𝑘17superscript𝛼2superscriptsubscript𝜂𝑢2superscriptsubscript𝜎22𝐽𝐵3subscript𝜉2𝑘14superscriptsubscript𝜂𝑤2superscriptsubscript𝜎12superscript𝛼2superscriptsubscript𝜎22𝐽𝐵\displaystyle+\frac{4}{K\eta_{\lambda}}\sum_{k=1}^{K}\left(\frac{\ell_{\Gamma}% \eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right)+% \frac{8\xi_{1,k+1}}{7}\frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}+\frac{3% \xi_{2,k+1}}{4}\frac{\eta_{w}^{2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}% \right)}{JB}\right)+ divide start_ARG 4 end_ARG start_ARG italic_K italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 8 italic_ξ start_POSTSUBSCRIPT 1 , italic_k + 1 end_POSTSUBSCRIPT end_ARG start_ARG 7 end_ARG divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_J italic_B end_ARG + divide start_ARG 3 italic_ξ start_POSTSUBSCRIPT 2 , italic_k + 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG divide start_ARG italic_η start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_J italic_B end_ARG )
4𝔼[R1]Kbηλ0K+4Kbηλ0Γ(ηλ0)2(σ12+K2cσ22)2K2babsent4𝔼delimited-[]subscript𝑅1superscript𝐾𝑏superscriptsubscript𝜂𝜆0𝐾4superscript𝐾𝑏superscriptsubscript𝜂𝜆0subscriptΓsuperscriptsuperscriptsubscript𝜂𝜆02superscriptsubscript𝜎12superscript𝐾2𝑐superscriptsubscript𝜎222superscript𝐾2𝑏\displaystyle\leq\frac{4\mathbb{E}[R_{1}]K^{b}}{\eta_{\lambda}^{0}K}+\frac{4K^% {b}}{\eta_{\lambda}^{0}}\frac{\ell_{\Gamma}(\eta_{\lambda}^{0})^{2}\left(% \sigma_{1}^{2}+K^{2c}\sigma_{2}^{2}\right)}{2K^{2b}}≤ divide start_ARG 4 blackboard_E [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K end_ARG + divide start_ARG 4 italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_K start_POSTSUPERSCRIPT 2 italic_b end_POSTSUPERSCRIPT end_ARG
+4Kbηλ0(802η0ηλ0K2cK2aσ227μ2BK(bac)+92η0ηλ0K2a(σ12+K2cσ22)4μ2BK(bac)).4superscript𝐾𝑏superscriptsubscript𝜂𝜆080superscript2subscript𝜂0superscriptsubscript𝜂𝜆0superscript𝐾2𝑐superscript𝐾2𝑎superscriptsubscript𝜎227subscript𝜇2𝐵superscript𝐾𝑏𝑎𝑐9superscript2subscript𝜂0superscriptsubscript𝜂𝜆0superscript𝐾2𝑎superscriptsubscript𝜎12superscript𝐾2𝑐superscriptsubscript𝜎224subscript𝜇2𝐵superscript𝐾𝑏𝑎𝑐\displaystyle+\frac{4K^{b}}{\eta_{\lambda}^{0}}\left(\frac{80\ell^{2}\eta_{0}% \eta_{\lambda}^{0}K^{2c}K^{-2a}\sigma_{2}^{2}}{7\mu_{2}BK^{(b-a-c)}}+\frac{9% \ell^{2}\eta_{0}\eta_{\lambda}^{0}K^{-2a}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}% ^{2}\right)}{4\mu_{2}BK^{(b-a-c)}}\right).+ divide start_ARG 4 italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 80 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT - 2 italic_a end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 7 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_B italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG 9 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT - 2 italic_a end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_B italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) .

Recalling the result of Lemma 1 states the relation between the stationarity of the minimax problem and the original bilevel problem, we have

𝔼[(λ~)2]=1Kk=1K𝔼[(λk)2]𝔼delimited-[]superscriptnorm~𝜆21𝐾superscriptsubscript𝑘1𝐾𝔼delimited-[]superscriptnormsubscript𝜆𝑘2\displaystyle\mathbb{E}\left[\left\|\nabla\mathcal{L}(\tilde{\lambda})\right\|% ^{2}\right]=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\|\nabla\mathcal{L}(% \lambda_{k})\right\|^{2}\right]blackboard_E [ ∥ ∇ caligraphic_L ( over~ start_ARG italic_λ end_ARG ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ caligraphic_L ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2Kk=1K(𝔼[(λk)Γα(λk)2]+𝔼[Γα(λk)2])absent2𝐾superscriptsubscript𝑘1𝐾𝔼delimited-[]superscriptnormsubscript𝜆𝑘superscriptΓ𝛼subscript𝜆𝑘2𝔼delimited-[]superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2\displaystyle\leq\frac{2}{K}\sum_{k=1}^{K}\left(\mathbb{E}\left[\left\|\nabla% \mathcal{L}(\lambda_{k})-\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}\right]% +\mathbb{E}\left[\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}\right]\right)≤ divide start_ARG 2 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( blackboard_E [ ∥ ∇ caligraphic_L ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] )
2α2+2Kk=1K𝔼[Γα(λk)2]absent2superscript𝛼22𝐾superscriptsubscript𝑘1𝐾𝔼delimited-[]superscriptnormsuperscriptΓ𝛼subscript𝜆𝑘2\displaystyle\leq\frac{2}{\alpha^{2}}+\frac{2}{K}\sum_{k=1}^{K}\mathbb{E}\left% [\left\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\|^{2}\right]≤ divide start_ARG 2 end_ARG start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2K2c+8𝔼[R1]Kbηλ0K+8Kbηλ0Γ(ηλ0)2(σ12+K2cσ22)2K2babsent2superscript𝐾2𝑐8𝔼delimited-[]subscript𝑅1superscript𝐾𝑏superscriptsubscript𝜂𝜆0𝐾8superscript𝐾𝑏superscriptsubscript𝜂𝜆0subscriptΓsuperscriptsuperscriptsubscript𝜂𝜆02superscriptsubscript𝜎12superscript𝐾2𝑐superscriptsubscript𝜎222superscript𝐾2𝑏\displaystyle\leq\frac{2}{K^{2c}}+\frac{8\mathbb{E}[R_{1}]K^{b}}{\eta_{\lambda% }^{0}K}+\frac{8K^{b}}{\eta_{\lambda}^{0}}\frac{\ell_{\Gamma}(\eta_{\lambda}^{0% })^{2}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}^{2}\right)}{2K^{2b}}≤ divide start_ARG 2 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT end_ARG + divide start_ARG 8 blackboard_E [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K end_ARG + divide start_ARG 8 italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG divide start_ARG roman_ℓ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_K start_POSTSUPERSCRIPT 2 italic_b end_POSTSUPERSCRIPT end_ARG
+8Kbηλ0(802η0ηλ0K2cK2aσ227μ2BK(bac)+92η0ηλ0K2a(σ12+K2cσ22)4μ2BK(bac)).8superscript𝐾𝑏superscriptsubscript𝜂𝜆080superscript2subscript𝜂0superscriptsubscript𝜂𝜆0superscript𝐾2𝑐superscript𝐾2𝑎superscriptsubscript𝜎227subscript𝜇2𝐵superscript𝐾𝑏𝑎𝑐9superscript2subscript𝜂0superscriptsubscript𝜂𝜆0superscript𝐾2𝑎superscriptsubscript𝜎12superscript𝐾2𝑐superscriptsubscript𝜎224subscript𝜇2𝐵superscript𝐾𝑏𝑎𝑐\displaystyle+\frac{8K^{b}}{\eta_{\lambda}^{0}}\left(\frac{80\ell^{2}\eta_{0}% \eta_{\lambda}^{0}K^{2c}K^{-2a}\sigma_{2}^{2}}{7\mu_{2}BK^{(b-a-c)}}+\frac{9% \ell^{2}\eta_{0}\eta_{\lambda}^{0}K^{-2a}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}% ^{2}\right)}{4\mu_{2}BK^{(b-a-c)}}\right).+ divide start_ARG 8 italic_K start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 80 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT - 2 italic_a end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 7 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_B italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG 9 roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT - 2 italic_a end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_B italic_K start_POSTSUPERSCRIPT ( italic_b - italic_a - italic_c ) end_POSTSUPERSCRIPT end_ARG ) .

Let c=1/7𝑐17c=1/7italic_c = 1 / 7, a=4/7𝑎47a=4/7italic_a = 4 / 7, and b=5/7𝑏57b=5/7italic_b = 5 / 7, we have

𝔼[(λ~)2]𝔼delimited-[]superscriptnorm~𝜆2\displaystyle\mathbb{E}\left[\left\|\nabla\mathcal{L}(\tilde{\lambda})\right\|% ^{2}\right]blackboard_E [ ∥ ∇ caligraphic_L ( over~ start_ARG italic_λ end_ARG ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =1Kk=1K𝔼[(λk)2]absent1𝐾superscriptsubscript𝑘1𝐾𝔼delimited-[]superscriptnormsubscript𝜆𝑘2\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\|\nabla\mathcal{L% }(\lambda_{k})\right\|^{2}\right]= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ caligraphic_L ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
𝒪(1K2/7)+𝒪(𝔼[R1]ηλ0K2/7)+𝒪((1+κη0)σ12BK4/7)+𝒪((1+κη0)σ22BK2/7).absent𝒪1superscript𝐾27𝒪𝔼delimited-[]subscript𝑅1superscriptsubscript𝜂𝜆0superscript𝐾27𝒪1𝜅subscript𝜂0superscriptsubscript𝜎12𝐵superscript𝐾47𝒪1𝜅subscript𝜂0superscriptsubscript𝜎22𝐵superscript𝐾27\displaystyle\leq\mathcal{O}\left(\frac{1}{K^{2/7}}\right)+\mathcal{O}\left(% \frac{\mathbb{E}[R_{1}]}{\eta_{\lambda}^{0}K^{2/7}}\right)+\mathcal{O}\left(% \frac{(1+\ell\kappa\eta_{0})\sigma_{1}^{2}}{BK^{4/7}}\right)+\mathcal{O}\left(% \frac{(1+\ell\kappa\eta_{0})\sigma_{2}^{2}}{BK^{2/7}}\right).≤ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 / 7 end_POSTSUPERSCRIPT end_ARG ) + caligraphic_O ( divide start_ARG blackboard_E [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 2 / 7 end_POSTSUPERSCRIPT end_ARG ) + caligraphic_O ( divide start_ARG ( 1 + roman_ℓ italic_κ italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_K start_POSTSUPERSCRIPT 4 / 7 end_POSTSUPERSCRIPT end_ARG ) + caligraphic_O ( divide start_ARG ( 1 + roman_ℓ italic_κ italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_K start_POSTSUPERSCRIPT 2 / 7 end_POSTSUPERSCRIPT end_ARG ) .

Note that the initial state R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be controlled by a constant which is independent with α𝛼\alphaitalic_α:

R1subscript𝑅1\displaystyle R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =Γα(λ1)Γminα+ξ1,1δ1+ξ2,1r1absentsuperscriptΓ𝛼subscript𝜆1superscriptsubscriptΓ𝛼subscript𝜉11subscript𝛿1subscript𝜉21subscript𝑟1\displaystyle=\Gamma^{\alpha}(\lambda_{1})-\Gamma_{\min}^{\alpha}+\xi_{1,1}% \delta_{1}+\xi_{2,1}r_{1}= roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - roman_Γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
=Γα(λ1)Γminα+𝒪(Jκη0λ/η0(w1wα(λ1)2+u1u(λ1)2))absentsuperscriptΓ𝛼subscript𝜆1superscriptsubscriptΓ𝛼𝒪𝐽𝜅superscriptsubscript𝜂0𝜆subscript𝜂0superscriptnormsubscript𝑤1superscriptsubscript𝑤𝛼subscript𝜆12superscriptnormsubscript𝑢1subscript𝑢subscript𝜆12\displaystyle=\Gamma^{\alpha}(\lambda_{1})-\Gamma_{\min}^{\alpha}+\mathcal{O}% \left(J\kappa\eta_{0}^{\lambda}/\eta_{0}\left(\left\|w_{1}-w_{\ast}^{\alpha}(% \lambda_{1})\right\|^{2}+\left\|u_{1}-u_{\ast}(\lambda_{1})\right\|^{2}\right)\right)= roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - roman_Γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + caligraphic_O ( italic_J italic_κ italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT / italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∥ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) (42)

where

Γα(λ1)ΓminαsuperscriptΓ𝛼subscript𝜆1superscriptsubscriptΓ𝛼\displaystyle\Gamma^{\alpha}(\lambda_{1})-\Gamma_{\min}^{\alpha}roman_Γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - roman_Γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT α(λ1,wα(λ1),u(λ1))α(λ,wα(λ),u(λ))absentsuperscript𝛼subscript𝜆1superscriptsubscript𝑤𝛼subscript𝜆1subscript𝑢subscript𝜆1superscript𝛼subscript𝜆superscriptsubscript𝑤𝛼subscript𝜆subscript𝑢subscript𝜆\displaystyle\leq\mathcal{L}^{\alpha}(\lambda_{1},w_{\ast}^{\alpha}(\lambda_{1% }),u_{\ast}(\lambda_{1}))-\mathcal{L}^{\alpha}(\lambda_{\ast},w_{\ast}^{\alpha% }(\lambda_{\ast}),u_{\ast}(\lambda_{\ast}))≤ caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - caligraphic_L start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
=L1(λ1,wα(λ1))L1(λ,wα(λ))+α(L2(λ1,wα(λ1))L2(λ1,u(λ1)))absentsubscript𝐿1subscript𝜆1superscriptsubscript𝑤𝛼subscript𝜆1subscript𝐿1superscript𝜆superscriptsubscript𝑤𝛼superscript𝜆𝛼subscript𝐿2subscript𝜆1superscriptsubscript𝑤𝛼subscript𝜆1subscript𝐿2subscript𝜆1subscript𝑢subscript𝜆1\displaystyle=L_{1}(\lambda_{1},w_{\ast}^{\alpha}(\lambda_{1}))-L_{1}(\lambda^% {\ast},w_{\ast}^{\alpha}(\lambda^{\ast}))+\alpha\left(L_{2}(\lambda_{1},w_{% \ast}^{\alpha}(\lambda_{1}))-L_{2}(\lambda_{1},u_{\ast}(\lambda_{1}))\right)= italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_α ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) )
+α(L2(λ,wα(λ))L2(λ,u(λ)))𝛼subscript𝐿2superscript𝜆superscriptsubscript𝑤𝛼superscript𝜆subscript𝐿2superscript𝜆subscript𝑢superscript𝜆\displaystyle+\alpha\left(L_{2}(\lambda^{\ast},w_{\ast}^{\alpha}(\lambda^{\ast% }))-L_{2}(\lambda^{\ast},u_{\ast}(\lambda^{\ast}))\right)+ italic_α ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) )
=L1(λ1,w(λ1))L1(λ,w(λ))+L1(λ1,wα(λ1))L1(λ1,w(λ1))absentsubscript𝐿1subscript𝜆1subscript𝑤subscript𝜆1subscript𝐿1superscript𝜆subscript𝑤superscript𝜆subscript𝐿1subscript𝜆1superscriptsubscript𝑤𝛼subscript𝜆1subscript𝐿1subscript𝜆1subscript𝑤subscript𝜆1\displaystyle=L_{1}(\lambda_{1},w_{\ast}(\lambda_{1}))-L_{1}(\lambda^{\ast},w_% {\ast}(\lambda^{\ast}))+L_{1}(\lambda_{1},w_{\ast}^{\alpha}(\lambda_{1}))-L_{1% }(\lambda_{1},w_{\ast}(\lambda_{1}))= italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
+L1(λ,w(λ))L1(λ,wα(λ))+α(L2(λ1,wα(λ1))L2(λ1,u(λ1)))subscript𝐿1superscript𝜆subscript𝑤superscript𝜆subscript𝐿1superscript𝜆superscriptsubscript𝑤𝛼superscript𝜆𝛼subscript𝐿2subscript𝜆1superscriptsubscript𝑤𝛼subscript𝜆1subscript𝐿2subscript𝜆1subscript𝑢subscript𝜆1\displaystyle+L_{1}(\lambda^{\ast},w_{\ast}(\lambda^{\ast}))-L_{1}(\lambda^{% \ast},w_{\ast}^{\alpha}(\lambda^{\ast}))+\alpha\left(L_{2}(\lambda_{1},w_{\ast% }^{\alpha}(\lambda_{1}))-L_{2}(\lambda_{1},u_{\ast}(\lambda_{1}))\right)+ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_α ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) )
+α(L2(λ,wα(λ))L2(λ,u(λ)))𝛼subscript𝐿2superscript𝜆superscriptsubscript𝑤𝛼superscript𝜆subscript𝐿2superscript𝜆subscript𝑢superscript𝜆\displaystyle+\alpha\left(L_{2}(\lambda^{\ast},w_{\ast}^{\alpha}(\lambda^{\ast% }))-L_{2}(\lambda^{\ast},u_{\ast}(\lambda^{\ast}))\right)+ italic_α ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) )
\displaystyle\leq (λ1)(λ)+10wα(λ1)w(λ1)+10wα(λ)w(λ)subscript𝜆1superscript𝜆subscript10normsuperscriptsubscript𝑤𝛼subscript𝜆1subscript𝑤subscript𝜆1subscript10normsuperscriptsubscript𝑤𝛼superscript𝜆subscript𝑤superscript𝜆\displaystyle\mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\ell_{10}% \left\|w_{\ast}^{\alpha}(\lambda_{1})-w_{\ast}(\lambda_{1})\right\|+\ell_{10}% \left\|w_{\ast}^{\alpha}(\lambda^{\ast})-w_{\ast}(\lambda^{\ast})\right\|caligraphic_L ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - caligraphic_L ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ + roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥
+α212wα(λ1)u(λ1)2+α212wα(λ)u(λ)2𝛼subscript212superscriptnormsuperscriptsubscript𝑤𝛼subscript𝜆1subscript𝑢subscript𝜆12𝛼subscript212superscriptnormsuperscriptsubscript𝑤𝛼superscript𝜆subscript𝑢superscript𝜆2\displaystyle+\alpha\frac{\ell_{21}}{2}\left\|w_{\ast}^{\alpha}(\lambda_{1})-u% _{\ast}(\lambda_{1})\right\|^{2}+\alpha\frac{\ell_{21}}{2}\left\|w_{\ast}^{% \alpha}(\lambda^{\ast})-u_{\ast}(\lambda^{\ast})\right\|^{2}+ italic_α divide start_ARG roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α divide start_ARG roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(λ1)(λ)+210C0α+2α212C02α2absentsubscript𝜆1superscript𝜆2subscript10subscript𝐶0𝛼2𝛼subscript212superscriptsubscript𝐶02superscript𝛼2\displaystyle\leq\mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\frac{2% \ell_{10}C_{0}}{\alpha}+2\alpha\frac{\ell_{21}}{2}\frac{C_{0}^{2}}{\alpha^{2}}≤ caligraphic_L ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - caligraphic_L ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG 2 roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG + 2 italic_α divide start_ARG roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG divide start_ARG italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(λ1)(λ)+210C0μ211+21C02μ211=(λ1)(λ)+𝒪(κ221),absentsubscript𝜆1superscript𝜆2subscript10subscript𝐶0subscript𝜇2subscript11subscript21superscriptsubscript𝐶02subscript𝜇2subscript11subscript𝜆1superscript𝜆𝒪superscript𝜅2subscript21\displaystyle\leq\mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\frac{2% \ell_{10}C_{0}\mu_{2}}{\ell_{11}}+\frac{\ell_{21}C_{0}^{2}\mu_{2}}{\ell_{11}}=% \mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\mathcal{O}\left(\kappa^{% 2}\ell_{21}\right),≤ caligraphic_L ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - caligraphic_L ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG 2 roman_ℓ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_ARG + divide start_ARG roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_ARG = caligraphic_L ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - caligraphic_L ( italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_O ( italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ) , (43)

where by definitions we know w(λ)=u(λ)subscript𝑤𝜆subscript𝑢𝜆w_{\ast}(\lambda)=u_{\ast}(\lambda)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) = italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_λ ) and the first inequality follows from the gradient-Lipschitz of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the Lipschitz continuity of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in w𝑤witalic_w, and the second inequality uses Lemma 4. The proof is complete. ∎