ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting

Rui Pan¹¹¹footnotemark: 11, Jipeng Zhang¹1, Xingyuan Pan²^∗, Renjie Pi¹, Xiaoyu Wang¹1, Tong Zhang²
¹The Hong Kong University of Science and Technology
²University of Illinois Urbana-Champaign
{rpan,jzhanggr,maxywang,rpi}@ust.hk [email protected]
[email protected] Equal Contribution. Code is available at https://github.com/2003pro/ScaleBiO.Corresponding author is Xiaoyu Wang <[email protected]>.

Abstract

Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to 34-billion-parameter LLMs on eight A40 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including GPT-2, LLaMA-3-8B, GPT-NeoX-20B, and Yi-34B, where bilevel optimization succeeds in filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.

^†^†Preprint. Work in Progress.

1 Introduction

Bilevel optimization (BO) has emerged as a prominent area of research, drawing substantial attention due to its effectiveness in numerous machine learning applications, such as hyperparameter optimization (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017; Lorraine et al., 2020), meta-learning (Andrychowicz et al., 2016; Franceschi et al., 2018; Rajeswaran et al., 2019) and reinforcement learning (Konda and Tsitsiklis, 1999; Hong et al., 2020). In its standard formulation, bilevel optimization involves a two-level hierarchical structure, where the solution to the outer-level problem depends on the solution to the inner-level problem,

	$\displaystyle\min_{\lambda\in\Lambda}\quad$	$\displaystyle\quad\mathcal{L}(\lambda)=L_{1}(\lambda,w_{\ast}(\lambda))\quad$
	s.t.	$\displaystyle\quad w_{\ast}(\lambda)=\arg\min_{w}L_{2}(\lambda,w).$		(1)

For example, on data reweighting tasks, $\lambda$ are weights of different data sources, $w$ represents the trainable model parameters, $w_{\ast}(\lambda)$ means the optimal parameters trained on a weighted dataset, while outer function $L_{1}$ and inner function $L_{2}$ stand for validation and training losses, respectively.

Despite the inherent flexibility and applicability of bilevel optimization across a wide range of problems, its extensive utilization in large-scale problems has been relatively limited thus far. The primary obstacle hindering the scalability of bilevel optimization arises from the interdependence between the upper-level and lower-level problems. This mutual dependency introduces significant computational challenges, particularly when dealing with large-scale problem instances. The natural gradient-based iterative method of solving Problem (1) is to compute (or estimate) the hyper-gradient

\displaystyle\frac{\partial\mathcal{L}(\lambda)}{\partial\lambda}=\frac{% \partial L_{1}(w_{\ast}(\lambda),\lambda)}{\partial\lambda}+\frac{\partial L_{% 1}(w_{\ast},\lambda)}{\partial w_{\ast}}\frac{\partial w_{\ast}(\lambda)}{% \partial\lambda}.

(2)

Under certain smoothness assumptions, the derivative $\frac{\partial w_{\ast}(\lambda)}{\partial\lambda}$ exists and is derived by implicit function theorem $\frac{\partial L_{2}}{\partial w}(w_{\ast}(\lambda),\lambda)=0$ via

\displaystyle\frac{\partial^{2}L_{2}(w_{\ast}(\lambda),\lambda)}{\partial^{2}w% }\frac{\partial w_{\ast}(\lambda)}{\partial\lambda}+\frac{\partial^{2}L_{2}(w_% {\ast}(\lambda),\lambda)}{\partial\lambda\partial w}=0.

(3)

The main challenge lies in computing or approximating the derivative $\partial w_{\ast}(\lambda)/\partial\lambda$ in (2), where a direct computation through (3) involves the Jacobian and Hessian of the inner function $L_{2}$ and may even require the Hessian inverse. There is a line of research (Domke, 2012; Pedregosa, 2016; Grazzi et al., 2020; Lorraine et al., 2020; Franceschi et al., 2017; Shaban et al., 2019; Grazzi et al., 2020; Ghadimi and Wang, 2018; Hong et al., 2020; Yang et al., 2021; Ji et al., 2021; Chen et al., 2022) have been tempted to address this challenge. However, these works mentioned above all require the computations of Hessian, Jacobian, or their products with vectors, which can be computationally expensive and memory-intensive for large-scale problems. Recently, Kwon et al. (2023) proposed a fully first-order method for stochastic bilevel optimization via only the first-order gradient oracle. This approach addresses the challenges associated with second-order computations and offers promising potential for stochastic bilevel optimization.

Despite these groundbreaking advancements in algorithms and theory, the practical performance of bilevel optimization in large-scale real-world settings has yet to be thoroughly investigated. Aiming to close this gap, this paper considers a practical scenario where Large Language Models (LLMs) are fine-tuned with different sources of datasets. We identify a significant challenge in determining the optimal sampling weights for each data source. For instance, Wang et al. (2024) has demonstrated that LLMs’ task-specific performance degrades in the presence of certain training datasets. However, the inclusion and combination of various datasets should intuitively enhance the models’ overall performance with proper sampling weights. This data-task misalignment poses a primary challenge in training LLMs with multiple data sources:

How to balance each data source in the training dataset to obtain optimal performance?

Various methods have been proposed in attempting to address this challenge. However, they either rely on intuitive preset (Zhou et al., 2024; Muennighoff et al., 2022; Du et al., 2022a; Almazrouei et al., 2023) or lacks theoretical guarantees (Xia et al., 2024a; Xie et al., 2024; Xia et al., 2024b), leading to suboptimal sampling weights. To this end, we land bilevel optimization to the data re-weighting tasks for LLMs, aiming to overcome the limitations of existing methods. We formulate this problem into a bilevel optimization problem where the sampling weights act as the outer learnable variables. Extensive experimental results in data denoising, multi-lingual training, and instruction-following fine-tuning demonstrate the scalability and effectiveness of our approach. The primary contributions are summarized as follows:

•

We first propose the scalable instantiation of bilevel optimization on large-sized LLM training problems, which is capable of scaling to models with 34 billion parameters.
•

We successfully bridge the gap between algorithmic advancements in bilevel optimization and their application in data reweighting, allowing the optimal data weights to be learnable for large-scale LLMs.
•

We provide both experimental and theoretical results to demonstrate the efficiency of ScaleBiO. Empirically, ScaleBiO is capable of filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO’s convergence guarantee matches the results of Kwon et al. (2023) on smooth and strongly convex objectives.

2 Related Work

Method	Description	Task	Model	Size
RMD (Bengio, 2000)	2-nd order, deterministic	hyperparameter optimization	Linear	$<$ 1M
CG (Grazzi et al., 2020)	2-nd order, deterministic	equilibrium models	CNN	$<$ 1M
stocBiO (Ji et al., 2021)	2-nd order, stochastic	meta learning	CNN	$<$ 1M
FdeHBO (Yang et al., 2023)	1-st order, stochastic	hyper-representation	LeNet	$<$ 1M
BOME (Liu et al., 2022)	1-st order, stochastic	data hyper-cleaning	Linear	$<$ 1M
SOBA (Dagréou et al., 2022)	2-nd order, stochastic	data reweighting	Transformers	7M
PZOBO (Sow et al., 2022)	1-st order, stochastic	few-shot meta-learning	ResNet	12M
BFTSS (Somayajula et al., 2023)	1-st order, stochastic	task-dependent structure learning	BERT	336M
ScaleBiO (Ours)	1-st order, stochastic	data reweighting	Yi	34B

Table 1: In this table, we compare the maximal model size implemented in their original paper, where ’M’ stands for million and ’B’ stands for billion. We also summarize their methods in Description and report the task they tested.

2.1 Bilevel Optimization

Traditional bilevel optimization algorithms are majorly categorized into two classes: 1) approximate implicit differentiable (AID) methods (Domke, 2012; Pedregosa, 2016; Grazzi et al., 2020; Lorraine et al., 2020), or 2) iterative differentiable (ITD) methods (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2019; Grazzi et al., 2020; Zhou et al., 2022a; Gao et al., 2023; Zhou et al., 2022b; Lin et al., 2023). Both approaches follow a two-loops manner and require huge computational cost for large-scale problems. To reduce the cost, attempts in stochastic bilevel optimization have been made (Ghadimi and Wang, 2018; Hong et al., 2020; Ji et al., 2021; Chen et al., 2022; Khanduri et al., 2021), which significantly improve the efficiency of traditional methods, but still lack practicality for large-scale settings due to the requirements of second-order information, such as Jacobian- and Hessian-vector products for estimating the hyper-gradient. Sow et al. (2022); Yang et al. (2023) attempt to approximate the Jacobian matrix $\nabla y^{\ast}(x)$ in (2) by finite differences, but the finite-different estimation can be sensitive to the selection of the smoothing constant and may suffer from some numerical issues in practice (Jorge and Stephen, 2006).

Recently, a new paradigm of fully first-order penalty-based methods has been introduced, which reformulate the inner-level problem into the optimality constraint (Liu et al., 2022; Kwon et al., 2023; Chen et al., 2023). Liu et al. (2022) first found the hypergradient only involving first-order information, while the method only applies to deterministic functions. Kwon et al. (2023) introduced a first-order gradient-based approach that avoids the estimations of Hessian or Jacobian. This method is easily adapted and extended to stochastic bilevel optimization settings. Chen et al. (2023) provided the near-optimal sample complexity, which improves the theoretical result of (Kwon et al., 2023) in the deterministic bilevel optimization. These results verify the effectiveness of the proposed paradigm in theory, yet its practical applications in large-scale LLM settings remain unexplored.

On the practical side, bilevel optimization has been explored in various NLP tasks. Somayajula et al. (2023) uses bilevel optimization to learn the task-dependent similarity structure. Although their approach demonstrates effectiveness on BERT models (Devlin et al., 2018), the finite difference approximation suffers from high error and therefore lacks the scalability in LLMs with billions of parameters. Grangier et al. (2024) adopts SOBA (Dagréou et al., 2022) to modify the training data distributions for language modeling under domain shift. However, the algorithm still requires gradient approximation and Hessian-vector products, posing challenges to scalability and engineering for large-scale problems. We summarize typical bilevel algorithms and their model sizes in Table 1, where to the best of our knowledge, no approach listed in the table has been successfully applied to over 1B-sized LLM models.

2.2 Data Reweighting

The proportion of training data sources significantly affects the performance of large language models (Du et al., 2022b; Xie et al., 2023). To this end, various methods have been proposed to reweight data sources for optimal training data mixture. For instance, Mindermann et al. (2022) utilizes the loss gap between a trained model and a base model to identify learnable data samples, assigning them higher weights on the fly. Thakkar et al. (2023) proposes to use self-influence score to guide the reweighting in mini-batch during pretraining. Xia et al. (2024b) leverages reference losses on validation sets and adjusts the weights dynamically, adding minimal overhead to standard training. DoReMi (Xie et al., 2024) applies distributionally robust optimization (DRO) to tuning the domain weights without knowledge of downstream tasks, showing effectiveness in improving perplexity and downstream accuracy. Nevertheless, none of the aforementioned methods ensures the optimality of the learned data weights, let alone scalable experiments on over 30B-sized models.

3 Methods

In this section, we elaborate on our ScaleBiO method for finding the optimal sampling weights in training large-scaled LLMs. We first formulate this problem as a bilevel optimization problem in Section 3.1 and then develop an efficient training method for our formulation in Section 3.2.

3.1 Problem Formulation

Suppose that $m$ data sources are available for training, e.g. Alpaca (Taori et al., 2023), FLAN (Wei et al., 2021), and ShareGPT (Chiang et al., 2023), where each source $S_{i}$ is a set of $n_{i}$ examples $S_{i}=\{a_{1}^{i},a_{2}^{i},\dots,a_{n_{i}}^{i}\}$ . We balance multiple training data sources by assigning each data source $S_{i}$ a sampling weight $p_{i}$ that satisfies $\sum_{i=1}^{m}p_{i}=1$ .

Accordingly, each data source $S_{i}$ contributes $p_{i}|\mathcal{D}_{\text{trn}}|$ samples to the training dataset $\mathcal{D}_{\text{trn}}$ . We then optimize the sampling weights to minimize the model’s loss on validation set $\mathcal{D}_{\text{val}}$ . This leads to the following bilevel optimization problem:

	$\displaystyle\min_{p\in\Lambda}$	$\displaystyle\,\,L_{\text{val}}(w^{*}(p))$
	$\displaystyle\mathrm{s.t.}$	$\displaystyle\,\,w^{*}(p)=\arg\min_{w}\sum_{i=1}^{m}\frac{p_{i}}{n_{i}}\sum_{j% =1}^{n_{i}}L_{\text{trn}}(w,a_{j}^{i})$

where $w$ denotes the parameters of LLM, $\{p_{i}\}$ is the probability distribution over $m$ data sources, $L_{\text{val}}$ and $L_{\text{trn}}$ respectively denote the language modeling loss on $D_{\text{val}}$ and $D_{\text{trn}}$ . In addition, we introduce a trainable variable $\lambda\in\mathbb{R}^{m}$ to represent $p_{i}=e^{\lambda_{i}}/\sum_{j=1}^{m}e^{\lambda_{j}}$ , so that $\sum_{i=1}^{m}p_{i}=1$ is satisfied.

Algorithm 1 ScaleBiO for high-dimensional and large-scale minimax problems

1: Input: step-sizes

\left\{\eta_{u},\eta_{\omega},\eta_{\lambda}\right\}

, penalty

\alpha

, and initialization

\lambda_{0}

u_{0}

w_{0}

2: for

k=0:K-1

3: Uniformly and independently select two

j_{k},r_{k}

block coordinates from

\left\{1,2,\cdots,J\right\}

, respectively

4: Generating iid samples

\left\{D_{\text{tr}}^{k},D_{\text{val}}^{k}\right\}

from training dataset

D_{\text{tr}}

and validation dataset

D_{\text{val}}

u_{k+1}^{j_{k}}=u_{k}^{j_{k}}-\alpha\eta_{u}\nabla_{j_{k}}L_{2}(\lambda_{k},u_% {k};D_{\text{tr}}^{k})

u_{k+1}=u_{k}+U_{j_{k}}(u_{k+1}^{j_{k}}-u_{k}^{j_{k}})

w_{k+1}^{r_{k}}=w_{k}^{r_{k}}-\eta_{w}\left(\nabla_{r_{k}}L_{1}(\lambda_{k},w_% {k};D_{\text{val}}^{k})+\alpha\nabla_{r_{k}}L_{2}(\lambda_{k},w_{k};D_{\text{% tr}}^{k})\right)

w_{k+1}=w_{k}+W_{r_{k}}(w_{k+1}^{r_{k}}-w_{t}^{r_{k}})

\lambda_{k+1}=\lambda_{k}-\eta_{\lambda}\left(\nabla L_{1}(\lambda_{k},w_{k};D% _{\text{val}}^{k})+\alpha\bigl{(}\nabla L_{2}(\lambda_{k},w_{k};D_{\text{tr}}^% {k})-\nabla L_{2}(\lambda_{k},u_{k};D_{\text{tr}}^{k})\bigl{)}\right)

10: end for

11: Output:

(\lambda_{K},w_{K},u_{K})

3.2 Fully First-order Hypergradient Method

Note that the inner-level problem of Problem (1) can be viewed as a constraint of the outer-level problem, and thus it is tempting to reformulate

	$\displaystyle\min_{\lambda\in\Lambda,w}$	$\displaystyle\quad L_{1}(\lambda,w)$
	s.t.	$\displaystyle\quad L_{2}(\lambda,w)-\min_{u}L_{2}(\lambda,u)=0.$		(4)

Here an auxiliary variable $u$ is introduced to transform the inner problem $w_{\ast}(\lambda)=\arg\min_{w}L_{2}(\lambda,w)$ to be the constraint $L_{2}(\lambda,w)-\min_{u}L_{2}(\lambda,u)=0$ where $u$ serves as the proxy of $w_{\ast}(\lambda)$ . By penalizing the constraint of (3.2) with multiplier $\alpha>0$ into the outer-level objective, we have the minimax formulation (Kwon et al., 2023; Lu and Mei, 2023):

\displaystyle\min_{\lambda\in\Lambda,w}\max_{u}\,\,\mathcal{L}^{\alpha}(% \lambda,w,u)

(5)

where

\displaystyle\mathcal{L}^{\alpha}(\lambda,w,u)=L_{1}(\lambda,w)+\alpha\left(L_% {2}(\lambda,w)-L_{2}(\lambda,u)\right)

In this way, the approximation of both inner constraint and outer optimum can be obtained during the same optimization process, and $\alpha$ controls the priority. When $\alpha\to\infty$ , the bilevel problem (1) is equivalent to the minimax problem (5) under certain smoothness assumptions.

To precisely describe the optimality of the minimax problem with the stationarity of the bilevel problem, we overload the notation in (5) and define

$\displaystyle\Phi^{\alpha}(\lambda,w)$	$\displaystyle:=\max_{u}\mathcal{L}^{\alpha}(\lambda,w,u);$	(6)
$\displaystyle u_{\ast}(\lambda)$	$\displaystyle:=\arg\max_{u}\mathcal{L}^{\alpha}(\lambda,w,u);$	(7)
$\displaystyle\Gamma^{\alpha}(\lambda)$	$\displaystyle:=\min_{w}\Phi^{\alpha}(\lambda,w);$	(8)
$\displaystyle w_{\ast}^{\alpha}(\lambda)$	$\displaystyle:=\arg\min_{w}\Phi^{\alpha}(\lambda,w).$	(9)

We make the following assumptions for the proposed minimax problem throughout this paper.

Assumption 1.

Suppose that

(1)

$L_{1}(\lambda,w)$ is twice continuously differentiable, $\ell_{10}$ -Lipschitz continuous in $w$ ; $\ell_{11}$ -gradient Lipschitz.
(2)

$L_{2}(\lambda,w)$ is $\ell_{21}$ -gradient Lipschitz, $\ell_{22}$ -Hessian Lipschitz, and $\mu_{2}$ -strongly convex in $w$ .

Lemma 1.

Under Assumption 1, if $\alpha>2\ell_{11}/\mu_{2}$ , we have

$\displaystyle\|\mathcal{L}(\lambda)-\Gamma^{\alpha}(\lambda)\|$	$\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha}\right)$	(10)
$\displaystyle\left\\|\nabla\mathcal{L}(\lambda)-\nabla\Gamma^{\alpha}(\lambda)\right\\|$	$\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha}\right)$	(11)
$\displaystyle\left\\|\nabla^{2}\Gamma^{\alpha}(\lambda)\right\\|$	$\displaystyle\leq\mathcal{O}(\kappa^{3})$	(12)

where the condition number $\kappa$ is defined by $\max\left\{\ell_{10},\ell_{11},\ell_{21},\ell_{22}\right\}/\mu_{2}$ .

Under Assumption 1, as indicated by Lemma 1, if $\alpha$ goes to infinity, the stationary point of the minimax problem (5) is also a stationary point of the bilevel problem (1).

3.3 Proposed Algorithm

By adopting the approach in Section 3.2, the outer-inner dependency can be naturally eliminated from the original bilevel problem and reformulated as an equivalent min-max problem. We propose a single-loop framework, referred to as ScaleBiO, to solve the large-scale min-max problem, outlined in Algorithm 1. In many large-scale machine-learning scenarios, the dimensionality of the training variables can be exceptionally high. For bilevel problems, we employ the randomized block coordinate method (Nesterov, 2012; Pan et al., 2024) to update the inner variables $u,w$ , which improves memory efficiency. As for the outer variable $\lambda$ , we utilize the stochastic gradient descent method for its update. Besides, the popular methods such as SGD momentum and Adam (Kingma and Ba, 2015) can be seamlessly integrated into Algorithm 1 to update each variable $u$ , $w$ , and $\lambda$ . The multiplier $\alpha$ is predefined with a sufficiently large factor that ensures the min-max solution is a good approximation of the original bilevel problem.

3.4 Theoretical Results

In this part, we provide a convergence analysis of Algorithm 1. Before showing the details of theoretical results, we introduce the notations for partitions. Let $\left\{x^{1},x^{2},\cdots,x^{J}\right\}$ with $x^{j}\in\mathbb{R}^{d_{j}\times 1}$ be $J$ non-overlap** blocks of $x$ . Let the matrix $U_{j}\in\mathbb{R}^{d\times d_{j}}$ be $d_{j}$ columns of a $d\times d$ permutation matrix $U$ corresponding to $j$ block coordinates in $x$ . For any partition of $x$ and $U$ ,

\displaystyle x=\sum_{j=1}^{J}U_{j}x^{j},\quad x_{j}=U_{j}^{T}x.

(13)

We present the essential lemmas in appendix C to show the theoretical properties of minimax objective $\mathcal{L}^{\alpha}$ in (5), as well as its optimizers $u_{\ast}$ and $w_{\ast}^{\alpha}$ . Lemma 1 provides clear evidence that $\Gamma^{\alpha}(\lambda)$ is smooth with parameter $\ell_{\Gamma}=\mathcal{O}(\kappa^{3})$ which is independent on the multiplier $\alpha$ .

Theorem 1.

Suppose that Assumptions 1 holds and the parameter $\alpha$ and step-sizes $\eta_{u},\eta_{w},\eta_{\lambda}$ are properly chosen such that

\displaystyle\alpha=K^{1/7},\eta_{u}=\eta_{w}=\frac{\eta_{0}}{K^{4/7}},\eta_{% \lambda}=\frac{\eta_{0}^{\lambda}}{K^{5/7}}.

Consider Algorithm 1, if $\alpha\geq\ell_{11}/\mu_{2}$ , for $\eta_{0}^{\lambda}\leq 1/(8\ell_{\Gamma})$ , $\eta_{0}\leq 8J/\mu_{2}$ and $\eta_{0}/\eta_{0}^{\lambda}\geq 6\sqrt{2}\kappa^{2}J$ , then

\displaystyle\mathbb{E}\left[\left\|\nabla\mathcal{L}(\tilde{\lambda})\right\|% ^{2}\right]\leq\mathcal{O}\left(\frac{1}{K^{2/7}}\right)

(14)

where $\tilde{\lambda}$ is uniformly chosen from $\left\{\lambda_{k}\right\}_{k=1}^{K}$ .

When considering the batch size $B=\mathcal{O}(1)$ , the complexity of finding an $\epsilon$ -stationary point of Algorithm 1 is $\mathcal{O}(\epsilon^{-7})$ , which matches that of (Kwon et al., 2023). The proof of Theorem 1 is provided in Appendix D.

4 Experiments

Our experiments are two-fold: We conduct (1) Small Scale Experiments in Section 4.1 for the intuitive understanding of our method ScaleBiO and (2) Real-World Application Experiments in Section 4.2 to validate its effectiveness and scalability. We start with detailing our main findings on the basic capabilities of our approach. Then we scale them up in real-world scenarios to address the challenge of finding the optimal sampling weight when fine-tuning LLM with multiple data sources.

Refer to caption — Figure 1: Data denoising with GPT-2: weights for noisy data and clean data.

4.1 Small Scale Experiments

To verify the effectiveness of ScaleBiO in small-scale LLMs, we conducted experiments with GPT-2 (124M) on three tasks with synthetic datasets: data denoising, multilingual training, and instruction-following fine-tuning. Full details are available in Appendix B.1.

4.1.1 Data Denoising

This experiment aims to verify ScaleBiO’s effectiveness in the data-denoising tasks. Our validation dataset, denoted as $\mathcal{D}_{\text{val}}$ , comprises 1000 clean samples randomly selected from the Alpaca dataset (Taori et al., 2023). The training dataset, $\mathcal{D}_{\text{trn}}$ , is derived from two distinct sources: the first includes 1000 clean samples also from Alpaca, while the second incorporates 9000 samples from Alpaca that have been artificially corrupted with synthetic noise, where the outputs are replaced with ".".

Figure 1 demonstrates that our approach has a robust capability to mitigate the influence of harmful data sources via automatic data denoising, where ScaleBiO assigns minimal weight to noisy data sources, effectively filtering the irrelevant samples.

4.1.2 Multilingual Training

It is also intriguing to check if ScaleBiO can recover optimal sampling weights for more general distributions. To this end, we design multilingual training experiments, where our validation data $\mathcal{D}_{\text{val}}$ comprises 600 random samples from Alpaca-GPT4-ZH (Peng et al., 2023) and 400 random samples from Alpaca-GPT4-EN (Peng et al., 2023). Hence, the underlying optimal weight is 6:4. In contrast, the training set $\mathcal{D}_{\text{trn}}$ has a 1:1 mix ratio, which consists of 40,000/40,000 random examples from Alpaca-GPT4-EN and Alpaca-GPT4-ZH, respectively.

As shown in Figure 2, ScaleBiO nearly replicates the optimal 6:4 ratio after reweighting the training data. This serves as another concrete proof that ScaleBiO is capable of adapting training data weights optimally to downstream validation datasets.

4.1.3 Instruction Following

In instruction-following fine-tuning tasks, there is a fundamental tradeoff between diversity and quality. To verify if ScaleBiO can deduce these implicit weights of low- and high-quality datasets, we conduct experiments on instruction-following tasks with GPT-2, where Alpaca and Alpaca-GPT4 (Peng et al., 2023) are employed. Here Alpaca-GPT4 shares the same instructions and input as Alpaca, whose high quality is distinguished by its outputs generated from a more sophisticated model GPT-4 (Achiam et al., 2023). Our validation data for bilevel optimization $\mathcal{D}_{\text{val}}$ consists of 1000 random samples from Alpaca-GPT4, while our training data $\mathcal{D}_{\text{trn}}$ consists of 2 separate parts: 1000 random samples from Alpaca-GPT4 and 9000 random samples from Alpaca.

As shown in Figure 3, although Alpaca-GPT4 accounts for only a small proportion of the training data (10%), it is highlighted by our method, revealing that our method effectively up-weights the high-quality data source, leading to improved model outcomes.

4.2 Real-World Application Experiments

In this section, we take a step forward to land bilevel optimization in real-world data reweighting applications, where we aim to balance multiple training data sources for boosting LLMs’ overall performance in downstream tasks.

4.2.1 Large-Scale Data Reweighting

LLaMA-3-8B		LLaMA-3-13B		GPT-NeoX-20B		Yi-34B
source	weight	source	weight	source	weight	source	weight
WildChat	0.711	WildChat	0.711	Airoboros	0.986	ShareGPT4	0.627
Airoboros	0.154	ShareGPT4	0.137	ShareGPT4	0.005	Airoboros	0.111
ChatAlpaca	0.119	ChatAlpaca	0.021	ChatAlpaca	0.003	WildChat	0.105
Total	0.984	Total	0.869	Total	0.994	Total	0.843

Table 2: Data sources with top-3 weights for LLaMA-3-8B, LLaMA-3-13B, GPT-NeoX-20B and Yi-34B.

To demonstrate ScaleBiO’s effectiveness in large-scale settings, practical choices of model and datasets are employed under the data reweighting scenarios. Specifically, we collect popular instruction following and chat completion data to construct a diverse training dataset of 18 distinct sources. The training set is formed by 9 high-quality datasets that are mostly from GPT-4 and 9 multilingual datasets serving as less relevant data sources. For our reference dataset $\mathcal{D}_{\text{val}}$ , we prompt GPT-4 to generate 1200 two-turn instructions and responses that correspond to the 8 categories in MT-Bench. Our detailed data description and preprocessing can be found in Appendix B.2.

We test our approach on LLaMA-3 model series¹¹1https://ai.meta.com/blog/meta-llama-3/ with parameter size of 8B and 13B²²2https://huggingface.co/Replete-AI/Llama-3-13B. We also conduct experiments on GPT-NeoX-20B (Black et al., 2022) and Yi-34B (01.AI et al., 2024). To evaluate our approach, we first sample 3,000 data from each source for reweighting. Then we sample 10,000 data according to the weights at the end of bilevel optimization to train our model (denotes as Ours). For a fair baseline comparison, we sample 10,000 data uniformly and add the reference data $\mathcal{D}_{\text{val}}$ into them to train the baseline model (denotes as Baseline).

As shown in Table 2 and Figure 4, for models with sizes varied from 8B to 34B, ScaleBiO is capable of filtering out the most irrelevant multi-lingual datasets and putting emphasis on GPT-4 generated data sources. Strikingly, the top three weights in all models were found to be within a small subset of our numerous data sources (WildChat, Airoboros, ChatAlpaca and ShareGPT4), which suggests a universality in the importance attributed to these data sources. Moreover, the top three weights in all models account for over 80% of the total weight. This concentration of importance in a small subset of data sources highlights the effectiveness of our method in identifying and emphasizing the most influential data sources, indicating that our method consistently identifies the most critical data sources, regardless of the model architecture.

4.2.2 Large Scale Instruction Following

To confirm the gain of ScaleBiO, we further apply learned sampling weights to fine-tune LLaMA-3-8B and LLaMA-3-70B models.

We evaluate LLMs’ instruction-following ability using MT-Bench (Zheng et al., 2023a) with single-answer grading. This benchmark prompts chat assistants with challenging multi-turn open-ended questions and uses "LLM-as-a-judge" for evaluation. MT-Bench stands out as a comprehensive benchmark that aligns well with human preference. It comprises 80 questions, evenly distributed across 8 categories: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM), and Knowledge II (humanities/social science). To demonstrate, we compare the performance of LLaMA-3-8B trained with our reweighted data versus the original data. Specifically, we first reweight data sources on LLaMA-3-8B. Then we fune-tune the LLM using the reweighted data.

The results, as shown in Table 3 and Table 4, indicate that our method consistently outperforms the baseline across different categories by a large margin, highlighting the effectiveness of our reweighting approach.

On top of that, the same weights learned from LLaMA-3-8B can be seamlessly transferred to fine-tuning LLaMA-3-70B model. As shown in Table 3, our model achieves noticeable improvement on average compared with the baseline, proving the transferability of the learned data weights across the same family of models.

Model		Average
LLaMA-3-8B	Baseline	6.11
LLaMA-3-8B	Ours	7.12
LLaMA-3-8B $\rightarrow$ LLaMA-3-70B	Baseline	7.85
LLaMA-3-8B $\rightarrow$ LLaMA-3-70B	Ours	8.05

Table 3: MT-Bench results of LLaMA-3-70B with transfer trained weights from LLaMA-3-8B.

4.2.3 Implementation Details

Our data reweighting process lasts for 3 epochs with $\alpha$ equals to 100 and initial learning rate of $10^{-5}$ for models and $10^{-2}$ for weights. For all our fine-tuning processes, we train the LLM for 1 epoch with an initial learning rate of $8\times 10^{-6}$ and a global batch size of 64. Throughout our experiments, we adopt randomized coordinate descent with AdamW (Pan et al., 2024) and bfloat16 precision for efficient training and inference. Our experiments are conducted on 8 NVIDIA A100 80GB GPUs.

5 Conclusion

In this paper, we propose ScaleBiO, the first bilevel optimization instantiation that is capable of scaling to 34B LLMs on data reweighting tasks. Theoretically, ScaleBiO ensures optimality of the learned data weights and enjoys the same convergence guarantees as conventional first-order penalty-based bilevel optimization algorithms on smooth and strongly convex objectives. Empirically, ScaleBiO enables data reweighting on $\geq 7B$ sized models, bringing forth an efficient data filtering and selection pipeline for improving model performance on various downstream tasks. On top of that, the learned sampling weights on LLaMA-3-8B can be seamlessly transferred to larger-sized LLaMA-3-70B while still obtaining non-trivial performance improvements.

Limitations

The proposed algorithm of ScaleBiO has yet to be verified in large-scale pre-training settings, where a huge amount of computation resources are required for conducting such experiments. We hope the success of ScaleBiO in large-scale fine-tuning settings can be the first step towards this direction.

The potential risks of ScaleBiO are the same as other data reweighting techniques, where optimizing the sampling weights on a single loss metric may lead to models that neglect other aspects, such as safety or ethics. In that case, multi-objective losses and post-training alignments are highly recommended to compensate for this deficiency.

The positive aspect of ScaleBiO is that it helps reweight data more effectively, thus allowing the training cost of large language models to be further reduced.

Ethical Considerations

In conducting our experiments on a diverse set of datasets for instruction following, we have given careful consideration to ethical concerns that may arise. Our work involves datasets such as ShareGPT, OpenOrca, WildChat, AlpacaChat, LMSYS-Chat, Airoboros, etc. We list the license for each dataset in the Appendix and ensure compliance with the licensing agreements for each dataset. Furthermore, all these data sources are publicly available and do not involve privacy issues.

References

Domke [2012] Justin Domke. Generic methods for optimization-based modeling. In Neil D. Lawrence and Mark Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 318–326, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2113–2122, Lille, France, 07–09 Jul 2015. PMLR.
Franceschi et al. [2017] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1165–1173. PMLR, 2017.
Lorraine et al. [2020] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540–1552. PMLR, 2020.
Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
Franceschi et al. [2018] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568–1577. PMLR, 2018.
Rajeswaran et al. [2019] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
Konda and Tsitsiklis [1999] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
Hong et al. [2020] Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170, 2020.
Pedregosa [2016] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746, New York, New York, USA, 20–22 Jun 2016. PMLR.
Grazzi et al. [2020] Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pages 3748–3758. PMLR, 2020.
Shaban et al. [2019] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1723–1732. PMLR, 2019.
Ghadimi and Wang [2018] Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. arXiv preprint arXiv:1802.02246, 2018.
Yang et al. [2021] Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670–13682, 2021.
Ji et al. [2021] Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pages 4882–4892. PMLR, 2021.
Chen et al. [2022] Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, pages 2466–2488. PMLR, 2022.
Kwon et al. [2023] Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, pages 18083–18113. PMLR, 2023.
Wang et al. [2024] Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36, 2024.
Zhou et al. [2024] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, ** Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
Muennighoff et al. [2022] Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
Du et al. [2022a] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022a.
Almazrouei et al. [2023] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
Xia et al. [2024a] Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024a.
Xie et al. [2024] Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
Xia et al. [2024b] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=09iOdaeOzp.
Bengio [2000] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
Yang et al. [2023] Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving $\mathcal{O}(\epsilon^{-1.5})$ complexity in Hessian/Jacobian-free stochastic bilevel optimization. Advances in Neural Information Processing Systems, 36, 2023.
Liu et al. [2022] Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach. In Advances in neural information processing systems, volume 35, pages 17248–17262, 2022.
Dagréou et al. [2022] Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in Neural Information Processing Systems, 35:26698–26710, 2022.
Sow et al. [2022] Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms. In Advances in Neural Information Processing Systems, volume 35, pages 4136–4149, 2022.
Somayajula et al. [2023] Sai Ashish Somayajula, Lifeng **, Linfeng Song, Haitao Mi, and Dong Yu. Bi-level finetuning with task-dependent similarity structure for low-resource training. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8569–8588, 2023.
Zhou et al. [2022a] Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, and Tong Zhang. Model agnostic sample reweighting for out-of-distribution learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27203–27221. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/zhou22d.html.
Gao et al. [2023] Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. Self-guided noise-free data generation for efficient zero-shot learning, 2023. URL https://arxiv.longhoe.net/abs/2205.12679.
Zhou et al. [2022b] Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Zonghao Chen, and Tong Zhang. Probabilistic bilevel coreset selection. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27287–27302. PMLR, 17–23 Jul 2022b. URL https://proceedings.mlr.press/v162/zhou22h.html.
Lin et al. [2023] Yong Lin, Renjie Pi, Weizhong Zhang, Xiaobo Xia, Jiahui Gao, Xiao Zhou, Tongliang Liu, and Bo Han. A holistic view of label noise transition matrix in deep learning and beyond. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:259298577.
Khanduri et al. [2021] Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Advances in neural information processing systems, 34:30271–30283, 2021.
Jorge and Stephen [2006] Nocedal Jorge and J Wright Stephen. Numerical optimization, 2006.
Chen et al. [2023] Lesi Chen, Yaohua Ma, and **gzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles. arXiv preprint arXiv:2306.14853, 2023.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Grangier et al. [2024] David Grangier, Pierre Ablin, and Awni Hannun. Bilevel optimization to learn training distributions for language modeling under domain shift. In NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2024. URL https://openreview.net/forum?id=D67r01BYYP.
Du et al. [2022b] Nan Du, Yan** Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022b.
Xie et al. [2023] Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36:34201–34227, 2023.
Mindermann et al. [2022] Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
Thakkar et al. [2023] Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. Self-influence guided data reweighting for language model pre-training. arXiv preprint arXiv:2311.00913, 2023.
Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Wei et al. [2021] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Blog post, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Lu and Mei [2023] Zhaosong Lu and Sanyou Mei. First-order penalty methods for bilevel optimization. arXiv preprint arXiv:2301.01716, 2023.
Nesterov [2012] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
Pan et al. [2024] Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv:2403.17919, 2024.
Kingma and Ba [2015] Diederik P Kingma and Jimmy Lei Ba. ADAM: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
Peng et al. [2023] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Black et al. [2022] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022. URL https://arxiv.longhoe.net/abs/2204.06745.
01.AI et al. [2024] 01.AI, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, **g Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai, 2024.
Zheng et al. [2023a] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023a.
Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Lian et al. [2023] Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
Bian et al. [2023] Ning Bian, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun, and Ben He. Chatalpaca: A multi-turn dialogue corpus based on alpaca instructions. https://github.com/cascip/ChatAlpaca, 2023.
Mukherjee et al. [2023] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
Zhao et al. [2024] Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Ye** Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
Zheng et al. [2023b] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023b.
"Teknium" [2023] "Teknium". Gpteacher general-instruct. https://huggingface.co/datasets/teknium/GPTeacher-General-Instruct, 2023.
Durbin [2023] Jon Durbin. Airoboros: using large language models to fine-tune large language models. https://huggingface.co/datasets/jondurbin/airoboros-3.2, 2023.

Appendix A Additional Experiments

A.1 Detailed MT-Bench Scores for LLaMA

The detailed MT-Bench scores for LLaMA-3-8B in Section B.2 are provided in Table 4.

Model		Coding	Extraction	Humanities	Math	Reasoning	Roleplay	STEM	Writing	Average
LLaMA-3-8B	Baseline	4.65	6.20	7.50	3.40	5.25	7.70	7.00	7.20	6.11
LLaMA-3-8B	Ours	5.40	8.55	9.68	3.45	5.30	8.20	8.73	7.63	7.12

Table 4: Comparison of our approach and the baseline on MT-Bench.

Appendix B Experimental Details

B.1 Small Scale Experiments

Throughout our small-scale experiments, we use GPT-2 [Radford et al., 2019] with 124 million parameters as the backbone model. For bilevel optimization hyperparameters, we set the learning rate to $10^{-2}$ for sampling weights and $10^{-5}$ for models. We run our algorithm for 3 epochs with a batch size of 64 and alpha of 10 while adopting AdamW [Loshchilov and Hutter, 2017] for optimization.

B.2 Large Scale Experiments

Datasets	Size	Kind	License
AlpacaGPT4 [Peng et al., 2023]	52K	Instruction	Apache-2.0
ShareGPT4 [Chiang et al., 2023]	6K	Conversation	Apache-2.0
SlimOrca [Lian et al., 2023]	518K	Instruction	MIT
AlpacaChat [Bian et al., 2023]	20K	Conversation	Apache-2.0
OpenOrcaGPT4 [Mukherjee et al., 2023]	1M	Instruction	MIT
WildChat [Zhao et al., 2024]	1M	Conversation	AI2 ImpACT
LMSYS-Chat [Zheng et al., 2023b]	1M	Conversation	LMSYS-Chat-1M
GPTeacher ["Teknium", 2023]	89K	Instruction	MIT
Airoboros [Durbin, 2023]	59K	Conversation	CC-BY-4.0
Alpaca-es³³3https://huggingface.co/datasets/bertin-project/alpaca-spanish	52K	Instruction	CC-BY-4.0
Alpaca-de⁴⁴4https://huggingface.co/datasets/mayflowergmbh/alpaca-gpt4_de	50K	Instruction	Apache-2.0
Alpaca-ja⁵⁵5https://huggingface.co/datasets/fujiki/japanese_alpaca_data	52K	Instruction	CC-BY-NC-SA-4.0
Alpaca-ko⁶⁶6https://huggingface.co/datasets/Bingsu/ko_alpaca_data	50K	Instruction	CC-BY-NC-4.0
Alpaca-ru⁷⁷7https://huggingface.co/datasets/IlyaGusev/ru_turbo_alpaca	30K	Instruction	CC-BY-4.0
Alpaca-it⁸⁸8https://huggingface.co/datasets/mchl-labs/stambecco_data_it	52K	Instruction	CC-BY-NC-SA-4.0
Alpaca-fr⁹⁹9https://huggingface.co/datasets/jpacifico/French-Alpaca-dataset-Instruct-55K	55K	Instruction	Apache-2.0
Alpaca-zh¹⁰¹⁰10https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh	49K	Instruction	CC-BY-4.0
Alpaca-pt¹¹¹¹11https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br	52K	Instruction	CC-BY-NC-4.0

Table 5: In this table, we summarize the datasets we used. Above the middle line are the high-quality datasets we used. Below the middle line are multilingual Alpaca datasets with language code in their name.

Our training data consists of 18 distinct sources as detailed in Table 5. We collect 9 high-quality datasets and 9 multilingual Alpaca datasets which serve as irrelevant data sources. For each data source, we preprocess by filtering out conversations/instructions that exceed the max length (1024 tokens in our experiments). For our reference dataset $\mathcal{D}_{\text{val}}$ , we prompt GPT4 using the prompt "Help me generate 3 sets of 2-turn instructions to evaluate the {category} ability of LLMs. The instructions for the second turn need to be highly relevant to the first turn. The following is an example.\n\n\n EXAMPLE:{example}\n TURN1:{turn1}\n TURN2:{turn2}\n", where {category} corresponds to one of the 8 categories in MT-Bench and {example} is one example from MT-Bench. In this way, we obtain a reference dataset that is highly similar to MT-Bench.

Appendix C Important Lemmas

Suppose Assumption 1 hold, the functions $\mathcal{L}^{\alpha}(\lambda,w,u)$ and $\Gamma^{\alpha}(\lambda)$ satisfy the following properties.

Lemma 2.

Under Assumption 1, the followings hold:

(i)

$\mathcal{L}^{\alpha}(\lambda,w,u)$ is $\mu_{2}\alpha$ -strongly concave w.r.t. $u$ ;
(ii)

$\mathcal{L}^{\alpha}(\lambda,w,u)$ is $\mu_{2}\alpha/2$ -strongly convex w.r.t. $w$ if $\alpha>2\ell_{11}/\mu_{2}$ .

The results of Lemma 2 can be found in [Kwon et al., 2023] and Lemma B.1 of [Chen et al., 2023]. From Lemma B.7 in [Chen et al., 2023], the following result holds for $\Gamma^{\alpha}(\lambda)$ :

Lemma 3.

Under Assumption 1, if $\alpha>2\ell_{11}/\mu_{2}$ , then $\Gamma^{\alpha}(\lambda)$ is $\ell_{\Gamma}$ -smooth, where $\ell_{\Gamma}=\mathcal{O}(\kappa^{3})$ is a constant that is independent on $\alpha$ .

Moreover, the functions $w_{\ast}^{\alpha}(\lambda)$ and $u_{\ast}(\lambda)$ satisfy the following properties.

Lemma 4.

Under Assumption 1, we have

\displaystyle\left\|w_{\ast}^{\alpha}(\lambda)-w_{\ast}(\lambda)\right\|\leq% \frac{C_{0}}{\alpha}

where $C_{0}=\ell_{10}/\mu_{2}$ .

The result in Lemma 4 follows from Lemma B.2 of Chen et al. [2023].

Lemma 5.

Under Assumption 1, if $\alpha>2\ell_{11}/\mu_{2}$ , then we have

(i)

$u_{\ast}(\lambda)$ is $\kappa$ -Lipschitz continuous;
(ii)

$w_{\ast}^{\alpha}(\lambda)$ is $\ell_{w_{\ast},0}$ -Lipschitz continuous where $\ell_{w_{\ast},0}=3\kappa$ .

where the condition number $\kappa=\max\left\{\ell_{10},\ell_{11},\ell_{21},\ell_{22}\right\}/\mu_{2}$

Claim (i) in Lemma 5 can be found in Lemma 2.2 of [Ghadimi and Wang, 2018] and Claim (ii) implies from Lemma 3.2 (setting $\lambda_{1}=\lambda_{2}$ ) of [Kwon et al., 2023].

Lemma 6.

Under Assumption 1, if $\alpha>2\ell_{11}/\mu_{2}$ , then $u_{\ast}(\lambda)$ is $\ell_{\nabla u_{\ast}}$ -smooth where $\ell_{\nabla u_{\ast}}=\mathcal{O}\left(\frac{\kappa^{2}}{\mu_{2}}\left(\ell_{% 21}+1\right)\right)$ where the condition number $\kappa=\max\left\{\ell_{10},\ell_{11},\ell_{21},\ell_{22}\right\}/\mu_{2}$

Following Lemma A.3 of [Kwon et al., 2023] and recalling the Lipschitz continuous property of $u_{\ast}(\lambda)$ from Lemma 5, we have this claim is correct.

Appendix D Proofs of Theorem 1

Proof.

We sample the function $\mathcal{L}^{\alpha}$ by the following mini-batch approximation $\mathcal{L}_{D_{k}}^{\alpha}$ per iteration:

\mathcal{L}_{D_{k}}^{\alpha}(\lambda,w,u):=L_{1}(\lambda,w;D_{\text{val}}^{k})% +\alpha\left(L_{2}(\lambda,w;D_{\text{tr}}^{k})-L_{2}(\lambda,u;D_{\text{tr}}^% {k})\right)

(15)

where $D_{k}=\left\{D_{\text{tr}}^{k},D_{\text{val}}^{k}\right\}$ and $D_{\text{tr}}^{k},D_{\text{val}}^{k}$ are i.i.d. mini-batch samples from the training and validation datasets $D_{\text{tr}},D_{\text{val}}$ , respectively. We use $\mathcal{F}_{k}$ to denote the random information before the iteration $(\lambda_{k},w_{k},u_{k})$ , that is $\mathcal{F}_{k}:=\sigma\left(\left\{(\lambda_{k},\omega_{k},u_{k}),D_{k-1},% \cdots,D_{1}\right\}\right)$ . We use $\mathcal{C}_{k}=\sigma\left(\left\{j_{1},j_{2}\cdots,j_{t-1};r_{1},r_{2},% \cdots,r_{t-1}\right\}\right)$ to denote the random information of variables $u,w$ for the randomized block coordinates before the iteration $k$ .

We recall the iterating formula of $\lambda$ in the stochastic version of the minimax algorithm that $\lambda_{k+1}-\lambda_{k}=-\eta_{\lambda}\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},w_{k},u_{k})$ . At each iteration,

\displaystyle\mathbb{E}[\nabla\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k},\omega_% {k},u_{k})\mid\mathcal{F}_{k}]=\nabla\mathcal{L}^{\alpha}(\lambda_{k},\omega_{% k},u_{k}).

(16)

By the smoothness of $\Gamma^{\alpha}$ (see Lemma 3), we have

	$\displaystyle\Gamma^{\alpha}(\lambda_{k+1})$	$\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})+\left\langle\nabla\Gamma^{\alpha% }(\lambda_{k}),\lambda_{k+1}-\lambda_{k}\right\rangle+\frac{\ell_{\Gamma}}{2}% \left\\|\lambda_{k+1}-\lambda_{k}\right\\|^{2}$
		$\displaystyle=\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}L_{D_{k}}^{\alpha}(\lambda_{k},% \omega_{k},u_{k})\right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left% \\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k},\omega_{k},u_{k})% \right\\|^{2}.$		(17)

Taking conditional expectation w.r.t. $\mathcal{F}_{k},\mathcal{C}_{k}$ on the above inequality, we have

	$\displaystyle\mathbb{E}[\Gamma^{\alpha}(\lambda_{k+1})\mid\mathcal{F}_{k},% \mathcal{C}_{k}]$
	$\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\mathbb{E}[\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})\mid\mathcal{F}_{k},\mathcal{C}_{k}]% \right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\mathbb{E}\left[\left\\|% \nabla_{\lambda}\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k},\omega_{k},u_{k})% \right\\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}\right]$
	$\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},% \omega_{k},u_{k})\right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}% \mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{\alpha}(\lambda_{k% },\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}\right]$		(18)

where the inequality follows the fact that $\mathcal{L}_{D_{k}}^{\alpha}$ is an unbiased estimation of $\mathcal{L}^{\alpha}$ and

	$\displaystyle\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k},\mathcal{% C}_{k}\right]$
	$\displaystyle=\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})+\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k}% ,\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]+\left\\|% \nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}$
	$\displaystyle\leq\frac{\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}}{B}+\left\\|% \nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}$		(19)

where the variance of the minibatch stochastic gradients (with batch size $B$ ) is bounded

\displaystyle\mathbb{E}\left[\left\|\nabla L_{1}(\lambda,w;D_{\text{val}}^{k})% -\nabla L_{1}(\lambda,w)\right\|^{2}\right]\leq\frac{\sigma_{1}^{2}}{B},\quad% \mathbb{E}\left[\left\|\nabla L_{2}(\lambda,w;D_{\text{tr}}^{k})-\nabla L_{2}(% \lambda,w)\right\|^{2}\right]\leq\frac{\sigma_{2}^{2}}{B},

(20)

then

	$\displaystyle\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle=\mathbb{E}\left[\left\\|\nabla_{\lambda}L_{1}(\lambda_{k},w_{k};D% _{\text{val}}^{k})-\nabla_{\lambda}L_{1}(\lambda_{k},w_{k})\right\\|^{2}+\alpha% ^{2}\left\\|\nabla_{\lambda}L_{2}(\lambda_{k},w_{k};D_{\text{tr}}^{k})-\nabla_{% \lambda}L_{2}(\lambda_{k},w_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle\quad+\alpha^{2}\mathbb{E}\left[\left\\|\nabla_{\lambda}L_{2}(% \lambda_{k},u_{k};D_{\text{tr}}^{k})-\nabla_{\lambda}L_{2}(\lambda_{k},u_{k})% \right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle\leq\frac{\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}}{B}.$		(21)

Applying the above results, we have

	$\displaystyle\mathbb{E}[\Gamma^{\alpha}(\lambda_{k+1})\mid\mathcal{F}_{k},% \mathcal{C}_{k}]$	$\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\eta_{\lambda}\left\langle\nabla% \Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},% \omega_{k},u_{k})\right\rangle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left% \\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}$
		$\displaystyle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2B}\left(\sigma_{1}^{2}+2% \alpha^{2}\sigma_{2}^{2}\right).$		(22)

Let $\delta_{k}=\left\|u_{k}-u_{\ast}(\lambda_{k})\right\|^{2}$ and $r_{k}=\left\|w_{k}-w_{\ast}^{\alpha}(\lambda_{k})\right\|^{2}$ . The inner product term of RHS of (D) is estimated as follows:

	$\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\rangle$
$\displaystyle=$	$\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}% ^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\rangle$
	$\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}% \Phi^{\alpha}(w_{\ast}^{\alpha}(\lambda_{k}),\lambda_{k})+\nabla_{\lambda}\Phi% ^{\alpha}(w_{\ast}^{\alpha}(\lambda_{k}),\lambda_{k})\right\rangle$
$\displaystyle\mathop{=}^{(a)}$	$\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{% \alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\rangle$
	$\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \Phi^{\alpha}(\lambda_{k},w_{k})-\nabla_{\lambda}\Phi^{\alpha}(\lambda_{k},w_{% \ast}^{\alpha}(\lambda_{k}))+\nabla\Gamma^{\alpha}(\lambda_{k})\right\rangle$
	$\displaystyle=-\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}-\left% \langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}\mathcal{L}^{\alpha}% (u_{k},\omega_{k},\lambda_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k% },w_{k},u_{\ast}(\lambda_{k}))\right\rangle$
	$\displaystyle-\left\langle\nabla\Gamma^{\alpha}(\lambda_{k}),\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}% \mathcal{L}^{\alpha}(\lambda_{k},w_{\ast}^{\alpha}(\lambda_{k}),u_{\ast}(% \lambda_{k}))\right\rangle$
$\displaystyle\mathop{\leq}^{(b)}$	$\displaystyle-\frac{1}{2}\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}% +\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{% \lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\\|^% {2}$
	$\displaystyle\quad+\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{% k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{% \ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))\right\\|^{2}$
$\displaystyle\mathop{\leq}^{(c)}$	$\displaystyle-\frac{1}{2}\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}% +\alpha^{2}\ell_{21}^{2}\left\\|u_{k}-u_{\ast}(\lambda_{k})\right\\|^{2}+2\left(% \ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)\left\\|\omega_{k}-\omega_{\ast}^{% \alpha}(\lambda_{k})\right\\|^{2}$
$\displaystyle=$	$\displaystyle-\frac{1}{2}\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}% +\alpha^{2}\ell_{21}^{2}\delta_{k}+2\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2% }\right)r_{k}$	(23)

where $(a)$ uses the optimality of $\Phi$ over $w$ that $\nabla_{\lambda}\Phi^{\alpha}(w_{\ast}^{\alpha}(\lambda_{k}),\lambda_{k})=% \nabla\Gamma^{\alpha}(\lambda_{k})=\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},w_{\ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))$ , $(b)$ follows from the Cauchy-Schwartz inequality and $(c)$ uses the smoothness of $L_{1}$ and $L_{2}$ . Next we turn to estimate the norm of gradient $\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})$ as follows

$\displaystyle\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{% k})\right\\|^{2}$	$\displaystyle=\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_% {k})-\nabla\Gamma^{\alpha}(\lambda_{k})+\nabla\Gamma^{\alpha}(\lambda_{k})% \right\\|^{2}$
	$\displaystyle\leq 2\left(\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}% +\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla% \Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right)$
	$\displaystyle\leq 2\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}+4% \left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{% \lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\\|^% {2}$
	$\displaystyle\quad+4\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_% {k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_% {\ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))\right\\|^{2}$
	$\displaystyle\mathop{\leq}^{(a)}2\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})% \right\\|^{2}+4\alpha^{2}\ell_{11}^{2}\left\\|u_{k}-u_{\ast}(\lambda_{k})\right% \\|^{2}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)\left\\|\omega_{k}-% \omega_{\ast}^{\alpha}(\lambda_{k})\right\\|^{2}$
	$\displaystyle=2\left\\|\nabla\Gamma(\lambda_{k})\right\\|^{2}+4\alpha^{2}\ell_{1% 1}^{2}\delta_{k}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}$	(24)

where $(a)$ uses the smoothness of objectives $L_{1},L_{2}$ . Incorporating the above inequalities (D) and (D) into (D) gives

	$\displaystyle\mathbb{E}[\Gamma^{\alpha}(\lambda_{k+1})\mid\mathcal{F}_{k},% \mathcal{C}_{k}]$	$\displaystyle\leq\Gamma^{\alpha}(\lambda_{k})-\frac{\eta_{\lambda}}{2}\left\\|% \nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}+\frac{\ell_{\Gamma}\eta_{% \lambda}^{2}}{2}\left(2\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}+4% \alpha^{2}\ell_{11}^{2}\delta_{k}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}% \right)r_{k}\right)$
		$\displaystyle+\eta_{\lambda}\left(\alpha^{2}\ell_{11}^{2}\delta_{k}+2\left(% \ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)+\frac{\ell_{\Gamma}% \eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right).$		(25)

Then, we focus on estimating $\delta_{k}$ and $r_{k}$ . For the inner variables $u,w$ , we use the randomized block coordinates method with total $J$ blocks and each block is uniformly chosen. By the strong concavity of $\mathcal{L}^{\alpha}$ with respect to $u$ , we first achieve the following evaluations for $\delta_{k}$ :

		$\displaystyle\mathbb{E}\left[\left\\|u_{k+1}-u_{\ast}(\lambda_{k})\right\\|^{2}% \mid\mathcal{F}_{k},\mathcal{C}_{k}\right]=\mathbb{E}\left[\left\\|u_{k}-\alpha% \eta_{u}U_{j_{t}}\nabla_{u}L_{2}\left(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text% {tr}}\right)-u_{\ast}(\lambda_{k})\right\\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_% {k}\right]$
		$\displaystyle=\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}-2\alpha\eta_{u}% \mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_{2}(% \lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\rangle_{j_{t}}\mid% \mathcal{F}_{k},\mathcal{C}_{k}\right]$
		$\displaystyle\quad+\alpha^{2}\eta_{u}^{2}\mathbb{E}\left[\left\\|U_{j_{t}}% \nabla_{u}L_{2}(u_{k},\lambda_{k};\mathcal{D}_{k}^{\text{tr}})\right\\|^{2}\mid% \mathcal{F}_{k},\mathcal{C}_{k}\right]$
		$\displaystyle\mathop{=}^{(a)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}-% \frac{2\alpha\eta_{u}}{J}\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\rangle+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}% \left[\left\\|\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})% \right\\|^{2}\mid\mathcal{F}_{k}\right]$
		$\displaystyle\mathop{\leq}^{(b)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}% -\frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right% \\|^{2}\right)$
		$\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\\|^{2}\mid% \mathcal{F}_{k}\right]$
		$\displaystyle\mathop{=}^{(c)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}-% \frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right% \\|^{2}\right)$
		$\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})-\nabla_{u}L_{2}% (\lambda_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]+\frac{\alpha^{2}\eta% _{u}^{2}}{J}\left\\|\nabla_{u}L_{2}(\lambda_{k},u_{k})\right\\|^{2}$
		$\displaystyle\mathop{\leq}^{(d)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}% -\frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right% \\|^{2}\right)$
		$\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})-\nabla_{u}L_{2}% (\lambda_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]+\frac{2\ell_{21}\eta% _{u}^{2}\alpha^{2}}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{\ast% }(\lambda_{k}))\right)$
		$\displaystyle\mathop{\leq}^{(e)}\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)% \left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}+\frac{\alpha^{2}\eta_{u}^{2}% \sigma_{2}^{2}}{JB}.$		(26)

where $(a)$ use the truth that since the $j_{k}$ block coordinate is uniformly chosen from $\left\{1,2,\cdots,J\right\}$ , we have

	$\displaystyle\mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{% u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\rangle_{j_{t}}% \mid\mathcal{F}_{k},\mathcal{C}_{k}\right]$	$\displaystyle=\frac{1}{J}\mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{% k}),\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right% \rangle\mid\mathcal{F}_{k}\right]$
		$\displaystyle=\frac{1}{J}\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\rangle$		(27)

and

\displaystyle\mathbb{E}\left[\left\|U_{j_{t}}\nabla_{u}L_{2}(u_{k},\lambda_{k}% ;\mathcal{D}_{k}^{\text{tr}})\right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}% \right]=\frac{1}{J}\mathbb{E}\left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};% \mathcal{D}_{k}^{\text{tr}})\right\|^{2}\mid\mathcal{F}_{k}\right]

(28)

$(b)$ follows from the strong convexity of $L_{2}$ w.r.t. $u$ which implies that

\displaystyle L_{2}(\lambda_{k},u_{\ast}(\lambda_{k}))\geq L_{2}(\lambda_{k},u% _{k})+\left\langle\nabla_{u}L_{2}(\lambda_{k},u_{k}),u_{\ast}(\lambda_{k})-u_{% k}\right\rangle+\frac{\mu_{2}}{2}\left\|u_{k}-u_{\ast}(\lambda_{k})\right\|^{2},

$(c)$ uses the relationship $\mathbb{E}\left[\nabla_{u}L_{2}(\lambda_{k},u_{k};D_{\text{tr}}^{k})\mid% \mathcal{F}_{k}\right]=\nabla_{u}L_{2}(\lambda_{k},u_{k})$ which induces that

\displaystyle\mathbb{E}\left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};D_{k}^{% \text{str}})\right\|^{2}\mid\mathcal{F}_{k}\right]=\mathbb{E}\left[\left\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{str}})-\nabla_{u}L_{2% }(\lambda_{k},u_{k})\right\|^{2}\mid\mathcal{F}_{k}\right]+\left\|\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\|^{2}

(29)

and $(d)$ uses the optimality of $u_{\ast}(\lambda)$ and the smoothness of $L_{2}$ such that

$\displaystyle L_{2}(\lambda_{k},u_{\ast}(\lambda_{k}))-L_{2}(\lambda_{k},u_{k})$	$\displaystyle\leq L_{2}(\lambda_{k},\tilde{u})-L_{2}(\lambda_{k},u_{k})$
	$\displaystyle\leq L_{2}(\lambda_{k},u_{k})+\left\langle\nabla_{u}L_{2}(\lambda% _{k},u_{k}),\tilde{u}-u_{k}\right\rangle+\frac{\ell_{21}}{2}\left\\|\tilde{u}-u% _{k}\right\\|^{2}-L_{2}(\lambda_{k},u_{k})$
	$\displaystyle=-\frac{1}{2\ell_{21}}\left\\|\nabla_{u}L_{2}(\lambda_{k},u_{k})% \right\\|^{2}$	(30)

where $\tilde{u}=u_{k}-\frac{1}{\ell_{21}}\nabla_{u}L_{2}(\lambda_{k},u_{k})$ and $(e)$ uses

\displaystyle\mathbb{E}\left[\left\|\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal% {D}_{k}^{\text{str}})-\nabla_{u}L_{2}(\lambda_{k},u_{k})\right\|^{2}\mid% \mathcal{F}_{k}\right]\leq\frac{\sigma_{2}^{2}}{B}.

(31)

and $\eta_{u}\leq 1/(\alpha\ell_{21})$ . Then we make the following recursive estimation for $\delta_{k}$ :

$\displaystyle\delta_{k+1}=$	$\displaystyle\left\\|u_{\ast}(\lambda_{k+1})-u_{k+1}\right\\|^{2}=\left\\|u_{\ast% }(\lambda_{k+1})-u_{\ast}(\lambda_{k})+u_{\ast}(\lambda_{k})-u_{k+1}\right\\|^{2}$
$\displaystyle\mathop{\leq}^{(a)}$	$\displaystyle(1+\gamma_{1})\left\\|u_{\ast}(\lambda_{k+1})-u_{\ast}(\lambda_{k}% )\right\\|^{2}+(1+1/\gamma_{1})\left\\|u_{\ast}(\lambda_{k})-u_{k+1}\right\\|^{2}$
$\displaystyle\mathop{\leq}^{(b)}$	$\displaystyle(1+\gamma_{1})\kappa^{2}\left\\|\lambda_{k+1}-\lambda_{k}\right\\|^% {2}+(1+1/\gamma_{1})\left\\|u_{\ast}(\lambda_{k})-u_{k+1}\right\\|^{2}$
$\displaystyle\mathop{\leq}^{(c)}$	$\displaystyle(1+\gamma_{1})\kappa^{2}\left\\|\lambda_{k+1}-\lambda_{k}\right\\|^% {2}+(1+1/\gamma_{1})\left(\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\delta% _{k}+\frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}\right)$
$\displaystyle\mathop{\leq}^{(d)}$	$\displaystyle(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left\\|\nabla_{\lambda}% \mathcal{L}^{\alpha}(u_{k},\omega_{k},\lambda_{k})\right\\|^{2}+(1+1/\gamma_{1}% )\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\delta_{k}+(1+1/\gamma_{1})% \frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}$
$\displaystyle\mathop{\leq}^{(e)}$	$\displaystyle(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left(2\left\\|\nabla% \Gamma^{\alpha}(\lambda_{k})\right\\|^{2}+4\alpha^{2}\ell_{21}^{2}\delta_{k}+8% \left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)+(1+1/\gamma_{1}% )\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\delta_{k}$
	$\displaystyle+(1+1/\gamma_{1})\frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}$
$\displaystyle=$	$\displaystyle\left(4\alpha^{2}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\ell_{% 21}^{2}+(1+1/\gamma_{1})\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)\right)% \delta_{k}+8(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left(\ell_{11}^{2}+% \alpha^{2}\ell_{21}^{2}\right)r_{k}$
	$\displaystyle+2(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\left\\|\nabla\Gamma^{% \alpha}(\lambda_{k})\right\\|^{2}+(1+1/\gamma_{1})\frac{\alpha^{2}\eta_{u}^{2}% \sigma_{2}^{2}}{JB}.$	(32)

where $(a)$ follows from Cauchy-Schwartz inequality with $\gamma_{1}>0$ ; (b) uses the Lipschitz continuity of $u_{\ast}$ from Lemma 5; $(c)$ follows from the inequality (D); $(d)$ uses the iterating formula of $\lambda_{k+1}$ ; $(e)$ follows from the inequality (D).

Since $L_{1}+\alpha L_{2}$ is strongly convex with respect to $w$ with parameter $\alpha\mu_{2}/2$ if $\alpha\geq 2\ell_{21}/\mu_{2}$ . Similar to $\delta_{k}$ , we can achieve the following result for $r_{k}$

\displaystyle\mathbb{E}\left[\left\|\omega_{\ast}^{\alpha}(\lambda_{k})-\omega% _{k+1}\right\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_{k}\right]\leq\left(1-\frac{% \alpha\mu_{2}\eta_{w}}{2J}\right)r_{k}+\frac{\eta_{w}^{2}\left(\sigma_{1}^{2}+% \alpha^{2}\sigma_{2}^{2}\right)}{JB}

(33)

Following the same procedure as in (D), we estimate the recursion $r_{k}$ as below

$\displaystyle r_{k+1}$	$\displaystyle\leq(1+\gamma_{2})\left\\|\omega_{\ast}^{\alpha}(\lambda_{k+1})-% \omega_{\ast}^{\alpha}(\lambda_{k})\right\\|^{2}+(1+\gamma_{2}^{-1})\left\\|% \omega_{\ast}^{\alpha}(\lambda_{k})-\omega_{k+1}\right\\|^{2}$
	$\displaystyle\leq(1+\gamma_{2})\kappa^{2}\left\\|\lambda_{k+1}-\lambda_{k}% \right\\|^{2}+(1+\gamma_{2}^{-1})\left(\left(1-\frac{\alpha\mu_{2}\eta_{w}}{2J}% \right)r_{k}+\frac{\eta_{w}^{2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}% \right)}{JB}\right)$
	$\displaystyle\leq\left(4(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\left(\ell_{% 11}^{2}+\alpha^{2}\ell_{21}^{2}\right)+(1+1/\gamma_{2})\left(1-\frac{\alpha\mu% _{2}\eta_{w}}{2J}\right)\right)r_{k}+8(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{% 2}\alpha^{2}\ell_{21}^{2}\delta_{k}$
	$\displaystyle+2(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\left\\|\nabla\Gamma^{% \alpha}(\lambda_{k})\right\\|^{2}+(1+1/\gamma_{2})\frac{\eta_{w}^{2}\left(% \sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}$	(34)

where $\gamma_{2}>0$ .

We define the Lyapunov function

\displaystyle R_{k}=\Gamma^{\alpha}(\lambda_{k})-\Gamma_{\min}^{\alpha}+\xi_{1% ,k}\delta_{k}+\xi_{2}^{k}r_{k}

(35)

where $\xi_{1,k},\xi_{2,k}>0$ are non-increasing sequences and $\Gamma_{\min}^{\alpha}$ is the minimum of $\Gamma^{\alpha}$ . We must have $R_{k}\geq 0$ . Incorporating the results of (D), (D), (D) gives

		$\displaystyle\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k},\mathcal{C}_{k}]$
		$\displaystyle\leq R_{k}-\frac{\eta_{\lambda}}{2}\left\\|\nabla\Gamma(\lambda_{k% })\right\\|^{2}+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(2\left\\|\nabla% \Gamma^{\alpha}(\lambda_{k})\right\\|^{2}+4\alpha^{2}\ell_{21}^{2}\delta_{k}+8% \left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)$
		$\displaystyle+\eta_{\lambda}\left(\alpha^{2}\ell_{21}^{2}\delta_{k}+2\left(% \ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}\right)+\frac{\ell_{\Gamma}% \eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right)+% \left(\xi_{1,k+1}\delta_{k+1}-\xi_{1,k}\delta_{k}\right)+\left(\xi_{2,k+1}r_{k% +1}-\xi_{2,k}r_{k}\right)$
		$\displaystyle\leq R_{k}-\left(\frac{\eta_{\lambda}}{2}-\ell_{\Gamma}\eta_{% \lambda}^{2}-2\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}-2\xi_{2,k+% 1}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\right)\left\\|\nabla\Gamma^{\alpha% }(\lambda_{k})\right\\|^{2}+\phi_{1}\delta_{k}+\phi_{2}r_{k}$
		$\displaystyle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2% \alpha^{2}\sigma_{2}^{2}\right)+\xi_{1,k+1}(1+\gamma_{1}^{-1})\frac{\alpha^{2}% \eta_{u}^{2}\sigma_{2}^{2}}{JB}+\xi_{2,k+1}(1+\gamma_{2}^{-1})\frac{\eta_{w}^{% 2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}$		(36)

where

$\displaystyle\phi_{1}$	$\displaystyle=\xi_{1,k+1}\left(4\alpha^{2}(1+\gamma_{1})\kappa^{2}\eta_{% \lambda}^{2}\ell_{21}^{2}+(1+1/\gamma_{1})\left(1-\frac{\alpha\mu_{2}\eta_{u}}% {J}\right)\right)-\xi_{1,k}+2\ell_{\Gamma}\eta_{\lambda}^{2}\alpha^{2}\ell_{21% }^{2}+\eta_{\lambda}\alpha^{2}\ell_{21}^{2}$
	$\displaystyle+8\xi_{2,k+1}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\left(\ell% _{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)$
$\displaystyle\phi_{2}$	$\displaystyle=\xi_{2,k+1}\left(4(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}% \left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)+(1+1/\gamma_{2})\left(1-% \frac{\alpha\mu_{2}\eta_{w}}{2J}\right)\right)-\xi_{2,k}+4\ell_{\Gamma}\eta_{% \lambda}^{2}\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)$
	$\displaystyle+2\eta_{\lambda}\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right% )+8\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}\alpha^{2}\ell_{21}^{2}.$	(37)

Let $\eta_{u}=\eta_{\omega}=\eta_{0}/K^{a}$ and $\eta_{\lambda}=\eta_{\lambda}^{0}/K^{b}$ , and $\alpha=K^{c}$ where $0\leq a\leq b$ and $c>0$ , and $\ell=\max\left\{\ell_{11},\ell_{21}\right\}$ then $\phi_{1}$ and $\phi_{2}$ can be re-written as:

$\displaystyle\phi_{1}$	$\displaystyle=\xi_{1,k+1}\left(\frac{4(1+\gamma_{1})\kappa^{2}(\eta_{\lambda}^% {0})^{2}}{K^{2(b-c)}}\ell^{2}+(1+1/\gamma_{1})\left(1-\frac{\mu_{2}\eta_{0}}{% JK^{(a-c)}}\right)\right)-\xi_{1,k}+\frac{2\ell_{\Gamma}\ell^{2}(\eta_{\lambda% }^{0})^{2}}{K^{2(b-c)}}+\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}$
	$\displaystyle\quad+\frac{8\xi_{2,k+1}(1+\gamma_{2})\kappa^{2}\ell^{2}(\eta_{% \lambda}^{0})^{2}}{K^{2(b-c)}}$
$\displaystyle\phi_{2}$	$\displaystyle=\xi_{2,k+1}\left(\frac{4(1+\gamma_{2})\kappa^{2}(\eta_{\lambda}^% {0})^{2}}{K^{2(b-c)}}\ell^{2}+(1+1/\gamma_{2})\left(1-\frac{\mu_{2}\eta_{0}}{2% JK^{(a-c)}}\right)\right)-\xi_{2,k}+\frac{4\ell_{\Gamma}\ell^{2}(\eta_{\lambda% }^{0})^{2}}{K^{2{(b-c)}}}+\frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}$
	$\displaystyle\quad+\frac{8\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\ell^{2}(\eta_{% \lambda}^{0})^{2}}{K^{2{(b-c)}}}.$	(38)

In order to achieve $\phi_{1}\leq 0$ and $\phi_{2}\leq 0$ , we might let $\gamma_{1}=\gamma_{2}=4JK^{(a-c)}/(\mu_{2}\eta_{0})-1$ , then

	$\displaystyle(1+1/\gamma_{1})\left(1-\frac{\mu_{2}\eta_{0}}{JK^{(a-c)}}\right)$	$\displaystyle\leq 1-\frac{3\mu_{2}\eta_{0}}{4JK^{(a-c)}}$
	$\displaystyle(1+1/\gamma_{2})\left(1-\frac{\mu_{2}\eta_{0}}{2JK^{(a-c)}}\right)$	$\displaystyle\leq 1-\frac{\mu_{2}\eta_{0}}{4JK^{(a-c)}}.$		(39)

For $\eta_{0}\leq\frac{8J}{\mu_{2}}$ , we have $\frac{\mu_{2}\eta_{0}}{4J}\leq\frac{1}{2}$ . Consider that $\xi_{1,k}$ and $\xi_{2,k}$ are non-increasing sequence, then $\xi_{1,k}\geq\xi_{1,k+1}$ and $\xi_{2,k}\geq\xi_{2,k+1}$ , we have

	$\displaystyle\phi_{1}$	$\displaystyle\leq\xi_{1,k}\left(1+\frac{(\eta_{\lambda}^{0})^{2}\ell^{2}\kappa% ^{2}J}{\mu_{2}\eta_{0}K^{(2b-c-a)}}-\frac{3\mu_{2}\eta_{0}}{4JK^{(a-c)}}\right% )-\xi_{1,k}+\frac{2\ell_{\Gamma}\ell^{2}(\eta_{\lambda}^{0})^{2}}{K^{2(b-c)}}+% \frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{2,k}\frac{8J(\eta_{\lambda}% ^{0})^{2}\ell^{2}\kappa^{2}}{\mu_{2}\eta_{0}K^{(2b-c-a)}}\leq 0$
	$\displaystyle\phi_{2}$	$\displaystyle\leq\xi_{2,k}\left(1+\frac{(\eta_{\lambda}^{0})^{2}\ell^{2}\kappa% ^{2}J}{\mu_{2}\eta_{0}K^{(2b-c-a)}}-\frac{\mu_{2}\eta_{0}}{4JK^{(a-c)}}\right)% -\xi_{2,k}+\frac{4\ell_{\Gamma}\ell^{2}(\eta_{\lambda}^{0})^{2}}{K^{2(b-c)}}+% \frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{1,k}\frac{8J(\eta_{\lambda% }^{0})^{2}\ell^{2}\kappa^{2}}{\mu_{2}\eta_{0}K^{(2b-c-a)}}\leq 0$

If $\eta_{\lambda}^{0}\leq 1/(2\ell_{\Gamma})$ and $\eta_{0}/\eta_{\lambda}^{0}\geq 6\sqrt{2}\kappa^{2}J$ , for $b\geq a$ and $k>1$ , then

\displaystyle\frac{2\ell_{\Gamma}\ell^{2}(\eta_{\lambda}^{0})^{2}}{K^{2(b-c)}}% \leq\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}},\frac{9(\eta_{\lambda}^{0})^% {2}\ell^{2}\kappa^{2}J}{\mu_{2}\eta_{0}}\leq\frac{\mu_{2}\eta_{0}}{8J}.

The inequalities of $\phi_{1},\phi_{2}$ can be simplified as

	$\displaystyle\phi_{1}$	$\displaystyle\leq\xi_{1,k}\left(1-\frac{53\mu_{2}\eta_{0}}{72JK^{(a-c)}}\right% )-\xi_{1,k}+\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{2,k}\frac{\mu_{% 2}\eta_{0}}{9JK^{(a-c)}}\leq 0$		(40)
	$\displaystyle\phi_{2}$	$\displaystyle\leq\xi_{2,k}\left(1-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\right% )-\xi_{2,k}+\frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}+\xi_{1,k}\frac{\mu_% {2}\eta_{0}}{9JK^{(a-c)}}\leq 0$		(41)

We might solve the above inequalities and properly set

	$\displaystyle\xi_{1,k}$	$\displaystyle=\frac{-\frac{53\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{2\ell^{2}\eta% _{\lambda}^{0}}{K^{(b-2c)}}-\frac{\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}\frac% {\mu_{2}\eta_{0}}{9JK^{(a-c)}}}{\frac{\mu_{2}\eta_{0}}{9JK^{(a-c)}}\frac{\mu_{% 2}\eta_{0}}{9JK^{(a-c)}}-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{53\mu_{2}% \eta_{0}}{72JK^{(a-c)}}}=\frac{\frac{114\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}% }}{\frac{837\mu_{2}\eta_{0}}{72JK^{(a-c)}}}=\frac{10\ell^{2}\eta_{\lambda}^{0}% }{\mu_{2}\eta_{0}}\frac{J}{K^{(b-a-c)}}$
	$\displaystyle\xi_{2,k}$	$\displaystyle=\frac{-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{\ell^{2}\eta_% {\lambda}^{0}}{K^{(b-2c)}}-\frac{2\ell^{2}\eta_{\lambda}^{0}}{K^{(b-2c)}}\frac% {\mu_{2}\eta_{0}}{9JK^{(a-c)}}}{\frac{\mu_{2}\eta_{0}}{9JK^{(a-c)}}\frac{\mu_{% 2}\eta_{0}}{9JK^{(a-c)}}-\frac{17\mu_{2}\eta_{0}}{72JK^{(a-c)}}\frac{53\mu_{2}% \eta_{0}}{72JK^{(a-c)}}}=\frac{3\ell^{2}\eta_{\lambda}^{0}}{\mu_{2}\eta_{0}}% \frac{J}{K^{(b-a-c)}}$

to guarantee that $\phi_{1}\leq 0$ and $\phi_{2}\leq 0$ . Then the main inequality (D) can be estimated as

	$\displaystyle\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k},\mathcal{C}_{k}]$	$\displaystyle\leq R_{k}-\left(\frac{\eta_{\lambda}}{2}-\ell_{\Gamma}\eta_{% \lambda}^{2}-2\xi_{1,k+1}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}^{2}-2\xi_{2,k+% 1}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}^{2}\right)\left\\|\nabla\Gamma^{\alpha% }(\lambda_{k})\right\\|^{2}$
		$\displaystyle+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2% \alpha^{2}\sigma_{2}^{2}\right)+\xi_{1,k+1}(1+\gamma_{1}^{-1})\frac{\alpha^{2}% \eta_{u}^{2}\sigma_{2}^{2}}{JB}+\xi_{2,k+1}(1+\gamma_{2}^{-1})\frac{\eta_{w}^{% 2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}.$

If we set $\eta_{\lambda}^{0}\leq 1/(8\ell_{\Gamma})$ , then $\ell_{\Gamma}\eta_{\lambda}^{2}\leq\frac{\eta_{\lambda}}{8}$ . For $b\geq a$ and $k\geq 1$ , if we set $\eta_{0}/\eta_{\lambda}^{0}\geq 8\sqrt{3}\kappa^{2}J$

	$\displaystyle\xi_{1,k}(1+\gamma_{1})\kappa^{2}\eta_{\lambda}$	$\displaystyle\leq\frac{40(\eta_{\lambda}^{0})^{2}K^{(a-c)}\ell^{2}\kappa^{2}J^% {2}}{\mu_{2}^{2}\eta_{0}^{2}K^{b}K^{(b-a-c)}}=\frac{40(\eta_{\lambda}^{0})^{2}% \ell^{2}\kappa^{2}J^{2}}{\mu_{2}^{2}\eta_{0}^{2}K^{(2b-2a)}}\leq\frac{1}{16}$
	$\displaystyle\xi_{2,k}(1+\gamma_{2})\kappa^{2}\eta_{\lambda}$	$\displaystyle\leq\frac{12(\eta_{\lambda}^{0})^{2}K^{(a-c)}\ell^{2}\kappa^{2}J^% {2}}{\mu_{2}^{2}\eta_{0}^{2}K^{b}K^{(b-a-c)}}=\frac{12(\eta_{\lambda}^{0})^{2}% \ell^{2}\kappa^{2}J^{2}}{\mu_{2}^{2}\eta_{0}^{2}K^{(2b-2a)}}\leq\frac{1}{16}.$

Then

\displaystyle\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k},\mathcal{C}_{k}]

\displaystyle\leq R_{k}-\frac{\eta_{\lambda}}{4}\left\|\nabla\Gamma^{\alpha}(% \lambda_{k})\right\|^{2}+\frac{\ell_{\Gamma}\eta_{\lambda}^{2}}{2}\left(\sigma% _{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right)+\frac{8\xi_{1,k+1}}{7}\frac{\alpha^{% 2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}+\frac{3\xi_{2,k+1}}{4}\frac{\eta_{w}^{2}% \left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}\right)}{JB}.

Telesco** the above inequality gives

	$\displaystyle\mathbb{E}\left[\left\\|\nabla\Gamma^{\alpha}(\tilde{\lambda})% \right\\|^{2}\right]$	$\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\\|\nabla\Gamma^{% \alpha}(\lambda_{k})\right\\|^{2}\right]$
		$\displaystyle\leq\frac{4}{K\eta_{\lambda}}\left(\sum_{k=1}^{T}\mathbb{E}[R_{k}% \mid\mathcal{F}_{k-1},\mathcal{C}_{k-1}]-\mathbb{E}[R_{k+1}\mid\mathcal{F}_{k}% ],\mathcal{C}_{k}\right)$
		$\displaystyle+\frac{4}{K\eta_{\lambda}}\sum_{k=1}^{K}\left(\frac{\ell_{\Gamma}% \eta_{\lambda}^{2}}{2}\left(\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}\right)+% \frac{8\xi_{1,k+1}}{7}\frac{\alpha^{2}\eta_{u}^{2}\sigma_{2}^{2}}{JB}+\frac{3% \xi_{2,k+1}}{4}\frac{\eta_{w}^{2}\left(\sigma_{1}^{2}+\alpha^{2}\sigma_{2}^{2}% \right)}{JB}\right)$
		$\displaystyle\leq\frac{4\mathbb{E}[R_{1}]K^{b}}{\eta_{\lambda}^{0}K}+\frac{4K^% {b}}{\eta_{\lambda}^{0}}\frac{\ell_{\Gamma}(\eta_{\lambda}^{0})^{2}\left(% \sigma_{1}^{2}+K^{2c}\sigma_{2}^{2}\right)}{2K^{2b}}$
		$\displaystyle+\frac{4K^{b}}{\eta_{\lambda}^{0}}\left(\frac{80\ell^{2}\eta_{0}% \eta_{\lambda}^{0}K^{2c}K^{-2a}\sigma_{2}^{2}}{7\mu_{2}BK^{(b-a-c)}}+\frac{9% \ell^{2}\eta_{0}\eta_{\lambda}^{0}K^{-2a}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}% ^{2}\right)}{4\mu_{2}BK^{(b-a-c)}}\right).$

Recalling the result of Lemma 1 states the relation between the stationarity of the minimax problem and the original bilevel problem, we have

	$\displaystyle\mathbb{E}\left[\left\\|\nabla\mathcal{L}(\tilde{\lambda})\right\\|% ^{2}\right]=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\\|\nabla\mathcal{L}(% \lambda_{k})\right\\|^{2}\right]$
	$\displaystyle\leq\frac{2}{K}\sum_{k=1}^{K}\left(\mathbb{E}\left[\left\\|\nabla% \mathcal{L}(\lambda_{k})-\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right]% +\mathbb{E}\left[\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right]\right)$
	$\displaystyle\leq\frac{2}{\alpha^{2}}+\frac{2}{K}\sum_{k=1}^{K}\mathbb{E}\left% [\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right]$
	$\displaystyle\leq\frac{2}{K^{2c}}+\frac{8\mathbb{E}[R_{1}]K^{b}}{\eta_{\lambda% }^{0}K}+\frac{8K^{b}}{\eta_{\lambda}^{0}}\frac{\ell_{\Gamma}(\eta_{\lambda}^{0% })^{2}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}^{2}\right)}{2K^{2b}}$
	$\displaystyle+\frac{8K^{b}}{\eta_{\lambda}^{0}}\left(\frac{80\ell^{2}\eta_{0}% \eta_{\lambda}^{0}K^{2c}K^{-2a}\sigma_{2}^{2}}{7\mu_{2}BK^{(b-a-c)}}+\frac{9% \ell^{2}\eta_{0}\eta_{\lambda}^{0}K^{-2a}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}% ^{2}\right)}{4\mu_{2}BK^{(b-a-c)}}\right).$

Let $c=1/7$ , $a=4/7$ , and $b=5/7$ , we have

	$\displaystyle\mathbb{E}\left[\left\\|\nabla\mathcal{L}(\tilde{\lambda})\right\\|% ^{2}\right]$	$\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\\|\nabla\mathcal{L% }(\lambda_{k})\right\\|^{2}\right]$
		$\displaystyle\leq\mathcal{O}\left(\frac{1}{K^{2/7}}\right)+\mathcal{O}\left(% \frac{\mathbb{E}[R_{1}]}{\eta_{\lambda}^{0}K^{2/7}}\right)+\mathcal{O}\left(% \frac{(1+\ell\kappa\eta_{0})\sigma_{1}^{2}}{BK^{4/7}}\right)+\mathcal{O}\left(% \frac{(1+\ell\kappa\eta_{0})\sigma_{2}^{2}}{BK^{2/7}}\right).$

Note that the initial state $R_{1}$ can be controlled by a constant which is independent with $\alpha$ :

	$\displaystyle R_{1}$	$\displaystyle=\Gamma^{\alpha}(\lambda_{1})-\Gamma_{\min}^{\alpha}+\xi_{1,1}% \delta_{1}+\xi_{2,1}r_{1}$
		$\displaystyle=\Gamma^{\alpha}(\lambda_{1})-\Gamma_{\min}^{\alpha}+\mathcal{O}% \left(J\kappa\eta_{0}^{\lambda}/\eta_{0}\left(\left\\|w_{1}-w_{\ast}^{\alpha}(% \lambda_{1})\right\\|^{2}+\left\\|u_{1}-u_{\ast}(\lambda_{1})\right\\|^{2}\right)\right)$		(42)

where

$\displaystyle\Gamma^{\alpha}(\lambda_{1})-\Gamma_{\min}^{\alpha}$	$\displaystyle\leq\mathcal{L}^{\alpha}(\lambda_{1},w_{\ast}^{\alpha}(\lambda_{1% }),u_{\ast}(\lambda_{1}))-\mathcal{L}^{\alpha}(\lambda_{\ast},w_{\ast}^{\alpha% }(\lambda_{\ast}),u_{\ast}(\lambda_{\ast}))$
	$\displaystyle=L_{1}(\lambda_{1},w_{\ast}^{\alpha}(\lambda_{1}))-L_{1}(\lambda^% {\ast},w_{\ast}^{\alpha}(\lambda^{\ast}))+\alpha\left(L_{2}(\lambda_{1},w_{% \ast}^{\alpha}(\lambda_{1}))-L_{2}(\lambda_{1},u_{\ast}(\lambda_{1}))\right)$
	$\displaystyle+\alpha\left(L_{2}(\lambda^{\ast},w_{\ast}^{\alpha}(\lambda^{\ast% }))-L_{2}(\lambda^{\ast},u_{\ast}(\lambda^{\ast}))\right)$
	$\displaystyle=L_{1}(\lambda_{1},w_{\ast}(\lambda_{1}))-L_{1}(\lambda^{\ast},w_% {\ast}(\lambda^{\ast}))+L_{1}(\lambda_{1},w_{\ast}^{\alpha}(\lambda_{1}))-L_{1% }(\lambda_{1},w_{\ast}(\lambda_{1}))$
	$\displaystyle+L_{1}(\lambda^{\ast},w_{\ast}(\lambda^{\ast}))-L_{1}(\lambda^{% \ast},w_{\ast}^{\alpha}(\lambda^{\ast}))+\alpha\left(L_{2}(\lambda_{1},w_{\ast% }^{\alpha}(\lambda_{1}))-L_{2}(\lambda_{1},u_{\ast}(\lambda_{1}))\right)$
	$\displaystyle+\alpha\left(L_{2}(\lambda^{\ast},w_{\ast}^{\alpha}(\lambda^{\ast% }))-L_{2}(\lambda^{\ast},u_{\ast}(\lambda^{\ast}))\right)$
$\displaystyle\leq$	$\displaystyle\mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\ell_{10}% \left\\|w_{\ast}^{\alpha}(\lambda_{1})-w_{\ast}(\lambda_{1})\right\\|+\ell_{10}% \left\\|w_{\ast}^{\alpha}(\lambda^{\ast})-w_{\ast}(\lambda^{\ast})\right\\|$
	$\displaystyle+\alpha\frac{\ell_{21}}{2}\left\\|w_{\ast}^{\alpha}(\lambda_{1})-u% _{\ast}(\lambda_{1})\right\\|^{2}+\alpha\frac{\ell_{21}}{2}\left\\|w_{\ast}^{% \alpha}(\lambda^{\ast})-u_{\ast}(\lambda^{\ast})\right\\|^{2}$
	$\displaystyle\leq\mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\frac{2% \ell_{10}C_{0}}{\alpha}+2\alpha\frac{\ell_{21}}{2}\frac{C_{0}^{2}}{\alpha^{2}}$
	$\displaystyle\leq\mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\frac{2% \ell_{10}C_{0}\mu_{2}}{\ell_{11}}+\frac{\ell_{21}C_{0}^{2}\mu_{2}}{\ell_{11}}=% \mathcal{L}(\lambda_{1})-\mathcal{L}(\lambda^{\ast})+\mathcal{O}\left(\kappa^{% 2}\ell_{21}\right),$	(43)

where by definitions we know $w_{\ast}(\lambda)=u_{\ast}(\lambda)$ and the first inequality follows from the gradient-Lipschitz of $L_{2}$ and the Lipschitz continuity of $L_{1}$ in $w$ , and the second inequality uses Lemma 4. The proof is complete. ∎

	$\displaystyle\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k},\mathcal{% C}_{k}\right]$
	$\displaystyle=\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})+\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k}% ,\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]+\left\\|% \nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}$
	$\displaystyle\leq\frac{\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}}{B}+\left\\|% \nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},\omega_{k},u_{k})\right\\|^{2}$		(19)

	$\displaystyle\mathbb{E}\left[\left\\|\nabla_{\lambda}\mathcal{L}_{D_{k}}^{% \alpha}(\lambda_{k},\omega_{k},u_{k})-\nabla_{\lambda}\mathcal{L}^{\alpha}(% \lambda_{k},\omega_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle=\mathbb{E}\left[\left\\|\nabla_{\lambda}L_{1}(\lambda_{k},w_{k};D% _{\text{val}}^{k})-\nabla_{\lambda}L_{1}(\lambda_{k},w_{k})\right\\|^{2}+\alpha% ^{2}\left\\|\nabla_{\lambda}L_{2}(\lambda_{k},w_{k};D_{\text{tr}}^{k})-\nabla_{% \lambda}L_{2}(\lambda_{k},w_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle\quad+\alpha^{2}\mathbb{E}\left[\left\\|\nabla_{\lambda}L_{2}(% \lambda_{k},u_{k};D_{\text{tr}}^{k})-\nabla_{\lambda}L_{2}(\lambda_{k},u_{k})% \right\\|^{2}\mid\mathcal{F}_{k}\right]$
	$\displaystyle\leq\frac{\sigma_{1}^{2}+2\alpha^{2}\sigma_{2}^{2}}{B}.$		(21)

$\displaystyle\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{% k})\right\\|^{2}$	$\displaystyle=\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_% {k})-\nabla\Gamma^{\alpha}(\lambda_{k})+\nabla\Gamma^{\alpha}(\lambda_{k})% \right\\|^{2}$
	$\displaystyle\leq 2\left(\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}% +\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla% \Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right)$
	$\displaystyle\leq 2\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}+4% \left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{k})-\nabla_{% \lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_{k},u_{\ast}(\lambda_{k}))\right\\|^% {2}$
	$\displaystyle\quad+4\left\\|\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_% {k},u_{\ast}(\lambda_{k}))-\nabla_{\lambda}\mathcal{L}^{\alpha}(\lambda_{k},w_% {\ast}^{\alpha}(\lambda_{k}),u_{\ast}(\lambda_{k}))\right\\|^{2}$
	$\displaystyle\mathop{\leq}^{(a)}2\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})% \right\\|^{2}+4\alpha^{2}\ell_{11}^{2}\left\\|u_{k}-u_{\ast}(\lambda_{k})\right% \\|^{2}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)\left\\|\omega_{k}-% \omega_{\ast}^{\alpha}(\lambda_{k})\right\\|^{2}$
	$\displaystyle=2\left\\|\nabla\Gamma(\lambda_{k})\right\\|^{2}+4\alpha^{2}\ell_{1% 1}^{2}\delta_{k}+8\left(\ell_{11}^{2}+\alpha^{2}\ell_{21}^{2}\right)r_{k}$	(24)

		$\displaystyle\mathbb{E}\left[\left\\|u_{k+1}-u_{\ast}(\lambda_{k})\right\\|^{2}% \mid\mathcal{F}_{k},\mathcal{C}_{k}\right]=\mathbb{E}\left[\left\\|u_{k}-\alpha% \eta_{u}U_{j_{t}}\nabla_{u}L_{2}\left(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text% {tr}}\right)-u_{\ast}(\lambda_{k})\right\\|^{2}\mid\mathcal{F}_{k},\mathcal{C}_% {k}\right]$
		$\displaystyle=\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}-2\alpha\eta_{u}% \mathbb{E}\left[\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_{2}(% \lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\rangle_{j_{t}}\mid% \mathcal{F}_{k},\mathcal{C}_{k}\right]$
		$\displaystyle\quad+\alpha^{2}\eta_{u}^{2}\mathbb{E}\left[\left\\|U_{j_{t}}% \nabla_{u}L_{2}(u_{k},\lambda_{k};\mathcal{D}_{k}^{\text{tr}})\right\\|^{2}\mid% \mathcal{F}_{k},\mathcal{C}_{k}\right]$
		$\displaystyle\mathop{=}^{(a)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}-% \frac{2\alpha\eta_{u}}{J}\left\langle u_{k}-u_{\ast}(\lambda_{k}),\nabla_{u}L_% {2}(\lambda_{k},u_{k})\right\rangle+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}% \left[\left\\|\nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})% \right\\|^{2}\mid\mathcal{F}_{k}\right]$
		$\displaystyle\mathop{\leq}^{(b)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}% -\frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right% \\|^{2}\right)$
		$\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})\right\\|^{2}\mid% \mathcal{F}_{k}\right]$
		$\displaystyle\mathop{=}^{(c)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}-% \frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right% \\|^{2}\right)$
		$\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})-\nabla_{u}L_{2}% (\lambda_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]+\frac{\alpha^{2}\eta% _{u}^{2}}{J}\left\\|\nabla_{u}L_{2}(\lambda_{k},u_{k})\right\\|^{2}$
		$\displaystyle\mathop{\leq}^{(d)}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}% -\frac{2\eta_{u}\alpha}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{% \ast}(\lambda_{k}))+\frac{\mu_{2}}{2}\left\\|u_{\ast}(\lambda_{k})-u_{k}\right% \\|^{2}\right)$
		$\displaystyle\quad+\frac{\alpha^{2}\eta_{u}^{2}}{J}\mathbb{E}\left[\left\\|% \nabla_{u}L_{2}(\lambda_{k},u_{k};\mathcal{D}_{k}^{\text{tr}})-\nabla_{u}L_{2}% (\lambda_{k},u_{k})\right\\|^{2}\mid\mathcal{F}_{k}\right]+\frac{2\ell_{21}\eta% _{u}^{2}\alpha^{2}}{J}\left(L_{2}(\lambda_{k},u_{k})-L_{2}(\lambda_{k},u_{\ast% }(\lambda_{k}))\right)$
		$\displaystyle\mathop{\leq}^{(e)}\left(1-\frac{\alpha\mu_{2}\eta_{u}}{J}\right)% \left\\|u_{\ast}(\lambda_{k})-u_{k}\right\\|^{2}+\frac{\alpha^{2}\eta_{u}^{2}% \sigma_{2}^{2}}{JB}.$		(26)

	$\displaystyle\mathbb{E}\left[\left\\|\nabla\mathcal{L}(\tilde{\lambda})\right\\|% ^{2}\right]=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}\left[\left\\|\nabla\mathcal{L}(% \lambda_{k})\right\\|^{2}\right]$
	$\displaystyle\leq\frac{2}{K}\sum_{k=1}^{K}\left(\mathbb{E}\left[\left\\|\nabla% \mathcal{L}(\lambda_{k})-\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right]% +\mathbb{E}\left[\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right]\right)$
	$\displaystyle\leq\frac{2}{\alpha^{2}}+\frac{2}{K}\sum_{k=1}^{K}\mathbb{E}\left% [\left\\|\nabla\Gamma^{\alpha}(\lambda_{k})\right\\|^{2}\right]$
	$\displaystyle\leq\frac{2}{K^{2c}}+\frac{8\mathbb{E}[R_{1}]K^{b}}{\eta_{\lambda% }^{0}K}+\frac{8K^{b}}{\eta_{\lambda}^{0}}\frac{\ell_{\Gamma}(\eta_{\lambda}^{0% })^{2}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}^{2}\right)}{2K^{2b}}$
	$\displaystyle+\frac{8K^{b}}{\eta_{\lambda}^{0}}\left(\frac{80\ell^{2}\eta_{0}% \eta_{\lambda}^{0}K^{2c}K^{-2a}\sigma_{2}^{2}}{7\mu_{2}BK^{(b-a-c)}}+\frac{9% \ell^{2}\eta_{0}\eta_{\lambda}^{0}K^{-2a}\left(\sigma_{1}^{2}+K^{2c}\sigma_{2}% ^{2}\right)}{4\mu_{2}BK^{(b-a-c)}}\right).$