Proof.
We sample the function by the following mini-batch approximation per iteration:
|
|
|
(15) |
where and are i.i.d. mini-batch samples from the training and validation datasets , respectively. We use to denote the random information before the iteration , that is . We use to denote the random information of variables for the randomized block coordinates before the iteration .
We recall the iterating formula of in the stochastic version of the minimax algorithm that . At each iteration,
|
|
|
(16) |
By the smoothness of (see Lemma 3), we have
|
|
|
|
|
|
|
|
(17) |
Taking conditional expectation w.r.t. on the above inequality, we have
|
|
|
|
|
|
|
|
|
(18) |
where the inequality follows the fact that is an unbiased estimation of and
|
|
|
|
|
|
|
|
|
|
|
|
(19) |
where the variance of the minibatch stochastic gradients (with batch size ) is bounded
|
|
|
(20) |
then
|
|
|
|
|
|
|
|
|
|
|
|
(21) |
Applying the above results, we have
|
|
|
|
|
|
|
|
(22) |
Let and . The inner product term of RHS of (D) is estimated as follows:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(23) |
where uses the optimality of over that , follows from the Cauchy-Schwartz inequality and uses the smoothness of and .
Next we turn to estimate the norm of gradient as follows
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(24) |
where uses the smoothness of objectives .
Incorporating the above inequalities (D) and (D) into (D) gives
|
|
|
|
|
|
|
|
(25) |
Then, we focus on estimating and . For the inner variables , we use the randomized block coordinates method with total blocks and each block is uniformly chosen. By the strong concavity of with respect to , we first achieve the following evaluations for :
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(26) |
where use the truth that since the block coordinate is uniformly chosen from , we have
|
|
|
|
|
|
|
|
(27) |
and
|
|
|
(28) |
follows from the strong convexity of w.r.t. which implies that
|
|
|
uses the relationship which induces that
|
|
|
(29) |
and uses the optimality of and the smoothness of such that
|
|
|
|
|
|
|
|
|
|
|
|
(30) |
where
and uses
|
|
|
(31) |
and .
Then we make the following recursive estimation for :
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(32) |
where follows from Cauchy-Schwartz inequality with ; (b) uses the Lipschitz continuity of from Lemma 5; follows from the inequality (D); uses the iterating formula of ; follows from the inequality (D).
Since is strongly convex with respect to with parameter if . Similar to , we can achieve the following result for
|
|
|
(33) |
Following the same procedure as in (D), we estimate the recursion as below
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(34) |
where .
We define the Lyapunov function
|
|
|
(35) |
where are non-increasing sequences and is the minimum of . We must have . Incorporating the results of (D), (D), (D) gives
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(36) |
where
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(37) |
Let and , and where and , and then and can be re-written as:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(38) |
In order to achieve and , we might let , then
|
|
|
|
|
|
|
|
(39) |
For , we have . Consider that and are non-increasing sequence, then and , we have
|
|
|
|
|
|
|
|
If and , for and , then
|
|
|
The inequalities of can be simplified as
|
|
|
|
(40) |
|
|
|
|
(41) |
We might solve the above inequalities and properly set
|
|
|
|
|
|
|
|
to guarantee that and . Then the main inequality (D) can be estimated as
|
|
|
|
|
|
|
|
If we set , then . For and , if we set
|
|
|
|
|
|
|
|
Then
|
|
|
|
Telesco** the above inequality gives
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Recalling the result of Lemma 1 states the relation between the stationarity of the minimax problem and the original bilevel problem, we have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Let , , and , we have
|
|
|
|
|
|
|
|
Note that the initial state can be controlled by a constant which is independent with :
|
|
|
|
|
|
|
|
(42) |
where
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(43) |
where by definitions we know and the first inequality follows from the gradient-Lipschitz of and the Lipschitz continuity of in , and the second inequality uses Lemma 4. The proof is complete.
∎