Theorem 13.
At any two points and , if
|
|
|
then the loss function satisfies the following smoothness property: for any we have
|
|
|
(14) |
Proof.
Note that
|
|
|
Therefore can be bounded as
|
|
|
(15) |
Similarly, we have
|
|
|
(16) |
Recall that by Lemma 9 we have
so
|
|
|
(17) |
where the last inequality is because and applying (15) and (16).
The remaining task is to bound . Since , we can use Lemma 18 to bound it as
|
|
|
(18) |
where we used at the last line. Plugging this back to (17), we get
|
|
|
(19) |
∎
Proof.
Case 1. There exists such that . Then by Lemma 20 and Lemma 11 we have
|
|
|
Case2. For , . Then by Lemma 21 we have (since ).
∎
Proof.
For any , by Cauchy–Schwarz inequality we have
|
|
|
(20) |
where the last inequality is because .
Note that (20) implies , so for we have
|
|
|
(21) |
where we used at the last inequality.
∎
Proof.
The key idea is to consider the gradient of , which can be calculated as
|
|
|
(22) |
where we used (8) in the second identity.
By Cauchy-Schwarz inequality, we have , which implies
|
|
|
(23) |
where we used at the second to last identity. Careful readers might notice that the term is not well-defined when , but we can still calculate its expectation over the whole probability space since the integration is only singular on a zero-measure set.
For each , by (22) we have
|
|
|
So
|
|
|
(24) |
where we used Lemma 19 at the fourth line and Cauchy-Schwarz inequality at the last line.
The last step is to lower bound . Since is sampled from , which is spherically symmetric, we know that the two random variables are independent. Therefore
|
|
|
(25) |
For the first term in (25), we have since is spherically symmetrically distributed. By norm-concentration inequality of Gaussian [Dasgupta and Schulman, 2000] we know that . The second term in (25) can be therefore lower bounded as
|
|
|
(26) |
Plugging (26) into (25), we get
|
|
|
(27) |
Now we can plug (27) into (24) and get
|
|
|
(28) |
where we used the inequality at the second to last line.
∎
Theorem 2.
Consider training a student -component GMM initialized from to learn a single-component ground truth GMM with population gradient EM algorithm. If the step size satisfies , then gradient EM converges globally with rate
|
|
|
where constant , .
Proof.
We use mathematical induction to prove Theorem 2, by proving the following two conditions inductively:
|
|
|
(29) |
|
|
|
(30) |
Note that (30) directly implies the theorem, so now we just need to prove (29) and (30) together.
The induction base for is trivial. Now suppose the conditions hold for time step , consider . By induction hypothesis (29) we have .
Proof of (30).
Since , we can apply classical analysis of gradient descent [Nesterov et al., 2018] as
|
|
|
(31) |
Note that the gradient norm can be upper bounded as
|
|
|
Then for any , we have .
So we can apply Theorem 13 and get
|
|
|
Therefore for ,
|
|
|
(32) |
Plugging (32) into (31), since we have
|
|
|
(33) |
By Lemma 12 we can lower bound the gradient norm as
|
|
|
(34) |
Combining (34) and (33), we have
|
|
|
(35) |
Note that the above inequality implies , therefore
|
|
|
On the other hand, by induction hypothesis we have , combined with the above inequality, we have , which finishes the proof of (30).
Proof of (29).
The dynamics of potential function can be calculated as
|
|
|
(36) |
The first term can be bounded by Lemma 12 as
|
|
|
(37) |
The second term is a perturbation term that can be upper bounded by Lemma 9 as
|
|
|
(38) |
where we use triangle inequality twice at the second and third line, and Cauchy-Schwarz inequality twice at the fourth and fifth line.
Putting (38), (37) and (36) together, we get
|
|
|
a). If , then
|
|
|
note that we used .
b). If , then .
Since (29) holds in both cases, our proof is done.
∎