Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models

Weihang Xu
University of Washington
[email protected] Maryam Fazel
University of Washington
[email protected] Simon S. Du
University of Washington
[email protected]

Abstract

We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with $n>1$ components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary $n$ remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate $O(1/\sqrt{t})$ . This is the first global convergence result for Gaussian mixtures with more than $2$ components. The sublinear convergence rate is due to the algorithmic nature of learning over-parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.

1 Introduction

Learning Gaussian Mixture Models (GMM) is a fundamental problem in machine learning with broad applications. In this problem, data generated from a mixture of $n\geq 2$ ground truth Gaussians are observed without the label (the index of component Gaussian that data is sampled from), and the goal is to retrieve the maximum likelihood estimation of Gaussian components. The Expectation Maximization (EM) algorithm is arguably the most widely-used algorithm for this problem. Each iteration of the EM algorithm consists of two steps. In the expectation (E) step, it computes the posterior probability of unobserved mixture membership label according to the current parameterized model. In the maximization (M) step, it computes the maximizer of the $Q$ function, which is the likelihood with respect to posterior estimation of the hidden label computed in the E step.

Gradient EM, as a popular variant of EM, is often used in practice when the maximization step of EM is costly or even intractable. It replaces the M step of EM with taking one gradient step on the $Q$ function. Learning Gaussian Mixture Models with EM/gradient EM is an important and widely-studied problem. Starting from the seminal work [Balakrishnan et al., 2014], a flurry of work Daskalakis et al. [2017], Xu et al. [2016], Dwivedi et al. [2018a], Kwon and Caramanis [2020], Dwivedi et al. [2019] have studied the convergence guarantee for EM/gradient EM in various settings. However, these works either only prove local convergence, or consider the special case of $2$ -Gaussian mixtures. A general global convergence analysis of EM/gradient EM on $n$ -Gaussian mixtures still remains unresolved. ** et al. [2016] is a notable negative result in this regard, where the authors show that on GMM with $n\geq 3$ components, randomly initialized EM will get trapped in a spurious local minimum with high probability.

Over-parameterized Gaussian Mixture Models. Motivated by the negative results, a line of work considers the over-parameterized setting where the model uses more Gaussian components than the ground truth GMM, in the hope that it might help the global convergence of EM and bypass the negative result. In such over-parameterized regime, the best that people know so far is from [Dwivedi et al., 2018b]. This work proves global convergence of 2-Gaussian mixtures on one single Gaussian ground truth. The authors also show that EM has a unique sub-linear convergence rate in this over-parameterized setting (compared with the linear convergence rate in the exact-parameterized setting [Balakrishnan et al., 2014]). This motivates the following natural open question:

Can we prove global convergence of the EM/gradient EM algorithm on general $n$ -Gaussian mixtures in the over-parameterized regime?

In this paper, we take a significant step towards answering this question. Our main contributions can be summarized as follows:

•

We prove global convergence of the gradient EM algorithm for learning general $n$ -component GMM on one single ground truth Gaussian distribution. This is, to the best of our knowledge, the first global convergence proof for general $n$ -component GMM. Our convergence rate is sub-linear, reflecting an inherent nature of over-parameterized GMM (see Remark 3 for details).
•

We propose a new analysis framework that utilizes the likelihood function for proving convergence of gradient EM. Our new framework tackles several emerging technical barriers for global analysis of general GMM.
•

We also identify a new geometric property of gradient EM for learning general $n$ -component GMM: There exists bad initialization regions that traps gradient EM for exponentially long, resulting in an inevitable exponential factor in the convergence rate of gradient EM.

1.1 Gaussian Mixture Model (GMM)

We consider the canonical Gaussian Mixture Models with weights $\bm{\pi}=(\pi_{1},\ldots,\pi_{n})$ ( $\sum_{i=1}^{n}\pi_{i}=1$ ), means $\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}$ and unit covariance matrices $I_{d}$ in $d$ -dimensional space. Following a widely-studied setting [Balakrishnan et al., 2014, Yan et al., , Daskalakis et al., 2017], we set the weights $\bm{\pi}$ and covariances $I_{d}$ in student GMM as fixed, and the means $\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}$ as trainable parameters. We use $\text{GMM}(\bm{\mu})$ to denote the GMM model parameterized by $\bm{\mu}$ , which can be described with probability density function (PDF) $p_{\bm{\mu}}:\mathbf{R}^{d}\to\mathbf{R}_{\geq 0}$ as

p_{\bm{\mu}}(x)=\sum_{i\in[n]}\pi_{i}\phi(x|\mu_{i},I_{d})=\sum_{i\in[n]}\pi_{% i}(2\pi)^{-d/2}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right),

(1)

where $\phi(\cdot|\mu,\Sigma)$ is the PDF of $\mathcal{N}(\mu,\Sigma)$ , $\pi_{1}+\cdots+\pi_{n}=1,\pi_{i}>0,\forall i\in[n]$ .

1.2 Gradient EM algorithm

The EM algorithm is one of the most popular algorithms for retrieving the maximum likelihood estimator (MLE) on latent variable models. In general, EM and gradient EM address the following problem: given a joint distribution $p_{\bm{\mu}^{*}}(x,y)$ of random variables $x,y$ parameterized by $\bm{\mu}^{*}$ , observing only the distribution of $x$ , but not the latent variable $y$ , the goal of EM and gradient EM is to retrieve the maximum likelihood estimator

\hat{\bm{\mu}}_{\text{MLE}}\in\arg\max_{\bm{\mu}}\log p_{\bm{\mu}}(x).

The focus of this paper is the non-convex optimization analysis, so we consider using population gradient EM algorithm to learn GMM (1), where the observed variable is $x\in\mathbf{R}^{d}$ and latent variable is the index of membership Gaussian in GMM. We follow the standard teacher-student setting where a student model $\text{GMM}(\bm{\mu})$ with $n\geq 2$ Gaussian components learns from data generated from a ground truth teacher model $\text{GMM}(\bm{\mu}^{*})$ . We consider the over-parameterized setting where the ground truth model $\text{GMM}(\bm{\mu}^{*})$ is a single Gaussian distribution $\mathcal{N}(0,I_{d})$ , namely $\bm{\mu}^{*}=({\mu_{1}^{*}}^{\top},\ldots,{\mu_{n}^{*}}^{\top})^{\top}=({0}^{% \top},\ldots,{0}^{\top})^{\top}$ . Our problem could be seen as a strict generalization of Dwivedi et al. [2018b], where they studied using mixture model of two Gaussians with symmetric means (they set constraint $\mu_{2}=-\mu_{1}$ ) to learn one single Gaussian.

At time step $t=0,1,2,\ldots$ , given with parameters $\bm{\mu}(t)=(\mu_{1}(t)^{\top},\ldots,\mu_{n}(t)^{\top})^{\top}$ , population gradient EM updates $\bm{\mu}$ via the following two steps

•

E step: for each $i\in[n]$ , compute the membership weight function $\psi_{i}:\mathbf{R}^{d}\to\mathbf{R}$ defined as

\psi_{i}(x|\bm{\mu}(t))=\Pr[i|x]=\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}(t)% \|^{2}}{2}\right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}(t)\|^{2}}% {2}\right)}.

(2)

•

M step: Define $Q(\cdot|,\mu(t))$ as

Q(\bm{\mu}|\bm{\mu}(t))=\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\sum_{i=1}% ^{n}-\psi_{i}(x|\bm{\mu}(t))\frac{\|x-\mu_{i}\|^{2}}{2}\right],

Gradient EM with step size $\eta>0$ performs the following update:

\mu_{i}(t+1)=\mu_{i}(t)-\eta\nabla_{\mu_{i}}Q(\bm{\mu}(t)|\bm{\mu}(t))=\mu_{i}% (t)-\eta\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\-\psi_{i}(x|\bm{\mu}(t))(% \mu_{i}(t)-x)\right].

(3)

The membership weight function $x\to\psi_{i}(x|\bm{\mu})$ represents the posterior probability of data point $x$ being sampled from the $i^{\text{th}}$ Gaussian of $\text{GMM}(\bm{\mu})$ . For ease of notation, we sometimes simply write $\psi_{i}(x|\bm{\mu})$ as $\psi_{i}(x)$ when the choice of $\bm{\mu}$ is obvious.

1.3 Loss function of gradient EM

Since the task of gradient EM is to find the MLE over ground truth distribution $p_{\bm{\mu}^{*}}$ , we can define the MLE loss function for gradient EM as

\mathcal{L}(\bm{\mu})=D_{\text{KL}}(p_{\bm{\mu}^{*}}||p_{\bm{\mu}})=-\mathbf{E% }_{x\sim p_{\bm{\mu}^{*}}}\left[\log\left(\frac{p_{\bm{\mu}}(x)}{p_{\bm{\mu}^{% *}}(x)}\right)\right].

(4)

The loss $\mathcal{L}$ is the Kullback–Leibler (KL) divergence between the ground truth GMM and the student model GMM. Since finding MLE is equivalent to minimizing the KL divergence between model and the ground truth, the goal of gradient EM is equivalent to finding the global minimum of loss $\mathcal{L}$ . In other words, proving that gradient EM finds the MLE is equivalent with proving the convergence of $\mathcal{L}$ to $0$ . However, we are going to present another reason why loss function $\mathcal{L}$ is important, for it is also closely related to the dynamics of gradient EM.

Gradient EM is gradient descent on $\mathcal{L}$ . We present the following important observation. The proof is deferred to appendix.

Fact 1.

For any $\bm{\mu}$ , $\nabla Q(\bm{\mu}|\bm{\mu})=\nabla\mathcal{L}(\bm{\mu})$ .

Fact 1 states that the gradient of $Q$ function that gradient EM optimizes in each iteration is identical to the gradient of loss function $\mathcal{L}$ . This observation is very useful since it implies that gradient EM is equivalent to gradient descent (GD) algorithm on $\mathcal{L}$ . This observation is not a new discovery of ours but actually a wide-spread folklore (see [** et al., 2016]). However, our new contribution is to observe Fact 1 is very helpful for analyzing gradient EM, and to construct a new convergence analysis framework for gradient EM based on it.

1.4 Notations

In this paper, we adopt the following notational conventions. We denote $\{1,2,\ldots,n\}$ with $[n]$ . $\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}\in\mathbf{R}^{nd}$ denotes the parameter vector of GMM obtained by concatenating Gaussian mean vectors $\mu_{1},\ldots,\mu_{n}$ together. For any vector $\mu$ , $\mu(t)$ denotes its value at time step $t$ , sometimes we omit this iteration number $t$ when its choice is clear and simply abbreviate $\mu(t)$ as $\mu$ . We define a shorthand of expectation taken over the ground truth GMM $\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}[\cdot]$ as $\mathbf{E}_{x}[\cdot]$ . For any vector $v\neq 0$ , we use $\overline{v}\coloneqq v/\|v\|$ to denote the normalization of $v$ . We define (with a slight abuse of notation) ${i_{\max}}\coloneqq\arg\max_{i\in[n]}\{\|\mu_{i}\|\}$ as the index of $\mu_{i}$ with the maximum norm, and $\mu_{\max}\coloneqq\|\mu_{{i_{\max}}}\|=\max_{i\in[n]}\{\|\mu_{i}\|\}$ as the maximum norm of $\mu_{i}$ . In particular, $\mu_{\max}(t)=\max\{\|\mu_{1}(t)\|,\ldots,\|\mu_{n}(t)\|\}$ . Similarly, $\pi_{\min}\coloneqq\min_{i\in[n]}\pi_{i}$ and ${\pi_{\max}}\coloneqq\max_{i\in[n]}\pi_{i}$ denotes the minimal and maximal $\pi_{i}$ , respectively. We use $\nabla_{\mu_{i}}\mathcal{L}$ to denote the gradient of $\mu_{i}$ on $\mathcal{L}$ , and $\nabla\mathcal{L}=(\nabla_{\mu_{1}}\mathcal{L}^{\top},\ldots,\nabla_{\mu_{n}}% \mathcal{L})^{\top}$ denotes the collection of all gradients.

1.5 Technical overview

Here we provide a brief summary of the major technical barriers for our global convergence analysis and our techniques for overcoming them.

New likelihood-based analysis framework. The traditional convergence analysis for EM/gradient EM in previous works Balakrishnan et al. [2014], Yan et al. , Kwon and Caramanis [2020] proceeds by showing the distance between the model and the ground truth GMM in the parameter space contracts linearly in every iteration. This type of approach meets new challenges in the over-parameterized $n$ -Gaussian mixture setting since the convergence is both sub-linear and non-monotonic. To address these problems, we propose a new likelihood-based convergence analysis framework: instead of proving the convergence of parameters, our analysis proceeds by showing the likelihood loss function $\mathcal{L}$ converges to $0$ . The new analysis framework is more flexible and allows us to overcome the aforementioned technical barriers.

Gradient lower bound. The first step of our global convergence analysis constructs a gradient lower bound. Using some algebraic transformation techniques, we convert the gradient projection $\left\langle\mathcal{L}(\bm{\mu}),\bm{\mu}\right\rangle$ into the expected norm square of a random vector $\tilde{\bm{\psi}}(x)$ . (See Section (4) for the full definition). Although lower bounding the expectation of $\tilde{\bm{\psi}}$ is very challenging, our key idea is that the gradient of $\tilde{\bm{\psi}}$ has very nice properties and can be easily lower bounded, allowing us to establish the gradient lower bound.

Local smoothness and regularity condition. After obtaining the gradient lower bound, the missing component of the proof is a smoothness condition of the loss function $\mathcal{L}$ . Since proving the smoothness of $\mathcal{L}$ is hard in general, we define and prove a weaker notion of local smoothness, which suffices to prove our result. In addition, we design and use an auxiliary function $U$ to show that gradient EM trajectory satisfies the locality required by our smoothness lemma.

2 Related work

2.1 2-Gaussian mixtures

There is a vast literature studying the convergence of EM/gradient EM on $2$ -component GMM. The initial batch of results proves convergence within a infinitesimally small local region [Xu and Jordan, 1996, Ma et al., 2000]. Balakrishnan et al. [2014] proves for the first time convergence of EM and gradient EM within a non-infinitesimal local region. Among the later works on the same problem, Klusowski and Brinda [2016] improves the basin of convergence guarantee, Daskalakis et al. [2017], Xu et al. [2016] proves the global convergence for $2$ -Gaussian mixtures. These works focused on the exact-parameterization scenario where the number of student mixtures is the same as that of the ground truth. More recently, Wu and Zhou [2019] proves global convergence of $2$ -component GMM without any separation condition. Their result can be viewed as a convergence result in the over-parameterized setting where the student model has two Gaussians and the ground truth is a single Gaussian. On the other hand, their setting is more restricted than ours because they require the means of two Gaussians in the student model to be symmetric around the ground truth mean. Weinberger and Bresler [2021] extends the convergence guarantee to the case of unbalanced weights. Another line of work Dwivedi et al. [2018b, 2019, a] studies the over-parameterized setting of using $2$ -Gaussian mixture to learn a single Gaussian and proves global convergence of EM. Our result extends this type of analysis to the general case of $n$ -Gaussian mixtures, which requires significantly different techniques. We note that going beyond Gaussian mixture models, there are also works studying EM algorithms for other mixture models such as a mixture of linear regression Kwon et al. [2019].

2.2 N-Gaussian mixtures

Another line of results focuses on the general case of $n$ Gaussian mixtures. ** et al. [2016] provides a counter-example showing that EM does not converge globally for $n>2$ (in the exact-parameterized case). Dasgupta and Schulman [2000] prove that a variant of EM converges to MLE in two rounds for $n$ -GMM. Their result relies on a modification of the EM algorithm and is not comparable with ours. [Chen et al., 2023] analyzes the structure of local minima in the likelihood function of GMM. However, their result is purely geometric and does not provide any convergence guarantee.

A series of paper Yan et al. , Zhao et al. , Kwon and Caramanis [2020], Segol and Nadler follow the framework proposed by Balakrishnan et al. [2014] to prove the local convergence of EM for $n$ -GMM. While their result applies to the more general $n$ -Gaussian mixture ground truth setting, their framework only provides local convergence guarantee and cannot be directly applied to our setting.

2.3 Slowdown due to over-parameterization

This paper gives an $O\left(1/\sqrt{t}\right)$ bound for fitting over-parameterized Gaussian mixture models to a single Gaussian. Recall that to learn a single Gaussian, if one’s student model is also a single Gaussian, then one can obtain an $\exp(-\Omega(t))$ rate because the loss is strongly convex. This slowdown effect due to over-parameterization has been observed for Gaussian mixtures in Dwivedi et al. [2018a], Wu and Zhou [2019], but has also been observed in other learning problems, such as learning a two-layer neural network Xu and Du [2023], Richert et al. [2022] and matrix sensing problems [Xiong et al., 2023, Zhang et al., 2021, Zhuo et al., 2021].

3 Main results

In this section, we present our main theoretical result, which consists of two parts: In Section 3.1 we present our global convergence analysis of gradient EM, in Section 3.2 we prove that an exponentially small factor in our convergence bound is inevitable and cannot be removed. All omitted proofs are deferred to the appendix.

3.1 Global convergence of gradient EM

We first present our main result, which states that gradient EM converges to MLE globally.

Theorem 2 (Main result).

Consider training a student $n$ -component GMM initialized from $\bm{\mu}(0)=(\mu_{1}(0)^{\top},\ldots,\mu_{n}(0)^{\top})^{\top}$ to learn a single-component ground truth GMM $\mathcal{N}(0,I_{d})$ with population gradient EM. If the step size satisfies $\eta\leq O\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{% 2}d^{2}(\frac{1}{\mu_{\max}(0)}+\mu_{\max}(0))^{2}}\right)$ , then gradient EM converges globally with rate

\mathcal{L}(\bm{\mu}(t))\leq\frac{1}{\sqrt{\gamma t}},

where constant $\gamma=\Omega\left(\frac{\eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^% {4}}{n^{2}d^{2}(1+\mu_{\max}(0){\sqrt{dn}})^{4}}\right)\in\mathbf{R}^{+}$ , $\mu_{\max}(0)=\max\{\|\mu_{1}(0)\|,\ldots,\|\mu_{n}(0)\|\}$ .

Remark 3.

Without over-parameterization, for learning a single Gaussian, one can obtain a linear convergence $\exp(-\Omega\left(t\right))$ . We would like to note that the sub-linear convergence rate guarantee of gradient EM stated in Theorem 2 ( $\mathcal{L}(\bm{\mu}(t))\leq O(1/\sqrt{t})$ ) is due to the inherent nature of the algorithm. Dwivedi et al. [2018b] studied the special case of using 2 Gaussian mixtures with symmetric means to learn a single Gaussian and proved that EM has sublinear convergence rate when the weights $\pi_{i}$ are equal. Since Theorem 2 studies the more general case of $n$ Gaussian mixtures, this type of subexponential convergence rate is the best than we can hope for.

Remark 4.

The convergence rate in Theorem 2 has a factor exponentially small in the initialization scale ( $\gamma\propto\exp(-16n\mu_{\max}^{2}(0))$ ). We would like to stress that this is again due to algorithmic nature of the problem rather than the limitation of analysis. In Section 3.2, we prove that there exists bad regions with exponentially small gradients so that when initialized from such region, gradient EM gets trapped locally for $\exp(\Omega(\mu_{\max}^{2}(0)))$ number of steps. Therefore, a convergence speed guarantee exponentially small in square of initialization scale is inevitable and cannot be improved.

Remark 5.

Theorem 2 is fundamentally different from convergence analysis for EM/gradient EM in previous works Yan et al. , Dwivedi et al. [2019], Balakrishnan et al. [2014] which proved monotonic linear contraction of parameter distance $\|\bm{\mu}(t)-\bm{\mu}^{*}\|$ . But our result also implies global convergence since loss function $\mathcal{L}$ converging to $0$ is equivalent to convergence of gradient EM to MLE.

Remark 6.

The convergence result in Theorem 2 is for population gradient EM, but it also implies global convergence for sample-based gradient EM as the sample size tends to infinity. For a similar reduction from population EM to sample EM, see Section 2.2 of [Xu et al., 2016].

3.2 Necessity of exponentially small factor in convergence rate

In this section we prove that a factor of $\exp(-\Theta(\mu_{\max}^{2}(0)))$ is inevitable in the global convergence rate guarantee of gradient EM. To demonstrate this, we show that there exists some bad region such that initializing from this region will trap gradient EM for an exponentially long time before it converges to the global minimum. Our result is the following theorem.

Theorem 7 (Existence of bad initialization region).

For any $n=2l+1$ , consider gradient EM initialized at point $\mu_{1}(0)=0,\mu_{2}(0)=\cdots=\mu_{l+1}(0)=12\sqrt{d}e_{1},\mu_{l+2}(0)=% \cdots=\mu_{2l+1}(0)=-12\sqrt{d}e_{1}$ , where $e_{1}=(1,0,\ldots,0)^{\top}$ is a standard unit vector. Then population gradient EM will be trapped in a bad local region around $\bm{\mu}(0)$ for exponentially many number of time steps $T=\frac{1}{15n\eta}e^{d}=\frac{1}{15n\eta}\exp(\Theta(\mu_{\max}^{2}(0)))$ . More rigorously, for any $0\leq t\leq T$ , we have

\|\mu_{i}(t)\|\geq 10\sqrt{d},\forall i\neq 1.

Theorem 7 states that, when initialized from some bad points $\bm{\mu}(0)$ , after $\exp(\Theta(\mu_{\max}^{2}(0)))$ number of time steps, gradient EM will still stay in this local region and remain $10\sqrt{d}$ distance away from the global minimum $\bm{\mu}=0$ . Therefore an exponentially small factor in convergence rate is inevitable.

Remark 8.

Theorem 7 eliminates the possibility of proving any polynomial convergence rate of gradient EM from arbitrary initialization. However, it is still possible to prove that, with some specific smart initialization schemes, gradient EM avoids the bad regions stated in Theorem 7 and enjoys a polynomial convergence rate. We leave this as an interesting open question for future analysis.

4 Proof overview

In this section, we provide a technical overview of the proof in our main result (Theorem 2 and Theorem 7).

4.1 Difficulties of a global convergence proof and our new analysis framework

Proving the global convergence of gradient EM for general $n$ -Gaussian mixture is highly nontrivial. While there have been many previous works [Balakrishnan et al., 2014, Yan et al., , Dwivedi et al., 2018b] studying either local convergence or the special case of $2$ -Gaussian mixtures, they all focus on showing the contraction of parametric error. Namely, their proof proceeds by showing the distance between the model parameter and the ground truth contracts, usually by a fixed linear ratio, in each iteration of the algorithm. However, this kind of approach faces various challenges for our general problem where the convergence is both sublinear and non-monotonic. Since the convergence rate is sublinear (see Remark 3), showing a linear contraction per iteration is no longer possible. Since the convergence is non-monotonic¹¹1To see this, consider $n=2,\mu_{1}=0,\mu_{2}=(1,0,\ldots,0)^{\top}$ , then the norm of $\mu_{1}$ strictly increases after one iteration., we also cannot show a strictly decreasing parametric distance.

To address these challenges, we propose a new convergence analysis framework for gradient EM by proving the convergence of likelihood $\mathcal{L}$ instead of the convergence of parameters $\bm{\mu}$ . There are several benefits for considering the convergence from the perspective of MLE loss $\mathcal{L}$ . Firstly, it naturally addresses the problem of non-monotonic and sub-linear convergence since we only need to show $\mathcal{L}$ decreases as the algorithm updates. Also, since gradient EM is equivalent with running gradient descent on loss function $\mathcal{L}$ (see Section 1.3), we can apply techniques from the optimization theory of gradient descent to facilitate our analysis.

4.2 Proof ideas for Theorem 2

We first briefly outline our proof of Theorem 2.

Proof roadmap. Our proof of Theorem 2 consists of three steps. Firstly, we prove a gradient lower bound for $\mathcal{L}$ (Theorem 12). Then we prove that the MLE $\mathcal{L}$ is locally smooth (Theorem 13). Finally, we combine the gradient lower bound and the smoothness condition to prove the global convergence of $\mathcal{L}$ with mathematical induction.

Step 1: Gradient lower bound.

Our first step aims to show that the gradient norm of $\mathcal{L}(\bm{\mu})$ is lower bounded by the distance of $\bm{\mu}$ to the ground truth. To do this, we need a few preliminary results. Inspired by Chen et al. [2023], we use Stein’s identity [Stein, 1981] to perform an algebraic transformation of the gradient. Recalling the definition of $\psi_{i}$ in (2), we have the following lemma.

Lemma 9.

For any $\text{GMM}(\bm{\mu}),i\in[n]$ , the gradient of $Q$ satisfies

\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})=% \mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right].

The gradient expression above is equivalent with the form in (3), but is easier to manipulate. Using the transformed gradient in Lemma 9, we have the following corollary.

Corollary 10.

Define vector $\tilde{\bm{\psi}}_{\bm{\mu}}(x)\coloneqq\sum_{i\in[n]}\psi_{i}(x)\mu_{i}$ . For any $\text{GMM}(\bm{\mu})$ , the projection of the gradient of $\nabla\mathcal{L}(\bm{\mu})$ onto $\bm{\mu}$ satisfies

\left\langle\nabla\mathcal{L}(\bm{\mu}),\bm{\mu}\right\rangle=\left\langle% \nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right\rangle=\sum_{i\in[n]}% \left\langle\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu}),\mu_{i}\right\rangle=\mathbf{% E}_{x}\left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}\right].

Corollary 9 is important since it converts the projection of gradient $\nabla\mathcal{L}(\bm{\mu})$ onto $\bm{\mu}$ to the expected norm square of a vector $\tilde{\bm{\psi}}_{\bm{\mu}}$ . Since a lower bound of the gradient projection implies a lower bound of the gradient, we only need to construct a lower bound for $\left\langle\nabla\mathcal{L}(\bm{\mu}),\bm{\mu}\right\rangle=\mathbf{E}_{x}% \left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}\right]$ . Since $\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}$ is always non-negative, we already know that the gradient projection is non-negative. But lower bounding $\mathbf{E}_{x}\left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}\right]$ is still highly nontrivial since the expression of $\tilde{\bm{\psi}}$ is complicated and hard to handle. However, our key observation is that, although $\tilde{\bm{\psi}}$ itself is hard to bound, its gradient has nice properties and can be handled gracefully:

\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)=\frac{1}{2}\sum_{i,j\in[n]}\psi_{i}(% x)\psi_{j}(x)(\mu_{i}-\mu_{j})(\mu_{i}-\mu_{j})^{\top}.

(5)

The gradient (5) is nicely-behaved. One can see immediately from (5) that the matrix $\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)$ is positive-semi-definite, and its eigenvalues can be directly bounded. To utilize these properties, we use the following algebraic trick to convert the task of lower bounding $\tilde{\bm{\psi}}$ itself into the task of lower bounding its gradient.

\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]=\frac{1}{4}% \mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\|x\|\cdot\overline{x}^{\top}\nabla% \tilde{\bm{\psi}}_{\bm{\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}\right].

(6)

Recall that $\bar{x}=\frac{x}{\|x\|}$ . See detailed derivation in (23). Using (5), combined with the properties of $\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)$ , we can obtain the following lemma.

Lemma 11.

For any $\text{GMM}(\bm{\mu})$ we have

\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{% \exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^{2}}\left(% \sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}.

On top of Lemma 11, we can easily lower bound the gradient projection in the following lemma, finishing the first step of our proof.

Lemma 12 (Gradient projection lower bound).

For any $\text{GMM}(\bm{\mu})$ we have

\left\langle\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right\rangle=% \mathbf{E}_{x}[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}]=\Omega\left(\frac{\exp% \left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{d(1+\mu_{\max}{\sqrt{d}})^{2}}\mu% _{\max}^{4}\right).

Step 2: Local smoothness.

To construct a global convergence analysis for gradient-based methods, after obtaining a gradient lower bound, we still need to prove the smoothness of loss $\mathcal{L}$ . (Recall that global smoothness of function $f$ means that there exists constant $C$ such that $\|\nabla f(x_{1})-\nabla f(x_{2})\|\leq C\|x_{1}-x_{2}\|,\forall x_{1},x_{2}$ .) However, proving the smoothness for $\mathcal{L}$ in general is very challenging since the membership function $\psi_{i}$ cannot be bounded when $\bm{\mu}$ is unbounded. To address this issue, we prove that $\mathcal{L}$ is locally smooth, $i.e.$ , the smoothness between two points $\bm{\mu}$ and $\bm{\mu}^{\prime}$ is satisfied if both $\|\bm{\mu}\|$ and $\|\bm{\mu}-\bm{\mu}^{\prime}\|$ are upper bounded. Our result is the following theorem.

Theorem 13 (Local smoothness of loss function).

At any two points $\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}$ and $\bm{\mu}+\bm{\delta}=((\mu_{1}+\delta_{1})^{\top},\ldots,(\mu_{n}+\delta_{n})^% {\top})^{\top}$ , if

\|\delta_{i}\|\leq\frac{1}{\max\left\{6d,2\|\mu_{i}\|\right\}},\forall i\in[n],

then the loss function $\mathcal{L}$ satisfies the following smoothness property: for any $i\in[n]$ we have

\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{\delta})-\nabla_{% \mu_{i}}\mathcal{L}(\bm{\mu})\right\|\leq n\mu_{\max}(30\sqrt{d}+4\mu_{\max})% \|\delta_{i}\|+\sum_{k\in[n]}\|\delta_{k}\|.

(7)

Step 3: putting everything together.

Given the gradient lower bound and the smoothness condition, we still need to resolve two remaining problems. The first one is that the gradient lower bound in Lemma 12 is given in terms of $\bm{\mu}$ , which we need to convert to a lower bound in terms of $\mathcal{L}(\bm{\mu})$ . For this we need the following upper bound of $\mathcal{L}$ .

Theorem 14 (Loss function upper bound).

The loss function can be upper bounded as

\mathcal{L}(\bm{\mu})\leq\sum_{i\in[n]}\frac{\pi_{i}}{2}\|\mu_{i}\|^{2}\leq% \frac{\mu_{\max}^{2}}{2}.

The second problem is that our local smoothness theorem requires $\bm{\mu}$ to be bounded, therefore we need to show a regularity condition that for each $i$ , $\bm{\mu}_{i}(t)$ stays in a bounded region during gradient EM updates. This is not easy to prove for each individual $\bm{\mu}_{i}$ due to the same non-monotonic issue mentioned in Section 4.1. To establish such a regularity condition, we introduce the following potential function.

Definition 15.

Define potential function $U:\mathbf{R}^{nd}\to\mathbf{R}\;$ for $\text{GMM}(\bm{\mu})$ as

U(\bm{\mu})=\sum_{i\in[n]}\|\mu_{i}\|^{2}.

We prove that the potential function $U$ remains bounded, implying each $\bm{\mu}_{i}$ remains well-behaved. With this regularity condition, combined with the previous two steps, we finish the proof of Theorem 2 via mathematical induction.

4.3 Proof ideas for Theorem 7

Proving Theorem 7 is much simpler. The idea is natural: we found that there exists some bad regions where the gradient of $\mathcal{L}$ is exponentially small, characterized by the following lemma.

Lemma 16 (Gradient norm upper bound).

For any $\bm{\mu}$ satisfying $\|\mu_{1}\|\leq\sqrt{d},\|\mu_{2}\|,\|\mu_{3}\|,\ldots,\|\mu_{n}\|\geq 10\sqrt% {d}$ , the gradient of $\mathcal{L}$ at $\bm{\mu}$ can be upper bounded as

\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq 2\|\mu_{1}\|+2\exp(-d)\sum_{i% \neq 1}\|\mu_{i}\|,\forall i\in[n].

Utilizing Lemma 16, we can prove Theorem 7 by showing that initialization from these bad regions will get trapped in it for exponentially long, since the gradient norm is exponentially small. The full proof can be found in Appendix B.2.

5 Experiments

In this section we include a few simulation results of gradient EM verifying our theoretical arguments. We choose the experimental setting of $d=5,\eta=0.7$ . We use $n=2,5,10$ Gaussian mixtures to learn data generated from one single ground truth Gaussian distribution $\mathcal{N}(\mu^{*},I_{d})$ , respectively. Since a closed form expression of the population gradient is intractable, we approximate the gradient step via Monte Carlo method, with sample size $3.5\times 10^{5}$ . The mixing weights of student GMM are randomly sampled from a standard Dirichlet distribution and set as fixed during gradient EM update. The covariances of all component Gaussians are set as the identity matrix. We recorded the convergence of likelihood function $\mathcal{L}$ (estimated also by Monte Carlo method on fresh samples each iteration) and parametric distance $\sum_{i\in[n]}\pi_{i}\|\mu_{i}-\mu^{*}\|^{2}$ along gradient EM trajectory. The results are reported in Figure 1 (left and middle panel). Simulation outcome indicates that both the likelihood $\mathcal{L}$ and the parametric distance converges sublinearly, verifying our theoretical arguments.

To verify our negative result Theorem 7, we consider the bad initialization point $\bm{\mu}(0)$ described in Theorem 7 ²²2To prevent numerical underflow issues, we change the constant $12$ in $\bm{\mu}(0)$ to $2$ . and plot the gradient norm at $\bm{\mu}(0)$ w.r.t. different dimension $d$ in Figure 1 (right panel). Simulation result shows that the gradient norm $\|\nabla\mathcal{L}(\bm{\mu}(0))\|$ at $\bm{\mu}(0)$ decreases exponentially in dimension $d$ , verifying the existence of bad initialization regions.

Refer to caption — Figure 1: Left: Sublinear convergence of the likelihood loss $\mathcal{L}$ . Middle: Sublinear convergence of the parametric distance $\sum_{i\in[n]}\pi_{i}\|\mu_{i}-\mu^{*}\|^{2}$ between student GMM and the ground truth. Right: Gradient norm $\|\nabla\mathcal{L}(\bm{\mu}(0))\|$ in the counter-example in Theorem 7 decreases exponentially fast w.r.t. dimension $d$ .

6 Conclusion

This paper gives the first global convergence of gradient EM for over-parameterized Gaussian mixture models when the ground truth is a single Gaussian, and rate is sublinear which is exponentially slower than the rate in the exact-parameterization case. One fundamental open problem is to study when one can obtain global convergence of EM or gradient EM for Gaussian mixture models when the ground truth has multiple components. The likelihood-based convergence framework proposed in this paper might be an helpful tool towards solving this general problem.

Acknowledgements

This work was supported in part by the following grants: NSF TRIPODS II-DMS 20231660, NSF CCF 2212261, NSF CCF 2007036, NSF AF 2312775, NSF IIS 2110170, NSF DMS 2134106, NSF IIS 2143493, and NSF IIS 2229881.

References

Balakrishnan et al. [2014] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis, 2014.
Daskalakis et al. [2017] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 704–710. PMLR, 07–10 Jul 2017. URL https://proceedings.mlr.press/v65/daskalakis17b.html.
Xu et al. [2016] Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Neural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:6310792.
Dwivedi et al. [2018a] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J. Wainwright, and Michael I. Jordan. Theoretical guarantees for em under misspecified gaussian mixture models. In Neural Information Processing Systems, 2018a. URL https://api.semanticscholar.org/CorpusID:54062377.
Kwon and Caramanis [2020] Jeongyeol Kwon and Constantine Caramanis. The em algorithm gives sample-optimality for learning mixtures of well-separated gaussians. In Conference on Learning Theory, pages 2425–2487. PMLR, 2020.
Dwivedi et al. [2019] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan, and Bin Yu. Sharp analysis of expectation-maximization for weakly identifiable models. In International Conference on Artificial Intelligence and Statistics, 2019. URL https://api.semanticscholar.org/CorpusID:216036378.
** et al. [2016] Chi **, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Neural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:3200184.
Dwivedi et al. [2018b] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Michael I. Jordan, Martin J. Wainwright, and Bin Yu. Singularity, misspecification and the convergence rate of em. The Annals of Statistics, 2018b. URL https://api.semanticscholar.org/CorpusID:88517736.
[9] Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence analysis of gradient EM for multi-component gaussian mixture. URL http://arxiv.longhoe.net/abs/1705.08530.
Xu and Jordan [1996] Lei Xu and Michael I Jordan. On convergence properties of the em algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996.
Ma et al. [2000] **wen Ma, Lei Xu, and Michael I Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures. Neural Computation, 12(12):2881–2907, 2000.
Klusowski and Brinda [2016] Jason M. Klusowski and W. D. Brinda. Statistical guarantees for estimating the centers of a two-component gaussian mixture by em. arXiv: Machine Learning, 2016. URL https://api.semanticscholar.org/CorpusID:88514434.
Wu and Zhou [2019] Yihong Wu and Harrison H. Zhou. Randomly initialized em algorithm for two-component gaussian mixture achieves near optimality in $o(\sqrt{n})$ iterations, 2019.
Weinberger and Bresler [2021] Nir Weinberger and Guy Bresler. The em algorithm is adaptively-optimal for unbalanced symmetric gaussian mixtures. J. Mach. Learn. Res., 23:103:1–103:79, 2021. URL https://api.semanticscholar.org/CorpusID:232404093.
Kwon et al. [2019] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis. Global convergence of the em algorithm for mixtures of two component linear regression. In Conference on Learning Theory, pages 2055–2110. PMLR, 2019.
Dasgupta and Schulman [2000] Sanjoy Dasgupta and Leonard J. Schulman. A two-round variant of em for gaussian mixtures. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, UAI ’00, page 152–159, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607099.
Chen et al. [2023] Yudong Chen, Dogyoon Song, Xumei Xi, and Yuqian Zhang. Local minima structures in gaussian mixture models, 2023.
[18] Ruofei Zhao, Yuanzhi Li, and Yuekai Sun. Statistical convergence of the EM algorithm on gaussian mixture models. URL http://arxiv.longhoe.net/abs/1810.04090.
[19] Nimrod Segol and Boaz Nadler. Improved convergence guarantees for learning gaussian mixture models by EM and gradient EM. URL http://arxiv.longhoe.net/abs/2101.00575.
Xu and Du [2023] Weihang Xu and Simon Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. In The Thirty Sixth Annual Conference on Learning Theory, pages 1155–1198. PMLR, 2023.
Richert et al. [2022] Frederieke Richert, Roman Worschech, and Bernd Rosenow. Soft mode in the dynamics of over-realizable online learning for soft committee machines. Physical Review E, 105(5):L052302, 2022.
Xiong et al. [2023] Nuoya Xiong, Lijun Ding, and Simon S Du. How over-parameterization slows down gradient descent in matrix sensing: The curses of symmetry and initialization. arXiv preprint arXiv:2310.01769, 2023.
Zhang et al. [2021] Jialun Zhang, Salar Fattahi, and Richard Y Zhang. Preconditioned gradient descent for over-parameterized nonconvex matrix factorization. Advances in Neural Information Processing Systems, 34:5985–5996, 2021.
Zhuo et al. [2021] Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, and Constantine Caramanis. On the computational and statistical complexity of over-parameterized matrix sensing. arXiv preprint arXiv:2102.02756, 2021.
Stein [1981] Charles M. Stein. Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151, 1981. doi: 10.1214/aos/1176345632. URL https://doi.org/10.1214/aos/1176345632.
Nesterov et al. [2018] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.

Appendix A Missing Proofs and Auxiliary lemmas

Proof of Fact 1.

It is well known that (see Section 1 of Wu and Zhou [2019])

Q(\bm{\mu}^{\prime}|\bm{\mu})=\mathbf{E}_{x\sim p_{\bm{\mu}^{*}}}\left[\log(p_% {\bm{\mu}^{\prime}}(x))-D_{\text{KL}}(p_{\bm{\mu}}(\cdot|x)||p_{\bm{\mu}^{% \prime}}(\cdot|x))-H(p_{\bm{\mu}}(\cdot|x))\right],

where $p_{\bm{\mu}}(\cdot|x)$ denotes the distribution of hidden variable $y$ (in our case of GMM the index of Gaussian component) conditioned on $x$ , and $H$ denotes information entropy.

Since $\bm{\mu}^{\prime}=\bm{\mu}$ is a global minimum of $D_{\text{KL}}(p_{\bm{\mu}}(\cdot|x)||p_{\bm{\mu}^{\prime}}(\cdot|x))$ , we have $\nabla D_{\text{KL}}(p_{\bm{\mu}}(\cdot|x)||p_{\bm{\mu}}(\cdot|x))=0$ . Also $\nabla H(p_{\bm{\mu}}(\cdot|x))=0$ since $H(p_{\bm{\mu}}(\cdot|x))$ is a constant. Therefore

\nabla Q(\bm{\mu}|\bm{\mu})=\mathbf{E}_{x\sim p_{\bm{\mu}^{*}}}\left[\nabla% \log(p_{\bm{\mu}}(x))\right]=\nabla\mathcal{L}(\bm{\mu}).

∎

The proof of Lemma 9 uses ideas from Theorem 1 of Chen et al. [2023] and relies on Stein’s identity, which is given by the following lemma.

Lemma 17 (Stein [1981]).

For $x\sim\mathcal{N}(\mu,\sigma^{2}I_{d})$ and differentiable function $g:\mathbf{R}^{d}\to\mathbf{R}$ we have

\mathbf{E}[g(x)(x-\mu)]=\sigma^{2}\mathbf{E}[\nabla_{x}g(x)],

if the two expectations in the above identity exist.

Now we are ready to prove Lemma 9.

Lemma 9.

For any $\text{GMM}(\bm{\mu}),i\in[n]$ , the gradient of $Q$ satisfies

\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})=% \mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right].

Proof.

Applying Stein’s identity (Lemma 17), for each $i\in[n]$ we have

\begin{split}\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})&=\mathbf{E}_{x\sim\mathcal{N% }(0,I_{d})}\left[\-\psi_{i}(x)(\mu_{i}-x)\right]\\ &=\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\-\psi_{i}(x)\right]\mu_{i}-% \mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\psi_{i}(x)x\right]\\ &=\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\-\psi_{i}(x)\right]\mu_{i}-% \mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}[\nabla_{x}\psi_{i}(x)].\end{split}

Recall that

\psi_{i}(x)=\Pr[i|x]=\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right% )}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)}.

The gradient $\nabla_{x}\psi_{i}(x)$ could be calculated as

\begin{split}&\nabla_{x}\psi_{i}(x)\\ ={}&\frac{1}{\left(\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}% \right)\right)^{2}}\Bigg{[}\left(\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu% _{k}\|^{2}}{2}\right)\right)\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}% \right)(\mu_{i}-x)\\ &-\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right)\left(\sum_{k\in[n]}\pi_% {k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)(\mu_{k}-x)\right)\Bigg{]}\\ ={}&\psi_{i}(x)(\mu_{i}-x)-\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)(\mu_{k}-x)\\ ={}&\psi_{i}(x)(\mu_{i}-x)+\psi_{i}(x)x-\sum_{k\in[n]}\psi_{i}(x)\psi_{k}(x)% \mu_{k}\\ ={}&\psi_{i}(x)\left(\mu_{i}-\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right),\end{split}

(8)

note that we used $\sum_{k\in[n]}\psi_{i}(x)=1$ .

Then we have

\begin{split}\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})&=\mathbf{E}_{x}\left[\-\psi_% {i}(x)\right]\mu_{i}-\mathbf{E}_{x}[\nabla_{x}\psi_{i}(x)]\\ &=\mathbf{E}_{x}\left[\-\psi_{i}(x)\right]\mu_{i}-\mathbf{E}_{x}\left[\psi_{i}% (x)\left(\mu_{i}-\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right)\right]=\mathbf{E}_{x}% \left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right].\end{split}

∎

Proof of Corollary 10.

\begin{split}&\left\langle\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right% \rangle=\sum_{i\in[n]}\left\langle\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu}),\mu_{i}% \right\rangle=\sum_{i\in[n]}\left\langle\mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k% \in[n]}\psi_{k}(x)\mu_{k}\right],\mu_{i}\right\rangle\\ &=\sum_{i\in[n]}\sum_{k\in[n]}\mathbf{E}_{x}\left\langle\psi_{i}(x)\psi_{k}(x)% \mu_{k},\mu_{i}\right\rangle=\mathbf{E}_{x}\left[\left\|\sum_{i\in[n]}\psi_{i}% (x)\mu_{i}\right\|^{2}\right]=\mathbf{E}_{x}\left[\left\|\tilde{\bm{\psi}}_{% \bm{\mu}}(x)\right\|^{2}\right].\end{split}

∎

Lemma 18.

For any constant $c$ satisfying $0<c\leq\frac{1}{3d}$ , we have

\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\exp\left(c\|x\|\right)\right]\leq 1% +5\sqrt{d}c.

Proof.

Note that $\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\exp\left(c\|x\|\right)\right]=% \mathcal{M}_{\|x\|}(c)$ is the moment-generating function of $\|x\|$ . To upper bound the value of a moment generating function at $c$ , we use Lagrange’s Mean Value Theorem:

\mathcal{M}_{\|x\|}(c)=\mathcal{M}_{\|x\|}(0)+\mathcal{M}_{\|x\|}^{\prime}(\xi% )c,

(9)

where $\xi\in[0,c]$ . Note that $\mathcal{M}_{\|x\|}(0)=1,$ So the remaining task is to bound $\mathcal{M}_{\|x\|}^{\prime}(\xi)$ . We bound this expectation using truncation method as:

\begin{split}\mathcal{M}_{\|x\|}^{\prime}(\xi)&=\mathbf{E}_{x}\left[\|x\|\exp(% \xi\|x\|)\right]\leq\mathbf{E}_{x}\left[\|x\|\exp(c\|x\|)\right]\\ &=\int_{x\in\mathbf{R}^{d}}\|x\|\exp(c\|x\|)(2\pi)^{-d/2}\exp\left(-\frac{\|x% \|^{2}}{2}\right)\mathrm{d}x\\ &=\int_{\|x\|\leq 1}\|x\|\exp(c\|x\|)(2\pi)^{-d/2}\exp\left(-\frac{\|x\|^{2}}{% 2}\right)\mathrm{d}x\\ &\quad+\int_{\|x\|\geq 1}\|x\|\exp(c\|x\|)(2\pi)^{-d/2}\exp\left(-\frac{\|x\|^% {2}}{2}\right)\mathrm{d}x\\ &\leq\exp(c)(2\pi)^{-d/2}V_{d}+\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c% \|x\|-\frac{\|x\|^{2}}{2}\right)\mathrm{d}x\\ &\leq\exp(c)(2\pi)^{-d/2}V_{d}+\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c% \|x\|-\frac{\|x\|^{2}}{2}\right)\mathrm{d}x,\end{split}

(10)

where $V_{d}=\frac{\pi^{d/2}}{\Gamma(d/2+1)}$ is the volume of $d$ -dimensional unit sphere.

Since $\|x\|\geq 1\Rightarrow c\|x\|-\frac{\|x\|^{2}}{2}\leq\frac{1}{3d}\|x\|-\frac{% \|x\|^{2}}{2}\leq-\frac{\|(1-1/(2d))x\|^{2}}{2}$ , we have

\begin{split}&\quad\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c\|x\|-\frac{% \|x\|^{2}}{2}\right)\mathrm{d}x\\ &\leq\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(-\frac{\|\frac{2d-1}{2d}x\|% ^{2}}{2}\right)\mathrm{d}x\\ &=\int_{\|y\|\geq\frac{2d-1}{2d}}\frac{2d}{2d-1}\|y\|(2\pi)^{-d/2}\exp\left(-% \frac{\|y\|^{2}}{2}\right)\left(\frac{2d}{2d-1}\right)^{d}\mathrm{d}y\\ &\leq\left(\frac{2d}{2d-1}\right)^{d+1}\mathbf{E}_{y\sim\mathcal{N}(0,I_{d})}% \left[\|y\|\right]\\ &=\left(\frac{2d}{2d-1}\right)^{d+1}\frac{\sqrt{2}\Gamma\left(\frac{d+1}{2}% \right)}{\Gamma\left(\frac{d}{2}\right)}\\ &\leq 4\sqrt{d},\end{split}

where we used $\left(\frac{2d}{2d-1}\right)^{d+1}\leq 4$ and the log convexity of Gamma function at the last line. Plugging this back to (10), we get

\begin{split}\mathcal{M}_{\|x\|}^{\prime}(\xi)&\leq\exp(c)(2\pi)^{-d/2}V_{d}+% \int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c\|x\|-\frac{\|x\|^{2}}{2}\right% )\mathrm{d}x\\ &\leq\exp(1/(3d))(2\pi)^{-d/2}+4\sqrt{d}\\ &\leq 5\sqrt{d}.\end{split}

(11)

Plugging (11) into (9), we obtain the final bound

\mathbf{E}_{x}\left[\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1% \right]=\mathcal{M}_{\|x\|}(c)=\mathcal{M}_{\|x\|}(0)+\mathcal{M}_{\|x\|}^{% \prime}(\xi)c\leq 1+5\sqrt{d}c.

∎

Lemma 19.

For any fixed $x\in\mathbf{R}^{d},x\neq 0$ and any $\bm{\mu}$ we have

\int_{t=-1}^{1}\psi_{i}(tx|\bm{\mu})\psi_{j}(tx|\bm{\mu})\mathrm{d}t\geq\frac{% 1}{2\mu_{\max}\|x\|}\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}^{2}\right)\left(1-% \exp\left(-4\mu_{\max}\|x\|\right)\right).

Proof.

\begin{split}\psi_{i}(tx)&=\frac{\pi_{i}\exp\left(-\frac{\|tx-\mu_{i}\|^{2}}{2% }\right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|tx-\mu_{k}\|^{2}}{2}\right)}% \\ &=\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}(\|tx-\mu_{i}\|^{2}% -\|tx-\mu_{k}\|^{2})\right)}\\ &=\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}(\|tx-\mu_{i}\|^{2}% -\|tx-\mu_{k}\|^{2})\right)}\\ &=\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}\left\langle 2tx-% \mu_{i}-\mu_{k},\mu_{k}-\mu_{i}\right\rangle\right)}\\ &\geq\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}(2\|tx\|+2\mu_{% \max})\cdot 2\mu_{\max}\right)}\\ &=\pi_{i}\exp\left(-2\mu_{\max}(\|tx\|+\mu_{\max})\right)\end{split}

(12)

Therefore

\begin{split}\int_{t=-1}^{1}\psi_{i}(tx)\psi_{j}(tx)\mathrm{d}t&\geq\int_{t=-1% }^{1}\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}(\|tx\|+\mu_{\max})\right)\mathrm{d}t% \\ &=\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}^{2}\right)\cdot 2\int_{t=0}^{1}\exp% \left(-4\mu_{\max}\|x\|t\right)\mathrm{d}t\\ &=\frac{1}{2\mu_{\max}\|x\|}\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}^{2}\right)% \left(1-\exp\left(-4\mu_{\max}\|x\|\right)\right).\end{split}

(13)

∎

Appendix B Proofs for Section 3 and 4

B.1 Proofs for global convergence analysis

Theorem 13.

At any two points $\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}$ and $\bm{\mu}+\bm{\delta}=((\mu_{1}+\delta_{1})^{\top},\ldots,(\mu_{n}+\delta_{n})^% {\top})^{\top}$ , if

\|\delta_{i}\|\leq\frac{1}{\max\left\{6d,2\|\mu_{i}\|\right\}},\forall i\in[n],

then the loss function $\mathcal{L}$ satisfies the following smoothness property: for any $i\in[n]$ we have

\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{\delta})-\nabla_{% \mu_{i}}\mathcal{L}(\bm{\mu})\right\|\leq n\mu_{\max}(30\sqrt{d}+4\mu_{\max})% \|\delta_{i}\|+\sum_{k\in[n]}\|\delta_{k}\|.

(14)

Proof.

Note that

\begin{split}&\quad\exp\left(-\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp% \left(-\frac{\|\delta_{i}\|^{2}}{2}\right)\leq\frac{\exp\left(-\frac{\|x-(\mu_% {i}+\delta_{i})\|^{2}}{2}\right)}{\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right% )}=\exp\left(\left\langle x-\mu_{i},\delta_{i}\right\rangle-\frac{\|\delta_{i}% \|^{2}}{2}\right)\\ &\leq\exp\left(\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|% \delta_{i}\|^{2}}{2}\right).\end{split}

Therefore $\psi_{i}(x|\bm{\mu}+\bm{\delta})$ can be bounded as

\begin{split}&\quad\psi_{i}(x|\bm{\mu}+\bm{\delta})=\frac{\pi_{i}\exp\left(-% \frac{\|x-(\mu_{i}+\delta_{i})\|^{2}}{2}\right)}{\sum_{k\in[n]}\pi_{k}\exp% \left(-\frac{\|x-(\mu_{k}+\delta_{k})\|^{2}}{2}\right)}\\ &\leq\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right)\exp\left(\|% \delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{i}\|^{2}}{2}% \right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)% \exp\left(-\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{% i}\|^{2}}{2}\right)}\leq\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)% \psi_{i}(x|\bm{\mu}).\end{split}

(15)

Similarly, we have

\begin{split}&\quad\psi_{i}(x|\bm{\mu}+\bm{\delta})=\frac{\pi_{i}\exp\left(-% \frac{\|x-(\mu_{i}+\delta_{i})\|^{2}}{2}\right)}{\sum_{k\in[n]}\pi_{k}\exp% \left(-\frac{\|x-(\mu_{k}+\delta_{k})\|^{2}}{2}\right)}\\ &\geq\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right)\exp\left(-\|% \delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{i}\|^{2}}{2}% \right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)% \exp\left(\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{i% }\|^{2}}{2}\right)}\geq\exp\left(-2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)% \psi_{i}(x|\bm{\mu}).\end{split}

(16)

Recall that by Lemma 9 we have $\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\mathbf{E}_{x}\left[\psi_{i}(x|\bm{\mu})% \sum_{k\in[n]}\psi_{k}(x|\bm{\mu})\mu_{k}\right],$ so

\begin{split}&\quad\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{% \delta})-\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\right\|\\ &=\left\|\mathbf{E}_{x}\left[\psi_{i}(x|\bm{\mu}+\bm{\delta})\sum_{k\in[n]}% \psi_{k}(x|\bm{\mu}+\bm{\delta})(\mu_{k}+\delta_{k})\right]-\mathbf{E}_{x}% \left[\psi_{i}(x|\bm{\mu})\sum_{k\in[n]}\psi_{k}(x|\bm{\mu})\mu_{k}\right]% \right\|\\ &=\Bigg{\|}\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{i}(x|\bm{\mu}+\bm{\delta})% \psi_{k}(x|\bm{\mu}+\bm{\delta})\delta_{k}\right]\\ &\quad+\mathbf{E}_{x}\left[\sum_{k\in[n]}(\psi_{i}(x|\bm{\mu}+\bm{\delta})\psi% _{k}(x|\bm{\mu}+\bm{\delta})-\psi_{i}(x|\bm{\mu})\psi_{k}(x|\bm{\mu}))\mu_{k}% \right]\Bigg{\|}\\ &\leq\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{i}(x|\bm{\mu}+\bm{\delta})\psi_{k% }(x|\bm{\mu}+\bm{\delta})\|\delta_{k}\|\right]\\ &\quad+\mathbf{E}_{x}\left[\sum_{k\in[n]}|\psi_{i}(x|\bm{\mu}+\bm{\delta})\psi% _{k}(x|\bm{\mu}+\bm{\delta})-\psi_{i}(x|\bm{\mu})\psi_{k}(x|\bm{\mu})|\cdot\|% \mu_{k}\|\right]\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}\mathbf{E}_{x}\left[|\psi_{i}(% x|\bm{\mu}+\bm{\delta})\psi_{k}(x|\bm{\mu}+\bm{\delta})-\psi_{i}(x|\bm{\mu})% \psi_{k}(x|\bm{\mu})|\right]\|\mu_{k}\|\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}\mathbf{E}_{x}\left[\exp\left(% 2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1\right]\|\mu_{k}\|,\end{split}

(17)

where the last inequality is because $\psi_{i},\psi_{k}\leq 1$ and applying (15) and (16).

The remaining task is to bound $\mathbf{E}_{x}\left[\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1\right]$ . Since $2\|\delta_{i}\|\leq\frac{1}{3d}$ , we can use Lemma 18 to bound it as

\begin{split}&\quad\mathbf{E}_{x}\left[\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{% i}\|)\right)-1\right]=\exp(2\|\delta_{i}\|\|\mu_{i}\|)\mathbf{E}_{x}\left[\exp% \left(2\|\delta_{i}\|\cdot\|x\|)\right)\right]-1\\ &\leq\exp(2\|\delta_{i}\|\|\mu_{i}\|)(1+10\sqrt{d}\|\delta_{i}\|)-1=\exp(2\|% \delta_{i}\|\|\mu_{i}\|)-1+10\sqrt{d}\|\delta_{i}\|\exp(2\|\delta_{i}\|\|\mu_{% i}\|)\\ &\leq 4\|\delta_{i}\|\|\mu_{i}\|+10\sqrt{d}\|\delta_{i}\|\exp(1)\leq(30\sqrt{d% }+4\|\mu_{i}\|)\|\delta_{i}\|.\end{split}

(18)

where we used $\exp(1+x)\leq 1+2x,\forall x\in[0,1]$ at the last line. Plugging this back to (17), we get

\begin{split}&\quad\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{% \delta})-\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\right\|\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}\mathbf{E}_{x}\left[\exp\left(% 2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1\right]\|\mu_{k}\|\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}(30\sqrt{d}+4\|\mu_{i}\|)\|% \delta_{i}\|\|\mu_{k}\|\\ &\leq n\mu_{\max}(30\sqrt{d}+4\mu_{\max})\|\delta_{i}\|+\sum_{k\in[n]}\|\delta% _{k}\|.\end{split}

(19)

∎

Theorem 14.

The loss function can be upper bounded as

\mathcal{L}(\bm{\mu})\leq\sum_{i\in[n]}\frac{\pi_{i}}{2}\|\mu_{i}\|^{2}\leq% \frac{\mu_{\max}^{2}}{2}.

Proof.

Since the logarithm function is concave, by Jensen’s inequality we have

\begin{split}\mathcal{L}(\bm{\mu})&=D_{\text{KL}}(p_{\bm{\mu}^{*}}||p_{\bm{\mu% }})=-\mathbf{E}_{x}\left[\log\left(\frac{p_{\bm{\mu}}(x)}{p_{\bm{\mu}^{*}}(x)}% \right)\right]\\ &=-\mathbf{E}_{x}\left[\log\left(\frac{\sum_{i}\pi_{i}\exp\left(-\frac{\|x-\mu% _{i}\|^{2}}{2}\right)}{\exp\left(-\frac{\|x\|^{2}}{2}\right)}\right)\right]\\ &\leq-\mathbf{E}_{x}\left[\sum_{i}\pi_{i}\log\left(\frac{\exp\left(-\frac{\|x-% \mu_{i}\|^{2}}{2}\right)}{\exp\left(-\frac{\|x\|^{2}}{2}\right)}\right)\right]% \\ &=-\sum_{i}\pi_{i}\mathbf{E}_{x}\left[\left\langle x,\mu_{i}\right\rangle-% \frac{\|\mu_{i}\|^{2}}{2}\right]\\ &=\sum_{i\in[n]}\frac{\pi_{i}}{2}\|\mu_{i}\|^{2}\leq\frac{\mu_{\max}^{2}}{2}.% \end{split}

∎

Lemma 12.

For any $\text{GMM}(\bm{\mu})$ we have

\left\langle\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right\rangle=% \mathbf{E}_{x}[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}]\geq\Omega\left(\frac{% \exp\left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{d(1+\mu_{\max}{\sqrt{d}})^{2}% }\mu_{\max}^{4}\right).

Proof.

Consider two cases:

Case 1. There exists $k\in[n]$ such that $\|\mu_{k}-\mu_{{i_{\max}}}\|\geq\frac{\mu_{\max}}{2}$ . Then by Lemma 20 and Lemma 11 we have

\begin{split}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right% ]&\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})% ^{2}}\left(\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^% {2}}\left(\frac{\pi_{\min}}{8}\mu_{\max}^{2}\right)^{2}\\ &=\frac{\exp\left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{2560000d(1+2\mu_{\max% }{\sqrt{d}})^{2}}\mu_{\max}^{4}.\end{split}

Case2. For $\forall k\in[n]$ , $\|\mu_{{i_{\max}}}-\mu_{k}\|<\frac{\mu_{\max}}{2}$ . Then by Lemma 21 we have $\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{1}% {4}\mu_{\max}^{2}\geq\Omega(\exp(-8\mu_{\max}^{2})\mu_{\max}^{4})\geq\Omega% \left(\frac{\exp\left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{d(1+\mu_{\max}{% \sqrt{d}})^{2}}\mu_{\max}^{4}\right),$ (since $e^{-x}x\leq 1,\forall x$ ). ∎

Lemma 20.

For any $\text{GMM}(\bm{\mu})$ , if there exists $k\in[n]$ such that $\|\mu_{k}-\mu_{{i_{\max}}}\|\geq\frac{\mu_{\max}}{2}$ , then we have

\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\geq\frac{\pi_{\min}}{8}% \mu_{\max}^{2}.

Proof.

By Cauchy–Schwarz inequality, we have $\|a\|^{2}+\|b\|^{2}\geq\frac{1}{2}\|a-b\|^{2}$ , so for $\forall i\in[n]$ we have

\begin{split}\sum_{j\in[n]}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}&\geq\pi_{{i_{\max}}}% \|\mu_{i}-\mu_{{i_{\max}}}\|^{2}+\pi_{k}\|\mu_{i}-\mu_{k}\|^{2}\\ &\geq\frac{\pi_{\min}}{2}\|(\mu_{i}-\mu_{{i_{\max}}})-(\mu_{i}-\mu_{k})\|^{2}=% \frac{\pi_{\min}}{2}\|\mu_{k}-\mu_{{i_{\max}}}\|^{2}.\end{split}

Therefore

\begin{split}\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}=\sum_{i\in[% n]}\pi_{i}\sum_{j\in[n]}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\geq\sum_{i\in[n]}\pi_{i% }\frac{\pi_{\min}}{2}\|\mu_{k}-\mu_{{i_{\max}}}\|^{2}\geq\frac{\pi_{\min}}{8}% \mu_{\max}^{2},\end{split}

where the last inequality is because $\|\mu_{k}-\mu_{{i_{\max}}}\|\geq\frac{\mu_{\max}}{2}$ and $\sum_{i}\pi_{i}=1$ . ∎

Lemma 21.

For any $\text{GMM}(\bm{\mu})$ , if for $\forall k\in[n]$ we have $\|\mu_{{i_{\max}}}-\mu_{k}\|<\frac{\mu_{\max}}{2}$ , then

\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{1}% {4}\mu_{\max}^{2}.

Proof.

For any $k\in[n]$ , by Cauchy–Schwarz inequality we have

\begin{split}\langle\mu_{k},\mu_{{i_{\max}}}\rangle&=\langle\mu_{{i_{\max}}}-(% \mu_{{i_{\max}}}-\mu_{k}),\mu_{{i_{\max}}}\rangle=\|\mu_{{i_{\max}}}\|^{2}-% \left\langle\mu_{{i_{\max}}}-\mu_{k},\mu_{{i_{\max}}}\right\rangle\\ &\geq\mu_{\max}^{2}-\|\mu_{{i_{\max}}}-\mu_{k}\|\mu_{\max}>\frac{1}{2}\mu_{% \max}^{2},\end{split}

(20)

where the last inequality is because $\|\mu_{{i_{\max}}}-\mu_{k}\|<\frac{\mu_{\max}}{2}$ .

Note that (20) implies $\langle\mu_{k},\overline{\mu_{{i_{\max}}}}\rangle>\frac{1}{2}\mu_{\max}$ , so for $\forall x\in\mathbf{R}^{d}$ we have

\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|=\left\|\sum_{k\in[n]}\psi_{k}(x)\mu_{k}% \right\|\geq\left\langle\sum_{k\in[n]}\psi_{k}(x)\mu_{k},\overline{\mu_{{i_{% \max}}}}\right\rangle=\sum_{k\in[n]}\psi_{k}(x)\left\langle\mu_{k},\overline{% \mu_{{i_{\max}}}}\right\rangle>\frac{1}{2}\mu_{\max},

(21)

where we used $\sum_{k\in[n]}\psi_{k}(x)=1$ at the last inequality. ∎

Lemma 11.

For any $\text{GMM}(\bm{\mu})$ we have

\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{% \exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^{2}}\left(% \sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}.

Proof.

The key idea is to consider the gradient of $\tilde{\bm{\psi}}_{\bm{\mu}}$ , which can be calculated as

\begin{split}\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)&=\sum_{i}\mu_{i}\left(% \frac{\partial\psi_{i}(x)}{\partial x}\right)^{\top}\\ &=\sum_{i}\psi_{i}(x)\mu_{i}\mu_{i}^{\top}-\sum_{i,j}\psi_{i}(x)\psi_{j}(x)\mu% _{i}\mu_{j}^{\top}\\ &=\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)\mu_{i}\mu_{i}^{\top}-\sum_{i,j}\psi_{% i}(x)\psi_{j}(x)\mu_{i}\mu_{j}^{\top}\\ &=\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)\mu_{i}(\mu_{i}-\mu_{j})^{\top}\\ &=\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)\frac{1}{2}\left(\mu_{i}(\mu_{i}-\mu_{% j})^{\top}+\mu_{j}(\mu_{j}-\mu_{i})^{\top}\right)\\ &=\frac{1}{2}\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)(\mu_{i}-\mu_{j})(\mu_{i}-% \mu_{j})^{\top},\end{split}

(22)

where we used (8) in the second identity.

By Cauchy-Schwarz inequality, we have $\|a\|^{2}+\|b\|^{2}\geq\frac{1}{2}\|a-b\|^{2}$ , which implies

\begin{split}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right% ]&=\frac{1}{2}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}+\|% \tilde{\bm{\psi}}_{\bm{\mu}}(-x)\|^{2}\right]\\ &\geq\frac{1}{4}\mathbf{E}_{x}\left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)-% \tilde{\bm{\psi}}_{\bm{\mu}}(-x)\right\|^{2}\right]\\ &\geq\frac{1}{4}\mathbf{E}_{x}\left[\left\langle\tilde{\bm{\psi}}_{\bm{\mu}}(x% )-\tilde{\bm{\psi}}_{\bm{\mu}}(-x),\overline{x}\right\rangle^{2}\right]\\ &=\frac{1}{4}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\frac{\partial}{\partial t% }\langle\tilde{\bm{\psi}}_{\bm{\mu}}(tx),\overline{x}\rangle\mathrm{d}t\right)% ^{2}\right]\\ &=\frac{1}{4}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}{x}^{\top}\nabla\tilde{% \bm{\psi}}_{\bm{\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}\right]\\ &=\frac{1}{4}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\|x\|\cdot\overline{x}^{% \top}\nabla\tilde{\bm{\psi}}_{\bm{\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}% \right],\end{split}

(23)

where we used $\frac{\partial}{\partial t}\tilde{\bm{\psi}}_{\bm{\mu}}(tx)=\nabla\tilde{\bm{% \psi}}_{\bm{\mu}}(tx)x$ at the second to last identity. Careful readers might notice that the term $\left(\int_{t=-1}^{1}\|x\|\cdot\overline{x}^{\top}\nabla\tilde{\bm{\psi}}_{\bm% {\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}$ is not well-defined when $x=0$ , but we can still calculate its expectation over the whole probability space since the integration is only singular on a zero-measure set.

For each $x\neq 0$ , by (22) we have

\overline{x}^{\top}\nabla\tilde{\bm{\psi}}_{\bm{\mu}}(tx)\overline{x}=\frac{1}% {2}\sum_{i,j\in[n]}\psi_{i}(tx)\psi_{j}(tx)\langle\mu_{i}-\mu_{j},\overline{x}% \rangle^{2}.

\begin{split}&\quad\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}% \right]\\ &\geq\frac{1}{16}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\|x\|\sum_{i,j\in[n]% }\psi_{i}(tx)\psi_{j}(tx)\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\mathrm% {d}t\right)^{2}\right]\\ &=\frac{1}{16}\mathbf{E}_{x}\left[\left(\|x\|\sum_{i,j\in[n]}\langle\mu_{i}-% \mu_{j},\overline{x}\rangle^{2}\int_{t=-1}^{1}\psi_{i}(tx)\psi_{j}(tx)\mathrm{% d}t\right)^{2}\right]\\ &\geq\frac{1}{16}\mathbf{E}_{x}\left[\left(\|x\|\sum_{i,j\in[n]}\langle\mu_{i}% -\mu_{j},\overline{x}\rangle^{2}\frac{1}{2\mu_{\max}\|x\|}\pi_{i}\pi_{j}\exp% \left(-4\mu_{\max}^{2}\right)\left(1-\exp\left(-4\mu_{\max}\|x\|\right)\right)% \right)^{2}\right]\\ &=\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\mathbf{E}_{x}\left[\left(\sum_{% i,j\in[n]}\pi_{i}\pi_{j}\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac{1-% \exp\left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right)^{2}\right]\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i}% \pi_{j}\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac% {1-\exp\left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]\right)^{2}\end{split}

(24)

where we used Lemma 19 at the fourth line and Cauchy-Schwarz inequality at the last line.

The last step is to lower bound $\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\left(1-\exp% \left(-4\mu_{\max}\|x\|\right)\right)/\mu_{\max}\right]$ . Since $x$ is sampled from $\mathcal{N}(0,I_{d})$ , which is spherically symmetric, we know that the two random variables $\{\overline{x},\|x\|\}$ are independent. Therefore

\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac{1-\exp% \left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]=\mathbf{E}_{x}\left[\langle% \mu_{i}-\mu_{j},\overline{x}\rangle^{2}\right]\mathbf{E}_{x}\left[\frac{1-\exp% \left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right].

(25)

For the first term in (25), we have $\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\right]=\|% \mu_{i}-\mu_{j}\|^{2}/d$ since $\overline{x}$ is spherically symmetrically distributed. By norm-concentration inequality of Gaussian [Dasgupta and Schulman, 2000] we know that $\Pr\left[\|x\|\geq\frac{\sqrt{d}}{2}\right]\geq 1/50,\forall d$ . The second term in (25) can be therefore lower bounded as

\begin{split}\mathbf{E}_{x}\left[\frac{1-\exp\left(-4\mu_{\max}\|x\|\right)}{% \mu_{\max}}\right]\geq\Pr\left[\|x\|\geq\frac{\sqrt{d}}{2}\right]\frac{1-\exp% \left(-4\mu_{\max}\cdot\frac{\sqrt{d}}{2}\right)}{\mu_{\max}}\geq\frac{1-\exp% \left(-2\mu_{\max}{\sqrt{d}}\right)}{50\mu_{\max}}.\end{split}

(26)

Plugging (26) into (25), we get

\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac{1-\exp% \left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]\geq\frac{1-\exp\left(-2\mu_% {\max}{\sqrt{d}}\right)}{50d\mu_{\max}}\|\mu_{i}-\mu_{j}\|^{2}.

(27)

Now we can plug (27) into (24) and get

\begin{split}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right% ]&\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i% }\pi_{j}\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}% \frac{1-\exp\left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]\right)^{2}\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i}% \pi_{j}\frac{1-\exp\left(-2\mu_{\max}{\sqrt{d}}\right)}{50d\mu_{\max}}\|\mu_{i% }-\mu_{j}\|^{2}\right)^{2}\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i}% \pi_{j}\frac{1-\frac{1}{1+2\mu_{\max}{\sqrt{d}}}}{50d\mu_{\max}}\|\mu_{i}-\mu_% {j}\|^{2}\right)^{2}\\ &=\frac{\exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^{2}% }\left(\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}\end{split}

(28)

where we used the inequality $\forall t\geq 0,e^{-t}\leq\frac{1}{1+t}$ at the second to last line. ∎

Theorem 2.

Consider training a student $n$ -component GMM initialized from $\bm{\mu}(0)=(\mu_{1}(0)^{\top},\ldots,\mu_{n}(0)^{\top})^{\top}$ to learn a single-component ground truth GMM $\mathcal{N}(0,I_{d})$ with population gradient EM algorithm. If the step size satisfies $\eta\leq O\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{% 2}d^{2}(\frac{1}{\mu_{\max}(0)}+\mu_{\max}(0))^{2}}\right)$ , then gradient EM converges globally with rate

\mathcal{L}(\bm{\mu}(t))\leq\frac{1}{\sqrt{\gamma t}},

Proof.

We use mathematical induction to prove Theorem 2, by proving the following two conditions inductively:

U(t)\leq n\mu_{\max}^{2}(0),\forall t.

(29)

\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t))}\geq\gamma t+\frac{1}{\mathcal{L}^{2}(% \bm{\mu}(0))},\forall t.

(30)

Note that (30) directly implies the theorem, so now we just need to prove (29) and (30) together.

The induction base for $t=0$ is trivial. Now suppose the conditions hold for time step $t$ , consider $t+1$ . By induction hypothesis (29) we have $\|\mu_{i}(t)\|\leq\mu_{\max}(t)\leq\sqrt{n}\mu_{\max}(0),\forall t$ .

Proof of (30). Since $\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu})=\nabla_{\bm{\mu}}\mathcal{L}(\bm{\mu})$ , we can apply classical analysis of gradient descent [Nesterov et al., 2018] as

\begin{split}&\quad\mathcal{L}(\bm{\mu}(t+1))-\mathcal{L}(\bm{\mu}(t))\\ &=\mathcal{L}(\bm{\mu}(t)-\eta\nabla\mathcal{L}(\bm{\mu}(t)))-\mathcal{L}(\bm{% \mu}(t))\\ &=-\int_{s=0}^{1}\left\langle\nabla\mathcal{L}(\bm{\mu}(t)-s\eta\nabla\mathcal% {L}(\bm{\mu}(t))),\eta\nabla\mathcal{L}(\bm{\mu}(t))\right\rangle\mathrm{d}s\\ &=-\int_{s=0}^{1}\left\langle\nabla\mathcal{L}(\bm{\mu}(t)),\eta\nabla\mathcal% {L}(\bm{\mu}(t))\right\rangle\mathrm{d}s+\int_{s=0}^{1}\left\langle\nabla% \mathcal{L}(\bm{\mu}(t))-\nabla\mathcal{L}(\bm{\mu}(t)-s\eta\nabla\mathcal{L}(% \bm{\mu}(t))),\eta\nabla\mathcal{L}(\bm{\mu}(t))\right\rangle\mathrm{d}s\\ &=-\eta\|\nabla\mathcal{L}(\bm{\mu}(t))\|^{2}+\eta\int_{s=0}^{1}\left\langle% \nabla\mathcal{L}(\bm{\mu}(t))-\nabla\mathcal{L}(\bm{\mu}(t)-s\eta\nabla% \mathcal{L}(\bm{\mu}(t))),\nabla\mathcal{L}(\bm{\mu}(t))\right\rangle\mathrm{d% }s\\ \end{split}

(31)

Note that the gradient norm can be upper bounded as

\begin{split}\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|&=\left\|\mathbf{E}_{% x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}(t)\right]\right\|\leq% \mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\left\|\mu_{k}(t)% \right\|\right]\\ &\leq\sum_{k}\|\mu_{k}(t)\|\leq\sqrt{nU(t)}\leq n\mu_{\max}(0).\end{split}

Then for any $s\in[0,1]$ , we have $\|s\eta\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\leq\eta n\mu_{\max}(0)\leq% \frac{1}{\max\left\{6d,2\|\mu_{i}(t)\|\right\}}$ . So we can apply Theorem 13 and get

\begin{split}&\quad\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))-\nabla_{\mu_{i}}% \mathcal{L}(\bm{\mu}(t)-s\eta\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t)))\|\\ &\leq n\mu_{\max}(t)(30\sqrt{d}+4\mu_{\max}(t))\|s\eta\nabla_{\mu_{i}}\mathcal% {L}(\bm{\mu}(t))\|+\sum_{k\in[n]}\|s\eta\nabla_{\mu_{k}}\mathcal{L}(\bm{\mu}(t% ))\|.\end{split}

Therefore for $\forall s\in[0,1]$ ,

\begin{split}&\quad\left\langle\nabla\mathcal{L}(\bm{\mu}(t))-\nabla\mathcal{L% }(\bm{\mu}(t)-s\eta\nabla\mathcal{L}(\bm{\mu}(t))),\nabla\mathcal{L}(\bm{\mu}(% t))\right\rangle\\ &\leq\sum_{i\in[n]}\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))-\nabla_{\mu_{i}}% \mathcal{L}(\bm{\mu}(t)-s\eta\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t)))\|\cdot% \|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\\ &\leq\sum_{i\in[n]}\left(n\mu_{\max}(t)(30\sqrt{d}+4\mu_{\max}(t))\|s\eta% \nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|+\sum_{k\in[n]}\|s\eta\nabla_{\mu_{k% }}\mathcal{L}(\bm{\mu}(t))\|\right)\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))% \|\\ &\leq\eta\left(n\mu_{\max}(t)(30\sqrt{d}+4\mu_{\max}(t))+n^{2}\right)\|\nabla% \mathcal{L}(\bm{\mu}(t))\|^{2}\\ &\leq\eta\left(4n^{2}\mu_{\max}(0)^{2}+30\sqrt{d}n^{3/2}\mu_{\max}(0)+n^{2}% \right)\|\nabla\mathcal{L}(\bm{\mu}(t))\|^{2}\\ &\leq 20\eta\sqrt{d}n^{2}(\mu_{\max}^{2}(0)+1)\|\nabla\mathcal{L}(\bm{\mu}(t))% \|^{2}.\\ \end{split}

(32)

Plugging (32) into (31), since $\eta\leq O\left(\frac{1}{\sqrt{d}n^{2}(\mu_{\max}^{2}(0)+1)}\right)$ we have

\mathcal{L}(\bm{\mu}(t+1))-\mathcal{L}(\bm{\mu}(t))\leq-\eta\|\nabla\mathcal{L% }(\bm{\mu}(t))\|^{2}+20\eta\sqrt{d}n^{2}(\mu_{\max}^{2}(0)+1)\|\nabla\mathcal{% L}(\bm{\mu}(t))\|^{2}\leq-\frac{\eta}{2}\|\nabla\mathcal{L}(\bm{\mu}(t))\|^{2}.

(33)

By Lemma 12 we can lower bound the gradient norm as

\begin{split}&\|\nabla\mathcal{L}(\bm{\mu}(t))\|\geq\frac{\left\langle\nabla% \mathcal{L}(\bm{\mu}(t)),\bm{\mu}(t)\right\rangle}{\|\bm{\mu}(t)\|}\geq\frac{% \left\langle\nabla\mathcal{L}(\bm{\mu}(t)),\bm{\mu}(t)\right\rangle}{n\mu_{% \max}(t)}\geq\Omega\left(\frac{\exp\left(-8\mu_{\max}^{2}(t)\right)\pi_{\min}^% {2}}{nd(1+\mu_{\max}(t){\sqrt{d}})^{2}}\right)\mu_{\max}^{3}(t)\\ &\overset{\text{Theorem \ref{Loss function upper bound}}}{\geq}\Omega\left(% \frac{\exp\left(-8\mu_{\max}^{2}(t)\right)\pi_{\min}^{2}}{nd(1+\mu_{\max}(t){% \sqrt{d}})^{2}}\right)(2\mathcal{L}(\bm{\mu}(t))^{3/2}\geq\Omega\left(\frac{% \exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{nd(1+\mu_{\max}(0){\sqrt{% dn}})^{2}}\right)\mathcal{L}^{3/2}(\bm{\mu}(t)).\end{split}

(34)

Combining (34) and (33), we have

\mathcal{L}(\bm{\mu}(t+1))\leq\mathcal{L}(\bm{\mu}(t))-\frac{\eta}{2}\|\nabla% \mathcal{L}(\bm{\mu}(t))\|^{2}\leq\mathcal{L}(\bm{\mu}(t))-\Omega\left(\frac{% \eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^{4}}{n^{2}d^{2}(1+\mu_{% \max}(0){\sqrt{dn}})^{4}}\right)\mathcal{L}^{3}(\bm{\mu}(t)).

(35)

Note that the above inequality implies $\mathcal{L}(\bm{\mu}(t+1))\leq\mathcal{L}(\bm{\mu}(t))$ , therefore

\begin{split}&\quad\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t+1))}-\frac{1}{\mathcal{% L}^{2}(\bm{\mu}(t))}=\frac{(\mathcal{L}(\bm{\mu}(t))-\mathcal{L}(\bm{\mu}(t+1)% ))(\mathcal{L}(\bm{\mu}(t))+\mathcal{L}(\bm{\mu}(t+1)))}{\mathcal{L}^{2}(\bm{% \mu}(t))\mathcal{L}^{2}(\bm{\mu}(t+1))}\\ &\geq\frac{(\mathcal{L}(\bm{\mu}(t))-\mathcal{L}(\bm{\mu}(t+1))\mathcal{L}(\bm% {\mu}(t))}{\mathcal{L}^{4}(\bm{\mu}(t))}\overset{\eqref{NL21}}{\geq}\Omega% \left(\frac{\eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^{4}}{n^{2}d^{% 2}(1+\mu_{\max}(0){\sqrt{dn}})^{4}}\right)=\gamma.\end{split}

On the other hand, by induction hypothesis we have $\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t))}\geq\gamma t+\frac{1}{\mathcal{L}^{2}(% \bm{\mu}(0))}$ , combined with the above inequality, we have $\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t+1))}\geq\frac{1}{\mathcal{L}^{2}(\bm{\mu}(% t))}+\gamma\geq\gamma(t+1)+\frac{1}{\mathcal{L}^{2}(\bm{\mu}(0))}$ , which finishes the proof of (30).

Proof of (29). The dynamics of potential function $U$ can be calculated as

\begin{split}&\quad U(\bm{\mu}(t+1))=\sum_{i\in[n]}\left\|\mu_{i}(t+1)\right\|% ^{2}\\ &=\sum_{i\in[n]}\left\|\mu_{i}(t)-\eta\nabla_{\mu_{i}}Q(\bm{\mu}(t)|\bm{\mu}(t% ))\right\|^{2}\\ &=U(\bm{\mu}(t))-{\eta\sum_{i\in[n]}\left\langle\mu_{i}(t),\nabla_{\mu_{i}}Q(% \bm{\mu}(t)|\bm{\mu}(t))\right\rangle}+{\eta^{2}\sum_{i\in[n]}\|\nabla_{\mu_{i% }}Q(\bm{\mu}(t)|\bm{\mu}(t))\|^{2}}\\ &\overset{\text{Corollary \ref{Gradient projection lemma}}}{=}U(\bm{\mu}(t))-% \underbrace{\eta\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}(t)}(x)\|^{2}% \right]}_{I_{1}}+\underbrace{\eta^{2}\sum_{i\in[n]}\|\nabla_{\mu_{i}}Q(\bm{\mu% }(t)|\bm{\mu}(t))\|^{2}}_{I_{2}}.\end{split}

(36)

The first term $I_{1}$ can be bounded by Lemma 12 as

I_{1}\geq\eta\Omega\left(\frac{\exp\left(-8\mu_{\max}^{2}(t)\right)\pi_{\min}^% {2}}{d(1+\mu_{\max}(t){\sqrt{d}})^{2}}\right)\mu_{\max}^{4}(t)\geq\eta\Omega% \left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{2}d(1+\mu_% {\max}(0){\sqrt{nd}})^{2}}\right)U^{2}(\bm{\mu}(t)).

(37)

The second term $I_{2}$ is a perturbation term that can be upper bounded by Lemma 9 as

\begin{split}I_{2}&={\eta^{2}\sum_{i\in[n]}\|\nabla_{\mu_{i}}Q(\bm{\mu}(t)|\bm% {\mu}(t))\|^{2}}=\eta^{2}\sum_{i\in[n]}\left\|\mathbf{E}_{x}\left[\psi_{i}(x)% \sum_{k\in[n]}\psi_{k}(x)\mu_{k}(t)\right]\right\|^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\left\|\psi_{i}(x)\sum_{k\in[n]% }\psi_{k}(x)\mu_{k}(t)\right\|\right]^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{% k}(x)\left\|\mu_{k}(t)\right\|\right]^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\sqrt{\left(\sum_{k\in[n]}\psi_% {i}^{2}(x)\psi_{k}^{2}(x)\right)\left(\sum_{k\in[n]}\|\mu_{k}(t)\|^{2}\right)}% \right]^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{i}^{2}(x)% \psi_{k}^{2}(x)\right]\mathbf{E}_{x}\left[\sum_{k\in[n]}\|\mu_{k}(t)\|^{2}% \right]\\ &=\eta^{2}U(\bm{\mu}(t))\mathbf{E}_{x}\left[\sum_{i\in[n]}\sum_{k\in[n]}\psi_{% i}^{2}(x)\psi_{k}^{2}(x)\right]\\ &\leq\eta^{2}U(\bm{\mu}(t))\mathbf{E}_{x}\left[\left(\sum_{i\in[n]}\psi_{i}(x)% \right)\left(\sum_{k\in[n]}\psi_{k}(x)\right)\right]\\ &=\eta^{2}U(\bm{\mu}(t)).\end{split}

(38)

where we use triangle inequality twice at the second and third line, and Cauchy-Schwarz inequality twice at the fourth and fifth line.

Putting (38), (37) and (36) together, we get

U(\bm{\mu}(t+1))\leq U(\bm{\mu}(t))-\eta\Omega\left(\frac{\exp\left(-8n\mu_{% \max}^{2}(0)\right)\pi_{\min}^{2}}{n^{2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}}% \right)U^{2}(\bm{\mu}(t))+\eta^{2}U(\bm{\mu}(t)).

Consider two cases:

a). If $\frac{n}{2}\mu_{\max}^{2}(0)\leq U(\bm{\mu}(t))\leq n\mu_{\max}^{2}(0)$ , then

\begin{split}&\quad U(\bm{\mu}(t+1))\leq U(\bm{\mu}(t))-\eta U(\bm{\mu}(t))% \left(\Omega\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n% ^{2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}}\right)U(\bm{\mu}(t))-\eta\right)\\ &\leq U(\bm{\mu}(t))-\eta U(\bm{\mu}(t))\left(\Omega\left(\frac{\exp\left(-8n% \mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}% }\right)\frac{n}{2}\mu_{\max}^{2}(0)-\eta\right)\leq U(\bm{\mu}(t))\leq n\mu_{% \max}^{2}(0),\end{split}

note that we used $\eta\leq O\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{% 2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}}\right)\frac{n}{2}\mu_{\max}^{2}(0)$ .

b). If $\frac{n}{2}\mu_{\max}^{2}(0)>U(\bm{\mu}(t))$ , then $U(\bm{\mu}(t+1))\leq(1+\eta^{2})U(\bm{\mu}(t))\leq 2U(\bm{\mu}(t))\leq n\mu_{% \max}^{2}(0)$ .

Since (29) holds in both cases, our proof is done. ∎

B.2 Proofs for Section 3.2

Lemma 16.

For any $\bm{\mu}$ satisfying $\|\mu_{1}\|\leq\sqrt{d},\|\mu_{2}\|,\|\mu_{3}\|,\ldots,\|\mu_{n}\|\geq 10\sqrt% {d}$ , the gradient of $\mathcal{L}$ at $\bm{\mu}$ can be upper bounded as

\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq 2\|\mu_{1}\|+2\exp(-d)\sum_{i% \neq 1}\|\mu_{i}\|,\forall i\in[n].

Proof.

Recall that the gradient has the form $\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k% \in[n]}\psi_{k}(x)\mu_{k}\right],$ hence its norm can be upper bounded as

\begin{split}&\quad\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq\mathbf{E}_{x}% \left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\right]\\ &\leq\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\Bigg{|}\|x\|\leq 2% \sqrt{d}\right]+\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\Bigg{% |}\|x\|>2\sqrt{d}\right]\Pr\left[\|x\|>2\sqrt{d}\right].\end{split}

(39)

For any $\|x\|\leq 2\sqrt{d}$ , we have $\exp(-\|x-\mu_{1}\|^{2}/2)\geq\exp(-(\|x\|+\|\mu_{1}\|)^{2}/2)\geq\exp(-9d/2)$ , while for $\forall i\neq 1$ , $\exp(-\|x-\mu_{i}\|^{2}/2)\leq\exp(-(\|\mu_{i}\|-\|x\|)^{2}/2)\leq\exp(-(10% \sqrt{d}-2\sqrt{d})^{2}/2)=\exp(-32d)$ . Since $\psi_{i}(x)\propto\exp(-\|x-\mu_{i}\|^{2}/2)$ we have

\|x\|\leq 2\sqrt{d}\Rightarrow\psi_{i}(x)\leq\frac{\exp(-\|x-\mu_{i}\|^{2}/2)}% {\exp(-\|x-\mu_{1}\|^{2}/2)}\leq\frac{\exp(-32d)}{\exp(-9d/2)}\leq\exp(-25d),% \forall i\neq 1.

Therefore the first term in (36) can be bounded as $\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\Bigg{|}\|x\|\leq 2% \sqrt{d}\right]\leq\|\mu_{1}\|+\exp(-25d)\sum_{i\neq 1}\|\mu_{i}\|.$

On the other hand, by tail bound of the norm of Gaussian vectors (see Lemma 8 of [Yan et al., ]) we have $\Pr\left[\|x\|>2\sqrt{d}\right]\leq\exp(-d)$ . Putting everything together, (39) can be further bounded as

\begin{split}\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq\|\mu_{1}\|+\exp(-25% d)\sum_{i\neq 1}\|\mu_{i}\|+\exp(-d)\sum_{i\in[n]}\|\mu_{i}\|\leq 2\|\mu_{1}\|% +2\exp(-d)\sum_{i\neq 1}\|\mu_{i}\|.\end{split}

∎

Theorem 7.

For any $n=2l+1$ , consider gradient EM initialized at point $\mu_{1}(0)=0,\mu_{2}(0)=\cdots=\mu_{l+1}(0)=12\sqrt{d}e_{1},\mu_{l+2}(0)=% \cdots=\mu_{2l+1}(0)=-12\sqrt{d}e_{1}$ , where $e_{1}=(1,0,\ldots,0)^{\top}$ is a standard unit vector. Then population gradient EM will be trapped in a bad local region around $\bm{\mu}(0)$ for exponentially long time $T=\frac{1}{15n\eta}e^{d}=\frac{1}{15n\eta}\exp(\Theta(\mu_{\max}^{2}(0)))$ . More rigorously, for any $0\leq t\leq T$ , we have

\|\mu_{i}(t)\|\geq 10\sqrt{d},\;\;\forall\;i\neq 1.

Proof.

We prove the following statement inductively $\forall\;0\leq t\leq T$ :

\mu_{1}(t)=0,\mu_{2}(t)=\ldots=\mu_{l+1}(t)=-\mu_{l+2}(t)=\ldots=-\mu_{2l+1}(t)

(40)

\forall i,\;\|\mu_{i}(t)-\mu_{i}(0)\|\leq\eta t(30\sqrt{d}ne^{-d}).

(41)

(40) states that during the gradient EM update, $\mu_{1}$ will keep stationary at $0$ . while the symmetry between $\mu_{2},\ldots,\mu_{n}$ will be preserved.

The induction base is trivial. Now suppose (41), (40) holds for $0,1,\ldots,t$ , we prove the case for $t+1$ .

Proof of (40). Due to the induction hypothesis, one can see from direct calculation that $\forall x,\psi_{1}(x|\bm{\mu}(t))=\psi_{1}(-x|\bm{\mu}(t)),\psi_{2}(x|\bm{\mu}% (t))=\cdots=\psi_{l+1}(x|\bm{\mu}(t))=\psi_{l+2}(-x|\bm{\mu}(t))=\cdots=\psi_{% 2l+1}(-x|\bm{\mu}(t))$ . For the ease of notation, we denote $\psi_{+}(x)\coloneqq\psi_{2}(x|\bm{\mu}(t))=\cdots=\psi_{l+1}(x|\bm{\mu}(t)),% \psi_{-}(x)\coloneqq\psi_{l+2}(x|\bm{\mu}(t))=\cdots=\psi_{2l+1}(x|\bm{\mu}(t)% ),\mu^{+}\coloneqq\mu_{2}(t)=\ldots=\mu_{l+1}(t),\mu^{-}\coloneqq\mu_{l+2}(t)=% \ldots=-\mu_{2l+1}(t).$ Then $\mu^{+}=-\mu^{-},\psi_{+}(x)=\psi_{-}(-x)$ . Consequently

\begin{split}&\quad\nabla_{\mu_{1}}\mathcal{L}(\bm{\mu}(t))=\mathbf{E}_{x}% \left[\psi_{1}(x|\bm{\mu}(t))\sum_{k\in[n]}\psi_{k}(x|\bm{\mu}(t))\mu_{k}(t)% \right]=\mathbf{E}_{x}\left[\psi_{1}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)\mu^{-% })\right]\\ &=\frac{1}{2}\mathbf{E}_{x}\left[\psi_{1}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)% \mu^{-})+\psi_{1}(-x)(l\psi_{+}(-x)\mu^{+}+l\psi_{-}(-x)\mu^{-})\right]\\ &=\frac{1}{2}\mathbf{E}_{x}\left[\psi_{1}(x)(l\psi_{+}(x)(\mu^{+}+\mu^{-})+l% \psi_{-}(x)(\mu^{+}+\mu^{-}))\right]=0\Rightarrow\mu_{1}(t+1)=\mu_{1}(t)=0.% \end{split}

Similarly, $\forall 2\leq i\leq l+1,l+2\leq j\leq 2l+1$ we have

\begin{split}&\quad\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))=\mathbf{E}_{x}% \left[\psi_{i}(x|\bm{\mu}(t))\sum_{k\in[n]}\psi_{k}(x|\bm{\mu}(t))\mu_{k}(t)% \right]=\mathbf{E}_{x}\left[\psi_{+}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)\mu^{-% })\right]\\ &=\mathbf{E}_{x}\left[\psi_{+}(-x)(l\psi_{+}(-x)\mu^{+}+l\psi_{-}(-x)\mu^{-})% \right]=-\mathbf{E}_{x}\left[\psi_{-}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)\mu^{% -})\right]=-\nabla_{\mu_{j}}\mathcal{L}(\bm{\mu}(t)).\end{split}

This indicates that $\nabla_{\mu_{2}}\mathcal{L}(\bm{\mu}(t))=\cdots=\nabla_{\mu_{l+1}}\mathcal{L}(% \bm{\mu}(t))=-\nabla_{\mu_{l+2}}\mathcal{L}(\bm{\mu}(t))=\cdots=-\nabla_{\mu_{% 2l+1}}\mathcal{L}(\bm{\mu}(t))$ , combined with the induction hypothesis we have $\mu_{2}(t+1)=\ldots=\mu_{l+1}(t+1)=-\mu_{l+2}(t+1)=\ldots=-\mu_{2l+1}(t+1)$ , (40) is proved.

Proof of (41).

By induction hypothesis, we have $\forall i,\;\|\mu_{i}(t)-\mu_{i}(0)\|\leq\eta t\cdot(30\sqrt{d}ne^{-d})\leq% \eta T\cdot(30\sqrt{d}ne^{-d})\leq 2\sqrt{d}.$ So $\forall i\neq 1,\|\mu_{i}(t)\|\leq\|\mu_{i}(0)\|+2\sqrt{d}<15\sqrt{d}$ . Then by Lemma 16, $\forall i\in[n]$ we have

\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\leq 2\|\mu_{1}(t)\|+2\exp(-d)\sum% _{i\neq 1}\|\mu_{i}(t)\|\leq 2n\exp(-d)\cdot 15\sqrt{d}=30\sqrt{d}ne^{-d},

note that here we used $\mu_{1}(t)=0$ . Therefore by the induction hypothesis we have $\|\mu_{i}(t+1)-\mu_{i}(0)\|\leq\eta t\cdot(30\sqrt{d}ne^{-d})+\eta\|\nabla_{% \mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\leq\eta(t+1)\cdot(30\sqrt{d}ne^{-d})$ , (41) is proven.

By (41), $\forall i\neq 1,0\leq t\leq T$ we have $\|\mu_{i}(t)\|\geq\|\mu_{i}(0)\|-\|\mu_{i}(t)-\mu_{i}(0)\|\geq 12\sqrt{d}-\eta T% (30\sqrt{d}ne^{-d})\geq 12\sqrt{d}-2\sqrt{d}=10\sqrt{d}.$ Our proof is done. ∎