Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models

Weihang Xu
University of Washington
[email protected] Maryam Fazel
University of Washington
[email protected] Simon S. Du
University of Washington
[email protected]
Abstract

We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with n>1𝑛1n>1italic_n > 1 components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary n𝑛nitalic_n remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate O(1/t)𝑂1𝑡O(1/\sqrt{t})italic_O ( 1 / square-root start_ARG italic_t end_ARG ). This is the first global convergence result for Gaussian mixtures with more than 2222 components. The sublinear convergence rate is due to the algorithmic nature of learning over-parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.

1 Introduction

Learning Gaussian Mixture Models (GMM) is a fundamental problem in machine learning with broad applications. In this problem, data generated from a mixture of n2𝑛2n\geq 2italic_n ≥ 2 ground truth Gaussians are observed without the label (the index of component Gaussian that data is sampled from), and the goal is to retrieve the maximum likelihood estimation of Gaussian components. The Expectation Maximization (EM) algorithm is arguably the most widely-used algorithm for this problem. Each iteration of the EM algorithm consists of two steps. In the expectation (E) step, it computes the posterior probability of unobserved mixture membership label according to the current parameterized model. In the maximization (M) step, it computes the maximizer of the Q𝑄Qitalic_Q function, which is the likelihood with respect to posterior estimation of the hidden label computed in the E step.

Gradient EM, as a popular variant of EM, is often used in practice when the maximization step of EM is costly or even intractable. It replaces the M step of EM with taking one gradient step on the Q𝑄Qitalic_Q function. Learning Gaussian Mixture Models with EM/gradient EM is an important and widely-studied problem. Starting from the seminal work [Balakrishnan et al., 2014], a flurry of work Daskalakis et al. [2017], Xu et al. [2016], Dwivedi et al. [2018a], Kwon and Caramanis [2020], Dwivedi et al. [2019] have studied the convergence guarantee for EM/gradient EM in various settings. However, these works either only prove local convergence, or consider the special case of 2222-Gaussian mixtures. A general global convergence analysis of EM/gradient EM on n𝑛nitalic_n-Gaussian mixtures still remains unresolved. ** et al. [2016] is a notable negative result in this regard, where the authors show that on GMM with n3𝑛3n\geq 3italic_n ≥ 3 components, randomly initialized EM will get trapped in a spurious local minimum with high probability.

Over-parameterized Gaussian Mixture Models. Motivated by the negative results, a line of work considers the over-parameterized setting where the model uses more Gaussian components than the ground truth GMM, in the hope that it might help the global convergence of EM and bypass the negative result. In such over-parameterized regime, the best that people know so far is from [Dwivedi et al., 2018b]. This work proves global convergence of 2-Gaussian mixtures on one single Gaussian ground truth. The authors also show that EM has a unique sub-linear convergence rate in this over-parameterized setting (compared with the linear convergence rate in the exact-parameterized setting [Balakrishnan et al., 2014]). This motivates the following natural open question:

Can we prove global convergence of the EM/gradient EM algorithm on general n𝑛nitalic_n-Gaussian mixtures in the over-parameterized regime?

In this paper, we take a significant step towards answering this question. Our main contributions can be summarized as follows:

  • We prove global convergence of the gradient EM algorithm for learning general n𝑛nitalic_n-component GMM on one single ground truth Gaussian distribution. This is, to the best of our knowledge, the first global convergence proof for general n𝑛nitalic_n-component GMM. Our convergence rate is sub-linear, reflecting an inherent nature of over-parameterized GMM (see Remark 3 for details).

  • We propose a new analysis framework that utilizes the likelihood function for proving convergence of gradient EM. Our new framework tackles several emerging technical barriers for global analysis of general GMM.

  • We also identify a new geometric property of gradient EM for learning general n𝑛nitalic_n-component GMM: There exists bad initialization regions that traps gradient EM for exponentially long, resulting in an inevitable exponential factor in the convergence rate of gradient EM.

1.1 Gaussian Mixture Model (GMM)

We consider the canonical Gaussian Mixture Models with weights 𝝅=(π1,,πn)𝝅subscript𝜋1subscript𝜋𝑛\bm{\pi}=(\pi_{1},\ldots,\pi_{n})bold_italic_π = ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (i=1nπi=1superscriptsubscript𝑖1𝑛subscript𝜋𝑖1\sum_{i=1}^{n}\pi_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1), means 𝝁=(μ1,,μn)𝝁superscriptsuperscriptsubscript𝜇1topsuperscriptsubscript𝜇𝑛toptop\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}bold_italic_μ = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and unit covariance matrices Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in d𝑑ditalic_d-dimensional space. Following a widely-studied setting [Balakrishnan et al., 2014, Yan et al., , Daskalakis et al., 2017], we set the weights 𝝅𝝅\bm{\pi}bold_italic_π and covariances Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in student GMM as fixed, and the means 𝝁=(μ1,,μn)𝝁superscriptsuperscriptsubscript𝜇1topsuperscriptsubscript𝜇𝑛toptop\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}bold_italic_μ = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT as trainable parameters. We use GMM(𝝁)GMM𝝁\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) to denote the GMM model parameterized by 𝝁𝝁\bm{\mu}bold_italic_μ, which can be described with probability density function (PDF) p𝝁:𝐑d𝐑0:subscript𝑝𝝁superscript𝐑𝑑subscript𝐑absent0p_{\bm{\mu}}:\mathbf{R}^{d}\to\mathbf{R}_{\geq 0}italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT : bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → bold_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT as

p𝝁(x)=i[n]πiϕ(x|μi,Id)=i[n]πi(2π)d/2exp(xμi22),subscript𝑝𝝁𝑥subscript𝑖delimited-[]𝑛subscript𝜋𝑖italic-ϕconditional𝑥subscript𝜇𝑖subscript𝐼𝑑subscript𝑖delimited-[]𝑛subscript𝜋𝑖superscript2𝜋𝑑2superscriptnorm𝑥subscript𝜇𝑖22p_{\bm{\mu}}(x)=\sum_{i\in[n]}\pi_{i}\phi(x|\mu_{i},I_{d})=\sum_{i\in[n]}\pi_{% i}(2\pi)^{-d/2}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right),italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_x | italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) , (1)

where ϕ(|μ,Σ)\phi(\cdot|\mu,\Sigma)italic_ϕ ( ⋅ | italic_μ , roman_Σ ) is the PDF of 𝒩(μ,Σ)𝒩𝜇Σ\mathcal{N}(\mu,\Sigma)caligraphic_N ( italic_μ , roman_Σ ), π1++πn=1,πi>0,i[n]formulae-sequencesubscript𝜋1subscript𝜋𝑛1formulae-sequencesubscript𝜋𝑖0for-all𝑖delimited-[]𝑛\pi_{1}+\cdots+\pi_{n}=1,\pi_{i}>0,\forall i\in[n]italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 , ∀ italic_i ∈ [ italic_n ].

1.2 Gradient EM algorithm

The EM algorithm is one of the most popular algorithms for retrieving the maximum likelihood estimator (MLE) on latent variable models. In general, EM and gradient EM address the following problem: given a joint distribution p𝝁(x,y)subscript𝑝superscript𝝁𝑥𝑦p_{\bm{\mu}^{*}}(x,y)italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_y ) of random variables x,y𝑥𝑦x,yitalic_x , italic_y parameterized by 𝝁superscript𝝁\bm{\mu}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, observing only the distribution of x𝑥xitalic_x, but not the latent variable y𝑦yitalic_y, the goal of EM and gradient EM is to retrieve the maximum likelihood estimator

𝝁^MLEargmax𝝁logp𝝁(x).subscript^𝝁MLEsubscript𝝁subscript𝑝𝝁𝑥\hat{\bm{\mu}}_{\text{MLE}}\in\arg\max_{\bm{\mu}}\log p_{\bm{\mu}}(x).over^ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT MLE end_POSTSUBSCRIPT ∈ roman_arg roman_max start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) .

The focus of this paper is the non-convex optimization analysis, so we consider using population gradient EM algorithm to learn GMM (1), where the observed variable is x𝐑d𝑥superscript𝐑𝑑x\in\mathbf{R}^{d}italic_x ∈ bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and latent variable is the index of membership Gaussian in GMM. We follow the standard teacher-student setting where a student model GMM(𝝁)GMM𝝁\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) with n2𝑛2n\geq 2italic_n ≥ 2 Gaussian components learns from data generated from a ground truth teacher model GMM(𝝁)GMMsuperscript𝝁\text{GMM}(\bm{\mu}^{*})GMM ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). We consider the over-parameterized setting where the ground truth model GMM(𝝁)GMMsuperscript𝝁\text{GMM}(\bm{\mu}^{*})GMM ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a single Gaussian distribution 𝒩(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d})caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), namely 𝝁=(μ1,,μn)=(0,,0)superscript𝝁superscriptsuperscriptsuperscriptsubscript𝜇1topsuperscriptsuperscriptsubscript𝜇𝑛toptopsuperscriptsuperscript0topsuperscript0toptop\bm{\mu}^{*}=({\mu_{1}^{*}}^{\top},\ldots,{\mu_{n}^{*}}^{\top})^{\top}=({0}^{% \top},\ldots,{0}^{\top})^{\top}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ( 0 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , 0 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Our problem could be seen as a strict generalization of Dwivedi et al. [2018b], where they studied using mixture model of two Gaussians with symmetric means (they set constraint μ2=μ1subscript𝜇2subscript𝜇1\mu_{2}=-\mu_{1}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) to learn one single Gaussian.

At time step t=0,1,2,𝑡012t=0,1,2,\ldotsitalic_t = 0 , 1 , 2 , …, given with parameters 𝝁(t)=(μ1(t),,μn(t))𝝁𝑡superscriptsubscript𝜇1superscript𝑡topsubscript𝜇𝑛superscript𝑡toptop\bm{\mu}(t)=(\mu_{1}(t)^{\top},\ldots,\mu_{n}(t)^{\top})^{\top}bold_italic_μ ( italic_t ) = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, population gradient EM updates 𝝁𝝁\bm{\mu}bold_italic_μ via the following two steps

  • E step: for each i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ], compute the membership weight function ψi:𝐑d𝐑:subscript𝜓𝑖superscript𝐑𝑑𝐑\psi_{i}:\mathbf{R}^{d}\to\mathbf{R}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → bold_R defined as

    ψi(x|𝝁(t))=Pr[i|x]=πiexp(xμi(t)22)k[n]πkexp(xμk(t)22).subscript𝜓𝑖conditional𝑥𝝁𝑡Prconditional𝑖𝑥subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖𝑡22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘𝑡22\psi_{i}(x|\bm{\mu}(t))=\Pr[i|x]=\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}(t)% \|^{2}}{2}\right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}(t)\|^{2}}% {2}\right)}.italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) = roman_Pr [ italic_i | italic_x ] = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG . (2)
  • M step: Define Q(|,μ(t))Q(\cdot|,\mu(t))italic_Q ( ⋅ | , italic_μ ( italic_t ) ) as

    Q(𝝁|𝝁(t))=𝐄x𝒩(0,Id)[i=1nψi(x|𝝁(t))xμi22],𝑄conditional𝝁𝝁𝑡subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]superscriptsubscript𝑖1𝑛subscript𝜓𝑖conditional𝑥𝝁𝑡superscriptnorm𝑥subscript𝜇𝑖22Q(\bm{\mu}|\bm{\mu}(t))=\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\sum_{i=1}% ^{n}-\psi_{i}(x|\bm{\mu}(t))\frac{\|x-\mu_{i}\|^{2}}{2}\right],italic_Q ( bold_italic_μ | bold_italic_μ ( italic_t ) ) = bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ] ,

    Gradient EM with step size η>0𝜂0\eta>0italic_η > 0 performs the following update:

    μi(t+1)=μi(t)ημiQ(𝝁(t)|𝝁(t))=μi(t)η𝐄x𝒩(0,Id)[ψi(x|𝝁(t))(μi(t)x)].subscript𝜇𝑖𝑡1subscript𝜇𝑖𝑡𝜂subscriptsubscript𝜇𝑖𝑄conditional𝝁𝑡𝝁𝑡subscript𝜇𝑖𝑡𝜂subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]subscript𝜓𝑖conditional𝑥𝝁𝑡subscript𝜇𝑖𝑡𝑥\mu_{i}(t+1)=\mu_{i}(t)-\eta\nabla_{\mu_{i}}Q(\bm{\mu}(t)|\bm{\mu}(t))=\mu_{i}% (t)-\eta\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\-\psi_{i}(x|\bm{\mu}(t))(% \mu_{i}(t)-x)\right].italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ ( italic_t ) | bold_italic_μ ( italic_t ) ) = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_η bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_x ) ] . (3)

The membership weight function xψi(x|𝝁)𝑥subscript𝜓𝑖conditional𝑥𝝁x\to\psi_{i}(x|\bm{\mu})italic_x → italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) represents the posterior probability of data point x𝑥xitalic_x being sampled from the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT Gaussian of GMM(𝝁)GMM𝝁\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ). For ease of notation, we sometimes simply write ψi(x|𝝁)subscript𝜓𝑖conditional𝑥𝝁\psi_{i}(x|\bm{\mu})italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) as ψi(x)subscript𝜓𝑖𝑥\psi_{i}(x)italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) when the choice of 𝝁𝝁\bm{\mu}bold_italic_μ is obvious.

1.3 Loss function of gradient EM

Since the task of gradient EM is to find the MLE over ground truth distribution p𝝁subscript𝑝superscript𝝁p_{\bm{\mu}^{*}}italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we can define the MLE loss function for gradient EM as

(𝝁)=DKL(p𝝁||p𝝁)=𝐄xp𝝁[log(p𝝁(x)p𝝁(x))].\mathcal{L}(\bm{\mu})=D_{\text{KL}}(p_{\bm{\mu}^{*}}||p_{\bm{\mu}})=-\mathbf{E% }_{x\sim p_{\bm{\mu}^{*}}}\left[\log\left(\frac{p_{\bm{\mu}}(x)}{p_{\bm{\mu}^{% *}}(x)}\right)\right].caligraphic_L ( bold_italic_μ ) = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | | italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ) = - bold_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_ARG ) ] . (4)

The loss \mathcal{L}caligraphic_L is the Kullback–Leibler (KL) divergence between the ground truth GMM and the student model GMM. Since finding MLE is equivalent to minimizing the KL divergence between model and the ground truth, the goal of gradient EM is equivalent to finding the global minimum of loss \mathcal{L}caligraphic_L. In other words, proving that gradient EM finds the MLE is equivalent with proving the convergence of \mathcal{L}caligraphic_L to 00. However, we are going to present another reason why loss function \mathcal{L}caligraphic_L is important, for it is also closely related to the dynamics of gradient EM.

Gradient EM is gradient descent on \mathcal{L}caligraphic_L. We present the following important observation. The proof is deferred to appendix.

Fact 1.

For any 𝛍𝛍\bm{\mu}bold_italic_μ, Q(𝛍|𝛍)=(𝛍)𝑄conditional𝛍𝛍𝛍\nabla Q(\bm{\mu}|\bm{\mu})=\nabla\mathcal{L}(\bm{\mu})∇ italic_Q ( bold_italic_μ | bold_italic_μ ) = ∇ caligraphic_L ( bold_italic_μ ).

Fact 1 states that the gradient of Q𝑄Qitalic_Q function that gradient EM optimizes in each iteration is identical to the gradient of loss function \mathcal{L}caligraphic_L. This observation is very useful since it implies that gradient EM is equivalent to gradient descent (GD) algorithm on \mathcal{L}caligraphic_L. This observation is not a new discovery of ours but actually a wide-spread folklore (see [** et al., 2016]). However, our new contribution is to observe Fact 1 is very helpful for analyzing gradient EM, and to construct a new convergence analysis framework for gradient EM based on it.

1.4 Notations

In this paper, we adopt the following notational conventions. We denote {1,2,,n}12𝑛\{1,2,\ldots,n\}{ 1 , 2 , … , italic_n } with [n]delimited-[]𝑛[n][ italic_n ]. 𝝁=(μ1,,μn)𝐑nd𝝁superscriptsuperscriptsubscript𝜇1topsuperscriptsubscript𝜇𝑛toptopsuperscript𝐑𝑛𝑑\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}\in\mathbf{R}^{nd}bold_italic_μ = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT denotes the parameter vector of GMM obtained by concatenating Gaussian mean vectors μ1,,μnsubscript𝜇1subscript𝜇𝑛\mu_{1},\ldots,\mu_{n}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT together. For any vector μ𝜇\muitalic_μ, μ(t)𝜇𝑡\mu(t)italic_μ ( italic_t ) denotes its value at time step t𝑡titalic_t, sometimes we omit this iteration number t𝑡titalic_t when its choice is clear and simply abbreviate μ(t)𝜇𝑡\mu(t)italic_μ ( italic_t ) as μ𝜇\muitalic_μ. We define a shorthand of expectation taken over the ground truth GMM 𝐄x𝒩(0,Id)[]subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}[\cdot]bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ⋅ ] as 𝐄x[]subscript𝐄𝑥delimited-[]\mathbf{E}_{x}[\cdot]bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⋅ ]. For any vector v0𝑣0v\neq 0italic_v ≠ 0, we use v¯v/v¯𝑣𝑣norm𝑣\overline{v}\coloneqq v/\|v\|over¯ start_ARG italic_v end_ARG ≔ italic_v / ∥ italic_v ∥ to denote the normalization of v𝑣vitalic_v. We define (with a slight abuse of notation) imaxargmaxi[n]{μi}subscript𝑖subscript𝑖delimited-[]𝑛normsubscript𝜇𝑖{i_{\max}}\coloneqq\arg\max_{i\in[n]}\{\|\mu_{i}\|\}italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≔ roman_arg roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT { ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ } as the index of μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the maximum norm, and μmaxμimax=maxi[n]{μi}subscript𝜇normsubscript𝜇subscript𝑖subscript𝑖delimited-[]𝑛normsubscript𝜇𝑖\mu_{\max}\coloneqq\|\mu_{{i_{\max}}}\|=\max_{i\in[n]}\{\|\mu_{i}\|\}italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≔ ∥ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ = roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT { ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ } as the maximum norm of μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In particular, μmax(t)=max{μ1(t),,μn(t)}subscript𝜇𝑡normsubscript𝜇1𝑡normsubscript𝜇𝑛𝑡\mu_{\max}(t)=\max\{\|\mu_{1}(t)\|,\ldots,\|\mu_{n}(t)\|\}italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) = roman_max { ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) ∥ , … , ∥ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ∥ }. Similarly, πminmini[n]πisubscript𝜋subscript𝑖delimited-[]𝑛subscript𝜋𝑖\pi_{\min}\coloneqq\min_{i\in[n]}\pi_{i}italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ≔ roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and πmaxmaxi[n]πisubscript𝜋subscript𝑖delimited-[]𝑛subscript𝜋𝑖{\pi_{\max}}\coloneqq\max_{i\in[n]}\pi_{i}italic_π start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≔ roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the minimal and maximal πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. We use μisubscriptsubscript𝜇𝑖\nabla_{\mu_{i}}\mathcal{L}∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L to denote the gradient of μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on \mathcal{L}caligraphic_L, and =(μ1,,μn)superscriptsubscriptsubscript𝜇1superscripttopsubscriptsubscript𝜇𝑛top\nabla\mathcal{L}=(\nabla_{\mu_{1}}\mathcal{L}^{\top},\ldots,\nabla_{\mu_{n}}% \mathcal{L})^{\top}∇ caligraphic_L = ( ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT denotes the collection of all gradients.

1.5 Technical overview

Here we provide a brief summary of the major technical barriers for our global convergence analysis and our techniques for overcoming them.

New likelihood-based analysis framework. The traditional convergence analysis for EM/gradient EM in previous works Balakrishnan et al. [2014], Yan et al. , Kwon and Caramanis [2020] proceeds by showing the distance between the model and the ground truth GMM in the parameter space contracts linearly in every iteration. This type of approach meets new challenges in the over-parameterized n𝑛nitalic_n-Gaussian mixture setting since the convergence is both sub-linear and non-monotonic. To address these problems, we propose a new likelihood-based convergence analysis framework: instead of proving the convergence of parameters, our analysis proceeds by showing the likelihood loss function \mathcal{L}caligraphic_L converges to 00. The new analysis framework is more flexible and allows us to overcome the aforementioned technical barriers.

Gradient lower bound. The first step of our global convergence analysis constructs a gradient lower bound. Using some algebraic transformation techniques, we convert the gradient projection (𝝁),𝝁𝝁𝝁\left\langle\mathcal{L}(\bm{\mu}),\bm{\mu}\right\rangle⟨ caligraphic_L ( bold_italic_μ ) , bold_italic_μ ⟩ into the expected norm square of a random vector 𝝍~(x)~𝝍𝑥\tilde{\bm{\psi}}(x)over~ start_ARG bold_italic_ψ end_ARG ( italic_x ). (See Section (4) for the full definition). Although lower bounding the expectation of 𝝍~~𝝍\tilde{\bm{\psi}}over~ start_ARG bold_italic_ψ end_ARG is very challenging, our key idea is that the gradient of 𝝍~~𝝍\tilde{\bm{\psi}}over~ start_ARG bold_italic_ψ end_ARG has very nice properties and can be easily lower bounded, allowing us to establish the gradient lower bound.

Local smoothness and regularity condition. After obtaining the gradient lower bound, the missing component of the proof is a smoothness condition of the loss function \mathcal{L}caligraphic_L. Since proving the smoothness of \mathcal{L}caligraphic_L is hard in general, we define and prove a weaker notion of local smoothness, which suffices to prove our result. In addition, we design and use an auxiliary function U𝑈Uitalic_U to show that gradient EM trajectory satisfies the locality required by our smoothness lemma.

2 Related work

2.1 2-Gaussian mixtures

There is a vast literature studying the convergence of EM/gradient EM on 2222-component GMM. The initial batch of results proves convergence within a infinitesimally small local region [Xu and Jordan, 1996, Ma et al., 2000]. Balakrishnan et al. [2014] proves for the first time convergence of EM and gradient EM within a non-infinitesimal local region. Among the later works on the same problem, Klusowski and Brinda [2016] improves the basin of convergence guarantee, Daskalakis et al. [2017], Xu et al. [2016] proves the global convergence for 2222-Gaussian mixtures. These works focused on the exact-parameterization scenario where the number of student mixtures is the same as that of the ground truth. More recently, Wu and Zhou [2019] proves global convergence of 2222-component GMM without any separation condition. Their result can be viewed as a convergence result in the over-parameterized setting where the student model has two Gaussians and the ground truth is a single Gaussian. On the other hand, their setting is more restricted than ours because they require the means of two Gaussians in the student model to be symmetric around the ground truth mean. Weinberger and Bresler [2021] extends the convergence guarantee to the case of unbalanced weights. Another line of work Dwivedi et al. [2018b, 2019, a] studies the over-parameterized setting of using 2222-Gaussian mixture to learn a single Gaussian and proves global convergence of EM. Our result extends this type of analysis to the general case of n𝑛nitalic_n-Gaussian mixtures, which requires significantly different techniques. We note that going beyond Gaussian mixture models, there are also works studying EM algorithms for other mixture models such as a mixture of linear regression Kwon et al. [2019].

2.2 N-Gaussian mixtures

Another line of results focuses on the general case of n𝑛nitalic_n Gaussian mixtures. ** et al. [2016] provides a counter-example showing that EM does not converge globally for n>2𝑛2n>2italic_n > 2 (in the exact-parameterized case). Dasgupta and Schulman [2000] prove that a variant of EM converges to MLE in two rounds for n𝑛nitalic_n-GMM. Their result relies on a modification of the EM algorithm and is not comparable with ours. [Chen et al., 2023] analyzes the structure of local minima in the likelihood function of GMM. However, their result is purely geometric and does not provide any convergence guarantee.

A series of paper Yan et al. , Zhao et al. , Kwon and Caramanis [2020], Segol and Nadler follow the framework proposed by Balakrishnan et al. [2014] to prove the local convergence of EM for n𝑛nitalic_n-GMM. While their result applies to the more general n𝑛nitalic_n-Gaussian mixture ground truth setting, their framework only provides local convergence guarantee and cannot be directly applied to our setting.

2.3 Slowdown due to over-parameterization

This paper gives an O(1/t)𝑂1𝑡O\left(1/\sqrt{t}\right)italic_O ( 1 / square-root start_ARG italic_t end_ARG ) bound for fitting over-parameterized Gaussian mixture models to a single Gaussian. Recall that to learn a single Gaussian, if one’s student model is also a single Gaussian, then one can obtain an exp(Ω(t))Ω𝑡\exp(-\Omega(t))roman_exp ( - roman_Ω ( italic_t ) ) rate because the loss is strongly convex. This slowdown effect due to over-parameterization has been observed for Gaussian mixtures in Dwivedi et al. [2018a], Wu and Zhou [2019], but has also been observed in other learning problems, such as learning a two-layer neural network Xu and Du [2023], Richert et al. [2022] and matrix sensing problems [Xiong et al., 2023, Zhang et al., 2021, Zhuo et al., 2021].

3 Main results

In this section, we present our main theoretical result, which consists of two parts: In Section 3.1 we present our global convergence analysis of gradient EM, in Section 3.2 we prove that an exponentially small factor in our convergence bound is inevitable and cannot be removed. All omitted proofs are deferred to the appendix.

3.1 Global convergence of gradient EM

We first present our main result, which states that gradient EM converges to MLE globally.

Theorem 2 (Main result).

Consider training a student n𝑛nitalic_n-component GMM initialized from 𝛍(0)=(μ1(0),,μn(0))𝛍0superscriptsubscript𝜇1superscript0topsubscript𝜇𝑛superscript0toptop\bm{\mu}(0)=(\mu_{1}(0)^{\top},\ldots,\mu_{n}(0)^{\top})^{\top}bold_italic_μ ( 0 ) = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to learn a single-component ground truth GMM 𝒩(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d})caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) with population gradient EM. If the step size satisfies ηO(exp(8nμmax2(0))πmin2n2d2(1μmax(0)+μmax(0))2)𝜂𝑂8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2superscript𝑑2superscript1subscript𝜇0subscript𝜇02\eta\leq O\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{% 2}d^{2}(\frac{1}{\mu_{\max}(0)}+\mu_{\max}(0))^{2}}\right)italic_η ≤ italic_O ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), then gradient EM converges globally with rate

(𝝁(t))1γt,𝝁𝑡1𝛾𝑡\mathcal{L}(\bm{\mu}(t))\leq\frac{1}{\sqrt{\gamma t}},caligraphic_L ( bold_italic_μ ( italic_t ) ) ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_γ italic_t end_ARG end_ARG ,

where constant γ=Ω(ηexp(16nμmax2(0))πmin4n2d2(1+μmax(0)dn)4)𝐑+𝛾Ω𝜂16𝑛superscriptsubscript𝜇20superscriptsubscript𝜋4superscript𝑛2superscript𝑑2superscript1subscript𝜇0𝑑𝑛4superscript𝐑\gamma=\Omega\left(\frac{\eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^% {4}}{n^{2}d^{2}(1+\mu_{\max}(0){\sqrt{dn}})^{4}}\right)\in\mathbf{R}^{+}italic_γ = roman_Ω ( divide start_ARG italic_η roman_exp ( - 16 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_d italic_n end_ARG ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) ∈ bold_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, μmax(0)=max{μ1(0),,μn(0)}subscript𝜇0normsubscript𝜇10normsubscript𝜇𝑛0\mu_{\max}(0)=\max\{\|\mu_{1}(0)\|,\ldots,\|\mu_{n}(0)\|\}italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) = roman_max { ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 ) ∥ , … , ∥ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 0 ) ∥ }.

Remark 3.

Without over-parameterization, for learning a single Gaussian, one can obtain a linear convergence exp(Ω(t))Ω𝑡\exp(-\Omega\left(t\right))roman_exp ( - roman_Ω ( italic_t ) ). We would like to note that the sub-linear convergence rate guarantee of gradient EM stated in Theorem 2 ((𝛍(t))O(1/t)𝛍𝑡𝑂1𝑡\mathcal{L}(\bm{\mu}(t))\leq O(1/\sqrt{t})caligraphic_L ( bold_italic_μ ( italic_t ) ) ≤ italic_O ( 1 / square-root start_ARG italic_t end_ARG )) is due to the inherent nature of the algorithm. Dwivedi et al. [2018b] studied the special case of using 2 Gaussian mixtures with symmetric means to learn a single Gaussian and proved that EM has sublinear convergence rate when the weights πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are equal. Since Theorem 2 studies the more general case of n𝑛nitalic_n Gaussian mixtures, this type of subexponential convergence rate is the best than we can hope for.

Remark 4.

The convergence rate in Theorem 2 has a factor exponentially small in the initialization scale (γexp(16nμmax2(0))proportional-to𝛾16𝑛superscriptsubscript𝜇20\gamma\propto\exp(-16n\mu_{\max}^{2}(0))italic_γ ∝ roman_exp ( - 16 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) )). We would like to stress that this is again due to algorithmic nature of the problem rather than the limitation of analysis. In Section 3.2, we prove that there exists bad regions with exponentially small gradients so that when initialized from such region, gradient EM gets trapped locally for exp(Ω(μmax2(0)))Ωsuperscriptsubscript𝜇20\exp(\Omega(\mu_{\max}^{2}(0)))roman_exp ( roman_Ω ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) ) number of steps. Therefore, a convergence speed guarantee exponentially small in square of initialization scale is inevitable and cannot be improved.

Remark 5.

Theorem 2 is fundamentally different from convergence analysis for EM/gradient EM in previous works Yan et al. , Dwivedi et al. [2019], Balakrishnan et al. [2014] which proved monotonic linear contraction of parameter distance 𝛍(t)𝛍norm𝛍𝑡superscript𝛍\|\bm{\mu}(t)-\bm{\mu}^{*}\|∥ bold_italic_μ ( italic_t ) - bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥. But our result also implies global convergence since loss function \mathcal{L}caligraphic_L converging to 00 is equivalent to convergence of gradient EM to MLE.

Remark 6.

The convergence result in Theorem 2 is for population gradient EM, but it also implies global convergence for sample-based gradient EM as the sample size tends to infinity. For a similar reduction from population EM to sample EM, see Section 2.2 of [Xu et al., 2016].

3.2 Necessity of exponentially small factor in convergence rate

In this section we prove that a factor of exp(Θ(μmax2(0)))Θsuperscriptsubscript𝜇20\exp(-\Theta(\mu_{\max}^{2}(0)))roman_exp ( - roman_Θ ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) ) is inevitable in the global convergence rate guarantee of gradient EM. To demonstrate this, we show that there exists some bad region such that initializing from this region will trap gradient EM for an exponentially long time before it converges to the global minimum. Our result is the following theorem.

Theorem 7 (Existence of bad initialization region).

For any n=2l+1𝑛2𝑙1n=2l+1italic_n = 2 italic_l + 1, consider gradient EM initialized at point μ1(0)=0,μ2(0)==μl+1(0)=12de1,μl+2(0)==μ2l+1(0)=12de1formulae-sequenceformulae-sequencesubscript𝜇100subscript𝜇20subscript𝜇𝑙1012𝑑subscript𝑒1subscript𝜇𝑙20subscript𝜇2𝑙1012𝑑subscript𝑒1\mu_{1}(0)=0,\mu_{2}(0)=\cdots=\mu_{l+1}(0)=12\sqrt{d}e_{1},\mu_{l+2}(0)=% \cdots=\mu_{2l+1}(0)=-12\sqrt{d}e_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 ) = 0 , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 ) = ⋯ = italic_μ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( 0 ) = 12 square-root start_ARG italic_d end_ARG italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( 0 ) = ⋯ = italic_μ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( 0 ) = - 12 square-root start_ARG italic_d end_ARG italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where e1=(1,0,,0)subscript𝑒1superscript100tope_{1}=(1,0,\ldots,0)^{\top}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 1 , 0 , … , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a standard unit vector. Then population gradient EM will be trapped in a bad local region around 𝛍(0)𝛍0\bm{\mu}(0)bold_italic_μ ( 0 ) for exponentially many number of time steps T=115nηed=115nηexp(Θ(μmax2(0)))𝑇115𝑛𝜂superscript𝑒𝑑115𝑛𝜂Θsuperscriptsubscript𝜇20T=\frac{1}{15n\eta}e^{d}=\frac{1}{15n\eta}\exp(\Theta(\mu_{\max}^{2}(0)))italic_T = divide start_ARG 1 end_ARG start_ARG 15 italic_n italic_η end_ARG italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 15 italic_n italic_η end_ARG roman_exp ( roman_Θ ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) ). More rigorously, for any 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T, we have

μi(t)10d,i1.formulae-sequencenormsubscript𝜇𝑖𝑡10𝑑for-all𝑖1\|\mu_{i}(t)\|\geq 10\sqrt{d},\forall i\neq 1.∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ ≥ 10 square-root start_ARG italic_d end_ARG , ∀ italic_i ≠ 1 .

Theorem 7 states that, when initialized from some bad points 𝝁(0)𝝁0\bm{\mu}(0)bold_italic_μ ( 0 ), after exp(Θ(μmax2(0)))Θsuperscriptsubscript𝜇20\exp(\Theta(\mu_{\max}^{2}(0)))roman_exp ( roman_Θ ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) ) number of time steps, gradient EM will still stay in this local region and remain 10d10𝑑10\sqrt{d}10 square-root start_ARG italic_d end_ARG distance away from the global minimum 𝝁=0𝝁0\bm{\mu}=0bold_italic_μ = 0. Therefore an exponentially small factor in convergence rate is inevitable.

Remark 8.

Theorem 7 eliminates the possibility of proving any polynomial convergence rate of gradient EM from arbitrary initialization. However, it is still possible to prove that, with some specific smart initialization schemes, gradient EM avoids the bad regions stated in Theorem 7 and enjoys a polynomial convergence rate. We leave this as an interesting open question for future analysis.

4 Proof overview

In this section, we provide a technical overview of the proof in our main result (Theorem 2 and Theorem 7).

4.1 Difficulties of a global convergence proof and our new analysis framework

Proving the global convergence of gradient EM for general n𝑛nitalic_n-Gaussian mixture is highly nontrivial. While there have been many previous works [Balakrishnan et al., 2014, Yan et al., , Dwivedi et al., 2018b] studying either local convergence or the special case of 2222-Gaussian mixtures, they all focus on showing the contraction of parametric error. Namely, their proof proceeds by showing the distance between the model parameter and the ground truth contracts, usually by a fixed linear ratio, in each iteration of the algorithm. However, this kind of approach faces various challenges for our general problem where the convergence is both sublinear and non-monotonic. Since the convergence rate is sublinear (see Remark 3), showing a linear contraction per iteration is no longer possible. Since the convergence is non-monotonic111To see this, consider n=2,μ1=0,μ2=(1,0,,0)formulae-sequence𝑛2formulae-sequencesubscript𝜇10subscript𝜇2superscript100topn=2,\mu_{1}=0,\mu_{2}=(1,0,\ldots,0)^{\top}italic_n = 2 , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 1 , 0 , … , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, then the norm of μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT strictly increases after one iteration., we also cannot show a strictly decreasing parametric distance.

To address these challenges, we propose a new convergence analysis framework for gradient EM by proving the convergence of likelihood \mathcal{L}caligraphic_L instead of the convergence of parameters 𝝁𝝁\bm{\mu}bold_italic_μ. There are several benefits for considering the convergence from the perspective of MLE loss \mathcal{L}caligraphic_L. Firstly, it naturally addresses the problem of non-monotonic and sub-linear convergence since we only need to show \mathcal{L}caligraphic_L decreases as the algorithm updates. Also, since gradient EM is equivalent with running gradient descent on loss function \mathcal{L}caligraphic_L (see Section 1.3), we can apply techniques from the optimization theory of gradient descent to facilitate our analysis.

4.2 Proof ideas for Theorem 2

We first briefly outline our proof of Theorem 2.

Proof roadmap. Our proof of Theorem 2 consists of three steps. Firstly, we prove a gradient lower bound for \mathcal{L}caligraphic_L (Theorem 12). Then we prove that the MLE \mathcal{L}caligraphic_L is locally smooth (Theorem 13). Finally, we combine the gradient lower bound and the smoothness condition to prove the global convergence of \mathcal{L}caligraphic_L with mathematical induction.

Step 1: Gradient lower bound.

Our first step aims to show that the gradient norm of (𝝁)𝝁\mathcal{L}(\bm{\mu})caligraphic_L ( bold_italic_μ ) is lower bounded by the distance of 𝝁𝝁\bm{\mu}bold_italic_μ to the ground truth. To do this, we need a few preliminary results. Inspired by Chen et al. [2023], we use Stein’s identity [Stein, 1981] to perform an algebraic transformation of the gradient. Recalling the definition of ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (2), we have the following lemma.

Lemma 9.

For any GMM(𝛍),i[n]GMM𝛍𝑖delimited-[]𝑛\text{GMM}(\bm{\mu}),i\in[n]GMM ( bold_italic_μ ) , italic_i ∈ [ italic_n ], the gradient of Q𝑄Qitalic_Q satisfies

μi(𝝁)=μiQ(𝝁|𝝁)=𝐄x[ψi(x)k[n]ψk(x)μk].subscriptsubscript𝜇𝑖𝝁subscriptsubscript𝜇𝑖𝑄conditional𝝁𝝁subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})=% \mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right].∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) = ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] .

The gradient expression above is equivalent with the form in (3), but is easier to manipulate. Using the transformed gradient in Lemma 9, we have the following corollary.

Corollary 10.

Define vector 𝛙~𝛍(x)i[n]ψi(x)μisubscript~𝛙𝛍𝑥subscript𝑖delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜇𝑖\tilde{\bm{\psi}}_{\bm{\mu}}(x)\coloneqq\sum_{i\in[n]}\psi_{i}(x)\mu_{i}over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ≔ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ), the projection of the gradient of (𝛍)𝛍\nabla\mathcal{L}(\bm{\mu})∇ caligraphic_L ( bold_italic_μ ) onto 𝛍𝛍\bm{\mu}bold_italic_μ satisfies

(𝝁),𝝁=𝝁Q(𝝁|𝝁),𝝁=i[n]μiQ(𝝁|𝝁),μi=𝐄x[𝝍~𝝁(x)2].𝝁𝝁subscript𝝁𝑄conditional𝝁𝝁𝝁subscript𝑖delimited-[]𝑛subscriptsubscript𝜇𝑖𝑄conditional𝝁𝝁subscript𝜇𝑖subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥2\left\langle\nabla\mathcal{L}(\bm{\mu}),\bm{\mu}\right\rangle=\left\langle% \nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right\rangle=\sum_{i\in[n]}% \left\langle\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu}),\mu_{i}\right\rangle=\mathbf{% E}_{x}\left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}\right].⟨ ∇ caligraphic_L ( bold_italic_μ ) , bold_italic_μ ⟩ = ⟨ ∇ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) , bold_italic_μ ⟩ = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ⟨ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Corollary 9 is important since it converts the projection of gradient (𝝁)𝝁\nabla\mathcal{L}(\bm{\mu})∇ caligraphic_L ( bold_italic_μ ) onto 𝝁𝝁\bm{\mu}bold_italic_μ to the expected norm square of a vector 𝝍~𝝁subscript~𝝍𝝁\tilde{\bm{\psi}}_{\bm{\mu}}over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT. Since a lower bound of the gradient projection implies a lower bound of the gradient, we only need to construct a lower bound for (𝝁),𝝁=𝐄x[𝝍~𝝁(x)2]𝝁𝝁subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥2\left\langle\nabla\mathcal{L}(\bm{\mu}),\bm{\mu}\right\rangle=\mathbf{E}_{x}% \left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}\right]⟨ ∇ caligraphic_L ( bold_italic_μ ) , bold_italic_μ ⟩ = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Since 𝝍~𝝁(x)2superscriptnormsubscript~𝝍𝝁𝑥2\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is always non-negative, we already know that the gradient projection is non-negative. But lower bounding 𝐄x[𝝍~𝝁(x)2]subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥2\mathbf{E}_{x}\left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\right\|^{2}\right]bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] is still highly nontrivial since the expression of 𝝍~~𝝍\tilde{\bm{\psi}}over~ start_ARG bold_italic_ψ end_ARG is complicated and hard to handle. However, our key observation is that, although 𝛙~~𝛙\tilde{\bm{\psi}}over~ start_ARG bold_italic_ψ end_ARG itself is hard to bound, its gradient has nice properties and can be handled gracefully:

x𝝍~𝝁(x)=12i,j[n]ψi(x)ψj(x)(μiμj)(μiμj).subscript𝑥subscript~𝝍𝝁𝑥12subscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥subscript𝜇𝑖subscript𝜇𝑗superscriptsubscript𝜇𝑖subscript𝜇𝑗top\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)=\frac{1}{2}\sum_{i,j\in[n]}\psi_{i}(% x)\psi_{j}(x)(\mu_{i}-\mu_{j})(\mu_{i}-\mu_{j})^{\top}.∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (5)

The gradient (5) is nicely-behaved. One can see immediately from (5) that the matrix x𝝍~𝝁(x)subscript𝑥subscript~𝝍𝝁𝑥\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) is positive-semi-definite, and its eigenvalues can be directly bounded. To utilize these properties, we use the following algebraic trick to convert the task of lower bounding 𝝍~~𝝍\tilde{\bm{\psi}}over~ start_ARG bold_italic_ψ end_ARG itself into the task of lower bounding its gradient.

𝐄x[𝝍~𝝁(x)2]=14𝐄x[(t=11xx¯𝝍~𝝁(tx)x¯dt)2].subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥214subscript𝐄𝑥delimited-[]superscriptsuperscriptsubscript𝑡11norm𝑥superscript¯𝑥topsubscript~𝝍𝝁𝑡𝑥¯𝑥differential-d𝑡2\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]=\frac{1}{4}% \mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\|x\|\cdot\overline{x}^{\top}\nabla% \tilde{\bm{\psi}}_{\bm{\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}\right].bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ italic_x ∥ ⋅ over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) over¯ start_ARG italic_x end_ARG roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (6)

Recall that x¯=xx¯𝑥𝑥norm𝑥\bar{x}=\frac{x}{\|x\|}over¯ start_ARG italic_x end_ARG = divide start_ARG italic_x end_ARG start_ARG ∥ italic_x ∥ end_ARG. See detailed derivation in (23). Using (5), combined with the properties of x𝝍~𝝁(x)subscript𝑥subscript~𝝍𝝁𝑥\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ), we can obtain the following lemma.

Lemma 11.

For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) we have

𝐄x[𝝍~𝝁(x)2]exp(8μmax2)40000d(1+2μmaxd)2(i,j[n]πiπjμiμj2)2.subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥28superscriptsubscript𝜇240000𝑑superscript12subscript𝜇𝑑2superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptnormsubscript𝜇𝑖subscript𝜇𝑗22\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{% \exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^{2}}\left(% \sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}.bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 40000 italic_d ( 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

On top of Lemma 11, we can easily lower bound the gradient projection in the following lemma, finishing the first step of our proof.

Lemma 12 (Gradient projection lower bound).

For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) we have

𝝁Q(𝝁|𝝁),𝝁=𝐄x[𝝍~𝝁(x)2]=Ω(exp(8μmax2)πmin2d(1+μmaxd)2μmax4).subscript𝝁𝑄conditional𝝁𝝁𝝁subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥2Ω8superscriptsubscript𝜇2superscriptsubscript𝜋2𝑑superscript1subscript𝜇𝑑2superscriptsubscript𝜇4\left\langle\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right\rangle=% \mathbf{E}_{x}[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}]=\Omega\left(\frac{\exp% \left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{d(1+\mu_{\max}{\sqrt{d}})^{2}}\mu% _{\max}^{4}\right).⟨ ∇ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) , bold_italic_μ ⟩ = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_Ω ( divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) .

Step 2: Local smoothness.

To construct a global convergence analysis for gradient-based methods, after obtaining a gradient lower bound, we still need to prove the smoothness of loss \mathcal{L}caligraphic_L. (Recall that global smoothness of function f𝑓fitalic_f means that there exists constant C𝐶Citalic_C such that f(x1)f(x2)Cx1x2,x1,x2norm𝑓subscript𝑥1𝑓subscript𝑥2𝐶normsubscript𝑥1subscript𝑥2for-allsubscript𝑥1subscript𝑥2\|\nabla f(x_{1})-\nabla f(x_{2})\|\leq C\|x_{1}-x_{2}\|,\forall x_{1},x_{2}∥ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ ≤ italic_C ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ , ∀ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.) However, proving the smoothness for \mathcal{L}caligraphic_L in general is very challenging since the membership function ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT cannot be bounded when 𝝁𝝁\bm{\mu}bold_italic_μ is unbounded. To address this issue, we prove that \mathcal{L}caligraphic_L is locally smooth, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., the smoothness between two points 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is satisfied if both 𝝁norm𝝁\|\bm{\mu}\|∥ bold_italic_μ ∥ and 𝝁𝝁norm𝝁superscript𝝁\|\bm{\mu}-\bm{\mu}^{\prime}\|∥ bold_italic_μ - bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ are upper bounded. Our result is the following theorem.

Theorem 13 (Local smoothness of loss function).

At any two points 𝛍=(μ1,,μn)𝛍superscriptsuperscriptsubscript𝜇1topsuperscriptsubscript𝜇𝑛toptop\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}bold_italic_μ = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝛍+𝛅=((μ1+δ1),,(μn+δn))𝛍𝛅superscriptsuperscriptsubscript𝜇1subscript𝛿1topsuperscriptsubscript𝜇𝑛subscript𝛿𝑛toptop\bm{\mu}+\bm{\delta}=((\mu_{1}+\delta_{1})^{\top},\ldots,(\mu_{n}+\delta_{n})^% {\top})^{\top}bold_italic_μ + bold_italic_δ = ( ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, if

δi1max{6d,2μi},i[n],formulae-sequencenormsubscript𝛿𝑖16𝑑2normsubscript𝜇𝑖for-all𝑖delimited-[]𝑛\|\delta_{i}\|\leq\frac{1}{\max\left\{6d,2\|\mu_{i}\|\right\}},\forall i\in[n],∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG roman_max { 6 italic_d , 2 ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ } end_ARG , ∀ italic_i ∈ [ italic_n ] ,

then the loss function \mathcal{L}caligraphic_L satisfies the following smoothness property: for any i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] we have

μi+δi(𝝁+𝜹)μi(𝝁)nμmax(30d+4μmax)δi+k[n]δk.normsubscriptsubscript𝜇𝑖subscript𝛿𝑖𝝁𝜹subscriptsubscript𝜇𝑖𝝁𝑛subscript𝜇30𝑑4subscript𝜇normsubscript𝛿𝑖subscript𝑘delimited-[]𝑛normsubscript𝛿𝑘\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{\delta})-\nabla_{% \mu_{i}}\mathcal{L}(\bm{\mu})\right\|\leq n\mu_{\max}(30\sqrt{d}+4\mu_{\max})% \|\delta_{i}\|+\sum_{k\in[n]}\|\delta_{k}\|.∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ + bold_italic_δ ) - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 30 square-root start_ARG italic_d end_ARG + 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ . (7)

Step 3: putting everything together.

Given the gradient lower bound and the smoothness condition, we still need to resolve two remaining problems. The first one is that the gradient lower bound in Lemma 12 is given in terms of 𝝁𝝁\bm{\mu}bold_italic_μ, which we need to convert to a lower bound in terms of (𝝁)𝝁\mathcal{L}(\bm{\mu})caligraphic_L ( bold_italic_μ ). For this we need the following upper bound of \mathcal{L}caligraphic_L.

Theorem 14 (Loss function upper bound).

The loss function can be upper bounded as

(𝝁)i[n]πi2μi2μmax22.𝝁subscript𝑖delimited-[]𝑛subscript𝜋𝑖2superscriptnormsubscript𝜇𝑖2superscriptsubscript𝜇22\mathcal{L}(\bm{\mu})\leq\sum_{i\in[n]}\frac{\pi_{i}}{2}\|\mu_{i}\|^{2}\leq% \frac{\mu_{\max}^{2}}{2}.caligraphic_L ( bold_italic_μ ) ≤ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG .

The second problem is that our local smoothness theorem requires 𝝁𝝁\bm{\mu}bold_italic_μ to be bounded, therefore we need to show a regularity condition that for each i𝑖iitalic_i, 𝝁i(t)subscript𝝁𝑖𝑡\bm{\mu}_{i}(t)bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) stays in a bounded region during gradient EM updates. This is not easy to prove for each individual 𝝁isubscript𝝁𝑖\bm{\mu}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT due to the same non-monotonic issue mentioned in Section 4.1. To establish such a regularity condition, we introduce the following potential function.

Definition 15.

Define potential function U:𝐑nd𝐑:𝑈superscript𝐑𝑛𝑑𝐑U:\mathbf{R}^{nd}\to\mathbf{R}\;italic_U : bold_R start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT → bold_R for GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) as

U(𝝁)=i[n]μi2.𝑈𝝁subscript𝑖delimited-[]𝑛superscriptnormsubscript𝜇𝑖2U(\bm{\mu})=\sum_{i\in[n]}\|\mu_{i}\|^{2}.italic_U ( bold_italic_μ ) = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We prove that the potential function U𝑈Uitalic_U remains bounded, implying each 𝝁isubscript𝝁𝑖\bm{\mu}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT remains well-behaved. With this regularity condition, combined with the previous two steps, we finish the proof of Theorem 2 via mathematical induction.

4.3 Proof ideas for Theorem 7

Proving Theorem 7 is much simpler. The idea is natural: we found that there exists some bad regions where the gradient of \mathcal{L}caligraphic_L is exponentially small, characterized by the following lemma.

Lemma 16 (Gradient norm upper bound).

For any 𝛍𝛍\bm{\mu}bold_italic_μ satisfying μ1d,μ2,μ3,,μn10dformulae-sequencenormsubscript𝜇1𝑑normsubscript𝜇2normsubscript𝜇3normsubscript𝜇𝑛10𝑑\|\mu_{1}\|\leq\sqrt{d},\|\mu_{2}\|,\|\mu_{3}\|,\ldots,\|\mu_{n}\|\geq 10\sqrt% {d}∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG italic_d end_ARG , ∥ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ , ∥ italic_μ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ , … , ∥ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ ≥ 10 square-root start_ARG italic_d end_ARG, the gradient of \mathcal{L}caligraphic_L at 𝛍𝛍\bm{\mu}bold_italic_μ can be upper bounded as

μi(𝝁)2μ1+2exp(d)i1μi,i[n].formulae-sequencenormsubscriptsubscript𝜇𝑖𝝁2normsubscript𝜇12𝑑subscript𝑖1normsubscript𝜇𝑖for-all𝑖delimited-[]𝑛\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq 2\|\mu_{1}\|+2\exp(-d)\sum_{i% \neq 1}\|\mu_{i}\|,\forall i\in[n].∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ ≤ 2 ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + 2 roman_exp ( - italic_d ) ∑ start_POSTSUBSCRIPT italic_i ≠ 1 end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ , ∀ italic_i ∈ [ italic_n ] .

Utilizing Lemma 16, we can prove Theorem 7 by showing that initialization from these bad regions will get trapped in it for exponentially long, since the gradient norm is exponentially small. The full proof can be found in Appendix B.2.

5 Experiments

In this section we include a few simulation results of gradient EM verifying our theoretical arguments. We choose the experimental setting of d=5,η=0.7formulae-sequence𝑑5𝜂0.7d=5,\eta=0.7italic_d = 5 , italic_η = 0.7. We use n=2,5,10𝑛2510n=2,5,10italic_n = 2 , 5 , 10 Gaussian mixtures to learn data generated from one single ground truth Gaussian distribution 𝒩(μ,Id)𝒩superscript𝜇subscript𝐼𝑑\mathcal{N}(\mu^{*},I_{d})caligraphic_N ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), respectively. Since a closed form expression of the population gradient is intractable, we approximate the gradient step via Monte Carlo method, with sample size 3.5×1053.5superscript1053.5\times 10^{5}3.5 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT. The mixing weights of student GMM are randomly sampled from a standard Dirichlet distribution and set as fixed during gradient EM update. The covariances of all component Gaussians are set as the identity matrix. We recorded the convergence of likelihood function \mathcal{L}caligraphic_L (estimated also by Monte Carlo method on fresh samples each iteration) and parametric distance i[n]πiμiμ2subscript𝑖delimited-[]𝑛subscript𝜋𝑖superscriptnormsubscript𝜇𝑖superscript𝜇2\sum_{i\in[n]}\pi_{i}\|\mu_{i}-\mu^{*}\|^{2}∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT along gradient EM trajectory. The results are reported in Figure 1 (left and middle panel). Simulation outcome indicates that both the likelihood \mathcal{L}caligraphic_L and the parametric distance converges sublinearly, verifying our theoretical arguments.

To verify our negative result Theorem 7, we consider the bad initialization point 𝝁(0)𝝁0\bm{\mu}(0)bold_italic_μ ( 0 ) described in Theorem 7 222To prevent numerical underflow issues, we change the constant 12121212 in 𝝁(0)𝝁0\bm{\mu}(0)bold_italic_μ ( 0 ) to 2222. and plot the gradient norm at 𝝁(0)𝝁0\bm{\mu}(0)bold_italic_μ ( 0 ) w.r.t. different dimension d𝑑ditalic_d in Figure 1 (right panel). Simulation result shows that the gradient norm (𝝁(0))norm𝝁0\|\nabla\mathcal{L}(\bm{\mu}(0))\|∥ ∇ caligraphic_L ( bold_italic_μ ( 0 ) ) ∥ at 𝝁(0)𝝁0\bm{\mu}(0)bold_italic_μ ( 0 ) decreases exponentially in dimension d𝑑ditalic_d, verifying the existence of bad initialization regions.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Left: Sublinear convergence of the likelihood loss \mathcal{L}caligraphic_L. Middle: Sublinear convergence of the parametric distance i[n]πiμiμ2subscript𝑖delimited-[]𝑛subscript𝜋𝑖superscriptnormsubscript𝜇𝑖superscript𝜇2\sum_{i\in[n]}\pi_{i}\|\mu_{i}-\mu^{*}\|^{2}∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT between student GMM and the ground truth. Right: Gradient norm (𝝁(0))norm𝝁0\|\nabla\mathcal{L}(\bm{\mu}(0))\|∥ ∇ caligraphic_L ( bold_italic_μ ( 0 ) ) ∥ in the counter-example in Theorem 7 decreases exponentially fast w.r.t. dimension d𝑑ditalic_d.

6 Conclusion

This paper gives the first global convergence of gradient EM for over-parameterized Gaussian mixture models when the ground truth is a single Gaussian, and rate is sublinear which is exponentially slower than the rate in the exact-parameterization case. One fundamental open problem is to study when one can obtain global convergence of EM or gradient EM for Gaussian mixture models when the ground truth has multiple components. The likelihood-based convergence framework proposed in this paper might be an helpful tool towards solving this general problem.

Acknowledgements

This work was supported in part by the following grants: NSF TRIPODS II-DMS 20231660, NSF CCF 2212261, NSF CCF 2007036, NSF AF 2312775, NSF IIS 2110170, NSF DMS 2134106, NSF IIS 2143493, and NSF IIS 2229881.

References

  • Balakrishnan et al. [2014] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis, 2014.
  • Daskalakis et al. [2017] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 704–710. PMLR, 07–10 Jul 2017. URL https://proceedings.mlr.press/v65/daskalakis17b.html.
  • Xu et al. [2016] Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Neural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:6310792.
  • Dwivedi et al. [2018a] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J. Wainwright, and Michael I. Jordan. Theoretical guarantees for em under misspecified gaussian mixture models. In Neural Information Processing Systems, 2018a. URL https://api.semanticscholar.org/CorpusID:54062377.
  • Kwon and Caramanis [2020] Jeongyeol Kwon and Constantine Caramanis. The em algorithm gives sample-optimality for learning mixtures of well-separated gaussians. In Conference on Learning Theory, pages 2425–2487. PMLR, 2020.
  • Dwivedi et al. [2019] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan, and Bin Yu. Sharp analysis of expectation-maximization for weakly identifiable models. In International Conference on Artificial Intelligence and Statistics, 2019. URL https://api.semanticscholar.org/CorpusID:216036378.
  • ** et al. [2016] Chi **, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Neural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:3200184.
  • Dwivedi et al. [2018b] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Michael I. Jordan, Martin J. Wainwright, and Bin Yu. Singularity, misspecification and the convergence rate of em. The Annals of Statistics, 2018b. URL https://api.semanticscholar.org/CorpusID:88517736.
  • [9] Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence analysis of gradient EM for multi-component gaussian mixture. URL http://arxiv.longhoe.net/abs/1705.08530.
  • Xu and Jordan [1996] Lei Xu and Michael I Jordan. On convergence properties of the em algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996.
  • Ma et al. [2000] **wen Ma, Lei Xu, and Michael I Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures. Neural Computation, 12(12):2881–2907, 2000.
  • Klusowski and Brinda [2016] Jason M. Klusowski and W. D. Brinda. Statistical guarantees for estimating the centers of a two-component gaussian mixture by em. arXiv: Machine Learning, 2016. URL https://api.semanticscholar.org/CorpusID:88514434.
  • Wu and Zhou [2019] Yihong Wu and Harrison H. Zhou. Randomly initialized em algorithm for two-component gaussian mixture achieves near optimality in o(n)𝑜𝑛o(\sqrt{n})italic_o ( square-root start_ARG italic_n end_ARG ) iterations, 2019.
  • Weinberger and Bresler [2021] Nir Weinberger and Guy Bresler. The em algorithm is adaptively-optimal for unbalanced symmetric gaussian mixtures. J. Mach. Learn. Res., 23:103:1–103:79, 2021. URL https://api.semanticscholar.org/CorpusID:232404093.
  • Kwon et al. [2019] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis. Global convergence of the em algorithm for mixtures of two component linear regression. In Conference on Learning Theory, pages 2055–2110. PMLR, 2019.
  • Dasgupta and Schulman [2000] Sanjoy Dasgupta and Leonard J. Schulman. A two-round variant of em for gaussian mixtures. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, UAI ’00, page 152–159, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607099.
  • Chen et al. [2023] Yudong Chen, Dogyoon Song, Xumei Xi, and Yuqian Zhang. Local minima structures in gaussian mixture models, 2023.
  • [18] Ruofei Zhao, Yuanzhi Li, and Yuekai Sun. Statistical convergence of the EM algorithm on gaussian mixture models. URL http://arxiv.longhoe.net/abs/1810.04090.
  • [19] Nimrod Segol and Boaz Nadler. Improved convergence guarantees for learning gaussian mixture models by EM and gradient EM. URL http://arxiv.longhoe.net/abs/2101.00575.
  • Xu and Du [2023] Weihang Xu and Simon Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. In The Thirty Sixth Annual Conference on Learning Theory, pages 1155–1198. PMLR, 2023.
  • Richert et al. [2022] Frederieke Richert, Roman Worschech, and Bernd Rosenow. Soft mode in the dynamics of over-realizable online learning for soft committee machines. Physical Review E, 105(5):L052302, 2022.
  • Xiong et al. [2023] Nuoya Xiong, Lijun Ding, and Simon S Du. How over-parameterization slows down gradient descent in matrix sensing: The curses of symmetry and initialization. arXiv preprint arXiv:2310.01769, 2023.
  • Zhang et al. [2021] Jialun Zhang, Salar Fattahi, and Richard Y Zhang. Preconditioned gradient descent for over-parameterized nonconvex matrix factorization. Advances in Neural Information Processing Systems, 34:5985–5996, 2021.
  • Zhuo et al. [2021] Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, and Constantine Caramanis. On the computational and statistical complexity of over-parameterized matrix sensing. arXiv preprint arXiv:2102.02756, 2021.
  • Stein [1981] Charles M. Stein. Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151, 1981. doi: 10.1214/aos/1176345632. URL https://doi.org/10.1214/aos/1176345632.
  • Nesterov et al. [2018] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.

Appendix A Missing Proofs and Auxiliary lemmas

Proof of Fact 1.

It is well known that (see Section 1 of Wu and Zhou [2019])

Q(𝝁|𝝁)=𝐄xp𝝁[log(p𝝁(x))DKL(p𝝁(|x)||p𝝁(|x))H(p𝝁(|x))],Q(\bm{\mu}^{\prime}|\bm{\mu})=\mathbf{E}_{x\sim p_{\bm{\mu}^{*}}}\left[\log(p_% {\bm{\mu}^{\prime}}(x))-D_{\text{KL}}(p_{\bm{\mu}}(\cdot|x)||p_{\bm{\mu}^{% \prime}}(\cdot|x))-H(p_{\bm{\mu}}(\cdot|x))\right],italic_Q ( bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_μ ) = bold_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) - italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) - italic_H ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) ] ,

where p𝝁(|x)p_{\bm{\mu}}(\cdot|x)italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) denotes the distribution of hidden variable y𝑦yitalic_y (in our case of GMM the index of Gaussian component) conditioned on x𝑥xitalic_x, and H𝐻Hitalic_H denotes information entropy.

Since 𝝁=𝝁superscript𝝁𝝁\bm{\mu}^{\prime}=\bm{\mu}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_μ is a global minimum of DKL(p𝝁(|x)||p𝝁(|x))D_{\text{KL}}(p_{\bm{\mu}}(\cdot|x)||p_{\bm{\mu}^{\prime}}(\cdot|x))italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_x ) ), we have DKL(p𝝁(|x)||p𝝁(|x))=0\nabla D_{\text{KL}}(p_{\bm{\mu}}(\cdot|x)||p_{\bm{\mu}}(\cdot|x))=0∇ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) = 0. Also H(p𝝁(|x))=0\nabla H(p_{\bm{\mu}}(\cdot|x))=0∇ italic_H ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) = 0 since H(p𝝁(|x))H(p_{\bm{\mu}}(\cdot|x))italic_H ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) is a constant. Therefore

Q(𝝁|𝝁)=𝐄xp𝝁[log(p𝝁(x))]=(𝝁).𝑄conditional𝝁𝝁subscript𝐄similar-to𝑥subscript𝑝superscript𝝁delimited-[]subscript𝑝𝝁𝑥𝝁\nabla Q(\bm{\mu}|\bm{\mu})=\mathbf{E}_{x\sim p_{\bm{\mu}^{*}}}\left[\nabla% \log(p_{\bm{\mu}}(x))\right]=\nabla\mathcal{L}(\bm{\mu}).∇ italic_Q ( bold_italic_μ | bold_italic_μ ) = bold_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ) ] = ∇ caligraphic_L ( bold_italic_μ ) .

The proof of Lemma 9 uses ideas from Theorem 1 of Chen et al. [2023] and relies on Stein’s identity, which is given by the following lemma.

Lemma 17 (Stein [1981]).

For x𝒩(μ,σ2Id)similar-to𝑥𝒩𝜇superscript𝜎2subscript𝐼𝑑x\sim\mathcal{N}(\mu,\sigma^{2}I_{d})italic_x ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and differentiable function g:𝐑d𝐑:𝑔superscript𝐑𝑑𝐑g:\mathbf{R}^{d}\to\mathbf{R}italic_g : bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → bold_R we have

𝐄[g(x)(xμ)]=σ2𝐄[xg(x)],𝐄delimited-[]𝑔𝑥𝑥𝜇superscript𝜎2𝐄delimited-[]subscript𝑥𝑔𝑥\mathbf{E}[g(x)(x-\mu)]=\sigma^{2}\mathbf{E}[\nabla_{x}g(x)],bold_E [ italic_g ( italic_x ) ( italic_x - italic_μ ) ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E [ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_g ( italic_x ) ] ,

if the two expectations in the above identity exist.

Now we are ready to prove Lemma 9.

Lemma 9.

For any GMM(𝛍),i[n]GMM𝛍𝑖delimited-[]𝑛\text{GMM}(\bm{\mu}),i\in[n]GMM ( bold_italic_μ ) , italic_i ∈ [ italic_n ], the gradient of Q𝑄Qitalic_Q satisfies

μi(𝝁)=μiQ(𝝁|𝝁)=𝐄x[ψi(x)k[n]ψk(x)μk].subscriptsubscript𝜇𝑖𝝁subscriptsubscript𝜇𝑖𝑄conditional𝝁𝝁subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})=% \mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right].∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) = ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] .
Proof.

Applying Stein’s identity (Lemma 17), for each i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] we have

μiQ(𝝁|𝝁)=𝐄x𝒩(0,Id)[ψi(x)(μix)]=𝐄x𝒩(0,Id)[ψi(x)]μi𝐄x𝒩(0,Id)[ψi(x)x]=𝐄x𝒩(0,Id)[ψi(x)]μi𝐄x𝒩(0,Id)[xψi(x)].subscriptsubscript𝜇𝑖𝑄conditional𝝁𝝁subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]subscript𝜓𝑖𝑥subscript𝜇𝑖𝑥subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]subscript𝜓𝑖𝑥subscript𝜇𝑖subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]subscript𝜓𝑖𝑥𝑥subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]subscript𝜓𝑖𝑥subscript𝜇𝑖subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]subscript𝑥subscript𝜓𝑖𝑥\begin{split}\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})&=\mathbf{E}_{x\sim\mathcal{N% }(0,I_{d})}\left[\-\psi_{i}(x)(\mu_{i}-x)\right]\\ &=\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\-\psi_{i}(x)\right]\mu_{i}-% \mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\psi_{i}(x)x\right]\\ &=\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\-\psi_{i}(x)\right]\mu_{i}-% \mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}[\nabla_{x}\psi_{i}(x)].\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_x ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] . end_CELL end_ROW

Recall that

ψi(x)=Pr[i|x]=πiexp(xμi22)k[n]πkexp(xμk22).subscript𝜓𝑖𝑥Prconditional𝑖𝑥subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘22\psi_{i}(x)=\Pr[i|x]=\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right% )}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)}.italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = roman_Pr [ italic_i | italic_x ] = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG .

The gradient xψi(x)subscript𝑥subscript𝜓𝑖𝑥\nabla_{x}\psi_{i}(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) could be calculated as

xψi(x)=1(k[n]πkexp(xμk22))2[(k[n]πkexp(xμk22))πiexp(xμi22)(μix)πiexp(xμi22)(k[n]πkexp(xμk22)(μkx))]=ψi(x)(μix)ψi(x)k[n]ψk(x)(μkx)=ψi(x)(μix)+ψi(x)xk[n]ψi(x)ψk(x)μk=ψi(x)(μik[n]ψk(x)μk),subscript𝑥subscript𝜓𝑖𝑥1superscriptsubscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘222delimited-[]subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘22subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖22subscript𝜇𝑖𝑥subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘22subscript𝜇𝑘𝑥subscript𝜓𝑖𝑥subscript𝜇𝑖𝑥subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘𝑥subscript𝜓𝑖𝑥subscript𝜇𝑖𝑥subscript𝜓𝑖𝑥𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜓𝑘𝑥subscript𝜇𝑘subscript𝜓𝑖𝑥subscript𝜇𝑖subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘\begin{split}&\nabla_{x}\psi_{i}(x)\\ ={}&\frac{1}{\left(\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}% \right)\right)^{2}}\Bigg{[}\left(\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu% _{k}\|^{2}}{2}\right)\right)\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}% \right)(\mu_{i}-x)\\ &-\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right)\left(\sum_{k\in[n]}\pi_% {k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)(\mu_{k}-x)\right)\Bigg{]}\\ ={}&\psi_{i}(x)(\mu_{i}-x)-\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)(\mu_{k}-x)\\ ={}&\psi_{i}(x)(\mu_{i}-x)+\psi_{i}(x)x-\sum_{k\in[n]}\psi_{i}(x)\psi_{k}(x)% \mu_{k}\\ ={}&\psi_{i}(x)\left(\mu_{i}-\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right),\end{split}start_ROW start_CELL end_CELL start_CELL ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ) italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x ) ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ) - italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ) + italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_x - ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL end_ROW (8)

note that we used k[n]ψi(x)=1subscript𝑘delimited-[]𝑛subscript𝜓𝑖𝑥1\sum_{k\in[n]}\psi_{i}(x)=1∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = 1.

Then we have

μiQ(𝝁|𝝁)=𝐄x[ψi(x)]μi𝐄x[xψi(x)]=𝐄x[ψi(x)]μi𝐄x[ψi(x)(μik[n]ψk(x)μk)]=𝐄x[ψi(x)k[n]ψk(x)μk].subscriptsubscript𝜇𝑖𝑄conditional𝝁𝝁subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝜇𝑖subscript𝐄𝑥delimited-[]subscript𝑥subscript𝜓𝑖𝑥subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝜇𝑖subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝜇𝑖subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘\begin{split}\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu})&=\mathbf{E}_{x}\left[\-\psi_% {i}(x)\right]\mu_{i}-\mathbf{E}_{x}[\nabla_{x}\psi_{i}(x)]\\ &=\mathbf{E}_{x}\left[\-\psi_{i}(x)\right]\mu_{i}-\mathbf{E}_{x}\left[\psi_{i}% (x)\left(\mu_{i}-\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right)\right]=\mathbf{E}_{x}% \left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}\right].\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] . end_CELL end_ROW

Proof of Corollary 10.
𝝁Q(𝝁|𝝁),𝝁=i[n]μiQ(𝝁|𝝁),μi=i[n]𝐄x[ψi(x)k[n]ψk(x)μk],μi=i[n]k[n]𝐄xψi(x)ψk(x)μk,μi=𝐄x[i[n]ψi(x)μi2]=𝐄x[𝝍~𝝁(x)2].subscript𝝁𝑄conditional𝝁𝝁𝝁subscript𝑖delimited-[]𝑛subscriptsubscript𝜇𝑖𝑄conditional𝝁𝝁subscript𝜇𝑖subscript𝑖delimited-[]𝑛subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘subscript𝜇𝑖subscript𝑖delimited-[]𝑛subscript𝑘delimited-[]𝑛subscript𝐄𝑥subscript𝜓𝑖𝑥subscript𝜓𝑘𝑥subscript𝜇𝑘subscript𝜇𝑖subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript𝑖delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜇𝑖2subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥2\begin{split}&\left\langle\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right% \rangle=\sum_{i\in[n]}\left\langle\nabla_{\mu_{i}}Q(\bm{\mu}|\bm{\mu}),\mu_{i}% \right\rangle=\sum_{i\in[n]}\left\langle\mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k% \in[n]}\psi_{k}(x)\mu_{k}\right],\mu_{i}\right\rangle\\ &=\sum_{i\in[n]}\sum_{k\in[n]}\mathbf{E}_{x}\left\langle\psi_{i}(x)\psi_{k}(x)% \mu_{k},\mu_{i}\right\rangle=\mathbf{E}_{x}\left[\left\|\sum_{i\in[n]}\psi_{i}% (x)\mu_{i}\right\|^{2}\right]=\mathbf{E}_{x}\left[\left\|\tilde{\bm{\psi}}_{% \bm{\mu}}(x)\right\|^{2}\right].\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ ∇ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) , bold_italic_μ ⟩ = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ⟨ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ⟨ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⟨ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . end_CELL end_ROW

Lemma 18.

For any constant c𝑐citalic_c satisfying 0<c13d0𝑐13𝑑0<c\leq\frac{1}{3d}0 < italic_c ≤ divide start_ARG 1 end_ARG start_ARG 3 italic_d end_ARG, we have

𝐄x𝒩(0,Id)[exp(cx)]1+5dc.subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]𝑐norm𝑥15𝑑𝑐\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\exp\left(c\|x\|\right)\right]\leq 1% +5\sqrt{d}c.bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_exp ( italic_c ∥ italic_x ∥ ) ] ≤ 1 + 5 square-root start_ARG italic_d end_ARG italic_c .
Proof.

Note that 𝐄x𝒩(0,Id)[exp(cx)]=x(c)subscript𝐄similar-to𝑥𝒩0subscript𝐼𝑑delimited-[]𝑐norm𝑥subscriptnorm𝑥𝑐\mathbf{E}_{x\sim\mathcal{N}(0,I_{d})}\left[\exp\left(c\|x\|\right)\right]=% \mathcal{M}_{\|x\|}(c)bold_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_exp ( italic_c ∥ italic_x ∥ ) ] = caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT ( italic_c ) is the moment-generating function of xnorm𝑥\|x\|∥ italic_x ∥. To upper bound the value of a moment generating function at c𝑐citalic_c, we use Lagrange’s Mean Value Theorem:

x(c)=x(0)+x(ξ)c,subscriptnorm𝑥𝑐subscriptnorm𝑥0superscriptsubscriptnorm𝑥𝜉𝑐\mathcal{M}_{\|x\|}(c)=\mathcal{M}_{\|x\|}(0)+\mathcal{M}_{\|x\|}^{\prime}(\xi% )c,caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT ( italic_c ) = caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT ( 0 ) + caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ξ ) italic_c , (9)

where ξ[0,c]𝜉0𝑐\xi\in[0,c]italic_ξ ∈ [ 0 , italic_c ]. Note that x(0)=1,subscriptnorm𝑥01\mathcal{M}_{\|x\|}(0)=1,caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT ( 0 ) = 1 , So the remaining task is to bound x(ξ)superscriptsubscriptnorm𝑥𝜉\mathcal{M}_{\|x\|}^{\prime}(\xi)caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ξ ). We bound this expectation using truncation method as:

x(ξ)=𝐄x[xexp(ξx)]𝐄x[xexp(cx)]=x𝐑dxexp(cx)(2π)d/2exp(x22)dx=x1xexp(cx)(2π)d/2exp(x22)dx+x1xexp(cx)(2π)d/2exp(x22)dxexp(c)(2π)d/2Vd+x1x(2π)d/2exp(cxx22)dxexp(c)(2π)d/2Vd+x1x(2π)d/2exp(cxx22)dx,superscriptsubscriptnorm𝑥𝜉subscript𝐄𝑥delimited-[]delimited-∥∥𝑥𝜉delimited-∥∥𝑥subscript𝐄𝑥delimited-[]delimited-∥∥𝑥𝑐delimited-∥∥𝑥subscript𝑥superscript𝐑𝑑delimited-∥∥𝑥𝑐delimited-∥∥𝑥superscript2𝜋𝑑2superscriptnorm𝑥22differential-d𝑥subscriptnorm𝑥1delimited-∥∥𝑥𝑐delimited-∥∥𝑥superscript2𝜋𝑑2superscriptnorm𝑥22differential-d𝑥subscriptnorm𝑥1delimited-∥∥𝑥𝑐delimited-∥∥𝑥superscript2𝜋𝑑2superscriptnorm𝑥22differential-d𝑥𝑐superscript2𝜋𝑑2subscript𝑉𝑑subscriptnorm𝑥1delimited-∥∥𝑥superscript2𝜋𝑑2𝑐delimited-∥∥𝑥superscriptnorm𝑥22differential-d𝑥𝑐superscript2𝜋𝑑2subscript𝑉𝑑subscriptnorm𝑥1delimited-∥∥𝑥superscript2𝜋𝑑2𝑐delimited-∥∥𝑥superscriptnorm𝑥22differential-d𝑥\begin{split}\mathcal{M}_{\|x\|}^{\prime}(\xi)&=\mathbf{E}_{x}\left[\|x\|\exp(% \xi\|x\|)\right]\leq\mathbf{E}_{x}\left[\|x\|\exp(c\|x\|)\right]\\ &=\int_{x\in\mathbf{R}^{d}}\|x\|\exp(c\|x\|)(2\pi)^{-d/2}\exp\left(-\frac{\|x% \|^{2}}{2}\right)\mathrm{d}x\\ &=\int_{\|x\|\leq 1}\|x\|\exp(c\|x\|)(2\pi)^{-d/2}\exp\left(-\frac{\|x\|^{2}}{% 2}\right)\mathrm{d}x\\ &\quad+\int_{\|x\|\geq 1}\|x\|\exp(c\|x\|)(2\pi)^{-d/2}\exp\left(-\frac{\|x\|^% {2}}{2}\right)\mathrm{d}x\\ &\leq\exp(c)(2\pi)^{-d/2}V_{d}+\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c% \|x\|-\frac{\|x\|^{2}}{2}\right)\mathrm{d}x\\ &\leq\exp(c)(2\pi)^{-d/2}V_{d}+\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c% \|x\|-\frac{\|x\|^{2}}{2}\right)\mathrm{d}x,\end{split}start_ROW start_CELL caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ξ ) end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ italic_x ∥ roman_exp ( italic_ξ ∥ italic_x ∥ ) ] ≤ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ italic_x ∥ roman_exp ( italic_c ∥ italic_x ∥ ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∫ start_POSTSUBSCRIPT italic_x ∈ bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_x ∥ roman_exp ( italic_c ∥ italic_x ∥ ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≤ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ roman_exp ( italic_c ∥ italic_x ∥ ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≥ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ roman_exp ( italic_c ∥ italic_x ∥ ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_exp ( italic_c ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≥ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( italic_c ∥ italic_x ∥ - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_exp ( italic_c ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≥ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( italic_c ∥ italic_x ∥ - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x , end_CELL end_ROW (10)

where Vd=πd/2Γ(d/2+1)subscript𝑉𝑑superscript𝜋𝑑2Γ𝑑21V_{d}=\frac{\pi^{d/2}}{\Gamma(d/2+1)}italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG italic_π start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( italic_d / 2 + 1 ) end_ARG is the volume of d𝑑ditalic_d-dimensional unit sphere.

Since x1cxx2213dxx22(11/(2d))x22norm𝑥1𝑐norm𝑥superscriptnorm𝑥2213𝑑norm𝑥superscriptnorm𝑥22superscriptnorm112𝑑𝑥22\|x\|\geq 1\Rightarrow c\|x\|-\frac{\|x\|^{2}}{2}\leq\frac{1}{3d}\|x\|-\frac{% \|x\|^{2}}{2}\leq-\frac{\|(1-1/(2d))x\|^{2}}{2}∥ italic_x ∥ ≥ 1 ⇒ italic_c ∥ italic_x ∥ - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 3 italic_d end_ARG ∥ italic_x ∥ - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ≤ - divide start_ARG ∥ ( 1 - 1 / ( 2 italic_d ) ) italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, we have

x1x(2π)d/2exp(cxx22)dxx1x(2π)d/2exp(2d12dx22)dx=y2d12d2d2d1y(2π)d/2exp(y22)(2d2d1)ddy(2d2d1)d+1𝐄y𝒩(0,Id)[y]=(2d2d1)d+12Γ(d+12)Γ(d2)4d,subscriptnorm𝑥1delimited-∥∥𝑥superscript2𝜋𝑑2𝑐delimited-∥∥𝑥superscriptnorm𝑥22differential-d𝑥subscriptnorm𝑥1delimited-∥∥𝑥superscript2𝜋𝑑2superscriptnorm2𝑑12𝑑𝑥22differential-d𝑥subscriptnorm𝑦2𝑑12𝑑2𝑑2𝑑1delimited-∥∥𝑦superscript2𝜋𝑑2superscriptnorm𝑦22superscript2𝑑2𝑑1𝑑differential-d𝑦superscript2𝑑2𝑑1𝑑1subscript𝐄similar-to𝑦𝒩0subscript𝐼𝑑delimited-[]delimited-∥∥𝑦superscript2𝑑2𝑑1𝑑12Γ𝑑12Γ𝑑24𝑑\begin{split}&\quad\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c\|x\|-\frac{% \|x\|^{2}}{2}\right)\mathrm{d}x\\ &\leq\int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(-\frac{\|\frac{2d-1}{2d}x\|% ^{2}}{2}\right)\mathrm{d}x\\ &=\int_{\|y\|\geq\frac{2d-1}{2d}}\frac{2d}{2d-1}\|y\|(2\pi)^{-d/2}\exp\left(-% \frac{\|y\|^{2}}{2}\right)\left(\frac{2d}{2d-1}\right)^{d}\mathrm{d}y\\ &\leq\left(\frac{2d}{2d-1}\right)^{d+1}\mathbf{E}_{y\sim\mathcal{N}(0,I_{d})}% \left[\|y\|\right]\\ &=\left(\frac{2d}{2d-1}\right)^{d+1}\frac{\sqrt{2}\Gamma\left(\frac{d+1}{2}% \right)}{\Gamma\left(\frac{d}{2}\right)}\\ &\leq 4\sqrt{d},\end{split}start_ROW start_CELL end_CELL start_CELL ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≥ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( italic_c ∥ italic_x ∥ - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≥ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ divide start_ARG 2 italic_d - 1 end_ARG start_ARG 2 italic_d end_ARG italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∫ start_POSTSUBSCRIPT ∥ italic_y ∥ ≥ divide start_ARG 2 italic_d - 1 end_ARG start_ARG 2 italic_d end_ARG end_POSTSUBSCRIPT divide start_ARG 2 italic_d end_ARG start_ARG 2 italic_d - 1 end_ARG ∥ italic_y ∥ ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ( divide start_ARG 2 italic_d end_ARG start_ARG 2 italic_d - 1 end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_d italic_y end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( divide start_ARG 2 italic_d end_ARG start_ARG 2 italic_d - 1 end_ARG ) start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_y ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∥ italic_y ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( divide start_ARG 2 italic_d end_ARG start_ARG 2 italic_d - 1 end_ARG ) start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT divide start_ARG square-root start_ARG 2 end_ARG roman_Γ ( divide start_ARG italic_d + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_Γ ( divide start_ARG italic_d end_ARG start_ARG 2 end_ARG ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 4 square-root start_ARG italic_d end_ARG , end_CELL end_ROW

where we used (2d2d1)d+14superscript2𝑑2𝑑1𝑑14\left(\frac{2d}{2d-1}\right)^{d+1}\leq 4( divide start_ARG 2 italic_d end_ARG start_ARG 2 italic_d - 1 end_ARG ) start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT ≤ 4 and the log convexity of Gamma function at the last line. Plugging this back to (10), we get

x(ξ)exp(c)(2π)d/2Vd+x1x(2π)d/2exp(cxx22)dxexp(1/(3d))(2π)d/2+4d5d.superscriptsubscriptnorm𝑥𝜉𝑐superscript2𝜋𝑑2subscript𝑉𝑑subscriptnorm𝑥1delimited-∥∥𝑥superscript2𝜋𝑑2𝑐delimited-∥∥𝑥superscriptnorm𝑥22differential-d𝑥13𝑑superscript2𝜋𝑑24𝑑5𝑑\begin{split}\mathcal{M}_{\|x\|}^{\prime}(\xi)&\leq\exp(c)(2\pi)^{-d/2}V_{d}+% \int_{\|x\|\geq 1}\|x\|(2\pi)^{-d/2}\exp\left(c\|x\|-\frac{\|x\|^{2}}{2}\right% )\mathrm{d}x\\ &\leq\exp(1/(3d))(2\pi)^{-d/2}+4\sqrt{d}\\ &\leq 5\sqrt{d}.\end{split}start_ROW start_CELL caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ξ ) end_CELL start_CELL ≤ roman_exp ( italic_c ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ∫ start_POSTSUBSCRIPT ∥ italic_x ∥ ≥ 1 end_POSTSUBSCRIPT ∥ italic_x ∥ ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT roman_exp ( italic_c ∥ italic_x ∥ - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_d italic_x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_exp ( 1 / ( 3 italic_d ) ) ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_d / 2 end_POSTSUPERSCRIPT + 4 square-root start_ARG italic_d end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 5 square-root start_ARG italic_d end_ARG . end_CELL end_ROW (11)

Plugging (11) into (9), we obtain the final bound

𝐄x[exp(2δi(x+μi))1]=x(c)=x(0)+x(ξ)c1+5dc.subscript𝐄𝑥delimited-[]2normsubscript𝛿𝑖norm𝑥normsubscript𝜇𝑖1subscriptnorm𝑥𝑐subscriptnorm𝑥0superscriptsubscriptnorm𝑥𝜉𝑐15𝑑𝑐\mathbf{E}_{x}\left[\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1% \right]=\mathcal{M}_{\|x\|}(c)=\mathcal{M}_{\|x\|}(0)+\mathcal{M}_{\|x\|}^{% \prime}(\xi)c\leq 1+5\sqrt{d}c.bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) - 1 ] = caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT ( italic_c ) = caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT ( 0 ) + caligraphic_M start_POSTSUBSCRIPT ∥ italic_x ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ξ ) italic_c ≤ 1 + 5 square-root start_ARG italic_d end_ARG italic_c .

Lemma 19.

For any fixed x𝐑d,x0formulae-sequence𝑥superscript𝐑𝑑𝑥0x\in\mathbf{R}^{d},x\neq 0italic_x ∈ bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_x ≠ 0 and any 𝛍𝛍\bm{\mu}bold_italic_μ we have

t=11ψi(tx|𝝁)ψj(tx|𝝁)dt12μmaxxπiπjexp(4μmax2)(1exp(4μmaxx)).superscriptsubscript𝑡11subscript𝜓𝑖conditional𝑡𝑥𝝁subscript𝜓𝑗conditional𝑡𝑥𝝁differential-d𝑡12subscript𝜇norm𝑥subscript𝜋𝑖subscript𝜋𝑗4superscriptsubscript𝜇214subscript𝜇norm𝑥\int_{t=-1}^{1}\psi_{i}(tx|\bm{\mu})\psi_{j}(tx|\bm{\mu})\mathrm{d}t\geq\frac{% 1}{2\mu_{\max}\|x\|}\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}^{2}\right)\left(1-% \exp\left(-4\mu_{\max}\|x\|\right)\right).∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t italic_x | bold_italic_μ ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t italic_x | bold_italic_μ ) roman_d italic_t ≥ divide start_ARG 1 end_ARG start_ARG 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ end_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) ) .
Proof.
ψi(tx)=πiexp(txμi22)k[n]πkexp(txμk22)=πik[n]πkexp(12(txμi2txμk2))=πik[n]πkexp(12(txμi2txμk2))=πik[n]πkexp(122txμiμk,μkμi)πik[n]πkexp(12(2tx+2μmax)2μmax)=πiexp(2μmax(tx+μmax))subscript𝜓𝑖𝑡𝑥subscript𝜋𝑖superscriptnorm𝑡𝑥subscript𝜇𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑡𝑥subscript𝜇𝑘22subscript𝜋𝑖subscript𝑘delimited-[]𝑛subscript𝜋𝑘12superscriptnorm𝑡𝑥subscript𝜇𝑖2superscriptnorm𝑡𝑥subscript𝜇𝑘2subscript𝜋𝑖subscript𝑘delimited-[]𝑛subscript𝜋𝑘12superscriptnorm𝑡𝑥subscript𝜇𝑖2superscriptnorm𝑡𝑥subscript𝜇𝑘2subscript𝜋𝑖subscript𝑘delimited-[]𝑛subscript𝜋𝑘122𝑡𝑥subscript𝜇𝑖subscript𝜇𝑘subscript𝜇𝑘subscript𝜇𝑖subscript𝜋𝑖subscript𝑘delimited-[]𝑛subscript𝜋𝑘122norm𝑡𝑥2subscript𝜇2subscript𝜇subscript𝜋𝑖2subscript𝜇delimited-∥∥𝑡𝑥subscript𝜇\begin{split}\psi_{i}(tx)&=\frac{\pi_{i}\exp\left(-\frac{\|tx-\mu_{i}\|^{2}}{2% }\right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|tx-\mu_{k}\|^{2}}{2}\right)}% \\ &=\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}(\|tx-\mu_{i}\|^{2}% -\|tx-\mu_{k}\|^{2})\right)}\\ &=\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}(\|tx-\mu_{i}\|^{2}% -\|tx-\mu_{k}\|^{2})\right)}\\ &=\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}\left\langle 2tx-% \mu_{i}-\mu_{k},\mu_{k}-\mu_{i}\right\rangle\right)}\\ &\geq\frac{\pi_{i}}{\sum_{k\in[n]}\pi_{k}\exp\left(\frac{1}{2}(2\|tx\|+2\mu_{% \max})\cdot 2\mu_{\max}\right)}\\ &=\pi_{i}\exp\left(-2\mu_{\max}(\|tx\|+\mu_{\max})\right)\end{split}start_ROW start_CELL italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t italic_x ) end_CELL start_CELL = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∥ italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∥ italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ 2 italic_t italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 2 ∥ italic_t italic_x ∥ + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ⋅ 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( ∥ italic_t italic_x ∥ + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ) end_CELL end_ROW (12)

Therefore

t=11ψi(tx)ψj(tx)dtt=11πiπjexp(4μmax(tx+μmax))dt=πiπjexp(4μmax2)2t=01exp(4μmaxxt)dt=12μmaxxπiπjexp(4μmax2)(1exp(4μmaxx)).superscriptsubscript𝑡11subscript𝜓𝑖𝑡𝑥subscript𝜓𝑗𝑡𝑥differential-d𝑡superscriptsubscript𝑡11subscript𝜋𝑖subscript𝜋𝑗4subscript𝜇delimited-∥∥𝑡𝑥subscript𝜇differential-d𝑡subscript𝜋𝑖subscript𝜋𝑗4superscriptsubscript𝜇22superscriptsubscript𝑡014subscript𝜇delimited-∥∥𝑥𝑡differential-d𝑡12subscript𝜇norm𝑥subscript𝜋𝑖subscript𝜋𝑗4superscriptsubscript𝜇214subscript𝜇delimited-∥∥𝑥\begin{split}\int_{t=-1}^{1}\psi_{i}(tx)\psi_{j}(tx)\mathrm{d}t&\geq\int_{t=-1% }^{1}\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}(\|tx\|+\mu_{\max})\right)\mathrm{d}t% \\ &=\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}^{2}\right)\cdot 2\int_{t=0}^{1}\exp% \left(-4\mu_{\max}\|x\|t\right)\mathrm{d}t\\ &=\frac{1}{2\mu_{\max}\|x\|}\pi_{i}\pi_{j}\exp\left(-4\mu_{\max}^{2}\right)% \left(1-\exp\left(-4\mu_{\max}\|x\|\right)\right).\end{split}start_ROW start_CELL ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t italic_x ) roman_d italic_t end_CELL start_CELL ≥ ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( ∥ italic_t italic_x ∥ + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ) roman_d italic_t end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ 2 ∫ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ italic_t ) roman_d italic_t end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ end_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) ) . end_CELL end_ROW (13)

Appendix B Proofs for Section 3 and 4

B.1 Proofs for global convergence analysis

Theorem 13.

At any two points 𝛍=(μ1,,μn)𝛍superscriptsuperscriptsubscript𝜇1topsuperscriptsubscript𝜇𝑛toptop\bm{\mu}=(\mu_{1}^{\top},\ldots,\mu_{n}^{\top})^{\top}bold_italic_μ = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝛍+𝛅=((μ1+δ1),,(μn+δn))𝛍𝛅superscriptsuperscriptsubscript𝜇1subscript𝛿1topsuperscriptsubscript𝜇𝑛subscript𝛿𝑛toptop\bm{\mu}+\bm{\delta}=((\mu_{1}+\delta_{1})^{\top},\ldots,(\mu_{n}+\delta_{n})^% {\top})^{\top}bold_italic_μ + bold_italic_δ = ( ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, if

δi1max{6d,2μi},i[n],formulae-sequencenormsubscript𝛿𝑖16𝑑2normsubscript𝜇𝑖for-all𝑖delimited-[]𝑛\|\delta_{i}\|\leq\frac{1}{\max\left\{6d,2\|\mu_{i}\|\right\}},\forall i\in[n],∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG roman_max { 6 italic_d , 2 ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ } end_ARG , ∀ italic_i ∈ [ italic_n ] ,

then the loss function \mathcal{L}caligraphic_L satisfies the following smoothness property: for any i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] we have

μi+δi(𝝁+𝜹)μi(𝝁)nμmax(30d+4μmax)δi+k[n]δk.normsubscriptsubscript𝜇𝑖subscript𝛿𝑖𝝁𝜹subscriptsubscript𝜇𝑖𝝁𝑛subscript𝜇30𝑑4subscript𝜇normsubscript𝛿𝑖subscript𝑘delimited-[]𝑛normsubscript𝛿𝑘\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{\delta})-\nabla_{% \mu_{i}}\mathcal{L}(\bm{\mu})\right\|\leq n\mu_{\max}(30\sqrt{d}+4\mu_{\max})% \|\delta_{i}\|+\sum_{k\in[n]}\|\delta_{k}\|.∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ + bold_italic_δ ) - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 30 square-root start_ARG italic_d end_ARG + 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ . (14)
Proof.

Note that

exp(δi(x+μi))exp(δi22)exp(x(μi+δi)22)exp(xμi22)=exp(xμi,δiδi22)exp(δi(x+μi))exp(δi22).delimited-∥∥subscript𝛿𝑖delimited-∥∥𝑥delimited-∥∥subscript𝜇𝑖superscriptnormsubscript𝛿𝑖22superscriptnorm𝑥subscript𝜇𝑖subscript𝛿𝑖22superscriptnorm𝑥subscript𝜇𝑖22𝑥subscript𝜇𝑖subscript𝛿𝑖superscriptnormsubscript𝛿𝑖22delimited-∥∥subscript𝛿𝑖delimited-∥∥𝑥delimited-∥∥subscript𝜇𝑖superscriptnormsubscript𝛿𝑖22\begin{split}&\quad\exp\left(-\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp% \left(-\frac{\|\delta_{i}\|^{2}}{2}\right)\leq\frac{\exp\left(-\frac{\|x-(\mu_% {i}+\delta_{i})\|^{2}}{2}\right)}{\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right% )}=\exp\left(\left\langle x-\mu_{i},\delta_{i}\right\rangle-\frac{\|\delta_{i}% \|^{2}}{2}\right)\\ &\leq\exp\left(\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|% \delta_{i}\|^{2}}{2}\right).\end{split}start_ROW start_CELL end_CELL start_CELL roman_exp ( - ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) roman_exp ( - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ≤ divide start_ARG roman_exp ( - divide start_ARG ∥ italic_x - ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG = roman_exp ( ⟨ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_exp ( ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) roman_exp ( - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) . end_CELL end_ROW

Therefore ψi(x|𝝁+𝜹)subscript𝜓𝑖conditional𝑥𝝁𝜹\psi_{i}(x|\bm{\mu}+\bm{\delta})italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) can be bounded as

ψi(x|𝝁+𝜹)=πiexp(x(μi+δi)22)k[n]πkexp(x(μk+δk)22)πiexp(xμi22)exp(δi(x+μi))exp(δi22)k[n]πkexp(xμk22)exp(δi(x+μi))exp(δi22)exp(2δi(x+μi))ψi(x|𝝁).subscript𝜓𝑖conditional𝑥𝝁𝜹subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖subscript𝛿𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘subscript𝛿𝑘22subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖22normsubscript𝛿𝑖norm𝑥normsubscript𝜇𝑖superscriptnormsubscript𝛿𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘22normsubscript𝛿𝑖norm𝑥normsubscript𝜇𝑖superscriptnormsubscript𝛿𝑖222delimited-∥∥subscript𝛿𝑖delimited-∥∥𝑥delimited-∥∥subscript𝜇𝑖subscript𝜓𝑖conditional𝑥𝝁\begin{split}&\quad\psi_{i}(x|\bm{\mu}+\bm{\delta})=\frac{\pi_{i}\exp\left(-% \frac{\|x-(\mu_{i}+\delta_{i})\|^{2}}{2}\right)}{\sum_{k\in[n]}\pi_{k}\exp% \left(-\frac{\|x-(\mu_{k}+\delta_{k})\|^{2}}{2}\right)}\\ &\leq\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right)\exp\left(\|% \delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{i}\|^{2}}{2}% \right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)% \exp\left(-\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{% i}\|^{2}}{2}\right)}\leq\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)% \psi_{i}(x|\bm{\mu}).\end{split}start_ROW start_CELL end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_exp ( ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) roman_exp ( - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_exp ( - ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) roman_exp ( - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG ≤ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) . end_CELL end_ROW (15)

Similarly, we have

ψi(x|𝝁+𝜹)=πiexp(x(μi+δi)22)k[n]πkexp(x(μk+δk)22)πiexp(xμi22)exp(δi(x+μi))exp(δi22)k[n]πkexp(xμk22)exp(δi(x+μi))exp(δi22)exp(2δi(x+μi))ψi(x|𝝁).subscript𝜓𝑖conditional𝑥𝝁𝜹subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖subscript𝛿𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘subscript𝛿𝑘22subscript𝜋𝑖superscriptnorm𝑥subscript𝜇𝑖22normsubscript𝛿𝑖norm𝑥normsubscript𝜇𝑖superscriptnormsubscript𝛿𝑖22subscript𝑘delimited-[]𝑛subscript𝜋𝑘superscriptnorm𝑥subscript𝜇𝑘22normsubscript𝛿𝑖norm𝑥normsubscript𝜇𝑖superscriptnormsubscript𝛿𝑖222delimited-∥∥subscript𝛿𝑖delimited-∥∥𝑥delimited-∥∥subscript𝜇𝑖subscript𝜓𝑖conditional𝑥𝝁\begin{split}&\quad\psi_{i}(x|\bm{\mu}+\bm{\delta})=\frac{\pi_{i}\exp\left(-% \frac{\|x-(\mu_{i}+\delta_{i})\|^{2}}{2}\right)}{\sum_{k\in[n]}\pi_{k}\exp% \left(-\frac{\|x-(\mu_{k}+\delta_{k})\|^{2}}{2}\right)}\\ &\geq\frac{\pi_{i}\exp\left(-\frac{\|x-\mu_{i}\|^{2}}{2}\right)\exp\left(-\|% \delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{i}\|^{2}}{2}% \right)}{\sum_{k\in[n]}\pi_{k}\exp\left(-\frac{\|x-\mu_{k}\|^{2}}{2}\right)% \exp\left(\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)\exp\left(-\frac{\|\delta_{i% }\|^{2}}{2}\right)}\geq\exp\left(-2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)% \psi_{i}(x|\bm{\mu}).\end{split}start_ROW start_CELL end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_exp ( - ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) roman_exp ( - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) roman_exp ( ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) roman_exp ( - divide start_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG ≥ roman_exp ( - 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) . end_CELL end_ROW (16)

Recall that by Lemma 9 we have μi(𝝁)=𝐄x[ψi(x|𝝁)k[n]ψk(x|𝝁)μk],subscriptsubscript𝜇𝑖𝝁subscript𝐄𝑥delimited-[]subscript𝜓𝑖conditional𝑥𝝁subscript𝑘delimited-[]𝑛subscript𝜓𝑘conditional𝑥𝝁subscript𝜇𝑘\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\mathbf{E}_{x}\left[\psi_{i}(x|\bm{\mu})% \sum_{k\in[n]}\psi_{k}(x|\bm{\mu})\mu_{k}\right],∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , so

μi+δi(𝝁+𝜹)μi(𝝁)=𝐄x[ψi(x|𝝁+𝜹)k[n]ψk(x|𝝁+𝜹)(μk+δk)]𝐄x[ψi(x|𝝁)k[n]ψk(x|𝝁)μk]=𝐄x[k[n]ψi(x|𝝁+𝜹)ψk(x|𝝁+𝜹)δk]+𝐄x[k[n](ψi(x|𝝁+𝜹)ψk(x|𝝁+𝜹)ψi(x|𝝁)ψk(x|𝝁))μk]𝐄x[k[n]ψi(x|𝝁+𝜹)ψk(x|𝝁+𝜹)δk]+𝐄x[k[n]|ψi(x|𝝁+𝜹)ψk(x|𝝁+𝜹)ψi(x|𝝁)ψk(x|𝝁)|μk]k[n]δk+k[n]𝐄x[|ψi(x|𝝁+𝜹)ψk(x|𝝁+𝜹)ψi(x|𝝁)ψk(x|𝝁)|]μkk[n]δk+k[n]𝐄x[exp(2δi(x+μi))1]μk,\begin{split}&\quad\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{% \delta})-\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\right\|\\ &=\left\|\mathbf{E}_{x}\left[\psi_{i}(x|\bm{\mu}+\bm{\delta})\sum_{k\in[n]}% \psi_{k}(x|\bm{\mu}+\bm{\delta})(\mu_{k}+\delta_{k})\right]-\mathbf{E}_{x}% \left[\psi_{i}(x|\bm{\mu})\sum_{k\in[n]}\psi_{k}(x|\bm{\mu})\mu_{k}\right]% \right\|\\ &=\Bigg{\|}\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{i}(x|\bm{\mu}+\bm{\delta})% \psi_{k}(x|\bm{\mu}+\bm{\delta})\delta_{k}\right]\\ &\quad+\mathbf{E}_{x}\left[\sum_{k\in[n]}(\psi_{i}(x|\bm{\mu}+\bm{\delta})\psi% _{k}(x|\bm{\mu}+\bm{\delta})-\psi_{i}(x|\bm{\mu})\psi_{k}(x|\bm{\mu}))\mu_{k}% \right]\Bigg{\|}\\ &\leq\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{i}(x|\bm{\mu}+\bm{\delta})\psi_{k% }(x|\bm{\mu}+\bm{\delta})\|\delta_{k}\|\right]\\ &\quad+\mathbf{E}_{x}\left[\sum_{k\in[n]}|\psi_{i}(x|\bm{\mu}+\bm{\delta})\psi% _{k}(x|\bm{\mu}+\bm{\delta})-\psi_{i}(x|\bm{\mu})\psi_{k}(x|\bm{\mu})|\cdot\|% \mu_{k}\|\right]\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}\mathbf{E}_{x}\left[|\psi_{i}(% x|\bm{\mu}+\bm{\delta})\psi_{k}(x|\bm{\mu}+\bm{\delta})-\psi_{i}(x|\bm{\mu})% \psi_{k}(x|\bm{\mu})|\right]\|\mu_{k}\|\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}\mathbf{E}_{x}\left[\exp\left(% 2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1\right]\|\mu_{k}\|,\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ + bold_italic_δ ) - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) - italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT | italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) - italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) | ⋅ ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ | italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ + bold_italic_δ ) - italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ) | ] ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) - 1 ] ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ , end_CELL end_ROW (17)

where the last inequality is because ψi,ψk1subscript𝜓𝑖subscript𝜓𝑘1\psi_{i},\psi_{k}\leq 1italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 and applying (15) and (16).

The remaining task is to bound 𝐄x[exp(2δi(x+μi))1]subscript𝐄𝑥delimited-[]2normsubscript𝛿𝑖norm𝑥normsubscript𝜇𝑖1\mathbf{E}_{x}\left[\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1\right]bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) - 1 ]. Since 2δi13d2normsubscript𝛿𝑖13𝑑2\|\delta_{i}\|\leq\frac{1}{3d}2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG 3 italic_d end_ARG, we can use Lemma 18 to bound it as

𝐄x[exp(2δi(x+μi))1]=exp(2δiμi)𝐄x[exp(2δix))]1exp(2δiμi)(1+10dδi)1=exp(2δiμi)1+10dδiexp(2δiμi)4δiμi+10dδiexp(1)(30d+4μi)δi.\begin{split}&\quad\mathbf{E}_{x}\left[\exp\left(2\|\delta_{i}\|(\|x\|+\|\mu_{% i}\|)\right)-1\right]=\exp(2\|\delta_{i}\|\|\mu_{i}\|)\mathbf{E}_{x}\left[\exp% \left(2\|\delta_{i}\|\cdot\|x\|)\right)\right]-1\\ &\leq\exp(2\|\delta_{i}\|\|\mu_{i}\|)(1+10\sqrt{d}\|\delta_{i}\|)-1=\exp(2\|% \delta_{i}\|\|\mu_{i}\|)-1+10\sqrt{d}\|\delta_{i}\|\exp(2\|\delta_{i}\|\|\mu_{% i}\|)\\ &\leq 4\|\delta_{i}\|\|\mu_{i}\|+10\sqrt{d}\|\delta_{i}\|\exp(1)\leq(30\sqrt{d% }+4\|\mu_{i}\|)\|\delta_{i}\|.\end{split}start_ROW start_CELL end_CELL start_CELL bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) - 1 ] = roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ⋅ ∥ italic_x ∥ ) ) ] - 1 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ( 1 + 10 square-root start_ARG italic_d end_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) - 1 = roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) - 1 + 10 square-root start_ARG italic_d end_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 4 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ + 10 square-root start_ARG italic_d end_ARG ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ roman_exp ( 1 ) ≤ ( 30 square-root start_ARG italic_d end_ARG + 4 ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ . end_CELL end_ROW (18)

where we used exp(1+x)1+2x,x[0,1]formulae-sequence1𝑥12𝑥for-all𝑥01\exp(1+x)\leq 1+2x,\forall x\in[0,1]roman_exp ( 1 + italic_x ) ≤ 1 + 2 italic_x , ∀ italic_x ∈ [ 0 , 1 ] at the last line. Plugging this back to (17), we get

μi+δi(𝝁+𝜹)μi(𝝁)k[n]δk+k[n]𝐄x[exp(2δi(x+μi))1]μkk[n]δk+k[n](30d+4μi)δiμknμmax(30d+4μmax)δi+k[n]δk.delimited-∥∥subscriptsubscript𝜇𝑖subscript𝛿𝑖𝝁𝜹subscriptsubscript𝜇𝑖𝝁subscript𝑘delimited-[]𝑛delimited-∥∥subscript𝛿𝑘subscript𝑘delimited-[]𝑛subscript𝐄𝑥delimited-[]2delimited-∥∥subscript𝛿𝑖delimited-∥∥𝑥delimited-∥∥subscript𝜇𝑖1delimited-∥∥subscript𝜇𝑘subscript𝑘delimited-[]𝑛delimited-∥∥subscript𝛿𝑘subscript𝑘delimited-[]𝑛30𝑑4delimited-∥∥subscript𝜇𝑖delimited-∥∥subscript𝛿𝑖delimited-∥∥subscript𝜇𝑘𝑛subscript𝜇30𝑑4subscript𝜇delimited-∥∥subscript𝛿𝑖subscript𝑘delimited-[]𝑛delimited-∥∥subscript𝛿𝑘\begin{split}&\quad\left\|\nabla_{\mu_{i}+\delta_{i}}\mathcal{L}(\bm{\mu}+\bm{% \delta})-\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\right\|\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}\mathbf{E}_{x}\left[\exp\left(% 2\|\delta_{i}\|(\|x\|+\|\mu_{i}\|)\right)-1\right]\|\mu_{k}\|\\ &\leq\sum_{k\in[n]}\|\delta_{k}\|+\sum_{k\in[n]}(30\sqrt{d}+4\|\mu_{i}\|)\|% \delta_{i}\|\|\mu_{k}\|\\ &\leq n\mu_{\max}(30\sqrt{d}+4\mu_{\max})\|\delta_{i}\|+\sum_{k\in[n]}\|\delta% _{k}\|.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ + bold_italic_δ ) - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_exp ( 2 ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ) - 1 ] ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ( 30 square-root start_ARG italic_d end_ARG + 4 ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 30 square-root start_ARG italic_d end_ARG + 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∥ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ . end_CELL end_ROW (19)

Theorem 14.

The loss function can be upper bounded as

(𝝁)i[n]πi2μi2μmax22.𝝁subscript𝑖delimited-[]𝑛subscript𝜋𝑖2superscriptnormsubscript𝜇𝑖2superscriptsubscript𝜇22\mathcal{L}(\bm{\mu})\leq\sum_{i\in[n]}\frac{\pi_{i}}{2}\|\mu_{i}\|^{2}\leq% \frac{\mu_{\max}^{2}}{2}.caligraphic_L ( bold_italic_μ ) ≤ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG .
Proof.

Since the logarithm function is concave, by Jensen’s inequality we have

(𝝁)=DKL(p𝝁||p𝝁)=𝐄x[log(p𝝁(x)p𝝁(x))]=𝐄x[log(iπiexp(xμi22)exp(x22))]𝐄x[iπilog(exp(xμi22)exp(x22))]=iπi𝐄x[x,μiμi22]=i[n]πi2μi2μmax22.\begin{split}\mathcal{L}(\bm{\mu})&=D_{\text{KL}}(p_{\bm{\mu}^{*}}||p_{\bm{\mu% }})=-\mathbf{E}_{x}\left[\log\left(\frac{p_{\bm{\mu}}(x)}{p_{\bm{\mu}^{*}}(x)}% \right)\right]\\ &=-\mathbf{E}_{x}\left[\log\left(\frac{\sum_{i}\pi_{i}\exp\left(-\frac{\|x-\mu% _{i}\|^{2}}{2}\right)}{\exp\left(-\frac{\|x\|^{2}}{2}\right)}\right)\right]\\ &\leq-\mathbf{E}_{x}\left[\sum_{i}\pi_{i}\log\left(\frac{\exp\left(-\frac{\|x-% \mu_{i}\|^{2}}{2}\right)}{\exp\left(-\frac{\|x\|^{2}}{2}\right)}\right)\right]% \\ &=-\sum_{i}\pi_{i}\mathbf{E}_{x}\left[\left\langle x,\mu_{i}\right\rangle-% \frac{\|\mu_{i}\|^{2}}{2}\right]\\ &=\sum_{i\in[n]}\frac{\pi_{i}}{2}\|\mu_{i}\|^{2}\leq\frac{\mu_{\max}^{2}}{2}.% \end{split}start_ROW start_CELL caligraphic_L ( bold_italic_μ ) end_CELL start_CELL = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | | italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ) = - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_ARG ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_log ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( divide start_ARG roman_exp ( - divide start_ARG ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_x , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ - divide start_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG . end_CELL end_ROW

Lemma 12.

For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) we have

𝝁Q(𝝁|𝝁),𝝁=𝐄x[𝝍~𝝁(x)2]Ω(exp(8μmax2)πmin2d(1+μmaxd)2μmax4).subscript𝝁𝑄conditional𝝁𝝁𝝁subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥2Ω8superscriptsubscript𝜇2superscriptsubscript𝜋2𝑑superscript1subscript𝜇𝑑2superscriptsubscript𝜇4\left\langle\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu}),\bm{\mu}\right\rangle=% \mathbf{E}_{x}[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}]\geq\Omega\left(\frac{% \exp\left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{d(1+\mu_{\max}{\sqrt{d}})^{2}% }\mu_{\max}^{4}\right).⟨ ∇ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) , bold_italic_μ ⟩ = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ roman_Ω ( divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) .
Proof.

Consider two cases:

Case 1. There exists k[n]𝑘delimited-[]𝑛k\in[n]italic_k ∈ [ italic_n ] such that μkμimaxμmax2normsubscript𝜇𝑘subscript𝜇subscript𝑖subscript𝜇2\|\mu_{k}-\mu_{{i_{\max}}}\|\geq\frac{\mu_{\max}}{2}∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ≥ divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. Then by Lemma 20 and Lemma 11 we have

𝐄x[𝝍~𝝁(x)2]exp(8μmax2)40000d(1+2μmaxd)2(i,j[n]πiπjμiμj2)2exp(8μmax2)40000d(1+2μmaxd)2(πmin8μmax2)2=exp(8μmax2)πmin22560000d(1+2μmaxd)2μmax4.subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥28superscriptsubscript𝜇240000𝑑superscript12subscript𝜇𝑑2superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗228superscriptsubscript𝜇240000𝑑superscript12subscript𝜇𝑑2superscriptsubscript𝜋8superscriptsubscript𝜇228superscriptsubscript𝜇2superscriptsubscript𝜋22560000𝑑superscript12subscript𝜇𝑑2superscriptsubscript𝜇4\begin{split}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right% ]&\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})% ^{2}}\left(\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^% {2}}\left(\frac{\pi_{\min}}{8}\mu_{\max}^{2}\right)^{2}\\ &=\frac{\exp\left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{2560000d(1+2\mu_{\max% }{\sqrt{d}})^{2}}\mu_{\max}^{4}.\end{split}start_ROW start_CELL bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 40000 italic_d ( 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 40000 italic_d ( 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2560000 italic_d ( 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT . end_CELL end_ROW

Case2. For k[n]for-all𝑘delimited-[]𝑛\forall k\in[n]∀ italic_k ∈ [ italic_n ], μimaxμk<μmax2normsubscript𝜇subscript𝑖subscript𝜇𝑘subscript𝜇2\|\mu_{{i_{\max}}}-\mu_{k}\|<\frac{\mu_{\max}}{2}∥ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ < divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. Then by Lemma 21 we have 𝐄x[𝝍~𝝁(x)2]14μmax2Ω(exp(8μmax2)μmax4)Ω(exp(8μmax2)πmin2d(1+μmaxd)2μmax4),subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥214superscriptsubscript𝜇2Ω8superscriptsubscript𝜇2superscriptsubscript𝜇4Ω8superscriptsubscript𝜇2superscriptsubscript𝜋2𝑑superscript1subscript𝜇𝑑2superscriptsubscript𝜇4\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{1}% {4}\mu_{\max}^{2}\geq\Omega(\exp(-8\mu_{\max}^{2})\mu_{\max}^{4})\geq\Omega% \left(\frac{\exp\left(-8\mu_{\max}^{2}\right)\pi_{\min}^{2}}{d(1+\mu_{\max}{% \sqrt{d}})^{2}}\mu_{\max}^{4}\right),bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ roman_Ω ( roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ≥ roman_Ω ( divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) , (since exx1,xsuperscript𝑒𝑥𝑥1for-all𝑥e^{-x}x\leq 1,\forall xitalic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT italic_x ≤ 1 , ∀ italic_x). ∎

Lemma 20.

For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ), if there exists k[n]𝑘delimited-[]𝑛k\in[n]italic_k ∈ [ italic_n ] such that μkμimaxμmax2normsubscript𝜇𝑘subscript𝜇subscript𝑖subscript𝜇2\|\mu_{k}-\mu_{{i_{\max}}}\|\geq\frac{\mu_{\max}}{2}∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ≥ divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, then we have

i,j[n]πiπjμiμj2πmin8μmax2.subscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptnormsubscript𝜇𝑖subscript𝜇𝑗2subscript𝜋8superscriptsubscript𝜇2\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\geq\frac{\pi_{\min}}{8}% \mu_{\max}^{2}.∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Proof.

By Cauchy–Schwarz inequality, we have a2+b212ab2superscriptnorm𝑎2superscriptnorm𝑏212superscriptnorm𝑎𝑏2\|a\|^{2}+\|b\|^{2}\geq\frac{1}{2}\|a-b\|^{2}∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_a - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, so for i[n]for-all𝑖delimited-[]𝑛\forall i\in[n]∀ italic_i ∈ [ italic_n ] we have

j[n]πjμiμj2πimaxμiμimax2+πkμiμk2πmin2(μiμimax)(μiμk)2=πmin2μkμimax2.subscript𝑗delimited-[]𝑛subscript𝜋𝑗superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗2subscript𝜋subscript𝑖superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇subscript𝑖2subscript𝜋𝑘superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑘2subscript𝜋2superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇subscript𝑖subscript𝜇𝑖subscript𝜇𝑘2subscript𝜋2superscriptdelimited-∥∥subscript𝜇𝑘subscript𝜇subscript𝑖2\begin{split}\sum_{j\in[n]}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}&\geq\pi_{{i_{\max}}}% \|\mu_{i}-\mu_{{i_{\max}}}\|^{2}+\pi_{k}\|\mu_{i}-\mu_{k}\|^{2}\\ &\geq\frac{\pi_{\min}}{2}\|(\mu_{i}-\mu_{{i_{\max}}})-(\mu_{i}-\mu_{k})\|^{2}=% \frac{\pi_{\min}}{2}\|\mu_{k}-\mu_{{i_{\max}}}\|^{2}.\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL ≥ italic_π start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW

Therefore

i,j[n]πiπjμiμj2=i[n]πij[n]πjμiμj2i[n]πiπmin2μkμimax2πmin8μmax2,subscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗2subscript𝑖delimited-[]𝑛subscript𝜋𝑖subscript𝑗delimited-[]𝑛subscript𝜋𝑗superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗2subscript𝑖delimited-[]𝑛subscript𝜋𝑖subscript𝜋2superscriptdelimited-∥∥subscript𝜇𝑘subscript𝜇subscript𝑖2subscript𝜋8superscriptsubscript𝜇2\begin{split}\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}=\sum_{i\in[% n]}\pi_{i}\sum_{j\in[n]}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\geq\sum_{i\in[n]}\pi_{i% }\frac{\pi_{\min}}{2}\|\mu_{k}-\mu_{{i_{\max}}}\|^{2}\geq\frac{\pi_{\min}}{8}% \mu_{\max}^{2},\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW

where the last inequality is because μkμimaxμmax2normsubscript𝜇𝑘subscript𝜇subscript𝑖subscript𝜇2\|\mu_{k}-\mu_{{i_{\max}}}\|\geq\frac{\mu_{\max}}{2}∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ≥ divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG and iπi=1subscript𝑖subscript𝜋𝑖1\sum_{i}\pi_{i}=1∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. ∎

Lemma 21.

For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ), if for k[n]for-all𝑘delimited-[]𝑛\forall k\in[n]∀ italic_k ∈ [ italic_n ] we have μimaxμk<μmax2normsubscript𝜇subscript𝑖subscript𝜇𝑘subscript𝜇2\|\mu_{{i_{\max}}}-\mu_{k}\|<\frac{\mu_{\max}}{2}∥ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ < divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, then

𝐄x[𝝍~𝝁(x)2]14μmax2.subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥214superscriptsubscript𝜇2\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{1}% {4}\mu_{\max}^{2}.bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Proof.

For any k[n]𝑘delimited-[]𝑛k\in[n]italic_k ∈ [ italic_n ], by Cauchy–Schwarz inequality we have

μk,μimax=μimax(μimaxμk),μimax=μimax2μimaxμk,μimaxμmax2μimaxμkμmax>12μmax2,subscript𝜇𝑘subscript𝜇subscript𝑖subscript𝜇subscript𝑖subscript𝜇subscript𝑖subscript𝜇𝑘subscript𝜇subscript𝑖superscriptdelimited-∥∥subscript𝜇subscript𝑖2subscript𝜇subscript𝑖subscript𝜇𝑘subscript𝜇subscript𝑖superscriptsubscript𝜇2delimited-∥∥subscript𝜇subscript𝑖subscript𝜇𝑘subscript𝜇12superscriptsubscript𝜇2\begin{split}\langle\mu_{k},\mu_{{i_{\max}}}\rangle&=\langle\mu_{{i_{\max}}}-(% \mu_{{i_{\max}}}-\mu_{k}),\mu_{{i_{\max}}}\rangle=\|\mu_{{i_{\max}}}\|^{2}-% \left\langle\mu_{{i_{\max}}}-\mu_{k},\mu_{{i_{\max}}}\right\rangle\\ &\geq\mu_{\max}^{2}-\|\mu_{{i_{\max}}}-\mu_{k}\|\mu_{\max}>\frac{1}{2}\mu_{% \max}^{2},\end{split}start_ROW start_CELL ⟨ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ end_CELL start_CELL = ⟨ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - ( italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ = ∥ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT > divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (20)

where the last inequality is because μimaxμk<μmax2normsubscript𝜇subscript𝑖subscript𝜇𝑘subscript𝜇2\|\mu_{{i_{\max}}}-\mu_{k}\|<\frac{\mu_{\max}}{2}∥ italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ < divide start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG.

Note that (20) implies μk,μimax¯>12μmaxsubscript𝜇𝑘¯subscript𝜇subscript𝑖12subscript𝜇\langle\mu_{k},\overline{\mu_{{i_{\max}}}}\rangle>\frac{1}{2}\mu_{\max}⟨ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over¯ start_ARG italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ > divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, so for x𝐑dfor-all𝑥superscript𝐑𝑑\forall x\in\mathbf{R}^{d}∀ italic_x ∈ bold_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT we have

𝝍~𝝁(x)=k[n]ψk(x)μkk[n]ψk(x)μk,μimax¯=k[n]ψk(x)μk,μimax¯>12μmax,normsubscript~𝝍𝝁𝑥normsubscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘¯subscript𝜇subscript𝑖subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘¯subscript𝜇subscript𝑖12subscript𝜇\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|=\left\|\sum_{k\in[n]}\psi_{k}(x)\mu_{k}% \right\|\geq\left\langle\sum_{k\in[n]}\psi_{k}(x)\mu_{k},\overline{\mu_{{i_{% \max}}}}\right\rangle=\sum_{k\in[n]}\psi_{k}(x)\left\langle\mu_{k},\overline{% \mu_{{i_{\max}}}}\right\rangle>\frac{1}{2}\mu_{\max},∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ = ∥ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≥ ⟨ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over¯ start_ARG italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ = ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ⟨ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over¯ start_ARG italic_μ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ > divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , (21)

where we used k[n]ψk(x)=1subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥1\sum_{k\in[n]}\psi_{k}(x)=1∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = 1 at the last inequality. ∎

Lemma 11.

For any GMM(𝛍)GMM𝛍\text{GMM}(\bm{\mu})GMM ( bold_italic_μ ) we have

𝐄x[𝝍~𝝁(x)2]exp(8μmax2)40000d(1+2μmaxd)2(i,j[n]πiπjμiμj2)2.subscript𝐄𝑥delimited-[]superscriptnormsubscript~𝝍𝝁𝑥28superscriptsubscript𝜇240000𝑑superscript12subscript𝜇𝑑2superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptnormsubscript𝜇𝑖subscript𝜇𝑗22\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right]\geq\frac{% \exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^{2}}\left(% \sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}.bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 40000 italic_d ( 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Proof.

The key idea is to consider the gradient of 𝝍~𝝁subscript~𝝍𝝁\tilde{\bm{\psi}}_{\bm{\mu}}over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT, which can be calculated as

x𝝍~𝝁(x)=iμi(ψi(x)x)=iψi(x)μiμii,jψi(x)ψj(x)μiμj=i,j[n]ψi(x)ψj(x)μiμii,jψi(x)ψj(x)μiμj=i,j[n]ψi(x)ψj(x)μi(μiμj)=i,j[n]ψi(x)ψj(x)12(μi(μiμj)+μj(μjμi))=12i,j[n]ψi(x)ψj(x)(μiμj)(μiμj),subscript𝑥subscript~𝝍𝝁𝑥subscript𝑖subscript𝜇𝑖superscriptsubscript𝜓𝑖𝑥𝑥topsubscript𝑖subscript𝜓𝑖𝑥subscript𝜇𝑖superscriptsubscript𝜇𝑖topsubscript𝑖𝑗subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥subscript𝜇𝑖superscriptsubscript𝜇𝑗topsubscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥subscript𝜇𝑖superscriptsubscript𝜇𝑖topsubscript𝑖𝑗subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥subscript𝜇𝑖superscriptsubscript𝜇𝑗topsubscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥subscript𝜇𝑖superscriptsubscript𝜇𝑖subscript𝜇𝑗topsubscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥12subscript𝜇𝑖superscriptsubscript𝜇𝑖subscript𝜇𝑗topsubscript𝜇𝑗superscriptsubscript𝜇𝑗subscript𝜇𝑖top12subscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑥subscript𝜓𝑗𝑥subscript𝜇𝑖subscript𝜇𝑗superscriptsubscript𝜇𝑖subscript𝜇𝑗top\begin{split}\nabla_{x}\tilde{\bm{\psi}}_{\bm{\mu}}(x)&=\sum_{i}\mu_{i}\left(% \frac{\partial\psi_{i}(x)}{\partial x}\right)^{\top}\\ &=\sum_{i}\psi_{i}(x)\mu_{i}\mu_{i}^{\top}-\sum_{i,j}\psi_{i}(x)\psi_{j}(x)\mu% _{i}\mu_{j}^{\top}\\ &=\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)\mu_{i}\mu_{i}^{\top}-\sum_{i,j}\psi_{% i}(x)\psi_{j}(x)\mu_{i}\mu_{j}^{\top}\\ &=\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)\mu_{i}(\mu_{i}-\mu_{j})^{\top}\\ &=\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)\frac{1}{2}\left(\mu_{i}(\mu_{i}-\mu_{% j})^{\top}+\mu_{j}(\mu_{j}-\mu_{i})^{\top}\right)\\ &=\frac{1}{2}\sum_{i,j\in[n]}\psi_{i}(x)\psi_{j}(x)(\mu_{i}-\mu_{j})(\mu_{i}-% \mu_{j})^{\top},\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG ∂ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_x end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , end_CELL end_ROW (22)

where we used (8) in the second identity.

By Cauchy-Schwarz inequality, we have a2+b212ab2superscriptnorm𝑎2superscriptnorm𝑏212superscriptnorm𝑎𝑏2\|a\|^{2}+\|b\|^{2}\geq\frac{1}{2}\|a-b\|^{2}∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_a - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which implies

𝐄x[𝝍~𝝁(x)2]=12𝐄x[𝝍~𝝁(x)2+𝝍~𝝁(x)2]14𝐄x[𝝍~𝝁(x)𝝍~𝝁(x)2]14𝐄x[𝝍~𝝁(x)𝝍~𝝁(x),x¯2]=14𝐄x[(t=11t𝝍~𝝁(tx),x¯dt)2]=14𝐄x[(t=11x𝝍~𝝁(tx)x¯dt)2]=14𝐄x[(t=11xx¯𝝍~𝝁(tx)x¯dt)2],subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥212subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥2superscriptdelimited-∥∥subscript~𝝍𝝁𝑥214subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥subscript~𝝍𝝁𝑥214subscript𝐄𝑥delimited-[]superscriptsubscript~𝝍𝝁𝑥subscript~𝝍𝝁𝑥¯𝑥214subscript𝐄𝑥delimited-[]superscriptsuperscriptsubscript𝑡11𝑡subscript~𝝍𝝁𝑡𝑥¯𝑥differential-d𝑡214subscript𝐄𝑥delimited-[]superscriptsuperscriptsubscript𝑡11superscript𝑥topsubscript~𝝍𝝁𝑡𝑥¯𝑥differential-d𝑡214subscript𝐄𝑥delimited-[]superscriptsuperscriptsubscript𝑡11delimited-∥∥𝑥superscript¯𝑥topsubscript~𝝍𝝁𝑡𝑥¯𝑥differential-d𝑡2\begin{split}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right% ]&=\frac{1}{2}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}+\|% \tilde{\bm{\psi}}_{\bm{\mu}}(-x)\|^{2}\right]\\ &\geq\frac{1}{4}\mathbf{E}_{x}\left[\left\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)-% \tilde{\bm{\psi}}_{\bm{\mu}}(-x)\right\|^{2}\right]\\ &\geq\frac{1}{4}\mathbf{E}_{x}\left[\left\langle\tilde{\bm{\psi}}_{\bm{\mu}}(x% )-\tilde{\bm{\psi}}_{\bm{\mu}}(-x),\overline{x}\right\rangle^{2}\right]\\ &=\frac{1}{4}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\frac{\partial}{\partial t% }\langle\tilde{\bm{\psi}}_{\bm{\mu}}(tx),\overline{x}\rangle\mathrm{d}t\right)% ^{2}\right]\\ &=\frac{1}{4}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}{x}^{\top}\nabla\tilde{% \bm{\psi}}_{\bm{\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}\right]\\ &=\frac{1}{4}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\|x\|\cdot\overline{x}^{% \top}\nabla\tilde{\bm{\psi}}_{\bm{\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}% \right],\end{split}start_ROW start_CELL bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( - italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) - over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( - italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) - over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( - italic_x ) , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ⟨ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) , over¯ start_ARG italic_x end_ARG ⟩ roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) over¯ start_ARG italic_x end_ARG roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ italic_x ∥ ⋅ over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) over¯ start_ARG italic_x end_ARG roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW (23)

where we used t𝝍~𝝁(tx)=𝝍~𝝁(tx)x𝑡subscript~𝝍𝝁𝑡𝑥subscript~𝝍𝝁𝑡𝑥𝑥\frac{\partial}{\partial t}\tilde{\bm{\psi}}_{\bm{\mu}}(tx)=\nabla\tilde{\bm{% \psi}}_{\bm{\mu}}(tx)xdivide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) = ∇ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) italic_x at the second to last identity. Careful readers might notice that the term (t=11xx¯𝝍~𝝁(tx)x¯dt)2superscriptsuperscriptsubscript𝑡11norm𝑥superscript¯𝑥topsubscript~𝝍𝝁𝑡𝑥¯𝑥differential-d𝑡2\left(\int_{t=-1}^{1}\|x\|\cdot\overline{x}^{\top}\nabla\tilde{\bm{\psi}}_{\bm% {\mu}}(tx)\overline{x}\mathrm{d}t\right)^{2}( ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ italic_x ∥ ⋅ over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) over¯ start_ARG italic_x end_ARG roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is not well-defined when x=0𝑥0x=0italic_x = 0, but we can still calculate its expectation over the whole probability space since the integration is only singular on a zero-measure set.

For each x0𝑥0x\neq 0italic_x ≠ 0, by (22) we have

x¯𝝍~𝝁(tx)x¯=12i,j[n]ψi(tx)ψj(tx)μiμj,x¯2.superscript¯𝑥topsubscript~𝝍𝝁𝑡𝑥¯𝑥12subscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑡𝑥subscript𝜓𝑗𝑡𝑥superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥2\overline{x}^{\top}\nabla\tilde{\bm{\psi}}_{\bm{\mu}}(tx)\overline{x}=\frac{1}% {2}\sum_{i,j\in[n]}\psi_{i}(tx)\psi_{j}(tx)\langle\mu_{i}-\mu_{j},\overline{x}% \rangle^{2}.over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_t italic_x ) over¯ start_ARG italic_x end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t italic_x ) ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

So

𝐄x[𝝍~𝝁(x)2]116𝐄x[(t=11xi,j[n]ψi(tx)ψj(tx)μiμj,x¯2dt)2]=116𝐄x[(xi,j[n]μiμj,x¯2t=11ψi(tx)ψj(tx)dt)2]116𝐄x[(xi,j[n]μiμj,x¯212μmaxxπiπjexp(4μmax2)(1exp(4μmaxx)))2]=exp(8μmax2)64𝐄x[(i,j[n]πiπjμiμj,x¯21exp(4μmaxx)μmax)2]exp(8μmax2)64(i,j[n]πiπj𝐄x[μiμj,x¯21exp(4μmaxx)μmax])2subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥2116subscript𝐄𝑥delimited-[]superscriptsuperscriptsubscript𝑡11delimited-∥∥𝑥subscript𝑖𝑗delimited-[]𝑛subscript𝜓𝑖𝑡𝑥subscript𝜓𝑗𝑡𝑥superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥2d𝑡2116subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥𝑥subscript𝑖𝑗delimited-[]𝑛superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥2superscriptsubscript𝑡11subscript𝜓𝑖𝑡𝑥subscript𝜓𝑗𝑡𝑥differential-d𝑡2116subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥𝑥subscript𝑖𝑗delimited-[]𝑛superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥212subscript𝜇norm𝑥subscript𝜋𝑖subscript𝜋𝑗4superscriptsubscript𝜇214subscript𝜇delimited-∥∥𝑥28superscriptsubscript𝜇264subscript𝐄𝑥delimited-[]superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥214subscript𝜇norm𝑥subscript𝜇28superscriptsubscript𝜇264superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗subscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥214subscript𝜇norm𝑥subscript𝜇2\begin{split}&\quad\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}% \right]\\ &\geq\frac{1}{16}\mathbf{E}_{x}\left[\left(\int_{t=-1}^{1}\|x\|\sum_{i,j\in[n]% }\psi_{i}(tx)\psi_{j}(tx)\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\mathrm% {d}t\right)^{2}\right]\\ &=\frac{1}{16}\mathbf{E}_{x}\left[\left(\|x\|\sum_{i,j\in[n]}\langle\mu_{i}-% \mu_{j},\overline{x}\rangle^{2}\int_{t=-1}^{1}\psi_{i}(tx)\psi_{j}(tx)\mathrm{% d}t\right)^{2}\right]\\ &\geq\frac{1}{16}\mathbf{E}_{x}\left[\left(\|x\|\sum_{i,j\in[n]}\langle\mu_{i}% -\mu_{j},\overline{x}\rangle^{2}\frac{1}{2\mu_{\max}\|x\|}\pi_{i}\pi_{j}\exp% \left(-4\mu_{\max}^{2}\right)\left(1-\exp\left(-4\mu_{\max}\|x\|\right)\right)% \right)^{2}\right]\\ &=\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\mathbf{E}_{x}\left[\left(\sum_{% i,j\in[n]}\pi_{i}\pi_{j}\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac{1-% \exp\left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right)^{2}\right]\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i}% \pi_{j}\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac% {1-\exp\left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]\right)^{2}\end{split}start_ROW start_CELL end_CELL start_CELL bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 16 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ italic_x ∥ ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t italic_x ) ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 16 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∥ italic_x ∥ ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_t = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t italic_x ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t italic_x ) roman_d italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 16 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∥ italic_x ∥ ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ end_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 64 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 64 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW (24)

where we used Lemma 19 at the fourth line and Cauchy-Schwarz inequality at the last line.

The last step is to lower bound 𝐄x[μiμj,x¯2(1exp(4μmaxx))/μmax]subscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥214subscript𝜇norm𝑥subscript𝜇\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\left(1-\exp% \left(-4\mu_{\max}\|x\|\right)\right)/\mu_{\max}\right]bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) ) / italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. Since x𝑥xitalic_x is sampled from 𝒩(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d})caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), which is spherically symmetric, we know that the two random variables {x¯,x}¯𝑥norm𝑥\{\overline{x},\|x\|\}{ over¯ start_ARG italic_x end_ARG , ∥ italic_x ∥ } are independent. Therefore

𝐄x[μiμj,x¯21exp(4μmaxx)μmax]=𝐄x[μiμj,x¯2]𝐄x[1exp(4μmaxx)μmax].subscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥214subscript𝜇norm𝑥subscript𝜇subscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥2subscript𝐄𝑥delimited-[]14subscript𝜇norm𝑥subscript𝜇\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac{1-\exp% \left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]=\mathbf{E}_{x}\left[\langle% \mu_{i}-\mu_{j},\overline{x}\rangle^{2}\right]\mathbf{E}_{x}\left[\frac{1-\exp% \left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right].bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ] = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ] . (25)

For the first term in (25), we have 𝐄x[μiμj,x¯2]=μiμj2/dsubscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥2superscriptnormsubscript𝜇𝑖subscript𝜇𝑗2𝑑\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\right]=\|% \mu_{i}-\mu_{j}\|^{2}/dbold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d since x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG is spherically symmetrically distributed. By norm-concentration inequality of Gaussian [Dasgupta and Schulman, 2000] we know that Pr[xd2]1/50,dPrnorm𝑥𝑑2150for-all𝑑\Pr\left[\|x\|\geq\frac{\sqrt{d}}{2}\right]\geq 1/50,\forall droman_Pr [ ∥ italic_x ∥ ≥ divide start_ARG square-root start_ARG italic_d end_ARG end_ARG start_ARG 2 end_ARG ] ≥ 1 / 50 , ∀ italic_d. The second term in (25) can be therefore lower bounded as

𝐄x[1exp(4μmaxx)μmax]Pr[xd2]1exp(4μmaxd2)μmax1exp(2μmaxd)50μmax.subscript𝐄𝑥delimited-[]14subscript𝜇norm𝑥subscript𝜇Prdelimited-∥∥𝑥𝑑214subscript𝜇𝑑2subscript𝜇12subscript𝜇𝑑50subscript𝜇\begin{split}\mathbf{E}_{x}\left[\frac{1-\exp\left(-4\mu_{\max}\|x\|\right)}{% \mu_{\max}}\right]\geq\Pr\left[\|x\|\geq\frac{\sqrt{d}}{2}\right]\frac{1-\exp% \left(-4\mu_{\max}\cdot\frac{\sqrt{d}}{2}\right)}{\mu_{\max}}\geq\frac{1-\exp% \left(-2\mu_{\max}{\sqrt{d}}\right)}{50\mu_{\max}}.\end{split}start_ROW start_CELL bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ] ≥ roman_Pr [ ∥ italic_x ∥ ≥ divide start_ARG square-root start_ARG italic_d end_ARG end_ARG start_ARG 2 end_ARG ] divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ⋅ divide start_ARG square-root start_ARG italic_d end_ARG end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG 1 - roman_exp ( - 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) end_ARG start_ARG 50 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG . end_CELL end_ROW (26)

Plugging (26) into (25), we get

𝐄x[μiμj,x¯21exp(4μmaxx)μmax]1exp(2μmaxd)50dμmaxμiμj2.subscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥214subscript𝜇norm𝑥subscript𝜇12subscript𝜇𝑑50𝑑subscript𝜇superscriptnormsubscript𝜇𝑖subscript𝜇𝑗2\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}\frac{1-\exp% \left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]\geq\frac{1-\exp\left(-2\mu_% {\max}{\sqrt{d}}\right)}{50d\mu_{\max}}\|\mu_{i}-\mu_{j}\|^{2}.bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ] ≥ divide start_ARG 1 - roman_exp ( - 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) end_ARG start_ARG 50 italic_d italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (27)

Now we can plug (27) into (24) and get

𝐄x[𝝍~𝝁(x)2]exp(8μmax2)64(i,j[n]πiπj𝐄x[μiμj,x¯21exp(4μmaxx)μmax])2exp(8μmax2)64(i,j[n]πiπj1exp(2μmaxd)50dμmaxμiμj2)2exp(8μmax2)64(i,j[n]πiπj111+2μmaxd50dμmaxμiμj2)2=exp(8μmax2)40000d(1+2μmaxd)2(i,j[n]πiπjμiμj2)2subscript𝐄𝑥delimited-[]superscriptdelimited-∥∥subscript~𝝍𝝁𝑥28superscriptsubscript𝜇264superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗subscript𝐄𝑥delimited-[]superscriptsubscript𝜇𝑖subscript𝜇𝑗¯𝑥214subscript𝜇norm𝑥subscript𝜇28superscriptsubscript𝜇264superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗12subscript𝜇𝑑50𝑑subscript𝜇superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗228superscriptsubscript𝜇264superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗1112subscript𝜇𝑑50𝑑subscript𝜇superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗228superscriptsubscript𝜇240000𝑑superscript12subscript𝜇𝑑2superscriptsubscript𝑖𝑗delimited-[]𝑛subscript𝜋𝑖subscript𝜋𝑗superscriptdelimited-∥∥subscript𝜇𝑖subscript𝜇𝑗22\begin{split}\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}}(x)\|^{2}\right% ]&\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i% }\pi_{j}\mathbf{E}_{x}\left[\langle\mu_{i}-\mu_{j},\overline{x}\rangle^{2}% \frac{1-\exp\left(-4\mu_{\max}\|x\|\right)}{\mu_{\max}}\right]\right)^{2}\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i}% \pi_{j}\frac{1-\exp\left(-2\mu_{\max}{\sqrt{d}}\right)}{50d\mu_{\max}}\|\mu_{i% }-\mu_{j}\|^{2}\right)^{2}\\ &\geq\frac{\exp\left(-8\mu_{\max}^{2}\right)}{64}\left(\sum_{i,j\in[n]}\pi_{i}% \pi_{j}\frac{1-\frac{1}{1+2\mu_{\max}{\sqrt{d}}}}{50d\mu_{\max}}\|\mu_{i}-\mu_% {j}\|^{2}\right)^{2}\\ &=\frac{\exp\left(-8\mu_{\max}^{2}\right)}{40000d(1+2\mu_{\max}{\sqrt{d}})^{2}% }\left(\sum_{i,j\in[n]}\pi_{i}\pi_{j}\|\mu_{i}-\mu_{j}\|^{2}\right)^{2}\end{split}start_ROW start_CELL bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 64 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 - roman_exp ( - 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_x ∥ ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 64 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG 1 - roman_exp ( - 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) end_ARG start_ARG 50 italic_d italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 64 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG 1 - divide start_ARG 1 end_ARG start_ARG 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG end_ARG end_ARG start_ARG 50 italic_d italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 40000 italic_d ( 1 + 2 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW (28)

where we used the inequality t0,et11+tformulae-sequencefor-all𝑡0superscript𝑒𝑡11𝑡\forall t\geq 0,e^{-t}\leq\frac{1}{1+t}∀ italic_t ≥ 0 , italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 1 + italic_t end_ARG at the second to last line. ∎

Theorem 2.

Consider training a student n𝑛nitalic_n-component GMM initialized from 𝛍(0)=(μ1(0),,μn(0))𝛍0superscriptsubscript𝜇1superscript0topsubscript𝜇𝑛superscript0toptop\bm{\mu}(0)=(\mu_{1}(0)^{\top},\ldots,\mu_{n}(0)^{\top})^{\top}bold_italic_μ ( 0 ) = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to learn a single-component ground truth GMM 𝒩(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d})caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) with population gradient EM algorithm. If the step size satisfies ηO(exp(8nμmax2(0))πmin2n2d2(1μmax(0)+μmax(0))2)𝜂𝑂8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2superscript𝑑2superscript1subscript𝜇0subscript𝜇02\eta\leq O\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{% 2}d^{2}(\frac{1}{\mu_{\max}(0)}+\mu_{\max}(0))^{2}}\right)italic_η ≤ italic_O ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), then gradient EM converges globally with rate

(𝝁(t))1γt,𝝁𝑡1𝛾𝑡\mathcal{L}(\bm{\mu}(t))\leq\frac{1}{\sqrt{\gamma t}},caligraphic_L ( bold_italic_μ ( italic_t ) ) ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_γ italic_t end_ARG end_ARG ,

where constant γ=Ω(ηexp(16nμmax2(0))πmin4n2d2(1+μmax(0)dn)4)𝐑+𝛾Ω𝜂16𝑛superscriptsubscript𝜇20superscriptsubscript𝜋4superscript𝑛2superscript𝑑2superscript1subscript𝜇0𝑑𝑛4superscript𝐑\gamma=\Omega\left(\frac{\eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^% {4}}{n^{2}d^{2}(1+\mu_{\max}(0){\sqrt{dn}})^{4}}\right)\in\mathbf{R}^{+}italic_γ = roman_Ω ( divide start_ARG italic_η roman_exp ( - 16 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_d italic_n end_ARG ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) ∈ bold_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, μmax(0)=max{μ1(0),,μn(0)}subscript𝜇0normsubscript𝜇10normsubscript𝜇𝑛0\mu_{\max}(0)=\max\{\|\mu_{1}(0)\|,\ldots,\|\mu_{n}(0)\|\}italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) = roman_max { ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 ) ∥ , … , ∥ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 0 ) ∥ }.

Proof.

We use mathematical induction to prove Theorem 2, by proving the following two conditions inductively:

U(t)nμmax2(0),t.𝑈𝑡𝑛superscriptsubscript𝜇20for-all𝑡U(t)\leq n\mu_{\max}^{2}(0),\forall t.italic_U ( italic_t ) ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) , ∀ italic_t . (29)
12(𝝁(t))γt+12(𝝁(0)),t.1superscript2𝝁𝑡𝛾𝑡1superscript2𝝁0for-all𝑡\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t))}\geq\gamma t+\frac{1}{\mathcal{L}^{2}(% \bm{\mu}(0))},\forall t.divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) end_ARG ≥ italic_γ italic_t + divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( 0 ) ) end_ARG , ∀ italic_t . (30)

Note that (30) directly implies the theorem, so now we just need to prove (29) and (30) together.

The induction base for t=0𝑡0t=0italic_t = 0 is trivial. Now suppose the conditions hold for time step t𝑡titalic_t, consider t+1𝑡1t+1italic_t + 1. By induction hypothesis (29) we have μi(t)μmax(t)nμmax(0),tformulae-sequencenormsubscript𝜇𝑖𝑡subscript𝜇𝑡𝑛subscript𝜇0for-all𝑡\|\mu_{i}(t)\|\leq\mu_{\max}(t)\leq\sqrt{n}\mu_{\max}(0),\forall t∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ ≤ italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ≤ square-root start_ARG italic_n end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) , ∀ italic_t.

Proof of (30). Since 𝝁Q(𝝁|𝝁)=𝝁(𝝁)subscript𝝁𝑄conditional𝝁𝝁subscript𝝁𝝁\nabla_{\bm{\mu}}Q(\bm{\mu}|\bm{\mu})=\nabla_{\bm{\mu}}\mathcal{L}(\bm{\mu})∇ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT italic_Q ( bold_italic_μ | bold_italic_μ ) = ∇ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ), we can apply classical analysis of gradient descent [Nesterov et al., 2018] as

(𝝁(t+1))(𝝁(t))=(𝝁(t)η(𝝁(t)))(𝝁(t))=s=01(𝝁(t)sη(𝝁(t))),η(𝝁(t))ds=s=01(𝝁(t)),η(𝝁(t))ds+s=01(𝝁(t))(𝝁(t)sη(𝝁(t))),η(𝝁(t))ds=η(𝝁(t))2+ηs=01(𝝁(t))(𝝁(t)sη(𝝁(t))),(𝝁(t))ds𝝁𝑡1𝝁𝑡𝝁𝑡𝜂𝝁𝑡𝝁𝑡superscriptsubscript𝑠01𝝁𝑡𝑠𝜂𝝁𝑡𝜂𝝁𝑡differential-d𝑠superscriptsubscript𝑠01𝝁𝑡𝜂𝝁𝑡differential-d𝑠superscriptsubscript𝑠01𝝁𝑡𝝁𝑡𝑠𝜂𝝁𝑡𝜂𝝁𝑡differential-d𝑠𝜂superscriptdelimited-∥∥𝝁𝑡2𝜂superscriptsubscript𝑠01𝝁𝑡𝝁𝑡𝑠𝜂𝝁𝑡𝝁𝑡differential-d𝑠\begin{split}&\quad\mathcal{L}(\bm{\mu}(t+1))-\mathcal{L}(\bm{\mu}(t))\\ &=\mathcal{L}(\bm{\mu}(t)-\eta\nabla\mathcal{L}(\bm{\mu}(t)))-\mathcal{L}(\bm{% \mu}(t))\\ &=-\int_{s=0}^{1}\left\langle\nabla\mathcal{L}(\bm{\mu}(t)-s\eta\nabla\mathcal% {L}(\bm{\mu}(t))),\eta\nabla\mathcal{L}(\bm{\mu}(t))\right\rangle\mathrm{d}s\\ &=-\int_{s=0}^{1}\left\langle\nabla\mathcal{L}(\bm{\mu}(t)),\eta\nabla\mathcal% {L}(\bm{\mu}(t))\right\rangle\mathrm{d}s+\int_{s=0}^{1}\left\langle\nabla% \mathcal{L}(\bm{\mu}(t))-\nabla\mathcal{L}(\bm{\mu}(t)-s\eta\nabla\mathcal{L}(% \bm{\mu}(t))),\eta\nabla\mathcal{L}(\bm{\mu}(t))\right\rangle\mathrm{d}s\\ &=-\eta\|\nabla\mathcal{L}(\bm{\mu}(t))\|^{2}+\eta\int_{s=0}^{1}\left\langle% \nabla\mathcal{L}(\bm{\mu}(t))-\nabla\mathcal{L}(\bm{\mu}(t)-s\eta\nabla% \mathcal{L}(\bm{\mu}(t))),\nabla\mathcal{L}(\bm{\mu}(t))\right\rangle\mathrm{d% }s\\ \end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) - caligraphic_L ( bold_italic_μ ( italic_t ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = caligraphic_L ( bold_italic_μ ( italic_t ) - italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ) - caligraphic_L ( bold_italic_μ ( italic_t ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - ∫ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) - italic_s italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ) , italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ⟩ roman_d italic_s end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - ∫ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) , italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ⟩ roman_d italic_s + ∫ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) - ∇ caligraphic_L ( bold_italic_μ ( italic_t ) - italic_s italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ) , italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ⟩ roman_d italic_s end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_η ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η ∫ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) - ∇ caligraphic_L ( bold_italic_μ ( italic_t ) - italic_s italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ) , ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ⟩ roman_d italic_s end_CELL end_ROW (31)

Note that the gradient norm can be upper bounded as

μi(𝝁(t))=𝐄x[ψi(x)k[n]ψk(x)μk(t)]𝐄x[ψi(x)k[n]ψk(x)μk(t)]kμk(t)nU(t)nμmax(0).delimited-∥∥subscriptsubscript𝜇𝑖𝝁𝑡delimited-∥∥subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘𝑡subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥delimited-∥∥subscript𝜇𝑘𝑡subscript𝑘delimited-∥∥subscript𝜇𝑘𝑡𝑛𝑈𝑡𝑛subscript𝜇0\begin{split}\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|&=\left\|\mathbf{E}_{% x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\mu_{k}(t)\right]\right\|\leq% \mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\left\|\mu_{k}(t)% \right\|\right]\\ &\leq\sum_{k}\|\mu_{k}(t)\|\leq\sqrt{nU(t)}\leq n\mu_{\max}(0).\end{split}start_ROW start_CELL ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ end_CELL start_CELL = ∥ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ] ∥ ≤ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ ≤ square-root start_ARG italic_n italic_U ( italic_t ) end_ARG ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) . end_CELL end_ROW

Then for any s[0,1]𝑠01s\in[0,1]italic_s ∈ [ 0 , 1 ], we have sημi(𝝁(t))ηnμmax(0)1max{6d,2μi(t)}norm𝑠𝜂subscriptsubscript𝜇𝑖𝝁𝑡𝜂𝑛subscript𝜇016𝑑2normsubscript𝜇𝑖𝑡\|s\eta\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\leq\eta n\mu_{\max}(0)\leq% \frac{1}{\max\left\{6d,2\|\mu_{i}(t)\|\right\}}∥ italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ ≤ italic_η italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) ≤ divide start_ARG 1 end_ARG start_ARG roman_max { 6 italic_d , 2 ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ } end_ARG. So we can apply Theorem 13 and get

μi(𝝁(t))μi(𝝁(t)sημi(𝝁(t)))nμmax(t)(30d+4μmax(t))sημi(𝝁(t))+k[n]sημk(𝝁(t)).delimited-∥∥subscriptsubscript𝜇𝑖𝝁𝑡subscriptsubscript𝜇𝑖𝝁𝑡𝑠𝜂subscriptsubscript𝜇𝑖𝝁𝑡𝑛subscript𝜇𝑡30𝑑4subscript𝜇𝑡delimited-∥∥𝑠𝜂subscriptsubscript𝜇𝑖𝝁𝑡subscript𝑘delimited-[]𝑛delimited-∥∥𝑠𝜂subscriptsubscript𝜇𝑘𝝁𝑡\begin{split}&\quad\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))-\nabla_{\mu_{i}}% \mathcal{L}(\bm{\mu}(t)-s\eta\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t)))\|\\ &\leq n\mu_{\max}(t)(30\sqrt{d}+4\mu_{\max}(t))\|s\eta\nabla_{\mu_{i}}\mathcal% {L}(\bm{\mu}(t))\|+\sum_{k\in[n]}\|s\eta\nabla_{\mu_{k}}\mathcal{L}(\bm{\mu}(t% ))\|.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) - italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ( 30 square-root start_ARG italic_d end_ARG + 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ) ∥ italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ . end_CELL end_ROW

Therefore for s[0,1]for-all𝑠01\forall s\in[0,1]∀ italic_s ∈ [ 0 , 1 ],

(𝝁(t))(𝝁(t)sη(𝝁(t))),(𝝁(t))i[n]μi(𝝁(t))μi(𝝁(t)sημi(𝝁(t)))μi(𝝁(t))i[n](nμmax(t)(30d+4μmax(t))sημi(𝝁(t))+k[n]sημk(𝝁(t)))μi(𝝁(t))η(nμmax(t)(30d+4μmax(t))+n2)(𝝁(t))2η(4n2μmax(0)2+30dn3/2μmax(0)+n2)(𝝁(t))220ηdn2(μmax2(0)+1)(𝝁(t))2.𝝁𝑡𝝁𝑡𝑠𝜂𝝁𝑡𝝁𝑡subscript𝑖delimited-[]𝑛delimited-∥∥subscriptsubscript𝜇𝑖𝝁𝑡subscriptsubscript𝜇𝑖𝝁𝑡𝑠𝜂subscriptsubscript𝜇𝑖𝝁𝑡delimited-∥∥subscriptsubscript𝜇𝑖𝝁𝑡subscript𝑖delimited-[]𝑛𝑛subscript𝜇𝑡30𝑑4subscript𝜇𝑡delimited-∥∥𝑠𝜂subscriptsubscript𝜇𝑖𝝁𝑡subscript𝑘delimited-[]𝑛delimited-∥∥𝑠𝜂subscriptsubscript𝜇𝑘𝝁𝑡delimited-∥∥subscriptsubscript𝜇𝑖𝝁𝑡𝜂𝑛subscript𝜇𝑡30𝑑4subscript𝜇𝑡superscript𝑛2superscriptdelimited-∥∥𝝁𝑡2𝜂4superscript𝑛2subscript𝜇superscript0230𝑑superscript𝑛32subscript𝜇0superscript𝑛2superscriptdelimited-∥∥𝝁𝑡220𝜂𝑑superscript𝑛2superscriptsubscript𝜇201superscriptdelimited-∥∥𝝁𝑡2\begin{split}&\quad\left\langle\nabla\mathcal{L}(\bm{\mu}(t))-\nabla\mathcal{L% }(\bm{\mu}(t)-s\eta\nabla\mathcal{L}(\bm{\mu}(t))),\nabla\mathcal{L}(\bm{\mu}(% t))\right\rangle\\ &\leq\sum_{i\in[n]}\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))-\nabla_{\mu_{i}}% \mathcal{L}(\bm{\mu}(t)-s\eta\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t)))\|\cdot% \|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\\ &\leq\sum_{i\in[n]}\left(n\mu_{\max}(t)(30\sqrt{d}+4\mu_{\max}(t))\|s\eta% \nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|+\sum_{k\in[n]}\|s\eta\nabla_{\mu_{k% }}\mathcal{L}(\bm{\mu}(t))\|\right)\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))% \|\\ &\leq\eta\left(n\mu_{\max}(t)(30\sqrt{d}+4\mu_{\max}(t))+n^{2}\right)\|\nabla% \mathcal{L}(\bm{\mu}(t))\|^{2}\\ &\leq\eta\left(4n^{2}\mu_{\max}(0)^{2}+30\sqrt{d}n^{3/2}\mu_{\max}(0)+n^{2}% \right)\|\nabla\mathcal{L}(\bm{\mu}(t))\|^{2}\\ &\leq 20\eta\sqrt{d}n^{2}(\mu_{\max}^{2}(0)+1)\|\nabla\mathcal{L}(\bm{\mu}(t))% \|^{2}.\\ \end{split}start_ROW start_CELL end_CELL start_CELL ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) - ∇ caligraphic_L ( bold_italic_μ ( italic_t ) - italic_s italic_η ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ) , ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) - italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ) ∥ ⋅ ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ( italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ( 30 square-root start_ARG italic_d end_ARG + 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ) ∥ italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_s italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ ) ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η ( italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ( 30 square-root start_ARG italic_d end_ARG + 4 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) ) + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η ( 4 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 30 square-root start_ARG italic_d end_ARG italic_n start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 20 italic_η square-root start_ARG italic_d end_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) + 1 ) ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (32)

Plugging (32) into (31), since ηO(1dn2(μmax2(0)+1))𝜂𝑂1𝑑superscript𝑛2superscriptsubscript𝜇201\eta\leq O\left(\frac{1}{\sqrt{d}n^{2}(\mu_{\max}^{2}(0)+1)}\right)italic_η ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) + 1 ) end_ARG ) we have

(𝝁(t+1))(𝝁(t))η(𝝁(t))2+20ηdn2(μmax2(0)+1)(𝝁(t))2η2(𝝁(t))2.𝝁𝑡1𝝁𝑡𝜂superscriptnorm𝝁𝑡220𝜂𝑑superscript𝑛2superscriptsubscript𝜇201superscriptnorm𝝁𝑡2𝜂2superscriptnorm𝝁𝑡2\mathcal{L}(\bm{\mu}(t+1))-\mathcal{L}(\bm{\mu}(t))\leq-\eta\|\nabla\mathcal{L% }(\bm{\mu}(t))\|^{2}+20\eta\sqrt{d}n^{2}(\mu_{\max}^{2}(0)+1)\|\nabla\mathcal{% L}(\bm{\mu}(t))\|^{2}\leq-\frac{\eta}{2}\|\nabla\mathcal{L}(\bm{\mu}(t))\|^{2}.caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) - caligraphic_L ( bold_italic_μ ( italic_t ) ) ≤ - italic_η ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 20 italic_η square-root start_ARG italic_d end_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) + 1 ) ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ - divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (33)

By Lemma 12 we can lower bound the gradient norm as

(𝝁(t))(𝝁(t)),𝝁(t)𝝁(t)(𝝁(t)),𝝁(t)nμmax(t)Ω(exp(8μmax2(t))πmin2nd(1+μmax(t)d)2)μmax3(t)Theorem 14Ω(exp(8μmax2(t))πmin2nd(1+μmax(t)d)2)(2(𝝁(t))3/2Ω(exp(8nμmax2(0))πmin2nd(1+μmax(0)dn)2)3/2(𝝁(t)).\begin{split}&\|\nabla\mathcal{L}(\bm{\mu}(t))\|\geq\frac{\left\langle\nabla% \mathcal{L}(\bm{\mu}(t)),\bm{\mu}(t)\right\rangle}{\|\bm{\mu}(t)\|}\geq\frac{% \left\langle\nabla\mathcal{L}(\bm{\mu}(t)),\bm{\mu}(t)\right\rangle}{n\mu_{% \max}(t)}\geq\Omega\left(\frac{\exp\left(-8\mu_{\max}^{2}(t)\right)\pi_{\min}^% {2}}{nd(1+\mu_{\max}(t){\sqrt{d}})^{2}}\right)\mu_{\max}^{3}(t)\\ &\overset{\text{Theorem \ref{Loss function upper bound}}}{\geq}\Omega\left(% \frac{\exp\left(-8\mu_{\max}^{2}(t)\right)\pi_{\min}^{2}}{nd(1+\mu_{\max}(t){% \sqrt{d}})^{2}}\right)(2\mathcal{L}(\bm{\mu}(t))^{3/2}\geq\Omega\left(\frac{% \exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{nd(1+\mu_{\max}(0){\sqrt{% dn}})^{2}}\right)\mathcal{L}^{3/2}(\bm{\mu}(t)).\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ ≥ divide start_ARG ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) , bold_italic_μ ( italic_t ) ⟩ end_ARG start_ARG ∥ bold_italic_μ ( italic_t ) ∥ end_ARG ≥ divide start_ARG ⟨ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) , bold_italic_μ ( italic_t ) ⟩ end_ARG start_ARG italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) end_ARG ≥ roman_Ω ( divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_t ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL overTheorem start_ARG ≥ end_ARG roman_Ω ( divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( 2 caligraphic_L ( bold_italic_μ ( italic_t ) ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ≥ roman_Ω ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_d italic_n end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) caligraphic_L start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) . end_CELL end_ROW (34)

Combining (34) and (33), we have

(𝝁(t+1))(𝝁(t))η2(𝝁(t))2(𝝁(t))Ω(ηexp(16nμmax2(0))πmin4n2d2(1+μmax(0)dn)4)3(𝝁(t)).𝝁𝑡1𝝁𝑡𝜂2superscriptnorm𝝁𝑡2𝝁𝑡Ω𝜂16𝑛superscriptsubscript𝜇20superscriptsubscript𝜋4superscript𝑛2superscript𝑑2superscript1subscript𝜇0𝑑𝑛4superscript3𝝁𝑡\mathcal{L}(\bm{\mu}(t+1))\leq\mathcal{L}(\bm{\mu}(t))-\frac{\eta}{2}\|\nabla% \mathcal{L}(\bm{\mu}(t))\|^{2}\leq\mathcal{L}(\bm{\mu}(t))-\Omega\left(\frac{% \eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^{4}}{n^{2}d^{2}(1+\mu_{% \max}(0){\sqrt{dn}})^{4}}\right)\mathcal{L}^{3}(\bm{\mu}(t)).caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) ≤ caligraphic_L ( bold_italic_μ ( italic_t ) ) - divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ∥ ∇ caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ caligraphic_L ( bold_italic_μ ( italic_t ) ) - roman_Ω ( divide start_ARG italic_η roman_exp ( - 16 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_d italic_n end_ARG ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) caligraphic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) . (35)

Note that the above inequality implies (𝝁(t+1))(𝝁(t))𝝁𝑡1𝝁𝑡\mathcal{L}(\bm{\mu}(t+1))\leq\mathcal{L}(\bm{\mu}(t))caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) ≤ caligraphic_L ( bold_italic_μ ( italic_t ) ), therefore

12(𝝁(t+1))12(𝝁(t))=((𝝁(t))(𝝁(t+1)))((𝝁(t))+(𝝁(t+1)))2(𝝁(t))2(𝝁(t+1))((𝝁(t))(𝝁(t+1))(𝝁(t))4(𝝁(t))(35)Ω(ηexp(16nμmax2(0))πmin4n2d2(1+μmax(0)dn)4)=γ.\begin{split}&\quad\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t+1))}-\frac{1}{\mathcal{% L}^{2}(\bm{\mu}(t))}=\frac{(\mathcal{L}(\bm{\mu}(t))-\mathcal{L}(\bm{\mu}(t+1)% ))(\mathcal{L}(\bm{\mu}(t))+\mathcal{L}(\bm{\mu}(t+1)))}{\mathcal{L}^{2}(\bm{% \mu}(t))\mathcal{L}^{2}(\bm{\mu}(t+1))}\\ &\geq\frac{(\mathcal{L}(\bm{\mu}(t))-\mathcal{L}(\bm{\mu}(t+1))\mathcal{L}(\bm% {\mu}(t))}{\mathcal{L}^{4}(\bm{\mu}(t))}\overset{\eqref{NL21}}{\geq}\Omega% \left(\frac{\eta\exp\left(-16n\mu_{\max}^{2}(0)\right)\pi_{\min}^{4}}{n^{2}d^{% 2}(1+\mu_{\max}(0){\sqrt{dn}})^{4}}\right)=\gamma.\end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t + 1 ) ) end_ARG - divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) end_ARG = divide start_ARG ( caligraphic_L ( bold_italic_μ ( italic_t ) ) - caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) ) ( caligraphic_L ( bold_italic_μ ( italic_t ) ) + caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) ) end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t + 1 ) ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG ( caligraphic_L ( bold_italic_μ ( italic_t ) ) - caligraphic_L ( bold_italic_μ ( italic_t + 1 ) ) caligraphic_L ( bold_italic_μ ( italic_t ) ) end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) end_ARG start_OVERACCENT italic_( italic_) end_OVERACCENT start_ARG ≥ end_ARG roman_Ω ( divide start_ARG italic_η roman_exp ( - 16 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_d italic_n end_ARG ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) = italic_γ . end_CELL end_ROW

On the other hand, by induction hypothesis we have 12(𝝁(t))γt+12(𝝁(0))1superscript2𝝁𝑡𝛾𝑡1superscript2𝝁0\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t))}\geq\gamma t+\frac{1}{\mathcal{L}^{2}(% \bm{\mu}(0))}divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) end_ARG ≥ italic_γ italic_t + divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( 0 ) ) end_ARG, combined with the above inequality, we have 12(𝝁(t+1))12(𝝁(t))+γγ(t+1)+12(𝝁(0))1superscript2𝝁𝑡11superscript2𝝁𝑡𝛾𝛾𝑡11superscript2𝝁0\frac{1}{\mathcal{L}^{2}(\bm{\mu}(t+1))}\geq\frac{1}{\mathcal{L}^{2}(\bm{\mu}(% t))}+\gamma\geq\gamma(t+1)+\frac{1}{\mathcal{L}^{2}(\bm{\mu}(0))}divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t + 1 ) ) end_ARG ≥ divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) end_ARG + italic_γ ≥ italic_γ ( italic_t + 1 ) + divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( 0 ) ) end_ARG, which finishes the proof of (30).

Proof of (29). The dynamics of potential function U𝑈Uitalic_U can be calculated as

U(𝝁(t+1))=i[n]μi(t+1)2=i[n]μi(t)ημiQ(𝝁(t)|𝝁(t))2=U(𝝁(t))ηi[n]μi(t),μiQ(𝝁(t)|𝝁(t))+η2i[n]μiQ(𝝁(t)|𝝁(t))2=Corollary 10U(𝝁(t))η𝐄x[𝝍~𝝁(t)(x)2]I1+η2i[n]μiQ(𝝁(t)|𝝁(t))2I2.\begin{split}&\quad U(\bm{\mu}(t+1))=\sum_{i\in[n]}\left\|\mu_{i}(t+1)\right\|% ^{2}\\ &=\sum_{i\in[n]}\left\|\mu_{i}(t)-\eta\nabla_{\mu_{i}}Q(\bm{\mu}(t)|\bm{\mu}(t% ))\right\|^{2}\\ &=U(\bm{\mu}(t))-{\eta\sum_{i\in[n]}\left\langle\mu_{i}(t),\nabla_{\mu_{i}}Q(% \bm{\mu}(t)|\bm{\mu}(t))\right\rangle}+{\eta^{2}\sum_{i\in[n]}\|\nabla_{\mu_{i% }}Q(\bm{\mu}(t)|\bm{\mu}(t))\|^{2}}\\ &\overset{\text{Corollary \ref{Gradient projection lemma}}}{=}U(\bm{\mu}(t))-% \underbrace{\eta\mathbf{E}_{x}\left[\|\tilde{\bm{\psi}}_{\bm{\mu}(t)}(x)\|^{2}% \right]}_{I_{1}}+\underbrace{\eta^{2}\sum_{i\in[n]}\|\nabla_{\mu_{i}}Q(\bm{\mu% }(t)|\bm{\mu}(t))\|^{2}}_{I_{2}}.\end{split}start_ROW start_CELL end_CELL start_CELL italic_U ( bold_italic_μ ( italic_t + 1 ) ) = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_η ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ ( italic_t ) | bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_U ( bold_italic_μ ( italic_t ) ) - italic_η ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ ( italic_t ) | bold_italic_μ ( italic_t ) ) ⟩ + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ ( italic_t ) | bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL overCorollary start_ARG = end_ARG italic_U ( bold_italic_μ ( italic_t ) ) - under⏟ start_ARG italic_η bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_italic_ψ end_ARG start_POSTSUBSCRIPT bold_italic_μ ( italic_t ) end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ ( italic_t ) | bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW (36)

The first term I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be bounded by Lemma 12 as

I1ηΩ(exp(8μmax2(t))πmin2d(1+μmax(t)d)2)μmax4(t)ηΩ(exp(8nμmax2(0))πmin2n2d(1+μmax(0)nd)2)U2(𝝁(t)).subscript𝐼1𝜂Ω8superscriptsubscript𝜇2𝑡superscriptsubscript𝜋2𝑑superscript1subscript𝜇𝑡𝑑2superscriptsubscript𝜇4𝑡𝜂Ω8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2𝑑superscript1subscript𝜇0𝑛𝑑2superscript𝑈2𝝁𝑡I_{1}\geq\eta\Omega\left(\frac{\exp\left(-8\mu_{\max}^{2}(t)\right)\pi_{\min}^% {2}}{d(1+\mu_{\max}(t){\sqrt{d}})^{2}}\right)\mu_{\max}^{4}(t)\geq\eta\Omega% \left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{2}d(1+\mu_% {\max}(0){\sqrt{nd}})^{2}}\right)U^{2}(\bm{\mu}(t)).italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_η roman_Ω ( divide start_ARG roman_exp ( - 8 italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_t ) square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_t ) ≥ italic_η roman_Ω ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_n italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) . (37)

The second term I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a perturbation term that can be upper bounded by Lemma 9 as

I2=η2i[n]μiQ(𝝁(t)|𝝁(t))2=η2i[n]𝐄x[ψi(x)k[n]ψk(x)μk(t)]2η2i[n]𝐄x[ψi(x)k[n]ψk(x)μk(t)]2η2i[n]𝐄x[ψi(x)k[n]ψk(x)μk(t)]2η2i[n]𝐄x[(k[n]ψi2(x)ψk2(x))(k[n]μk(t)2)]2η2i[n]𝐄x[k[n]ψi2(x)ψk2(x)]𝐄x[k[n]μk(t)2]=η2U(𝝁(t))𝐄x[i[n]k[n]ψi2(x)ψk2(x)]η2U(𝝁(t))𝐄x[(i[n]ψi(x))(k[n]ψk(x))]=η2U(𝝁(t)).\begin{split}I_{2}&={\eta^{2}\sum_{i\in[n]}\|\nabla_{\mu_{i}}Q(\bm{\mu}(t)|\bm% {\mu}(t))\|^{2}}=\eta^{2}\sum_{i\in[n]}\left\|\mathbf{E}_{x}\left[\psi_{i}(x)% \sum_{k\in[n]}\psi_{k}(x)\mu_{k}(t)\right]\right\|^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\left\|\psi_{i}(x)\sum_{k\in[n]% }\psi_{k}(x)\mu_{k}(t)\right\|\right]^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k\in[n]}\psi_{% k}(x)\left\|\mu_{k}(t)\right\|\right]^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\sqrt{\left(\sum_{k\in[n]}\psi_% {i}^{2}(x)\psi_{k}^{2}(x)\right)\left(\sum_{k\in[n]}\|\mu_{k}(t)\|^{2}\right)}% \right]^{2}\\ &\leq\eta^{2}\sum_{i\in[n]}\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{i}^{2}(x)% \psi_{k}^{2}(x)\right]\mathbf{E}_{x}\left[\sum_{k\in[n]}\|\mu_{k}(t)\|^{2}% \right]\\ &=\eta^{2}U(\bm{\mu}(t))\mathbf{E}_{x}\left[\sum_{i\in[n]}\sum_{k\in[n]}\psi_{% i}^{2}(x)\psi_{k}^{2}(x)\right]\\ &\leq\eta^{2}U(\bm{\mu}(t))\mathbf{E}_{x}\left[\left(\sum_{i\in[n]}\psi_{i}(x)% \right)\left(\sum_{k\in[n]}\psi_{k}(x)\right)\right]\\ &=\eta^{2}U(\bm{\mu}(t)).\end{split}start_ROW start_CELL italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_μ ( italic_t ) | bold_italic_μ ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∥ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ square-root start_ARG ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ) ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ] bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U ( bold_italic_μ ( italic_t ) ) bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U ( bold_italic_μ ( italic_t ) ) bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U ( bold_italic_μ ( italic_t ) ) . end_CELL end_ROW (38)

where we use triangle inequality twice at the second and third line, and Cauchy-Schwarz inequality twice at the fourth and fifth line.

Putting (38), (37) and (36) together, we get

U(𝝁(t+1))U(𝝁(t))ηΩ(exp(8nμmax2(0))πmin2n2d(1+μmax(0)nd)2)U2(𝝁(t))+η2U(𝝁(t)).𝑈𝝁𝑡1𝑈𝝁𝑡𝜂Ω8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2𝑑superscript1subscript𝜇0𝑛𝑑2superscript𝑈2𝝁𝑡superscript𝜂2𝑈𝝁𝑡U(\bm{\mu}(t+1))\leq U(\bm{\mu}(t))-\eta\Omega\left(\frac{\exp\left(-8n\mu_{% \max}^{2}(0)\right)\pi_{\min}^{2}}{n^{2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}}% \right)U^{2}(\bm{\mu}(t))+\eta^{2}U(\bm{\mu}(t)).italic_U ( bold_italic_μ ( italic_t + 1 ) ) ≤ italic_U ( bold_italic_μ ( italic_t ) ) - italic_η roman_Ω ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_n italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_μ ( italic_t ) ) + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U ( bold_italic_μ ( italic_t ) ) .

Consider two cases:

a). If n2μmax2(0)U(𝝁(t))nμmax2(0)𝑛2superscriptsubscript𝜇20𝑈𝝁𝑡𝑛superscriptsubscript𝜇20\frac{n}{2}\mu_{\max}^{2}(0)\leq U(\bm{\mu}(t))\leq n\mu_{\max}^{2}(0)divide start_ARG italic_n end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ≤ italic_U ( bold_italic_μ ( italic_t ) ) ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ), then

U(𝝁(t+1))U(𝝁(t))ηU(𝝁(t))(Ω(exp(8nμmax2(0))πmin2n2d(1+μmax(0)nd)2)U(𝝁(t))η)U(𝝁(t))ηU(𝝁(t))(Ω(exp(8nμmax2(0))πmin2n2d(1+μmax(0)nd)2)n2μmax2(0)η)U(𝝁(t))nμmax2(0),𝑈𝝁𝑡1𝑈𝝁𝑡𝜂𝑈𝝁𝑡Ω8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2𝑑superscript1subscript𝜇0𝑛𝑑2𝑈𝝁𝑡𝜂𝑈𝝁𝑡𝜂𝑈𝝁𝑡Ω8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2𝑑superscript1subscript𝜇0𝑛𝑑2𝑛2superscriptsubscript𝜇20𝜂𝑈𝝁𝑡𝑛superscriptsubscript𝜇20\begin{split}&\quad U(\bm{\mu}(t+1))\leq U(\bm{\mu}(t))-\eta U(\bm{\mu}(t))% \left(\Omega\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n% ^{2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}}\right)U(\bm{\mu}(t))-\eta\right)\\ &\leq U(\bm{\mu}(t))-\eta U(\bm{\mu}(t))\left(\Omega\left(\frac{\exp\left(-8n% \mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}% }\right)\frac{n}{2}\mu_{\max}^{2}(0)-\eta\right)\leq U(\bm{\mu}(t))\leq n\mu_{% \max}^{2}(0),\end{split}start_ROW start_CELL end_CELL start_CELL italic_U ( bold_italic_μ ( italic_t + 1 ) ) ≤ italic_U ( bold_italic_μ ( italic_t ) ) - italic_η italic_U ( bold_italic_μ ( italic_t ) ) ( roman_Ω ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_n italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_U ( bold_italic_μ ( italic_t ) ) - italic_η ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_U ( bold_italic_μ ( italic_t ) ) - italic_η italic_U ( bold_italic_μ ( italic_t ) ) ( roman_Ω ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_n italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG italic_n end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) - italic_η ) ≤ italic_U ( bold_italic_μ ( italic_t ) ) ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) , end_CELL end_ROW

note that we used ηO(exp(8nμmax2(0))πmin2n2d(1+μmax(0)nd)2)n2μmax2(0)𝜂𝑂8𝑛superscriptsubscript𝜇20superscriptsubscript𝜋2superscript𝑛2𝑑superscript1subscript𝜇0𝑛𝑑2𝑛2superscriptsubscript𝜇20\eta\leq O\left(\frac{\exp\left(-8n\mu_{\max}^{2}(0)\right)\pi_{\min}^{2}}{n^{% 2}d(1+\mu_{\max}(0){\sqrt{nd}})^{2}}\right)\frac{n}{2}\mu_{\max}^{2}(0)italic_η ≤ italic_O ( divide start_ARG roman_exp ( - 8 italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) italic_π start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ( 1 + italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 0 ) square-root start_ARG italic_n italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG italic_n end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ).

b). If n2μmax2(0)>U(𝝁(t))𝑛2superscriptsubscript𝜇20𝑈𝝁𝑡\frac{n}{2}\mu_{\max}^{2}(0)>U(\bm{\mu}(t))divide start_ARG italic_n end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) > italic_U ( bold_italic_μ ( italic_t ) ), then U(𝝁(t+1))(1+η2)U(𝝁(t))2U(𝝁(t))nμmax2(0)𝑈𝝁𝑡11superscript𝜂2𝑈𝝁𝑡2𝑈𝝁𝑡𝑛superscriptsubscript𝜇20U(\bm{\mu}(t+1))\leq(1+\eta^{2})U(\bm{\mu}(t))\leq 2U(\bm{\mu}(t))\leq n\mu_{% \max}^{2}(0)italic_U ( bold_italic_μ ( italic_t + 1 ) ) ≤ ( 1 + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_U ( bold_italic_μ ( italic_t ) ) ≤ 2 italic_U ( bold_italic_μ ( italic_t ) ) ≤ italic_n italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ).

Since (29) holds in both cases, our proof is done. ∎

B.2 Proofs for Section 3.2

Lemma 16.

For any 𝛍𝛍\bm{\mu}bold_italic_μ satisfying μ1d,μ2,μ3,,μn10dformulae-sequencenormsubscript𝜇1𝑑normsubscript𝜇2normsubscript𝜇3normsubscript𝜇𝑛10𝑑\|\mu_{1}\|\leq\sqrt{d},\|\mu_{2}\|,\|\mu_{3}\|,\ldots,\|\mu_{n}\|\geq 10\sqrt% {d}∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG italic_d end_ARG , ∥ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ , ∥ italic_μ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ , … , ∥ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ ≥ 10 square-root start_ARG italic_d end_ARG, the gradient of \mathcal{L}caligraphic_L at 𝛍𝛍\bm{\mu}bold_italic_μ can be upper bounded as

μi(𝝁)2μ1+2exp(d)i1μi,i[n].formulae-sequencenormsubscriptsubscript𝜇𝑖𝝁2normsubscript𝜇12𝑑subscript𝑖1normsubscript𝜇𝑖for-all𝑖delimited-[]𝑛\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq 2\|\mu_{1}\|+2\exp(-d)\sum_{i% \neq 1}\|\mu_{i}\|,\forall i\in[n].∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ ≤ 2 ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + 2 roman_exp ( - italic_d ) ∑ start_POSTSUBSCRIPT italic_i ≠ 1 end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ , ∀ italic_i ∈ [ italic_n ] .
Proof.

Recall that the gradient has the form μi(𝝁)=𝐄x[ψi(x)k[n]ψk(x)μk],subscriptsubscript𝜇𝑖𝝁subscript𝐄𝑥delimited-[]subscript𝜓𝑖𝑥subscript𝑘delimited-[]𝑛subscript𝜓𝑘𝑥subscript𝜇𝑘\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})=\mathbf{E}_{x}\left[\psi_{i}(x)\sum_{k% \in[n]}\psi_{k}(x)\mu_{k}\right],∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , hence its norm can be upper bounded as

μi(𝝁)𝐄x[ψi(x)k[n]ψk(x)μk]𝐄x[k[n]ψk(x)μk|x2d]+𝐄x[k[n]ψk(x)μk|x>2d]Pr[x>2d].\begin{split}&\quad\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq\mathbf{E}_{x}% \left[\psi_{i}(x)\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\right]\\ &\leq\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\Bigg{|}\|x\|\leq 2% \sqrt{d}\right]+\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\Bigg{% |}\|x\|>2\sqrt{d}\right]\Pr\left[\|x\|>2\sqrt{d}\right].\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ ≤ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ | ∥ italic_x ∥ ≤ 2 square-root start_ARG italic_d end_ARG ] + bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ | ∥ italic_x ∥ > 2 square-root start_ARG italic_d end_ARG ] roman_Pr [ ∥ italic_x ∥ > 2 square-root start_ARG italic_d end_ARG ] . end_CELL end_ROW (39)

For any x2dnorm𝑥2𝑑\|x\|\leq 2\sqrt{d}∥ italic_x ∥ ≤ 2 square-root start_ARG italic_d end_ARG, we have exp(xμ12/2)exp((x+μ1)2/2)exp(9d/2)superscriptnorm𝑥subscript𝜇122superscriptnorm𝑥normsubscript𝜇1229𝑑2\exp(-\|x-\mu_{1}\|^{2}/2)\geq\exp(-(\|x\|+\|\mu_{1}\|)^{2}/2)\geq\exp(-9d/2)roman_exp ( - ∥ italic_x - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) ≥ roman_exp ( - ( ∥ italic_x ∥ + ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) ≥ roman_exp ( - 9 italic_d / 2 ), while for i1for-all𝑖1\forall i\neq 1∀ italic_i ≠ 1, exp(xμi2/2)exp((μix)2/2)exp((10d2d)2/2)=exp(32d)superscriptnorm𝑥subscript𝜇𝑖22superscriptnormsubscript𝜇𝑖norm𝑥22superscript10𝑑2𝑑2232𝑑\exp(-\|x-\mu_{i}\|^{2}/2)\leq\exp(-(\|\mu_{i}\|-\|x\|)^{2}/2)\leq\exp(-(10% \sqrt{d}-2\sqrt{d})^{2}/2)=\exp(-32d)roman_exp ( - ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) ≤ roman_exp ( - ( ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ - ∥ italic_x ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) ≤ roman_exp ( - ( 10 square-root start_ARG italic_d end_ARG - 2 square-root start_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) = roman_exp ( - 32 italic_d ). Since ψi(x)exp(xμi2/2)proportional-tosubscript𝜓𝑖𝑥superscriptnorm𝑥subscript𝜇𝑖22\psi_{i}(x)\propto\exp(-\|x-\mu_{i}\|^{2}/2)italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∝ roman_exp ( - ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) we have

x2dψi(x)exp(xμi2/2)exp(xμ12/2)exp(32d)exp(9d/2)exp(25d),i1.formulae-sequencenorm𝑥2𝑑subscript𝜓𝑖𝑥superscriptnorm𝑥subscript𝜇𝑖22superscriptnorm𝑥subscript𝜇12232𝑑9𝑑225𝑑for-all𝑖1\|x\|\leq 2\sqrt{d}\Rightarrow\psi_{i}(x)\leq\frac{\exp(-\|x-\mu_{i}\|^{2}/2)}% {\exp(-\|x-\mu_{1}\|^{2}/2)}\leq\frac{\exp(-32d)}{\exp(-9d/2)}\leq\exp(-25d),% \forall i\neq 1.∥ italic_x ∥ ≤ 2 square-root start_ARG italic_d end_ARG ⇒ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ≤ divide start_ARG roman_exp ( - ∥ italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) end_ARG start_ARG roman_exp ( - ∥ italic_x - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) end_ARG ≤ divide start_ARG roman_exp ( - 32 italic_d ) end_ARG start_ARG roman_exp ( - 9 italic_d / 2 ) end_ARG ≤ roman_exp ( - 25 italic_d ) , ∀ italic_i ≠ 1 .

Therefore the first term in (36) can be bounded as 𝐄x[k[n]ψk(x)μk|x2d]μ1+exp(25d)i1μi.\mathbf{E}_{x}\left[\sum_{k\in[n]}\psi_{k}(x)\|\mu_{k}\|\Bigg{|}\|x\|\leq 2% \sqrt{d}\right]\leq\|\mu_{1}\|+\exp(-25d)\sum_{i\neq 1}\|\mu_{i}\|.bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ | ∥ italic_x ∥ ≤ 2 square-root start_ARG italic_d end_ARG ] ≤ ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + roman_exp ( - 25 italic_d ) ∑ start_POSTSUBSCRIPT italic_i ≠ 1 end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ .

On the other hand, by tail bound of the norm of Gaussian vectors (see Lemma 8 of [Yan et al., ]) we have Pr[x>2d]exp(d)Prnorm𝑥2𝑑𝑑\Pr\left[\|x\|>2\sqrt{d}\right]\leq\exp(-d)roman_Pr [ ∥ italic_x ∥ > 2 square-root start_ARG italic_d end_ARG ] ≤ roman_exp ( - italic_d ). Putting everything together, (39) can be further bounded as

μi(𝝁)μ1+exp(25d)i1μi+exp(d)i[n]μi2μ1+2exp(d)i1μi.delimited-∥∥subscriptsubscript𝜇𝑖𝝁delimited-∥∥subscript𝜇125𝑑subscript𝑖1delimited-∥∥subscript𝜇𝑖𝑑subscript𝑖delimited-[]𝑛delimited-∥∥subscript𝜇𝑖2delimited-∥∥subscript𝜇12𝑑subscript𝑖1delimited-∥∥subscript𝜇𝑖\begin{split}\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu})\|\leq\|\mu_{1}\|+\exp(-25% d)\sum_{i\neq 1}\|\mu_{i}\|+\exp(-d)\sum_{i\in[n]}\|\mu_{i}\|\leq 2\|\mu_{1}\|% +2\exp(-d)\sum_{i\neq 1}\|\mu_{i}\|.\end{split}start_ROW start_CELL ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ) ∥ ≤ ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + roman_exp ( - 25 italic_d ) ∑ start_POSTSUBSCRIPT italic_i ≠ 1 end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ + roman_exp ( - italic_d ) ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ 2 ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + 2 roman_exp ( - italic_d ) ∑ start_POSTSUBSCRIPT italic_i ≠ 1 end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ . end_CELL end_ROW

Theorem 7.

For any n=2l+1𝑛2𝑙1n=2l+1italic_n = 2 italic_l + 1, consider gradient EM initialized at point μ1(0)=0,μ2(0)==μl+1(0)=12de1,μl+2(0)==μ2l+1(0)=12de1formulae-sequenceformulae-sequencesubscript𝜇100subscript𝜇20subscript𝜇𝑙1012𝑑subscript𝑒1subscript𝜇𝑙20subscript𝜇2𝑙1012𝑑subscript𝑒1\mu_{1}(0)=0,\mu_{2}(0)=\cdots=\mu_{l+1}(0)=12\sqrt{d}e_{1},\mu_{l+2}(0)=% \cdots=\mu_{2l+1}(0)=-12\sqrt{d}e_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 ) = 0 , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 ) = ⋯ = italic_μ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( 0 ) = 12 square-root start_ARG italic_d end_ARG italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( 0 ) = ⋯ = italic_μ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( 0 ) = - 12 square-root start_ARG italic_d end_ARG italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where e1=(1,0,,0)subscript𝑒1superscript100tope_{1}=(1,0,\ldots,0)^{\top}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 1 , 0 , … , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a standard unit vector. Then population gradient EM will be trapped in a bad local region around 𝛍(0)𝛍0\bm{\mu}(0)bold_italic_μ ( 0 ) for exponentially long time T=115nηed=115nηexp(Θ(μmax2(0)))𝑇115𝑛𝜂superscript𝑒𝑑115𝑛𝜂Θsuperscriptsubscript𝜇20T=\frac{1}{15n\eta}e^{d}=\frac{1}{15n\eta}\exp(\Theta(\mu_{\max}^{2}(0)))italic_T = divide start_ARG 1 end_ARG start_ARG 15 italic_n italic_η end_ARG italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 15 italic_n italic_η end_ARG roman_exp ( roman_Θ ( italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 ) ) ). More rigorously, for any 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T, we have

μi(t)10d,i1.formulae-sequencenormsubscript𝜇𝑖𝑡10𝑑for-all𝑖1\|\mu_{i}(t)\|\geq 10\sqrt{d},\;\;\forall\;i\neq 1.∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ ≥ 10 square-root start_ARG italic_d end_ARG , ∀ italic_i ≠ 1 .
Proof.

We prove the following statement inductively  0tTfor-all 0𝑡𝑇\forall\;0\leq t\leq T∀ 0 ≤ italic_t ≤ italic_T:

μ1(t)=0,μ2(t)==μl+1(t)=μl+2(t)==μ2l+1(t)formulae-sequencesubscript𝜇1𝑡0subscript𝜇2𝑡subscript𝜇𝑙1𝑡subscript𝜇𝑙2𝑡subscript𝜇2𝑙1𝑡\mu_{1}(t)=0,\mu_{2}(t)=\ldots=\mu_{l+1}(t)=-\mu_{l+2}(t)=\ldots=-\mu_{2l+1}(t)italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = 0 , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = … = italic_μ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_t ) = - italic_μ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( italic_t ) = … = - italic_μ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( italic_t ) (40)
i,μi(t)μi(0)ηt(30dned).for-all𝑖normsubscript𝜇𝑖𝑡subscript𝜇𝑖0𝜂𝑡30𝑑𝑛superscript𝑒𝑑\forall i,\;\|\mu_{i}(t)-\mu_{i}(0)\|\leq\eta t(30\sqrt{d}ne^{-d}).∀ italic_i , ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∥ ≤ italic_η italic_t ( 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ) . (41)

(40) states that during the gradient EM update, μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will keep stationary at 00. while the symmetry between μ2,,μnsubscript𝜇2subscript𝜇𝑛\mu_{2},\ldots,\mu_{n}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT will be preserved.

The induction base is trivial. Now suppose (41), (40) holds for 0,1,,t01𝑡0,1,\ldots,t0 , 1 , … , italic_t, we prove the case for t+1𝑡1t+1italic_t + 1.

Proof of (40). Due to the induction hypothesis, one can see from direct calculation that x,ψ1(x|𝝁(t))=ψ1(x|𝝁(t)),ψ2(x|𝝁(t))==ψl+1(x|𝝁(t))=ψl+2(x|𝝁(t))==ψ2l+1(x|𝝁(t))formulae-sequencefor-all𝑥subscript𝜓1conditional𝑥𝝁𝑡subscript𝜓1conditional𝑥𝝁𝑡subscript𝜓2conditional𝑥𝝁𝑡subscript𝜓𝑙1conditional𝑥𝝁𝑡subscript𝜓𝑙2conditional𝑥𝝁𝑡subscript𝜓2𝑙1conditional𝑥𝝁𝑡\forall x,\psi_{1}(x|\bm{\mu}(t))=\psi_{1}(-x|\bm{\mu}(t)),\psi_{2}(x|\bm{\mu}% (t))=\cdots=\psi_{l+1}(x|\bm{\mu}(t))=\psi_{l+2}(-x|\bm{\mu}(t))=\cdots=\psi_{% 2l+1}(-x|\bm{\mu}(t))∀ italic_x , italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) = italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_x | bold_italic_μ ( italic_t ) ) , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) = ⋯ = italic_ψ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) = italic_ψ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( - italic_x | bold_italic_μ ( italic_t ) ) = ⋯ = italic_ψ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( - italic_x | bold_italic_μ ( italic_t ) ). For the ease of notation, we denote ψ+(x)ψ2(x|𝝁(t))==ψl+1(x|𝝁(t)),ψ(x)ψl+2(x|𝝁(t))==ψ2l+1(x|𝝁(t)),μ+μ2(t)==μl+1(t),μμl+2(t)==μ2l+1(t).formulae-sequencesubscript𝜓𝑥subscript𝜓2conditional𝑥𝝁𝑡subscript𝜓𝑙1conditional𝑥𝝁𝑡subscript𝜓𝑥subscript𝜓𝑙2conditional𝑥𝝁𝑡subscript𝜓2𝑙1conditional𝑥𝝁𝑡superscript𝜇subscript𝜇2𝑡subscript𝜇𝑙1𝑡superscript𝜇subscript𝜇𝑙2𝑡subscript𝜇2𝑙1𝑡\psi_{+}(x)\coloneqq\psi_{2}(x|\bm{\mu}(t))=\cdots=\psi_{l+1}(x|\bm{\mu}(t)),% \psi_{-}(x)\coloneqq\psi_{l+2}(x|\bm{\mu}(t))=\cdots=\psi_{2l+1}(x|\bm{\mu}(t)% ),\mu^{+}\coloneqq\mu_{2}(t)=\ldots=\mu_{l+1}(t),\mu^{-}\coloneqq\mu_{l+2}(t)=% \ldots=-\mu_{2l+1}(t).italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) ≔ italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) = ⋯ = italic_ψ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) , italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) ≔ italic_ψ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) = ⋯ = italic_ψ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) , italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ≔ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = … = italic_μ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_t ) , italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ≔ italic_μ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( italic_t ) = … = - italic_μ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( italic_t ) . Then μ+=μ,ψ+(x)=ψ(x)formulae-sequencesuperscript𝜇superscript𝜇subscript𝜓𝑥subscript𝜓𝑥\mu^{+}=-\mu^{-},\psi_{+}(x)=\psi_{-}(-x)italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = - italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) = italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( - italic_x ). Consequently

μ1(𝝁(t))=𝐄x[ψ1(x|𝝁(t))k[n]ψk(x|𝝁(t))μk(t)]=𝐄x[ψ1(x)(lψ+(x)μ++lψ(x)μ)]=12𝐄x[ψ1(x)(lψ+(x)μ++lψ(x)μ)+ψ1(x)(lψ+(x)μ++lψ(x)μ)]=12𝐄x[ψ1(x)(lψ+(x)(μ++μ)+lψ(x)(μ++μ))]=0μ1(t+1)=μ1(t)=0.subscriptsubscript𝜇1𝝁𝑡subscript𝐄𝑥delimited-[]subscript𝜓1conditional𝑥𝝁𝑡subscript𝑘delimited-[]𝑛subscript𝜓𝑘conditional𝑥𝝁𝑡subscript𝜇𝑘𝑡subscript𝐄𝑥delimited-[]subscript𝜓1𝑥𝑙subscript𝜓𝑥superscript𝜇𝑙subscript𝜓𝑥superscript𝜇12subscript𝐄𝑥delimited-[]subscript𝜓1𝑥𝑙subscript𝜓𝑥superscript𝜇𝑙subscript𝜓𝑥superscript𝜇subscript𝜓1𝑥𝑙subscript𝜓𝑥superscript𝜇𝑙subscript𝜓𝑥superscript𝜇12subscript𝐄𝑥delimited-[]subscript𝜓1𝑥𝑙subscript𝜓𝑥superscript𝜇superscript𝜇𝑙subscript𝜓𝑥superscript𝜇superscript𝜇0subscript𝜇1𝑡1subscript𝜇1𝑡0\begin{split}&\quad\nabla_{\mu_{1}}\mathcal{L}(\bm{\mu}(t))=\mathbf{E}_{x}% \left[\psi_{1}(x|\bm{\mu}(t))\sum_{k\in[n]}\psi_{k}(x|\bm{\mu}(t))\mu_{k}(t)% \right]=\mathbf{E}_{x}\left[\psi_{1}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)\mu^{-% })\right]\\ &=\frac{1}{2}\mathbf{E}_{x}\left[\psi_{1}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)% \mu^{-})+\psi_{1}(-x)(l\psi_{+}(-x)\mu^{+}+l\psi_{-}(-x)\mu^{-})\right]\\ &=\frac{1}{2}\mathbf{E}_{x}\left[\psi_{1}(x)(l\psi_{+}(x)(\mu^{+}+\mu^{-})+l% \psi_{-}(x)(\mu^{+}+\mu^{-}))\right]=0\Rightarrow\mu_{1}(t+1)=\mu_{1}(t)=0.% \end{split}start_ROW start_CELL end_CELL start_CELL ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ] = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) + italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( - italic_x ) italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( - italic_x ) italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) ( italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) ] = 0 ⇒ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + 1 ) = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = 0 . end_CELL end_ROW

Similarly, 2il+1,l+2j2l+1formulae-sequencefor-all2𝑖𝑙1𝑙2𝑗2𝑙1\forall 2\leq i\leq l+1,l+2\leq j\leq 2l+1∀ 2 ≤ italic_i ≤ italic_l + 1 , italic_l + 2 ≤ italic_j ≤ 2 italic_l + 1 we have

μi(𝝁(t))=𝐄x[ψi(x|𝝁(t))k[n]ψk(x|𝝁(t))μk(t)]=𝐄x[ψ+(x)(lψ+(x)μ++lψ(x)μ)]=𝐄x[ψ+(x)(lψ+(x)μ++lψ(x)μ)]=𝐄x[ψ(x)(lψ+(x)μ++lψ(x)μ)]=μj(𝝁(t)).subscriptsubscript𝜇𝑖𝝁𝑡subscript𝐄𝑥delimited-[]subscript𝜓𝑖conditional𝑥𝝁𝑡subscript𝑘delimited-[]𝑛subscript𝜓𝑘conditional𝑥𝝁𝑡subscript𝜇𝑘𝑡subscript𝐄𝑥delimited-[]subscript𝜓𝑥𝑙subscript𝜓𝑥superscript𝜇𝑙subscript𝜓𝑥superscript𝜇subscript𝐄𝑥delimited-[]subscript𝜓𝑥𝑙subscript𝜓𝑥superscript𝜇𝑙subscript𝜓𝑥superscript𝜇subscript𝐄𝑥delimited-[]subscript𝜓𝑥𝑙subscript𝜓𝑥superscript𝜇𝑙subscript𝜓𝑥superscript𝜇subscriptsubscript𝜇𝑗𝝁𝑡\begin{split}&\quad\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))=\mathbf{E}_{x}% \left[\psi_{i}(x|\bm{\mu}(t))\sum_{k\in[n]}\psi_{k}(x|\bm{\mu}(t))\mu_{k}(t)% \right]=\mathbf{E}_{x}\left[\psi_{+}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)\mu^{-% })\right]\\ &=\mathbf{E}_{x}\left[\psi_{+}(-x)(l\psi_{+}(-x)\mu^{+}+l\psi_{-}(-x)\mu^{-})% \right]=-\mathbf{E}_{x}\left[\psi_{-}(x)(l\psi_{+}(x)\mu^{+}+l\psi_{-}(x)\mu^{% -})\right]=-\nabla_{\mu_{j}}\mathcal{L}(\bm{\mu}(t)).\end{split}start_ROW start_CELL end_CELL start_CELL ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x | bold_italic_μ ( italic_t ) ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ] = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( - italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( - italic_x ) italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( - italic_x ) italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] = - bold_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) ( italic_l italic_ψ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_l italic_ψ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] = - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) . end_CELL end_ROW

This indicates that μ2(𝝁(t))==μl+1(𝝁(t))=μl+2(𝝁(t))==μ2l+1(𝝁(t))subscriptsubscript𝜇2𝝁𝑡subscriptsubscript𝜇𝑙1𝝁𝑡subscriptsubscript𝜇𝑙2𝝁𝑡subscriptsubscript𝜇2𝑙1𝝁𝑡\nabla_{\mu_{2}}\mathcal{L}(\bm{\mu}(t))=\cdots=\nabla_{\mu_{l+1}}\mathcal{L}(% \bm{\mu}(t))=-\nabla_{\mu_{l+2}}\mathcal{L}(\bm{\mu}(t))=\cdots=-\nabla_{\mu_{% 2l+1}}\mathcal{L}(\bm{\mu}(t))∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) = ⋯ = ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) = - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) = ⋯ = - ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ), combined with the induction hypothesis we have μ2(t+1)==μl+1(t+1)=μl+2(t+1)==μ2l+1(t+1)subscript𝜇2𝑡1subscript𝜇𝑙1𝑡1subscript𝜇𝑙2𝑡1subscript𝜇2𝑙1𝑡1\mu_{2}(t+1)=\ldots=\mu_{l+1}(t+1)=-\mu_{l+2}(t+1)=\ldots=-\mu_{2l+1}(t+1)italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + 1 ) = … = italic_μ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_t + 1 ) = - italic_μ start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT ( italic_t + 1 ) = … = - italic_μ start_POSTSUBSCRIPT 2 italic_l + 1 end_POSTSUBSCRIPT ( italic_t + 1 ), (40) is proved.

Proof of (41).

By induction hypothesis, we have i,μi(t)μi(0)ηt(30dned)ηT(30dned)2d.for-all𝑖normsubscript𝜇𝑖𝑡subscript𝜇𝑖0𝜂𝑡30𝑑𝑛superscript𝑒𝑑𝜂𝑇30𝑑𝑛superscript𝑒𝑑2𝑑\forall i,\;\|\mu_{i}(t)-\mu_{i}(0)\|\leq\eta t\cdot(30\sqrt{d}ne^{-d})\leq% \eta T\cdot(30\sqrt{d}ne^{-d})\leq 2\sqrt{d}.∀ italic_i , ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∥ ≤ italic_η italic_t ⋅ ( 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ) ≤ italic_η italic_T ⋅ ( 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ) ≤ 2 square-root start_ARG italic_d end_ARG . So i1,μi(t)μi(0)+2d<15dformulae-sequencefor-all𝑖1normsubscript𝜇𝑖𝑡normsubscript𝜇𝑖02𝑑15𝑑\forall i\neq 1,\|\mu_{i}(t)\|\leq\|\mu_{i}(0)\|+2\sqrt{d}<15\sqrt{d}∀ italic_i ≠ 1 , ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ ≤ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∥ + 2 square-root start_ARG italic_d end_ARG < 15 square-root start_ARG italic_d end_ARG. Then by Lemma 16, i[n]for-all𝑖delimited-[]𝑛\forall i\in[n]∀ italic_i ∈ [ italic_n ] we have

μi(𝝁(t))2μ1(t)+2exp(d)i1μi(t)2nexp(d)15d=30dned,normsubscriptsubscript𝜇𝑖𝝁𝑡2normsubscript𝜇1𝑡2𝑑subscript𝑖1normsubscript𝜇𝑖𝑡2𝑛𝑑15𝑑30𝑑𝑛superscript𝑒𝑑\|\nabla_{\mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\leq 2\|\mu_{1}(t)\|+2\exp(-d)\sum% _{i\neq 1}\|\mu_{i}(t)\|\leq 2n\exp(-d)\cdot 15\sqrt{d}=30\sqrt{d}ne^{-d},∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ ≤ 2 ∥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) ∥ + 2 roman_exp ( - italic_d ) ∑ start_POSTSUBSCRIPT italic_i ≠ 1 end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ ≤ 2 italic_n roman_exp ( - italic_d ) ⋅ 15 square-root start_ARG italic_d end_ARG = 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ,

note that here we used μ1(t)=0subscript𝜇1𝑡0\mu_{1}(t)=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = 0. Therefore by the induction hypothesis we have μi(t+1)μi(0)ηt(30dned)+ημi(𝝁(t))η(t+1)(30dned)normsubscript𝜇𝑖𝑡1subscript𝜇𝑖0𝜂𝑡30𝑑𝑛superscript𝑒𝑑𝜂normsubscriptsubscript𝜇𝑖𝝁𝑡𝜂𝑡130𝑑𝑛superscript𝑒𝑑\|\mu_{i}(t+1)-\mu_{i}(0)\|\leq\eta t\cdot(30\sqrt{d}ne^{-d})+\eta\|\nabla_{% \mu_{i}}\mathcal{L}(\bm{\mu}(t))\|\leq\eta(t+1)\cdot(30\sqrt{d}ne^{-d})∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∥ ≤ italic_η italic_t ⋅ ( 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ) + italic_η ∥ ∇ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_μ ( italic_t ) ) ∥ ≤ italic_η ( italic_t + 1 ) ⋅ ( 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ), (41) is proven.

By (41), i1,0tTformulae-sequencefor-all𝑖10𝑡𝑇\forall i\neq 1,0\leq t\leq T∀ italic_i ≠ 1 , 0 ≤ italic_t ≤ italic_T we have μi(t)μi(0)μi(t)μi(0)12dηT(30dned)12d2d=10d.normsubscript𝜇𝑖𝑡normsubscript𝜇𝑖0normsubscript𝜇𝑖𝑡subscript𝜇𝑖012𝑑𝜂𝑇30𝑑𝑛superscript𝑒𝑑12𝑑2𝑑10𝑑\|\mu_{i}(t)\|\geq\|\mu_{i}(0)\|-\|\mu_{i}(t)-\mu_{i}(0)\|\geq 12\sqrt{d}-\eta T% (30\sqrt{d}ne^{-d})\geq 12\sqrt{d}-2\sqrt{d}=10\sqrt{d}.∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ ≥ ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∥ - ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∥ ≥ 12 square-root start_ARG italic_d end_ARG - italic_η italic_T ( 30 square-root start_ARG italic_d end_ARG italic_n italic_e start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ) ≥ 12 square-root start_ARG italic_d end_ARG - 2 square-root start_ARG italic_d end_ARG = 10 square-root start_ARG italic_d end_ARG . Our proof is done. ∎