Learning time-scales in two-layers neural networks

Raphaël Berthier,    Andrea Montanari,    Kangjie Zhou EPFL, email address: [email protected]Department of Electrical Engineering and Department of Statistics, Stanford University, email address: [email protected]Department of Statistics, Stanford University, email address: [email protected]
(May 1, 2024)
Abstract

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically ‘simpler’ or ‘easier to learn’ although in a way that is difficult to formalize.

Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

Keywords: Deep learning, Neural network, Gradient flow, Dynamical system, Non-convex optimization, Incremental learning

Mathematics Subject Classification: 34E15, 37N40, 68T07

Communicated by Joan Bruna

1 Introduction

It is a recurring empirical observation that the training dynamics of neural networks exhibits a whole range of surprising behaviors:

  1. 1.

    Plateaus. Plotting the training and test error as a function of SGD steps, using either small stepsize or large batches to average out stochasticity, reveals striking patterns. These error curves display long plateaus where barely anything seems to be happening, which are followed by rapid drops [41, 48, 39].

  2. 2.

    Time-scales separation. The time window for this rapid descent is much shorter than the time spent in the plateaus. Additionally, subsequent phases of learning take increasingly longer times [18, 8].

  3. 3.

    Incremental learning. Models learnt in the first phases of learning appear to be simpler than in later phases. Among others, [5] demonstrated that easier examples in a dataset are learned earlier; [28] showed that models learnt in the first phase of training correlate well with linear models; [22] showed that, in many simplified models, the dynamics of gradient descent explores the solution space in an incremental order of complexity; [39] demonstrated that, in certain settings, a function that approximates well the target is only learnt past the point of overfitting.

Understanding these phenomena is not a matter of intellectual curiosity. In particular, incremental learning plays a key role in our understanding of generalization in deep learning. Indeed, in this scenario, stop** the learning at a certain time t𝑡titalic_t amounts to controlling the complexity of the model learnt. The notion of complexity corresponds to the order in which the space of models is explored.

While a number of groups have developed models to explain these phenomena, it is fair to say that a complete picture is still lacking. An exhaustive overview of these works is out of place here. We will outline three possible explanations that have been developed in the past, and provide more pointers in Section 3.

Theory #1#1\#1# 1: Dynamics near singular points.

Several early works [41, 17, 44] pointed out that the parametrization of multi-layer neural networks presents symmetries and degeneracies. For instance, the function represented by a multi-layer perceptron is invariant under permutations of the neurons in the same layer. As a consequence, the population risk has multiple local minima connected through saddles or other singular sub-manifolds. Dynamics near these sub-manifolds naturally exhibits plateaus. Further, random or agnostic initializations typically place the network close to such submanifolds.

Theory #2#2\#2# 2: Linear networks.

Following the pioneering work of [7], a number of authors, most notably [43, 30], studied the behavior of deep neural networks with linear activations. While such networks can only represent linear functions, the training dynamics is highly non-linear. As demonstrated in [43], learning happens through stages that correspond to the singular value decomposition of the input-output covariance. Time scales are determined by the singular values.

Theory #3#3\#3# 3: Kernel regime.

Following an initial insight of [26], a number of groups proved that, for certain initializations, the training dynamics and model learnt by overparametrized neural networks is well approximated by certain linearly parametrized models. In the limit of very wide networks, the training dynamics of these models converges in turn to the training dynamics of kernel ridge(less) regression (KRR) with respect to a deterministic kernel (independent of the random initialization.) We refer to [9] for an overview and pointers to this literature. Recently [21] show that, in high dimension, the learning dynamics of KRR also exhibits plateaus and waterfalls, and learns functions of increasing complexity over a diverging sequence of timescales.

While each of these theories offers useful insights, it is important to realize that they do not agree on the basic mechanism that explains plateaus, time-scales separation, and incremental learning. In theory #1#1\#1# 1, plateaus are associated to singular manifolds and high-dimensional saddles, while in theories #2#2\#2# 2 and #3#3\#3# 3 they are related to a hierarchy of singular values of a certain matrix. In #2#2\#2# 2, the relevant singular values are the ones of the input-output covariance, and the fact that these singular values are well separated is postulated to be a property of the data distribution. In contrast, in #3#3\#3# 3 the relevant singular values are the eigenvalues of the kernel operator, and hence completely independent of the output (the target function). In this case, eigenvalues which are very different are proved to exist under natural high-dimensional distributions.

Not only these theories propose different explanations, but they are also motivated by very different simplified models. Theory #1#1\#1# 1 has been developed only for networks with a small number of hidden units. Theory #2#2\#2# 2 only applies to networks with multiple output units, because otherwise the input-output covariance is a d×1𝑑1d\times 1italic_d × 1 matrix and hence has only one non-trivial singular value. Finally, theory #3#3\#3# 3 applies under the conditions of the linear (a.k.a. lazy) regime, namely large overparametrization and suitable initialization (see, e.g., [9]).

In order to better understand the origin of plateaus, time-scales separation, and incremental learning, we attempt a detailed analysis of gradient flow for two-layer neural networks. We consider a simple data-generation model, and propose a precise scenario for the behavior of learning dynamics. We do not assume any of the simplifying features of the theories described above: activations are non-linear; the number of hidden neurons is large; we place ourselves outside the linear (lazy) regime.

Our analysis is based on methods from dynamical systems theory: singular perturbation theory and matched asymptotic expansions. Unfortunately, we fall short of providing a general rigorous proof of the proposed scenario, but we can nevertheless prove it in several special cases and provide a heuristic argument supporting its generality.

The rest of the paper is organized as follows. Section 2 describes our data distribution, learning model, and the proposed scenario for the learning dynamics. We review further related work in Section 3. Section 4 describes the reduction of the gradient flow to a ‘mean field’ dynamics that will be the starting point of our analysis. Section 5 presents numerical evidence of the proposed learning scenario. Finally, Sections 6 to 7 present our analysis of the learning dynamics.

Notations.

In this paper, we use the classical asymptotic notations. The notations f(ε)=o(g(ε))𝑓𝜀𝑜𝑔𝜀f(\varepsilon)=o(g(\varepsilon))italic_f ( italic_ε ) = italic_o ( italic_g ( italic_ε ) ) or g(ε)=ω(f(ε))𝑔𝜀𝜔𝑓𝜀g(\varepsilon)=\omega(f(\varepsilon))italic_g ( italic_ε ) = italic_ω ( italic_f ( italic_ε ) ) as ε0𝜀0\varepsilon\to 0italic_ε → 0 both denote that |f(ε)|/|g(ε)|0𝑓𝜀𝑔𝜀0|f(\varepsilon)|/|g(\varepsilon)|\to 0| italic_f ( italic_ε ) | / | italic_g ( italic_ε ) | → 0 in the limit ε0𝜀0\varepsilon\to 0italic_ε → 0. The notations f(ε)=O(g(ε))𝑓𝜀𝑂𝑔𝜀f(\varepsilon)=O(g(\varepsilon))italic_f ( italic_ε ) = italic_O ( italic_g ( italic_ε ) ) or g(ε)=Ω(f(ε))𝑔𝜀Ω𝑓𝜀g(\varepsilon)=\Omega(f(\varepsilon))italic_g ( italic_ε ) = roman_Ω ( italic_f ( italic_ε ) ) both denote that the ratio |f(ε)|/|g(ε)|𝑓𝜀𝑔𝜀|f(\varepsilon)|/|g(\varepsilon)|| italic_f ( italic_ε ) | / | italic_g ( italic_ε ) | remains upper bounded in the limit. The notation f(ε)=Θ(g(ε))𝑓𝜀Θ𝑔𝜀f(\varepsilon)=\Theta(g(\varepsilon))italic_f ( italic_ε ) = roman_Θ ( italic_g ( italic_ε ) ) or f(ε)g(ε)asymptotically-equals𝑓𝜀𝑔𝜀f(\varepsilon)\asymp g(\varepsilon)italic_f ( italic_ε ) ≍ italic_g ( italic_ε ) denote that f(ε)=O(g(ε))𝑓𝜀𝑂𝑔𝜀f(\varepsilon)=O(g(\varepsilon))italic_f ( italic_ε ) = italic_O ( italic_g ( italic_ε ) ) and g(ε)=O(f(ε))𝑔𝜀𝑂𝑓𝜀g(\varepsilon)=O(f(\varepsilon))italic_g ( italic_ε ) = italic_O ( italic_f ( italic_ε ) ) both hold. Finally, f(ε)g(ε)similar-to𝑓𝜀𝑔𝜀f(\varepsilon)\sim g(\varepsilon)italic_f ( italic_ε ) ∼ italic_g ( italic_ε ) denotes that f(ε)/g(ε)1𝑓𝜀𝑔𝜀1f(\varepsilon)/g(\varepsilon)\to 1italic_f ( italic_ε ) / italic_g ( italic_ε ) → 1 in the limit.

2 Setting and canonical learning order

We are given pairs {(xi,yi)}insubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖𝑛\{(x_{i},y_{i})\}_{i\leq n}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT, where xidsubscript𝑥𝑖superscript𝑑x_{i}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a feature vector and yisubscript𝑦𝑖y_{i}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R is a response variable. We are interested in cases in which the feature vector is high-dimensional but does not contain strong structure, but the response depends on a low-dimensional projection of the data. We assume the simplest model of this type, the so-called single-index model:

yi=φ(u,xi),xi𝖭(0,Id),u𝕊d1,formulae-sequencesubscript𝑦𝑖𝜑subscript𝑢subscript𝑥𝑖formulae-sequencesimilar-tosubscript𝑥𝑖𝖭0subscript𝐼𝑑subscript𝑢superscript𝕊𝑑1y_{i}=\varphi(\langle u_{*},x_{i}\rangle)\,,\qquad\ x_{i}\sim\mathsf{N}(0,I_{d% }),\;u_{*}\in\mathbb{S}^{d-1},italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT , (1)

where φ::𝜑\varphi:\mathbb{R}\to\mathbb{R}italic_φ : blackboard_R → blackboard_R is a link function, 𝖭(0,Id)𝖭0subscript𝐼𝑑\mathsf{N}(0,I_{d})sansserif_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) denotes the standard multivariate Gaussian distribution in dimension d𝑑ditalic_d, and 𝕊d1:={vd:v2=1}assignsuperscript𝕊𝑑1conditional-set𝑣superscript𝑑subscriptnorm𝑣21\mathbb{S}^{d-1}:=\{v\in\mathbb{R}^{d}:\,\|v\|_{2}=1\}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT := { italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 }. We study the ability to learn model (1) using a two-layers neural network with m𝑚mitalic_m hidden neurons:

f(x;a,u)=1mi=1maiσ(ui,x),a1,,am,u1,,um𝕊d1,formulae-sequence𝑓𝑥𝑎𝑢1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝜎subscript𝑢𝑖𝑥subscript𝑎1formulae-sequencesubscript𝑎𝑚subscript𝑢1subscript𝑢𝑚superscript𝕊𝑑1f(x;a,u)=\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle),\qquad\ % a_{1},\cdots,a_{m}\in\mathbb{R},\ u_{1},\cdots,u_{m}\in\mathbb{S}^{d-1},italic_f ( italic_x ; italic_a , italic_u ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT , (2)

where (a,u):=(a1,,am,u1,,um)assign𝑎𝑢subscript𝑎1subscript𝑎𝑚subscript𝑢1subscript𝑢𝑚(a,u):=(a_{1},\cdots,a_{m},u_{1},\cdots,u_{m})( italic_a , italic_u ) := ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) collectively denotes all the model’s parameter and σ::𝜎\sigma:\mathbb{R}\to\mathbb{R}italic_σ : blackboard_R → blackboard_R is the activation function of the neural network. The factor 1/m1𝑚1/m1 / italic_m in the definition is relevant for the initialization and learning rate. We anticipate that we will initialize the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s to be of order one, which results in second layer coefficients ai/m=Θ(1/m)subscript𝑎𝑖𝑚Θ1𝑚a_{i}/m=\Theta(1/m)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_m = roman_Θ ( 1 / italic_m ).

Remark 2.1.

Standard initializations in deep learning frameworks yield second-layer coefficients ai/m=Θ(1/m)subscript𝑎𝑖𝑚Θ1𝑚a_{i}/m=\Theta(1/\sqrt{m})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_m = roman_Θ ( 1 / square-root start_ARG italic_m end_ARG ) [29, 23, 24]. However, it is increasingly clear that this initialization presents fundamental limitations for large m𝑚mitalic_m. Notably, two-layers networks with this initialization converges to kernel methods [37], and the latter cannot learn ridge functions from polynomially many samples [20, 47].

It is well understood that, in order to drive the learning process outside the kernel regime (for m𝑚m\to\inftyitalic_m → ∞), it is necessary to set ai/m=Θ(1/m)subscript𝑎𝑖𝑚Θ1𝑚a_{i}/m=\Theta(1/m)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_m = roman_Θ ( 1 / italic_m ). This is often referred to as the ‘mean-field initialization’ [32, 13, 19, 1]. We notice that suitable generalizations of the mean-field initialization are currently used in state-of-the-art implementations [45, 46].

The bulk of our work will be devoted to the analysis of projected gradient flow in (ai,ui)1imsubscriptsubscript𝑎𝑖subscript𝑢𝑖1𝑖𝑚(a_{i},u_{i})_{1\leqslant i\leqslant m}( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ⩽ italic_i ⩽ italic_m end_POSTSUBSCRIPT on the population risk

R(a,u)𝑅𝑎𝑢\displaystyle\mathscrsfs{R}(a,u)italic_R ( italic_a , italic_u ) =12𝔼{(yf(x;a,u))2}absent12𝔼superscript𝑦𝑓𝑥𝑎𝑢2\displaystyle=\frac{1}{2}\mathbb{E}\big{\{}\big{(}y-f(x;a,u)\big{)}^{2}\big{\}}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E { ( italic_y - italic_f ( italic_x ; italic_a , italic_u ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } (3)
=12𝔼{(φ(u,x)1mi=1maiσ(ui,x))2}.absent12𝔼superscript𝜑subscript𝑢𝑥1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝜎subscript𝑢𝑖𝑥2\displaystyle=\frac{1}{2}\mathbb{E}\Big{\{}\Big{(}\varphi(\langle u_{*},x% \rangle)-\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)\Big{)}^{% 2}\Big{\}}\,.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E { ( italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (4)

In Section 7, we will bound the distance between stochastic gradient descent (SGD) and gradient flow in population risk. As a consequence, we will establish finite sample generalization guarantees for SGD learning.

Projected gradient flow with respect to the risk R(a,u)𝑅𝑎𝑢\mathscrsfs{R}(a,u)italic_R ( italic_a , italic_u ) is defined by the following ordinary differential equations (ODEs):

t(εai)subscript𝑡𝜀subscript𝑎𝑖\displaystyle\partial_{t}(\varepsilon a_{i})∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ε italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =maiR(a,u),absent𝑚subscriptsubscript𝑎𝑖𝑅𝑎𝑢\displaystyle=-m\partial_{a_{i}}\mathscrsfs{R}(a,u)\,,= - italic_m ∂ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ( italic_a , italic_u ) , (5)
tuisubscript𝑡subscript𝑢𝑖\displaystyle\partial_{t}u_{i}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =m(Iduiui)uiR(a,u).absent𝑚subscript𝐼𝑑subscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscriptsubscript𝑢𝑖𝑅𝑎𝑢\displaystyle=-m(I_{d}-u_{i}u_{i}^{\top})\nabla_{u_{i}}\mathscrsfs{R}(a,u)\,.= - italic_m ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ( italic_a , italic_u ) . (6)

Here, ε𝜀\varepsilonitalic_ε can be viewed as the relative step size, namely the ratio between the first and second-layer step sizes. It is useful to make a few remarks about the definition of gradient flow:

  • The projection Iduiuisubscript𝐼𝑑subscript𝑢𝑖superscriptsubscript𝑢𝑖topI_{d}-u_{i}u_{i}^{\top}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ensures that uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT remains on the unit sphere 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT.

  • The overall scaling of time is arbitrary, and the matching to SGD steps will be carried out in Section 7. The factors m𝑚mitalic_m on the right-hand side are introduced for mathematical convenience, since the partial derivatives are of order 1/m1𝑚1/m1 / italic_m.

  • As aforementioned, the factor ε𝜀\varepsilonitalic_ε introduced in the gradient flow of the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s plays the role of the relative step size. Throughout the paper, we will keep ε𝜀\varepsilonitalic_ε as a free parameter independent of m𝑚mitalic_m, and study the evolution of gradient flow for small ε𝜀\varepsilonitalic_ε. This corresponds to a setting in which the second-layer coefficients are learned much faster than the first-layer weights. We emphasize however that the small ε𝜀\varepsilonitalic_ε limit is taken after the large m,d𝑚𝑑m,ditalic_m , italic_d limits. Thus, despite the second-layer weights are learnt faster, the evolution of first layer weights will be crucial, and lead to true feature learning.

We assume the initialization to be random with i.i.d. components (ai,init,ui,init)subscript𝑎𝑖initsubscript𝑢𝑖init(a_{i,\rm{init}},u_{i,\rm{init}})( italic_a start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT ):

(ai,init,ui,init)PAUnif(𝕊d1),similar-tosubscript𝑎𝑖initsubscript𝑢𝑖inittensor-productsubscriptP𝐴Unifsuperscript𝕊𝑑1\displaystyle(a_{i,\rm{init}},u_{i,\rm{init}})\sim{\rm P}_{A}\otimes\mathrm{% Unif}(\mathbb{S}^{d-1})\,,( italic_a start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT ) ∼ roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊗ roman_Unif ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) , (7)

where PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is a probability measure on \mathbb{R}blackboard_R. The unique solution of the gradient flow ODEs with this initialization will be denoted by (a(t),u(t))𝑎𝑡𝑢𝑡(a(t),u(t))( italic_a ( italic_t ) , italic_u ( italic_t ) ). We will be interested in the case of large networks (m𝑚m\to\inftyitalic_m → ∞) in high dimension (d𝑑d\to\inftyitalic_d → ∞). As shown below, the two limits commute (over fixed time horizons).

Our main finding is that, in a number of cases, φ𝜑\varphiitalic_φ is learnt incrementally. Namely, the function f(x;a(t),u(t))𝑓𝑥𝑎𝑡𝑢𝑡f(x;a(t),u(t))italic_f ( italic_x ; italic_a ( italic_t ) , italic_u ( italic_t ) ) evolves over time according to a sequence of polynomial approximations of φ(u,x)𝜑subscript𝑢𝑥\varphi(\langle u_{*},x\rangle)italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ). These polynomial approximations are given by the decomposition of φ𝜑\varphiitalic_φ in L2(,ϕ(x)dx)superscript𝐿2italic-ϕ𝑥d𝑥L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R , italic_ϕ ( italic_x ) roman_d italic_x ), where ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) is the standard normal density: ϕ(x)=exp(x2/2)/2πitalic-ϕ𝑥superscript𝑥222𝜋\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi}italic_ϕ ( italic_x ) = roman_exp ( - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) / square-root start_ARG 2 italic_π end_ARG. (For notational simplicity, we will use the shorthand L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT instead of L2(,ϕ(x)dx)superscript𝐿2italic-ϕ𝑥d𝑥L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R , italic_ϕ ( italic_x ) roman_d italic_x ) in the sequel.)

In order to describe the polynomial approximations learnt during the training more explicitly, we decompose φ𝜑\varphiitalic_φ and σ𝜎\sigmaitalic_σ into normalized Hermite polynomials:

φ(z)=k=0φkHek(z),σ(z)=k=0σkHek(z).formulae-sequence𝜑𝑧superscriptsubscript𝑘0subscript𝜑𝑘subscriptHe𝑘𝑧𝜎𝑧superscriptsubscript𝑘0subscript𝜎𝑘subscriptHe𝑘𝑧\displaystyle\varphi(z)=\sum_{k=0}^{\infty}\varphi_{k}\mathrm{He}_{k}(z)\,,\;% \;\;\;\sigma(z)=\sum_{k=0}^{\infty}\sigma_{k}\mathrm{He}_{k}(z)\,.italic_φ ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) , italic_σ ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) . (8)

Here, HeksubscriptHe𝑘\mathrm{He}_{k}roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the k𝑘kitalic_k-th Hermite polynomial, normalized so that HekL2(,ϕ(x)dx)=1subscriptnormsubscriptHe𝑘superscript𝐿2italic-ϕ𝑥d𝑥1\left\|{\mathrm{He}_{k}}\right\|_{L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)}=1∥ roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R , italic_ϕ ( italic_x ) roman_d italic_x ) end_POSTSUBSCRIPT = 1.

00t𝑡titalic_tRinitsubscript𝑅init\mathscrsfs{R}_{\rm{init}}italic_R start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT+++R𝑅\mathscrsfs{R}italic_RO(ε)𝑂𝜀O(\varepsilon)italic_O ( italic_ε )+++14|σ1φ1|ε1/2log1ε14subscript𝜎1subscript𝜑1superscript𝜀121𝜀\frac{1}{4|\sigma_{1}\varphi_{1}|}\varepsilon^{\nicefrac{{1}}{{2}}}\log\frac{1% }{\varepsilon}divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG+++c2ε1/4subscript𝑐2superscript𝜀14c_{2}\varepsilon^{\nicefrac{{1}}{{4}}}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT+++c3ε1/6subscript𝑐3superscript𝜀16c_{3}\varepsilon^{\nicefrac{{1}}{{6}}}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 6 end_ARG end_POSTSUPERSCRIPT+++O(ε1/2)𝑂superscript𝜀12O(\varepsilon^{\nicefrac{{1}}{{2}}})italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )O(ε1/3)𝑂superscript𝜀13O(\varepsilon^{\nicefrac{{1}}{{3}}})italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT )O(ε1/4)𝑂superscript𝜀14O(\varepsilon^{\nicefrac{{1}}{{4}}})italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT )12φ1212superscriptsubscript𝜑12\frac{1}{2}\varphi_{1}^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT12φ2212superscriptsubscript𝜑22\frac{1}{2}\varphi_{2}^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT12φ3212superscriptsubscript𝜑32\frac{1}{2}\varphi_{3}^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Figure 1: Cartoon illustration of the evolution of the population risk within the canonical learning order of Definition 1.

As we will see, the incremental learning behavior arises for small ε𝜀\varepsilonitalic_ε. By the law of large numbers (see below), the following almost sure limit exists (provided PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is square integrable)

Rinit:=limmlimdR(ainit,uinit)=12(φ0σ0aPA(da))2+12k1φk2.assignsubscript𝑅initsubscript𝑚subscript𝑑𝑅subscript𝑎initsubscript𝑢init12superscriptsubscript𝜑0subscript𝜎0𝑎subscriptP𝐴d𝑎212subscript𝑘1superscriptsubscript𝜑𝑘2\displaystyle\mathscrsfs{R}_{\rm{init}}:=\lim_{m\to\infty}\lim_{d\to\infty}% \mathscrsfs{R}(a_{\rm{init}},u_{\rm{init}})\,=\frac{1}{2}\left(\varphi_{0}-% \sigma_{0}\int\!a\,{\rm P}_{A}(\mathrm{d}a)\right)^{2}+\frac{1}{2}\sum_{k% \geqslant 1}\varphi_{k}^{2}.italic_R start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT := roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT italic_R ( italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( roman_d italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (9)

We are now in position to describe the scenario that we will study in the rest of the paper.

Definition 1.

We say that the canonical learning order holds up to level L𝐿Litalic_L for a certain target function φ𝜑\varphiitalic_φ, activation σ𝜎\sigmaitalic_σ, and distribution PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, if the followings hold:

  1. 1.

    The limit below exists:

    R(t,ε)=limmlimdR(a(t),u(t)).subscript𝑅𝑡𝜀subscript𝑚subscript𝑑𝑅𝑎𝑡𝑢𝑡\displaystyle\mathscrsfs{R}_{\infty}(t,\varepsilon)=\lim_{m\to\infty}\lim_{d% \to\infty}\mathscrsfs{R}(a(t),u(t)).italic_R start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_t , italic_ε ) = roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT italic_R ( italic_a ( italic_t ) , italic_u ( italic_t ) ) . (10)
  2. 2.

    There exist constants c2,,cL+1>0subscript𝑐2subscript𝑐𝐿10c_{2},\dots,c_{L+1}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT > 0 such that the following asymptotic holds as ε0𝜀0\varepsilon\to 0italic_ε → 0, t0𝑡0t\to 0italic_t → 0:

    R(t,ε)ε0,t0{Rinitif t=o(ε),12k1φk2if t=ω(ε) and t=14|σ1φ1|ε1/2log1εω(ε1/2),12k2φk2if t=14|σ1φ1|ε1/2log1ε+ω(ε1/2) and t=c2ε1/4ω(ε1/3),12klφk2if t=cl1ε1/2(l1)+ω(ε1/l) and t=clε1/2lω(ε1/l+1), for all 3lL+1.formulae-sequence𝜀0𝑡0absentsubscript𝑅𝑡𝜀casessubscript𝑅initif 𝑡𝑜𝜀12subscript𝑘1superscriptsubscript𝜑𝑘2if 𝑡𝜔𝜀 and 𝑡14subscript𝜎1subscript𝜑1superscript𝜀121𝜀𝜔superscript𝜀1212subscript𝑘2superscriptsubscript𝜑𝑘2if 𝑡14subscript𝜎1subscript𝜑1superscript𝜀121𝜀𝜔superscript𝜀12 and 𝑡subscript𝑐2superscript𝜀14𝜔superscript𝜀1312subscript𝑘𝑙superscriptsubscript𝜑𝑘2if 𝑡subscript𝑐𝑙1superscript𝜀12𝑙1𝜔superscript𝜀1𝑙 and 𝑡subscript𝑐𝑙superscript𝜀12𝑙𝜔superscript𝜀1𝑙1otherwise for all 3𝑙𝐿1\mathscrsfs{R}_{\infty}(t,\varepsilon)\xrightarrow[\varepsilon\to 0,\,t\to 0]{% }\begin{cases}\mathscrsfs{R}_{\rm{init}}&\text{if }t=o(\varepsilon)\,,\\ \frac{1}{2}\sum_{k\geqslant 1}\varphi_{k}^{2}&\text{if }t=\omega(\varepsilon)% \text{ and }t=\frac{1}{4|\sigma_{1}\varphi_{1}|}\varepsilon^{\nicefrac{{1}}{{2% }}}\log\frac{1}{\varepsilon}-\omega(\varepsilon^{\nicefrac{{1}}{{2}}})\,,\\ \frac{1}{2}\sum_{k\geqslant 2}\varphi_{k}^{2}&\text{if }t=\frac{1}{4|\sigma_{1% }\varphi_{1}|}\varepsilon^{\nicefrac{{1}}{{2}}}\log\frac{1}{\varepsilon}+% \omega(\varepsilon^{\nicefrac{{1}}{{2}}})\text{ and }t=c_{2}\varepsilon^{% \nicefrac{{1}}{{4}}}-\omega(\varepsilon^{\nicefrac{{1}}{{3}}})\,,\\ \frac{1}{2}\sum_{k\geqslant l}\varphi_{k}^{2}&\text{if }t=c_{l-1}\varepsilon^{% \nicefrac{{1}}{{2(l-1)}}}+\omega(\varepsilon^{\nicefrac{{1}}{{l}}})\text{ and % }t=c_{l}\varepsilon^{\nicefrac{{1}}{{2l}}}-\omega(\varepsilon^{\nicefrac{{1}}{% {l+1}}})\,,\\ &\qquad\text{ for all }3\leqslant l\leqslant L+1.\end{cases}italic_R start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_t , italic_ε ) start_ARROW start_UNDERACCENT italic_ε → 0 , italic_t → 0 end_UNDERACCENT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW end_ARROW { start_ROW start_CELL italic_R start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT end_CELL start_CELL if italic_t = italic_o ( italic_ε ) , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_t = italic_ω ( italic_ε ) and italic_t = divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG - italic_ω ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 2 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_t = divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG + italic_ω ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) and italic_t = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT - italic_ω ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_t = italic_c start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 ( italic_l - 1 ) end_ARG end_POSTSUPERSCRIPT + italic_ω ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG italic_l end_ARG end_POSTSUPERSCRIPT ) and italic_t = italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 italic_l end_ARG end_POSTSUPERSCRIPT - italic_ω ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG italic_l + 1 end_ARG end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL for all 3 ⩽ italic_l ⩽ italic_L + 1 . end_CELL end_ROW

Figure 1 provides a cartoon illustration of the canonical learning order.

At first sight, the setting of Eq. (2) is overly restrictive because we require ui2=1subscriptnormsubscript𝑢𝑖21\|u_{i}\|_{2}=1∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and we do not have offsets in the activations. Therefore, it might seem that s=1𝑠1s=1italic_s = 1 and φ=σ𝜑𝜎\varphi=\sigmaitalic_φ = italic_σ is required in order to approximate arbitrarily well the target function. In contrast, the next proposition shows that the network (2) enjoys universal approximation properties.

Proposition 1.

Assume that σ𝜎\sigmaitalic_σ is Lipschitz continuous and generic in the following sense: the decomposition of σ𝜎\sigmaitalic_σ into Hermite polynomials does not have any coefficient equal to 00. For any Lipschitz function φ::𝜑\varphi:{\mathbb{R}}\to{\mathbb{R}}italic_φ : blackboard_R → blackboard_R, u2=1subscriptnormsubscript𝑢21\|u_{*}\|_{2}=1∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, and x𝒩(0,Id)similar-to𝑥𝒩0subscript𝐼𝑑x\sim{\mathcal{N}}(0,I_{d})italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) such that 𝔼{φ(u,x)2}<𝔼𝜑superscriptsubscript𝑢𝑥2\mathbb{E}\{\varphi(\langle u_{*},x\rangle)^{2}\}<\inftyblackboard_E { italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } < ∞, there exists a sequence m𝑚m\to\inftyitalic_m → ∞ and a(m),u(m)superscript𝑎𝑚superscript𝑢𝑚a^{(m)},u^{(m)}italic_a start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT with ui(m)2=1subscriptnormsuperscriptsubscript𝑢𝑖𝑚21\|u_{i}^{(m)}\|_{2}=1∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 such that

limdlimm𝔼{(φ(u,x)f(x;a(m),u(m)))2}=0.subscript𝑑subscript𝑚𝔼superscript𝜑subscript𝑢𝑥𝑓𝑥superscript𝑎𝑚superscript𝑢𝑚20\displaystyle\lim_{d\to\infty}\lim_{m\to\infty}\mathbb{E}\big{\{}\big{(}% \varphi(\langle u_{*},x\rangle)-f(x;a^{(m)},u^{(m)})\big{)}^{2}\big{\}}=0\,.roman_lim start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E { ( italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - italic_f ( italic_x ; italic_a start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } = 0 .

This result is not surprising in view of the arguments in the next sections, which suggest that indeed gradient flow constructs such an approximation for a broad class of functions of the form f(x)=φ(u,x)subscript𝑓𝑥𝜑subscript𝑢𝑥f_{*}(x)=\varphi(\langle u_{*},x\rangle)italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) = italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ). We nevertheless give an independent proof in Appendix A.

A specific realization of our general setup is determined by the triple (σ,φ,PA)𝜎𝜑subscriptP𝐴(\sigma,\varphi,{\rm P}_{A})( italic_σ , italic_φ , roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ), In the rest of the paper, we will provide evidence showing that the canonical learning order holds in a number of cases. Nevertheless, we can also construct examples in which it does not hold:

  • If one or more of the Hermite coefficients of the activation vanish, then the canonical learning order does not hold for general φ𝜑\varphiitalic_φ. Specifically, if σk=0subscript𝜎𝑘0\sigma_{k}=0italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0, then for any t𝑡titalic_t the function f(x;a(t),u(t))𝑓𝑥𝑎𝑡𝑢𝑡f(x;a(t),u(t))italic_f ( italic_x ; italic_a ( italic_t ) , italic_u ( italic_t ) ) remains orthogonal to Hek(u,x)subscriptHe𝑘subscript𝑢𝑥\mathrm{He}_{k}(\langle u_{*},x\rangle)roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ). In particular, if φk0subscript𝜑𝑘0\varphi_{k}\neq 0italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 then the risk remains bounded away from zero for every t𝑡titalic_t. We refer to Appendix E.1 for a formal statement.

  • If the first k+1𝑘1k+1italic_k + 1 Hermite coefficients of φ𝜑\varphiitalic_φ vanish, φ0==φk=0subscript𝜑0subscript𝜑𝑘0\varphi_{0}=\dots=\varphi_{k}=0italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ⋯ = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0, k1𝑘1k\geq 1italic_k ≥ 1, then the canonical learning order does not hold. (See Appendix E.2 for the proof.)

  • In fact, we expect the canonical learning order might fail every time one or more of the coefficients φksubscript𝜑𝑘\varphi_{k}italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT vanish, for k1𝑘1k\geq 1italic_k ≥ 1. Appendix E.3 provides some heuristic justification for this failure.

Remark 2.2.

We can compare the canonical learning order described here to the ones in earlier literature and described as theory #1#1\#1# 1, #2#2\#2# 2, #3#3\#3# 3 in the introduction. There appears points of contact, but also important differences with both theory #1#1\#1# 1 and #3#3\#3# 3:

  • As in theory #1#1\#1# 1, the plateaus and separation of time scales arise because the trajectory of gradient flow is approximated by a sequence of motions along submanifolds in the space of parameters (a,u)𝑎𝑢(a,u)( italic_a , italic_u ). Along the l𝑙litalic_l-th such submanifold f(x;a,u)𝑓𝑥𝑎𝑢f(x;a,u)italic_f ( italic_x ; italic_a , italic_u ) is well-approximated by a degree-l𝑙litalic_l polynomial. Esca** each submanifold takes an increasingly longer time.

    This is reminiscent of the motion between saddles investigated in earlier work [41, 17, 44]. However, unlike in earlier work, we will see that this applies to networks with a large (possibly diverging) number of hidden neurons. Also, we identify the subsequent phases of learning with the polynomial decomposition of Eq. (8).

  • As in theory #3#3\#3# 3, subsequent phases of learning correspond to increasingly accurate polynomial approximations of the target function φ(u,x)𝜑subscript𝑢𝑥\varphi(\langle u_{*},x\rangle)italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ). However, the underlying mechanism and time scales are completely different. In the linear regime, the different time scales emerge because of increasingly small eigenvalues of the neural tangent kernel. In that case, the time required to learn degree-l𝑙litalic_l polynomials is of order dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT [21].

    In contrast, in the canonical learning order, polynomials of degree l𝑙litalic_l are learnt on a time scale of order one in d𝑑ditalic_d (and only depending on the learning rate ε𝜀\varepsilonitalic_ε). This of course has important implications when approximating gradient flow by SGD. Within the linear regime, the sample size required to learn a polynomial of order l𝑙litalic_l scales like dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT [21], while in the canonical learning order, it is only of order d𝑑ditalic_d (see Section 7).

3 Further related work

As we mentioned in the introduction, plateaus and time scales in the learning dynamics of kernel models were analyzed by [21]. A sharp analysis for the related random features model was developed by [12].

Our analysis builds upon the mean-field description of learning in two-layer neural networks, which was developed in a sequence of works, see, e.g., [32, 40, 13, 33]. In particular, we leverage the fact that, for the data distribution (1), the population risk function is invariant under rotations around the axis usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, and this allows for a dimensionality reduction in the mean field description. Similar symmetry argument were used by [32] and, more recently, by [1].

The single-index model can be learnt using simpler methods than large two-layer networks. Limiting ourselves to the case of gradient descent algorithms, [31] proved that gradient descent with respect to the non-convex empirical risk R^n(u):=n1i=1n(yiφ(uxi))2assignsubscript^𝑅𝑛𝑢superscript𝑛1superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖𝜑superscript𝑢topsubscript𝑥𝑖2\widehat{R}_{n}(u):=n^{-1}\sum_{i=1}^{n}(y_{i}-\varphi(u^{\top}x_{i}))^{2}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_u ) := italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ ( italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT converges to a near global optimum, provided φ𝜑\varphiitalic_φ is strictly increasing. [4] considered online SGD under more challenging learning scenarios and characterized the time (sample size) for |u,u|𝑢subscript𝑢|\langle u,u_{*}\rangle|| ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ | to become significantly larger than for a random unit vector u𝑢uitalic_u.

Learning in overparametrized two-layer networks under model (1) (or its variations) has been studied recently by several groups. In particular, [6] considers a training procedure which runs a single step gradient descent followed by freezing the first layer and performing ridge regression with respect to the second layer. This scheme is amenable to a precise characterization of the generalization error. [11] consider a similar scheme in which a first phase of gradient descent is run to achieve positive correlation with the unknown direction usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. [14] also consider a two-phases scheme, and prove consistency and excess risk bounds for a more general class of target functions whereby the first equation in (1) is replaced by

yi=φ(Uxi)+εi,Ud×k,φ:k,:formulae-sequencesubscript𝑦𝑖𝜑superscriptsubscript𝑈topsubscript𝑥𝑖subscript𝜀𝑖subscript𝑈superscript𝑑𝑘𝜑superscript𝑘\displaystyle y_{i}=\varphi(U_{*}^{\top}x_{i})+\varepsilon_{i}\,,\;\;\;U_{*}% \in{\mathbb{R}}^{d\times k}\,,\varphi:{\mathbb{R}}^{k}\to{\mathbb{R}}\,,italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_φ ( italic_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT , italic_φ : blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_R , (11)

with kdmuch-less-than𝑘𝑑k\ll ditalic_k ≪ italic_d. In particular, near optimal error bounds are obtained under a non-degeneracy condition on 2φsuperscript2𝜑\nabla^{2}\varphi∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ.

[1] consider a similar model whereby xUnif({+1,1}d)similar-to𝑥Unifsuperscript11𝑑x\sim\mathrm{Unif}(\{+1,-1\}^{d})italic_x ∼ roman_Unif ( { + 1 , - 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), and y=φ(xS)𝑦𝜑subscript𝑥𝑆y=\varphi(x_{S})italic_y = italic_φ ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) where S[d]𝑆delimited-[]𝑑S\subseteq[d]italic_S ⊆ [ italic_d ], and xS=(xi)iSsubscript𝑥𝑆subscriptsubscript𝑥𝑖𝑖𝑆x_{S}=(x_{i})_{i\in S}italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT (i.e., xSsubscript𝑥𝑆x_{S}italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT contains the coordinates of x𝑥xitalic_x indexed by entries of S𝑆Sitalic_S). Under a structural assumption on φ𝜑\varphiitalic_φ (the ‘merged staircase property’), and for |S|𝑆|S|| italic_S | fixed, they prove the two stages algorithm learns the target function with sample complexity of order d𝑑ditalic_d. This paper is technically related to ours in that it uses mean-field theory to obtain a characterization of learning in terms of a PDE in a reduced (k+2)𝑘2(k+2)( italic_k + 2 )-dimensional space.

A similar model was studied by [8] that bounds the sample complexity by dO(k)superscript𝑑𝑂𝑘d^{O(k)}italic_d start_POSTSUPERSCRIPT italic_O ( italic_k ) end_POSTSUPERSCRIPT for learning parities on k𝑘kitalic_k bits using gradient descent with large batches (if k=O(1)𝑘𝑂1k=O(1)italic_k = italic_O ( 1 ), [8] require O(1)𝑂1O(1)italic_O ( 1 ) steps with batch size dO(k)superscript𝑑𝑂𝑘d^{O(k)}italic_d start_POSTSUPERSCRIPT italic_O ( italic_k ) end_POSTSUPERSCRIPT).

Let us emphasize that our objective is quite different from these works. We implement a simple online SGD algorithm with additional projection steps, and try to derive a precise picture of the successive phases of learning (in particular, we do not consider two-stage schemes or layer-by-layer learning). On the other hand, we focus on a relatively simple model.

To clarify the difference, it is perhaps useful to rephrase our claims in terms of sample complexity. While previous works show that the target function can be learnt with O(d)𝑂𝑑O(d)italic_O ( italic_d ) samples, we claim that it is learnt by online SGD with test error r𝑟ritalic_r from about C(r,ε)d𝐶𝑟𝜀𝑑C(r,\varepsilon)ditalic_C ( italic_r , italic_ε ) italic_d samples and characterize the dependence of C(r,ε)𝐶𝑟𝜀C(r,\varepsilon)italic_C ( italic_r , italic_ε ) on r𝑟ritalic_r for small ε𝜀\varepsilonitalic_ε. (Falling short of a proof in the general case.)

After posting an initial version of this paper, we became aware that [3] independently derived equations similar to (15)-(19), (26), (130). There are technical differences, and hence we cannot apply their results directly. However, Section 4.3 and Appendix B.4 are analogous to their work.

4 The large-network, high-dimensional limit

The first step of our analysis is a reduction of the system of ODEs (5), (6), with dimension m(d+1)𝑚𝑑1m(d+1)italic_m ( italic_d + 1 ) to a system of ODEs in 2m2𝑚2m2 italic_m dimensions. We will achieve this reduction in two steps:

  • (i)𝑖(i)( italic_i )

    First we reduce to a system in m(m+3)/2𝑚𝑚32m(m+3)/2italic_m ( italic_m + 3 ) / 2 dimensions for the variables aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ui,ujsubscript𝑢𝑖subscript𝑢𝑗\langle u_{i},u_{j}\rangle⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩, ui,usubscript𝑢𝑖subscript𝑢\langle u_{i},u_{*}\rangle⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩. This reduction is exact and is quite standard. It is done in Section 4.1.

  • (ii)𝑖𝑖(ii)( italic_i italic_i )

    We then show that the products ui,ujsubscript𝑢𝑖subscript𝑢𝑗\langle u_{i},u_{j}\rangle⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ can be eliminated, with an error O(1/m)𝑂1𝑚O(1/m)italic_O ( 1 / italic_m ). This is done in Section 4.2. As further discussed below, the resulting dynamics could also be derived from the mean field theory of [32, 40, 13, 33] (with the required modifications for the constraints ui=1normsubscript𝑢𝑖1\|u_{i}\|=1∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ = 1).

In order to define formally the reduced system, we define the functions U,V:[1,1]:𝑈𝑉11U,V:[-1,1]\to\mathbb{R}italic_U , italic_V : [ - 1 , 1 ] → blackboard_R via:

V(s)𝑉𝑠\displaystyle V(s)italic_V ( italic_s ) :=𝔼{φ(G)σ(Gs)}=k0φkσksk,(G,Gs)𝒩(0,[1ss1]),formulae-sequenceassignabsent𝔼𝜑𝐺𝜎subscript𝐺𝑠subscript𝑘0subscript𝜑𝑘subscript𝜎𝑘superscript𝑠𝑘similar-to𝐺subscript𝐺𝑠𝒩0delimited-[]matrix1𝑠𝑠1\displaystyle:=\mathbb{E}\{\varphi(G)\,\sigma(G_{s})\}=\sum_{k\geqslant 0}% \varphi_{k}\sigma_{k}s^{k}\,,\;\;\;\;\;(G,G_{s})\sim{\mathcal{N}}\left(0,\left% [\begin{matrix}1&s\\ s&1\end{matrix}\right]\right)\,,:= blackboard_E { italic_φ ( italic_G ) italic_σ ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) } = ∑ start_POSTSUBSCRIPT italic_k ⩾ 0 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ( italic_G , italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∼ caligraphic_N ( 0 , [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_s end_CELL end_ROW start_ROW start_CELL italic_s end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ) , (12)
U(s)𝑈𝑠\displaystyle U(s)italic_U ( italic_s ) :=𝔼{σ(G)σ(Gs)}=k0σk2sk.assignabsent𝔼𝜎𝐺𝜎subscript𝐺𝑠subscript𝑘0superscriptsubscript𝜎𝑘2superscript𝑠𝑘\displaystyle:=\mathbb{E}\{\sigma(G)\,\sigma(G_{s})\}=\sum_{k\geqslant 0}% \sigma_{k}^{2}s^{k}\,.:= blackboard_E { italic_σ ( italic_G ) italic_σ ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) } = ∑ start_POSTSUBSCRIPT italic_k ⩾ 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (13)

Note that the above identities follow from [36, Proposition 11.31]. Throughout this section, we will make the following assumptions.

A1.

The distribution of weights at initialization, PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is supported on [M1,M1]subscript𝑀1subscript𝑀1[-M_{1},M_{1}][ - italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ].

A2.

The activation function is bounded: σM2subscriptnorm𝜎subscript𝑀2\left\|{\sigma}\right\|_{\infty}\leq M_{2}∥ italic_σ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Additionally, the functions V𝑉Vitalic_V and U𝑈Uitalic_U are bounded and of class C2superscript𝐶2C^{2}italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, with uniformly bounded first and second derivatives over s[1,1]𝑠11s\in[-1,1]italic_s ∈ [ - 1 , 1 ]. A sufficient condition for this is

sup{σL2,σ′′L2}M2,sup{φL2,φL2,φ′′L2}M2.formulae-sequencesupremumsubscriptnormsuperscript𝜎superscript𝐿2subscriptnormsuperscript𝜎′′superscript𝐿2subscript𝑀2supremumsubscriptnorm𝜑superscript𝐿2subscriptnormsuperscript𝜑superscript𝐿2subscriptnormsuperscript𝜑′′superscript𝐿2subscript𝑀2\sup\left\{\left\|{\sigma^{\prime}}\right\|_{L^{2}},\,\left\|{\sigma^{\prime% \prime}}\right\|_{L^{2}}\right\}\leq M_{2},\quad\ \sup\left\{\left\|{\varphi}% \right\|_{L^{2}},\,\left\|{\varphi^{\prime}}\right\|_{L^{2}},\,\left\|{\varphi% ^{\prime\prime}}\right\|_{L^{2}}\right\}\leq M_{2}.roman_sup { ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∥ italic_σ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_sup { ∥ italic_φ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∥ italic_φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∥ italic_φ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
A3.

Responses are bounded, i.e., φM3subscriptnorm𝜑subscript𝑀3\|\varphi\|_{\infty}\leq M_{3}∥ italic_φ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

Remark 4.1.

We hereby briefly explain the sufficiency of L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-boundedness of derivatives of σ𝜎\sigmaitalic_σ and φ𝜑\varphiitalic_φ as claimed in Assumption A2. Suppose for example that σL2M2subscriptnormsuperscript𝜎superscript𝐿2subscript𝑀2\left\|{\sigma^{\prime}}\right\|_{L^{2}}\leq M_{2}∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and φL2M2subscriptnormsuperscript𝜑superscript𝐿2subscript𝑀2\left\|{\varphi^{\prime}}\right\|_{L^{2}}\leq M_{2}∥ italic_φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then we have

sups[1,1]|V(s)|=(a)sups[1,1]|𝔼{φ(G)σ(Gs)}|(b)φL2σL2M22,superscript𝑎subscriptsupremum𝑠11superscript𝑉𝑠subscriptsupremum𝑠11𝔼superscript𝜑𝐺superscript𝜎subscript𝐺𝑠superscript𝑏subscriptnormsuperscript𝜑superscript𝐿2subscriptnormsuperscript𝜎superscript𝐿2superscriptsubscript𝑀22\sup_{s\in[-1,1]}\left|V^{\prime}(s)\right|\stackrel{{\scriptstyle(a)}}{{=}}% \sup_{s\in[-1,1]}\left|\mathbb{E}\{\varphi^{\prime}(G)\,\sigma^{\prime}(G_{s})% \}\right|\stackrel{{\scriptstyle(b)}}{{\leq}}\left\|{\varphi^{\prime}}\right\|% _{L^{2}}\left\|{\sigma^{\prime}}\right\|_{L^{2}}\leq M_{2}^{2},roman_sup start_POSTSUBSCRIPT italic_s ∈ [ - 1 , 1 ] end_POSTSUBSCRIPT | italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_sup start_POSTSUBSCRIPT italic_s ∈ [ - 1 , 1 ] end_POSTSUBSCRIPT | blackboard_E { italic_φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_G ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) } | start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP ∥ italic_φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (14)

where (a)𝑎(a)( italic_a ) follows from Gaussian integration by parts and (b)𝑏(b)( italic_b ) follows from Cauchy-Schwarz inequality.

4.1 Reduction to d𝑑ditalic_d-independent flow

Our first statement establishes reduction (i)𝑖(i)( italic_i ) mentioned above. The proof of this fact is presented in Appendix B.1.

Proposition 2 (Reduction to d𝑑ditalic_d-independent flow).

Define si=ui,usubscript𝑠𝑖subscript𝑢𝑖subscript𝑢s_{i}=\langle u_{i},u_{*}\rangleitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩, rij=ui,ujsubscript𝑟𝑖𝑗subscript𝑢𝑖subscript𝑢𝑗r_{ij}=\langle u_{i},u_{j}\rangleitalic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ for i,j=1,,mformulae-sequence𝑖𝑗1𝑚i,j=1,\dots,mitalic_i , italic_j = 1 , … , italic_m. Then, letting R=(rij)i,jm𝑅subscriptsubscript𝑟𝑖𝑗𝑖𝑗𝑚R=(r_{ij})_{i,j\leq m}italic_R = ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j ≤ italic_m end_POSTSUBSCRIPT, we have

R(a,u)=Rred(a,s,R):=12φL221mi=1maiV(si)+12m2i,j=1maiajU(rij).𝑅𝑎𝑢subscript𝑅red𝑎𝑠𝑅assign12subscriptsuperscriptnorm𝜑2superscript𝐿21𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝑉subscript𝑠𝑖12superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗\mathscrsfs{R}(a,u)=\mathscrsfs{R}_{\mbox{\tiny\rm red}}(a,s,R):=\frac{1}{2}\|% \varphi\|^{2}_{L^{2}}-\frac{1}{m}\sum_{i=1}^{m}a_{i}V(s_{i})+\frac{1}{2m^{2}}% \sum_{i,j=1}^{m}a_{i}a_{j}U(r_{ij})\,.italic_R ( italic_a , italic_u ) = italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a , italic_s , italic_R ) := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) . (15)

If (a(t),u(t))𝑎𝑡𝑢𝑡(a(t),u(t))( italic_a ( italic_t ) , italic_u ( italic_t ) ) solve the gradient flow ODEs (5)-(6) then (a(t),s(t),R(t))𝑎𝑡𝑠𝑡𝑅𝑡(a(t),s(t),R(t))( italic_a ( italic_t ) , italic_s ( italic_t ) , italic_R ( italic_t ) ) are the unique solution of the following set of ODEs (note that rii=1subscript𝑟𝑖𝑖1r_{ii}=1italic_r start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 identically)

εtai=𝜀subscript𝑡subscript𝑎𝑖absent\displaystyle\varepsilon\partial_{t}a_{i}=\,italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = V(si)1mj=1majU(rij),𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗\displaystyle V(s_{i})-\frac{1}{m}\sum_{j=1}^{m}a_{j}U(r_{ij})\,,italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , (16)
tsi=subscript𝑡subscript𝑠𝑖absent\displaystyle\partial_{t}s_{i}=\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ai(V(si)(1si2)1mj=1majU(rij)(sjrijsi)),subscript𝑎𝑖superscript𝑉subscript𝑠𝑖1superscriptsubscript𝑠𝑖21𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑟𝑖𝑗subscript𝑠𝑗subscript𝑟𝑖𝑗subscript𝑠𝑖\displaystyle a_{i}\left(V^{\prime}(s_{i})(1-s_{i}^{2})-\frac{1}{m}\sum_{j=1}^% {m}a_{j}U^{\prime}(r_{ij})(s_{j}-r_{ij}s_{i})\right)\,,italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (17)
trij=subscript𝑡subscript𝑟𝑖𝑗absent\displaystyle\partial_{t}r_{ij}=\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ai(V(si)(sjsirij)1mp=1mapU(rip)(rjpriprij)),subscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑠𝑗subscript𝑠𝑖subscript𝑟𝑖𝑗1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝subscript𝑟𝑗𝑝subscript𝑟𝑖𝑝subscript𝑟𝑖𝑗\displaystyle a_{i}\left(V^{\prime}(s_{i})(s_{j}-s_{i}r_{ij})-\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}-r_{ip}r_{ij})\right)\,,italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) , (18)
+aj(V(sj)(sisjrij)1mp=1mapU(rjp)(riprjprij)).subscript𝑎𝑗superscript𝑉subscript𝑠𝑗subscript𝑠𝑖subscript𝑠𝑗subscript𝑟𝑖𝑗1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑗𝑝subscript𝑟𝑖𝑝subscript𝑟𝑗𝑝subscript𝑟𝑖𝑗\displaystyle+a_{j}\left(V^{\prime}(s_{j})(s_{i}-s_{j}r_{ij})-\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}-r_{jp}r_{ij})\right)\,.+ italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) . (19)

The input dimension d𝑑ditalic_d does not appear in the reduced ODEs, Eqs. (16) to (19), and only plays a role in the initialization of the sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and the rijsubscript𝑟𝑖𝑗r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT’s. Namely, since ui,initUnif(𝕊d1)similar-tosubscript𝑢𝑖initUnifsuperscript𝕊𝑑1u_{i,\rm{init}}\sim\mathrm{Unif}(\mathbb{S}^{d-1})italic_u start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT ∼ roman_Unif ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ), we can represent ui,init=gi/gi2subscript𝑢𝑖initsubscript𝑔𝑖subscriptnormsubscript𝑔𝑖2u_{i,\rm{init}}=g_{i}/\|g_{i}\|_{2}italic_u start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∥ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with gi𝖭(0,Id/d)similar-tosubscript𝑔𝑖𝖭0subscript𝐼𝑑𝑑g_{i}\sim\mathsf{N}(0,I_{d}/d)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT / italic_d ). By concentration of gi2subscriptnormsubscript𝑔𝑖2\|g_{i}\|_{2}∥ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, this implies that, for 1i<jm1𝑖𝑗𝑚1\leq i<j\leq m1 ≤ italic_i < italic_j ≤ italic_m, sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, rijsubscript𝑟𝑖𝑗r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are approximately 𝖭(0,1/d)𝖭01𝑑\mathsf{N}(0,1/d)sansserif_N ( 0 , 1 / italic_d ).

This discussion immediately yields the following consequence.

Corollary 1.

Let (a(t),u(t))𝑎𝑡𝑢𝑡(a(t),u(t))( italic_a ( italic_t ) , italic_u ( italic_t ) ) be the solution of the gradient flow ODEs (5), (6) with initialization (7), and let (a0(t),s0(t),R0(t))superscript𝑎0𝑡superscript𝑠0𝑡superscript𝑅0𝑡(a^{0}(t),s^{0}(t),R^{0}(t))( italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) be the unique solution of Eqs. (16) to (19), with initialization ai0(0)=ai(0)subscriptsuperscript𝑎0𝑖0subscript𝑎𝑖0a^{0}_{i}(0)=a_{i}(0)italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ), si0(0)=0subscriptsuperscript𝑠0𝑖00s^{0}_{i}(0)=0italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = 0, rij0(0)=0subscriptsuperscript𝑟0𝑖𝑗00r^{0}_{ij}(0)=0italic_r start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 0 ) = 0 for ij𝑖𝑗i\neq jitalic_i ≠ italic_j. Then, for any fixed T𝑇Titalic_T (possibly dependent on m𝑚mitalic_m but not on d𝑑ditalic_d), the followings holds with probability at least 1exp(Cm)1superscript𝐶𝑚1-\exp(-C^{\prime}m)1 - roman_exp ( - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m ) over the i.i.d. initialization (ai(0),ui(0))i[m]subscriptsubscript𝑎𝑖0subscript𝑢𝑖0𝑖delimited-[]𝑚(a_{i}(0),u_{i}(0))_{i\in[m]}( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ) start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT:

supt[0,T]|R(a(t),u(t))Rred(a0(t),s0(t),R0(t))|CMdexp(MT(1+T)2/ε2),subscriptsupremum𝑡0𝑇𝑅𝑎𝑡𝑢𝑡subscript𝑅redsuperscript𝑎0𝑡superscript𝑠0𝑡superscript𝑅0𝑡𝐶𝑀𝑑𝑀𝑇superscript1𝑇2superscript𝜀2\displaystyle\sup_{t\in[0,T]}\big{|}\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{% \mbox{\tiny\rm red}}(a^{0}(t),s^{0}(t),R^{0}(t))\big{|}\leq\frac{CM}{\sqrt{d}}% \exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,,roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_R ( italic_a ( italic_t ) , italic_u ( italic_t ) ) - italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) | ≤ divide start_ARG italic_C italic_M end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (20)
max(supt[0,T]1ma(t)a0(t)2,1msupt[0,T]s(t)s0(t)2)1dCexp(MT(1+T)2/ε2),subscriptsupremum𝑡0𝑇1𝑚subscriptnorm𝑎𝑡superscript𝑎0𝑡21𝑚subscriptsupremum𝑡0𝑇subscriptnorm𝑠𝑡superscript𝑠0𝑡21𝑑𝐶𝑀𝑇superscript1𝑇2superscript𝜀2\displaystyle\max\left(\sup_{t\in[0,T]}\frac{1}{\sqrt{m}}\|a(t)-a^{0}(t)\|_{2}% ,\frac{1}{\sqrt{m}}\sup_{t\in[0,T]}\|s(t)-s^{0}(t)\|_{2}\right)\leq\frac{1}{% \sqrt{d}}\cdot C\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,,roman_max ( roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ∥ italic_a ( italic_t ) - italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ italic_s ( italic_t ) - italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ⋅ italic_C roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (21)
supt[0,T]1mR(t)R0(t)F1dCexp(MT(1+T)2/ε2).subscriptsupremum𝑡0𝑇1𝑚subscriptnorm𝑅𝑡superscript𝑅0𝑡F1𝑑𝐶𝑀𝑇superscript1𝑇2superscript𝜀2\displaystyle\sup_{t\in[0,T]}\frac{1}{m}\|R(t)-R^{0}(t)\|_{\rm F}\leq\frac{1}{% \sqrt{d}}\cdot C\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ italic_R ( italic_t ) - italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ⋅ italic_C roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (22)

Here C,C𝐶superscript𝐶C,C^{\prime}italic_C , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are absolute constants and M𝑀Mitalic_M only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s in Assumptions A1-A3.

The proof of Corollary 1 is deferred to Appendix B.2. From now on, we will assume the initialization si0(0)=0subscriptsuperscript𝑠0𝑖00s^{0}_{i}(0)=0italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = 0, rij0(0)=0subscriptsuperscript𝑟0𝑖𝑗00r^{0}_{ij}(0)=0italic_r start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 0 ) = 0 for ij𝑖𝑗i\neq jitalic_i ≠ italic_j, but drop the superscript 00 for notational simplicity. We notice in passing that the right-hand sides of Eqs. (20) to (22) are independent of m𝑚mitalic_m: this approximation step holds uniformly over m𝑚mitalic_m. (Note that the left hand sides are normalized by m𝑚mitalic_m as to yield the root mean square error per entry.)

4.2 Elimination of the products ui,ujsubscript𝑢𝑖subscript𝑢𝑗\langle u_{i},u_{j}\rangle⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩

In order to state the reduction (ii)𝑖𝑖(ii)( italic_i italic_i ) outlined above, we define the mean field risk as

Rmf(a,s):=Rred(a,s,R=ss)=12φL221mi=1maiV(si)+12m2i,j=1maiajU(sisj).assignsubscript𝑅mf𝑎𝑠subscript𝑅red𝑎𝑠𝑅𝑠superscript𝑠top12subscriptsuperscriptnorm𝜑2superscript𝐿21𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝑉subscript𝑠𝑖12superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝑈subscript𝑠𝑖subscript𝑠𝑗\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s):=\mathscrsfs{R}_{\mbox{% \tiny\rm red}}(a,s,R=ss^{\top})=\frac{1}{2}\|\varphi\|^{2}_{L^{2}}-\frac{1}{m}% \sum_{i=1}^{m}a_{i}V(s_{i})+\frac{1}{2m^{2}}\sum_{i,j=1}^{m}a_{i}a_{j}U(s_{i}s% _{j})\,.italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a , italic_s ) := italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a , italic_s , italic_R = italic_s italic_s start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (23)

Further, we denote by {aimf(t),simf(t)}i=1msuperscriptsubscriptsubscriptsuperscript𝑎mf𝑖𝑡subscriptsuperscript𝑠mf𝑖𝑡𝑖1𝑚\{a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t)\}_{i=1}^{m}{ italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT the solution to the following ODEs:

εtai=V(si)1mj=1majU(sisj),tsi=ai(1si2)(V(si)1mj=1majU(sisj)sj).formulae-sequence𝜀subscript𝑡subscript𝑎𝑖𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑠𝑖subscript𝑠𝑗subscript𝑡subscript𝑠𝑖subscript𝑎𝑖1superscriptsubscript𝑠𝑖2superscript𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑠𝑖subscript𝑠𝑗subscript𝑠𝑗\begin{split}\varepsilon\partial_{t}a_{i}=\,&V(s_{i})-\frac{1}{m}\sum_{j=1}^{m% }a_{j}U(s_{i}s_{j})\,,\\ \partial_{t}s_{i}=\,&a_{i}\left(1-s_{i}^{2}\right)\left(V^{\prime}(s_{i})-% \frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})s_{j}\right)\,.\end{split}start_ROW start_CELL italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . end_CELL end_ROW (24)

Note that (24) would be identical to (16)-(17) if we had rij=sisjsubscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗r_{ij}=s_{i}s_{j}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. A priori, this is not the case. However, the two systems of equations are close to each other for large m𝑚mitalic_m as made precise by our next proposition, which formalizes reduction (ii)𝑖𝑖(ii)( italic_i italic_i ).

The intuitive explanation for the approximation rijsisjsubscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗r_{ij}\approx s_{i}s_{j}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is quite interesting. For large m𝑚mitalic_m, due to ‘propagation of chaos’, the neuron weights {(ui,ai)}imsubscriptsubscript𝑢𝑖subscript𝑎𝑖𝑖𝑚\{(u_{i},a_{i})\}_{i\leq m}{ ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ≤ italic_m end_POSTSUBSCRIPT are approximately independent. Further, because of the symmetry of the problem under rotations that keep usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT fixed, weights (ui)insubscriptsubscript𝑢𝑖𝑖𝑛(u_{i})_{i\leq n}( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT are approximately uniformly distributed conditional on si=ui,usubscript𝑠𝑖subscript𝑢𝑖subscript𝑢s_{i}=\langle u_{i},u_{*}\rangleitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩. As a consequence, decomposing ui=siu+uisubscript𝑢𝑖subscript𝑠𝑖subscript𝑢superscriptsubscript𝑢𝑖perpendicular-tou_{i}=s_{i}u_{*}+u_{i}^{\perp}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, we have rij=sisj+ui,ujsubscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗superscriptsubscript𝑢𝑖perpendicular-tosuperscriptsubscript𝑢𝑗perpendicular-tor_{ij}=s_{i}s_{j}+\langle u_{i}^{\perp},u_{j}^{\perp}\rangleitalic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩, with uisuperscriptsubscript𝑢𝑖perpendicular-tou_{i}^{\perp}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, ujsuperscriptsubscript𝑢𝑗perpendicular-tou_{j}^{\perp}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT approximately uniform on span{u}\operatorname{span}\{u_{*}\}^{\perp}roman_span { italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT and independent. Therefore, in high dimensions we have rijsisjsubscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗r_{ij}\approx s_{i}s_{j}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Proposition 3 (Reduction to flow in 2msuperscript2𝑚\mathbb{R}^{2m}blackboard_R start_POSTSUPERSCRIPT 2 italic_m end_POSTSUPERSCRIPT).

Let (ai(t),si(t),rij(t))1i<jmsubscriptsubscript𝑎𝑖𝑡subscript𝑠𝑖𝑡subscript𝑟𝑖𝑗𝑡1𝑖𝑗𝑚(a_{i}(t),s_{i}(t),r_{ij}(t))_{1\leq i<j\leq m}( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_m end_POSTSUBSCRIPT be the unique solution of the ODEs (16)-(19) with initialization si(0)=0subscript𝑠𝑖00s_{i}(0)=0italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = 0, rij(0)=0subscript𝑟𝑖𝑗00r_{ij}(0)=0italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 0 ) = 0 for all 1ijm1𝑖𝑗𝑚1\leq i\neq j\leq m1 ≤ italic_i ≠ italic_j ≤ italic_m. Let (aimf(t),simf(t))imsubscriptsubscriptsuperscript𝑎mf𝑖𝑡subscriptsuperscript𝑠mf𝑖𝑡𝑖𝑚(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))_{i\leq m}( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUBSCRIPT italic_i ≤ italic_m end_POSTSUBSCRIPT be the unique solution of the ODEs (24) with initialization simf(0)=0subscriptsuperscript𝑠mf𝑖00s^{\mbox{\tiny\rm mf}}_{i}(0)=0italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = 0, aimf(0)=ai(0)subscriptsuperscript𝑎mf𝑖0subscript𝑎𝑖0a^{\mbox{\tiny\rm mf}}_{i}(0)=a_{i}(0)italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) for all im𝑖𝑚i\leq mitalic_i ≤ italic_m.

If assumptions A1-A3 hold, then for any T<𝑇T<\inftyitalic_T < ∞ there exists a constant

C(T)=Mexp(MT(1+T)2/ε2)𝐶𝑇𝑀𝑀𝑇superscript1𝑇2superscript𝜀2C(T)=M\exp(MT(1+T)^{2}/\varepsilon^{2})italic_C ( italic_T ) = italic_M roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (25)

(with M𝑀Mitalic_M depending on the constants {Mi}1i3subscriptsubscript𝑀𝑖1𝑖3\{M_{i}\}_{1\leq i\leq 3}{ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ 3 end_POSTSUBSCRIPT appearing in Assumptions A1-A3 only) such that:

supt[0,T]1mi=1m(ai(t),si(t))(aimf(t),simf(t))22C(T)m.subscriptsupremum𝑡0𝑇1𝑚superscriptsubscript𝑖1𝑚superscriptsubscriptnormsubscript𝑎𝑖𝑡subscript𝑠𝑖𝑡subscriptsuperscript𝑎mf𝑖𝑡subscriptsuperscript𝑠mf𝑖𝑡22𝐶𝑇𝑚\sup_{t\in[0,T]}\frac{1}{m}\sum_{i=1}^{m}\big{\|}(a_{i}(t),s_{i}(t))-(a^{\mbox% {\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))\big{\|}_{2}^{2}\leq\frac{% C(T)}{m}\,.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) - ( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_C ( italic_T ) end_ARG start_ARG italic_m end_ARG .

Consequently,

supt[0,T]|Rred(a(t),s(t),R(t))Rmf(amf(t),smf(t))|C(T)m.subscriptsupremum𝑡0𝑇subscript𝑅red𝑎𝑡𝑠𝑡𝑅𝑡subscript𝑅mfsuperscript𝑎mf𝑡superscript𝑠mf𝑡𝐶𝑇𝑚\sup_{t\in[0,T]}\left|\mathscrsfs{R}_{\mbox{\tiny\rm red}}\left(a(t),s(t),R(t)% \right)-\mathscrsfs{R}_{\mbox{\tiny\rm mf}}\left(a^{\mbox{\tiny\rm mf}}(t),s^{% \mbox{\tiny\rm mf}}(t)\right)\right|\leq\frac{C(T)}{\sqrt{m}}\,.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a ( italic_t ) , italic_s ( italic_t ) , italic_R ( italic_t ) ) - italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) | ≤ divide start_ARG italic_C ( italic_T ) end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG .

The proof of this proposition is deferred to Appendix B.3. Now, combining the propositions and corollaries in this section, we deduce that with high probability over the i.i.d. initialization,

supt[0,T]|R(a(t),u(t))Rmf(amf(t),smf(t))|(1d+1m)CMexp(MT(1+T)2/ε2).subscriptsupremum𝑡0𝑇𝑅𝑎𝑡𝑢𝑡subscript𝑅mfsuperscript𝑎mf𝑡superscript𝑠mf𝑡1𝑑1𝑚𝐶𝑀𝑀𝑇superscript1𝑇2superscript𝜀2\sup_{t\in[0,T]}\left|\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\mbox{\tiny\rm mf% }}\left(a^{\mbox{\tiny\rm mf}}(t),s^{\mbox{\tiny\rm mf}}(t)\right)\right|\leq% \left(\frac{1}{\sqrt{d}}+\frac{1}{\sqrt{m}}\right)CM\exp(MT(1+T)^{2}/% \varepsilon^{2}).roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_R ( italic_a ( italic_t ) , italic_u ( italic_t ) ) - italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) | ≤ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ) italic_C italic_M roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (26)

4.3 Connection with mean field theory

Consider the empirical distributions of the neurons:

ρ^tsubscript^𝜌𝑡\displaystyle\widehat{\rho}_{t}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT :=1mi=1mδ(ai(t),si(t)),assignabsent1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝑎𝑖𝑡subscript𝑠𝑖𝑡\displaystyle:=\frac{1}{m}\sum_{i=1}^{m}\delta_{(a_{i}(t),s_{i}(t))}\,,:= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) end_POSTSUBSCRIPT , (27)
ρtsubscript𝜌𝑡\displaystyle\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT :=1mi=1mδ(aimf(t),simf(t)),assignabsent1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscriptsuperscript𝑎mf𝑖𝑡subscriptsuperscript𝑠mf𝑖𝑡\displaystyle:=\frac{1}{m}\sum_{i=1}^{m}\delta_{(a^{\mbox{\tiny\rm mf}}_{i}(t)% ,s^{\mbox{\tiny\rm mf}}_{i}(t))}\,,:= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) end_POSTSUBSCRIPT , (28)

with (ai(t),si(t))imsubscriptsubscript𝑎𝑖𝑡subscript𝑠𝑖𝑡𝑖𝑚(a_{i}(t),s_{i}(t))_{i\leq m}( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUBSCRIPT italic_i ≤ italic_m end_POSTSUBSCRIPT, (aimf(t),simf(t))imsubscriptsubscriptsuperscript𝑎mf𝑖𝑡subscriptsuperscript𝑠mf𝑖𝑡𝑖𝑚(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))_{i\leq m}( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUBSCRIPT italic_i ≤ italic_m end_POSTSUBSCRIPT as in the statement of Proposition 3, i.e., solving (respectively) Eqs. (16)-(19) and Eq. (24) with initial conditions as given there.

Then, it is immediate to show that ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT solves (in weak sense) the following continuity partial differential equation (PDE) (we refer to [2, 42] for the definition of weak solutions and basic properties, and Appendix B.4 for a short derivation.)

tρt(a,s)subscript𝑡subscript𝜌𝑡𝑎𝑠\displaystyle\partial_{t}\rho_{t}(a,s)∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a , italic_s ) =(ρtΨ(a,s;ρt))absentsubscript𝜌𝑡Ψ𝑎𝑠subscript𝜌𝑡\displaystyle=-\nabla\cdot\left(\rho_{t}\Psi\left(a,s;\rho_{t}\right)\right)= - ∇ ⋅ ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (29)
:=(a(ρtΨa(a,s;ρt))+s(ρtΨs(a,s;ρt))),assignabsentsubscript𝑎subscript𝜌𝑡subscriptΨ𝑎𝑎𝑠subscript𝜌𝑡subscript𝑠subscript𝜌𝑡subscriptΨ𝑠𝑎𝑠subscript𝜌𝑡\displaystyle:=-\left(\partial_{a}\left(\rho_{t}\Psi_{a}\left(a,s;\rho_{t}% \right)\right)+\partial_{s}\left(\rho_{t}\Psi_{s}\left(a,s;\rho_{t}\right)% \right)\right),:= - ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) , (30)

where Ψ=(Ψa,Ψs)ΨsubscriptΨ𝑎subscriptΨ𝑠\Psi=(\Psi_{a},\Psi_{s})roman_Ψ = ( roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is given by

Ψa(a,s;ρ)=subscriptΨ𝑎𝑎𝑠𝜌absent\displaystyle\Psi_{a}(a,s;\rho)=\,roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ ) = ε1(V(s)2a1U(ss1)ρ(da1,ds1)),superscript𝜀1𝑉𝑠subscriptsuperscript2subscript𝑎1𝑈𝑠subscript𝑠1𝜌dsubscript𝑎1dsubscript𝑠1\displaystyle\varepsilon^{-1}\cdot\left(V(s)-\int_{\mathbb{R}^{2}}a_{1}U(ss_{1% })\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right),italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ( italic_V ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , (31)
Ψs(a,s;ρ)=subscriptΨ𝑠𝑎𝑠𝜌absent\displaystyle\Psi_{s}(a,s;\rho)=\,roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ ) = a(1s2)(V(s)2a1s1U(ss1)ρ(da1,ds1)).𝑎1superscript𝑠2superscript𝑉𝑠subscriptsuperscript2subscript𝑎1subscript𝑠1superscript𝑈𝑠subscript𝑠1𝜌dsubscript𝑎1dsubscript𝑠1\displaystyle a(1-s^{2})\cdot\left(V^{\prime}(s)-\int_{\mathbb{R}^{2}}a_{1}s_{% 1}U^{\prime}(ss_{1})\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right).italic_a ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) . (32)

This equation can be extended to a flow in the whole space (𝒫(2),W2)𝒫superscript2subscript𝑊2(\mathscr{P}(\mathbb{R}^{2}),W_{2})( script_P ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (all probability measures on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT equipped with the second Wasserstein distance), and interpreted as gradient flow with respect to this metric in the following risk:

Rmf,(ρ):=12φL22aV(s)ρ(da,ds)+12a1a2U(s1s2)ρ(da1,ds1)ρ(da2,ds2),assignsubscript𝑅mf𝜌12subscriptsuperscriptnorm𝜑2superscript𝐿2𝑎𝑉𝑠𝜌d𝑎d𝑠12subscript𝑎1subscript𝑎2𝑈subscript𝑠1subscript𝑠2𝜌dsubscript𝑎1dsubscript𝑠1𝜌dsubscript𝑎2dsubscript𝑠2\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho):=\frac{1}{2}\|\varphi% \|^{2}_{L^{2}}-\int\!aV(s)\,\rho(\mathrm{d}a,\mathrm{d}s)+\frac{1}{2}\int\!a_{% 1}a_{2}U(s_{1}s_{2})\,\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\,\rho(\mathrm{d}a_% {2},\mathrm{d}s_{2})\,\,,italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ∫ italic_a italic_V ( italic_s ) italic_ρ ( roman_d italic_a , roman_d italic_s ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_d italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (33)

which is the obvious extension of Rmf(a,s)subscript𝑅mf𝑎𝑠\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s)italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a , italic_s ) of Eq. (23) to general probability distributions. Proposition 3 implies that for any T<𝑇T<\inftyitalic_T < ∞, and under the above initial conditions,

supt[0,T]W2(ρt,ρ^t)Mexp(MT(1+T)2/ε2)m.subscriptsupremum𝑡0𝑇subscript𝑊2subscript𝜌𝑡subscript^𝜌𝑡𝑀𝑀𝑇superscript1𝑇2superscript𝜀2𝑚\displaystyle\sup_{t\in[0,T]}W_{2}(\rho_{t},\widehat{\rho}_{t})\leq\sqrt{\frac% {M\exp(MT(1+T)^{2}/\varepsilon^{2})}{m}}\,.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ square-root start_ARG divide start_ARG italic_M roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_m end_ARG end_ARG . (34)

If we further denote by ρtdsuperscriptsubscript𝜌𝑡𝑑\rho_{t}^{d}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT the empirical distribution of (ai(t),si(t))subscript𝑎𝑖𝑡subscript𝑠𝑖𝑡(a_{i}(t),s_{i}(t))( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ), im𝑖𝑚i\leq mitalic_i ≤ italic_m, when si(0)=ui(0),usubscript𝑠𝑖0subscript𝑢𝑖0subscript𝑢s_{i}(0)=\langle u_{i}(0),u_{*}\rangleitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩, ui(0)Unif(𝕊d1)similar-tosubscript𝑢𝑖0Unifsuperscript𝕊𝑑1u_{i}(0)\sim\mathrm{Unif}(\mathbb{S}^{d-1})italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ∼ roman_Unif ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ), a further application of Corollary 1 yields

supt[0,T]W2(ρtd,ρt)Mexp(MT(1+T)2/ε2)md.subscriptsupremum𝑡0𝑇subscript𝑊2subscriptsuperscript𝜌𝑑𝑡subscript𝜌𝑡𝑀𝑀𝑇superscript1𝑇2superscript𝜀2𝑚𝑑\displaystyle\sup_{t\in[0,T]}W_{2}(\rho^{d}_{t},\rho_{t})\leq\sqrt{\frac{M\exp% (MT(1+T)^{2}/\varepsilon^{2})}{m\wedge d}}\,.roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ square-root start_ARG divide start_ARG italic_M roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_m ∧ italic_d end_ARG end_ARG . (35)

Starting with [32, 13, 40], several authors used continuity PDEs of the form (29) to study the learning dynamics of two-layer neural networks. Following the physics tradition, this is referred to as the ‘mean-field theory’ of two-layer neural networks. Appendix B.5 sketches an alternative approach to prove bounds of the form (26), (35) using the results of [32, 33]. The present derivation has the advantages of yielding a sharper bound and of being self-contained.

4.4 A general formulation

As mentioned above, the system of ODEs in Eq. (24) is a special case of the Wasserstein gradient flow of Eq. (29) whereby we set ρ0=m1i=1mδ(aimf(0),simf(0))subscript𝜌0superscript𝑚1superscriptsubscript𝑖1𝑚subscript𝛿superscriptsubscript𝑎𝑖mf0superscriptsubscript𝑠𝑖mf0\rho_{0}=m^{-1}\sum_{i=1}^{m}\delta_{(a_{i}^{\mbox{\tiny\rm mf}}(0),s_{i}^{% \mbox{\tiny\rm mf}}(0))}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( 0 ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( 0 ) ) end_POSTSUBSCRIPT. In order to study the solutions of Eq. (29) (hence Eq. (24)) we adopt the following framework. Let (Ω,ρ)Ω𝜌(\Omega,\rho)( roman_Ω , italic_ρ ) denote a probability space. Let a=a(ω,t)𝑎𝑎𝜔𝑡a=a(\omega,t)italic_a = italic_a ( italic_ω , italic_t ) and s=s(ω,t)𝑠𝑠𝜔𝑡s=s(\omega,t)italic_s = italic_s ( italic_ω , italic_t ) (ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω, t0𝑡0t\geqslant 0italic_t ⩾ 0) be two measurable functions satisfying (drop** dependencies in t𝑡titalic_t below)

εta(ω)=V(s(ω))a(ν)U(s(ω)s(ν))dρ(ν),ts(ω)=a(ω)(1s(ω)2)(V(s(ω))a(ν)U(s(ω)s(ν))s(ν)dρ(ν)).formulae-sequence𝜀subscript𝑡𝑎𝜔𝑉𝑠𝜔𝑎𝜈𝑈𝑠𝜔𝑠𝜈differential-d𝜌𝜈subscript𝑡𝑠𝜔𝑎𝜔1𝑠superscript𝜔2superscript𝑉𝑠𝜔𝑎𝜈superscript𝑈𝑠𝜔𝑠𝜈𝑠𝜈differential-d𝜌𝜈\begin{split}\varepsilon\partial_{t}a(\omega)=\,&V(s(\omega))-\int a(\nu)U(s(% \omega)s(\nu))\mathrm{d}\rho(\nu)\,,\\ \partial_{t}s(\omega)=\,&a(\omega)\left(1-s(\omega)^{2}\right)\left(V^{\prime}% (s(\omega))-\int a(\nu)U^{\prime}(s(\omega)s(\nu))s(\nu)\mathrm{d}\rho(\nu)% \right)\,.\end{split}start_ROW start_CELL italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a ( italic_ω ) = end_CELL start_CELL italic_V ( italic_s ( italic_ω ) ) - ∫ italic_a ( italic_ν ) italic_U ( italic_s ( italic_ω ) italic_s ( italic_ν ) ) roman_d italic_ρ ( italic_ν ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s ( italic_ω ) = end_CELL start_CELL italic_a ( italic_ω ) ( 1 - italic_s ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ( italic_ω ) ) - ∫ italic_a ( italic_ν ) italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ( italic_ω ) italic_s ( italic_ν ) ) italic_s ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW (36)

If ω=iΩ={1,,m}𝜔𝑖Ω1𝑚\omega=i\in\Omega=\{1,\dots,m\}italic_ω = italic_i ∈ roman_Ω = { 1 , … , italic_m } endowed with the uniform measure, we obtain the equations (24). In general, the push-forward ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the measure ρ𝜌\rhoitalic_ρ through the map ωΩ(a(ω,t),s(ω,t))2𝜔Ωmaps-to𝑎𝜔𝑡𝑠𝜔𝑡superscript2\omega\in\Omega\mapsto(a(\omega,t),s(\omega,t))\in\mathbb{R}^{2}italic_ω ∈ roman_Ω ↦ ( italic_a ( italic_ω , italic_t ) , italic_s ( italic_ω , italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT satisfies the mean-field equation (29). As a consequence, the dynamics (36) can be viewed as a gradient flow on the risk

Rmf,(ρ)=12φ2a(ω)V(s(ω))dρ(ω)+12a(ω1)a(ω2)U(s(ω1)s(ω2))dρ(ω1)dρ(ω2).subscript𝑅mf𝜌12superscriptnorm𝜑2𝑎𝜔𝑉𝑠𝜔differential-d𝜌𝜔12𝑎subscript𝜔1𝑎subscript𝜔2𝑈𝑠subscript𝜔1𝑠subscript𝜔2differential-d𝜌subscript𝜔1differential-d𝜌subscript𝜔2\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)=\frac{1}{2}\|\varphi\|^{2}-\int a(% \omega)V(s(\omega))\mathrm{d}\rho(\omega)+\frac{1}{2}\int a(\omega_{1})a(% \omega_{2})U(s(\omega_{1})s(\omega_{2}))\mathrm{d}\rho(\omega_{1})\mathrm{d}% \rho(\omega_{2})\,.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∫ italic_a ( italic_ω ) italic_V ( italic_s ( italic_ω ) ) roman_d italic_ρ ( italic_ω ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ italic_a ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_a ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_U ( italic_s ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_s ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) roman_d italic_ρ ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_d italic_ρ ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (37)

We next characterize the landscape of the risk function Rmf,(ρ)subscript𝑅mf𝜌\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ). In particular, we establish that under certain conditions, the global infimum of Rmf,(ρ)subscript𝑅mf𝜌\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) is 00.

Proposition 4.

The risk function Rmf,(ρ)subscript𝑅mf𝜌\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) can be expressed as

Rmf,(ρ)=12k=0(φkσka(ω)s(ω)kdρ(ω))2.subscript𝑅mf𝜌12superscriptsubscript𝑘0superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)=\,\frac{1}{2}\sum_{k=0}^{\infty}% \left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (38)

Assume that σk0subscript𝜎𝑘0\sigma_{k}\neq 0italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 for all k0𝑘0k\geq 0italic_k ≥ 0, and that

k=0σk2<,k=0φk2<.formulae-sequencesuperscriptsubscript𝑘0superscriptsubscript𝜎𝑘2superscriptsubscript𝑘0superscriptsubscript𝜑𝑘2\sum_{k=0}^{\infty}\sigma_{k}^{2}<\infty,\quad\sum_{k=0}^{\infty}\varphi_{k}^{% 2}<\infty.∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞ , ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞ .

Then, for any δ>0𝛿0\delta>0italic_δ > 0, there exists a triple (a,s,ρ)𝑎𝑠𝜌(a,s,\rho)( italic_a , italic_s , italic_ρ ) such that aL2(ρ)𝑎superscript𝐿2𝜌a\in L^{2}(\rho)italic_a ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ), s[1,1]𝑠11s\in[-1,1]italic_s ∈ [ - 1 , 1 ], and Rmf,(ρ)δ2subscript𝑅mf𝜌superscript𝛿2\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)\leq\delta^{2}italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

This proposition is proved in Appendix B.6.

Remark 4.2.

Proposition 4 complements Proposition 1 which establishes approximability of the target function f(x)=φ(u,x)subscript𝑓𝑥𝜑subscript𝑢𝑥f_{*}(x)=\varphi(\langle u_{*},x\rangle)italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) = italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) using the networks (2) (Proposition 4 can be seen as an m=d=𝑚𝑑m=d=\inftyitalic_m = italic_d = ∞ version of the latter). We note that the proofs of these propositions also provides insight into the structure of approximators. Namely, we can take the weights uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be i.i.d. with distribution ρ(u)du𝜌𝑢d𝑢\rho(u)\mathrm{d}uitalic_ρ ( italic_u ) roman_d italic_u that is symmetric under rotations around usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, and ai=α(ui)subscript𝑎𝑖𝛼subscript𝑢𝑖a_{i}=\alpha(u_{i})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), a(u)=α(u)ρ(u)𝑎𝑢𝛼𝑢𝜌𝑢a(u)=\alpha(u)\rho(u)italic_a ( italic_u ) = italic_α ( italic_u ) italic_ρ ( italic_u ) is concentrated close to u,u=0subscript𝑢𝑢0\langle u_{*},u\rangle=0⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u ⟩ = 0 (on a scale that can rely on the desired approximation error).

Indeed, the analysis of gradient flow in Section 6 reveals that the solutions found by gradient flow are of this nature. Namely, neurons develop a small but strictly positive alignment with usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. The distribution and size of the alignment evolves over time.

Remark 4.3.

The results in this section can be generalized to multi-index models: y=φ(Ux)𝑦𝜑superscriptsubscript𝑈top𝑥y=\varphi(U_{*}^{\top}x)italic_y = italic_φ ( italic_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) where UO(d,k)subscript𝑈𝑂𝑑𝑘U_{*}\in O(d,k)italic_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ italic_O ( italic_d , italic_k ), the space of d×k𝑑𝑘d\times kitalic_d × italic_k orthogonal matrices. Further, the corresponding limiting dynamics become

εta(ω)=V(s(ω))a(ν)U(s(ω)s(ν))dρ(ν),ts(ω)=a(ω)(Iks(ω)s(ω))(V(s(ω))a(ν)U(s(ω)s(ν))s(ν)dρ(ν)).formulae-sequence𝜀subscript𝑡𝑎𝜔𝑉𝑠𝜔𝑎𝜈𝑈𝑠superscript𝜔top𝑠𝜈differential-d𝜌𝜈subscript𝑡𝑠𝜔𝑎𝜔subscript𝐼𝑘𝑠𝜔𝑠superscript𝜔top𝑉𝑠𝜔𝑎𝜈superscript𝑈𝑠superscript𝜔top𝑠𝜈𝑠𝜈differential-d𝜌𝜈\begin{split}\varepsilon\partial_{t}a(\omega)=\,&V(s(\omega))-\int a(\nu)U% \left(s(\omega)^{\top}s(\nu)\right)\mathrm{d}\rho(\nu)\,,\\ \partial_{t}s(\omega)=\,&a(\omega)\left(I_{k}-s(\omega)s(\omega)^{\top}\right)% \left(\nabla V(s(\omega))-\int a(\nu)U^{\prime}\left(s(\omega)^{\top}s(\nu)% \right)s(\nu)\mathrm{d}\rho(\nu)\right)\,.\end{split}start_ROW start_CELL italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a ( italic_ω ) = end_CELL start_CELL italic_V ( italic_s ( italic_ω ) ) - ∫ italic_a ( italic_ν ) italic_U ( italic_s ( italic_ω ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s ( italic_ν ) ) roman_d italic_ρ ( italic_ν ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s ( italic_ω ) = end_CELL start_CELL italic_a ( italic_ω ) ( italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_s ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( ∇ italic_V ( italic_s ( italic_ω ) ) - ∫ italic_a ( italic_ν ) italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ( italic_ω ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_s ( italic_ν ) ) italic_s ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW

Here, s(ω)k𝑠𝜔superscript𝑘s(\omega)\in\mathbb{R}^{k}italic_s ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents Uu(ω)superscriptsubscript𝑈top𝑢𝜔U_{*}^{\top}u(\omega)italic_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ( italic_ω ), and for sk𝑠superscript𝑘s\in\mathbb{R}^{k}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, s21subscriptnorm𝑠21\left\|{s}\right\|_{2}\leq 1∥ italic_s ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1:

V(s)=𝔼[φ(G)σ(Gs)],(G,Gs)𝒩(0,[Ikss1]).formulae-sequence𝑉𝑠𝔼delimited-[]𝜑𝐺𝜎subscript𝐺𝑠similar-to𝐺subscript𝐺𝑠𝒩0delimited-[]matrixsubscript𝐼𝑘𝑠superscript𝑠top1V(s)=\mathbb{E}\left[\varphi(G)\sigma(G_{s})\right],\quad(G,G_{s})\sim{% \mathcal{N}}\left(0,\left[\begin{matrix}I_{k}&s\\ s^{\top}&1\end{matrix}\right]\right).italic_V ( italic_s ) = blackboard_E [ italic_φ ( italic_G ) italic_σ ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ] , ( italic_G , italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∼ caligraphic_N ( 0 , [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL italic_s end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ) .

The definition of U𝑈Uitalic_U is the same as before.

5 Numerical solution

Refer to captionRefer to caption
Refer to caption
(a) ε=103𝜀superscript103\varepsilon=10^{-3}italic_ε = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
Refer to captionRefer to caption
Refer to caption
(b) ε=106𝜀superscript106\varepsilon=10^{-6}italic_ε = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Figure 2: Simulation of the simplified neuron dynamics of Eqs. (24), with the target function of Eq. (39) and ReLU activations. We use learning rate ratios ε=103𝜀superscript103\varepsilon=10^{-3}italic_ε = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (left) and ε=106𝜀superscript106\varepsilon=10^{-6}italic_ε = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT (right) and we use m=10𝑚10m=10italic_m = 10 neurons. First two rows: evolution of the risk Rmfsubscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf}}italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT of Eq. (23), in linear and log-scales. Third row: evolution of the first three terms of the sum of (40).
Refer to captionRefer to caption
Figure 3: Same simulation as in Figure 2 (b). In these plots, we show the evolution of the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i{1,,m}𝑖1𝑚i\in\{1,\dots,m\}italic_i ∈ { 1 , … , italic_m } following a discretization of Eqs. (24).
Refer to captionRefer to caption
Refer to captionRefer to caption
Refer to captionRefer to caption
Refer to captionRefer to caption
Figure 4: Comparison between the simplified neuron dynamics (24) (MF) and projected gradient descent for the two-layer neural network (2) (NN), with the same target function and activation as the simulations in Figure 2. We use four different combinations of (learning rare ratio, network width): (ε,m)=(1,10)𝜀𝑚110(\varepsilon,m)=(1,10)( italic_ε , italic_m ) = ( 1 , 10 ) (first row), (ε,m)=(1,50)𝜀𝑚150(\varepsilon,m)=(1,50)( italic_ε , italic_m ) = ( 1 , 50 ) (second row), (ε,m)=(103,10)𝜀𝑚superscript10310(\varepsilon,m)=(10^{-3},10)( italic_ε , italic_m ) = ( 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 ) (third row), and (ε,m)=(103,50)𝜀𝑚superscript10350(\varepsilon,m)=(10^{-3},50)( italic_ε , italic_m ) = ( 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 50 ) (fourth row). Left panel: evolution of the risk Rmfsubscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf}}italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT for NN and MF on a logarithmic scale. Right panel: evolution of the first three components of Rmfsubscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf}}italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT (constant, linear, and quadratic) for NN and MF.

In Figure 2, we present the result of an Euler discretization of Eqs. (24) where φ𝜑\varphiitalic_φ is a degree-2222 polynomial and σ𝜎\sigmaitalic_σ is the ReLU activation: σ(s)=max(s,0)𝜎𝑠𝑠0\sigma(s)=\max(s,0)italic_σ ( italic_s ) = roman_max ( italic_s , 0 ),

φ(s)=He0(s)He1(s)23He2(s)=(1226)s226s2.𝜑𝑠subscriptHe0𝑠subscriptHe1𝑠23subscriptHe2𝑠1226𝑠226superscript𝑠2\displaystyle\begin{split}&\varphi(s)=\mathrm{He}_{0}(s)-\mathrm{He}_{1}(s)-% \frac{2}{3}\mathrm{He}_{2}(s)\,\\ &\hskip 19.91692pt=\left(1-\frac{2\sqrt{2}}{6}\right)-s-\frac{2\sqrt{2}}{6}s^{% 2}\,.\end{split}start_ROW start_CELL end_CELL start_CELL italic_φ ( italic_s ) = roman_He start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) - roman_He start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) - divide start_ARG 2 end_ARG start_ARG 3 end_ARG roman_He start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( 1 - divide start_ARG 2 square-root start_ARG 2 end_ARG end_ARG start_ARG 6 end_ARG ) - italic_s - divide start_ARG 2 square-root start_ARG 2 end_ARG end_ARG start_ARG 6 end_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (39)

These plots clearly display two of the features emphasized in the introduction: (i)𝑖(i)( italic_i ) plateaus separated by periods of rapid improvement of the risk; (ii)𝑖𝑖(ii)( italic_i italic_i ) increasingly long timescales (notice the logarithmic time axis in the second and third row).

In order to examine the incremental learning structure, we rewrite the risk Rmfsubscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf}}italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT of Eq. (23) by decomposing φ𝜑\varphiitalic_φ and σ𝜎\sigmaitalic_σ in the basis of Hermite polynomials

Rmf(a,s)subscript𝑅mf𝑎𝑠\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s)italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a , italic_s ) =12k0(φkσkmi=1maisik)2.absent12subscript𝑘0superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖superscriptsubscript𝑠𝑖𝑘2\displaystyle=\frac{1}{2}\sum_{k\geqslant 0}\left(\varphi_{k}-\frac{\sigma_{k}% }{m}\sum_{i=1}^{m}a_{i}s_{i}^{k}\right)^{2}\,.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (40)

We observe that, for small ε𝜀\varepsilonitalic_ε, the Hermite coefficients of φ𝜑\varphiitalic_φ are learned sequentially, in the order of their degree. When ε𝜀\varepsilonitalic_ε is sufficiently small (right plots), this incremental learning happens in well separated phases. The plateaus and waterfalls in the plots of Rmfsubscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf}}italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT correspond to the network learning increasingly higher degree polynomials.

In Figure 3 we plot the evolution of the values of the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for i{1,,m}𝑖1𝑚i\in\{1,\dots,m\}italic_i ∈ { 1 , … , italic_m }. We observe that the overall order of magnitude of the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and the sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s increases when passing through the different phases of the incremental learning process. In the mean time, some of the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s will undergo a sign change during the learning process, which is characterized by a sudden decrease and subsequent rapid increase in its magnitude.

Altogether, the results of Figures 2 and 3 are consistent with the canonical learning order up to level L=2𝐿2L=2italic_L = 2 as per Definition 1. While we conjecture that incremental learning also occurs for higher-order polynomials, we found this hard to observe in numerical simulations: we would need to take ε𝜀\varepsilonitalic_ε much smaller than in Figure 2, resulting in prohibitively large simulation costs.

First, as predicted in Definition 1, the times at which the components are learned are closer on a logarithmic scale as the degree increases. It is therefore increasingly difficult to observe time scales corresponding to higher degrees.

Second, we expect there to be a choice of the initialization (ai,init,ui,init)i[m]subscriptsubscript𝑎𝑖initsubscript𝑢𝑖init𝑖delimited-[]𝑚(a_{i,\rm{init}},u_{i,\rm{init}})_{i\in[m]}( italic_a start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i , roman_init end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT, activation and target function, for which not all the components of φ𝜑\varphiitalic_φ are actually learnt. We observed empirically that this happens easily for small m𝑚mitalic_m.

To conclude this section, in Figure 4 we compare the simplified neuron dynamics (MF) of Eq. (24) and the evolution of projected gradient descent for the original two-layer neural network (NN). From the plots we observe two remarkable phenomena: (1) the evolution of the risk for NN and MF are close to each other during the entire learning process for both large learning rate ratio (ε=1𝜀1\varepsilon=1italic_ε = 1) and small (ε=103𝜀superscript103\varepsilon=10^{-3}italic_ε = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT), and their risk curves have the same qualitative behavior even if m𝑚mitalic_m is small (m=10𝑚10m=10italic_m = 10); (2) as we increase the value of m𝑚mitalic_m from 10101010 to 50505050, the alignment between the learning curves of NN and MF improves significantly. These observations justify our argument in Section 4.2 that the inter-neuron correlations rijsubscript𝑟𝑖𝑗r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are well approximated by sisjsubscript𝑠𝑖subscript𝑠𝑗s_{i}s_{j}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for wide networks.

6 Timescales hierarchy in the gradient flow dynamics

We are interested in the behavior of the solution of the ODEs (36), initialized from s(ω,0)=0𝑠𝜔00s(\omega,0)=0italic_s ( italic_ω , 0 ) = 0 for all ω𝜔\omegaitalic_ω (as per Proposition 3). The canonical learning order of Definition 1 concerns the behavior of solutions for ε0𝜀0\varepsilon\to 0italic_ε → 0. This type of questions can be addressed within the theory of dynamical systems using singular perturbation theory [25]. Here, ‘singular’ refers to the fact that ε𝜀\varepsilonitalic_ε multiplies one of the highest-order derivatives. In Eq. (36), ε𝜀\varepsilonitalic_ε multiplies the differential term ta(ω)subscript𝑡𝑎𝜔\partial_{t}a(\omega)∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a ( italic_ω ), so that the ODE system becomes singular in the limit ε0𝜀0\varepsilon\to 0italic_ε → 0. In particular, it degenerates to the following system of differential-algebraic equations:

V(s(ω))=a(ν)U(s(ω)s(ν))dρ(ν),ts(ω)=a(ω)(1s(ω)2)(V(s(ω))a(ν)U(s(ω)s(ν))s(ν)dρ(ν)).formulae-sequence𝑉𝑠𝜔𝑎𝜈𝑈𝑠𝜔𝑠𝜈differential-d𝜌𝜈subscript𝑡𝑠𝜔𝑎𝜔1𝑠superscript𝜔2superscript𝑉𝑠𝜔𝑎𝜈superscript𝑈𝑠𝜔𝑠𝜈𝑠𝜈differential-d𝜌𝜈\begin{split}V(s(\omega))=\,&\int a(\nu)U(s(\omega)s(\nu))\mathrm{d}\rho(\nu)% \,,\\ \partial_{t}s(\omega)=\,&a(\omega)\left(1-s(\omega)^{2}\right)\left(V^{\prime}% (s(\omega))-\int a(\nu)U^{\prime}(s(\omega)s(\nu))s(\nu)\mathrm{d}\rho(\nu)% \right)\,.\end{split}start_ROW start_CELL italic_V ( italic_s ( italic_ω ) ) = end_CELL start_CELL ∫ italic_a ( italic_ν ) italic_U ( italic_s ( italic_ω ) italic_s ( italic_ν ) ) roman_d italic_ρ ( italic_ν ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s ( italic_ω ) = end_CELL start_CELL italic_a ( italic_ω ) ( 1 - italic_s ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ( italic_ω ) ) - ∫ italic_a ( italic_ν ) italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ( italic_ω ) italic_s ( italic_ν ) ) italic_s ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW (41)

Due to singularity, the qualitative behavior of the above system is dramatically different from that of Eq. (36) with ε𝜀\varepsilonitalic_ε small but non-zero. This is in stark contrast to regular perturbation problems, for which the limiting dynamics will still be a system of differential equations with the same order and similar qualitative behavior as the perturbed system.

As a side remark, we note that the system (36) can be seen as a slow-fast dynamical system, where the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω )’s are the fast variables and the s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω )’s are the slow variables [10]. Formally, the time derivative of the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω )’s is multiplied by a factor (1/ε)1𝜀(1/\varepsilon)( 1 / italic_ε ). From a dynamical systems perspective, the present case is made complicated because of a bifurcation when the s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω )’s become non-zero.

The canonical learning order provides a detailed description of this bifurcation. We will motivate this scenario using a classical, but non-rigorous, technique of singular perturbation theory, called the matched asymptotic expansion [25, Chapter 2]. This technique decomposes the approximation of the solution in several time scales on which a regular approximation holds. These time scales are traditionally called layers in the literature; however, we avoid this terminology due to the potential confusion with the layers of the neural network.

We will work mainly using the Hermite representation of the dynamical ODEs (36), which we write down for the reader’s convenience:

εta(ω)=k=0σks(ω)k(φkσka(ν)s(ν)kdρ(ν)),ts(ω)=a(ω)(1s(ω)2)k=1kσks(ω)k1(φkσka(ν)s(ν)kdρ(ν)).formulae-sequence𝜀subscript𝑡𝑎𝜔superscriptsubscript𝑘0subscript𝜎𝑘𝑠superscript𝜔𝑘subscript𝜑𝑘subscript𝜎𝑘𝑎𝜈𝑠superscript𝜈𝑘differential-d𝜌𝜈subscript𝑡𝑠𝜔𝑎𝜔1𝑠superscript𝜔2superscriptsubscript𝑘1𝑘subscript𝜎𝑘𝑠superscript𝜔𝑘1subscript𝜑𝑘subscript𝜎𝑘𝑎𝜈𝑠superscript𝜈𝑘differential-d𝜌𝜈\begin{split}\varepsilon\partial_{t}a(\omega)&=\,\sum_{k=0}^{\infty}\sigma_{k}% s(\omega)^{k}\left(\varphi_{k}-\sigma_{k}\int a(\nu)s(\nu)^{k}\mathrm{d}\rho(% \nu)\right)\,,\\ \partial_{t}s(\omega)&=\,a(\omega)\left(1-s(\omega)^{2}\right)\sum_{k=1}^{% \infty}k\sigma_{k}s(\omega)^{k-1}\left(\varphi_{k}-\sigma_{k}\int a(\nu)s(\nu)% ^{k}\mathrm{d}\rho(\nu)\right)\,.\end{split}start_ROW start_CELL italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a ( italic_ω ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s ( italic_ω ) end_CELL start_CELL = italic_a ( italic_ω ) ( 1 - italic_s ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_k italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW (42)

The rest of this section is organized as follows. We first give a brief overview of the method of matched asymptotic expansions and a summary of our main results regarding the learning timescales in Section 6.1. Sections 6.2-6.4 respectively describe the first three time scales of the matched asymptotic expansion of (42). This gives, for each time scale, an approximation of the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω ), s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω ). In Appendix C.2, we detail how these sections induce an evolution of the risk alternating plateaus and rapid decreases, and support the standing learning scenario of Definition 1. Finally, in Section 6.5, we conjecture the behavior on longer time scales.

Notations.

We denote 𝟙1\mathds{1}blackboard_1 the constant function 𝟙:ωΩ1:1𝜔Ωmaps-to1\mathds{1}:\omega\in\Omega\mapsto 1\in\mathbb{R}blackboard_1 : italic_ω ∈ roman_Ω ↦ 1 ∈ blackboard_R. Denote .,.L2(ρ)\langle.,.\rangle_{L^{2}(\rho)}⟨ . , . ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT the dot product on L2(ρ)superscript𝐿2𝜌L^{2}(\rho)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) and .L2(ρ)\|.\|_{L^{2}(\rho)}∥ . ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT the associated norm. For xL2(ρ)𝑥superscript𝐿2𝜌x\in L^{2}(\rho)italic_x ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ), we denote xsubscript𝑥perpendicular-tox_{\perp}italic_x start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT the orthogonal projection of x𝑥xitalic_x on the hyperplane 𝟙superscript1perpendicular-to\mathds{1}^{\perp}blackboard_1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT of L2(ρ)superscript𝐿2𝜌L^{2}(\rho)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) of functions orthogonal to 𝟙1\mathds{1}blackboard_1:

x(ω)=x(ω)x(ν)dρ(ν).subscript𝑥perpendicular-to𝜔𝑥𝜔𝑥𝜈differential-d𝜌𝜈x_{\perp}(\omega)=x(\omega)-\int x(\nu)\mathrm{d}\rho(\nu)\,.italic_x start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ( italic_ω ) = italic_x ( italic_ω ) - ∫ italic_x ( italic_ν ) roman_d italic_ρ ( italic_ν ) .

We denote ainit(ω)=a(ω,0)subscript𝑎init𝜔𝑎𝜔0a_{\text{init}}(\omega)=a(\omega,0)italic_a start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ( italic_ω ) = italic_a ( italic_ω , 0 ) and thus a,initsubscript𝑎perpendicular-toinita_{\perp,\text{\rm{init}}}italic_a start_POSTSUBSCRIPT ⟂ , init end_POSTSUBSCRIPT is the orthogonal projection of ainitsubscript𝑎inita_{\text{init}}italic_a start_POSTSUBSCRIPT init end_POSTSUBSCRIPT on 𝟙superscript1perpendicular-to\mathds{1}^{\perp}blackboard_1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT.

6.1 Matched asymptotic expansions

The method of matched asymptotic expansions is a common approach to finding approximate solutions of perturbed differential equations. In the present paper, we are mainly interested in applying this technique to approximate the solution to the specific singularly perturbed ODE system111Although we keep calling this an ODE system, it is important to keep in mind that it takes place in an infinite-dimensional space. of Eq. (42). Denoting by t𝑡titalic_t the independent variable and by ε𝜀\varepsilonitalic_ε the perturbation parameter, the method of matched asymptotic expansions consists of the following three steps: (1) Divide the domain of t𝑡titalic_t (generally a subinterval of \mathbb{R}blackboard_R) to several subdomains, which may overlap each other and depend on the perturbation parameter ε𝜀\varepsilonitalic_ε; (2) Within each subdomain, find an accurate approximation to the perturbed system. This is usually achieved by expanding the perturbed system in powers of ε𝜀\varepsilonitalic_ε, and kee** only terms that are relevant to the current domain; (3) The approximate solutions obtained in Step (2) might not be valid in the overlap of two adjacent subdomains. To resolve this issue, these approximate solutions are then combined together through a process called “matching” to produce an approximation that is valid on the entire domain.

In our setting, the singularly perturbed system (42) takes the form of

εta(ω)=𝜀subscript𝑡𝑎𝜔absent\displaystyle\varepsilon\partial_{t}a(\omega)=\,italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a ( italic_ω ) = f(a(ω),s(ω)),𝑓𝑎𝜔𝑠𝜔\displaystyle f(a(\omega),s(\omega)),italic_f ( italic_a ( italic_ω ) , italic_s ( italic_ω ) ) ,
ts(ω)=subscript𝑡𝑠𝜔absent\displaystyle\partial_{t}s(\omega)=\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s ( italic_ω ) = g(a(ω),s(ω)).𝑔𝑎𝜔𝑠𝜔\displaystyle g(a(\omega),s(\omega)).italic_g ( italic_a ( italic_ω ) , italic_s ( italic_ω ) ) .

We will carry out explicit calculations for the first three time scales in Sections 6.2-6.4, respectively. Here is a summary of our main findings:

  • In Section 6.2 we explore the learning of the constant component of the target function, which happens at the timescale t=Θ(ε)𝑡Θ𝜀t=\Theta(\varepsilon)italic_t = roman_Θ ( italic_ε ). At the end of this phase, the mean-field risk (see (37)) evolves to

    Rmf,=12k1φk2+O(ε).subscript𝑅mf12subscript𝑘1superscriptsubscript𝜑𝑘2𝑂𝜀\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\sum_{k\geqslant 1}\varphi_{k% }^{2}+O(\varepsilon)\,.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε ) . (43)

    In other words, during this phase, gradient flow learns the constant term in φ𝜑\varphiitalic_φ. At the end of this time scale we have a(ω)=Θ(1)𝑎𝜔Θ1a(\omega)=\Theta(1)italic_a ( italic_ω ) = roman_Θ ( 1 ) and s(ω)=Θ(ε)𝑠𝜔Θ𝜀s(\omega)=\Theta(\varepsilon)italic_s ( italic_ω ) = roman_Θ ( italic_ε ).

  • Then, in Section 6.3 we investigate the second time scale t=t2ε1/2𝑡subscript𝑡2superscript𝜀12t=t_{2}\varepsilon^{1/2}italic_t = italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT, t2clog(1/ε)subscript𝑡2𝑐1𝜀t_{2}\leq c\log(1/\varepsilon)italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_c roman_log ( 1 / italic_ε ), during which the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω )’s and s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω )’s increase to a different order in ε𝜀\varepsilonitalic_ε. The result of this time scale is mainly technical and needed to understand the transition to the time scale of Section 6.4. We also perform the matching procedure to combine the approximate solution within this time scale to the one obtained in Section 6.2. At the end of this time scale we have a(ω)=Θ(ect2)𝑎𝜔Θsuperscript𝑒superscript𝑐subscript𝑡2a(\omega)=\Theta(e^{c^{\prime}t_{2}})italic_a ( italic_ω ) = roman_Θ ( italic_e start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and s(ω)=Θ(ect2ε1/2)𝑠𝜔Θsuperscript𝑒superscript𝑐subscript𝑡2superscript𝜀12s(\omega)=\Theta(e^{c^{\prime}t_{2}}\varepsilon^{1/2})italic_s ( italic_ω ) = roman_Θ ( italic_e start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ).

  • To understand the evolution of the risk relevant to learning the linear component of φ𝜑\varphiitalic_φ, we introduce a new time scale t=14|φ1σ1|ε1/2log1ε+Θ(ε1/2)𝑡14subscript𝜑1subscript𝜎1superscript𝜀121𝜀Θsuperscript𝜀12t=\frac{1}{4|\varphi_{1}\sigma_{1}|}\varepsilon^{1/2}\log\frac{1}{\varepsilon}% +\Theta(\varepsilon^{1/2})italic_t = divide start_ARG 1 end_ARG start_ARG 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG italic_ε start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG + roman_Θ ( italic_ε start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) in Section 6.4, and show that the linear component can be learned within this time scale. To be more accurate, at the end of this time scale we have

    Rmf,=12k2φk2+O(ε1/4),subscript𝑅mf12subscript𝑘2superscriptsubscript𝜑𝑘2𝑂superscript𝜀14\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\sum_{k\geqslant 2}\varphi_{k% }^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})\,,italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 2 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) , (44)

    and a(ω)=Θ(ε1/4)𝑎𝜔Θsuperscript𝜀14a(\omega)=\Theta(\varepsilon^{-1/4})italic_a ( italic_ω ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ) and s(ω)=Θ(ε1/4)𝑠𝜔Θsuperscript𝜀14s(\omega)=\Theta(\varepsilon^{1/4})italic_s ( italic_ω ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ).

Finally, in Section 6.5, we conjecture the behavior of the approximate solutions and induced risks for longer time scales.

6.2 First time scale: constant component

We define a “fast” time variable t1=t/εsubscript𝑡1𝑡𝜀t_{1}=t/\varepsilonitalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_t / italic_ε and replace it in Eq. (42). We expand the solutions a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω ) and s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω ) in powers of ε𝜀\varepsilonitalic_ε:

a(ω)𝑎𝜔\displaystyle a(\omega)italic_a ( italic_ω ) =a(0)(ω)+εa(1)(ω)+ε2a(2)(ω)+,absentsuperscript𝑎0𝜔𝜀superscript𝑎1𝜔superscript𝜀2superscript𝑎2𝜔\displaystyle=a^{(0)}(\omega)+\varepsilon a^{(1)}(\omega)+\varepsilon^{2}a^{(2% )}(\omega)+\dots\,,= italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω ) + … , (45)
s(ω)𝑠𝜔\displaystyle s(\omega)italic_s ( italic_ω ) =s(0)(ω)+εs(1)(ω)+ε2s(2)(ω)+,absentsuperscript𝑠0𝜔𝜀superscript𝑠1𝜔superscript𝜀2superscript𝑠2𝜔\displaystyle=s^{(0)}(\omega)+\varepsilon s^{(1)}(\omega)+\varepsilon^{2}s^{(2% )}(\omega)+\dots\,,= italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω ) + … , (46)

where a(0)(ω),a(1)(ω),a(2)(ω),,s(0)(ω),s(1)(ω),s(2)(ω),superscript𝑎0𝜔superscript𝑎1𝜔superscript𝑎2𝜔superscript𝑠0𝜔superscript𝑠1𝜔superscript𝑠2𝜔a^{(0)}(\omega),a^{(1)}(\omega),a^{(2)}(\omega),\dots,s^{(0)}(\omega),s^{(1)}(% \omega),s^{(2)}(\omega),\dotsitalic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) , italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) , italic_a start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω ) , … , italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω ) , … are implicitly functions of t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. They are initialized at

a(0)(ω,t1=0)=ainit(ω),superscript𝑎0𝜔subscript𝑡10subscript𝑎init𝜔\displaystyle a^{(0)}(\omega,t_{1}=0)=a_{\rm{init}}(\omega)\,,italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ( italic_ω ) , a(1)(ω,t1=0)=0,superscript𝑎1𝜔subscript𝑡100\displaystyle a^{(1)}(\omega,t_{1}=0)=0\,,italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = 0 , a(2)(ω,t1=0)=0,superscript𝑎2𝜔subscript𝑡100\displaystyle a^{(2)}(\omega,t_{1}=0)=0\,,italic_a start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = 0 , \displaystyle\dots (47)
s(0)(ω,t1=0)=0,superscript𝑠0𝜔subscript𝑡100\displaystyle s^{(0)}(\omega,t_{1}=0)=0\,,italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = 0 , s(1)(ω,t1=0)=0,superscript𝑠1𝜔subscript𝑡100\displaystyle s^{(1)}(\omega,t_{1}=0)=0\,,italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = 0 , s(2)(ω,t1=0)=0,superscript𝑠2𝜔subscript𝑡100\displaystyle s^{(2)}(\omega,t_{1}=0)=0\,,italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = 0 , \displaystyle\dots (48)

to be consistent with the initial condition a(ω,t1=0)=a(ω,t=0)=ainit(ω)𝑎𝜔subscript𝑡10𝑎𝜔𝑡0subscript𝑎init𝜔a(\omega,t_{1}=0)=a(\omega,t=0)=a_{\rm{init}}(\omega)italic_a ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = italic_a ( italic_ω , italic_t = 0 ) = italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ( italic_ω ) and s(ω,t1=0)=s(ω,t=0)=0𝑠𝜔subscript𝑡10𝑠𝜔𝑡00s(\omega,t_{1}=0)=s(\omega,t=0)=0italic_s ( italic_ω , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ) = italic_s ( italic_ω , italic_t = 0 ) = 0.

We substitute the expansion in (42):

t1a(0)(ω)+εt1a(1)(ω)+subscriptsubscript𝑡1superscript𝑎0𝜔𝜀subscriptsubscript𝑡1superscript𝑎1𝜔\displaystyle\partial_{t_{1}}a^{(0)}(\omega)+\varepsilon\partial_{t_{1}}a^{(1)% }(\omega)+\dots∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … (49)
=k=0σk(s(0)(ω)+εs(1)(ω)+)kabsentsuperscriptsubscript𝑘0subscript𝜎𝑘superscriptsuperscript𝑠0𝜔𝜀superscript𝑠1𝜔𝑘\displaystyle\quad=\sum_{k=0}^{\infty}\sigma_{k}\left(s^{(0)}(\omega)+% \varepsilon s^{(1)}(\omega)+\dots\right)^{k}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (50)
×(φkσk(a(0)(ν)+εa(1)(ν)+)(s(0)(ν)+εs(1)(ν)+)kdρ(ν)),absentsubscript𝜑𝑘subscript𝜎𝑘superscript𝑎0𝜈𝜀superscript𝑎1𝜈superscriptsuperscript𝑠0𝜈𝜀superscript𝑠1𝜈𝑘differential-d𝜌𝜈\displaystyle\quad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}(% \nu)+\varepsilon a^{(1)}(\nu)+\dots\right)\left(s^{(0)}(\nu)+\varepsilon s^{(1% )}(\nu)+\dots\right)^{k}\mathrm{d}\rho(\nu)\right)\,,× ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ ( italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) + italic_ε italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) + italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) , (51)
t1s(0)(ω)+εt1s(1)(ω)+subscriptsubscript𝑡1superscript𝑠0𝜔𝜀subscriptsubscript𝑡1superscript𝑠1𝜔\displaystyle\partial_{t_{1}}s^{(0)}(\omega)+\varepsilon\partial_{t_{1}}s^{(1)% }(\omega)+\dots∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … (52)
=ε(a(0)(ω)+εa(1)(ω)+)(1(s(0)(ω)+εs(1)(ω)+)2)absent𝜀superscript𝑎0𝜔𝜀superscript𝑎1𝜔1superscriptsuperscript𝑠0𝜔𝜀superscript𝑠1𝜔2\displaystyle\quad=\varepsilon\left(a^{(0)}(\omega)+\varepsilon a^{(1)}(\omega% )+\dots\right)\left(1-\left(s^{(0)}(\omega)+\varepsilon s^{(1)}(\omega)+\dots% \right)^{2}\right)= italic_ε ( italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) ( 1 - ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (53)
×k=1kσk(s(0)(ω)+εs(1)(ω)+)k1\displaystyle\quad\qquad\times\sum_{k=1}^{\infty}k\sigma_{k}\left(s^{(0)}(% \omega)+\varepsilon s^{(1)}(\omega)+\dots\right)^{k-1}× ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_k italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT (54)
×(φkσk(a(0)(ν)+εa(1)(ν)+)(s(0)(ν)+εs(1)(ν)+)kdρ(ν)).absentsubscript𝜑𝑘subscript𝜎𝑘superscript𝑎0𝜈𝜀superscript𝑎1𝜈superscriptsuperscript𝑠0𝜈𝜀superscript𝑠1𝜈𝑘differential-d𝜌𝜈\displaystyle\quad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}(% \nu)+\varepsilon a^{(1)}(\nu)+\dots\right)\left(s^{(0)}(\nu)+\varepsilon s^{(1% )}(\nu)+\dots\right)^{k}\mathrm{d}\rho(\nu)\right)\,.× ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ ( italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) + italic_ε italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) + italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) . (55)

The basic assumption of matched asymptotic expansions is that terms of the same order in ε𝜀\varepsilonitalic_ε can be identified (with some limitations that we develop below). For now, let us identify terms of order 1=ε01superscript𝜀01=\varepsilon^{0}1 = italic_ε start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

t1a(0)(ω)subscriptsubscript𝑡1superscript𝑎0𝜔\displaystyle\partial_{t_{1}}a^{(0)}(\omega)∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) =k=0σk(s(0)(ω))k(φkσka(0)(ν)(s(0)(ν))kdρ(ν)),absentsuperscriptsubscript𝑘0subscript𝜎𝑘superscriptsuperscript𝑠0𝜔𝑘subscript𝜑𝑘subscript𝜎𝑘superscript𝑎0𝜈superscriptsuperscript𝑠0𝜈𝑘differential-d𝜌𝜈\displaystyle=\sum_{k=0}^{\infty}\sigma_{k}\left(s^{(0)}(\omega)\right)^{k}% \left(\varphi_{k}-\sigma_{k}\int a^{(0)}(\nu)\left(s^{(0)}(\nu)\right)^{k}% \mathrm{d}\rho(\nu)\right)\,,= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) , (56)
t1s(0)(ω)subscriptsubscript𝑡1superscript𝑠0𝜔\displaystyle\partial_{t_{1}}s^{(0)}(\omega)∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) =0.absent0\displaystyle=0\,.= 0 . (57)

From (57) and (48), we have s(0)(ω)=0superscript𝑠0𝜔0s^{(0)}(\omega)=0italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) = 0: time t1=O(1)t=O(ε)subscript𝑡1𝑂1𝑡𝑂𝜀t_{1}=O(1)\Leftrightarrow t=O(\varepsilon)italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O ( 1 ) ⇔ italic_t = italic_O ( italic_ε ) is too short for the s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω ) to be of order 1111.

Substituting s(0)(ω)=0superscript𝑠0𝜔0s^{(0)}(\omega)=0italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) = 0 in (56), we obtain

t1a(0)(ω)=σ0(φ0σ0a(0)(ν)dρ(ν)).subscriptsubscript𝑡1superscript𝑎0𝜔subscript𝜎0subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈\partial_{t_{1}}a^{(0)}(\omega)=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a% ^{(0)}(\nu)\mathrm{d}\rho(\nu)\right)\,.∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) . (58)

Recall that .,.L2(ρ)\langle.,.\rangle_{L^{2}(\rho)}⟨ . , . ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT is the dot product on L2(ρ)superscript𝐿2𝜌L^{2}(\rho)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ), 𝟙1\mathds{1}blackboard_1 denotes the constant function 𝟙:ωΩ1:1𝜔Ωmaps-to1\mathds{1}:\omega\in\Omega\mapsto 1\in\mathbb{R}blackboard_1 : italic_ω ∈ roman_Ω ↦ 1 ∈ blackboard_R and asubscript𝑎perpendicular-toa_{\perp}italic_a start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT is the orthogonal projection of a𝑎aitalic_a on 𝟙superscript1perpendicular-to\mathds{1}^{\perp}blackboard_1 start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Equation (58) can be rewritten as

t1a(0),𝟙L2(ρ)subscriptsubscript𝑡1subscriptsuperscript𝑎01superscript𝐿2𝜌\displaystyle\partial_{t_{1}}\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT =σ0(φ0σ0a(0),𝟙L2(ρ)),absentsubscript𝜎0subscript𝜑0subscript𝜎0subscriptsuperscript𝑎01superscript𝐿2𝜌\displaystyle=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\langle a^{(0)},\mathds{% 1}\rangle_{L^{2}(\rho)}\right)\,,= italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) ,
t1a(0)subscriptsubscript𝑡1superscriptsubscript𝑎perpendicular-to0\displaystyle\partial_{t_{1}}a_{\perp}^{(0)}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT =0,absent0\displaystyle=0\,,= 0 ,

which gives after integration (using (47)):

a(0),𝟙L2(ρ)subscriptsuperscript𝑎01superscript𝐿2𝜌\displaystyle\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT =eσ02t1ainit,𝟙L2(ρ)+(1eσ02t1)φ0σ0,absentsuperscript𝑒superscriptsubscript𝜎02subscript𝑡1subscriptsubscript𝑎init1superscript𝐿2𝜌1superscript𝑒superscriptsubscript𝜎02subscript𝑡1subscript𝜑0subscript𝜎0\displaystyle=e^{-\sigma_{0}^{2}t_{1}}\langle a_{\rm{init}},\mathds{1}\rangle_% {L^{2}(\rho)}+\left(1-e^{-\sigma_{0}^{2}t_{1}}\right)\frac{\varphi_{0}}{\sigma% _{0}}\,,= italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟨ italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT + ( 1 - italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG , (59)
a(0)superscriptsubscript𝑎perpendicular-to0\displaystyle a_{\perp}^{(0)}italic_a start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT =a,init.absentsubscript𝑎perpendicular-toinit\displaystyle=a_{\perp,\rm{init}}\,.= italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT .

At this point, we have determined a(0)(ω)superscript𝑎0𝜔a^{(0)}(\omega)italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) and s(0)(ω)superscript𝑠0𝜔s^{(0)}(\omega)italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ), and thus a(ω)=a(0)(ω)+O(ε)𝑎𝜔superscript𝑎0𝜔𝑂𝜀a(\omega)=a^{(0)}(\omega)+O(\varepsilon)italic_a ( italic_ω ) = italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε ) and s(ω)=s(0)(ω)+O(ε)𝑠𝜔superscript𝑠0𝜔𝑂𝜀s(\omega)=s^{(0)}(\omega)+O(\varepsilon)italic_s ( italic_ω ) = italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε ) up to a O(ε)𝑂𝜀O(\varepsilon)italic_O ( italic_ε ) precision, which is sufficient to obtain a o(1)𝑜1o(1)italic_o ( 1 )-approximation of the risk Rmf,subscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT (see Section C.2). However, note that we could obtain more precise estimates by identifying higher-order terms in (49)-(55). For instance, identifying the O(ε)𝑂𝜀O(\varepsilon)italic_O ( italic_ε ) terms in (52)-(55), we obtain t1s(1)(ω)=a(0)(ω)σ1φ1subscriptsubscript𝑡1superscript𝑠1𝜔superscript𝑎0𝜔subscript𝜎1subscript𝜑1\partial_{t_{1}}s^{(1)}(\omega)=a^{(0)}(\omega)\sigma_{1}\varphi_{1}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) = italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This shows that the s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω ) become non-zero, though only of order ε𝜀\varepsilonitalic_ε on the time scale t11asymptotically-equalssubscript𝑡11t_{1}\asymp 1italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ 1; the inner-layer weights develop an infinitesimal correlation with the true direction usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT thanks to the linear component of σ𝜎\sigmaitalic_σ and φ𝜑\varphiitalic_φ.

The approximation constructed above should be considered as valid on the time scale t11tεasymptotically-equalssubscript𝑡11asymptotically-equals𝑡𝜀t_{1}\asymp 1\Leftrightarrow t\asymp\varepsilonitalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ 1 ⇔ italic_t ≍ italic_ε. As ε0𝜀0\varepsilon\to 0italic_ε → 0, we obtain the following approximation of the risk (see Eq. (37) for definition, and Appendix C.2 for a detailed derivation):

Rmf,=12e2σ02t1(φ0σ0ainit,𝟙L2(ρ))2+12k1φk2+O(ε).subscript𝑅mf12superscript𝑒2superscriptsubscript𝜎02subscript𝑡1superscriptsubscript𝜑0subscript𝜎0subscriptsubscript𝑎init1superscript𝐿2𝜌212subscript𝑘1superscriptsubscript𝜑𝑘2𝑂𝜀\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}e^{-2\sigma_{0}^% {2}t_{1}}\left(\varphi_{0}-{\sigma_{0}}\left\langle a_{\rm{init}},\mathds{1}% \right\rangle_{L^{2}(\rho)}\right)^{2}+\frac{1}{2}\sum_{k\geqslant 1}\varphi_{% k}^{2}+O(\varepsilon)\,.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT - 2 italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε ) .

This approximation breaks down when we reach a new time scale, at which the s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω ) are large enough for the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω ) to be affected (at leading order) by the linear part of the functions. We detail the new time scale and its resolution in the next section.

6.3 Second time scale: linear component I

In this section, we seek a second, slower time scale, for which the behavior of the asymptotic expansion is different.

Identification of the scale.

Consider t2=tεγsubscript𝑡2𝑡superscript𝜀𝛾t_{2}=\frac{t}{\varepsilon^{\gamma}}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG, where γ<1𝛾1\gamma<1italic_γ < 1 is to be determined. We rewrite the system (42) using t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and expand the solutions a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω ) and s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω ):

a(ω)𝑎𝜔\displaystyle a(\omega)italic_a ( italic_ω ) =a(0)(ω)+εδa(1)(ω)+ε2δa(2)(ω)+,absentsuperscript𝑎0𝜔superscript𝜀𝛿superscript𝑎1𝜔superscript𝜀2𝛿superscript𝑎2𝜔\displaystyle=a^{(0)}(\omega)+\varepsilon^{\delta}a^{(1)}(\omega)+\varepsilon^% {2\delta}a^{(2)}(\omega)+\dots\,,= italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε start_POSTSUPERSCRIPT 2 italic_δ end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω ) + … , (60)
s(ω)𝑠𝜔\displaystyle s(\omega)italic_s ( italic_ω ) =εδs(1)(ω)+ε2δs(2)(ω)+,absentsuperscript𝜀𝛿superscript𝑠1𝜔superscript𝜀2𝛿superscript𝑠2𝜔\displaystyle=\varepsilon^{\delta}s^{(1)}(\omega)+\varepsilon^{2\delta}s^{(2)}% (\omega)+\dots\,,= italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε start_POSTSUPERSCRIPT 2 italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_ω ) + … , (61)

where the exponent δ𝛿\deltaitalic_δ is also to be determined. (Since within the previous time scale we obtained s(ω)=O(ε)𝑠𝜔𝑂𝜀s(\omega)=O(\varepsilon)italic_s ( italic_ω ) = italic_O ( italic_ε ), it is natural to assume s(0)(ω)=0superscript𝑠0𝜔0s^{(0)}(\omega)=0italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) = 0.)

Let us pause to comment on our method.

Similarly to what has been done in the previous time scale, we will substitute the expansions (60)-(61) in the equations (42) in order to compute the different terms in the expansion. However, this step also allows us to compute the exponents γ𝛾\gammaitalic_γ and δ𝛿\deltaitalic_δ, that give respectively the new time scale and the size of the s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω )’s.

Note that we should have proceeded similarly for the first time scale, by introducing a first time variable t1=tεγsubscript𝑡1𝑡superscript𝜀superscript𝛾t_{1}=\frac{t}{\varepsilon^{\gamma^{\prime}}}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG, expanding a(ω),s(ω)𝑎𝜔𝑠𝜔a(\omega),s(\omega)italic_a ( italic_ω ) , italic_s ( italic_ω ) in powers 1,εδ,ε2δ,1superscript𝜀superscript𝛿superscript𝜀2superscript𝛿1,\varepsilon^{\delta^{\prime}},\varepsilon^{2\delta^{\prime}},\dots1 , italic_ε start_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_ε start_POSTSUPERSCRIPT 2 italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , …, and determining γsuperscript𝛾\gamma^{\prime}italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and δsuperscript𝛿\delta^{\prime}italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT a posteriori. This would have led, indeed, to γ=1superscript𝛾1\gamma^{\prime}=1italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 and δ=1superscript𝛿1\delta^{\prime}=1italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1. However, for simplicity, we preferred to fix these values that are natural a priori.

Finally, note that the expansions (45)-(46) and (60)-(61) are different, because they are valid on different time scales. In fact, the only coherence conditions that we require below is that the expansions match in a joint asymptotic where t1=tεsubscript𝑡1𝑡𝜀t_{1}=\frac{t}{\varepsilon}\to\inftyitalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG → ∞ and t2=tεγ0subscript𝑡2𝑡superscript𝜀𝛾0t_{2}=\frac{t}{\varepsilon^{\gamma}}\to 0italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG → 0. We thus build different approximations for each one of the time scales, with some matching conditions; this justifies the name of matched asymptotic expansion.

We now return to our computations and substitute (60)-(61) in (42):

ε1γt2a(0)(ω)+superscript𝜀1𝛾subscriptsubscript𝑡2superscript𝑎0𝜔\displaystyle\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)+\dotsitalic_ε start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + … =k=0σk(εδs(1)(ω)+)kabsentsuperscriptsubscript𝑘0subscript𝜎𝑘superscriptsuperscript𝜀𝛿superscript𝑠1𝜔𝑘\displaystyle=\sum_{k=0}^{\infty}\sigma_{k}\left(\varepsilon^{\delta}s^{(1)}(% \omega)+\dots\right)^{k}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
×(φkσk(a(0)(ν)+)(εδs(1)(ν)+)kdρ(ν)),absentsubscript𝜑𝑘subscript𝜎𝑘superscript𝑎0𝜈superscriptsuperscript𝜀𝛿superscript𝑠1𝜈𝑘differential-d𝜌𝜈\displaystyle\qquad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}% (\nu)+\dots\right)\left(\varepsilon^{\delta}s^{(1)}(\nu)+\dots\right)^{k}% \mathrm{d}\rho(\nu)\right)\,,× ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ ( italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) ( italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) ,
εδt2s(1)(ω)+superscript𝜀𝛿subscriptsubscript𝑡2superscript𝑠1𝜔\displaystyle\varepsilon^{\delta}\partial_{t_{2}}s^{(1)}(\omega)+\dotsitalic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … =εγ(a(0)(ω)+)(1(εδs(1)(ω)+)2)k=1kσk(εδs(1)(ω)+)k1absentsuperscript𝜀𝛾superscript𝑎0𝜔1superscriptsuperscript𝜀𝛿superscript𝑠1𝜔2superscriptsubscript𝑘1𝑘subscript𝜎𝑘superscriptsuperscript𝜀𝛿superscript𝑠1𝜔𝑘1\displaystyle=\varepsilon^{\gamma}\left(a^{(0)}(\omega)+\dots\right)\left(1-% \left(\varepsilon^{\delta}s^{(1)}(\omega)+\dots\right)^{2}\right)\sum_{k=1}^{% \infty}k\sigma_{k}\left(\varepsilon^{\delta}s^{(1)}(\omega)+\dots\right)^{k-1}= italic_ε start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) ( 1 - ( italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_k italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT
×(φkσk(a(0)(ν)+)(εδs(1)(ν)+)kdρ(ν)),absentsubscript𝜑𝑘subscript𝜎𝑘superscript𝑎0𝜈superscriptsuperscript𝜀𝛿superscript𝑠1𝜈𝑘differential-d𝜌𝜈\displaystyle\qquad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}% (\nu)+\dots\right)\left(\varepsilon^{\delta}s^{(1)}(\nu)+\dots\right)^{k}% \mathrm{d}\rho(\nu)\right)\,,× ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ ( italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) ( italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) + … ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) ,

and thus

ε1γt2a(0)(ω)+O(ε1γ+δ)superscript𝜀1𝛾subscriptsubscript𝑡2superscript𝑎0𝜔𝑂superscript𝜀1𝛾𝛿\displaystyle\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)+O(% \varepsilon^{1-\gamma+\delta})italic_ε start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT 1 - italic_γ + italic_δ end_POSTSUPERSCRIPT ) =σ0(φ0σ0a(0)(ν)dρ(ν))absentsubscript𝜎0subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈\displaystyle=\sigma_{0}\left(\varphi_{0}-\sigma_{0}\int a^{(0)}(\nu)\mathrm{d% }\rho(\nu)\right)= italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) (62)
εδσ02a(1)(ν)dρ(ν)+εδσ1φ1s(1)(ω)+O(ε2δ),superscript𝜀𝛿superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈superscript𝜀𝛿subscript𝜎1subscript𝜑1superscript𝑠1𝜔𝑂superscript𝜀2𝛿\displaystyle\qquad-\varepsilon^{\delta}{\sigma_{0}^{2}}\int a^{(1)}(\nu)% \mathrm{d}\rho(\nu)+\varepsilon^{\delta}\sigma_{1}\varphi_{1}s^{(1)}(\omega)+O% (\varepsilon^{2\delta})\,,- italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT 2 italic_δ end_POSTSUPERSCRIPT ) , (63)
εδt2s(1)(ω)+O(ε2δ)superscript𝜀𝛿subscriptsubscript𝑡2superscript𝑠1𝜔𝑂superscript𝜀2𝛿\displaystyle\varepsilon^{\delta}\partial_{t_{2}}s^{(1)}(\omega)+O(\varepsilon% ^{2\delta})italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT 2 italic_δ end_POSTSUPERSCRIPT ) =εγσ1φ1a(0)(ω)+O(εγ+δ).absentsuperscript𝜀𝛾subscript𝜎1subscript𝜑1superscript𝑎0𝜔𝑂superscript𝜀𝛾𝛿\displaystyle=\varepsilon^{\gamma}\sigma_{1}\varphi_{1}a^{(0)}(\omega)+O(% \varepsilon^{\gamma+\delta})\,.= italic_ε start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT italic_γ + italic_δ end_POSTSUPERSCRIPT ) . (64)

For the first time scale, we chose γ=δ=1𝛾𝛿1\gamma=\delta=1italic_γ = italic_δ = 1, so that the terms of order εδsuperscript𝜀𝛿\varepsilon^{\delta}italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT were negligible compared to ε1γt2a(0)(ω)superscript𝜀1𝛾subscriptsubscript𝑡2superscript𝑎0𝜔\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)italic_ε start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) in (62). This means that the linear components σ1,φ1subscript𝜎1subscript𝜑1\sigma_{1},\varphi_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of the functions had no effect on the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω ) at leading order. We are now interested in a new time scale where ε1γt2a(0)(ω)superscript𝜀1𝛾subscriptsubscript𝑡2superscript𝑎0𝜔\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)italic_ε start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) and εδσ1φ1s(1)(ω)superscript𝜀𝛿subscript𝜎1subscript𝜑1superscript𝑠1𝜔\varepsilon^{\delta}\sigma_{1}\varphi_{1}s^{(1)}(\omega)italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) are of the same order, i.e., 1γ=δ1𝛾𝛿1-\gamma=\delta1 - italic_γ = italic_δ; then the linear components play a role in the dynamics.

Further, for s(1)(ω)superscript𝑠1𝜔s^{(1)}(\omega)italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) to be non-zero, we need both sides of (64) to be of the same order, thus δ=γ𝛿𝛾\delta=\gammaitalic_δ = italic_γ. Putting together, this gives γ=δ=1/2𝛾𝛿12\gamma=\delta=1/2italic_γ = italic_δ = 1 / 2.

Derivation of the ODEs for this time scale.

Let us summarize equations. For t2=tε1/2subscript𝑡2𝑡superscript𝜀12t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG and

a(ω)𝑎𝜔\displaystyle a(\omega)italic_a ( italic_ω ) =a(0)(ω)+ε1/2a(1)(ω)+,absentsuperscript𝑎0𝜔superscript𝜀12superscript𝑎1𝜔\displaystyle=a^{(0)}(\omega)+\varepsilon^{\nicefrac{{1}}{{2}}}a^{(1)}(\omega)% +\dots\,,= italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ,
s(ω)𝑠𝜔\displaystyle s(\omega)italic_s ( italic_ω ) =ε1/2s(1)(ω)+,absentsuperscript𝜀12superscript𝑠1𝜔\displaystyle=\varepsilon^{\nicefrac{{1}}{{2}}}s^{(1)}(\omega)+\dots\,,= italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + … ,

we have from (62)-(64):

ε1/2t2a(0)(ω)superscript𝜀12subscriptsubscript𝑡2superscript𝑎0𝜔\displaystyle\varepsilon^{\nicefrac{{1}}{{2}}}\partial_{t_{2}}a^{(0)}(\omega)italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) =σ0(φ0σ0a(0)(ν)dρ(ν))absentsubscript𝜎0subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈\displaystyle=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(\nu)\mathrm% {d}\rho(\nu)\right)= italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) (65)
ε1/2σ02a(1)(ν)dρ(ν)+ε1/2σ1φ1s(1)(ω)+O(ε),superscript𝜀12superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈superscript𝜀12subscript𝜎1subscript𝜑1superscript𝑠1𝜔𝑂𝜀\displaystyle\qquad-\varepsilon^{\nicefrac{{1}}{{2}}}{\sigma_{0}^{2}}\int a^{(% 1)}(\nu)\mathrm{d}\rho(\nu)+\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi% _{1}s^{(1)}(\omega)+O(\varepsilon)\,,- italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε ) , (66)
ε1/2t2s(1)(ω)superscript𝜀12subscriptsubscript𝑡2superscript𝑠1𝜔\displaystyle\varepsilon^{\nicefrac{{1}}{{2}}}\partial_{t_{2}}s^{(1)}(\omega)italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) =ε1/2σ1φ1a(0)(ω)+O(ε).absentsuperscript𝜀12subscript𝜎1subscript𝜑1superscript𝑎0𝜔𝑂𝜀\displaystyle=\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}a^{(0)}(% \omega)+O(\varepsilon)\,.= italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε ) . (67)

First, we identify the terms of order 1=ε01superscript𝜀01=\varepsilon^{0}1 = italic_ε start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

0=σ0(φ0σ0a(0)(ν)dρ(ν)).0subscript𝜎0subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈0=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)% \right)\,.0 = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) . (68)

This means that the trajectory remains in the affine hyperplane defined by φ0=σ0a(0)(ν)dρ(ν)subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈\varphi_{0}={\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ). Intuitively, the constant component of φ𝜑\varphiitalic_φ remains fitted by the neural network in this second time scale.

Second, we identify the terms of order ε1/2superscript𝜀12\varepsilon^{\nicefrac{{1}}{{2}}}italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT in (65)-(67):

t2a(0)(ω)subscriptsubscript𝑡2superscript𝑎0𝜔\displaystyle\partial_{t_{2}}a^{(0)}(\omega)∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) =σ02a(1)(ν)dρ(ν)+σ1φ1s(1)(ω),absentsuperscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈subscript𝜎1subscript𝜑1superscript𝑠1𝜔\displaystyle=-{\sigma_{0}^{2}}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)+\sigma_{1}% \varphi_{1}s^{(1)}(\omega)\,,= - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) , (69)
t2s(1)(ω)subscriptsubscript𝑡2superscript𝑠1𝜔\displaystyle\partial_{t_{2}}s^{(1)}(\omega)∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) =σ1φ1a(0)(ω).absentsubscript𝜎1subscript𝜑1superscript𝑎0𝜔\displaystyle=\sigma_{1}\varphi_{1}a^{(0)}(\omega)\,.= italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) . (70)

Note that, in Eqs. (69)–(70), the time derivative of a(1)superscript𝑎1a^{(1)}italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT does not appear, and therefore the evolution of a(1)superscript𝑎1a^{(1)}italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is not determined by these equations. In fact, a(1)superscript𝑎1a^{(1)}italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is best interpreted as the Lagrange multiplier associated to the constraint (68). Namely, this is a free term that can be adjusted so that the solution of the system (69)–(70) satisfies the constraint (68). We can check unknown term in (69) leaves the right degree of freedom such that this is the case: we have

0=t2(φ0σ0)=(68)t2(a(0)(ω)dρ(ω))=(69)σ02a(1)(ν)dρ(ν)+σ1φ1s(1)(ω)dρ(ω).0subscriptsubscript𝑡2subscript𝜑0subscript𝜎0italic-(68italic-)subscriptsubscript𝑡2superscript𝑎0𝜔differential-d𝜌𝜔italic-(69italic-)superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈subscript𝜎1subscript𝜑1superscript𝑠1𝜔differential-d𝜌𝜔0=\partial_{t_{2}}\left(\frac{\varphi_{0}}{\sigma_{0}}\right)\underset{\eqref{% eq:aux-12}}{=}\partial_{t_{2}}\left(\int a^{(0)}(\omega)\mathrm{d}\rho(\omega)% \right)\underset{\eqref{eq:aux-10}}{=}-{\sigma_{0}^{2}}\int a^{(1)}(\nu)% \mathrm{d}\rho(\nu)+\sigma_{1}\varphi_{1}\int s^{(1)}(\omega)\mathrm{d}\rho(% \omega)\,.0 = ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_UNDERACCENT italic_( italic_) end_UNDERACCENT start_ARG = end_ARG ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) roman_d italic_ρ ( italic_ω ) ) start_UNDERACCENT italic_( italic_) end_UNDERACCENT start_ARG = end_ARG - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) roman_d italic_ρ ( italic_ω ) .

In this last expression, the first unknown term can always compensate the second term so that the constraint is satisfied. The entire evolution of a(1)superscript𝑎1a^{(1)}italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is determined by higher orders in the expansion.

To eliminate this Lagrange multiplier, we use again the compact notations:

t2a(0)subscriptsubscript𝑡2superscript𝑎0\displaystyle\partial_{t_{2}}a^{(0)}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT =σ02a(1),𝟙L2(ρ)𝟙+σ1φ1s(1),absentsuperscriptsubscript𝜎02subscriptsuperscript𝑎11superscript𝐿2𝜌1subscript𝜎1subscript𝜑1superscript𝑠1\displaystyle=-\sigma_{0}^{2}\langle a^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}% \mathds{1}+\sigma_{1}\varphi_{1}s^{(1)}\,,= - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT blackboard_1 + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , (71)
t2s(1)subscriptsubscript𝑡2superscript𝑠1\displaystyle\partial_{t_{2}}s^{(1)}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT =σ1φ1a(0),absentsubscript𝜎1subscript𝜑1superscript𝑎0\displaystyle=\sigma_{1}\varphi_{1}a^{(0)}\,,= italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , (72)

and thus

t2a(0)subscriptsubscript𝑡2subscriptsuperscript𝑎0perpendicular-to\displaystyle\partial_{t_{2}}a^{(0)}_{\perp}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT =σ1φ1s(1),absentsubscript𝜎1subscript𝜑1subscriptsuperscript𝑠1perpendicular-to\displaystyle=\sigma_{1}\varphi_{1}s^{(1)}_{\perp}\,,= italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , (73)
t2s(1)subscriptsubscript𝑡2subscriptsuperscript𝑠1perpendicular-to\displaystyle\partial_{t_{2}}s^{(1)}_{\perp}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT =σ1φ1a(0).absentsubscript𝜎1subscript𝜑1subscriptsuperscript𝑎0perpendicular-to\displaystyle=\sigma_{1}\varphi_{1}a^{(0)}_{\perp}\,.= italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT . (74)

Matching.

The initialization of the ODEs (71)-(72) for the second time scale is determined by a classical procedure that matches with the previous time scale. In this paragraph, we denote a¯,s¯¯𝑎¯𝑠\underline{a},\underline{s}under¯ start_ARG italic_a end_ARG , under¯ start_ARG italic_s end_ARG the approximation obtained in the first time scale (Section 6.2), and a¯,s¯¯𝑎¯𝑠\overline{a},\overline{s}over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_s end_ARG the approximation in the second time scale, described above.

Consider an intermediate time scale t~=tεα~𝑡𝑡superscript𝜀𝛼\widetilde{t}=\frac{t}{\varepsilon^{\alpha}}over~ start_ARG italic_t end_ARG = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG, 1/2<α<112𝛼1\nicefrac{{1}}{{2}}<\alpha<1/ start_ARG 1 end_ARG start_ARG 2 end_ARG < italic_α < 1, and assume t~1asymptotically-equals~𝑡1\widetilde{t}\asymp 1over~ start_ARG italic_t end_ARG ≍ 1 so that

t1=tε=t~ε1α,subscript𝑡1𝑡𝜀~𝑡superscript𝜀1𝛼\displaystyle t_{1}=\frac{t}{\varepsilon}=\frac{\widetilde{t}}{\varepsilon^{1-% \alpha}}\to\infty\,,italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG = divide start_ARG over~ start_ARG italic_t end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT end_ARG → ∞ , t2=tε1/2=εα1/2t~0.subscript𝑡2𝑡superscript𝜀12superscript𝜀𝛼12~𝑡0\displaystyle t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}={\varepsilon^{% \alpha-\nicefrac{{1}}{{2}}}}{\widetilde{t}}\to 0\,.italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG = italic_ε start_POSTSUPERSCRIPT italic_α - / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG italic_t end_ARG → 0 .

In this intermediate regime, we want the approximations provided on the first and the second time scales to match: a¯(t~)¯𝑎~𝑡\underline{a}(\widetilde{t})under¯ start_ARG italic_a end_ARG ( over~ start_ARG italic_t end_ARG ) and a¯(t~)¯𝑎~𝑡\overline{a}(\widetilde{t})over¯ start_ARG italic_a end_ARG ( over~ start_ARG italic_t end_ARG ) (resp. s¯(t~)¯𝑠~𝑡\underline{s}(\widetilde{t})under¯ start_ARG italic_s end_ARG ( over~ start_ARG italic_t end_ARG ) and s¯(t~)¯𝑠~𝑡\overline{s}(\widetilde{t})over¯ start_ARG italic_s end_ARG ( over~ start_ARG italic_t end_ARG )) should match to leading order.

From the first time scale approximation,

a¯¯𝑎\displaystyle\underline{a}under¯ start_ARG italic_a end_ARG =a¯(0)+O(ε)absentsuperscript¯𝑎0𝑂𝜀\displaystyle=\underline{a}^{(0)}+O(\varepsilon)= under¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_O ( italic_ε ) (75)
=a¯(0),𝟙L2(ρ)𝟙+a¯(0)+O(ε)absentsubscriptsuperscript¯𝑎01superscript𝐿2𝜌1subscriptsuperscript¯𝑎0perpendicular-to𝑂𝜀\displaystyle=\langle\underline{a}^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}% \mathds{1}+\underline{a}^{(0)}_{\perp}+O(\varepsilon)= ⟨ under¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT blackboard_1 + under¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT + italic_O ( italic_ε ) (76)
=[eσ02t1ainit,𝟙L2(ρ)+(1eσ02t1)φ0σ0]𝟙+a,init+O(ε)absentdelimited-[]superscript𝑒superscriptsubscript𝜎02subscript𝑡1subscriptsubscript𝑎init1superscript𝐿2𝜌1superscript𝑒superscriptsubscript𝜎02subscript𝑡1subscript𝜑0subscript𝜎01subscript𝑎perpendicular-toinit𝑂𝜀\displaystyle=\left[e^{-\sigma_{0}^{2}t_{1}}\langle a_{\rm{init}},\mathds{1}% \rangle_{L^{2}(\rho)}+\left(1-e^{-\sigma_{0}^{2}t_{1}}\right)\frac{\varphi_{0}% }{\sigma_{0}}\right]\mathds{1}+a_{\perp,\rm{init}}+O(\varepsilon)= [ italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟨ italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT + ( 1 - italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ] blackboard_1 + italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_O ( italic_ε ) (77)
=[eσ02t~/ε1αainit,𝟙L2(ρ)+(1eσ02t~/ε1α)φ0σ0]𝟙+a,init+O(ε)absentdelimited-[]superscript𝑒superscriptsubscript𝜎02~𝑡superscript𝜀1𝛼subscriptsubscript𝑎init1superscript𝐿2𝜌1superscript𝑒superscriptsubscript𝜎02~𝑡superscript𝜀1𝛼subscript𝜑0subscript𝜎01subscript𝑎perpendicular-toinit𝑂𝜀\displaystyle=\left[e^{-\sigma_{0}^{2}\nicefrac{{\widetilde{t}}}{{\varepsilon^% {1-\alpha}}}}\langle a_{\rm{init}},\mathds{1}\rangle_{L^{2}(\rho)}+\left(1-e^{% -\sigma_{0}^{2}\nicefrac{{\widetilde{t}}}{{\varepsilon^{1-\alpha}}}}\right)% \frac{\varphi_{0}}{\sigma_{0}}\right]\mathds{1}+a_{\perp,\rm{init}}+O(\varepsilon)= [ italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / start_ARG over~ start_ARG italic_t end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ⟨ italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT + ( 1 - italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / start_ARG over~ start_ARG italic_t end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ) divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ] blackboard_1 + italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_O ( italic_ε ) (78)
=φ0σ0𝟙+a,init+o(1).absentsubscript𝜑0subscript𝜎01subscript𝑎perpendicular-toinit𝑜1\displaystyle=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+a_{\perp,\rm{init}}+o(1% )\,.= divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG blackboard_1 + italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_o ( 1 ) . (79)

From the second time scale approximation,

a¯¯𝑎\displaystyle\overline{a}over¯ start_ARG italic_a end_ARG =a¯(0)(t2)+O(ε1/2)=a¯(0)(εα1/2t~)+O(ε1/2)absentsuperscript¯𝑎0subscript𝑡2𝑂superscript𝜀12superscript¯𝑎0superscript𝜀𝛼12~𝑡𝑂superscript𝜀12\displaystyle=\overline{a}^{(0)}(t_{2})+O(\varepsilon^{\nicefrac{{1}}{{2}}})=% \overline{a}^{(0)}({\varepsilon^{\alpha-\nicefrac{{1}}{{2}}}}{\widetilde{t}})+% O(\varepsilon^{\nicefrac{{1}}{{2}}})= over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) = over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ε start_POSTSUPERSCRIPT italic_α - / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG italic_t end_ARG ) + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) (80)
=a¯(0)(0)+o(1).absentsuperscript¯𝑎00𝑜1\displaystyle=\overline{a}^{(0)}(0)+o(1)\,.= over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 0 ) + italic_o ( 1 ) . (81)

By matching, Equations (79) and (81) should be coherent. Thus the ODE for the second time scale should be initialized from a¯(0)(0)=φ0σ0𝟙+a,initsuperscript¯𝑎00subscript𝜑0subscript𝜎01subscript𝑎perpendicular-toinit\overline{a}^{(0)}(0)=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+a_{\perp,\rm{% init}}over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 0 ) = divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG blackboard_1 + italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT.

Similarly, the matching procedure gives that the ODE for the second time scale should be initialized from s¯(1)=0superscript¯𝑠10\overline{s}^{(1)}=0over¯ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0.

Solution.

As we are done with the matching procedure, we now consider the solution in the second time scale only, that we denote again by a𝑎aitalic_a, s𝑠sitalic_s as in (71), (72). The matching procedure motivates us to consider the solution of (73)-(74) initialized at a(0)(0)=a,initsuperscriptsubscript𝑎perpendicular-to00subscript𝑎perpendicular-toinita_{\perp}^{(0)}(0)=a_{\perp,\rm{init}}italic_a start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 0 ) = italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT, s(1)=0superscriptsubscript𝑠perpendicular-to10s_{\perp}^{(1)}=0italic_s start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0. This gives

a(0)=cosh(φ1σ1t2)a,init,superscriptsubscript𝑎perpendicular-to0subscript𝜑1subscript𝜎1subscript𝑡2subscript𝑎perpendicular-toinit\displaystyle a_{\perp}^{(0)}=\cosh\left(\varphi_{1}\sigma_{1}t_{2}\right)a_{% \perp,\rm{init}}\,,italic_a start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = roman_cosh ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , s(1)=sinh(φ1σ1t2)a,init.superscriptsubscript𝑠perpendicular-to1subscript𝜑1subscript𝜎1subscript𝑡2subscript𝑎perpendicular-toinit\displaystyle s_{\perp}^{(1)}=\sinh\left(\varphi_{1}\sigma_{1}t_{2}\right)a_{% \perp,\rm{init}}\,.italic_s start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = roman_sinh ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT .

To conclude, we note that a(0),𝟙L2(ρ)=φ0σ0subscriptsuperscript𝑎01superscript𝐿2𝜌subscript𝜑0subscript𝜎0\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}=\frac{\varphi_{0}}{\sigma_{0}}⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is constrained by (68). Further, from (70),

t2s(1),𝟙L2(ρ)=σ1φ1a(0),𝟙L2(ρ)=σ1φ1φ0σ0,subscriptsubscript𝑡2subscriptsuperscript𝑠11superscript𝐿2𝜌subscript𝜎1subscript𝜑1subscriptsuperscript𝑎01superscript𝐿2𝜌subscript𝜎1subscript𝜑1subscript𝜑0subscript𝜎0\partial_{t_{2}}\langle s^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}% \varphi_{1}\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}\varphi_{% 1}\frac{\varphi_{0}}{\sigma_{0}},∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ,

thus s(1),𝟙L2(ρ)=σ1φ1φ0σ0t2subscriptsuperscript𝑠11superscript𝐿2𝜌subscript𝜎1subscript𝜑1subscript𝜑0subscript𝜎0subscript𝑡2\langle s^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}\varphi_{1}\frac{% \varphi_{0}}{\sigma_{0}}t_{2}⟨ italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Putting together, these equations give:

a(0)=φ0σ0𝟙+cosh(φ1σ1t2)a,init,superscript𝑎0subscript𝜑0subscript𝜎01subscript𝜑1subscript𝜎1subscript𝑡2subscript𝑎perpendicular-toinit\displaystyle a^{(0)}=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+\cosh\left(% \varphi_{1}\sigma_{1}t_{2}\right)a_{\perp,\rm{init}}\,,italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG blackboard_1 + roman_cosh ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , s(1)=σ1φ1φ0σ0t2𝟙+sinh(φ1σ1t2)a,init.superscript𝑠1subscript𝜎1subscript𝜑1subscript𝜑0subscript𝜎0subscript𝑡21subscript𝜑1subscript𝜎1subscript𝑡2subscript𝑎perpendicular-toinit\displaystyle s^{(1)}=\sigma_{1}\varphi_{1}\frac{\varphi_{0}}{\sigma_{0}}t_{2}% \mathds{1}+\sinh\left(\varphi_{1}\sigma_{1}t_{2}\right)a_{\perp,\rm{init}}\,.italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_1 + roman_sinh ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT . (82)

We observe that a(0)superscript𝑎0a^{(0)}italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT diverge as t2subscript𝑡2t_{2}\to\inftyitalic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → ∞. This implies that our approximation on the second time scale must break down at a certain point. Indeed, we analyzed this time scale under the assumption that both a(0)superscript𝑎0a^{(0)}italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT are of order 1111. However, since a(0)superscript𝑎0a^{(0)}italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT diverge exponentially as t2subscript𝑡2t_{2}\to\inftyitalic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → ∞, as per Eq. (82), this assumption breaks down when t2log(1/ε)asymptotically-equalssubscript𝑡21𝜀t_{2}\asymp\log(1/\varepsilon)italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≍ roman_log ( 1 / italic_ε ).

More precisely, in (65) (resp. (67)), the O(ε)𝑂𝜀O(\varepsilon)italic_O ( italic_ε ) term includes a term of the form

εs(1)(ω)σ12a(0)(ν)s(1)(ν)dρ(ν)𝜀superscript𝑠1𝜔superscriptsubscript𝜎12superscript𝑎0𝜈superscript𝑠1𝜈differential-d𝜌𝜈\displaystyle-\varepsilon s^{(1)}(\omega)\sigma_{1}^{2}\int a^{(0)}(\nu)s^{(1)% }(\nu)\mathrm{d}\rho(\nu)- italic_ε italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) (resp. εa(0)(ω)σ12a(0)(ν)s(1)(ν)dρ(ν)).resp. 𝜀superscript𝑎0𝜔superscriptsubscript𝜎12superscript𝑎0𝜈superscript𝑠1𝜈differential-d𝜌𝜈\displaystyle\left(\text{resp.~{}}-\varepsilon a^{(0)}(\omega)\sigma_{1}^{2}% \int a^{(0)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)\,.( resp. - italic_ε italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) .

When a(0)superscript𝑎0a^{(0)}italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT become of order ε1/4superscript𝜀14\varepsilon^{-\nicefrac{{1}}{{4}}}italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT, this term becomes of order ε1/4superscript𝜀14\varepsilon^{\nicefrac{{1}}{{4}}}italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT, which is then of the same order as the term ε1/2σ1φ1s(1)(ω)superscript𝜀12subscript𝜎1subscript𝜑1superscript𝑠1𝜔\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}s^{(1)}(\omega)italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) in (65) (resp. the term ε1/2σ1φ1a(0)(ω)superscript𝜀12subscript𝜎1subscript𝜑1superscript𝑎0𝜔\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}a^{(0)}(\omega)italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) in (67)). At this point, these terms can not be neglected anymore. From (82), we have

a(0)e|φ1σ1|t22a,init,similar-tosuperscript𝑎0superscript𝑒subscript𝜑1subscript𝜎1subscript𝑡22subscript𝑎perpendicular-toinit\displaystyle a^{(0)}\sim\frac{e^{|\varphi_{1}\sigma_{1}|t_{2}}}{2}a_{\perp,% \rm{init}}\,,italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∼ divide start_ARG italic_e start_POSTSUPERSCRIPT | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , s(1)sign(φ1σ1)e|φ1σ1|t22a,init,similar-tosuperscript𝑠1signsubscript𝜑1subscript𝜎1superscript𝑒subscript𝜑1subscript𝜎1subscript𝑡22subscript𝑎perpendicular-toinit\displaystyle s^{(1)}\sim\operatorname{sign}(\varphi_{1}\sigma_{1})\frac{e^{|% \varphi_{1}\sigma_{1}|t_{2}}}{2}a_{\perp,\rm{init}}\,,italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∼ roman_sign ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) divide start_ARG italic_e start_POSTSUPERSCRIPT | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , t2.subscript𝑡2\displaystyle t_{2}\to\infty\,.italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → ∞ .

Therefore, a(0)superscript𝑎0a^{(0)}italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT become of order ε1/4superscript𝜀14\varepsilon^{-\nicefrac{{1}}{{4}}}italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT at the time t214|σ1φ1|log1εsimilar-tosubscript𝑡214subscript𝜎1subscript𝜑11𝜀t_{2}\sim\frac{1}{4|\sigma_{1}\varphi_{1}|}\log\frac{1}{\varepsilon}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG, at which the approximation on the second time scale breaks down. We thus introduce a new time scale centered at this critical point.

6.4 Third time scale: linear component II

We now introduce the time t3=t214|φ1σ1|log1εsubscript𝑡3subscript𝑡214subscript𝜑1subscript𝜎11𝜀t_{3}=t_{2}-\frac{1}{4|\varphi_{1}\sigma_{1}|}\log\frac{1}{\varepsilon}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG. As t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is only a translation from t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the ODEs in terms of t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are the same as the ones in term of t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. However, in this time scale, a𝑎aitalic_a and ε1/2ssuperscript𝜀12𝑠\varepsilon^{\nicefrac{{1}}{{2}}}sitalic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_s have diverged. In coherence with the discussion above, we seek expansions of the form

a𝑎\displaystyle aitalic_a =ε1/4a(1)+a(0)+ε1/4a(1)+,absentsuperscript𝜀14superscript𝑎1superscript𝑎0superscript𝜀14superscript𝑎1\displaystyle=\varepsilon^{-\nicefrac{{1}}{{4}}}a^{(-1)}+a^{(0)}+\varepsilon^{% \nicefrac{{1}}{{4}}}a^{(1)}+\dots\,,= italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT + italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + … , (83)
s𝑠\displaystyle sitalic_s =ε1/4s(1)+ε1/2s(2)+.absentsuperscript𝜀14superscript𝑠1superscript𝜀12superscript𝑠2\displaystyle=\varepsilon^{\nicefrac{{1}}{{4}}}s^{(1)}+\varepsilon^{\nicefrac{% {1}}{{2}}}s^{(2)}+\dots\,.= italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT + … . (84)

Similarly to the second time scale, we substitute (83)-(84) in (42) and obtain

ε1/4t3a(1)(ω)superscript𝜀14subscriptsubscript𝑡3superscript𝑎1𝜔\displaystyle\varepsilon^{\nicefrac{{1}}{{4}}}\partial_{t_{3}}a^{(-1)}(\omega)italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ω ) =ε1/4σ02a(1)(ν)dρ(ν)+σ0(φ0σ0a(0)(ν)dρ(ν))absentsuperscript𝜀14superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈subscript𝜎0subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈\displaystyle=-\varepsilon^{-\nicefrac{{1}}{{4}}}{\sigma_{0}^{2}}\int a^{(-1)}% (\nu)\mathrm{d}\rho(\nu)+\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(% \nu)\mathrm{d}\rho(\nu)\right)= - italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) )
ε1/4σ02a(1)(ν)dρ(ν)+ε1/4σ1(φ1σ1a(1)(ν)s(1)(ν)dρ(ν))s(1)(ω)+O(ε1/2),superscript𝜀14superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈superscript𝜀14subscript𝜎1subscript𝜑1subscript𝜎1superscript𝑎1𝜈superscript𝑠1𝜈differential-d𝜌𝜈superscript𝑠1𝜔𝑂superscript𝜀12\displaystyle\hskip 14.22636pt-\varepsilon^{\nicefrac{{1}}{{4}}}{\sigma_{0}^{2% }}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)+\varepsilon^{\nicefrac{{1}}{{4}}}\sigma% _{1}\left(\varphi_{1}-{\sigma_{1}}\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho% (\nu)\right)s^{(1)}(\omega)+O(\varepsilon^{\nicefrac{{1}}{{2}}})\,,- italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ,
ε1/4t3s(1)(ω)superscript𝜀14subscriptsubscript𝑡3superscript𝑠1𝜔\displaystyle\varepsilon^{\nicefrac{{1}}{{4}}}\partial_{t_{3}}s^{(1)}(\omega)italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) =ε1/4σ1(φ1σ1a(1)(ν)s(1)(ν)dρ(ν))a(1)(ω)+O(ε1/2).absentsuperscript𝜀14subscript𝜎1subscript𝜑1subscript𝜎1superscript𝑎1𝜈superscript𝑠1𝜈differential-d𝜌𝜈superscript𝑎1𝜔𝑂superscript𝜀12\displaystyle=\varepsilon^{\nicefrac{{1}}{{4}}}\sigma_{1}\left(\varphi_{1}-{% \sigma_{1}}\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)a^{(-1)}(% \omega)+O(\varepsilon^{\nicefrac{{1}}{{2}}})\,.= italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) .

First, we identify the terms of order ε1/4superscript𝜀14\varepsilon^{-\nicefrac{{1}}{{4}}}italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT:

0=σ02a(1)(ν)dρ(ν)=σ02a(1),𝟙L2(ρ).0superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈superscriptsubscript𝜎02subscriptsuperscript𝑎11superscript𝐿2𝜌0=-{\sigma_{0}^{2}}\int a^{(-1)}(\nu)\mathrm{d}\rho(\nu)=-{\sigma_{0}^{2}}% \left\langle a^{(-1)},\mathds{1}\right\rangle_{L^{2}(\rho)}\,.0 = - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) = - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT . (85)

This means that a𝑎aitalic_a has no component diverging in ε𝜀\varepsilonitalic_ε in the direction of 𝟙1\mathds{1}blackboard_1.

Second, we identify the terms of order 1=ε01superscript𝜀01=\varepsilon^{0}1 = italic_ε start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

0=σ0(φ0σ0a(0)(ν)dρ(ν))=σ0(φ0σ0a(0),𝟙L2(ρ)).0subscript𝜎0subscript𝜑0subscript𝜎0superscript𝑎0𝜈differential-d𝜌𝜈subscript𝜎0subscript𝜑0subscript𝜎0subscriptsuperscript𝑎01superscript𝐿2𝜌0=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)% \right)=\sigma_{0}\left(\varphi_{0}-\sigma_{0}\left\langle a^{(0)},\mathds{1}% \right\rangle_{L^{2}(\rho)}\right)\,.0 = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) . (86)

Put together with (85), this equation ensures that the constant component of φ𝜑\varphiitalic_φ remains learned on this third time scale.

Third, we identify the terms of order ε1/4superscript𝜀14\varepsilon^{\nicefrac{{1}}{{4}}}italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT:

t3a(1)(ω)=σ02a(1)(ν)dρ(ν)+σ1(φ1σ1a(1)(ν)s(1)(ν)dρ(ν))s(1)(ω),t3s(1)(ω)=σ1(φ1σ1a(1)(ν)s(1)(ν)dρ(ν))a(1)(ω).formulae-sequencesubscriptsubscript𝑡3superscript𝑎1𝜔superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈subscript𝜎1subscript𝜑1subscript𝜎1superscript𝑎1𝜈superscript𝑠1𝜈differential-d𝜌𝜈superscript𝑠1𝜔subscriptsubscript𝑡3superscript𝑠1𝜔subscript𝜎1subscript𝜑1subscript𝜎1superscript𝑎1𝜈superscript𝑠1𝜈differential-d𝜌𝜈superscript𝑎1𝜔\displaystyle\begin{split}\partial_{t_{3}}a^{(-1)}(\omega)&=-{\sigma_{0}^{2}}% \int a^{(1)}(\nu)\mathrm{d}\rho(\nu)+\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}% \int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)s^{(1)}(\omega)\,,\\ \partial_{t_{3}}s^{(1)}(\omega)&=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\int a% ^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)a^{(-1)}(\omega)\,.\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ω ) end_CELL start_CELL = - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ω ) end_CELL start_CELL = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) ) italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ω ) . end_CELL end_ROW (87)

Again, the term σ02a(1)(ν)dρ(ν)superscriptsubscript𝜎02superscript𝑎1𝜈differential-d𝜌𝜈-{\sigma_{0}^{2}}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)- italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) is best interpreted as the Lagrange multiplier associated to the constraints (85), (86). Using the compact notations,

a(1)(ν)s(1)(ν)dρ(ν)superscript𝑎1𝜈superscript𝑠1𝜈differential-d𝜌𝜈\displaystyle\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) =a(1),s(1)L2(ρ)=a(1),𝟙L2(ρ)𝟙,s(1)L2(ρ)+a(1),s(1)L2(ρ)absentsubscriptsuperscript𝑎1superscript𝑠1superscript𝐿2𝜌subscriptsuperscript𝑎11superscript𝐿2𝜌subscript1superscript𝑠1superscript𝐿2𝜌subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌\displaystyle=\left\langle a^{(-1)},s^{(1)}\right\rangle_{L^{2}(\rho)}={\left% \langle a^{(-1)},\mathds{1}\right\rangle_{L^{2}(\rho)}\left\langle\mathds{1},s% ^{(1)}\right\rangle_{L^{2}(\rho)}}+\left\langle a^{(-1)}_{\perp},s^{(1)}_{% \perp}\right\rangle_{L^{2}(\rho)}= ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ⟨ blackboard_1 , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT + ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT
=a(1),s(1)L2(ρ),absentsubscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌\displaystyle=\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2% }(\rho)}\,,= ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ,

where in the last equality we use (85). Thus we can rewrite (87) as

t3a(1)=σ02a(1),𝟙L2(ρ)𝟙+σ1(φ1σ1a(1),s(1)L2(ρ))s(1),t3s(1)=σ1(φ1σ1a(1),s(1)L2(ρ))a(1),formulae-sequencesubscriptsubscript𝑡3superscript𝑎1superscriptsubscript𝜎02subscriptsuperscript𝑎11superscript𝐿2𝜌1subscript𝜎1subscript𝜑1subscript𝜎1subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌superscript𝑠1subscriptsubscript𝑡3superscript𝑠1subscript𝜎1subscript𝜑1subscript𝜎1subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌superscript𝑎1\displaystyle\begin{split}\partial_{t_{3}}a^{(-1)}&=-{\sigma_{0}^{2}}\langle a% ^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}\mathds{1}+\sigma_{1}\left(\varphi_{1}-{% \sigma_{1}}\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(% \rho)}\right)s^{(1)}\,,\\ \partial_{t_{3}}s^{(1)}&=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\left\langle a% ^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(\rho)}\right)a^{(-1)}\,,% \end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL = - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT blackboard_1 + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , end_CELL end_ROW (88)

and thus

t3a(1)=σ1(φ1σ1a(1),s(1)L2(ρ))s(1),t3s(1)=σ1(φ1σ1a(1),s(1)L2(ρ))a(1).formulae-sequencesubscriptsubscript𝑡3subscriptsuperscript𝑎1perpendicular-tosubscript𝜎1subscript𝜑1subscript𝜎1subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌subscriptsuperscript𝑠1perpendicular-tosubscriptsubscript𝑡3subscriptsuperscript𝑠1perpendicular-tosubscript𝜎1subscript𝜑1subscript𝜎1subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌subscriptsuperscript𝑎1perpendicular-to\displaystyle\begin{split}\partial_{t_{3}}a^{(-1)}_{\perp}&=\sigma_{1}\left(% \varphi_{1}-{\sigma_{1}}\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right% \rangle_{L^{2}(\rho)}\right)s^{(1)}_{\perp}\,,\\ \partial_{t_{3}}s^{(1)}_{\perp}&=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\left% \langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(\rho)}\right)a^{(% -1)}_{\perp}\,.\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT end_CELL start_CELL = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT end_CELL start_CELL = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT . end_CELL end_ROW (89)

In Appendix C.1, we solve this system of ODEs and determine the initial condition by matching with the previous layer. The result is that

a(1)=a(1)=λa,init,s(1)=s(1)=sign(σ1φ1)λa,init,formulae-sequencesuperscript𝑎1subscriptsuperscript𝑎1perpendicular-to𝜆subscript𝑎perpendicular-toinitsuperscript𝑠1subscriptsuperscript𝑠1perpendicular-tosignsubscript𝜎1subscript𝜑1𝜆subscript𝑎perpendicular-toinit\displaystyle\begin{split}&a^{(-1)}=a^{(-1)}_{\perp}=\lambda a_{\perp,\rm{init% }}\,,\\ &s^{(1)}=s^{(1)}_{\perp}=\operatorname{sign}(\sigma_{1}\varphi_{1})\lambda a_{% \perp,\rm{init}}\,,\end{split}start_ROW start_CELL end_CELL start_CELL italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT = italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT = roman_sign ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , end_CELL end_ROW (90)

where λ=λ(t3)𝜆𝜆subscript𝑡3\lambda=\lambda(t_{3})italic_λ = italic_λ ( italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is the function

λ(t3)=|φ1|1/2(|σ1|a,initL2(ρ)2+4|φ1|e2|σ1φ1|t3)1/2.𝜆subscript𝑡3superscriptsubscript𝜑112superscriptsubscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌24subscript𝜑1superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡312\lambda(t_{3})=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left({|\sigma_{1}|}% \left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}+4|\varphi_{1}|e^{-2|% \sigma_{1}\varphi_{1}|t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,.italic_λ ( italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG . (91)

This solution finishes to describe how the linear part of the function φ𝜑\varphiitalic_φ is learned. Plugging it into the equations for a(1)superscript𝑎1a^{(-1)}italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT and s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, we get

σ1a(1)(ν)s(1)(ν)dρ(ν)=σ1a(1),s(1)L2(ρ)=φ1|σ1|a,initL2(ρ)2|σ1|a,initL2(ρ)2+4|φ1|e2|σ1φ1|t3,subscript𝜎1superscript𝑎1𝜈superscript𝑠1𝜈differential-d𝜌𝜈subscript𝜎1subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌24subscript𝜑1superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡3\displaystyle\sigma_{1}\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)=% \sigma_{1}\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(% \rho)}=\frac{\varphi_{1}|\sigma_{1}|\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(% \rho)}^{2}}{{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}% +4|\varphi_{1}|e^{-2|\sigma_{1}\varphi_{1}|t_{3}}},italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ν ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_ν ) roman_d italic_ρ ( italic_ν ) = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = divide start_ARG italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ,

which converges to φ1subscript𝜑1\varphi_{1}italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as t3subscript𝑡3t_{3}\to\inftyitalic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → ∞. Consequently, we obtain the following approximation for Rmf,subscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT within this time scale (again, see Appendix C.2 for details):

Rmf,=12φ12(111+4|φ1||σ1|a,initL2(ρ)2e2|σ1φ1|t3)2+12k2φk2+O(ε1/4),ε0.formulae-sequencesubscript𝑅mf12superscriptsubscript𝜑12superscript1114subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡3212subscript𝑘2superscriptsubscript𝜑𝑘2𝑂superscript𝜀14𝜀0\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\varphi_{1}^{2}% \left(1-\frac{1}{1+\frac{4|\varphi_{1}|}{|\sigma_{1}|\left\|a_{\perp,\rm{init}% }\right\|_{L^{2}(\rho)}^{2}}e^{-2|\sigma_{1}\varphi_{1}|t_{3}}}\right)^{2}+% \frac{1}{2}\sum_{k\geqslant 2}\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4% }}})\,,\quad\varepsilon\to 0.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 2 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) , italic_ε → 0 .

6.5 Conjectured behavior for larger time scales

The analysis of the previous sections naturally suggests the existence of a sequence of cutoffs. At each time scale, a new polynomial component of φ𝜑\varphiitalic_φ is learned within a window that is much shorter than the time elapsed before that phase started. Along this sequence, we expect s𝑠sitalic_s and a𝑎aitalic_a to grow to increasingly larger scales in ε𝜀\varepsilonitalic_ε (but s𝑠sitalic_s remains o(1)𝑜1o(1)italic_o ( 1 ) while a𝑎aitalic_a diverges).

More precisely, we assume that during the l𝑙litalic_l-th phase, the network learns the degree-l𝑙litalic_l component φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and various quantities satisfy the following scaling behavior:

a=O(εωl),s=O(εβl),t=O(εμl),formulae-sequence𝑎𝑂superscript𝜀subscript𝜔𝑙formulae-sequence𝑠𝑂superscript𝜀subscript𝛽𝑙𝑡𝑂superscript𝜀subscript𝜇𝑙\displaystyle a=O(\varepsilon^{-\omega_{l}}),\;\;\;s=O(\varepsilon^{\beta_{l}}% ),\;\;\;t=O(\varepsilon^{\mu_{l}})\,,italic_a = italic_O ( italic_ε start_POSTSUPERSCRIPT - italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , italic_s = italic_O ( italic_ε start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , italic_t = italic_O ( italic_ε start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , (92)

where ωl>0subscript𝜔𝑙0\omega_{l}>0italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > 0 is an increasing sequence and βl,μl>0subscript𝛽𝑙subscript𝜇𝑙0\beta_{l},\mu_{l}>0italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > 0 are decreasing sequences. Further, while learning of this component takes place when t=O(εμl)𝑡𝑂superscript𝜀subscript𝜇𝑙t=O(\varepsilon^{\mu_{l}})italic_t = italic_O ( italic_ε start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), the actual evolution of the risk (and of the neural network) take place on much shorter scales, namely:

Δt=O(ενl),Δ𝑡𝑂superscript𝜀subscript𝜈𝑙\displaystyle\Delta t=O(\varepsilon^{\nu_{l}})\,,roman_Δ italic_t = italic_O ( italic_ε start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , (93)

where νlsubscript𝜈𝑙\nu_{l}italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is also decreasing, with νl>μlsubscript𝜈𝑙subscript𝜇𝑙\nu_{l}>\mu_{l}italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The goal of this section is to provide heuristic arguments to conjecture the values of ωlsubscript𝜔𝑙\omega_{l}italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, βlsubscript𝛽𝑙\beta_{l}italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, μlsubscript𝜇𝑙\mu_{l}italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and νlsubscript𝜈𝑙\nu_{l}italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. We will base this conjecture on a rigorous analysis of a simplified model.

The simplified model is motivated by the expectation (supported by the heuristics and simulations in the previous sections) that learning each component happens independently from the details of the evolution on previous time scales. In the simplified model, the activation function σ(x)𝜎𝑥\sigma(x)italic_σ ( italic_x ) is proportional to the l𝑙litalic_l-th Hermite polynomial, namely σ(x)=σlHel(x)𝜎𝑥subscript𝜎𝑙subscriptHe𝑙𝑥\sigma(x)=\sigma_{l}\mathrm{He}_{l}(x)italic_σ ( italic_x ) = italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_He start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ). This is the component of σ𝜎\sigmaitalic_σ that we expect to be relevant on the l𝑙litalic_l-th time scale. The gradient flow equations (42) then read:

εta(ω)=σls(ω)l(φlσla(ν)s(ν)ldρ(ν)),ts(ω)=a(ω)(1s(ω)2)lσls(ω)l1(φlσla(ν)s(ν)ldρ(ν)).formulae-sequence𝜀subscript𝑡𝑎𝜔subscript𝜎𝑙𝑠superscript𝜔𝑙subscript𝜑𝑙subscript𝜎𝑙𝑎𝜈𝑠superscript𝜈𝑙differential-d𝜌𝜈subscript𝑡𝑠𝜔𝑎𝜔1𝑠superscript𝜔2𝑙subscript𝜎𝑙𝑠superscript𝜔𝑙1subscript𝜑𝑙subscript𝜎𝑙𝑎𝜈𝑠superscript𝜈𝑙differential-d𝜌𝜈\begin{split}\varepsilon\partial_{t}a(\omega)&=\,\sigma_{l}s(\omega)^{l}\left(% \varphi_{l}-\sigma_{l}\int a(\nu)s(\nu)^{l}\mathrm{d}\rho(\nu)\right)\,,\\ \partial_{t}s(\omega)&=\,a(\omega)\left(1-s(\omega)^{2}\right)l\sigma_{l}s(% \omega)^{l-1}\left(\varphi_{l}-\sigma_{l}\int a(\nu)s(\nu)^{l}\mathrm{d}\rho(% \nu)\right)\,.\end{split}start_ROW start_CELL italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a ( italic_ω ) end_CELL start_CELL = italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s ( italic_ω ) end_CELL start_CELL = italic_a ( italic_ω ) ( 1 - italic_s ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_l italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW (94)

with corresponding risk component

Rl=12(φlσla(ν)s(ν)ldρ(ν))2.subscript𝑅𝑙12superscriptsubscript𝜑𝑙subscript𝜎𝑙𝑎𝜈𝑠superscript𝜈𝑙differential-d𝜌𝜈2\mathscrsfs{R}_{l}=\frac{1}{2}\left(\varphi_{l}-\sigma_{l}\int a(\nu)s(\nu)^{l% }\mathrm{d}\rho(\nu)\right)^{2}.italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We capture the effect of learning dynamics on the previous time scales by the overall magnitude of the a(ω)𝑎𝜔a(\omega)italic_a ( italic_ω )’s and s(ω)𝑠𝜔s(\omega)italic_s ( italic_ω )’s at initialization. Namely, we choose the scale of initialization of the simplified model to be given by the end of the (l1)𝑙1(l-1)( italic_l - 1 )-th time scale, i.e., a(ω)εωl1asymptotically-equals𝑎𝜔superscript𝜀subscript𝜔𝑙1a(\omega)\asymp\varepsilon^{-\omega_{l-1}}italic_a ( italic_ω ) ≍ italic_ε start_POSTSUPERSCRIPT - italic_ω start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and s(ω)εβl1asymptotically-equals𝑠𝜔superscript𝜀subscript𝛽𝑙1s(\omega)\asymp\varepsilon^{\beta_{l-1}}italic_s ( italic_ω ) ≍ italic_ε start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Further, in order for the (l1)𝑙1(l-1)( italic_l - 1 )-th component to be learned, namely

a(ν)s(ν)l1dρ(ν)φl1σl1,𝑎𝜈𝑠superscript𝜈𝑙1differential-d𝜌𝜈subscript𝜑𝑙1subscript𝜎𝑙1\int a(\nu)s(\nu)^{l-1}\mathrm{d}\rho(\nu)\approx\frac{\varphi_{l-1}}{\sigma_{% l-1}},∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ≈ divide start_ARG italic_φ start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_ARG , (95)

we require ωl1=(l1)βl1subscript𝜔𝑙1𝑙1subscript𝛽𝑙1\omega_{l-1}=(l-1)\beta_{l-1}italic_ω start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT = ( italic_l - 1 ) italic_β start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT so that a(ν)s(ν)l1dρ(ν)=Θ(1)𝑎𝜈𝑠superscript𝜈𝑙1differential-d𝜌𝜈Θ1\int a(\nu)s(\nu)^{l-1}\mathrm{d}\rho(\nu)=\Theta(1)∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) = roman_Θ ( 1 ). Analogously, we assume ωl=lβlsubscript𝜔𝑙𝑙subscript𝛽𝑙\omega_{l}=l\beta_{l}italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_l italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

Based on this consideration, we introduce the rescaled variables

a~(ω)=εωla(ω),s~(ω)=εβls(ω),wherea~(ω,0)εωlωl1,s~(ω,0)εβl1βl.formulae-sequence~𝑎𝜔superscript𝜀subscript𝜔𝑙𝑎𝜔formulae-sequence~𝑠𝜔superscript𝜀subscript𝛽𝑙𝑠𝜔formulae-sequenceasymptotically-equalswhere~𝑎𝜔0superscript𝜀subscript𝜔𝑙subscript𝜔𝑙1asymptotically-equals~𝑠𝜔0superscript𝜀subscript𝛽𝑙1subscript𝛽𝑙\widetilde{a}(\omega)=\varepsilon^{\omega_{l}}a(\omega),\ \widetilde{s}(\omega% )=\varepsilon^{-\beta_{l}}s(\omega),\ \text{where}\ \widetilde{a}(\omega,0)% \asymp\varepsilon^{\omega_{l}-\omega_{l-1}},\widetilde{s}(\omega,0)\asymp% \varepsilon^{\beta_{l-1}-\beta_{l}}.over~ start_ARG italic_a end_ARG ( italic_ω ) = italic_ε start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_a ( italic_ω ) , over~ start_ARG italic_s end_ARG ( italic_ω ) = italic_ε start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s ( italic_ω ) , where over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) ≍ italic_ε start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_ω start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) ≍ italic_ε start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Rewriting Eq. (94) in terms of a~(ω)~𝑎𝜔\widetilde{a}(\omega)over~ start_ARG italic_a end_ARG ( italic_ω )’s and s~(ω)~𝑠𝜔\widetilde{s}(\omega)over~ start_ARG italic_s end_ARG ( italic_ω )’s, and using ωl=lβlsubscript𝜔𝑙𝑙subscript𝛽𝑙\omega_{l}=l\beta_{l}italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_l italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we get that

ε12lβlta~(ω)=σls~(ω)l(φlσla~(ν)s~(ν)ldρ(ν))ε2βlts~(ω)=lσla~(ω)s~(ω)l1(1ε2βls~(ω)2)(φlσla~(ν)s~(ν)ldρ(ν)).superscript𝜀12𝑙subscript𝛽𝑙subscript𝑡~𝑎𝜔subscript𝜎𝑙~𝑠superscript𝜔𝑙subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈superscript𝜀2subscript𝛽𝑙subscript𝑡~𝑠𝜔𝑙subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙11superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈\begin{split}\varepsilon^{1-2l\beta_{l}}\partial_{t}\widetilde{a}(\omega)=\,&% \sigma_{l}\widetilde{s}(\omega)^{l}\left(\varphi_{l}-\sigma_{l}\int\widetilde{% a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right)\\ \varepsilon^{2\beta_{l}}\partial_{t}\widetilde{s}(\omega)=\,&l\sigma_{l}% \widetilde{a}(\omega)\widetilde{s}(\omega)^{l-1}\left(1-\varepsilon^{2\beta_{l% }}\widetilde{s}(\omega)^{2}\right)\left(\varphi_{l}-\sigma_{l}\int\widetilde{a% }(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right).\end{split}start_ROW start_CELL italic_ε start_POSTSUPERSCRIPT 1 - 2 italic_l italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) = end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) end_CELL end_ROW start_ROW start_CELL italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) = end_CELL start_CELL italic_l italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW (96)

In order for the a~(ω)~𝑎𝜔\widetilde{a}(\omega)over~ start_ARG italic_a end_ARG ( italic_ω )’s and s~(ω)~𝑠𝜔\widetilde{s}(\omega)over~ start_ARG italic_s end_ARG ( italic_ω )’s to be learned simultaneously, we need 12lβl=2βl12𝑙subscript𝛽𝑙2subscript𝛽𝑙1-2l\beta_{l}=2\beta_{l}1 - 2 italic_l italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, which implies βl=1/2(l+1)subscript𝛽𝑙12𝑙1\beta_{l}=1/2(l+1)italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 / 2 ( italic_l + 1 ). Making a further change of the time variable t=ενlτ𝑡superscript𝜀subscript𝜈𝑙𝜏t=\varepsilon^{\nu_{l}}\tauitalic_t = italic_ε start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_τ, where νl=2βl=1/(l+1)subscript𝜈𝑙2subscript𝛽𝑙1𝑙1\nu_{l}=2\beta_{l}=1/(l+1)italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 / ( italic_l + 1 ), it follows that

τa~(ω)=σls~(ω)l(φlσla~(ν)s~(ν)ldρ(ν))τs~(ω)=lσla~(ω)s~(ω)l1(1ε2βls~(ω)2)(φlσla~(ν)s~(ν)ldρ(ν)).subscript𝜏~𝑎𝜔subscript𝜎𝑙~𝑠superscript𝜔𝑙subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈subscript𝜏~𝑠𝜔𝑙subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙11superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈\begin{split}\partial_{\tau}\widetilde{a}(\omega)=\,&\sigma_{l}\widetilde{s}(% \omega)^{l}\left(\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(\nu% )^{l}\mathrm{d}\rho(\nu)\right)\\ \partial_{\tau}\widetilde{s}(\omega)=\,&l\sigma_{l}\widetilde{a}(\omega)% \widetilde{s}(\omega)^{l-1}\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(% \omega)^{2}\right)\left(\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde% {s}(\nu)^{l}\mathrm{d}\rho(\nu)\right).\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) = end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) = end_CELL start_CELL italic_l italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) . end_CELL end_ROW (97)

Moreover, rewriting the risk in terms of the rescaled variables a~,s~~𝑎~𝑠\widetilde{a},\widetilde{s}over~ start_ARG italic_a end_ARG , over~ start_ARG italic_s end_ARG, Rl(τ)=Rl(a~(τ),s~(τ))subscript𝑅𝑙𝜏subscript𝑅𝑙~𝑎𝜏~𝑠𝜏\mathscrsfs{R}_{l}(\tau)=\mathscrsfs{R}_{l}(\widetilde{a}(\tau),\widetilde{s}(% \tau))italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_τ ) = italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_τ ) ) satisfies the ODE:

τRl=2σl2Rls~(ω)2(l1)(l2a~(ω)2(1ε2βls~(ω)2)+s~(ω)2)dρ(ω).subscript𝜏subscript𝑅𝑙2superscriptsubscript𝜎𝑙2subscript𝑅𝑙~𝑠superscript𝜔2𝑙1superscript𝑙2~𝑎superscript𝜔21superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2~𝑠superscript𝜔2differential-d𝜌𝜔\partial_{\tau}\mathscrsfs{R}_{l}=-2\sigma_{l}^{2}\mathscrsfs{R}_{l}\cdot\int% \widetilde{s}(\omega)^{2(l-1)}\left(l^{2}\widetilde{a}(\omega)^{2}\left(1-% \varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)+\widetilde{s}(\omega)% ^{2}\right)\mathrm{d}\rho(\omega).∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = - 2 italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ ∫ over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_ρ ( italic_ω ) . (98)

Note that with our choice of βlsubscript𝛽𝑙\beta_{l}italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ωlsubscript𝜔𝑙\omega_{l}italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we have ωlωl1=βl1βl=1/2l(l+1)subscript𝜔𝑙subscript𝜔𝑙1subscript𝛽𝑙1subscript𝛽𝑙12𝑙𝑙1\omega_{l}-\omega_{l-1}=\beta_{l-1}-\beta_{l}=1/2l(l+1)italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_ω start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 / 2 italic_l ( italic_l + 1 ). This means that the a~(ω)~𝑎𝜔\widetilde{a}(\omega)over~ start_ARG italic_a end_ARG ( italic_ω )’s and s~(ω)~𝑠𝜔\widetilde{s}(\omega)over~ start_ARG italic_s end_ARG ( italic_ω )’s are initialized at the same scale, namely

a~(ω,0),s~(ω,0)=Θ(ε1/2l(l+1)).~𝑎𝜔0~𝑠𝜔0Θsuperscript𝜀12𝑙𝑙1\displaystyle\widetilde{a}(\omega,0),\widetilde{s}(\omega,0)=\Theta(% \varepsilon^{1/2l(l+1)})\,.over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) , over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) . (99)

The theorem below describes quantitatively the dynamics of the simplified model for small ε𝜀\varepsilonitalic_ε, and determines the value of μlsubscript𝜇𝑙\mu_{l}italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (recall that νl=1/(l+1)subscript𝜈𝑙1𝑙1\nu_{l}=1/(l+1)italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 / ( italic_l + 1 )):

Theorem 1 (Evolution of the simplified gradient flow).

Assume l2𝑙2l\geq 2italic_l ≥ 2 and let (a~(ω,τ),s~(ω,τ))τ0subscript~𝑎𝜔𝜏~𝑠𝜔𝜏𝜏0(\widetilde{a}(\omega,\tau),\widetilde{s}(\omega,\tau))_{\tau\geq 0}( over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) ) start_POSTSUBSCRIPT italic_τ ≥ 0 end_POSTSUBSCRIPT be the unique solution of the ODE system (97), initialized as per Eq. (99) (note in particular that σlφla~(ω,0)s~(ω,0)lε1/2lasymptotically-equalssubscript𝜎𝑙subscript𝜑𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙superscript𝜀12𝑙\sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}\asymp% \varepsilon^{1/2l}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≍ italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l end_POSTSUPERSCRIPT). Then the followings hold:

  • (a)𝑎(a)( italic_a )

    Let us denote

    A={ω:σlφllim infε0ε1/2la~(ω,0)s~(ω,0)l>0}𝐴conditional-set𝜔subscript𝜎𝑙subscript𝜑𝑙subscriptlimit-infimum𝜀0superscript𝜀12𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙0A=\left\{\omega:\sigma_{l}\varphi_{l}\liminf_{\varepsilon\to 0}\varepsilon^{-1% /2l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}>0\right\}italic_A = { italic_ω : italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT lim inf start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 / 2 italic_l end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT > 0 } (100)

    and assume ρ(A)>0𝜌𝐴0\rho(A)>0italic_ρ ( italic_A ) > 0. For Δ(0,φl2/2)Δ0superscriptsubscript𝜑𝑙22\Delta\in(0,\varphi_{l}^{2}/2)roman_Δ ∈ ( 0 , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ), define

    τ(Δ)=inf{τ0:Rl(a~(τ),s~(τ))Δ}.𝜏Δinfimumconditional-set𝜏0subscript𝑅𝑙~𝑎𝜏~𝑠𝜏Δ\tau(\Delta)=\inf\{\tau\geq 0:\mathscrsfs{R}_{l}(\widetilde{a}(\tau),% \widetilde{s}(\tau))\leq\Delta\}.italic_τ ( roman_Δ ) = roman_inf { italic_τ ≥ 0 : italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_τ ) ) ≤ roman_Δ } . (101)

    Then, for any fixed ΔΔ\Deltaroman_Δ we have τ(Δ)=Θ(ε(l1)/2l(l+1))𝜏ΔΘsuperscript𝜀𝑙12𝑙𝑙1\tau(\Delta)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})italic_τ ( roman_Δ ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) as ε0𝜀0\varepsilon\to 0italic_ε → 0. Further, if ρ𝜌\rhoitalic_ρ is a discrete probability measure, then there exists τ(ε)=Θ(ε(l1)/2l(l+1))subscript𝜏𝜀Θsuperscript𝜀𝑙12𝑙𝑙1\tau_{*}(\varepsilon)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})italic_τ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ε ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) and, for any Δ>0Δ0\Delta>0roman_Δ > 0 a constant c(Δ)>0subscript𝑐Δ0c_{*}(\Delta)>0italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_Δ ) > 0 independent of ε𝜀\varepsilonitalic_ε such that

    ττ(ε)c(Δ)𝜏subscript𝜏𝜀subscript𝑐Δ\displaystyle\tau\leq\tau_{*}(\varepsilon)-c_{*}(\Delta)italic_τ ≤ italic_τ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ε ) - italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_Δ ) lim infε0Rl(a~(τ),s~(τ))12φl2Δ,absentsubscriptlimit-infimum𝜀0subscript𝑅𝑙~𝑎𝜏~𝑠𝜏12superscriptsubscript𝜑𝑙2Δ\displaystyle\Rightarrow\;\;\liminf_{\varepsilon\to 0}\mathscrsfs{R}_{l}(% \widetilde{a}(\tau),\widetilde{s}(\tau))\geq\frac{1}{2}\varphi_{l}^{2}-\Delta\,,⇒ lim inf start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_τ ) ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_Δ , (102)
    ττ(ε)+c(Δ)𝜏subscript𝜏𝜀subscript𝑐Δ\displaystyle\tau\geq\tau_{*}(\varepsilon)+c_{*}(\Delta)italic_τ ≥ italic_τ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ε ) + italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_Δ ) lim supε0Rl(a~(τ),s~(τ))Δ,absentsubscriptlimit-supremum𝜀0subscript𝑅𝑙~𝑎𝜏~𝑠𝜏Δ\displaystyle\Rightarrow\;\;\limsup_{\varepsilon\to 0}\mathscrsfs{R}_{l}(% \widetilde{a}(\tau),\widetilde{s}(\tau))\leq\Delta\,,⇒ lim sup start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_τ ) ) ≤ roman_Δ , (103)

    namely the l𝑙litalic_l-th component is learnt in an O(1)𝑂1O(1)italic_O ( 1 ) time window around τ(ε)=Θ(ε(l1)/2l(l+1))subscript𝜏𝜀Θsuperscript𝜀𝑙12𝑙𝑙1\tau_{*}(\varepsilon)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})italic_τ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ε ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ).

  • (b)𝑏(b)( italic_b )

    Similarly, we denote

    B={ω:σlφllim supε0ε1/2la~(ω,0)s~(ω,0)l<0,andlim infε0(s~(ω,0)2/a~(ω,0)2)>l}.𝐵conditional-set𝜔formulae-sequencesubscript𝜎𝑙subscript𝜑𝑙subscriptlimit-supremum𝜀0superscript𝜀12𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙0andsubscriptlimit-infimum𝜀0~𝑠superscript𝜔02~𝑎superscript𝜔02𝑙B=\left\{\omega:\sigma_{l}\varphi_{l}\limsup_{\varepsilon\to 0}\varepsilon^{-1% /2l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}<0,\ \text{and}\ \liminf% _{\varepsilon\to 0}(\widetilde{s}(\omega,0)^{2}/\widetilde{a}(\omega,0)^{2})>l% \right\}.italic_B = { italic_ω : italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT lim sup start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 / 2 italic_l end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT < 0 , and lim inf start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) > italic_l } . (104)

    If ρ(B)>0𝜌𝐵0\rho(B)>0italic_ρ ( italic_B ) > 0, then the same claims as in (a)𝑎(a)( italic_a ) hold.

  • (c)𝑐(c)( italic_c )

    If neither of the conditions at points (a)𝑎(a)( italic_a ), (b)𝑏(b)( italic_b ) holds, and

    σlφllim supε0ε1/2la~(ω,0)s~(ω,0)l<0,lim supε0(s~(ω,0)2/a~(ω,0)2)<lformulae-sequencesubscript𝜎𝑙subscript𝜑𝑙subscriptlimit-supremum𝜀0superscript𝜀12𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙0subscriptlimit-supremum𝜀0~𝑠superscript𝜔02~𝑎superscript𝜔02𝑙\sigma_{l}\varphi_{l}\limsup_{\varepsilon\to 0}\varepsilon^{-1/2l}\widetilde{a% }(\omega,0)\widetilde{s}(\omega,0)^{l}<0,\quad\limsup_{\varepsilon\to 0}(% \widetilde{s}(\omega,0)^{2}/\widetilde{a}(\omega,0)^{2})<litalic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT lim sup start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 / 2 italic_l end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT < 0 , lim sup start_POSTSUBSCRIPT italic_ε → 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) < italic_l (105)

    for almost every ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω. Then, for such ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω and each Δ>0Δ0\Delta>0roman_Δ > 0, there exists a constant C(ω,Δ)>0subscript𝐶𝜔Δ0C_{*}(\omega,\Delta)>0italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ω , roman_Δ ) > 0 such that

    τC(ω,Δ)ε(l1)/2l(l+1)|s~(ω,τ)|Δε1/2l(l+1),𝜏subscript𝐶𝜔Δsuperscript𝜀𝑙12𝑙𝑙1~𝑠𝜔𝜏Δsuperscript𝜀12𝑙𝑙1\displaystyle\tau\geq C_{*}(\omega,\Delta)\varepsilon^{-(l-1)/2l(l+1)}\;\;% \Rightarrow\;\;|\widetilde{s}(\omega,\tau)|\leq\Delta\varepsilon^{1/2l(l+1)}\,,italic_τ ≥ italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ω , roman_Δ ) italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ⇒ | over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) | ≤ roman_Δ italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT , (106)

    meaning that s~(ω,τ)~𝑠𝜔𝜏\widetilde{s}(\omega,\tau)over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) converges to 00 eventually.

We further note that τ=Θ(ε(l1)/2l(l+1))t=Θ(εμl)𝜏Θsuperscript𝜀𝑙12𝑙𝑙1𝑡Θsuperscript𝜀subscript𝜇𝑙\tau=\Theta(\varepsilon^{-(l-1)/2l(l+1)})\Longleftrightarrow t=\Theta(% \varepsilon^{\mu_{l}})italic_τ = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) ⟺ italic_t = roman_Θ ( italic_ε start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) with μl=1/2lsubscript𝜇𝑙12𝑙\mu_{l}=1/2litalic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 / 2 italic_l, and τ=O(1)t=O(ενl)𝜏𝑂1𝑡𝑂superscript𝜀subscript𝜈𝑙\tau=O(1)\Longleftrightarrow t=O(\varepsilon^{\nu_{l}})italic_τ = italic_O ( 1 ) ⟺ italic_t = italic_O ( italic_ε start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) with νl=1/(l+1)subscript𝜈𝑙1𝑙1\nu_{l}=1/(l+1)italic_ν start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 1 / ( italic_l + 1 ).

The proof of Theorem 1 is deferred to Appendix C.3.

Remark 6.1.

Under the conditions of cases (a)𝑎(a)( italic_a ) and (b)𝑏(b)( italic_b ), we see that the degree-l𝑙litalic_l component of the target function is learnt within an O(ε1/(l+1))𝑂superscript𝜀1𝑙1O(\varepsilon^{1/(l+1)})italic_O ( italic_ε start_POSTSUPERSCRIPT 1 / ( italic_l + 1 ) end_POSTSUPERSCRIPT ) time window around t(l,ε)ε1/2lasymptotically-equalssubscript𝑡𝑙𝜀superscript𝜀12𝑙t_{*}(l,\varepsilon)\asymp\varepsilon^{1/2l}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_l , italic_ε ) ≍ italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l end_POSTSUPERSCRIPT, which is consistent with the timescales conjectured in Definition 1.

Remark 6.2.

Case (c)𝑐(c)( italic_c ) corresponds to s(ω)/s(ω,0)𝑠𝜔𝑠𝜔0s(\omega)/s(\omega,0)italic_s ( italic_ω ) / italic_s ( italic_ω , 0 ) becoming close to 00 in time t=O(εμl)𝑡𝑂superscript𝜀subscript𝜇𝑙t=O(\varepsilon^{\mu_{l}})italic_t = italic_O ( italic_ε start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), and staying at 00. In other words, the neurons become orthogonal to the target direction and play no role in learning higher-degree components any longer.

Informally, case (c)𝑐(c)( italic_c ) couples the learning of different polynomial components. It can happen that the learning phase l1𝑙1l-1italic_l - 1 induces an effective initialization (a~(ω,0),s~(ω,0))~𝑎𝜔0~𝑠𝜔0(\widetilde{a}(\omega,0),\ \widetilde{s}(\omega,0))( over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) , over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) ) within the domain of case (c)𝑐(c)( italic_c ).

We expect this not to be the case for suitable choices of initialization (or equivalently PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT), φ𝜑\varphiitalic_φ, and σ𝜎\sigmaitalic_σ. Establishing this would amount to establishing that the canonical learning order holds.

7 Stochastic gradient descent and finite sample size

So far we focused on analyzing the projected gradient flow (GF) dynamics with respect to the population risk, as defined in Eqs. (5)-(6). In this section, we extract the implications of our analysis of GF on online projected stochastic gradient descent, which is a projected version of the SGD dynamics (162).

For simplicity of notation, we denote by z=(y,x)×d𝑧𝑦𝑥superscript𝑑z=(y,x)\in{\mathbb{R}}\times{\mathbb{R}}^{d}italic_z = ( italic_y , italic_x ) ∈ blackboard_R × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT a datapoint and by θi=(ai,ui)×𝕊d1subscript𝜃𝑖subscript𝑎𝑖subscript𝑢𝑖superscript𝕊𝑑1\theta_{i}=(a_{i},u_{i})\in{\mathbb{R}}\times\mathbb{S}^{d-1}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT the parameters of neuron i𝑖iitalic_i. For z=(y,x)𝑧𝑦𝑥z=(y,x)italic_z = ( italic_y , italic_x ) and ρ(m)=(1/m)i=1mδθi=(1/m)i=1mδ(ai,ui)superscript𝜌𝑚1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝜃𝑖1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝑎𝑖subscript𝑢𝑖\rho^{(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta_{i}}=(1/m)\sum_{i=1}^{m}\delta_{(% a_{i},u_{i})}italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = ( 1 / italic_m ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( 1 / italic_m ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, we define

F^i(ρ(m);z)=subscript^𝐹𝑖superscript𝜌𝑚𝑧absent\displaystyle\widehat{F}_{i}(\rho^{(m)};z)=\,over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ; italic_z ) = (y1mj=1majσ(uj,x))σ(ui,x),𝑦1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝜎subscript𝑢𝑗𝑥𝜎subscript𝑢𝑖𝑥\displaystyle\left(y-\frac{1}{m}\sum_{j=1}^{m}a_{j}\sigma(\langle u_{j},x% \rangle)\right)\sigma(\langle u_{i},x\rangle),( italic_y - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x ⟩ ) ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ,
G^i(ρ(m);z)=subscript^𝐺𝑖superscript𝜌𝑚𝑧absent\displaystyle\widehat{G}_{i}(\rho^{(m)};z)=\,over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ; italic_z ) = ai(y1mj=1majσ(uj,x))σ(ui,x)x.subscript𝑎𝑖𝑦1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝜎subscript𝑢𝑗𝑥superscript𝜎subscript𝑢𝑖𝑥𝑥\displaystyle a_{i}\left(y-\frac{1}{m}\sum_{j=1}^{m}a_{j}\sigma(\langle u_{j},% x\rangle)\right)\sigma^{\prime}(\langle u_{i},x\rangle)x.italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x ⟩ ) ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) italic_x .

The projected SGD dynamics is specified as follows:

a¯i(k+1)=a¯i(k)+ε1ηF^i(ρ¯(m)(k);zk+1)u¯i(k+1)=Proj𝕊d1(u¯i(k)+ηG^i(ρ¯(m)(k);zk+1)),subscript¯𝑎𝑖𝑘1subscript¯𝑎𝑖𝑘superscript𝜀1𝜂subscript^𝐹𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1subscript¯𝑢𝑖𝑘1subscriptProjsuperscript𝕊𝑑1subscript¯𝑢𝑖𝑘𝜂subscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1\begin{split}\overline{a}_{i}(k+1)=\,&\overline{a}_{i}(k)+\varepsilon^{-1}\eta% \widehat{F}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\\ \overline{u}_{i}(k+1)=\,&\operatorname{Proj}_{\mathbb{S}^{d-1}}\left(\overline% {u}_{i}(k)+\eta\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right),\end{split}start_ROW start_CELL over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) = end_CELL start_CELL over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) = end_CELL start_CELL roman_Proj start_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_η over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (107)

where for ud𝑢superscript𝑑u\in\mathbb{R}^{d}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and compact Sd𝑆superscript𝑑S\subset\mathbb{R}^{d}italic_S ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, ProjS(u):=argminsSsu2assignsubscriptProj𝑆𝑢subscriptargmin𝑠𝑆subscriptnorm𝑠𝑢2\operatorname{Proj}_{S}(u):=\operatorname*{argmin}_{s\in S}\left\|{s-u}\right% \|_{2}roman_Proj start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_u ) := roman_argmin start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT ∥ italic_s - italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and ρ¯(m):=(1/m)i=1mδθ¯iassignsuperscript¯𝜌𝑚1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript¯𝜃𝑖\overline{\rho}^{(m)}:=(1/m)\sum_{i=1}^{m}\delta_{\overline{\theta}_{i}}over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT := ( 1 / italic_m ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Note that the (a¯i,u¯i)subscript¯𝑎𝑖subscript¯𝑢𝑖(\overline{a}_{i},\overline{u}_{i})( over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )’s here are different from the (a¯,s¯)¯𝑎¯𝑠(\overline{a},\overline{s})( over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_s end_ARG )’s in Section 6.

We prove that, for small η𝜂\etaitalic_η, the projected SGD of Eq. (107) is close to the gradient flow of Eqs. (5)-(6). Throughout this section, we make the following assumptions similar to those assumed in Section 4:

A1.

ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is supported on [M1,M1]×𝕊d1subscript𝑀1subscript𝑀1superscript𝕊𝑑1[-M_{1},M_{1}]\times\mathbb{S}^{d-1}[ - italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. Hence, |ai(0)|M1subscript𝑎𝑖0subscript𝑀1|a_{i}(0)|\leq M_{1}| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) | ≤ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ].

A2.

The activation function is bounded: σM2subscriptnorm𝜎subscript𝑀2\left\|{\sigma}\right\|_{\infty}\leq M_{2}∥ italic_σ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Additionally, define for u,ud𝑢superscript𝑢superscript𝑑u,u^{\prime}\in\mathbb{R}^{d}italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT:

V(u,u;u2,u2)=𝑉subscript𝑢𝑢subscriptnormsubscript𝑢2subscriptnorm𝑢2absent\displaystyle V(\langle u_{*},u\rangle;\left\|{u_{*}}\right\|_{2},\left\|{u}% \right\|_{2})=\,italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼[φ(u,x)σ(u,x)],𝔼delimited-[]𝜑subscript𝑢𝑥𝜎𝑢𝑥\displaystyle\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)\sigma(\langle u,x% \rangle)\right],blackboard_E [ italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ italic_u , italic_x ⟩ ) ] , (108)
U(u,u;u2,u2)=𝑈𝑢superscript𝑢subscriptnorm𝑢2subscriptnormsuperscript𝑢2absent\displaystyle U(\langle u,u^{\prime}\rangle;\left\|{u}\right\|_{2},\left\|{u^{% \prime}}\right\|_{2})=\,italic_U ( ⟨ italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ; ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼[σ(u,x)σ(u,x)].𝔼delimited-[]𝜎𝑢𝑥𝜎superscript𝑢𝑥\displaystyle\mathbb{E}\left[\sigma(\langle u,x\rangle)\sigma(\langle u^{% \prime},x\rangle)\right].blackboard_E [ italic_σ ( ⟨ italic_u , italic_x ⟩ ) italic_σ ( ⟨ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ⟩ ) ] . (109)

We then require the functions V𝑉Vitalic_V and U𝑈Uitalic_U to be bounded and differentiable, with uniformly bounded and Lipschitz continuous gradients for all u2,u22subscriptnorm𝑢2subscriptnormsuperscript𝑢22\left\|{u}\right\|_{2},\left\|{u^{\prime}}\right\|_{2}\leq 2∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 2:

uV2M2,uVuV2M2uu2,formulae-sequencesubscriptnormsubscript𝑢𝑉2subscript𝑀2subscriptnormsubscript𝑢𝑉subscriptsuperscript𝑢𝑉2subscript𝑀2subscriptnorm𝑢superscript𝑢2\displaystyle\left\|{\nabla_{u}V}\right\|_{2}\leq M_{2},\ \left\|{\nabla_{u}V-% \nabla_{u^{\prime}}V}\right\|_{2}\leq M_{2}\left\|{u-u^{\prime}}\right\|_{2},∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_V ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_V - ∇ start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (110)
(u,u)U2M2,(u,u)U(u1,u1)U2M2(uu12+uu12).formulae-sequencesubscriptnormsubscript𝑢superscript𝑢𝑈2subscript𝑀2subscriptnormsubscript𝑢superscript𝑢𝑈subscriptsubscript𝑢1superscriptsubscript𝑢1𝑈2subscript𝑀2subscriptnorm𝑢subscript𝑢12subscriptnormsuperscript𝑢superscriptsubscript𝑢12\displaystyle\left\|{\nabla_{(u,u^{\prime})}U}\right\|_{2}\leq M_{2},\ \left\|% {\nabla_{(u,u^{\prime})}U-\nabla_{(u_{1},u_{1}^{\prime})}U}\right\|_{2}\leq M_% {2}\left(\left\|{u-u_{1}}\right\|_{2}+\left\|{u^{\prime}-u_{1}^{\prime}}\right% \|_{2}\right).∥ ∇ start_POSTSUBSCRIPT ( italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_U ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ ∇ start_POSTSUBSCRIPT ( italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_U - ∇ start_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_U ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ∥ italic_u - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (111)

Similar to Remark 4.1, we can show that a sufficient condition for Eq.s (110) and (111) is

sup{σL2,σ′′L2}M2,sup{φL2,φL2,φ′′L2}M2,formulae-sequencesupremumsubscriptnormsuperscript𝜎superscript𝐿2subscriptnormsuperscript𝜎′′superscript𝐿2superscriptsubscript𝑀2supremumsubscriptnorm𝜑superscript𝐿2subscriptnormsuperscript𝜑superscript𝐿2subscriptnormsuperscript𝜑′′superscript𝐿2superscriptsubscript𝑀2\sup\left\{\left\|{\sigma^{\prime}}\right\|_{L^{2}},\,\left\|{\sigma^{\prime% \prime}}\right\|_{L^{2}}\right\}\leq M_{2}^{\prime},\quad\ \sup\left\{\left\|{% \varphi}\right\|_{L^{2}},\,\left\|{\varphi^{\prime}}\right\|_{L^{2}},\,\left\|% {\varphi^{\prime\prime}}\right\|_{L^{2}}\right\}\leq M_{2}^{\prime},roman_sup { ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∥ italic_σ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_sup { ∥ italic_φ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∥ italic_φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∥ italic_φ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

where the constant M2superscriptsubscript𝑀2M_{2}^{\prime}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT depends uniquely on M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

A3.

Assume (x,y)similar-to𝑥𝑦(x,y)\sim\mathds{P}( italic_x , italic_y ) ∼ blackboard_P, then we require that y[M3,M3]𝑦subscript𝑀3subscript𝑀3y\in[-M_{3},M_{3}]italic_y ∈ [ - italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] almost surely. Moreover, we assume that for all u22subscriptnorm𝑢22\left\|{u}\right\|_{2}\leq 2∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 2, both σ(u,x)𝜎𝑢𝑥\sigma(\langle u,x\rangle)italic_σ ( ⟨ italic_u , italic_x ⟩ ) and σ(u,x)(xu,xu)superscript𝜎𝑢𝑥𝑥𝑢𝑥𝑢\sigma^{\prime}(\langle u,x\rangle)(x-\langle u,x\rangle u)italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u , italic_x ⟩ ) ( italic_x - ⟨ italic_u , italic_x ⟩ italic_u ) are M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT-sub-Gaussian.

The following theorem upper bounds the distance between gradient flow and projected stochastic gradient descent dynamics.

Theorem 2 (Difference between GF and Projected SGD).

Let θi(t)=(ai(t),ui(t))subscript𝜃𝑖𝑡subscript𝑎𝑖𝑡subscript𝑢𝑖𝑡\theta_{i}(t)=(a_{i}(t),u_{i}(t))italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) be the solution of the GF ordinary differential equations (5)-(6). There exists a constant M𝑀Mitalic_M that only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s from Assumptions A1-A3, such that for any T,z0𝑇𝑧0T,z\geq 0italic_T , italic_z ≥ 0 and

η1(d+logm+z2)Mexp((1+1/ε)MT(1+T/ε)2),𝜂1𝑑𝑚superscript𝑧2𝑀11𝜀𝑀𝑇superscript1𝑇𝜀2\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((1+1/\varepsilon)MT(1+T/\varepsilon)^{2% })},italic_η ≤ divide start_ARG 1 end_ARG start_ARG ( italic_d + roman_log italic_m + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_M roman_exp ( ( 1 + 1 / italic_ε ) italic_M italic_T ( 1 + italic_T / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ,

the following holds with probability at least 1exp(z2)1superscript𝑧21-\exp(-z^{2})1 - roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ):

supk[0,T/η]maxi[m]|a¯i(k)|subscriptsupremum𝑘0𝑇𝜂subscript𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑘absent\displaystyle\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\overline{% a}_{i}(k)\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_T / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≤ M(1+T/ε),𝑀1𝑇𝜀\displaystyle M(1+T/\varepsilon),italic_M ( 1 + italic_T / italic_ε ) , (112)
supk[0,T/η]maxi[m]θi(kη)θ¯i(k)2subscriptsupremum𝑘0𝑇𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript𝜃𝑖𝑘𝜂subscript¯𝜃𝑖𝑘2absent\displaystyle\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\theta_{% i}(k\eta)-\overline{\theta}_{i}(k)}\right\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_T / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_η ) - over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ (d+logm+z)𝑑𝑚𝑧\displaystyle\left(\sqrt{d+\log m}+z\right)( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) (113)
×Mexp((1+1ε)MT(1+Tε)2)η,absent𝑀11𝜀𝑀𝑇superscript1𝑇𝜀2𝜂\displaystyle\ \ \times M\exp\left(\left(1+\frac{1}{\varepsilon}\right)MT\left% (1+\frac{T}{\varepsilon}\right)^{2}\right)\sqrt{\eta},× italic_M roman_exp ( ( 1 + divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ) italic_M italic_T ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG , (114)
supk[0,T/η]|R(a¯(k),u¯(k))R(a(kη),u(kη))|subscriptsupremum𝑘0𝑇𝜂𝑅¯𝑎𝑘¯𝑢𝑘𝑅𝑎𝑘𝜂𝑢𝑘𝜂absent\displaystyle\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(\overline% {a}(k),\overline{u}(k))-\mathscrsfs{R}(a(k\eta),u(k\eta))\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_T / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT | italic_R ( over¯ start_ARG italic_a end_ARG ( italic_k ) , over¯ start_ARG italic_u end_ARG ( italic_k ) ) - italic_R ( italic_a ( italic_k italic_η ) , italic_u ( italic_k italic_η ) ) | ≤ (d+logm+z)𝑑𝑚𝑧\displaystyle\left(\sqrt{d+\log m}+z\right)( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) (115)
×Mexp((1+1ε)MT(1+Tε)2)η.absent𝑀11𝜀𝑀𝑇superscript1𝑇𝜀2𝜂\displaystyle\ \ \times M\exp\left(\left(1+\frac{1}{\varepsilon}\right)MT\left% (1+\frac{T}{\varepsilon}\right)^{2}\right)\sqrt{\eta}.× italic_M roman_exp ( ( 1 + divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ) italic_M italic_T ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG . (116)

The proof is presented in Appendix D and follows the same scheme as in that of Theorem 1 part (B) in [33]. The main difference with respect to that theorem is here we are interested in projected SGD (and GF) instead of plain SGD (and GF), hence an additional step of approximation is required, and the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s need to be treated separately. We next draw implications of the last result on learning by online SGD within the canonical learning order.

Theorem 3.

Fix any δ>0𝛿0\delta>0italic_δ > 0. Assume φ,σ𝜑𝜎\varphi,\sigmaitalic_φ , italic_σ and the initialization PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT be such that the canonical learning order of Definition 1 holds up to level L𝐿Litalic_L for some L2𝐿2L\geq 2italic_L ≥ 2, and that

kL+1φk2δ2.subscript𝑘𝐿1superscriptsubscript𝜑𝑘2𝛿2\sum_{k\geq L+1}\varphi_{k}^{2}\leq\frac{\delta}{2}.∑ start_POSTSUBSCRIPT italic_k ≥ italic_L + 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG . (117)

Then, there exist constants ε=ε(δ)subscript𝜀subscript𝜀𝛿\varepsilon_{*}=\varepsilon_{*}(\delta)italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_δ ), T0=T0(δ)subscript𝑇0subscript𝑇0𝛿T_{0}=T_{0}(\delta)italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_δ ), T=T(ε,δ)=T0(δ)ε1/(2L)𝑇𝑇𝜀𝛿subscript𝑇0𝛿superscript𝜀12𝐿T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/(2L)}italic_T = italic_T ( italic_ε , italic_δ ) = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_δ ) italic_ε start_POSTSUPERSCRIPT 1 / ( 2 italic_L ) end_POSTSUPERSCRIPT and M=M(ε,δ)𝑀𝑀𝜀𝛿M=M(\varepsilon,\delta)italic_M = italic_M ( italic_ε , italic_δ ) that depend on ε,δ𝜀𝛿\varepsilon,\deltaitalic_ε , italic_δ (together with φ,σ𝜑𝜎\varphi,\sigmaitalic_φ , italic_σ and PAsubscriptP𝐴{\rm P}_{A}roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT) such that the following happens. Assume εε(δ)𝜀subscript𝜀𝛿\varepsilon\leq\varepsilon_{*}(\delta)italic_ε ≤ italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_δ ) and m,d,z𝑚𝑑𝑧m,d,zitalic_m , italic_d , italic_z are such that dM𝑑𝑀d\geq Mitalic_d ≥ italic_M, mmax(M,z)𝑚𝑀𝑧m\geq\max(M,z)italic_m ≥ roman_max ( italic_M , italic_z ), and the step size η𝜂\etaitalic_η and number of samples (equivalently, number of steps) n𝑛nitalic_n satisfy

η𝜂\displaystyle\etaitalic_η =1M(d+logm+z),absent1𝑀𝑑𝑚𝑧\displaystyle=\frac{1}{M(d+\log m+z)}\,,= divide start_ARG 1 end_ARG start_ARG italic_M ( italic_d + roman_log italic_m + italic_z ) end_ARG , (118)
n𝑛\displaystyle nitalic_n =MT(d+logm+z).absent𝑀𝑇𝑑𝑚𝑧\displaystyle=MT(d+\log m+z)\,.= italic_M italic_T ( italic_d + roman_log italic_m + italic_z ) . (119)

Then, with probability at least 1ez1superscript𝑒𝑧1-e^{-z}1 - italic_e start_POSTSUPERSCRIPT - italic_z end_POSTSUPERSCRIPT, the projected gradient descent algorithm of Eq. (107) achieves population risk smaller than δ𝛿\deltaitalic_δ:

(R(a¯(n),u¯(n))δ)1ez.𝑅¯𝑎𝑛¯𝑢𝑛𝛿1superscript𝑒𝑧\displaystyle\mathds{P}\Big{(}\mathscrsfs{R}(\overline{a}(n),\overline{u}(n))% \leq\delta\Big{)}\geq 1-e^{-z}\,.blackboard_P ( italic_R ( over¯ start_ARG italic_a end_ARG ( italic_n ) , over¯ start_ARG italic_u end_ARG ( italic_n ) ) ≤ italic_δ ) ≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_z end_POSTSUPERSCRIPT . (120)

The proof of Theorem 3 is deferred to Appendix D.4.

Remark 7.1.

Within the lazy or neural tangent regime, learning the projection of the target function φ(u,x)𝜑subscript𝑢𝑥\varphi(\langle u_{*},x\rangle)italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) onto polynomials of degree \ellroman_ℓ requires ndmuch-greater-than𝑛superscript𝑑n\gg d^{\ell}italic_n ≫ italic_d start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT samples, and md1much-greater-than𝑚superscript𝑑1m\gg d^{\ell-1}italic_m ≫ italic_d start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT neurons [20, 34, 35].

In contrast, Theorem 3 shows that, within the canonical learning order, O(d)𝑂𝑑O(d)italic_O ( italic_d ) samples and O(1)𝑂1O(1)italic_O ( 1 ) neurons are sufficient. Further as per Theorem 2, the learning dynamics is accurately described by the GF analyzed in the previous sections.

8 Discussion

We conclude by discussing some of our findings as well as potential extensions of our work. As mentioned in the introduction, our initial motivation was to understand certain ubiquitous phenomena in the learning dynamics of multi-layer neural networks. A particularly striking phenomenon that we could reproduce in the present mathematical setting is the coexistence of plateaus in which the risk barely changes and sudden drops.

In the next paragraphs, we will briefly emphasize results or future directions that were not anticipated at the beginning of this work.

Implicit bias in function space.

We provided evidence towards the canonical learning order of Definition 1. According to this scenario, the target function φ𝜑\varphiitalic_φ is learnt according to its decomposition into Hermite polynomials, with lower degree components learnt first. This theory applies to online SGD via Theorem 2 and Theorem 3. In this setting, the number of SGD steps correspond to the number of samples. Therefore, at a small sample size, SGD will fit a low degree polynomial approximation of the target function, with the degree increasing with samples.

A similar phenomenon is observed with (rotationally invariant) kernel methods [34], with one important difference. Here the number of samples always scale linearly in the degree, while for kernel methods, different polynomial degree correspond to different scalings with the dimension.

Implicit bias in parameter space.

Our analysis tracks the evolution of the weights as well. As explained in Section 6, in order for the degree-k𝑘kitalic_k component of the target function to be well approximated (in the d,m𝑑𝑚d,m\to\inftyitalic_d , italic_m → ∞ limit), it is sufficient that σka(ν)s(ν)kρ(dν)=φksubscript𝜎𝑘𝑎𝜈𝑠superscript𝜈𝑘𝜌d𝜈subscript𝜑𝑘\sigma_{k}\int a(\nu)s(\nu)^{k}\rho(\mathrm{d}\nu)=\varphi_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_ρ ( roman_d italic_ν ) = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Here ν𝜈\nuitalic_ν is an abstract neuron index, a(ν)𝑎𝜈a(\nu)italic_a ( italic_ν ) is the second-layer weight and s(ν)𝑠𝜈s(\nu)italic_s ( italic_ν ) is the projection of the first layer weight along the target direction usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT.

Naively, one would expect that, in order for learning to take place, first layer weights should be well aligned with usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, i.e. s(ν)𝑠𝜈s(\nu)italic_s ( italic_ν ) should concentrate close to one. However this is not the only way to satisfy the constraints σka(ν)s(ν)k𝑑ρ(ν)=φksubscript𝜎𝑘𝑎𝜈𝑠superscript𝜈𝑘differential-d𝜌𝜈subscript𝜑𝑘\sigma_{k}\int a(\nu)s(\nu)^{k}d\rho(\nu)=\varphi_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_d italic_ρ ( italic_ν ) = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Indeed, our analysis in Section 6 indicates that gradient flow satisfies this constraint with s=Θ(εβk)𝑠Θsuperscript𝜀subscript𝛽𝑘s=\Theta(\varepsilon^{\beta_{k}})italic_s = roman_Θ ( italic_ε start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and a=Θ(εωk)𝑎Θsuperscript𝜀subscript𝜔𝑘a=\Theta(\varepsilon^{-\omega_{k}})italic_a = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) with βk=1/2(k+1)subscript𝛽𝑘12𝑘1\beta_{k}=1/2(k+1)italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 / 2 ( italic_k + 1 ), ωk=k/2(k+1)subscript𝜔𝑘𝑘2𝑘1\omega_{k}=k/2(k+1)italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_k / 2 ( italic_k + 1 ) (so that σka(ν)s(ν)kρ(dν)subscript𝜎𝑘𝑎𝜈𝑠superscript𝜈𝑘𝜌d𝜈\sigma_{k}\int a(\nu)s(\nu)^{k}\rho(\mathrm{d}\nu)italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ν ) italic_s ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_ρ ( roman_d italic_ν ) will be of order one) as ε0𝜀0\varepsilon\to 0italic_ε → 0. In other words, the alignment is small, and second layer weights are large. (In general, weights on multiple scales coexist.)

The role of the learning rate ε𝜀\varepsilonitalic_ε.

The initialization of parameters and relative step-sizes play a key role in modern (non-convex) machine learning. The combination of the two scalings (initialization and relative stepsize) affects the learning dynamics. In order to clarify this point, we can consider a general parametrization (we keep ui2=1subscriptnormsubscript𝑢𝑖21\|u_{i}\|_{2}=1∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1)

f(x;a,u)=1mγi=1maiσ(ui,x)=:i=1mciσ(ui,x),f(x;a,u)=\frac{1}{m^{\gamma}}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)% =:\sum_{i=1}^{m}c_{i}\sigma(\langle u_{i},x\rangle),\;italic_f ( italic_x ; italic_a , italic_u ) = divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) = : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ,

and gradient flow dynamics

εtai𝜀subscript𝑡subscript𝑎𝑖\displaystyle\varepsilon\partial_{t}a_{i}italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =maiR(a,u),absent𝑚subscriptsubscript𝑎𝑖𝑅𝑎𝑢\displaystyle=-m\partial_{a_{i}}\mathscrsfs{R}(a,u)\,,= - italic_m ∂ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ( italic_a , italic_u ) ,
tuisubscript𝑡subscript𝑢𝑖\displaystyle\partial_{t}u_{i}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =m(Iduiui)uiR(a,u).absent𝑚subscript𝐼𝑑subscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscriptsubscript𝑢𝑖𝑅𝑎𝑢\displaystyle=-m(I_{d}-u_{i}u_{i}^{\top})\nabla_{u_{i}}\mathscrsfs{R}(a,u)\,.= - italic_m ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ( italic_a , italic_u ) .

(Note that the learning rate in the second equation can be set to 1111 without loss of generality, by rescaling the time axis.) Rewriting this in terms of the coefficients cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, so that the function representation is kept fixed, we have

stci𝑠subscript𝑡subscript𝑐𝑖\displaystyle s\partial_{t}c_{i}italic_s ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =mciR¯(c,u),s=εm2γ,formulae-sequenceabsent𝑚subscriptsubscript𝑐𝑖¯𝑅𝑐𝑢𝑠𝜀superscript𝑚2𝛾\displaystyle=-m\partial_{c_{i}}\overline{\mathscrsfs{R}}(c,u)\,,\;\;s=% \varepsilon m^{2\gamma}\,,= - italic_m ∂ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_R end_ARG ( italic_c , italic_u ) , italic_s = italic_ε italic_m start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ,

while the second equation remains unchanged. This parametrization allows us to compare various scalings in a uniform fashion.

  • Mean field scaling [31, 13]: s=Θ(m2)𝑠Θsuperscript𝑚2s=\Theta(m^{2})italic_s = roman_Θ ( italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), |ci(0)|=Θ(m1)subscript𝑐𝑖0Θsuperscript𝑚1|c_{i}(0)|=\Theta(m^{-1})| italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) | = roman_Θ ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

  • In this paper: s=εm2𝑠𝜀superscript𝑚2s=\varepsilon m^{2}italic_s = italic_ε italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, |ci(0)|=Θ(m1)subscript𝑐𝑖0Θsuperscript𝑚1|c_{i}(0)|=\Theta(m^{-1})| italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) | = roman_Θ ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), ε0𝜀0\varepsilon\to 0italic_ε → 0 after m𝑚m\to\inftyitalic_m → ∞.

  • Classical scaling [29, 24]: s=Θ(1)𝑠Θ1s=\Theta(1)italic_s = roman_Θ ( 1 ), |ci(0)|=Θ(m1/2)subscript𝑐𝑖0Θsuperscript𝑚12|c_{i}(0)|=\Theta(m^{-1/2})| italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) | = roman_Θ ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

As mentioned already, mean field scaling can exhibit better feature learning properties. In particular, the class of functions studied in the present paper can require much larger sample size to learn under the classical scaling [37, 20, 47]. The choice of initialization in this paper is the same as in the mean field literature, with the difference that the relative learning rate s𝑠sitalic_s is a factor ε𝜀\varepsilonitalic_ε smaller, hence making it –in a sense– slightly closer to the the classical scaling. It would be interesting to explore other scalings as well.

We also note that, while the limit of small ε𝜀\varepsilonitalic_ε is interesting, setting directly ε=0𝜀0\varepsilon=0italic_ε = 0 leads to a singular behavior222No matter how we rescale time, in this case learning takes place instantly, up to a certain critical degree.. Formally, setting ε=0𝜀0\varepsilon=0italic_ε = 0 corresponds to kee** second layer weights equal to their optimal values: a correct analysis of this case requires to account for the role of stepsize and not just use the gradient flow approximation.

More complex network models.

The choice of the neural network model in this paper was mainly dictated by the desire to avoid inessential technicalities. It would be important to move towards more realistic models.

First, we used projected gradient descent to constrain the weights’ norms ui=1normsubscript𝑢𝑖1\|u_{i}\|=1∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ = 1. While this is a common theoretical device in studying single-index models [4, 11], we believe that techniques developed here can be extended to the more general case. Analogously, we could add biases to the network architecture and hence replace Eq. (2) by

f(x;a,u,b)=1mi=1maiσ(ui,x+bi),a1,b1,,am,bm,u1,,umd,formulae-sequence𝑓𝑥𝑎𝑢𝑏1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝜎subscript𝑢𝑖𝑥subscript𝑏𝑖subscript𝑎1subscript𝑏1subscript𝑎𝑚formulae-sequencesubscript𝑏𝑚subscript𝑢1subscript𝑢𝑚superscript𝑑f(x;a,u,b)=\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle+b_{i}),% \qquad\ a_{1},b_{1},\cdots,a_{m},b_{m}\in\mathbb{R},\ u_{1},\cdots,u_{m}\in{% \mathbb{R}}^{d},italic_f ( italic_x ; italic_a , italic_u , italic_b ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , (121)

With this change, the limiting mean-field dynamics will be an autonomous ODE system of (ai(t),bi(t),si(t),ri(t))i=1msuperscriptsubscriptsubscript𝑎𝑖𝑡subscript𝑏𝑖𝑡subscript𝑠𝑖𝑡subscript𝑟𝑖𝑡𝑖1𝑚(a_{i}(t),b_{i}(t),s_{i}(t),r_{i}(t))_{i=1}^{m}( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT where ri(t)=ui(t)2subscript𝑟𝑖𝑡subscriptnormsubscript𝑢𝑖𝑡2r_{i}(t)=\left\|{u_{i}(t)}\right\|_{2}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We expect that its evolution will be qualitatively similar to that of the simplified dynamics considered in the paper.

Second, the single-index model studied here is a simple example of target function which requires feature learning. An obvious generalization is to consider multi-index models, as already discussed in Remark 4.3.

Finally, it would be interesting to generalize our analysis to classification losses.

Acknowledgments

This work was supported by the NSF through award DMS-2031883, the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF grant CCF-2006489 and the ONR grant N00014-18-1-2729, and a grant from Eric and Wendy Schmidt at the Institute for Advanced Studies. Part of this work was carried out while Andrea Montanari was on partial leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AM’s activity while on leave.

References

  • Abbe et al. [2022] Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
  • Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  • Arnaboldi et al. [2023] Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high-dimensional & mean-field dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. arXiv preprint arXiv:2302.05882, 2023.
  • Arous et al. [2021] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788–4838, 2021.
  • Arpit et al. [2017] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
  • Ba et al. [2022] Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems, 2022.
  • Baldi and Hornik [1989] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
  • Barak et al. [2022] Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv:2207.08799, 2022.
  • Bartlett et al. [2021] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
  • Berglund [2001] Nils Berglund. Perturbation theory of dynamical systems. arXiv preprint math/0111178, 2001.
  • Bietti et al. [2022] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. Advances in Neural Information Processing Systems, 35:9768–9783, 2022.
  • Bodin and Macris [2021] Antoine Bodin and Nicolas Macris. Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model. Advances in Neural Information Processing Systems, 34:21605–21617, 2021.
  • Chizat and Bach [2018] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  • Damian et al. [2022] Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
  • [15] Encyclopedia of Mathematics. Bernoulli equation. http://encyclopediaofmath.org/index.php?title=Bernoulli_equation&oldid=40764.
  • Frye and Efthimiou [2012] Christopher Frye and Costas J Efthimiou. Spherical harmonics in p dimensions. arXiv preprint arXiv:1205.3548, 2012.
  • Fukumizu and Amari [2000] Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000.
  • Ghorbani et al. [2020a] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Discussion of:“nonparametric regression using deep neural networks with relu activation function”. The Annals of Statistics, 48(4), 2020a.
  • Ghorbani et al. [2020b] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020b.
  • Ghorbani et al. [2021] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
  • Ghosh et al. [2021] Nikhil Ghosh, Song Mei, and Bin Yu. The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2021.
  • Gissin et al. [2019] Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The implicit bias of depth: How incremental learning drives generalization. arXiv preprint arXiv:1909.12051, 2019.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • Holmes [2013] Mark Holmes. Introduction to Perturbation Methods. Springer Texts in Applied Mathematics, 2013.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • ** et al. [2019] Chi **, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
  • Kalimeris et al. [2019] Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.
  • LeCun et al. [2002] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
  • Li et al. [2020] Zhiyuan Li, Yu** Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. arXiv preprint arXiv:2012.09839, 2020.
  • Mei et al. [2018a] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018a.
  • Mei et al. [2018b] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018b.
  • Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
  • Mei et al. [2022] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
  • Montanari and Zhong [2022] Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816–2847, 2022.
  • O’Donnell [2014] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  • Oymak and Soltanolkotabi [2020] Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
  • Pinkus [1999] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999.
  • Power et al. [2022] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022.
  • Rotskoff and Vanden-Eijnden [2018] Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in neural information processing systems, 31, 2018.
  • Saad and Solla [1995] David Saad and Sara A Solla. On-line learning in soft committee machines. Physical Review E, 52(4):4225, 1995.
  • Santambrogio [2015] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
  • Saxe et al. [2013] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  • Wei et al. [2008] Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
  • Yang and Hu [2020] Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks. arXiv:2011.14522, 2020.
  • Yang and Hu [2021] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
  • Yehudai and Shamir [2019] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Yoshida and Okada [2019] Yuki Yoshida and Masato Okada. Data-dependence of plateau phenomenon in learning with neural network—statistical mechanical analysis. Advances in Neural Information Processing Systems, 32, 2019.

Appendix A Proof of Proposition 1

By standard approximation theory arguments [38], it is sufficient to show that there exists an integrable function adL1(𝕊d1,μ0)subscript𝑎𝑑superscript𝐿1superscript𝕊𝑑1subscript𝜇0a_{d}\in L^{1}(\mathbb{S}^{d-1},\mu_{0})italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) such that

limd𝔼{(ad(u)σ(u,x)μ0(du)φ(u,x))2}=0.subscript𝑑𝔼superscriptsubscript𝑎𝑑𝑢𝜎𝑢𝑥subscript𝜇0d𝑢𝜑subscript𝑢𝑥20\displaystyle\lim_{d\to\infty}\mathbb{E}\big{\{}\big{(}\int a_{d}(u)\,\sigma(% \langle u,x\rangle)\,\mu_{0}(\mathrm{d}u)-\varphi(\langle u_{*},x\rangle)\big{% )}^{2}\big{\}}=0\,.roman_lim start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT blackboard_E { ( ∫ italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) italic_σ ( ⟨ italic_u , italic_x ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) - italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } = 0 . (122)

(We denote by μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the uniform probability measure over 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT.)

Denote by Pd,ksubscript𝑃𝑑𝑘P_{d,k}italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT the Gegenbauer polynomial of order d𝑑ditalic_d and degree k𝑘kitalic_k (see, e.g., [34]). Namely, (Pd,k:k0):subscript𝑃𝑑𝑘𝑘0(P_{d,k}:k\geq 0)( italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT : italic_k ≥ 0 ) form an orthogonal system with respect to the measure with density (1t2)(d3)proportional-toabsentsuperscript1superscript𝑡2𝑑3\propto(1-t^{2})^{(d-3)}∝ ( 1 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_d - 3 ) end_POSTSUPERSCRIPT, t[1,1]𝑡11t\in[-1,1]italic_t ∈ [ - 1 , 1 ]. Recall that for fixed v,w𝑣𝑤v,witalic_v , italic_w of norm 1111, the polynomials Pd,j(v,u),Pd,k(w,u)subscript𝑃𝑑𝑗𝑣𝑢subscript𝑃𝑑𝑘𝑤𝑢P_{d,j}(\langle v,u\rangle),P_{d,k}(\langle w,u\rangle)italic_P start_POSTSUBSCRIPT italic_d , italic_j end_POSTSUBSCRIPT ( ⟨ italic_v , italic_u ⟩ ) , italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⟨ italic_w , italic_u ⟩ ) are spherical harmonics satisfying

Pd,j(v,u)Pd,k(w,u)μ0(du)=δkjPd,k(v,w).subscript𝑃𝑑𝑗𝑣𝑢subscript𝑃𝑑𝑘𝑤𝑢subscript𝜇0d𝑢subscript𝛿𝑘𝑗subscript𝑃𝑑𝑘𝑣𝑤\displaystyle\int P_{d,j}(\langle v,u\rangle)P_{d,k}(\langle w,u\rangle)\,\mu_% {0}(\mathrm{d}u)=\delta_{kj}P_{d,k}(\langle v,w\rangle)\,.∫ italic_P start_POSTSUBSCRIPT italic_d , italic_j end_POSTSUBSCRIPT ( ⟨ italic_v , italic_u ⟩ ) italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⟨ italic_w , italic_u ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) = italic_δ start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⟨ italic_v , italic_w ⟩ ) . (123)

Also, Pd,k(1)=Bd,ksubscript𝑃𝑑𝑘1subscript𝐵𝑑𝑘P_{d,k}(1)=B_{d,k}italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( 1 ) = italic_B start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT is the dimension of the space of spherical harmonics of degree k𝑘kitalic_k, whence (Pd,k()/Bd,k1/2:k0):subscript𝑃𝑑𝑘superscriptsubscript𝐵𝑑𝑘12𝑘0(P_{d,k}(\cdot)/B_{d,k}^{1/2}:\,k\geq 0)( italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⋅ ) / italic_B start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT : italic_k ≥ 0 ) form an orthonormal set. We will denote by cd,k(σ)subscript𝑐𝑑𝑘𝜎c_{d,k}(\sigma)italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) the k𝑘kitalic_k-th coefficient of the expansion of σ(.d)\sigma(\,.\,\sqrt{d})italic_σ ( . square-root start_ARG italic_d end_ARG ) in this basis, and similarly for φ(.d)\varphi(\,.\,\sqrt{d})italic_φ ( . square-root start_ARG italic_d end_ARG ), with coefficients cd,k(φ)subscript𝑐𝑑𝑘𝜑c_{d,k}(\varphi)italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_φ ), namely

σ(td)𝜎𝑡𝑑\displaystyle\sigma(t\sqrt{d})italic_σ ( italic_t square-root start_ARG italic_d end_ARG ) =k=0cd,k(σ)Bd,k1/2Pd,k(t),absentsuperscriptsubscript𝑘0subscript𝑐𝑑𝑘𝜎superscriptsubscript𝐵𝑑𝑘12subscript𝑃𝑑𝑘𝑡\displaystyle=\sum_{k=0}^{\infty}\frac{c_{d,k}(\sigma)}{B_{d,k}^{1/2}}\,P_{d,k% }(t)\,,= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) end_ARG start_ARG italic_B start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_t ) ,
φ(td)𝜑𝑡𝑑\displaystyle\varphi(t\sqrt{d})italic_φ ( italic_t square-root start_ARG italic_d end_ARG ) =k=0cd,k(φ)Bd,k1/2Pd,k(t).absentsuperscriptsubscript𝑘0subscript𝑐𝑑𝑘𝜑superscriptsubscript𝐵𝑑𝑘12subscript𝑃𝑑𝑘𝑡\displaystyle=\sum_{k=0}^{\infty}\frac{c_{d,k}(\varphi)}{B_{d,k}^{1/2}}\,P_{d,% k}(t)\,.= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_φ ) end_ARG start_ARG italic_B start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_t ) .

As shown for instance in [34], limdcd,k(σ)=ck(σ)subscript𝑑subscript𝑐𝑑𝑘𝜎subscript𝑐𝑘𝜎\lim_{d\to\infty}c_{d,k}(\sigma)=c_{k}(\sigma)roman_lim start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) = italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_σ ) is the k𝑘kitalic_k-th Hermite coefficient of σ𝜎\sigmaitalic_σ and similarly for cd,k(φ)subscript𝑐𝑑𝑘𝜑c_{d,k}(\varphi)italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_φ ). In particular, cd,k(σ)0subscript𝑐𝑑𝑘𝜎0c_{d,k}(\sigma)\neq 0italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) ≠ 0 for all d𝑑ditalic_d large enough. For N𝑁Nitalic_N a large integer let

ad(u)=k=0Ncd,k(φ)cd,k(σ)Pd,k(u,u).subscript𝑎𝑑𝑢superscriptsubscript𝑘0𝑁subscript𝑐𝑑𝑘𝜑subscript𝑐𝑑𝑘𝜎subscript𝑃𝑑𝑘𝑢subscript𝑢\displaystyle a_{d}(u)=\sum_{k=0}^{N}\frac{c_{d,k}(\varphi)}{c_{d,k}(\sigma)}P% _{d,k}(\langle u,u_{*}\rangle)\,.italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_φ ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) end_ARG italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ) .

By Eq. (123), we have, for z=dnorm𝑧𝑑\|z\|=\sqrt{d}∥ italic_z ∥ = square-root start_ARG italic_d end_ARG,

ad(u)σ(u,z)μ0(du)subscript𝑎𝑑𝑢𝜎𝑢𝑧subscript𝜇0d𝑢\displaystyle\int a_{d}(u)\,\sigma(\langle u,z\rangle)\,\mu_{0}(\mathrm{d}u)∫ italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) italic_σ ( ⟨ italic_u , italic_z ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) =k=0Ncd,k(φ)cd,k(σ)cd,k(σ)Bd,k1/2Pd,k(u,z/d)absentsuperscriptsubscript𝑘0𝑁subscript𝑐𝑑𝑘𝜑subscript𝑐𝑑𝑘𝜎subscript𝑐𝑑𝑘𝜎superscriptsubscript𝐵𝑑𝑘12subscript𝑃𝑑𝑘subscript𝑢𝑧𝑑\displaystyle=\sum_{k=0}^{N}\frac{c_{d,k}(\varphi)}{c_{d,k}(\sigma)}\frac{c_{d% ,k}(\sigma)}{B_{d,k}^{1/2}}P_{d,k}(\langle u_{*},z\rangle/\sqrt{d})= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_φ ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) end_ARG divide start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_σ ) end_ARG start_ARG italic_B start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_z ⟩ / square-root start_ARG italic_d end_ARG )
=k=0Ncd,k(φ)Bd,k1/2Pd,k(u,z/d).absentsuperscriptsubscript𝑘0𝑁subscript𝑐𝑑𝑘𝜑superscriptsubscript𝐵𝑑𝑘12subscript𝑃𝑑𝑘subscript𝑢𝑧𝑑\displaystyle=\sum_{k=0}^{N}\frac{c_{d,k}(\varphi)}{B_{d,k}^{1/2}}P_{d,k}(% \langle u_{*},z\rangle/\sqrt{d})\,.= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( italic_φ ) end_ARG start_ARG italic_B start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG italic_P start_POSTSUBSCRIPT italic_d , italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_z ⟩ / square-root start_ARG italic_d end_ARG ) .

Denoting by z𝑧zitalic_z a uniform random vector on the sphere of radius d𝑑\sqrt{d}square-root start_ARG italic_d end_ARG, and r=x2/d𝑟subscriptnorm𝑥2𝑑r=\|x\|_{2}/\sqrt{d}italic_r = ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_d end_ARG, we have

𝔼{(ad(u)σ(u,x)μ0(du)\displaystyle\mathbb{E}\big{\{}\big{(}\int a_{d}(u)\,\sigma(\langle u,x\rangle% )\,\mu_{0}(\mathrm{d}u)-blackboard_E { ( ∫ italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) italic_σ ( ⟨ italic_u , italic_x ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) - φ(u,x))2}=𝔼{(ad(u)σ(ru,z)μ0(du)φ(ru,z))2}\displaystyle\varphi(\langle u_{*},x\rangle)\big{)}^{2}\big{\}}=\mathbb{E}\big% {\{}\big{(}\int a_{d}(u)\,\sigma(r\langle u,z\rangle)\,\mu_{0}(\mathrm{d}u)-% \varphi(r\langle u_{*},z\rangle)\big{)}^{2}\big{\}}italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } = blackboard_E { ( ∫ italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) italic_σ ( italic_r ⟨ italic_u , italic_z ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) - italic_φ ( italic_r ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_z ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
()𝔼{(ad(u)σ(u,z)μ0(du)φ(u,z))2}+CNL2dsuperscriptabsent𝔼superscriptsubscript𝑎𝑑𝑢𝜎𝑢𝑧subscript𝜇0d𝑢𝜑subscript𝑢𝑧2subscript𝐶𝑁superscript𝐿2𝑑\displaystyle\stackrel{{\scriptstyle(*)}}{{\leq}}\mathbb{E}\big{\{}\big{(}\int a% _{d}(u)\,\sigma(\langle u,z\rangle)\,\mu_{0}(\mathrm{d}u)-\varphi(\langle u_{*% },z\rangle)\big{)}^{2}\big{\}}+\frac{C_{N}L^{2}}{d}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( ∗ ) end_ARG end_RELOP blackboard_E { ( ∫ italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) italic_σ ( ⟨ italic_u , italic_z ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) - italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_z ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + divide start_ARG italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d end_ARG
𝔼{φ>N(u,z)2}+CNL2d,absent𝔼subscript𝜑absent𝑁superscriptsubscript𝑢𝑧2subscript𝐶𝑁superscript𝐿2𝑑\displaystyle\leq\mathbb{E}\big{\{}\varphi_{>N}(\langle u_{*},z\rangle)^{2}% \big{\}}+\frac{C_{N}L^{2}}{d}\,,≤ blackboard_E { italic_φ start_POSTSUBSCRIPT > italic_N end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_z ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + divide start_ARG italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d end_ARG ,

where in ()(*)( ∗ ) we used concentration of χ𝜒\chiitalic_χ-squared random variables, Lipschitzness of σ𝜎\sigmaitalic_σ and φ𝜑\varphiitalic_φ, and that φ>N(td)subscript𝜑absent𝑁𝑡𝑑\varphi_{>N}(t\sqrt{d})italic_φ start_POSTSUBSCRIPT > italic_N end_POSTSUBSCRIPT ( italic_t square-root start_ARG italic_d end_ARG ) is the projection of φ(td)𝜑𝑡𝑑\varphi(t\sqrt{d})italic_φ ( italic_t square-root start_ARG italic_d end_ARG ) orthogonal to polynomials of degree at most N𝑁Nitalic_N (with respect to the measure with density proportional to (1t2)(d3)/2superscript1superscript𝑡2𝑑32(1-t^{2})^{(d-3)/2}( 1 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_d - 3 ) / 2 end_POSTSUPERSCRIPT on [1,1]11[-1,1][ - 1 , 1 ]). Therefore

lim supd𝔼{(ad(u)σ(u,x)μ0(du)φ(u,x))2}k=N+1ck(φ)2.subscriptlimit-supremum𝑑𝔼superscriptsubscript𝑎𝑑𝑢𝜎𝑢𝑥subscript𝜇0d𝑢𝜑subscript𝑢𝑥2superscriptsubscript𝑘𝑁1subscript𝑐𝑘superscript𝜑2\displaystyle\limsup_{d\to\infty}\mathbb{E}\big{\{}\big{(}\int a_{d}(u)\,% \sigma(\langle u,x\rangle)\,\mu_{0}(\mathrm{d}u)-\varphi(\langle u_{*},x% \rangle)\big{)}^{2}\big{\}}\leq\sum_{k=N+1}^{\infty}c_{k}(\varphi)^{2}\,.lim sup start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT blackboard_E { ( ∫ italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_u ) italic_σ ( ⟨ italic_u , italic_x ⟩ ) italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_u ) - italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ≤ ∑ start_POSTSUBSCRIPT italic_k = italic_N + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_φ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The claim (122) follows by taking N𝑁N\to\inftyitalic_N → ∞.

Appendix B Appendix to Section 4

B.1 Proof of Proposition 2

When x𝖭(0,Id)similar-to𝑥𝖭0subscript𝐼𝑑x\sim\mathsf{N}(0,I_{d})italic_x ∼ sansserif_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and u,u𝕊d1𝑢superscript𝑢superscript𝕊𝑑1u,u^{\prime}\in\mathbb{S}^{d-1}italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, (u,xu,x)𝖭(0,(1u,uu,u1))similar-tomatrix𝑢𝑥superscript𝑢𝑥𝖭0matrix1𝑢superscript𝑢𝑢superscript𝑢1\begin{pmatrix}\langle u,x\rangle\\ \langle u^{\prime},x\rangle\end{pmatrix}\sim\mathsf{N}\left(0,\begin{pmatrix}1% &\langle u,u^{\prime}\rangle\\ \langle u,u^{\prime}\rangle&1\end{pmatrix}\right)( start_ARG start_ROW start_CELL ⟨ italic_u , italic_x ⟩ end_CELL end_ROW start_ROW start_CELL ⟨ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ⟩ end_CELL end_ROW end_ARG ) ∼ sansserif_N ( 0 , ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL ⟨ italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL ⟨ italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ). Thus

R(a,u)𝑅𝑎𝑢\displaystyle\mathscrsfs{R}(a,u)italic_R ( italic_a , italic_u ) =12𝔼(φ(u,x)1mi=1maiσ(ui,x))2absent12𝔼superscript𝜑subscript𝑢𝑥1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝜎subscript𝑢𝑖𝑥2\displaystyle=\frac{1}{2}\mathbb{E}\left(\varphi(\langle u_{*},x\rangle)-\frac% {1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)\right)^{2}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E ( italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=12𝔼[φ(u,x)2]1mi=1mai𝔼[φ(u,x)σ(ui,x)]+121m2i,j=1maiaj𝔼[σ(ui,x)σ(uj,x)]absent12𝔼delimited-[]𝜑superscriptsubscript𝑢𝑥21𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝔼delimited-[]𝜑subscript𝑢𝑥𝜎subscript𝑢𝑖𝑥121superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝔼delimited-[]𝜎subscript𝑢𝑖𝑥𝜎subscript𝑢𝑗𝑥\displaystyle=\frac{1}{2}\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)^{2}% \right]-\frac{1}{m}\sum_{i=1}^{m}a_{i}\mathbb{E}\left[\varphi(\langle u_{*},x% \rangle)\sigma(\langle u_{i},x\rangle)\right]+\frac{1}{2}\frac{1}{m^{2}}\sum_{% i,j=1}^{m}a_{i}a_{j}\mathbb{E}\left[\sigma(\langle u_{i},x\rangle)\sigma(% \langle u_{j},x\rangle)\right]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x ⟩ ) ]
=12φL221mi=1maiV(u,ui)+121m2i,j=1maiajU(ui,uj)absent12subscriptsuperscriptnorm𝜑2superscript𝐿21𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝑉subscript𝑢subscript𝑢𝑖121superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝑈subscript𝑢𝑖subscript𝑢𝑗\displaystyle=\frac{1}{2}\|\varphi\|^{2}_{L^{2}}-\frac{1}{m}\sum_{i=1}^{m}a_{i% }V(\langle u_{*},u_{i}\rangle)+\frac{1}{2}\frac{1}{m^{2}}\sum_{i,j=1}^{m}a_{i}% a_{j}U(\langle u_{i},u_{j}\rangle)= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) (124)
=12φL221mi=1maiV(si)+121m2i,j=1maiajU(rij).absent12subscriptsuperscriptnorm𝜑2superscript𝐿21𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝑉subscript𝑠𝑖121superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗\displaystyle=\frac{1}{2}\|\varphi\|^{2}_{L^{2}}-\frac{1}{m}\sum_{i=1}^{m}a_{i% }V(s_{i})+\frac{1}{2}\frac{1}{m^{2}}\sum_{i,j=1}^{m}a_{i}a_{j}U(r_{ij})\,.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) .

This proves (15). Equation (16) follows directly:

εtai𝜀subscript𝑡subscript𝑎𝑖\displaystyle\varepsilon\partial_{t}a_{i}italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =maiR(a,u)=V(si)1mj=1majU(rij).absent𝑚subscriptsubscript𝑎𝑖𝑅𝑎𝑢𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗\displaystyle=-m\partial_{a_{i}}\mathscrsfs{R}(a,u)=V(s_{i})-\frac{1}{m}\sum_{% j=1}^{m}a_{j}U(r_{ij})\,.= - italic_m ∂ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ( italic_a , italic_u ) = italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) .

To obtain equations (17)-(19), we now take gradients in (124):

tuisubscript𝑡subscript𝑢𝑖\displaystyle\partial_{t}u_{i}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =m(Iduiui)uiR(a,u)absent𝑚subscript𝐼𝑑subscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscriptsubscript𝑢𝑖𝑅𝑎𝑢\displaystyle=-m(I_{d}-u_{i}u_{i}^{\top})\nabla_{u_{i}}\mathscrsfs{R}(a,u)= - italic_m ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ( italic_a , italic_u )
=ai(Iduiui)(V(u,ui)u1mj=1majU(ui,uj)uj)absentsubscript𝑎𝑖subscript𝐼𝑑subscript𝑢𝑖superscriptsubscript𝑢𝑖topsuperscript𝑉subscript𝑢subscript𝑢𝑖subscript𝑢1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑢𝑖subscript𝑢𝑗subscript𝑢𝑗\displaystyle=a_{i}\left(I_{d}-u_{i}u_{i}^{\top}\right)\left(V^{\prime}(% \langle u_{*},u_{i}\rangle)u_{*}-\frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(% \langle u_{i},u_{j}\rangle)u_{j}\right)= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=ai(V(u,ui)(uuiuiu)1mj=1majU(ui,uj)(ujuiuiuj))absentsubscript𝑎𝑖superscript𝑉subscript𝑢subscript𝑢𝑖subscript𝑢subscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscript𝑢1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑢𝑖subscript𝑢𝑗subscript𝑢𝑗subscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscript𝑢𝑗\displaystyle=a_{i}\left(V^{\prime}(\langle u_{*},u_{i}\rangle)(u_{*}-u_{i}u_{% i}^{\top}u_{*})-\frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(\langle u_{i},u_{j}% \rangle)(u_{j}-u_{i}u_{i}^{\top}u_{j})\right)= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) ( italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) )
=ai(V(si)(usiui)1mj=1majU(rij)(ujrijui)).absentsubscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑢subscript𝑠𝑖subscript𝑢𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑟𝑖𝑗subscript𝑢𝑗subscript𝑟𝑖𝑗subscript𝑢𝑖\displaystyle=a_{i}\left(V^{\prime}(s_{i})(u_{*}-s_{i}u_{i})-\frac{1}{m}\sum_{% j=1}^{m}a_{j}U^{\prime}(r_{ij})(u_{j}-r_{ij}u_{i})\right)\,.= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ( italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

Thus

tsisubscript𝑡subscript𝑠𝑖\displaystyle\partial_{t}s_{i}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =u,tuiabsentsubscript𝑢subscript𝑡subscript𝑢𝑖\displaystyle=\langle u_{*},\partial_{t}u_{i}\rangle= ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩
=ai(V(si)(u,usiu,ui)1mj=1majU(rij)(u,ujriju,ui))absentsubscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑢subscript𝑢subscript𝑠𝑖subscript𝑢subscript𝑢𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑟𝑖𝑗subscript𝑢subscript𝑢𝑗subscript𝑟𝑖𝑗subscript𝑢subscript𝑢𝑖\displaystyle=a_{i}\left(V^{\prime}(s_{i})(\langle u_{*},u_{*}\rangle-s_{i}% \langle u_{*},u_{i}\rangle)-\frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(r_{ij})(% \langle u_{*},u_{j}\rangle-r_{ij}\langle u_{*},u_{i}\rangle)\right)= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) )
=ai(V(si)(1si2)1mj=1majU(rij)(sjrijsi)).absentsubscript𝑎𝑖superscript𝑉subscript𝑠𝑖1superscriptsubscript𝑠𝑖21𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑟𝑖𝑗subscript𝑠𝑗subscript𝑟𝑖𝑗subscript𝑠𝑖\displaystyle=a_{i}\left(V^{\prime}(s_{i})(1-s_{i}^{2})-\frac{1}{m}\sum_{j=1}^% {m}a_{j}U^{\prime}(r_{ij})(s_{j}-r_{ij}s_{i})\right)\,.= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

This gives (17). Finally, we perform a similar computation to compute trij=tui,uj+ui,tujsubscript𝑡subscript𝑟𝑖𝑗subscript𝑡subscript𝑢𝑖subscript𝑢𝑗subscript𝑢𝑖subscript𝑡subscript𝑢𝑗\partial_{t}r_{ij}=\langle\partial_{t}u_{i},u_{j}\rangle+\langle u_{i},% \partial_{t}u_{j}\rangle∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ + ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩. We compute only the first term, as the second term can be obtained by inverting i𝑖iitalic_i and j𝑗jitalic_j:

tui,ujsubscript𝑡subscript𝑢𝑖subscript𝑢𝑗\displaystyle\langle\partial_{t}u_{i},u_{j}\rangle⟨ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ =ai(V(si)(uj,usiuj,ui)1mp=1mapU(rip)(uj,upripuj,ui))absentsubscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑢𝑗subscript𝑢subscript𝑠𝑖subscript𝑢𝑗subscript𝑢𝑖1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝subscript𝑢𝑗subscript𝑢𝑝subscript𝑟𝑖𝑝subscript𝑢𝑗subscript𝑢𝑖\displaystyle=a_{i}\left(V^{\prime}(s_{i})(\langle u_{j},u_{*}\rangle-s_{i}% \langle u_{j},u_{i}\rangle)-\frac{1}{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(% \langle u_{j},u_{p}\rangle-r_{ip}\langle u_{j},u_{i}\rangle)\right)= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) )
=ai(V(si)(sjsirij)1mp=1mapU(rip)(rjpriprij)).absentsubscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑠𝑗subscript𝑠𝑖subscript𝑟𝑖𝑗1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝subscript𝑟𝑗𝑝subscript𝑟𝑖𝑝subscript𝑟𝑖𝑗\displaystyle=a_{i}\left(V^{\prime}(s_{i})(s_{j}-s_{i}r_{ij})-\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}-r_{ip}r_{ij})\right)\,.= italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) .

Adding the symmetric term ui,tujsubscript𝑢𝑖subscript𝑡subscript𝑢𝑗\langle u_{i},\partial_{t}u_{j}\rangle⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩, we obtain (18)-(19).

B.2 Proof of Corollary 1

First, note that in the proof of Lemma 1, we obtain the following a priori estimate on the magnitude of the ai0superscriptsubscript𝑎𝑖0a_{i}^{0}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT’s:

sup1im|ai0(t)|M(1+tε),t0,formulae-sequencesubscriptsupremum1𝑖𝑚superscriptsubscript𝑎𝑖0𝑡𝑀1𝑡𝜀for-all𝑡0\sup_{1\leq i\leq m}\left|a_{i}^{0}(t)\right|\leq M\left(1+\frac{t}{% \varepsilon}\right),\ \forall t\geq 0,roman_sup start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) | ≤ italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) , ∀ italic_t ≥ 0 , (125)

where M𝑀Mitalic_M only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s in Assumptions A1-A3. Using a similar argument as that in the proof of Proposition 3, we obtain that for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] and i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ],

|t(aiai0)|subscript𝑡subscript𝑎𝑖superscriptsubscript𝑎𝑖0absent\displaystyle\left|\partial_{t}(a_{i}-a_{i}^{0})\right|\leq\,| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | ≤ Mε(|sisi0|+1mj=1m|ajaj0|)+M(1+t/ε)ε1mj=1m|rijrij0|,𝑀𝜀subscript𝑠𝑖superscriptsubscript𝑠𝑖01𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscriptsubscript𝑎𝑗0𝑀1𝑡𝜀𝜀1𝑚superscriptsubscript𝑗1𝑚subscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗0\displaystyle\frac{M}{\varepsilon}\left(\left|s_{i}-s_{i}^{0}\right|+\frac{1}{% m}\sum_{j=1}^{m}\left|a_{j}-a_{j}^{0}\right|\right)+\frac{M(1+t/\varepsilon)}{% \varepsilon}\cdot\frac{1}{m}\sum_{j=1}^{m}\left|r_{ij}-r_{ij}^{0}\right|,divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ( | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | ) + divide start_ARG italic_M ( 1 + italic_t / italic_ε ) end_ARG start_ARG italic_ε end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | ,
|t(sisi0)|subscript𝑡subscript𝑠𝑖superscriptsubscript𝑠𝑖0absent\displaystyle\left|\partial_{t}(s_{i}-s_{i}^{0})\right|\leq\,| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | ≤ M(1+t/ε)(|aiai0|+1mj=1m|ajaj0|)𝑀1𝑡𝜀subscript𝑎𝑖superscriptsubscript𝑎𝑖01𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscriptsubscript𝑎𝑗0\displaystyle M(1+t/\varepsilon)\cdot\left(\left|a_{i}-a_{i}^{0}\right|+\frac{% 1}{m}\sum_{j=1}^{m}\left|a_{j}-a_{j}^{0}\right|\right)italic_M ( 1 + italic_t / italic_ε ) ⋅ ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | )
+M(1+t/ε)2(|sisi0|+1mj=1m(|sjsj0|+|rijrij0|)),𝑀superscript1𝑡𝜀2subscript𝑠𝑖superscriptsubscript𝑠𝑖01𝑚superscriptsubscript𝑗1𝑚subscript𝑠𝑗superscriptsubscript𝑠𝑗0subscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗0\displaystyle+M(1+t/\varepsilon)^{2}\cdot\left(\left|s_{i}-s_{i}^{0}\right|+% \frac{1}{m}\sum_{j=1}^{m}\left(\left|s_{j}-s_{j}^{0}\right|+\left|r_{ij}-r_{ij% }^{0}\right|\right)\right),+ italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | ) ) ,

and for 1ijm1𝑖𝑗𝑚1\leq i\neq j\leq m1 ≤ italic_i ≠ italic_j ≤ italic_m,

|t(rijrij0)|subscript𝑡subscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗0absent\displaystyle\left|\partial_{t}(r_{ij}-r_{ij}^{0})\right|\leq\,| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | ≤ M(1+t/ε)(|aiai0|+|ajaj0|+|sisi0|+|sjsj0|+1mp=1m|apap0|)𝑀1𝑡𝜀subscript𝑎𝑖superscriptsubscript𝑎𝑖0subscript𝑎𝑗superscriptsubscript𝑎𝑗0subscript𝑠𝑖superscriptsubscript𝑠𝑖0subscript𝑠𝑗superscriptsubscript𝑠𝑗01𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscriptsubscript𝑎𝑝0\displaystyle M(1+t/\varepsilon)\left(\left|a_{i}-a_{i}^{0}\right|+\left|a_{j}% -a_{j}^{0}\right|+\left|s_{i}-s_{i}^{0}\right|+\left|s_{j}-s_{j}^{0}\right|+% \frac{1}{m}\sum_{p=1}^{m}\left|a_{p}-a_{p}^{0}\right|\right)italic_M ( 1 + italic_t / italic_ε ) ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | )
+M(1+t/ε)2(|rijrij0|+1mp=1m(|riprip0|+|rjprjp0|)).𝑀superscript1𝑡𝜀2subscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗01𝑚superscriptsubscript𝑝1𝑚subscript𝑟𝑖𝑝superscriptsubscript𝑟𝑖𝑝0subscript𝑟𝑗𝑝superscriptsubscript𝑟𝑗𝑝0\displaystyle+M(1+t/\varepsilon)^{2}\cdot\left(\left|r_{ij}-r_{ij}^{0}\right|+% \frac{1}{m}\sum_{p=1}^{m}\left(\left|r_{ip}-r_{ip}^{0}\right|+\left|r_{jp}-r_{% jp}^{0}\right|\right)\right).+ italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | + | italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | ) ) .

Therefore, we deduce that

ti=1m(aiai0)2subscript𝑡superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖02absent\displaystyle\partial_{t}\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}\leq\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ Mεi=1m(sisi0)2+M(1+t/ε)ε(i=1m(aiai0)2+1mi,j=1m(rijrij0)2),𝑀𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖02𝑀1𝑡𝜀𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖021𝑚superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗02\displaystyle\frac{M}{\varepsilon}\sum_{i=1}^{m}(s_{i}-s_{i}^{0})^{2}+\frac{M(% 1+t/\varepsilon)}{\varepsilon}\left(\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}+\frac{% 1}{m}\sum_{i,j=1}^{m}(r_{ij}-r_{ij}^{0})^{2}\right),divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_M ( 1 + italic_t / italic_ε ) end_ARG start_ARG italic_ε end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
ti=1m(sisi0)2subscript𝑡superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖02absent\displaystyle\partial_{t}\sum_{i=1}^{m}(s_{i}-s_{i}^{0})^{2}\leq\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ M(1+t/ε)i=1m(aiai0)2+M(1+t/ε)2(i=1m(sisi0)2+1mi,j=1m(rijrij0)2),𝑀1𝑡𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖02𝑀superscript1𝑡𝜀2superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖021𝑚superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗02\displaystyle M(1+t/\varepsilon)\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}+M(1+t/% \varepsilon)^{2}\left(\sum_{i=1}^{m}(s_{i}-s_{i}^{0})^{2}+\frac{1}{m}\sum_{i,j% =1}^{m}(r_{ij}-r_{ij}^{0})^{2}\right),italic_M ( 1 + italic_t / italic_ε ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
t(1mi,j=1m(rijrij0)2)subscript𝑡1𝑚superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗02absent\displaystyle\partial_{t}\left(\frac{1}{m}\sum_{i,j=1}^{m}(r_{ij}-r_{ij}^{0})^% {2}\right)\leq\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ M(1+t/ε)(i=1m(aiai0)2+i=1m(sisi0)2)+M(1+t/ε)21mi,j=1m(rijrij0)2.𝑀1𝑡𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖02superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖02𝑀superscript1𝑡𝜀21𝑚superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗02\displaystyle M(1+t/\varepsilon)\left(\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}+\sum% _{i=1}^{m}(s_{i}-s_{i}^{0})^{2}\right)+M(1+t/\varepsilon)^{2}\cdot\frac{1}{m}% \sum_{i,j=1}^{m}(r_{ij}-r_{ij}^{0})^{2}.italic_M ( 1 + italic_t / italic_ε ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Defining

G(t)=i=1m(ai(t)ai0(t))2+i=1m(si(t)si0(t))2+1mi,j=1m(rij(t)rij0(t))2,𝐺𝑡superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑎𝑖0𝑡2superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑠𝑖0𝑡21𝑚superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗𝑡superscriptsubscript𝑟𝑖𝑗0𝑡2G(t)=\sum_{i=1}^{m}(a_{i}(t)-a_{i}^{0}(t))^{2}+\sum_{i=1}^{m}(s_{i}(t)-s_{i}^{% 0}(t))^{2}+\frac{1}{m}\sum_{i,j=1}^{m}(r_{ij}(t)-r_{ij}^{0}(t))^{2},italic_G ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_t ) - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

then we know that G(t)(M(1+t)2/ε2)G(t)superscript𝐺𝑡𝑀superscript1𝑡2superscript𝜀2𝐺𝑡G^{\prime}(t)\leq(M(1+t)^{2}/\varepsilon^{2})G(t)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) ≤ ( italic_M ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_G ( italic_t ). Applying Grönwall’s inequality yields

G(t)G(0)exp(0t(M(1+s)2/ε2)ds)G(0)exp(Mt(1+t)2/ε2),t[0,T].formulae-sequence𝐺𝑡𝐺0superscriptsubscript0𝑡𝑀superscript1𝑠2superscript𝜀2differential-d𝑠𝐺0𝑀𝑡superscript1𝑡2superscript𝜀2for-all𝑡0𝑇G(t)\leq G(0)\exp\left(\int_{0}^{t}(M(1+s)^{2}/\varepsilon^{2}){\rm d}s\right)% \leq G(0)\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right),\ \forall t\in[0,T].italic_G ( italic_t ) ≤ italic_G ( 0 ) roman_exp ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_M ( 1 + italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_s ) ≤ italic_G ( 0 ) roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , ∀ italic_t ∈ [ 0 , italic_T ] .

Since {ui(0),u}i[m]i.i.d.𝒩(0,1/d)subscriptsimilar-toformulae-sequenceiidsubscriptsubscript𝑢𝑖0subscript𝑢𝑖delimited-[]𝑚𝒩01𝑑\{\langle u_{i}(0),u_{*}\rangle\}_{i\in[m]}\sim_{\mathrm{i.i.d.}}{\mathcal{N}}% (0,1/d){ ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT roman_i . roman_i . roman_d . end_POSTSUBSCRIPT caligraphic_N ( 0 , 1 / italic_d ) and for any i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ], {ui(0),uj(0)}jii.i.d.𝒩(0,1/d)subscriptsimilar-toformulae-sequenceiidsubscriptsubscript𝑢𝑖0subscript𝑢𝑗0𝑗𝑖𝒩01𝑑\{\langle u_{i}(0),u_{j}(0)\rangle\}_{j\neq i}\sim_{\mathrm{i.i.d.}}{\mathcal{% N}}(0,1/d){ ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 0 ) ⟩ } start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT ∼ start_POSTSUBSCRIPT roman_i . roman_i . roman_d . end_POSTSUBSCRIPT caligraphic_N ( 0 , 1 / italic_d ). Using standard concentration inequalities, we know that

G(0)=i=1mui(0),u2+1mijui(0),uj(0)2Cmd𝐺0superscriptsubscript𝑖1𝑚superscriptsubscript𝑢𝑖0subscript𝑢21𝑚subscript𝑖𝑗superscriptsubscript𝑢𝑖0subscript𝑢𝑗02𝐶𝑚𝑑G(0)=\sum_{i=1}^{m}\langle u_{i}(0),u_{*}\rangle^{2}+\frac{1}{m}\sum_{i\neq j}% \langle u_{i}(0),u_{j}(0)\rangle^{2}\leq C\frac{m}{d}italic_G ( 0 ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 0 ) ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C divide start_ARG italic_m end_ARG start_ARG italic_d end_ARG (126)

with probability at least 1exp(Cm)1superscript𝐶𝑚1-\exp(C^{\prime}m)1 - roman_exp ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m ), where C𝐶Citalic_C and Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are both absolute constants. Therefore,

supt[0,T]a(t)a0(t)2subscriptsupremum𝑡0𝑇subscriptnorm𝑎𝑡superscript𝑎0𝑡2absent\displaystyle\sup_{t\in[0,T]}\|a(t)-a^{0}(t)\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ italic_a ( italic_t ) - italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ Cmdexp(MT(1+T)2/ε2),𝐶𝑚𝑑𝑀𝑇superscript1𝑇2superscript𝜀2\displaystyle C\sqrt{\frac{m}{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right),italic_C square-root start_ARG divide start_ARG italic_m end_ARG start_ARG italic_d end_ARG end_ARG roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (127)
supt[0,T]s(t)s0(t)2subscriptsupremum𝑡0𝑇subscriptnorm𝑠𝑡superscript𝑠0𝑡2absent\displaystyle\sup_{t\in[0,T]}\|s(t)-s^{0}(t)\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ italic_s ( italic_t ) - italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ Cmdexp(MT(1+T)2/ε2),𝐶𝑚𝑑𝑀𝑇superscript1𝑇2superscript𝜀2\displaystyle C\sqrt{\frac{m}{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right),italic_C square-root start_ARG divide start_ARG italic_m end_ARG start_ARG italic_d end_ARG end_ARG roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (128)
supt[0,T]R(t)R0(t)Fsubscriptsupremum𝑡0𝑇subscriptnorm𝑅𝑡superscript𝑅0𝑡Fabsent\displaystyle\sup_{t\in[0,T]}\|R(t)-R^{0}(t)\|_{\rm F}\leq\,roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ italic_R ( italic_t ) - italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT ≤ Cmdexp(MT(1+T)2/ε2).𝐶𝑚𝑑𝑀𝑇superscript1𝑇2superscript𝜀2\displaystyle C\frac{m}{\sqrt{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right).italic_C divide start_ARG italic_m end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (129)

Next we upper bound the risk difference, by direct calculation,

|R(a(t),u(t))Rred(a0(t),s0(t),R0(t))|𝑅𝑎𝑡𝑢𝑡subscript𝑅redsuperscript𝑎0𝑡superscript𝑠0𝑡superscript𝑅0𝑡\displaystyle\big{|}\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\mbox{\tiny\rm red% }}(a^{0}(t),s^{0}(t),R^{0}(t))\big{|}| italic_R ( italic_a ( italic_t ) , italic_u ( italic_t ) ) - italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) |
=\displaystyle=\,= |Rred(a(t),s(t),R(t))Rred(a0(t),s0(t),R0(t))|subscript𝑅red𝑎𝑡𝑠𝑡𝑅𝑡subscript𝑅redsuperscript𝑎0𝑡superscript𝑠0𝑡superscript𝑅0𝑡\displaystyle\big{|}\mathscrsfs{R}_{\mbox{\tiny\rm red}}(a(t),s(t),R(t))-% \mathscrsfs{R}_{\mbox{\tiny\rm red}}(a^{0}(t),s^{0}(t),R^{0}(t))\big{|}| italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a ( italic_t ) , italic_s ( italic_t ) , italic_R ( italic_t ) ) - italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) , italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) ) |
\displaystyle\leq\, 1m|aV(s)(a0)V(s0)|+12m2|aU(R)a(a0)U(R0)a0|1𝑚superscript𝑎top𝑉𝑠superscriptsuperscript𝑎0top𝑉superscript𝑠012superscript𝑚2superscript𝑎top𝑈𝑅𝑎superscriptsuperscript𝑎0top𝑈superscript𝑅0superscript𝑎0\displaystyle\frac{1}{m}\left|a^{\top}V(s)-(a^{0})^{\top}V(s^{0})\right|+\frac% {1}{2m^{2}}\left|a^{\top}U(R)a-(a^{0})^{\top}U(R^{0})a^{0}\right|divide start_ARG 1 end_ARG start_ARG italic_m end_ARG | italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( italic_s ) - ( italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | + divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_U ( italic_R ) italic_a - ( italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_U ( italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT |
\displaystyle\leq\, 1m(mVaa02+a02Vss02)1𝑚𝑚subscriptnorm𝑉subscriptnorm𝑎superscript𝑎02subscriptnormsuperscript𝑎02subscriptnormsuperscript𝑉subscriptnorm𝑠superscript𝑠02\displaystyle\frac{1}{m}\left(\sqrt{m}\left\|{V}\right\|_{\infty}\left\|{a-a^{% 0}}\right\|_{2}+\left\|{a^{0}}\right\|_{2}\left\|{V^{\prime}}\right\|_{\infty}% \left\|{s-s^{0}}\right\|_{2}\right)divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ( square-root start_ARG italic_m end_ARG ∥ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_a - italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_s - italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
+12m2(U(R)U(R0)opa022+2U(R)opa02aa02)12superscript𝑚2subscriptnorm𝑈𝑅𝑈superscript𝑅0opsuperscriptsubscriptnormsuperscript𝑎0222subscriptnorm𝑈𝑅opsubscriptnormsuperscript𝑎02subscriptnorm𝑎superscript𝑎02\displaystyle+\frac{1}{2m^{2}}\left(\left\|{U(R)-U(R^{0})}\right\|_{\mathrm{op% }}\left\|{a^{0}}\right\|_{2}^{2}+2\left\|{U(R)}\right\|_{\mathrm{op}}\left\|{a% ^{0}}\right\|_{2}\left\|{a-a^{0}}\right\|_{2}\right)+ divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∥ italic_U ( italic_R ) - italic_U ( italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT ∥ italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ italic_U ( italic_R ) ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT ∥ italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_a - italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq\, M(1+t/ε)m(aa02+ss02)+M(1+t/ε)22mRR0F𝑀1𝑡𝜀𝑚subscriptnorm𝑎superscript𝑎02subscriptnorm𝑠superscript𝑠02𝑀superscript1𝑡𝜀22𝑚subscriptnorm𝑅superscript𝑅0F\displaystyle\frac{M(1+t/\varepsilon)}{\sqrt{m}}\left(\left\|{a-a^{0}}\right\|% _{2}+\left\|{s-s^{0}}\right\|_{2}\right)+\frac{M(1+t/\varepsilon)^{2}}{2m}% \left\|{R-R^{0}}\right\|_{\rm F}divide start_ARG italic_M ( 1 + italic_t / italic_ε ) end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ( ∥ italic_a - italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_s - italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + divide start_ARG italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_m end_ARG ∥ italic_R - italic_R start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT
\displaystyle\leq\, Mdexp(Mt(1+t)2/ε2)𝑀𝑑𝑀𝑡superscript1𝑡2superscript𝜀2\displaystyle\frac{M}{\sqrt{d}}\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)divide start_ARG italic_M end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

with probability at least 1exp(Cm)1superscript𝐶𝑚1-\exp(-C^{\prime}m)1 - roman_exp ( - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m ), where the constant M𝑀Mitalic_M only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s from Assumptions A1-A3. The conclusion now follows from taking the supremum over all t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]. This completes the proof of Corollary 1.

B.3 Proof of Proposition 3

We consider rij=rijsisj=ui,ujui,uu,ujsuperscriptsubscript𝑟𝑖𝑗perpendicular-tosubscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗subscript𝑢𝑖subscript𝑢𝑗subscript𝑢𝑖subscript𝑢subscript𝑢subscript𝑢𝑗r_{ij}^{\perp}=r_{ij}-s_{i}s_{j}=\langle u_{i},u_{j}\rangle-\langle u_{i},u_{*% }\rangle\langle u_{*},u_{j}\rangleitalic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ - ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩, the dot product between uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ujsubscript𝑢𝑗u_{j}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that is out of the relevant subspace spanned by usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. We show that these variables satisfy the ODEs

trij=ai(V(si)sirij+1mp=1mapU(rip)(rjpriprij))aj(V(sj)sjrij+1mp=1mapU(rjp)(riprjprij)).subscript𝑡superscriptsubscript𝑟𝑖𝑗perpendicular-tosubscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑠𝑖superscriptsubscript𝑟𝑖𝑗perpendicular-to1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝superscriptsubscript𝑟𝑗𝑝perpendicular-tosubscript𝑟𝑖𝑝superscriptsubscript𝑟𝑖𝑗perpendicular-tosubscript𝑎𝑗superscript𝑉subscript𝑠𝑗subscript𝑠𝑗superscriptsubscript𝑟𝑖𝑗perpendicular-to1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑗𝑝superscriptsubscript𝑟𝑖𝑝perpendicular-tosubscript𝑟𝑗𝑝superscriptsubscript𝑟𝑖𝑗perpendicular-to\begin{split}\partial_{t}r_{ij}^{\perp}=\,&-a_{i}\left(V^{\prime}(s_{i})\cdot s% _{i}r_{ij}^{\perp}+\frac{1}{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}^{% \perp}-r_{ip}r_{ij}^{\perp})\right)\\ \,&-a_{j}\left(V^{\prime}(s_{j})\cdot s_{j}r_{ij}^{\perp}+\frac{1}{m}\sum_{p=1% }^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}^{\perp}-r_{jp}r_{ij}^{\perp})\right)\,.% \end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = end_CELL start_CELL - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) ) . end_CELL end_ROW (130)

By definition of rijsuperscriptsubscript𝑟𝑖𝑗perpendicular-tor_{ij}^{\perp}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, we readily see that

trij=trijsitsjsjtsi.subscript𝑡superscriptsubscript𝑟𝑖𝑗perpendicular-tosubscript𝑡subscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑡subscript𝑠𝑗subscript𝑠𝑗subscript𝑡subscript𝑠𝑖\partial_{t}r_{ij}^{\perp}=\partial_{t}r_{ij}-s_{i}\partial_{t}s_{j}-s_{j}% \partial_{t}s_{i}.∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Plugging in Eq.s (17) to (19) gives that

trij=subscript𝑡superscriptsubscript𝑟𝑖𝑗perpendicular-toabsent\displaystyle\partial_{t}r_{ij}^{\perp}=\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = ai(V(si)(sjsi2sirij)1mp=1mapU(rip)(rjpsjspriprij+sisjrip))subscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑠𝑗superscriptsubscript𝑠𝑖2subscript𝑠𝑖subscript𝑟𝑖𝑗1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝subscript𝑟𝑗𝑝subscript𝑠𝑗subscript𝑠𝑝subscript𝑟𝑖𝑝subscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗subscript𝑟𝑖𝑝\displaystyle a_{i}\left(V^{\prime}(s_{i})(s_{j}s_{i}^{2}-s_{i}r_{ij})-\frac{1% }{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}-s_{j}s_{p}-r_{ip}r_{ij}+s_{i}% s_{j}r_{ip})\right)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) )
+aj(V(sj)(sisj2sjrij)1mp=1mapU(rjp)(ripsisprjprij+sisjrjp))subscript𝑎𝑗superscript𝑉subscript𝑠𝑗subscript𝑠𝑖superscriptsubscript𝑠𝑗2subscript𝑠𝑗subscript𝑟𝑖𝑗1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑗𝑝subscript𝑟𝑖𝑝subscript𝑠𝑖subscript𝑠𝑝subscript𝑟𝑗𝑝subscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗subscript𝑟𝑗𝑝\displaystyle+a_{j}\left(V^{\prime}(s_{j})(s_{i}s_{j}^{2}-s_{j}r_{ij})-\frac{1% }{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}-s_{i}s_{p}-r_{jp}r_{ij}+s_{i}% s_{j}r_{jp})\right)+ italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ) )
=\displaystyle=\,= ai(V(si)sirij+1mp=1mapU(rip)(rjpriprij))subscript𝑎𝑖superscript𝑉subscript𝑠𝑖subscript𝑠𝑖superscriptsubscript𝑟𝑖𝑗perpendicular-to1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝superscriptsubscript𝑟𝑗𝑝perpendicular-tosubscript𝑟𝑖𝑝superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle-a_{i}\left(V^{\prime}(s_{i})s_{i}r_{ij}^{\perp}+\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}^{\perp}-r_{ip}r_{ij}^{\perp})\right)- italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) )
aj(V(sj)sjrij+1mp=1mapU(rjp)(riprjprij)).subscript𝑎𝑗superscript𝑉subscript𝑠𝑗subscript𝑠𝑗superscriptsubscript𝑟𝑖𝑗perpendicular-to1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑗𝑝superscriptsubscript𝑟𝑖𝑝perpendicular-tosubscript𝑟𝑗𝑝superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle-a_{j}\left(V^{\prime}(s_{j})s_{j}r_{ij}^{\perp}+\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}^{\perp}-r_{jp}r_{ij}^{\perp})\right).- italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) ) .

This proves Eq. (130).

Lemma 1.

If Assumptions A1-A3 hold, then we have for any fixed T>0𝑇0T>0italic_T > 0:

supt[0,T]i,j=1mrij(t)2mexp(MT(1+T)2/ε2).subscriptsupremum𝑡0𝑇superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2𝑚𝑀𝑇superscript1𝑇2superscript𝜀2\sup_{t\in[0,T]}\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\leq m\exp\left(MT\left(1% +T\right)^{2}/\varepsilon^{2}\right).roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_m roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Proof.

To begin with, using Eq. (130), we obtain that

t(i,j=1mrij(t)2)subscript𝑡superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2\displaystyle\partial_{t}\left(\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\right)∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) =2i,j=1mrij×trijabsent2superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosubscript𝑡superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle=2\sum_{i,j=1}^{m}r_{ij}^{\perp}\times\partial_{t}r_{ij}^{\perp}= 2 ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT × ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT
=4i,j=1mairij(V(si)sirij+1mp=1mapU(rip)(rjpriprij)).absent4superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑉subscript𝑠𝑖subscript𝑠𝑖superscriptsubscript𝑟𝑖𝑗perpendicular-to1𝑚superscriptsubscript𝑝1𝑚subscript𝑎𝑝superscript𝑈subscript𝑟𝑖𝑝superscriptsubscript𝑟𝑗𝑝perpendicular-tosubscript𝑟𝑖𝑝superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle=-4\sum_{i,j=1}^{m}a_{i}r_{ij}^{\perp}\left(V^{\prime}(s_{i})% \cdot s_{i}r_{ij}^{\perp}+\frac{1}{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{% jp}^{\perp}-r_{ip}r_{ij}^{\perp})\right).= - 4 ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) ( italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) ) .

Using the ODEs for the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, we obtain that

|tai|=subscript𝑡subscript𝑎𝑖absent\displaystyle\left|\partial_{t}a_{i}\right|=\,| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = 1ε|V(si)1mj=1majU(rij)|=1ε|𝔼[φ(u,x)σ(ui,x)]1mj=1maj𝔼[σ(ui,x)σ(uj,x)]|1𝜀𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗1𝜀𝔼delimited-[]𝜑subscript𝑢𝑥𝜎subscript𝑢𝑖𝑥1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝔼delimited-[]𝜎subscript𝑢𝑖𝑥𝜎subscript𝑢𝑗𝑥\displaystyle\frac{1}{\varepsilon}\left|V(s_{i})-\frac{1}{m}\sum_{j=1}^{m}a_{j% }U(r_{ij})\right|=\frac{1}{\varepsilon}\left|\mathbb{E}\left[\varphi(\langle u% _{*},x\rangle)\sigma(\langle u_{i},x\rangle)\right]-\frac{1}{m}\sum_{j=1}^{m}a% _{j}\mathbb{E}\left[\sigma(\langle u_{i},x\rangle)\sigma(\langle u_{j},x% \rangle)\right]\right|divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG | italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | = divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG | blackboard_E [ italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x ⟩ ) ] |
=\displaystyle=\,= 1ε|𝔼[σ(ui,x)(yf(x;a,u))]|1ε𝔼[σ(ui,x)2]1/2𝔼[(yf(x;a,u))2]1/21𝜀𝔼delimited-[]𝜎subscript𝑢𝑖𝑥𝑦𝑓𝑥𝑎𝑢1𝜀𝔼superscriptdelimited-[]𝜎superscriptsubscript𝑢𝑖𝑥212𝔼superscriptdelimited-[]superscript𝑦𝑓𝑥𝑎𝑢212\displaystyle\frac{1}{\varepsilon}\left|\mathbb{E}\left[\sigma(\langle u_{i},x% \rangle)\left(y-f(x;a,u)\right)\right]\right|\leq\frac{1}{\varepsilon}\mathbb{% E}\left[\sigma(\langle u_{i},x\rangle)^{2}\right]^{1/2}\mathbb{E}\left[\left(y% -f(x;a,u)\right)^{2}\right]^{1/2}divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG | blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ( italic_y - italic_f ( italic_x ; italic_a , italic_u ) ) ] | ≤ divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT blackboard_E [ ( italic_y - italic_f ( italic_x ; italic_a , italic_u ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
(i)superscript𝑖\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP 1εM22R(a(0),u(0))Mε,1𝜀subscript𝑀22𝑅𝑎0𝑢0𝑀𝜀\displaystyle\frac{1}{\varepsilon}M_{2}\sqrt{2\mathscrsfs{R}(a(0),u(0))}\leq% \frac{M}{\varepsilon},divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG 2 italic_R ( italic_a ( 0 ) , italic_u ( 0 ) ) end_ARG ≤ divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ,

where (i)𝑖(i)( italic_i ) follows from our assumptions and the fact that R(a(t),u(t))R(a(0),u(0))𝑅𝑎𝑡𝑢𝑡𝑅𝑎0𝑢0\mathscrsfs{R}(a(t),u(t))\leq\mathscrsfs{R}(a(0),u(0))italic_R ( italic_a ( italic_t ) , italic_u ( italic_t ) ) ≤ italic_R ( italic_a ( 0 ) , italic_u ( 0 ) ), since tR(a,u)0subscript𝑡𝑅𝑎𝑢0\partial_{t}\mathscrsfs{R}(a,u)\leq 0∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_R ( italic_a , italic_u ) ≤ 0 by gradient flow equations. Moreover, the constant M𝑀Mitalic_M only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s. Since |ai(0)|M1subscript𝑎𝑖0subscript𝑀1|a_{i}(0)|\leq M_{1}| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) | ≤ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ], we know that |ai(t)|M(1+t/ε)subscript𝑎𝑖𝑡𝑀1𝑡𝜀|a_{i}(t)|\leq M(1+t/\varepsilon)| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) | ≤ italic_M ( 1 + italic_t / italic_ε ) for all t0𝑡0t\geq 0italic_t ≥ 0, thus leading to the following estimate:

t(i,j=1mrij(t)2)subscript𝑡superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2absent\displaystyle\partial_{t}\left(\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\right)\leq\,∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 4i,j=1mM(1+tε)|rij|(V|rij|+M(1+tε)U(1mp=1m|rjp|+|rij|))4superscriptsubscript𝑖𝑗1𝑚𝑀1𝑡𝜀superscriptsubscript𝑟𝑖𝑗perpendicular-tosubscriptnormsuperscript𝑉superscriptsubscript𝑟𝑖𝑗perpendicular-to𝑀1𝑡𝜀subscriptnormsuperscript𝑈1𝑚superscriptsubscript𝑝1𝑚superscriptsubscript𝑟𝑗𝑝perpendicular-tosuperscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle 4\sum_{i,j=1}^{m}M\left(1+\frac{t}{\varepsilon}\right)\left|r_{% ij}^{\perp}\right|\left(\left\|{V^{\prime}}\right\|_{\infty}\left|r_{ij}^{% \perp}\right|+M\left(1+\frac{t}{\varepsilon}\right)\left\|{U^{\prime}}\right\|% _{\infty}\left(\frac{1}{m}\sum_{p=1}^{m}\left|r_{jp}^{\perp}\right|+\left|r_{% ij}^{\perp}\right|\right)\right)4 ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | ( ∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | + italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | + | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | ) )
\displaystyle\leq\, M(1+tε)2(i,j=1mrij(t)2+1mi,j,p=1m|rij(t)||rjp(t)|)𝑀superscript1𝑡𝜀2superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡21𝑚superscriptsubscript𝑖𝑗𝑝1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-to𝑡superscriptsubscript𝑟𝑗𝑝perpendicular-to𝑡\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\left(\sum_{i,j=1}^{m}r% _{ij}^{\perp}(t)^{2}+\frac{1}{m}\sum_{i,j,p=1}^{m}\left|r_{ij}^{\perp}(t)% \right|\cdot\left|r_{jp}^{\perp}(t)\right|\right)italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) | ⋅ | italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) | )
\displaystyle\leq\, M(1+tε)2(i,j=1mrij(t)2+12mi,j,p=1m(rij(t)2+rjp(t)2))𝑀superscript1𝑡𝜀2superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡212𝑚superscriptsubscript𝑖𝑗𝑝1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2superscriptsubscript𝑟𝑗𝑝perpendicular-tosuperscript𝑡2\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\left(\sum_{i,j=1}^{m}r% _{ij}^{\perp}(t)^{2}+\frac{1}{2m}\sum_{i,j,p=1}^{m}\left(r_{ij}^{\perp}(t)^{2}% +r_{jp}^{\perp}(t)^{2}\right)\right)italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )
\displaystyle\leq\, M(1+tε)2i,j=1mrij(t)2,𝑀superscript1𝑡𝜀2superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\sum_{i,j=1}^{m}r_{ij}^% {\perp}(t)^{2},italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the constant M𝑀Mitalic_M only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s in our assumptions. At initialization, we know that i,j=1mrij(0)2=msuperscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript02𝑚\sum_{i,j=1}^{m}r_{ij}^{\perp}(0)^{2}=m∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_m. Applying Grönwall’s inequality yields that

i,j=1mrij(t)2mexp(0tM(1+sε)2ds),t[0,T],formulae-sequencesuperscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2𝑚superscriptsubscript0𝑡𝑀superscript1𝑠𝜀2differential-d𝑠for-all𝑡0𝑇\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\leq m\exp\left(\int_{0}^{t}M\left(1+% \frac{s}{\varepsilon}\right)^{2}\mathrm{d}s\right),\quad\forall t\in[0,T],∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_m roman_exp ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_M ( 1 + divide start_ARG italic_s end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_s ) , ∀ italic_t ∈ [ 0 , italic_T ] ,

which further implies that

supt[0,T]i,j=1mrij(t)2mexp(0TM(1+tε)2dt)mexp(MT(1+T)2/ε2).subscriptsupremum𝑡0𝑇superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-tosuperscript𝑡2𝑚superscriptsubscript0𝑇𝑀superscript1𝑡𝜀2differential-d𝑡𝑚𝑀𝑇superscript1𝑇2superscript𝜀2\sup_{t\in[0,T]}\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\leq m\exp\left(\int_{0}^% {T}M\left(1+\frac{t}{\varepsilon}\right)^{2}\mathrm{d}t\right)\leq m\exp\left(% MT\left(1+T\right)^{2}/\varepsilon^{2}\right).roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_m roman_exp ( ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_t ) ≤ italic_m roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

This completes the proof. ∎

We show that

supt[0,T]i=1m((ai(t)aimf(t))2+(si(t)simf(t))2)C(T).subscriptsupremum𝑡0𝑇superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑎𝑖mf𝑡2superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑠𝑖mf𝑡2𝐶𝑇\sup_{t\in[0,T]}\sum_{i=1}^{m}\left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(% t)\right)^{2}+\left(s_{i}(t)-s_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)% \leq C(T).roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ italic_C ( italic_T ) . (131)

To this end, we define S(t)=i=1m((ai(t)aimf(t))2+(si(t)simf(t))2)𝑆𝑡superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑎𝑖mf𝑡2superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑠𝑖mf𝑡2S(t)=\sum_{i=1}^{m}\left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{% 2}+\left(s_{i}(t)-s_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)italic_S ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). By our assumption, S(0)=0𝑆00S(0)=0italic_S ( 0 ) = 0. Moreover, using the same technique as in the proof of Lemma 1, we know that |aimf(t)|M(1+t/ε)superscriptsubscript𝑎𝑖mf𝑡𝑀1𝑡𝜀|a_{i}^{\mbox{\tiny\rm mf}}(t)|\leq M(1+t/\varepsilon)| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) | ≤ italic_M ( 1 + italic_t / italic_ε ) for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ]. According to Eq.s (16)-(19) and Eq. (24), we deduce that

|t(aiaimf)|=1ε|V(si)V(simf)1mj=1m(ajU(rij)ajmfU(simfsjmf))|subscript𝑡subscript𝑎𝑖superscriptsubscript𝑎𝑖mf1𝜀𝑉subscript𝑠𝑖𝑉superscriptsubscript𝑠𝑖mf1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗superscriptsubscript𝑎𝑗mf𝑈superscriptsubscript𝑠𝑖mfsuperscriptsubscript𝑠𝑗mf\displaystyle\left|\partial_{t}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\right|=\,% \frac{1}{\varepsilon}\cdot\left|V(s_{i})-V(s_{i}^{\mbox{\tiny\rm mf}})-\frac{1% }{m}\sum_{j=1}^{m}\left(a_{j}U(r_{ij})-a_{j}^{\mbox{\tiny\rm mf}}U(s_{i}^{% \mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})\right)\right|| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) | = divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ⋅ | italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ) |
\displaystyle\leq\, 1ε(|V(si)V(simf)|+1mj=1m|ajU(rij)ajU(sisj)|+1mj=1m|ajU(sisj)ajmfU(simfsjmf)|)1𝜀𝑉subscript𝑠𝑖𝑉superscriptsubscript𝑠𝑖mf1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗subscript𝑎𝑗𝑈subscript𝑠𝑖subscript𝑠𝑗1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑠𝑖subscript𝑠𝑗superscriptsubscript𝑎𝑗mf𝑈superscriptsubscript𝑠𝑖mfsuperscriptsubscript𝑠𝑗mf\displaystyle\frac{1}{\varepsilon}\cdot\left(\left|V(s_{i})-V(s_{i}^{\mbox{% \tiny\rm mf}})\right|+\frac{1}{m}\sum_{j=1}^{m}\left|a_{j}U(r_{ij})-a_{j}U(s_{% i}s_{j})\right|+\frac{1}{m}\sum_{j=1}^{m}\left|a_{j}U(s_{i}s_{j})-a_{j}^{\mbox% {\tiny\rm mf}}U(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})\right|\right)divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ⋅ ( | italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) | )
\displaystyle\leq\, 1ε(V|sisimf|+M(1+t/ε)mUj=1m|rij|)1𝜀subscriptnormsuperscript𝑉subscript𝑠𝑖superscriptsubscript𝑠𝑖mf𝑀1𝑡𝜀𝑚subscriptnormsuperscript𝑈superscriptsubscript𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle\frac{1}{\varepsilon}\cdot\left(\left\|{V^{\prime}}\right\|_{% \infty}|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}|+\frac{M(1+t/\varepsilon)}{m}\left\|{% U^{\prime}}\right\|_{\infty}\sum_{j=1}^{m}\left|r_{ij}^{\perp}\right|\right)divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ⋅ ( ∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + divide start_ARG italic_M ( 1 + italic_t / italic_ε ) end_ARG start_ARG italic_m end_ARG ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | )
+1mεj=1m(U|ajajmf|+M(1+tε)U(|sisimf|+|sjsjmf|))1𝑚𝜀superscriptsubscript𝑗1𝑚subscriptnorm𝑈subscript𝑎𝑗superscriptsubscript𝑎𝑗mf𝑀1𝑡𝜀subscriptnormsuperscript𝑈subscript𝑠𝑖superscriptsubscript𝑠𝑖mfsubscript𝑠𝑗superscriptsubscript𝑠𝑗mf\displaystyle+\frac{1}{m\varepsilon}\sum_{j=1}^{m}\left(\left\|{U}\right\|_{% \infty}|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}|+M\left(1+\frac{t}{\varepsilon}\right% )\left\|{U^{\prime}}\right\|_{\infty}\left(|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}|+% |s_{j}-s_{j}^{\mbox{\tiny\rm mf}}|\right)\right)+ divide start_ARG 1 end_ARG start_ARG italic_m italic_ε end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ∥ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | ) )
\displaystyle\leq\, Mε(1+tε)(|sisimf|+1mj=1m(|sjsjmf|+|ajajmf|+|rij|)),𝑀𝜀1𝑡𝜀subscript𝑠𝑖superscriptsubscript𝑠𝑖mf1𝑚superscriptsubscript𝑗1𝑚subscript𝑠𝑗superscriptsubscript𝑠𝑗mfsubscript𝑎𝑗superscriptsubscript𝑎𝑗mfsuperscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}|+\frac{1}{m}\sum_{j=1}^{m}\left(|s_{j}% -s_{j}^{\mbox{\tiny\rm mf}}|+|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}|+\left|r_{ij}^{% \perp}\right|\right)\right),divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ⋅ ( | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | ) ) ,

thus leading to the following estimate:

i=1m(aiaimf)t(aiaimf)superscriptsubscript𝑖1𝑚subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑡subscript𝑎𝑖superscriptsubscript𝑎𝑖mfabsent\displaystyle\sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\cdot\partial_{t}% (a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\leq\,∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ⋅ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ≤ Mε(1+tε)1mi=1m|aiaimf|j=1m(|ajajmf|+|sjsjmf|)𝑀𝜀1𝑡𝜀1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsuperscriptsubscript𝑗1𝑚subscript𝑎𝑗superscriptsubscript𝑎𝑗mfsubscript𝑠𝑗superscriptsubscript𝑠𝑗mf\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \frac{1}{m}\sum_{i=1}^{m}|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}|\cdot\sum_{j=1}^{m}% \left(|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}|+|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}|\right)divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | )
+Mε(1+tε)(i=1m|aiaimf||sisimf|+1mi,j=1m|aiaimf||rij|)𝑀𝜀1𝑡𝜀superscriptsubscript𝑖1𝑚subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf1𝑚superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsuperscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle+\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\sum_{i=1}^{m}|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}||s_{i}-s_{i}^{\mbox{% \tiny\rm mf}}|+\frac{1}{m}\sum_{i,j=1}^{m}|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}|% \left|r_{ij}^{\perp}\right|\right)+ divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | )
(i)superscript𝑖\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP Mε(1+tε)(i=1m(aiaimf)2+i=1m(sisimf)2+1mi,j=1m(rij)2)𝑀𝜀1𝑡𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖mf2superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf21𝑚superscriptsubscript𝑖𝑗1𝑚superscriptsuperscriptsubscript𝑟𝑖𝑗perpendicular-to2\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i% }-s_{i}^{\mbox{\tiny\rm mf}})^{2}+\frac{1}{m}\sum_{i,j=1}^{m}\left(r_{ij}^{% \perp}\right)^{2}\right)divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(ii)superscript𝑖𝑖\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\,start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP Mε(1+tε)(i=1m(aiaimf)2+i=1m(sisimf)2+exp(Mt(1+t)2/ε2)),𝑀𝜀1𝑡𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖mf2superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf2𝑀𝑡superscript1𝑡2superscript𝜀2\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i% }-s_{i}^{\mbox{\tiny\rm mf}})^{2}+\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)% \right),divide start_ARG italic_M end_ARG start_ARG italic_ε end_ARG ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ,

where in (i)𝑖(i)( italic_i ) we use the Cauchy-Schwarz inequality and the inequality of arithmetic and geometric means, and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from the conclusion of Lemma 1. Similarly, we obtain that

|t(sisimf)|subscript𝑡subscript𝑠𝑖superscriptsubscript𝑠𝑖mfabsent\displaystyle\left|\partial_{t}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\right|\leq\,| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) | ≤ V|aiaimf|+M(1+tε)(V′′+2V)|sisimf|subscriptnormsuperscript𝑉subscript𝑎𝑖superscriptsubscript𝑎𝑖mf𝑀1𝑡𝜀subscriptnormsuperscript𝑉′′2subscriptnormsuperscript𝑉subscript𝑠𝑖superscriptsubscript𝑠𝑖mf\displaystyle\left\|{V^{\prime}}\right\|_{\infty}|a_{i}-a_{i}^{\mbox{\tiny\rm mf% }}|+M\left(1+\frac{t}{\varepsilon}\right)\left(\left\|{V^{\prime\prime}}\right% \|_{\infty}+2\left\|{V^{\prime}}\right\|_{\infty}\right)|s_{i}-s_{i}^{\mbox{% \tiny\rm mf}}|∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ( ∥ italic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + 2 ∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT |
+1m|aij=1majU(rij)(sjrijsi)aij=1majU(sisj)(1si2)sj|1𝑚subscript𝑎𝑖superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑟𝑖𝑗subscript𝑠𝑗subscript𝑟𝑖𝑗subscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑠𝑖subscript𝑠𝑗1superscriptsubscript𝑠𝑖2subscript𝑠𝑗\displaystyle+\frac{1}{m}\left|a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(r_{ij})(s_{j% }-r_{ij}s_{i})-a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})(1-s_{i}^{2})s_{j% }\right|+ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |
+1m|aij=1majU(sisj)(1si2)sjaimfj=1majmfU(simfsjmf)(1(simf)2)sjmf|1𝑚subscript𝑎𝑖superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑠𝑖subscript𝑠𝑗1superscriptsubscript𝑠𝑖2subscript𝑠𝑗superscriptsubscript𝑎𝑖mfsuperscriptsubscript𝑗1𝑚superscriptsubscript𝑎𝑗mfsuperscript𝑈superscriptsubscript𝑠𝑖mfsuperscriptsubscript𝑠𝑗mf1superscriptsuperscriptsubscript𝑠𝑖mf2superscriptsubscript𝑠𝑗mf\displaystyle+\frac{1}{m}\left|a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})(% 1-s_{i}^{2})s_{j}-a_{i}^{\mbox{\tiny\rm mf}}\sum_{j=1}^{m}a_{j}^{\mbox{\tiny% \rm mf}}U^{\prime}(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})(1-(s_% {i}^{\mbox{\tiny\rm mf}})^{2})s_{j}^{\mbox{\tiny\rm mf}}\right|+ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ( 1 - ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT |
\displaystyle\leq\, M(1+tε)(|aiaimf|+|sisimf|)+2mM(1+tε)(U+U′′)j=1m|aj||rij|𝑀1𝑡𝜀subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf2𝑚𝑀1𝑡𝜀subscriptnormsuperscript𝑈subscriptnormsuperscript𝑈′′superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle M\left(1+\frac{t}{\varepsilon}\right)\cdot\left(|a_{i}-a_{i}^{% \mbox{\tiny\rm mf}}|+|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}|\right)+\frac{2}{m}% \cdot M\left(1+\frac{t}{\varepsilon}\right)\left(\left\|{U^{\prime}}\right\|_{% \infty}+\left\|{U^{\prime\prime}}\right\|_{\infty}\right)\sum_{j=1}^{m}|a_{j}|% \left|r_{ij}^{\perp}\right|italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ⋅ ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | ) + divide start_ARG 2 end_ARG start_ARG italic_m end_ARG ⋅ italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) ( ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_U start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT |
+M(1+tε)21mj=1m(|ajajmf|+|sjsjmf|)𝑀superscript1𝑡𝜀21𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscriptsubscript𝑎𝑗mfsubscript𝑠𝑗superscriptsubscript𝑠𝑗mf\displaystyle+M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\frac{1}{m}\sum_{j% =1}^{m}\left(|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}|+|s_{j}-s_{j}^{\mbox{\tiny\rm mf% }}|\right)+ italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | )
\displaystyle\leq\, M(1+tε)2(|aiaimf|+|sisimf|+1mj=1m(|sjsjmf|+|ajajmf|+|rij|)),𝑀superscript1𝑡𝜀2subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf1𝑚superscriptsubscript𝑗1𝑚subscript𝑠𝑗superscriptsubscript𝑠𝑗mfsubscript𝑎𝑗superscriptsubscript𝑎𝑗mfsuperscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\left(|a_{i}-a_{i}% ^{\mbox{\tiny\rm mf}}|+|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}|+\frac{1}{m}\sum_{j=1% }^{m}\left(|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}|+|a_{j}-a_{j}^{\mbox{\tiny\rm mf}% }|+\left|r_{ij}^{\perp}\right|\right)\right),italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT | ) ) ,

which further implies that

i=1m(sisimf)t(sisimf)M(1+tε)2(i=1m(aiaimf)2+i=1m(sisimf)2+exp(Mt(1+t)2/ε2)).superscriptsubscript𝑖1𝑚subscript𝑠𝑖superscriptsubscript𝑠𝑖mfsubscript𝑡subscript𝑠𝑖superscriptsubscript𝑠𝑖mf𝑀superscript1𝑡𝜀2superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖mf2superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf2𝑀𝑡superscript1𝑡2superscript𝜀2\sum_{i=1}^{m}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\cdot\partial_{t}(s_{i}-s_{i}^% {\mbox{\tiny\rm mf}})\leq M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\left(% \sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i}-s_{i% }^{\mbox{\tiny\rm mf}})^{2}+\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)\right).∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ⋅ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ≤ italic_M ( 1 + divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) .

Combining the above estimates, we finally deduce that

S(t)=superscript𝑆𝑡absent\displaystyle S^{\prime}(t)=\,italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = 2i=1m((aiaimf)t(aiaimf)+(sisimf)t(sisimf))2superscriptsubscript𝑖1𝑚subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑡subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑠𝑖superscriptsubscript𝑠𝑖mfsubscript𝑡subscript𝑠𝑖superscriptsubscript𝑠𝑖mf\displaystyle 2\sum_{i=1}^{m}\left((a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\cdot% \partial_{t}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})+(s_{i}-s_{i}^{\mbox{\tiny\rm mf% }})\cdot\partial_{t}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\right)2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ⋅ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) + ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) ⋅ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) )
\displaystyle\leq\, M(1+t)(1+t/ε)ε(i=1m(aiaimf)2+i=1m(sisimf)2+exp(Mt(1+t)2/ε2))𝑀1𝑡1𝑡𝜀𝜀superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖superscriptsubscript𝑎𝑖mf2superscriptsubscript𝑖1𝑚superscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑖mf2𝑀𝑡superscript1𝑡2superscript𝜀2\displaystyle\frac{M(1+t)(1+t/\varepsilon)}{\varepsilon}\cdot\left(\sum_{i=1}^% {m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i}-s_{i}^{\mbox{% \tiny\rm mf}})^{2}+\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)\right)divide start_ARG italic_M ( 1 + italic_t ) ( 1 + italic_t / italic_ε ) end_ARG start_ARG italic_ε end_ARG ⋅ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )
=\displaystyle=\,= M(1+t)(1+t/ε)ε(S(t)+exp(Mt(1+t)2/ε2)).𝑀1𝑡1𝑡𝜀𝜀𝑆𝑡𝑀𝑡superscript1𝑡2superscript𝜀2\displaystyle\frac{M(1+t)(1+t/\varepsilon)}{\varepsilon}\cdot\left(S(t)+\exp% \left(Mt(1+t)^{2}/\varepsilon^{2}\right)\right).divide start_ARG italic_M ( 1 + italic_t ) ( 1 + italic_t / italic_ε ) end_ARG start_ARG italic_ε end_ARG ⋅ ( italic_S ( italic_t ) + roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) .

Applying Grönwall’s inequality immediately implies

S(t)exp(Mt(1+t)2/ε2+0tM(1+s)(1+s/ε)εds)exp(Mt(1+t)2/ε2),𝑆𝑡𝑀𝑡superscript1𝑡2superscript𝜀2superscriptsubscript0𝑡𝑀1𝑠1𝑠𝜀𝜀differential-d𝑠𝑀𝑡superscript1𝑡2superscript𝜀2S(t)\leq\exp\left(Mt(1+t)^{2}/\varepsilon^{2}+\int_{0}^{t}\frac{M(1+s)(1+s/% \varepsilon)}{\varepsilon}\mathrm{d}s\right)\leq\exp\left(Mt(1+t)^{2}/% \varepsilon^{2}\right),italic_S ( italic_t ) ≤ roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_M ( 1 + italic_s ) ( 1 + italic_s / italic_ε ) end_ARG start_ARG italic_ε end_ARG roman_d italic_s ) ≤ roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (132)

which further leads to Eq. (131) and concludes the proof of Proposition 3. The “consequently” part can be shown via direct calculation, but we include it here for the sake of completeness. By definition, for any t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] we have

|Rred(a(t),s(t),R(t))Rmf(amf(t),smf(t))|subscript𝑅red𝑎𝑡𝑠𝑡𝑅𝑡subscript𝑅mfsuperscript𝑎mf𝑡superscript𝑠mf𝑡\displaystyle\left|\mathscrsfs{R}_{\mbox{\tiny\rm red}}\left(a(t),s(t),R(t)% \right)-\mathscrsfs{R}_{\mbox{\tiny\rm mf}}\left(a^{\mbox{\tiny\rm mf}}(t),s^{% \mbox{\tiny\rm mf}}(t)\right)\right|| italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a ( italic_t ) , italic_s ( italic_t ) , italic_R ( italic_t ) ) - italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) |
\displaystyle\leq\, 1m|i=1maiV(si)i=1maimfV(simf)|+12m2|i,j=1maiajU(rij)i,j=1maimfajmfU(simfsjmf)|1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝑉subscript𝑠𝑖superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖mf𝑉superscriptsubscript𝑠𝑖mf12superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝑈subscript𝑟𝑖𝑗superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑎𝑖mfsuperscriptsubscript𝑎𝑗mf𝑈superscriptsubscript𝑠𝑖mfsuperscriptsubscript𝑠𝑗mf\displaystyle\frac{1}{m}\left|\sum_{i=1}^{m}a_{i}V(s_{i})-\sum_{i=1}^{m}a_{i}^% {\mbox{\tiny\rm mf}}V(s_{i}^{\mbox{\tiny\rm mf}})\right|+\frac{1}{2m^{2}}\left% |\sum_{i,j=1}^{m}a_{i}a_{j}U(r_{ij})-\sum_{i,j=1}^{m}a_{i}^{\mbox{\tiny\rm mf}% }a_{j}^{\mbox{\tiny\rm mf}}U(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf% }})\right|divide start_ARG 1 end_ARG start_ARG italic_m end_ARG | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) | + divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ) |
\displaystyle\leq\, 1mi=1m(V|aiaimf|+M(1+t/ε)V|sisimf|)+12m2M(1+t/ε)2Ui,j=1m|rij|1𝑚superscriptsubscript𝑖1𝑚subscriptnorm𝑉subscript𝑎𝑖superscriptsubscript𝑎𝑖mf𝑀1𝑡𝜀subscriptnormsuperscript𝑉subscript𝑠𝑖superscriptsubscript𝑠𝑖mf12superscript𝑚2𝑀superscript1𝑡𝜀2subscriptnormsuperscript𝑈superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left(\left\|{V}\right\|_{\infty}\left|a% _{i}-a_{i}^{\mbox{\tiny\rm mf}}\right|+M(1+t/\varepsilon)\left\|{V^{\prime}}% \right\|_{\infty}\left|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\right|\right)+\frac{1}% {2m^{2}}M(1+t/\varepsilon)^{2}\left\|{U^{\prime}}\right\|_{\infty}\sum_{i,j=1}% ^{m}\left|r_{ij}^{\perp}\right|divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ∥ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + italic_M ( 1 + italic_t / italic_ε ) ∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | ) + divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT |
+12m2i,j=1m(M(1+t/ε)U(|aiaimf|+|ajajmf|)+M(1+t/ε)2U(|sisimf|+|sjsjmf|))12superscript𝑚2superscriptsubscript𝑖𝑗1𝑚𝑀1𝑡𝜀subscriptnorm𝑈subscript𝑎𝑖superscriptsubscript𝑎𝑖mfsubscript𝑎𝑗superscriptsubscript𝑎𝑗mf𝑀superscript1𝑡𝜀2subscriptnormsuperscript𝑈subscript𝑠𝑖superscriptsubscript𝑠𝑖mfsubscript𝑠𝑗superscriptsubscript𝑠𝑗mf\displaystyle+\frac{1}{2m^{2}}\sum_{i,j=1}^{m}\left(M(1+t/\varepsilon)\left\|{% U}\right\|_{\infty}\left(\left|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}\right|+\left|a% _{j}-a_{j}^{\mbox{\tiny\rm mf}}\right|\right)+M(1+t/\varepsilon)^{2}\left\|{U^% {\prime}}\right\|_{\infty}\left(\left|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\right|+% \left|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\right|\right)\right)+ divide start_ARG 1 end_ARG start_ARG 2 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_M ( 1 + italic_t / italic_ε ) ∥ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | ) + italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | ) )
\displaystyle\leq\, M(1+t/ε)1mi=1m|aiaimf|+M(1+t/ε)21mi=1m|sisimf|+M(1+t/ε)21m2i,j=1m|rij|𝑀1𝑡𝜀1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖superscriptsubscript𝑎𝑖mf𝑀superscript1𝑡𝜀21𝑚superscriptsubscript𝑖1𝑚subscript𝑠𝑖superscriptsubscript𝑠𝑖mf𝑀superscript1𝑡𝜀21superscript𝑚2superscriptsubscript𝑖𝑗1𝑚superscriptsubscript𝑟𝑖𝑗perpendicular-to\displaystyle M(1+t/\varepsilon)\frac{1}{m}\sum_{i=1}^{m}\left|a_{i}-a_{i}^{% \mbox{\tiny\rm mf}}\right|+M(1+t/\varepsilon)^{2}\frac{1}{m}\sum_{i=1}^{m}% \left|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\right|+M(1+t/\varepsilon)^{2}\frac{1}{m% ^{2}}\sum_{i,j=1}^{m}\left|r_{ij}^{\perp}\right|italic_M ( 1 + italic_t / italic_ε ) divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT | + italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT |
\displaystyle\leq\, M(1+t/ε)2(1mi=1m((ai(t)aimf(t))2+(si(t)simf(t))2)+1m2i,j=1m|rij(t)|2)𝑀superscript1𝑡𝜀21𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑎𝑖mf𝑡2superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑠𝑖mf𝑡21superscript𝑚2superscriptsubscript𝑖𝑗1𝑚superscriptsuperscriptsubscript𝑟𝑖𝑗perpendicular-to𝑡2\displaystyle M(1+t/\varepsilon)^{2}\cdot\left(\sqrt{\frac{1}{m}\sum_{i=1}^{m}% \left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}+\left(s_{i}(t)-s% _{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)}+\sqrt{\frac{1}{m^{2}}\sum_{i,j% =1}^{m}\left|r_{ij}^{\perp}(t)\right|^{2}}\right)italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_t ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
\displaystyle\leq\, 1mM(1+t/ε)2exp(Mt(1+t)2/ε2).1𝑚𝑀superscript1𝑡𝜀2𝑀𝑡superscript1𝑡2superscript𝜀2\displaystyle\frac{1}{\sqrt{m}}M(1+t/\varepsilon)^{2}\exp\left(Mt(1+t)^{2}/% \varepsilon^{2}\right).divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( italic_M italic_t ( 1 + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Therefore,

supt[0,T]|Rred(a(t),s(t),R(t))Rmf(amf(t),smf(t))|Mexp(MT(1+T)2/ε2)m,subscriptsupremum𝑡0𝑇subscript𝑅red𝑎𝑡𝑠𝑡𝑅𝑡subscript𝑅mfsuperscript𝑎mf𝑡superscript𝑠mf𝑡𝑀𝑀𝑇superscript1𝑇2superscript𝜀2𝑚\sup_{t\in[0,T]}\left|\mathscrsfs{R}_{\mbox{\tiny\rm red}}\left(a(t),s(t),R(t)% \right)-\mathscrsfs{R}_{\mbox{\tiny\rm mf}}\left(a^{\mbox{\tiny\rm mf}}(t),s^{% \mbox{\tiny\rm mf}}(t)\right)\right|\leq\frac{M\exp\left(MT(1+T)^{2}/% \varepsilon^{2}\right)}{\sqrt{m}},roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_R start_POSTSUBSCRIPT red end_POSTSUBSCRIPT ( italic_a ( italic_t ) , italic_s ( italic_t ) , italic_R ( italic_t ) ) - italic_R start_POSTSUBSCRIPT mf end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) | ≤ divide start_ARG italic_M roman_exp ( italic_M italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG , (133)

as desired.

B.4 Derivation of the mean field dynamics (29)

For any bounded continuous fCb(2)𝑓subscript𝐶𝑏superscript2f\in C_{b}(\mathbb{R}^{2})italic_f ∈ italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we have

2f(a,s)tρt(da,ds)=t(2f(a,s)ρt(da,ds))=t(1mi=1mf(aimf(t),simf(t)))subscriptsuperscript2𝑓𝑎𝑠subscript𝑡subscript𝜌𝑡d𝑎d𝑠subscript𝑡subscriptsuperscript2𝑓𝑎𝑠subscript𝜌𝑡d𝑎d𝑠subscript𝑡1𝑚superscriptsubscript𝑖1𝑚𝑓superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡\displaystyle\int_{\mathbb{R}^{2}}f(a,s)\partial_{t}\rho_{t}(\mathrm{d}a,% \mathrm{d}s)=\partial_{t}\left(\int_{\mathbb{R}^{2}}f(a,s)\rho_{t}(\mathrm{d}a% ,\mathrm{d}s)\right)=\partial_{t}\left(\frac{1}{m}\sum_{i=1}^{m}f(a_{i}^{\mbox% {\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\right)∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_a , italic_s ) ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_a , roman_d italic_s ) = ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_a , italic_s ) italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_a , roman_d italic_s ) ) = ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) )
=\displaystyle=\,= 1mi=1m(af(aimf(t),simf(t))taimf(t)+sf(aimf(t),simf(t))tsimf(t))1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑓superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡subscript𝑡superscriptsubscript𝑎𝑖mf𝑡subscript𝑠𝑓superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡subscript𝑡superscriptsubscript𝑠𝑖mf𝑡\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left(\partial_{a}f(a_{i}^{\mbox{\tiny% \rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\cdot\partial_{t}a_{i}^{\mbox{\tiny% \rm mf}}(t)+\partial_{s}f(a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf% }}(t))\cdot\partial_{t}s_{i}^{\mbox{\tiny\rm mf}}(t)\right)divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_f ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) ⋅ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_f ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) ⋅ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) )
=(i)superscript𝑖\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\,start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i ) end_ARG end_RELOP 1mi=1m(af(aimf(t),simf(t))Ψa(aimf(t),simf(t);ρt)+sf(aimf(t),simf(t))Ψs(aimf(t),simf(t);ρt))1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑓superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡subscriptΨ𝑎superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡subscript𝜌𝑡subscript𝑠𝑓superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡subscriptΨ𝑠superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡subscript𝜌𝑡\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left(\partial_{a}f(a_{i}^{\mbox{\tiny% \rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\cdot\Psi_{a}\left(a_{i}^{\mbox{% \tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t);\rho_{t}\right)+\partial_{s}f(a% _{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\cdot\Psi_{s}\left(% a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t);\rho_{t}\right)\right)divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_f ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) ⋅ roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_f ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ) ⋅ roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
=\displaystyle=\,= 2(af(a,s)Ψa(a,s;ρt)+sf(a,s)Ψs(a,s;ρt))ρt(da,ds)subscriptsuperscript2subscript𝑎𝑓𝑎𝑠subscriptΨ𝑎𝑎𝑠subscript𝜌𝑡subscript𝑠𝑓𝑎𝑠subscriptΨ𝑠𝑎𝑠subscript𝜌𝑡subscript𝜌𝑡d𝑎d𝑠\displaystyle\int_{\mathbb{R}^{2}}\left(\partial_{a}f(a,s)\cdot\Psi_{a}(a,s;% \rho_{t})+\partial_{s}f(a,s)\cdot\Psi_{s}(a,s;\rho_{t})\right)\rho_{t}(\mathrm% {d}a,\mathrm{d}s)∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_f ( italic_a , italic_s ) ⋅ roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_f ( italic_a , italic_s ) ⋅ roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_a , roman_d italic_s )
=(ii)superscript𝑖𝑖\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\,start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP 2f(a,s)(a(ρtΨa(a,s;ρt))+s(ρtΨs(a,s;ρt)))(da,ds),subscriptsuperscript2𝑓𝑎𝑠subscript𝑎subscript𝜌𝑡subscriptΨ𝑎𝑎𝑠subscript𝜌𝑡subscript𝑠subscript𝜌𝑡subscriptΨ𝑠𝑎𝑠subscript𝜌𝑡d𝑎d𝑠\displaystyle-\int_{\mathbb{R}^{2}}f(a,s)\cdot\left(\partial_{a}\left(\rho_{t}% \Psi_{a}(a,s;\rho_{t})\right)+\partial_{s}\left(\rho_{t}\Psi_{s}(a,s;\rho_{t})% \right)\right)(\mathrm{d}a,\mathrm{d}s),- ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_a , italic_s ) ⋅ ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ( roman_d italic_a , roman_d italic_s ) ,

where (i)𝑖(i)( italic_i ) follows from the ODE satisfied by the (aimf(t),simf(t))superscriptsubscript𝑎𝑖mf𝑡superscriptsubscript𝑠𝑖mf𝑡(a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mf end_POSTSUPERSCRIPT ( italic_t ) )’s, and in (ii)𝑖𝑖(ii)( italic_i italic_i ) we use integration by parts. We thus obtain that

tρt=(a(ρtΨa(a,s;ρt))+s(ρtΨs(a,s;ρt)))=(ρtΨ(a,s;ρt)),subscript𝑡subscript𝜌𝑡subscript𝑎subscript𝜌𝑡subscriptΨ𝑎𝑎𝑠subscript𝜌𝑡subscript𝑠subscript𝜌𝑡subscriptΨ𝑠𝑎𝑠subscript𝜌𝑡subscript𝜌𝑡Ψ𝑎𝑠subscript𝜌𝑡\partial_{t}\rho_{t}=-\left(\partial_{a}\left(\rho_{t}\Psi_{a}(a,s;\rho_{t})% \right)+\partial_{s}\left(\rho_{t}\Psi_{s}(a,s;\rho_{t})\right)\right)=-\nabla% \cdot\left(\rho_{t}\Psi(a,s;\rho_{t})\right),∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) = - ∇ ⋅ ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ,

which recovers Eq. (29).

B.5 Details of the alternative mean field approach

Let

ρ¯t=1mi=1mδ(ai(t),ui(t)),subscript¯𝜌𝑡1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝑎𝑖𝑡subscript𝑢𝑖𝑡\overline{\rho}_{t}=\frac{1}{m}\sum_{i=1}^{m}\delta_{(a_{i}(t),u_{i}(t))}\,,over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) end_POSTSUBSCRIPT , (134)

where (ai(t),ui(t))1imsubscriptsubscript𝑎𝑖𝑡subscript𝑢𝑖𝑡1𝑖𝑚(a_{i}(t),u_{i}(t))_{1\leqslant i\leqslant m}( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUBSCRIPT 1 ⩽ italic_i ⩽ italic_m end_POSTSUBSCRIPT is the solution of (5)–(6). ρ¯tsubscript¯𝜌𝑡\overline{\rho}_{t}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a measure on ×𝕊d1superscript𝕊𝑑1\mathbb{R}\times\mathbb{S}^{d-1}blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT solving the continuity PDE

tρ¯t(a,u)=(ρ¯tΨ¯(a,u;ρ¯t))=(a(ρ¯tΨ¯a(a,u;ρ¯t))+u(ρ¯tΨ¯u(a,u;ρ¯t))),subscript𝑡subscript¯𝜌𝑡𝑎𝑢subscript¯𝜌𝑡¯Ψ𝑎𝑢subscript¯𝜌𝑡subscript𝑎subscript¯𝜌𝑡subscript¯Ψ𝑎𝑎𝑢subscript¯𝜌𝑡subscript𝑢subscript¯𝜌𝑡subscript¯Ψ𝑢𝑎𝑢subscript¯𝜌𝑡\displaystyle\begin{split}\partial_{t}\overline{\rho}_{t}(a,u)&=-\nabla\cdot% \left(\overline{\rho}_{t}\overline{\Psi}\left(a,u;\overline{\rho}_{t}\right)% \right)\\ &=-\left(\partial_{a}\left(\overline{\rho}_{t}\overline{\Psi}_{a}\left(a,u;% \overline{\rho}_{t}\right)\right)+\partial_{u}\left(\overline{\rho}_{t}% \overline{\Psi}_{u}\left(a,u;\overline{\rho}_{t}\right)\right)\right)\,,\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a , italic_u ) end_CELL start_CELL = - ∇ ⋅ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG roman_Ψ end_ARG ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∂ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) , end_CELL end_ROW (135)

where Ψ¯=(Ψ¯a,Ψ¯u)¯Ψsubscript¯Ψ𝑎subscript¯Ψ𝑢\overline{\Psi}=(\overline{\Psi}_{a},\overline{\Psi}_{u})over¯ start_ARG roman_Ψ end_ARG = ( over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) is given by

Ψ¯a(a,u;ρ¯)subscript¯Ψ𝑎𝑎𝑢¯𝜌\displaystyle\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) =ε1(V(u,u)×𝕊d1a1U(u,u1)ρ¯(da1,du1)),absentsuperscript𝜀1𝑉𝑢subscript𝑢subscriptsuperscript𝕊𝑑1subscript𝑎1𝑈𝑢subscript𝑢1¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle=\varepsilon^{-1}\left(V(\langle u,u_{*}\rangle)-\int_{\mathbb{R}% \times\mathbb{S}^{d-1}}a_{1}U(\langle u,u_{1}\rangle)\overline{\rho}(\mathrm{d% }a_{1},\mathrm{d}u_{1})\right)\,,= italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_V ( ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U ( ⟨ italic_u , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ,
Ψ¯u(a,u;ρ¯)subscript¯Ψ𝑢𝑎𝑢¯𝜌\displaystyle\overline{\Psi}_{u}\left(a,u;\overline{\rho}\right)over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) =a(Iduu)(V(u,u)u×𝕊d1a1U(u,u1)u1ρ¯(da1,du1)).absent𝑎subscript𝐼𝑑𝑢superscript𝑢topsuperscript𝑉𝑢subscript𝑢subscript𝑢subscriptsuperscript𝕊𝑑1subscript𝑎1superscript𝑈𝑢subscript𝑢1subscript𝑢1¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle=a\left(I_{d}-uu^{\top}\right)\left(V^{\prime}(\langle u,u_{*}% \rangle)u_{*}-\int_{\mathbb{R}\times\mathbb{S}^{d-1}}a_{1}U^{\prime}(\langle u% ,u_{1}\rangle)u_{1}\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)\,.= italic_a ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ) italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) .

A remarkable property of the equation (135) is that it preserves invariance to rotations orthogonal to usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Indeed, assume that ρ¯¯𝜌\overline{\rho}over¯ start_ARG italic_ρ end_ARG is invariant to rotations orthogonal to usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. In this case, we show that Ψ¯a(a,u;ρ¯)subscript¯Ψ𝑎𝑎𝑢¯𝜌\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) and u,Ψ¯u(a,u;ρ¯)subscript𝑢subscript¯Ψ𝑢𝑎𝑢¯𝜌\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}\right)\rangle⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) ⟩ depend only on s:=u,uassign𝑠𝑢subscript𝑢s:=\langle u,u_{*}\rangleitalic_s := ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ and s1:=u1,uassignsubscript𝑠1subscript𝑢1subscript𝑢s_{1}:=\langle u_{1},u_{*}\rangleitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := ⟨ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩. Let usuperscript𝑢perpendicular-tou^{\perp}italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT (resp. u1superscriptsubscript𝑢1perpendicular-tou_{1}^{\perp}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT) denote the component of u𝑢uitalic_u (resp. u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) orthogonal to usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Let R𝑅Ritalic_R denote a random uniform rotation orthogonal to usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. By the rotation invariance of ρ¯¯𝜌\overline{\rho}over¯ start_ARG italic_ρ end_ARG,

Ψ¯a(a,u;ρ¯)subscript¯Ψ𝑎𝑎𝑢¯𝜌\displaystyle\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) =ε1(V(u,u)×𝕊d1a1𝔼R[U(u,Ru1)]ρ¯(da1,du1))absentsuperscript𝜀1𝑉𝑢subscript𝑢subscriptsuperscript𝕊𝑑1subscript𝑎1subscript𝔼𝑅delimited-[]𝑈𝑢𝑅subscript𝑢1¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle=\varepsilon^{-1}\left(V(\langle u,u_{*}\rangle)-\int_{\mathbb{R}% \times\mathbb{S}^{d-1}}a_{1}\mathbb{E}_{R}\left[U(\langle u,Ru_{1}\rangle)% \right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)= italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_V ( ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT [ italic_U ( ⟨ italic_u , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) ] over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
=ε1(V(s)×𝕊d1a1𝔼R[U(ss1+u,Ru1)]ρ¯(da1,du1)).absentsuperscript𝜀1𝑉𝑠subscriptsuperscript𝕊𝑑1subscript𝑎1subscript𝔼𝑅delimited-[]𝑈𝑠subscript𝑠1superscript𝑢perpendicular-to𝑅superscriptsubscript𝑢1perpendicular-to¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle=\varepsilon^{-1}\left(V(s)-\int_{\mathbb{R}\times\mathbb{S}^{d-1% }}a_{1}\mathbb{E}_{R}\left[U(ss_{1}+\langle u^{\perp},Ru_{1}^{\perp}\rangle)% \right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)\,.= italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_V ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT [ italic_U ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⟨ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ ) ] over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) .

The random variable B(d)=uu,Ru1u1superscript𝐵𝑑superscript𝑢perpendicular-tonormsuperscript𝑢perpendicular-to𝑅superscriptsubscript𝑢1perpendicular-tonormsuperscriptsubscript𝑢1perpendicular-toB^{(d)}=\left\langle\frac{u^{\perp}}{\|u^{\perp}\|},R\frac{u_{1}^{\perp}}{\|u_% {1}^{\perp}\|}\right\rangleitalic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT = ⟨ divide start_ARG italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ end_ARG , italic_R divide start_ARG italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ end_ARG ⟩ is a one dimensional projection of a random variable uniform on the unit sphere of the hyperplane orthogonal to usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT; thus it has the density pB(d)(b)(1b2)d/22proportional-tosubscript𝑝superscript𝐵𝑑𝑏superscript1superscript𝑏2𝑑22p_{B^{(d)}}(b)\propto(1-b^{2})^{d/2-2}italic_p start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_b ) ∝ ( 1 - italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 - 2 end_POSTSUPERSCRIPT (see, e.g., [16, Lemma 4.17]). Denote

U(d)(s,s1)=𝔼B(d)[U(ss1+(1s2)1/2(1s12)1/2B(d))],superscript𝑈𝑑𝑠subscript𝑠1subscript𝔼superscript𝐵𝑑delimited-[]𝑈𝑠subscript𝑠1superscript1superscript𝑠212superscript1superscriptsubscript𝑠1212superscript𝐵𝑑U^{(d)}(s,s_{1})=\mathbb{E}_{B^{(d)}}\left[U\left(ss_{1}+(1-s^{2})^{1/2}(1-s_{% 1}^{2})^{1/2}B^{(d)}\right)\right]\,,italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_U ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ) ] ,

then we have

Ψ¯a(a,u;ρ¯)=ε1(V(s)×𝕊d1a1U(d)(s,s1)ρ¯(da1,du1)).subscript¯Ψ𝑎𝑎𝑢¯𝜌superscript𝜀1𝑉𝑠subscriptsuperscript𝕊𝑑1subscript𝑎1superscript𝑈𝑑𝑠subscript𝑠1¯𝜌dsubscript𝑎1dsubscript𝑢1\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)=\varepsilon^{-1}\left(V(s)% -\int_{\mathbb{R}\times\mathbb{S}^{d-1}}a_{1}U^{(d)}(s,s_{1})\overline{\rho}(% \mathrm{d}a_{1},\mathrm{d}u_{1})\right)\,.over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) = italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_V ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) . (136)

Further, we compute

u,Ψ¯u(a,u;ρ¯)=u,a(Iduu)\displaystyle\left\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}% \right)\right\rangle=\Bigg{\langle}u_{*},a\left(I_{d}-uu^{\top}\right)⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) ⟩ = ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_a ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT )
(V(s)u×𝕊d1a1𝔼R[U(ss1+u,Ru1)(s1u+Ru1)]ρ¯(da1,du1))\displaystyle\hskip 85.35826pt\left(V^{\prime}(s)u_{*}-\int_{\mathbb{R}\times% \mathbb{S}^{d-1}}a_{1}\mathbb{E}_{R}\left[U^{\prime}(ss_{1}+\langle u^{\perp},% Ru_{1}^{\perp}\rangle)\left(s_{1}u_{*}+Ru_{1}^{\perp}\right)\right]\overline{% \rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)\Bigg{\rangle}( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT [ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⟨ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ ) ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) ] over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ⟩
=a[(1s2)V(s)×𝕊d1a1𝔼R[U(ss1+u,Ru1)u,(Iduu)(s1u+Ru1)]ρ¯(da1,du1)].absent𝑎delimited-[]1superscript𝑠2superscript𝑉𝑠subscriptsuperscript𝕊𝑑1subscript𝑎1subscript𝔼𝑅delimited-[]superscript𝑈𝑠subscript𝑠1superscript𝑢perpendicular-to𝑅superscriptsubscript𝑢1perpendicular-tosubscript𝑢subscript𝐼𝑑𝑢superscript𝑢topsubscript𝑠1subscript𝑢𝑅superscriptsubscript𝑢1perpendicular-to¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle=a\left[(1-s^{2})V^{\prime}(s)-\int_{\mathbb{R}\times\mathbb{S}^{% d-1}}a_{1}\mathbb{E}_{R}\left[U^{\prime}(ss_{1}+\langle u^{\perp},Ru_{1}^{% \perp}\rangle)\langle u_{*},(I_{d}-uu^{\top})(s_{1}u_{*}+Ru_{1}^{\perp})% \rangle\right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right]\,.= italic_a [ ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT [ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⟨ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ ) ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) ⟩ ] over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] .

In the equation above, we have u,(Iduu)s1u=s1(1s2)subscript𝑢subscript𝐼𝑑𝑢superscript𝑢topsubscript𝑠1subscript𝑢subscript𝑠11superscript𝑠2\langle u_{*},(I_{d}-uu^{\top})s_{1}u_{*}\rangle=s_{1}(1-s^{2})⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and as u,Ru1=0subscript𝑢𝑅superscriptsubscript𝑢1perpendicular-to0\langle u_{*},Ru_{1}^{\perp}\rangle=0⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ = 0 a.s., we have

u,(Iduu)Ru1=u,uu,Ru1=su,Ru1.subscript𝑢subscript𝐼𝑑𝑢superscript𝑢top𝑅superscriptsubscript𝑢1perpendicular-to𝑢subscript𝑢𝑢𝑅superscriptsubscript𝑢1perpendicular-to𝑠superscript𝑢perpendicular-to𝑅superscriptsubscript𝑢1perpendicular-to\left\langle u_{*},(I_{d}-uu^{\top})Ru_{1}^{\perp}\right\rangle=-\langle u,u_{% *}\rangle\langle u,Ru_{1}^{\perp}\rangle=-s\langle u^{\perp},Ru_{1}^{\perp}% \rangle\,.⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ = - ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ⟨ italic_u , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ = - italic_s ⟨ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ .

Thus we obtain

u,Ψ¯u(a,u;ρ¯)subscript𝑢subscript¯Ψ𝑢𝑎𝑢¯𝜌\displaystyle\left\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}% \right)\right\rangle⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) ⟩
=a(1s2)[V(s)×𝕊d1a1𝔼R[U(ss1+u,Ru1)(s1s1s2u,Ru1)]ρ¯(da1,du1)].absent𝑎1superscript𝑠2delimited-[]superscript𝑉𝑠subscriptsuperscript𝕊𝑑1subscript𝑎1subscript𝔼𝑅delimited-[]superscript𝑈𝑠subscript𝑠1superscript𝑢perpendicular-to𝑅superscriptsubscript𝑢1perpendicular-tosubscript𝑠1𝑠1superscript𝑠2superscript𝑢perpendicular-to𝑅superscriptsubscript𝑢1perpendicular-to¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle=a(1-s^{2})\left[V^{\prime}(s)-\int_{\mathbb{R}\times\mathbb{S}^{% d-1}}a_{1}\mathbb{E}_{R}\left[U^{\prime}(ss_{1}+\langle u^{\perp},Ru_{1}^{% \perp}\rangle)\left(s_{1}-\frac{s}{1-s^{2}}\langle u^{\perp},Ru_{1}^{\perp}% \rangle\right)\right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right]\,.= italic_a ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) [ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT [ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⟨ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ ) ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - divide start_ARG italic_s end_ARG start_ARG 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , italic_R italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ ) ] over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] .

Note that

sU(d)(s,s1)=𝔼B(d)[U(ss1+(1s2)1/2(1s12)1/2B(d))(s1s(1s2)1/2(1s12)1/2B(d))]subscript𝑠superscript𝑈𝑑𝑠subscript𝑠1subscript𝔼superscript𝐵𝑑delimited-[]superscript𝑈𝑠subscript𝑠1superscript1superscript𝑠212superscript1superscriptsubscript𝑠1212superscript𝐵𝑑subscript𝑠1𝑠superscript1superscript𝑠212superscript1superscriptsubscript𝑠1212superscript𝐵𝑑\partial_{s}U^{(d)}(s,s_{1})=\mathbb{E}_{B^{(d)}}\left[U^{\prime}\left(ss_{1}+% (1-s^{2})^{1/2}(1-s_{1}^{2})^{1/2}B^{(d)}\right)\left(s_{1}-\frac{s}{(1-s^{2})% ^{1/2}}(1-s_{1}^{2})^{1/2}B^{(d)}\right)\right]∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ) ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - divide start_ARG italic_s end_ARG start_ARG ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ( 1 - italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ) ]

and thus we have

u,Ψ¯u(a,u;ρ¯)=a(1s2)[V(s)×𝕊d1a1(sU(d))(s,s1)ρ¯(da1,du1)].subscript𝑢subscript¯Ψ𝑢𝑎𝑢¯𝜌𝑎1superscript𝑠2delimited-[]superscript𝑉𝑠subscriptsuperscript𝕊𝑑1subscript𝑎1subscript𝑠superscript𝑈𝑑𝑠subscript𝑠1¯𝜌dsubscript𝑎1dsubscript𝑢1\displaystyle\left\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}% \right)\right\rangle=a(1-s^{2})\left[V^{\prime}(s)-\int_{\mathbb{R}\times% \mathbb{S}^{d-1}}a_{1}\left(\partial_{s}U^{(d)}\right)(s,s_{1})\overline{\rho}% (\mathrm{d}a_{1},\mathrm{d}u_{1})\right]\,.⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG roman_Ψ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_a , italic_u ; over¯ start_ARG italic_ρ end_ARG ) ⟩ = italic_a ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) [ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ) ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over¯ start_ARG italic_ρ end_ARG ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] . (137)

Of course, a discrete measure of the form (134) can not be invariant to rotations orthogonal to usubscript𝑢u_{*}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. However, if the uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are initialized uniformly on the unit sphere, then the measure ρ¯0subscript¯𝜌0\overline{\rho}_{0}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT converges to a measure with the rotation invariance as m𝑚m\to\inftyitalic_m → ∞. One can then apply the results of [33] to control the deviations from this limit. Let us thus assume that ρ¯0subscript¯𝜌0\overline{\rho}_{0}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfies the rotation invariance. Define the map φ(a,u)=(a,u,u)𝜑𝑎𝑢𝑎𝑢subscript𝑢\varphi(a,u)=(a,\langle u,u_{*}\rangle)italic_φ ( italic_a , italic_u ) = ( italic_a , ⟨ italic_u , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ). Then, from (136), (137), the push-forward ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of ρ¯tsubscript¯𝜌𝑡\overline{\rho}_{t}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through the map φ𝜑\varphiitalic_φ satisfies the continuity equation

tρt(a,s)subscript𝑡subscript𝜌𝑡𝑎𝑠\displaystyle\partial_{t}\rho_{t}(a,s)∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a , italic_s ) =(ρtΨ(d)(a,s;ρt))absentsubscript𝜌𝑡superscriptΨ𝑑𝑎𝑠subscript𝜌𝑡\displaystyle=-\nabla\cdot\left(\rho_{t}\Psi^{(d)}\left(a,s;\rho_{t}\right)\right)= - ∇ ⋅ ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
=(a(ρtΨa(d)(a,s;ρt))+s(ρtΨs(d)(a,s;ρt))),absentsubscript𝑎subscript𝜌𝑡subscriptsuperscriptΨ𝑑𝑎𝑎𝑠subscript𝜌𝑡subscript𝑠subscript𝜌𝑡subscriptsuperscriptΨ𝑑𝑠𝑎𝑠subscript𝜌𝑡\displaystyle=-\left(\partial_{a}\left(\rho_{t}\Psi^{(d)}_{a}\left(a,s;\rho_{t% }\right)\right)+\partial_{s}\left(\rho_{t}\Psi^{(d)}_{s}\left(a,s;\rho_{t}% \right)\right)\right),= - ( ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ,

where Ψ(d)=(Ψa(d),Ψs(d))superscriptΨ𝑑subscriptsuperscriptΨ𝑑𝑎subscriptsuperscriptΨ𝑑𝑠\Psi^{(d)}=(\Psi^{(d)}_{a},\Psi^{(d)}_{s})roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT = ( roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is given by

Ψa(d)(a,s;ρ)=subscriptsuperscriptΨ𝑑𝑎𝑎𝑠𝜌absent\displaystyle\Psi^{(d)}_{a}(a,s;\rho)=\,roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ ) = ε1(V(s)2a1U(d)(s,s1)ρ(da1,ds1)),superscript𝜀1𝑉𝑠subscriptsuperscript2subscript𝑎1superscript𝑈𝑑𝑠subscript𝑠1𝜌dsubscript𝑎1dsubscript𝑠1\displaystyle\varepsilon^{-1}\cdot\left(V(s)-\int_{\mathbb{R}^{2}}a_{1}U^{(d)}% (s,s_{1})\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right)\,,italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ( italic_V ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ,
Ψs(d)(a,s;ρ)=subscriptsuperscriptΨ𝑑𝑠𝑎𝑠𝜌absent\displaystyle\Psi^{(d)}_{s}(a,s;\rho)=\,roman_Ψ start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a , italic_s ; italic_ρ ) = a(1s2)(V(s)2a1sU(d)(s,s1)ρ(da1,ds1)).𝑎1superscript𝑠2superscript𝑉𝑠subscriptsuperscript2subscript𝑎1subscript𝑠superscript𝑈𝑑𝑠subscript𝑠1𝜌dsubscript𝑎1dsubscript𝑠1\displaystyle a(1-s^{2})\cdot\left(V^{\prime}(s)-\int_{\mathbb{R}^{2}}a_{1}% \partial_{s}U^{(d)}(s,s_{1})\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right)\,.italic_a ( 1 - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_d italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) .

When d𝑑d\to\inftyitalic_d → ∞, pB(d)(b)db(1b2)d/22dbproportional-tosubscript𝑝superscript𝐵𝑑𝑏d𝑏superscript1superscript𝑏2𝑑22d𝑏p_{B^{(d)}}(b)\mathrm{d}b\propto(1-b^{2})^{d/2-2}\mathrm{d}bitalic_p start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_b ) roman_d italic_b ∝ ( 1 - italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d / 2 - 2 end_POSTSUPERSCRIPT roman_d italic_b converges weakly to the Dirac mass δ0(db)subscript𝛿0d𝑏\delta_{0}(\mathrm{d}b)italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_b ). As a consequence,

U(d)(s,s1)dU(ss1),𝑑absentsuperscript𝑈𝑑𝑠subscript𝑠1𝑈𝑠subscript𝑠1\displaystyle U^{(d)}(s,s_{1})\xrightarrow[d\to\infty]{}U(ss_{1})\,,italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_ARROW start_UNDERACCENT italic_d → ∞ end_UNDERACCENT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW end_ARROW italic_U ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , sU(d)(s,s1)dU(ss1)s1.𝑑absentsubscript𝑠superscript𝑈𝑑𝑠subscript𝑠1superscript𝑈𝑠subscript𝑠1subscript𝑠1\displaystyle\partial_{s}U^{(d)}(s,s_{1})\xrightarrow[d\to\infty]{}U^{\prime}(% ss_{1})s_{1}\,.∂ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_ARROW start_UNDERACCENT italic_d → ∞ end_UNDERACCENT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW end_ARROW italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

As a consequence, in the limit d𝑑d\to\inftyitalic_d → ∞, we recover the equations (29)–(32). Moreover, if ρ¯0=PAUnif(𝕊d1)subscript¯𝜌0tensor-productsubscriptP𝐴Unifsuperscript𝕊𝑑1\overline{\rho}_{0}={\rm P}_{A}\otimes\mathrm{Unif}(\mathbb{S}^{d-1})over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊗ roman_Unif ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ), then ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT converges weakly to PAδ0(ds)tensor-productsubscriptP𝐴subscript𝛿0d𝑠{\rm P}_{A}\otimes\delta_{0}(\mathrm{d}s)roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊗ italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( roman_d italic_s ) as d𝑑d\to\inftyitalic_d → ∞.

B.6 Proof of Proposition 4

First, note that the potential functions U𝑈Uitalic_U and V𝑉Vitalic_V admit the following expansion:

U(s)=k=0σk2sk,V(s)=k=0φkσksk.formulae-sequence𝑈𝑠superscriptsubscript𝑘0superscriptsubscript𝜎𝑘2superscript𝑠𝑘𝑉𝑠superscriptsubscript𝑘0subscript𝜑𝑘subscript𝜎𝑘superscript𝑠𝑘\displaystyle U(s)=\,\sum_{k=0}^{\infty}\sigma_{k}^{2}s^{k},\ V(s)=\,\sum_{k=0% }^{\infty}\varphi_{k}\sigma_{k}s^{k}.italic_U ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_V ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

As a consequence, we deduce that

Rmf,(ρ)=subscript𝑅mf𝜌absent\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)=\,italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT ( italic_ρ ) = 12k=0φk2k=0φkσka(ω)s(ω)kdρ(ω)+12k=0σk2a(ω1)a(ω2)s(ω1)ks(ω2)kdρ(ω1)dρ(ω2)12superscriptsubscript𝑘0superscriptsubscript𝜑𝑘2superscriptsubscript𝑘0subscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔12superscriptsubscript𝑘0superscriptsubscript𝜎𝑘2𝑎subscript𝜔1𝑎subscript𝜔2𝑠superscriptsubscript𝜔1𝑘𝑠superscriptsubscript𝜔2𝑘differential-d𝜌subscript𝜔1differential-d𝜌subscript𝜔2\displaystyle\frac{1}{2}\sum_{k=0}^{\infty}\varphi_{k}^{2}-\sum_{k=0}^{\infty}% \varphi_{k}\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)+\frac{1% }{2}\sum_{k=0}^{\infty}\sigma_{k}^{2}\int a(\omega_{1})a(\omega_{2})s(\omega_{% 1})^{k}s(\omega_{2})^{k}\mathrm{d}\rho(\omega_{1})\mathrm{d}\rho(\omega_{2})divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_a ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_a ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_s ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_s ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_d italic_ρ ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=\displaystyle=\,= 12k=0(φk22φkσka(ω)s(ω)kdρ(ω)+σk2(a(ω)s(ω)kdρ(ω))2)12superscriptsubscript𝑘0superscriptsubscript𝜑𝑘22subscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔superscriptsubscript𝜎𝑘2superscript𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2\displaystyle\frac{1}{2}\sum_{k=0}^{\infty}\left(\varphi_{k}^{2}-2\varphi_{k}% \sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)+\sigma_{k}^{2}% \left(\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}\right)divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) + italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=\displaystyle=\,= 12k=0(φkσka(ω)s(ω)kdρ(ω))2.12superscriptsubscript𝑘0superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2\displaystyle\frac{1}{2}\sum_{k=0}^{\infty}\left(\varphi_{k}-\sigma_{k}\int a(% \omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Now we show that the above risk can be arbitrarily small. We will choose ρ𝜌\rhoitalic_ρ to be the Lebesgue measure on [1,1]11[-1,1][ - 1 , 1 ] and aL2[1,1]𝑎superscript𝐿211a\in L^{2}[-1,1]italic_a ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ - 1 , 1 ] so that a(ω)s(ω)kdρ(ω)=11a(s)skds𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔superscriptsubscript11𝑎𝑠superscript𝑠𝑘differential-d𝑠\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)=\int_{-1}^{1}a(s)s^{k}% \mathrm{d}s∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) = ∫ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_a ( italic_s ) italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_s. Now, we define the following set of sequences

W={(σk11a(s)skds)k0|aL2[1,1]}.𝑊conditional-setsubscriptsubscript𝜎𝑘superscriptsubscript11𝑎𝑠superscript𝑠𝑘differential-d𝑠𝑘0𝑎superscript𝐿211W=\left\{\left(\sigma_{k}\int_{-1}^{1}a(s)s^{k}\mathrm{d}s\right)_{k\geq 0}\,% \middle|\,a\in L^{2}[-1,1]\right\}\,.italic_W = { ( italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_a ( italic_s ) italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_s ) start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT | italic_a ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ - 1 , 1 ] } .

Since aL2[1,1]𝑎superscript𝐿211a\in L^{2}[-1,1]italic_a ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ - 1 , 1 ] and (σk)k02subscriptsubscript𝜎𝑘𝑘0superscript2(\sigma_{k})_{k\geq 0}\in\ell^{2}( italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∈ roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we know that W2𝑊superscript2W\subset\ell^{2}italic_W ⊂ roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., W𝑊Witalic_W is a linear subspace of 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Now it suffices to show that W𝑊Witalic_W is dense in 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is equivalent to W={0}superscript𝑊perpendicular-to0W^{\perp}=\{0\}italic_W start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = { 0 }, namely

v2,vWv=0.formulae-sequence𝑣superscript2perpendicular-to𝑣𝑊𝑣0v\in\ell^{2},\ v\perp W\implies v=0.italic_v ∈ roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_v ⟂ italic_W ⟹ italic_v = 0 .

Fix any such v𝑣vitalic_v and take μW𝜇𝑊\mu\in Witalic_μ ∈ italic_W such that for all k𝑘kitalic_k, μk=σk11a(s)skdssubscript𝜇𝑘subscript𝜎𝑘superscriptsubscript11𝑎𝑠superscript𝑠𝑘differential-d𝑠\mu_{k}=\sigma_{k}\int_{-1}^{1}a(s)s^{k}\mathrm{d}sitalic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_a ( italic_s ) italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_s for some aL2[1,1]𝑎superscript𝐿211a\in L^{2}[-1,1]italic_a ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ - 1 , 1 ]. We then have

0=v,μ=k=0vkμk=k=0vkσk11a(s)skds=11a(s)(k=0vkσksk)ds,0𝑣𝜇superscriptsubscript𝑘0subscript𝑣𝑘subscript𝜇𝑘superscriptsubscript𝑘0subscript𝑣𝑘subscript𝜎𝑘superscriptsubscript11𝑎𝑠superscript𝑠𝑘differential-d𝑠superscriptsubscript11𝑎𝑠superscriptsubscript𝑘0subscript𝑣𝑘subscript𝜎𝑘superscript𝑠𝑘differential-d𝑠\displaystyle 0=\langle v,\mu\rangle=\sum_{k=0}^{\infty}v_{k}\mu_{k}=\sum_{k=0% }^{\infty}v_{k}\sigma_{k}\int_{-1}^{1}a(s)s^{k}\mathrm{d}s=\int_{-1}^{1}a(s)% \cdot\left(\sum_{k=0}^{\infty}v_{k}\sigma_{k}s^{k}\right)\mathrm{d}s,0 = ⟨ italic_v , italic_μ ⟩ = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_a ( italic_s ) italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_s = ∫ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_a ( italic_s ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) roman_d italic_s ,

where the last step follows from dominated convergence theorem. Indeed, by Hölder’s inequality,

k=0|vkσk|(k=0vk2)1/2(k=0σk2)1/2<.superscriptsubscript𝑘0subscript𝑣𝑘subscript𝜎𝑘superscriptsuperscriptsubscript𝑘0superscriptsubscript𝑣𝑘212superscriptsuperscriptsubscript𝑘0superscriptsubscript𝜎𝑘212\sum_{k=0}^{\infty}|v_{k}\sigma_{k}|\leq\left(\sum_{k=0}^{\infty}v_{k}^{2}% \right)^{1/2}\left(\sum_{k=0}^{\infty}\sigma_{k}^{2}\right)^{1/2}<\infty.∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT | italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≤ ( ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT < ∞ .

As a consequence, the function series k=0nσkvksksuperscriptsubscript𝑘0𝑛subscript𝜎𝑘subscript𝑣𝑘superscript𝑠𝑘\sum_{k=0}^{n}\sigma_{k}v_{k}s^{k}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT uniformly absolutely converges to the continuous function f(s)=k=0σkvksk𝑓𝑠superscriptsubscript𝑘0subscript𝜎𝑘subscript𝑣𝑘superscript𝑠𝑘f(s)=\sum_{k=0}^{\infty}\sigma_{k}v_{k}s^{k}italic_f ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT on [1,1]11[-1,1][ - 1 , 1 ]. The above argument then implies that for any aL2[1,1]𝑎superscript𝐿211a\in L^{2}[-1,1]italic_a ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ - 1 , 1 ], 11a(s)f(s)ds=0superscriptsubscript11𝑎𝑠𝑓𝑠differential-d𝑠0\int_{-1}^{1}a(s)f(s)\mathrm{d}s=0∫ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_a ( italic_s ) italic_f ( italic_s ) roman_d italic_s = 0, which further implies that f(s)0𝑓𝑠0f(s)\equiv 0italic_f ( italic_s ) ≡ 0. Therefore, σkvk=0subscript𝜎𝑘subscript𝑣𝑘0\sigma_{k}v_{k}=0italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 for all k0𝑘0k\geq 0italic_k ≥ 0. Since σk0subscript𝜎𝑘0\sigma_{k}\neq 0italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 for all k0𝑘0k\geq 0italic_k ≥ 0, we must have vk=0subscript𝑣𝑘0v_{k}=0italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 for all k0𝑘0k\geq 0italic_k ≥ 0, i.e., v=0𝑣0v=0italic_v = 0. This completes the proof of the density of W𝑊Witalic_W in 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and thus the proof of the Proposition.

Appendix C Calculations for the analysis of mean-field gradient flow

C.1 Solution of Eq. (89)

In order to solve the system (89), we start from an associated one-dimensional ODE.

Lemma 2.

The solution λ=λ(t3)𝜆𝜆subscript𝑡3\lambda=\lambda(t_{3})italic_λ = italic_λ ( italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) of the ODE

t3λ=|σ1|(|φ1||σ1|a,initL2(ρ)2λ2)λsubscriptsubscript𝑡3𝜆subscript𝜎1subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝜆2𝜆\partial_{t_{3}}\lambda=|\sigma_{1}|\left(|\varphi_{1}|-{|\sigma_{1}|}\left\|a% _{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}\lambda^{2}\right)\lambda∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ = | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_λ (138)

with initial condition λ(0)𝜆0\lambda(0)italic_λ ( 0 ) is

λ=|φ1|1/2(|σ1|a,initL2(ρ)2+(|φ1|λ(0)2|σ1|a,initL2(ρ)2)e2|σ1φ1|t3)1/2.𝜆superscriptsubscript𝜑112superscriptsubscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2subscript𝜑1𝜆superscript02subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡312\lambda=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left({|\sigma_{1}|}\left\|% a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}+\left(|\varphi_{1}|\lambda(0)^{-% 2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}\right)e^% {-2|\sigma_{1}\varphi_{1}|t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,.italic_λ = divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG . (139)
Proof.

For simplicity, denote α=|σ1|𝛼subscript𝜎1\alpha=|\sigma_{1}|italic_α = | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT |, β=|φ1|𝛽subscript𝜑1\beta=|\varphi_{1}|italic_β = | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | and γ=|σ1|a,initL2(ρ)2𝛾subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2\gamma={|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}italic_γ = | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then

t3λ=α(βγλ2)λ.subscriptsubscript𝑡3𝜆𝛼𝛽𝛾superscript𝜆2𝜆\partial_{t_{3}}\lambda=\alpha\left(\beta-\gamma\lambda^{2}\right)\lambda\,.∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ = italic_α ( italic_β - italic_γ italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_λ .

This is Bernoulli differential equation (see, e.g., [15]). In this situation, the classical trick is to reduce the problem to a linear inhomogeneous first-order equation by considering

t3(λ2)subscriptsubscript𝑡3superscript𝜆2\displaystyle\partial_{t_{3}}\left(\lambda^{-2}\right)∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) =2(t3λ)λ3=2α(βγλ2)λ2absent2subscriptsubscript𝑡3𝜆superscript𝜆32𝛼𝛽𝛾superscript𝜆2superscript𝜆2\displaystyle=-2\left(\partial_{t_{3}}\lambda\right)\lambda^{-3}=-2\alpha\left% (\beta-\gamma\lambda^{2}\right)\lambda^{-2}= - 2 ( ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ ) italic_λ start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT = - 2 italic_α ( italic_β - italic_γ italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_λ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
=2α(γβλ2).absent2𝛼𝛾𝛽superscript𝜆2\displaystyle=2\alpha(\gamma-\beta\lambda^{-2})\,.= 2 italic_α ( italic_γ - italic_β italic_λ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) .

Solving this linear inhomogeneous first-order equation gives

λ2=γβ+(λ(0)2γβ)e2αβt3,superscript𝜆2𝛾𝛽𝜆superscript02𝛾𝛽superscript𝑒2𝛼𝛽subscript𝑡3\lambda^{-2}=\frac{\gamma}{\beta}+\left(\lambda(0)^{-2}-\frac{\gamma}{\beta}% \right)\,e^{-2\alpha\beta t_{3}}\,,italic_λ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT = divide start_ARG italic_γ end_ARG start_ARG italic_β end_ARG + ( italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - divide start_ARG italic_γ end_ARG start_ARG italic_β end_ARG ) italic_e start_POSTSUPERSCRIPT - 2 italic_α italic_β italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

and thus

λ=β1/2(γ+(βλ(0)2γ)e2αβt3)1/2,𝜆superscript𝛽12superscript𝛾𝛽𝜆superscript02𝛾superscript𝑒2𝛼𝛽subscript𝑡312\lambda=\frac{\beta^{\nicefrac{{1}}{{2}}}}{\left(\gamma+\left(\beta\lambda(0)^% {-2}-\gamma\right)e^{-2\alpha\beta t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,,italic_λ = divide start_ARG italic_β start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_γ + ( italic_β italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - italic_γ ) italic_e start_POSTSUPERSCRIPT - 2 italic_α italic_β italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ,

which is the claimed result. ∎

Let λ=λ(t3)𝜆𝜆subscript𝑡3\lambda=\lambda(t_{3})italic_λ = italic_λ ( italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) be a solution of (138) and consider

a(1)=a(1)=λa,init,s(1)=s(1)=sign(σ1φ1)λa,init.formulae-sequencesuperscript𝑎1subscriptsuperscript𝑎1perpendicular-to𝜆subscript𝑎perpendicular-toinitsuperscript𝑠1subscriptsuperscript𝑠1perpendicular-tosignsubscript𝜎1subscript𝜑1𝜆subscript𝑎perpendicular-toinit\displaystyle\begin{split}&a^{(-1)}=a^{(-1)}_{\perp}=\lambda a_{\perp,\rm{init% }}\,,\\ &s^{(1)}=s^{(1)}_{\perp}=\operatorname{sign}(\sigma_{1}\varphi_{1})\lambda a_{% \perp,\rm{init}}\,.\end{split}start_ROW start_CELL end_CELL start_CELL italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT = italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT = roman_sign ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT . end_CELL end_ROW (140)

Then a(1),s(1)superscript𝑎1superscript𝑠1a^{(-1)},s^{(1)}italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT are solutions of the constrained ODE system (85), (88). Indeed,

a(1),𝟙L2(ρ)=λa,init,𝟙L2(ρ)=0,subscriptsuperscript𝑎11superscript𝐿2𝜌𝜆subscriptsubscript𝑎perpendicular-toinit1superscript𝐿2𝜌0\left\langle a^{(-1)},\mathds{1}\right\rangle_{L^{2}(\rho)}=\lambda\left% \langle a_{\perp,\rm{init}},\mathds{1}\right\rangle_{L^{2}(\rho)}=0\,,⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = italic_λ ⟨ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT = 0 ,

thus the constraint (85) is satisfied. Further

t3a(1)subscriptsubscript𝑡3superscriptsubscript𝑎perpendicular-to1\displaystyle\partial_{t_{3}}a_{\perp}^{(-1)}∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT =(t3λ)a,initabsentsubscriptsubscript𝑡3𝜆subscript𝑎perpendicular-toinit\displaystyle=\left(\partial_{t_{3}}\lambda\right)a_{\perp,\rm{init}}= ( ∂ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT
=|σ1|(|φ1||σ1|a,initL2(ρ)2λ2)λa,initabsentsubscript𝜎1subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝜆2𝜆subscript𝑎perpendicular-toinit\displaystyle=|\sigma_{1}|\left(|\varphi_{1}|-{|\sigma_{1}|}\|a_{\perp,\rm{% init}}\|_{L^{2}(\rho)}^{2}\lambda^{2}\right)\lambda a_{\perp,\rm{init}}= | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT
=sign(σ1φ1)σ1(φ1sign(σ1φ1)σ1a,initL2(ρ)2λ2)λa,initabsentsignsubscript𝜎1subscript𝜑1subscript𝜎1subscript𝜑1signsubscript𝜎1subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝜆2𝜆subscript𝑎perpendicular-toinit\displaystyle=\operatorname{sign}(\sigma_{1}\varphi_{1})\sigma_{1}\left(% \varphi_{1}-\operatorname{sign}(\sigma_{1}\varphi_{1}){\sigma_{1}}\|a_{\perp,% \rm{init}}\|_{L^{2}(\rho)}^{2}\lambda^{2}\right)\lambda a_{\perp,\rm{init}}= roman_sign ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_sign ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT
=σ1(φ1σ1a(1),s(1)L2(ρ))s(1).absentsubscript𝜎1subscript𝜑1subscript𝜎1subscriptsubscriptsuperscript𝑎1perpendicular-tosubscriptsuperscript𝑠1perpendicular-tosuperscript𝐿2𝜌subscriptsuperscript𝑠1perpendicular-to\displaystyle=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\left\langle a^{(-1)}_{% \perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(\rho)}\right)s^{(1)}_{\perp}\,.= italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT .

A similar computation shows that the differential equation for s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is also satisfied. This concludes that (140) is a valid candidate to solve the third time scale.

Matching.

To determine the value of the initialization λ(0)𝜆0\lambda(0)italic_λ ( 0 ) we perform a matching procedure with the previous time scale. In this paragraph, we denote a¯,s¯¯𝑎¯𝑠\underline{a},\underline{s}under¯ start_ARG italic_a end_ARG , under¯ start_ARG italic_s end_ARG the approximation obtained in the second time scale (Section 6.3), and a¯,s¯¯𝑎¯𝑠\overline{a},\overline{s}over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_s end_ARG the approximation in the third time scale (Section 6.4 and above).

Consider an intermediate time scale t~=t2clog1ε~𝑡subscript𝑡2𝑐1𝜀\widetilde{t}=t_{2}-c\log\frac{1}{\varepsilon}over~ start_ARG italic_t end_ARG = italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_c roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG with 0<c<14|σ1φ1|0𝑐14subscript𝜎1subscript𝜑10<c<\frac{1}{4|\sigma_{1}\varphi_{1}|}0 < italic_c < divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG. Assume t~1asymptotically-equals~𝑡1\widetilde{t}\asymp 1over~ start_ARG italic_t end_ARG ≍ 1. Then

t2subscript𝑡2\displaystyle t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =t~+clog1εε0+,absent~𝑡𝑐1𝜀𝜀0absent\displaystyle=\widetilde{t}+c\log\frac{1}{\varepsilon}\xrightarrow[\varepsilon% \to 0]{}+\infty\,,= over~ start_ARG italic_t end_ARG + italic_c roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG start_ARROW start_UNDERACCENT italic_ε → 0 end_UNDERACCENT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW end_ARROW + ∞ ,
t3subscript𝑡3\displaystyle t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =t~(14|σ1φ1|c)log1εε0.absent~𝑡14subscript𝜎1subscript𝜑1𝑐1𝜀𝜀0absent\displaystyle=\widetilde{t}-\left(\frac{1}{4|\sigma_{1}\varphi_{1}|}-c\right)% \log\frac{1}{\varepsilon}\xrightarrow[\varepsilon\to 0]{}-\infty\,.= over~ start_ARG italic_t end_ARG - ( divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG - italic_c ) roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG start_ARROW start_UNDERACCENT italic_ε → 0 end_UNDERACCENT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW end_ARROW - ∞ .

From the approximation (82) on the second time scale,

a¯¯𝑎\displaystyle\underline{a}under¯ start_ARG italic_a end_ARG =a¯(0)+O(ε1/2)absentsuperscript¯𝑎0𝑂superscript𝜀12\displaystyle=\underline{a}^{(0)}+O(\varepsilon^{\nicefrac{{1}}{{2}}})= under¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=φ0σ0𝟙+cosh(φ1σ1t2)a,init+O(ε1/2)absentsubscript𝜑0subscript𝜎01subscript𝜑1subscript𝜎1subscript𝑡2subscript𝑎perpendicular-toinit𝑂superscript𝜀12\displaystyle=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+\cosh\left(\varphi_{1}% \sigma_{1}t_{2}\right){a}_{\perp,\rm{init}}+O(\varepsilon^{\nicefrac{{1}}{{2}}})= divide start_ARG italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG blackboard_1 + roman_cosh ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=cosh(φ1σ1(t~+clog1ε))a,init+O(1)absentsubscript𝜑1subscript𝜎1~𝑡𝑐1𝜀subscript𝑎perpendicular-toinit𝑂1\displaystyle=\cosh\left(\varphi_{1}\sigma_{1}\left(\widetilde{t}+c\log\frac{1% }{\varepsilon}\right)\right){a}_{\perp,\rm{init}}+O(1)= roman_cosh ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_t end_ARG + italic_c roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ) ) italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_O ( 1 )
=12e|φ1σ1|t~εc|φ1σ1|a,init+O(1).absent12superscript𝑒subscript𝜑1subscript𝜎1~𝑡superscript𝜀𝑐subscript𝜑1subscript𝜎1subscript𝑎perpendicular-toinit𝑂1\displaystyle=\frac{1}{2}e^{|\varphi_{1}\sigma_{1}|\widetilde{t}}\varepsilon^{% -c|\varphi_{1}\sigma_{1}|}{a}_{\perp,\rm{init}}+O(1)\,.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | over~ start_ARG italic_t end_ARG end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - italic_c | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_O ( 1 ) . (141)

From the approximation on the third time scale,

a¯¯𝑎\displaystyle\overline{a}over¯ start_ARG italic_a end_ARG =ε1/4a¯(1)+O(1)absentsuperscript𝜀14superscript¯𝑎1𝑂1\displaystyle=\varepsilon^{-\nicefrac{{1}}{{4}}}\overline{a}^{(-1)}+O(1)= italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT + italic_O ( 1 )
=ε1/4λa,init+O(1).absentsuperscript𝜀14𝜆subscript𝑎perpendicular-toinit𝑂1\displaystyle=\varepsilon^{-\nicefrac{{1}}{{4}}}\lambda{a}_{\perp,\rm{init}}+O% (1)\,.= italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_λ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT + italic_O ( 1 ) .

Note that as t3subscript𝑡3t_{3}\to-\inftyitalic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → - ∞, from (139),

λ|φ1|1/2(|φ1|λ(0)2|σ1|a,initL2(ρ)2)1/2e|σ1φ1|t3.similar-to𝜆superscriptsubscript𝜑112superscriptsubscript𝜑1𝜆superscript02subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌212superscript𝑒subscript𝜎1subscript𝜑1subscript𝑡3\lambda\sim\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left(|\varphi_{1}|% \lambda(0)^{-2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}% ^{2}\right)^{\nicefrac{{1}}{{2}}}}e^{|\sigma_{1}\varphi_{1}|t_{3}}\,.italic_λ ∼ divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Thus

a¯¯𝑎\displaystyle\overline{a}over¯ start_ARG italic_a end_ARG ε1/4|φ1|1/2(|φ1|λ(0)2|σ1|a,initL2(ρ)2)1/2e|σ1φ1|(t~(14|σ1φ1|c)log1ε)a,initsimilar-toabsentsuperscript𝜀14superscriptsubscript𝜑112superscriptsubscript𝜑1𝜆superscript02subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌212superscript𝑒subscript𝜎1subscript𝜑1~𝑡14subscript𝜎1subscript𝜑1𝑐1𝜀subscript𝑎perpendicular-toinit\displaystyle\sim\varepsilon^{-\nicefrac{{1}}{{4}}}\frac{|\varphi_{1}|^{% \nicefrac{{1}}{{2}}}}{\left(|\varphi_{1}|\lambda(0)^{-2}-{|\sigma_{1}|}\left\|% a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}\right)^{\nicefrac{{1}}{{2}}}}e^{% |\sigma_{1}\varphi_{1}|\left(\widetilde{t}-\left(\frac{1}{4|\sigma_{1}\varphi_% {1}|}-c\right)\log\frac{1}{\varepsilon}\right)}a_{\perp,\rm{init}}∼ italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ( over~ start_ARG italic_t end_ARG - ( divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG - italic_c ) roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG ) end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT
|φ1|1/2(|φ1|λ(0)2|σ1|a,initL2(ρ)2)1/2e|σ1φ1|t~εc|σ1φ1|a,init.similar-toabsentsuperscriptsubscript𝜑112superscriptsubscript𝜑1𝜆superscript02subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌212superscript𝑒subscript𝜎1subscript𝜑1~𝑡superscript𝜀𝑐subscript𝜎1subscript𝜑1subscript𝑎perpendicular-toinit\displaystyle\sim\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left(|\varphi_{1}% |\lambda(0)^{-2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)% }^{2}\right)^{\nicefrac{{1}}{{2}}}}e^{|\sigma_{1}\varphi_{1}|\widetilde{t}}% \varepsilon^{-c|\sigma_{1}\varphi_{1}|}a_{\perp,\rm{init}}\,.∼ divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | over~ start_ARG italic_t end_ARG end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - italic_c | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT . (142)

By matching, Equations (141) and (142) should be coherent. This gives

12=|φ1|1/2(|φ1|λ(0)2|σ1|a,initL2(ρ)2)1/2,12superscriptsubscript𝜑112superscriptsubscript𝜑1𝜆superscript02subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌212\frac{1}{2}=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left(|\varphi_{1}|% \lambda(0)^{-2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}% ^{2}\right)^{\nicefrac{{1}}{{2}}}}\,,divide start_ARG 1 end_ARG start_ARG 2 end_ARG = divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_λ ( 0 ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ,

and thus

λ(t3)=|φ1|1/2(|σ1|a,initL2(ρ)2+4|φ1|e2|σ1φ1|t3)1/2.𝜆subscript𝑡3superscriptsubscript𝜑112superscriptsubscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌24subscript𝜑1superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡312\lambda(t_{3})=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left({|\sigma_{1}|}% \left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}+4|\varphi_{1}|e^{-2|% \sigma_{1}\varphi_{1}|t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,.italic_λ ( italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = divide start_ARG | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG . (143)

One could check similarly that s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT also satisfies the matching conditions under the same constraint, and thus that (140) are indeed the solutions of the third time scale.

C.2 Induced approximation of the risk

In this section, we show that the behavior of a𝑎aitalic_a and s𝑠sitalic_s derived in Sections 6.26.4 leads to an evolution of the risk alternating plateaus and rapid decreases, in agreement with the canonical learning order of Definition 1. For the convenience of the reader, we recall the expression (37) of the risk

Rmf,subscript𝑅mf\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT =12φ2a(ω)V(s(ω))dρ(ω)+12a(ω1)a(ω2)U(s(ω1)s(ω2))dρ(ω1)dρ(ω2)absent12superscriptnorm𝜑2𝑎𝜔𝑉𝑠𝜔differential-d𝜌𝜔12𝑎subscript𝜔1𝑎subscript𝜔2𝑈𝑠subscript𝜔1𝑠subscript𝜔2differential-d𝜌subscript𝜔1differential-d𝜌subscript𝜔2\displaystyle=\frac{1}{2}\|\varphi\|^{2}-\int a(\omega)V(s(\omega))\mathrm{d}% \rho(\omega)+\frac{1}{2}\int a(\omega_{1})a(\omega_{2})U(s(\omega_{1})s(\omega% _{2}))\mathrm{d}\rho(\omega_{1})\mathrm{d}\rho(\omega_{2})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_φ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∫ italic_a ( italic_ω ) italic_V ( italic_s ( italic_ω ) ) roman_d italic_ρ ( italic_ω ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ italic_a ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_a ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_U ( italic_s ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_s ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) roman_d italic_ρ ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_d italic_ρ ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=12k0(φkσka(ω)s(ω)kdρ(ω))2.absent12subscript𝑘0superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2\displaystyle=\frac{1}{2}\sum_{k\geqslant 0}\left(\varphi_{k}-\sigma_{k}\int a% (\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}\,.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 0 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

First time scale t1=tεsubscript𝑡1𝑡𝜀t_{1}=\frac{t}{\varepsilon}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG (Section 6.2).

On this time scale, we have a=O(1)𝑎𝑂1a=O(1)italic_a = italic_O ( 1 ) and s=O(ε)𝑠𝑂𝜀s=O(\varepsilon)italic_s = italic_O ( italic_ε ). Thus for all k1𝑘1k\geqslant 1italic_k ⩾ 1, a(ω)s(ω)kdρ(ω)=O(ε)𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔𝑂𝜀\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)=O(\varepsilon)∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) = italic_O ( italic_ε ) whence (φkσka(ω)s(ω)kdρ(ω))2=φk2+O(ε)superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2superscriptsubscript𝜑𝑘2𝑂𝜀\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}=\varphi_{k}^{2}+O(\varepsilon)( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε ).

Further, using (59),

(φ0σ0a(ω)dρ(ω))2superscriptsubscript𝜑0subscript𝜎0𝑎𝜔differential-d𝜌𝜔2\displaystyle\left(\varphi_{0}-\sigma_{0}\int a(\omega)\mathrm{d}\rho(\omega)% \right)^{2}( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(φ0σ0a,𝟙L2(ρ))2=(φ0σ0a(0),𝟙L2(ρ)+O(ε))2absentsuperscriptsubscript𝜑0subscript𝜎0subscript𝑎1superscript𝐿2𝜌2superscriptsubscript𝜑0subscript𝜎0subscriptsuperscript𝑎01superscript𝐿2𝜌𝑂𝜀2\displaystyle=\left(\varphi_{0}-{\sigma_{0}}\left\langle a,\mathds{1}\right% \rangle_{L^{2}(\rho)}\right)^{2}=\left(\varphi_{0}-{\sigma_{0}}\left\langle a^% {(0)},\mathds{1}\right\rangle_{L^{2}(\rho)}+O(\varepsilon)\right)^{2}= ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT + italic_O ( italic_ε ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=e2σ02t1(φ0σ0ainit,𝟙L2(ρ))2+O(ε).absentsuperscript𝑒2superscriptsubscript𝜎02subscript𝑡1superscriptsubscript𝜑0subscript𝜎0subscriptsubscript𝑎init1superscript𝐿2𝜌2𝑂𝜀\displaystyle=e^{-2\sigma_{0}^{2}t_{1}}\left(\varphi_{0}-{\sigma_{0}}\left% \langle a_{\rm{init}},\mathds{1}\right\rangle_{L^{2}(\rho)}\right)^{2}+O(% \varepsilon)\,.= italic_e start_POSTSUPERSCRIPT - 2 italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε ) .

Thus as ε0𝜀0\varepsilon\to 0italic_ε → 0,

Rmf,=12e2σ02t1(φ0σ0ainit,𝟙L2(ρ))2+12k1φk2+O(ε).subscript𝑅mf12superscript𝑒2superscriptsubscript𝜎02subscript𝑡1superscriptsubscript𝜑0subscript𝜎0subscriptsubscript𝑎init1superscript𝐿2𝜌212subscript𝑘1superscriptsubscript𝜑𝑘2𝑂𝜀\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}e^{-2\sigma_{0}^% {2}t_{1}}\left(\varphi_{0}-{\sigma_{0}}\left\langle a_{\rm{init}},\mathds{1}% \right\rangle_{L^{2}(\rho)}\right)^{2}+\frac{1}{2}\sum_{k\geqslant 1}\varphi_{% k}^{2}+O(\varepsilon)\,.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT - 2 italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , blackboard_1 ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε ) .

This describes, in a more detailed form, the first transition in Definition 1.

Second time scale t2=tε1/2subscript𝑡2𝑡superscript𝜀12t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG (Section 6.3).

On this time scale, we have a=O(1)𝑎𝑂1a=O(1)italic_a = italic_O ( 1 ) and s=O(ε1/2)𝑠𝑂superscript𝜀12s=O(\varepsilon^{\nicefrac{{1}}{{2}}})italic_s = italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ). Thus for all k1𝑘1k\geqslant 1italic_k ⩾ 1, (φkσka(ω)s(ω)kdρ(ω))2=φk2+O(ε1/2)superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2superscriptsubscript𝜑𝑘2𝑂superscript𝜀12\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}=\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{2}}})( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ).

Further, using (68),

(φ0σ0a(ω)dρ(ω))2=(φ0σ0a(0)(ω)dρ(ω)+O(ε1/2))2=O(ε1/2).superscriptsubscript𝜑0subscript𝜎0𝑎𝜔differential-d𝜌𝜔2superscriptsubscript𝜑0subscript𝜎0superscript𝑎0𝜔differential-d𝜌𝜔𝑂superscript𝜀122𝑂superscript𝜀12\displaystyle\left(\varphi_{0}-\sigma_{0}\int a(\omega)\mathrm{d}\rho(\omega)% \right)^{2}=\left(\varphi_{0}-\sigma_{0}\int a^{(0)}(\omega)\mathrm{d}\rho(% \omega)+O(\varepsilon^{\nicefrac{{1}}{{2}}})\right)^{2}=O(\varepsilon^{% \nicefrac{{1}}{{2}}})\,.( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) roman_d italic_ρ ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) .

Thus as ε0𝜀0\varepsilon\to 0italic_ε → 0,

Rmf,=12k1φk2+O(ε1/2).subscript𝑅mf12subscript𝑘1superscriptsubscript𝜑𝑘2𝑂superscript𝜀12\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\sum_{k\geqslant 1% }\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{2}}})\,.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) .

This second time scale does not induce any transition of the risk Rmf,subscript𝑅mf\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT (but was necessary to understand the divergence of a𝑎aitalic_a and ε1/2ssuperscript𝜀12𝑠\varepsilon^{-\nicefrac{{1}}{{2}}}sitalic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_s).

Third time scale t3=tε1/214|σ1φ1|log1εsubscript𝑡3𝑡superscript𝜀1214subscript𝜎1subscript𝜑11𝜀t_{3}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}-\frac{1}{4|\sigma_{1}\varphi% _{1}|}\log\frac{1}{\varepsilon}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG 4 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_ε end_ARG (Section 6.4).

On this time scale, we have a=O(ε1/4)𝑎𝑂superscript𝜀14a=O(\varepsilon^{-\nicefrac{{1}}{{4}}})italic_a = italic_O ( italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) and s=O(ε1/4)𝑠𝑂superscript𝜀14s=O(\varepsilon^{\nicefrac{{1}}{{4}}})italic_s = italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ). Thus for all k2𝑘2k\geqslant 2italic_k ⩾ 2, (φkσka(ω)s(ω)kdρ(ω))2=φk2+O(ε1/4)superscriptsubscript𝜑𝑘subscript𝜎𝑘𝑎𝜔𝑠superscript𝜔𝑘differential-d𝜌𝜔2superscriptsubscript𝜑𝑘2𝑂superscript𝜀14\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}=\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ).

Further, using (85), (86),

(φ0σ0a(ω)dρ(ω))2superscriptsubscript𝜑0subscript𝜎0𝑎𝜔differential-d𝜌𝜔2\displaystyle\left(\varphi_{0}-\sigma_{0}\int a(\omega)\mathrm{d}\rho(\omega)% \right)^{2}( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(φ0σ0ε1/4a(1)(ω)dρ(ω)σ0a(0)(ω)dρ(ω)+O(ε1/4))2absentsuperscriptsubscript𝜑0subscript𝜎0superscript𝜀14superscript𝑎1𝜔differential-d𝜌𝜔subscript𝜎0superscript𝑎0𝜔differential-d𝜌𝜔𝑂superscript𝜀142\displaystyle=\left(\varphi_{0}-{\sigma_{0}}\varepsilon^{-\nicefrac{{1}}{{4}}}% \int a^{(-1)}(\omega)\mathrm{d}\rho(\omega)-{\sigma_{0}}\int a^{(0)}(\omega)% \mathrm{d}\rho(\omega)+O(\varepsilon^{\nicefrac{{1}}{{4}}})\right)^{2}= ( italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT ( italic_ω ) roman_d italic_ρ ( italic_ω ) - italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∫ italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_ω ) roman_d italic_ρ ( italic_ω ) + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=O(ε1/4).absent𝑂superscript𝜀14\displaystyle=O(\varepsilon^{\nicefrac{{1}}{{4}}})\,.= italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) .

Finally, using (90), (91),

(φ1σ1a(ω)s(ω)dρ(ω))2superscriptsubscript𝜑1subscript𝜎1𝑎𝜔𝑠𝜔differential-d𝜌𝜔2\displaystyle\left(\varphi_{1}-{\sigma_{1}}\int a(\omega)s(\omega)\mathrm{d}% \rho(\omega)\right)^{2}( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_a ( italic_ω ) italic_s ( italic_ω ) roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(φ1σ1a,sL2(ρ))2=(φ1σ1a(1),s(1)L2(ρ)+O(ε1/4))2absentsuperscriptsubscript𝜑1subscript𝜎1subscript𝑎𝑠superscript𝐿2𝜌2superscriptsubscript𝜑1subscript𝜎1subscriptsuperscript𝑎1superscript𝑠1superscript𝐿2𝜌𝑂superscript𝜀142\displaystyle=\left(\varphi_{1}-{\sigma_{1}}\left\langle a,s\right\rangle_{L^{% 2}(\rho)}\right)^{2}=\left(\varphi_{1}-{\sigma_{1}}\left\langle a^{(-1)},s^{(1% )}\right\rangle_{L^{2}(\rho)}+O(\varepsilon^{\nicefrac{{1}}{{4}}})\right)^{2}= ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a , italic_s ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT ( - 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(a)(φ1σ1sign(σ1φ1)λ2a,initL2(ρ)2)2+O(ε1/4)superscript𝑎absentsuperscriptsubscript𝜑1subscript𝜎1signsubscript𝜎1subscript𝜑1superscript𝜆2subscriptsuperscriptnormsubscript𝑎perpendicular-toinit2superscript𝐿2𝜌2𝑂superscript𝜀14\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\left(\varphi_{1}-{\sigma_{1}}% \operatorname{sign}(\sigma_{1}\varphi_{1})\lambda^{2}\|a_{\perp,\rm{init}}\|^{% 2}_{L^{2}(\rho)}\right)^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_sign ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT )
=(b)(φ1σ1sign(σ1φ1)|φ1|a,initL2(ρ)2(|σ1|a,initL2(ρ)2+4|φ1|e2|σ1φ1|t3))2+O(ε1/4)superscript𝑏absentsuperscriptsubscript𝜑1subscript𝜎1signsubscript𝜎1subscript𝜑1subscript𝜑1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌24subscript𝜑1superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡32𝑂superscript𝜀14\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\left(\varphi_{1}-\frac{\sigma_{% 1}\operatorname{sign}(\sigma_{1}\varphi_{1})|\varphi_{1}|\|a_{\perp,\rm{init}}% \|_{L^{2}(\rho)}^{2}}{\left({|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{% L^{2}(\rho)}^{2}+4|\varphi_{1}|e^{-2|\sigma_{1}\varphi_{1}|t_{3}}\right)}% \right)^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_b ) end_ARG end_RELOP ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - divide start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_sign ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT )
=φ12(111+4|φ1||σ1|a,initL2(ρ)2e2|σ1φ1|t3)2+O(ε1/4),absentsuperscriptsubscript𝜑12superscript1114subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡32𝑂superscript𝜀14\displaystyle=\varphi_{1}^{2}\left(1-\frac{1}{1+\frac{4|\varphi_{1}|}{|\sigma_% {1}|\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}}e^{-2|\sigma_{1}% \varphi_{1}|t_{3}}}\right)^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})\,,= italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) ,

where in (a)𝑎(a)( italic_a ) we used (90) and in (b)𝑏(b)( italic_b ) (91). Thus as ε0𝜀0\varepsilon\to 0italic_ε → 0,

Rmf,=12φ12(111+4|φ1||σ1|a,initL2(ρ)2e2|σ1φ1|t3)2+12k2φk2+O(ε1/4).subscript𝑅mf12superscriptsubscript𝜑12superscript1114subscript𝜑1subscript𝜎1superscriptsubscriptnormsubscript𝑎perpendicular-toinitsuperscript𝐿2𝜌2superscript𝑒2subscript𝜎1subscript𝜑1subscript𝑡3212subscript𝑘2superscriptsubscript𝜑𝑘2𝑂superscript𝜀14\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\varphi_{1}^{2}% \left(1-\frac{1}{1+\frac{4|\varphi_{1}|}{|\sigma_{1}|\left\|a_{\perp,\rm{init}% }\right\|_{L^{2}(\rho)}^{2}}e^{-2|\sigma_{1}\varphi_{1}|t_{3}}}\right)^{2}+% \frac{1}{2}\sum_{k\geqslant 2}\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4% }}})\,.italic_R start_POSTSUBSCRIPT mf , ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 4 | italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∥ italic_a start_POSTSUBSCRIPT ⟂ , roman_init end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - 2 | italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ⩾ 2 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_ε start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) .

This describes, in a more detailed form, the second transition in Definition 1.

C.3 Proof of Theorem 1

Throughout the proof, we will use the shorthand Rl(τ)subscript𝑅𝑙𝜏\mathscrsfs{R}_{l}(\tau)italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_τ ) to represent Rl(a~(τ),s~(τ))subscript𝑅𝑙~𝑎𝜏~𝑠𝜏\mathscrsfs{R}_{l}(\widetilde{a}(\tau),\widetilde{s}(\tau))italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_τ ) ). First, note that according to the ODE satisfied by Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (Eq. (98)), we know that Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT must be non-increasing, thus for small enough ε>0𝜀0\varepsilon>0italic_ε > 0,

|φlσla~(ν,τ)s~(ν,τ)ldρ(ν)|subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈𝜏~𝑠superscript𝜈𝜏𝑙differential-d𝜌𝜈absent\displaystyle\left|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu,\tau)\widetilde% {s}(\nu,\tau)^{l}\mathrm{d}\rho(\nu)\right|\leq\,| italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν , italic_τ ) over~ start_ARG italic_s end_ARG ( italic_ν , italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) | ≤ |φlσla~(ν,0)s~(ν,0)ldρ(ν)|subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈0~𝑠superscript𝜈0𝑙differential-d𝜌𝜈\displaystyle\left|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu,0)\widetilde{s}% (\nu,0)^{l}\mathrm{d}\rho(\nu)\right|| italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν , 0 ) over~ start_ARG italic_s end_ARG ( italic_ν , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) |
\displaystyle\leq\, |φl|+O(ε1/2l)2|φl|,τ0.formulae-sequencesubscript𝜑𝑙𝑂superscript𝜀12𝑙2subscript𝜑𝑙for-all𝜏0\displaystyle\left|\varphi_{l}\right|+O(\varepsilon^{1/2l})\leq 2\left|\varphi% _{l}\right|,\ \forall\tau\geq 0.| italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | + italic_O ( italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l end_POSTSUPERSCRIPT ) ≤ 2 | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | , ∀ italic_τ ≥ 0 .

Hence, we obtain the estimates:

τ|a~(ω)||τa~(ω)|2|σl||φl||s~(ω)|l,τ|s~(ω)||τs~(ω)|2l|σl||φl||a~(ω)||s~(ω)|l1.formulae-sequencesubscript𝜏~𝑎𝜔subscript𝜏~𝑎𝜔2subscript𝜎𝑙subscript𝜑𝑙superscript~𝑠𝜔𝑙subscript𝜏~𝑠𝜔subscript𝜏~𝑠𝜔2𝑙subscript𝜎𝑙subscript𝜑𝑙~𝑎𝜔superscript~𝑠𝜔𝑙1\displaystyle\partial_{\tau}\left|\widetilde{a}(\omega)\right|\leq\left|% \partial_{\tau}\tilde{a}(\omega)\right|\leq 2|\sigma_{l}||\varphi_{l}|\left|% \widetilde{s}(\omega)\right|^{l},\ \partial_{\tau}\left|\widetilde{s}(\omega)% \right|\leq\left|\partial_{\tau}\widetilde{s}(\omega)\right|\leq 2l|\sigma_{l}% ||\varphi_{l}|\left|\widetilde{a}(\omega)\right|\left|\widetilde{s}(\omega)% \right|^{l-1}.∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | over~ start_ARG italic_a end_ARG ( italic_ω ) | ≤ | ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) | ≤ 2 | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | over~ start_ARG italic_s end_ARG ( italic_ω ) | start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | over~ start_ARG italic_s end_ARG ( italic_ω ) | ≤ | ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) | ≤ 2 italic_l | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | over~ start_ARG italic_a end_ARG ( italic_ω ) | | over~ start_ARG italic_s end_ARG ( italic_ω ) | start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT .

According to the comparison theorem for system of ODEs, we know that |a~(ω,τ)|a^(ω,τ)~𝑎𝜔𝜏^𝑎𝜔𝜏|\widetilde{a}(\omega,\tau)|\leq\widehat{a}(\omega,\tau)| over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) | ≤ over^ start_ARG italic_a end_ARG ( italic_ω , italic_τ ), |s~(ω,τ)|s^(ω,τ)~𝑠𝜔𝜏^𝑠𝜔𝜏|\widetilde{s}(\omega,\tau)|\leq\widehat{s}(\omega,\tau)| over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) | ≤ over^ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) for all τ0𝜏0\tau\geq 0italic_τ ≥ 0 where

a^(ω,0)=max{|a~(ω,0)|,|s~(ω,0)|},s^(ω,0)=l1/2a^(ω,0)=l1/2max{|a~(ω,0)|,|s~(ω,0)|},formulae-sequence^𝑎𝜔0~𝑎𝜔0~𝑠𝜔0^𝑠𝜔0superscript𝑙12^𝑎𝜔0superscript𝑙12~𝑎𝜔0~𝑠𝜔0\widehat{a}(\omega,0)=\max\left\{|\widetilde{a}(\omega,0)|,\ |\widetilde{s}(% \omega,0)|\right\},\ \widehat{s}(\omega,0)=l^{1/2}\widehat{a}(\omega,0)=l^{1/2% }\max\left\{|\widetilde{a}(\omega,0)|,\ |\widetilde{s}(\omega,0)|\right\},over^ start_ARG italic_a end_ARG ( italic_ω , 0 ) = roman_max { | over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) | , | over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) | } , over^ start_ARG italic_s end_ARG ( italic_ω , 0 ) = italic_l start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG ( italic_ω , 0 ) = italic_l start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_max { | over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) | , | over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) | } ,

and

τa^(ω)=2|σl||φl|s^(ω)l,τs^(ω)=2l|σl||φl|a^(ω)s^(ω)l1.formulae-sequencesubscript𝜏^𝑎𝜔2subscript𝜎𝑙subscript𝜑𝑙^𝑠superscript𝜔𝑙subscript𝜏^𝑠𝜔2𝑙subscript𝜎𝑙subscript𝜑𝑙^𝑎𝜔^𝑠superscript𝜔𝑙1\partial_{\tau}\widehat{a}(\omega)=2|\sigma_{l}||\varphi_{l}|\widehat{s}(% \omega)^{l},\ \partial_{\tau}\widehat{s}(\omega)=2l|\sigma_{l}||\varphi_{l}|% \widehat{a}(\omega)\widehat{s}(\omega)^{l-1}.∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG ( italic_ω ) = 2 | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | over^ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG ( italic_ω ) = 2 italic_l | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | over^ start_ARG italic_a end_ARG ( italic_ω ) over^ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT . (144)

The above system of ODEs can be solved analytically via integration. First, we note that

τ(s^(ω)2)=4l|σl||φl|a^(ω)s^(ω)l=lτ(a^(ω)2),subscript𝜏^𝑠superscript𝜔24𝑙subscript𝜎𝑙subscript𝜑𝑙^𝑎𝜔^𝑠superscript𝜔𝑙𝑙subscript𝜏^𝑎superscript𝜔2\partial_{\tau}(\widehat{s}(\omega)^{2})=4l|\sigma_{l}||\varphi_{l}|\widehat{a% }(\omega)\widehat{s}(\omega)^{l}=l\partial_{\tau}(\widehat{a}(\omega)^{2}),∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over^ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = 4 italic_l | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | over^ start_ARG italic_a end_ARG ( italic_ω ) over^ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_l ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over^ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

which implies that (further note s^(ω,0)2=la^(ω,0)2^𝑠superscript𝜔02𝑙^𝑎superscript𝜔02\widehat{s}(\omega,0)^{2}=l\widehat{a}(\omega,0)^{2}over^ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_l over^ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)

s^(ω,τ)=l1/2a^(ω,τ),τ0.formulae-sequence^𝑠𝜔𝜏superscript𝑙12^𝑎𝜔𝜏for-all𝜏0\widehat{s}(\omega,\tau)=l^{1/2}\widehat{a}(\omega,\tau),\ \forall\tau\geq 0.over^ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) = italic_l start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) , ∀ italic_τ ≥ 0 . (145)

The ODE system then reduces to τa^(ω)=2ll/2|σl||φl|a^(ω)lsubscript𝜏^𝑎𝜔2superscript𝑙𝑙2subscript𝜎𝑙subscript𝜑𝑙^𝑎superscript𝜔𝑙\partial_{\tau}\widehat{a}(\omega)=2l^{l/2}|\sigma_{l}||\varphi_{l}|\widehat{a% }(\omega)^{l}∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG ( italic_ω ) = 2 italic_l start_POSTSUPERSCRIPT italic_l / 2 end_POSTSUPERSCRIPT | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | over^ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, which admits the solution

a^(ω,τ)=(a^(ω,0)l+12ll/2(l1)|σl||φl|τ)1/(l1).^𝑎𝜔𝜏superscript^𝑎superscript𝜔0𝑙12superscript𝑙𝑙2𝑙1subscript𝜎𝑙subscript𝜑𝑙𝜏1𝑙1\widehat{a}(\omega,\tau)=\left(\widehat{a}(\omega,0)^{-l+1}-2l^{l/2}(l-1)|% \sigma_{l}||\varphi_{l}|\tau\right)^{-1/(l-1)}.over^ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) = ( over^ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT - italic_l + 1 end_POSTSUPERSCRIPT - 2 italic_l start_POSTSUPERSCRIPT italic_l / 2 end_POSTSUPERSCRIPT ( italic_l - 1 ) | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_τ ) start_POSTSUPERSCRIPT - 1 / ( italic_l - 1 ) end_POSTSUPERSCRIPT . (146)

Since a^(ω,0)=Θ(ε1/2l(l+1))^𝑎𝜔0Θsuperscript𝜀12𝑙𝑙1\widehat{a}(\omega,0)=\Theta(\varepsilon^{1/2l(l+1)})over^ start_ARG italic_a end_ARG ( italic_ω , 0 ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ), we know that a^(ω,τ),s^(ω,τ)=o(1)^𝑎𝜔𝜏^𝑠𝜔𝜏𝑜1\widehat{a}(\omega,\tau),\widehat{s}(\omega,\tau)=o(1)over^ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) , over^ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) = italic_o ( 1 ) until τ=Θ(ε(l1)/2l(l+1))O(1)=Θ(ε(l1)/2l(l+1))𝜏Θsuperscript𝜀𝑙12𝑙𝑙1𝑂1Θsuperscript𝜀𝑙12𝑙𝑙1\tau=\Theta(\varepsilon^{-(l-1)/2l(l+1)})-O(1)=\Theta(\varepsilon^{-(l-1)/2l(l% +1)})italic_τ = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) - italic_O ( 1 ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ), which means that a~(ω,τ),s~(ω,τ)=o(1)~𝑎𝜔𝜏~𝑠𝜔𝜏𝑜1\widetilde{a}(\omega,\tau),\widetilde{s}(\omega,\tau)=o(1)over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) , over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) = italic_o ( 1 ) until τ=Ω(ε(l1)/2l(l+1))𝜏Ωsuperscript𝜀𝑙12𝑙𝑙1\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})italic_τ = roman_Ω ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ). As a consequence,

Rl(τ)=12(φlσla~(ν,τ)s~(ν,τ)ldρ(ν))2=12φl2o(1)subscript𝑅𝑙𝜏12superscriptsubscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈𝜏~𝑠superscript𝜈𝜏𝑙differential-d𝜌𝜈212superscriptsubscript𝜑𝑙2𝑜1\mathscrsfs{R}_{l}(\tau)=\frac{1}{2}\left(\varphi_{l}-\sigma_{l}\int\widetilde% {a}(\nu,\tau)\widetilde{s}(\nu,\tau)^{l}\mathrm{d}\rho(\nu)\right)^{2}=\frac{1% }{2}\varphi_{l}^{2}-o(1)italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν , italic_τ ) over~ start_ARG italic_s end_ARG ( italic_ν , italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_o ( 1 )

until τ=Ω(ε(l1)/2l(l+1))𝜏Ωsuperscript𝜀𝑙12𝑙𝑙1\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})italic_τ = roman_Ω ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ). This means that the learning of the l𝑙litalic_l-th component will not begin until τ=Ω(ε(l1)/2l(l+1))𝜏Ωsuperscript𝜀𝑙12𝑙𝑙1\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})italic_τ = roman_Ω ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ), namely τ(Δ)=Ω(ε(l1)/2l(l+1))𝜏ΔΩsuperscript𝜀𝑙12𝑙𝑙1\tau(\Delta)=\Omega(\varepsilon^{-(l-1)/2l(l+1)})italic_τ ( roman_Δ ) = roman_Ω ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) for any fixed Δ>0Δ0\Delta>0roman_Δ > 0. Note that the above argument applies to all of the settings in the theorem statement.

Next, we show that for any fixed Δ>0Δ0\Delta>0roman_Δ > 0, τ(Δ)=O(ε(l1)/2l(l+1))𝜏Δ𝑂superscript𝜀𝑙12𝑙𝑙1\tau(\Delta)=O(\varepsilon^{-(l-1)/2l(l+1)})italic_τ ( roman_Δ ) = italic_O ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ), which means that the l𝑙litalic_l-th component can be learnt in O(ε(l1)/2l(l+1))𝑂superscript𝜀𝑙12𝑙𝑙1O(\varepsilon^{-(l-1)/2l(l+1)})italic_O ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) time. To prove our claim by contradiction, assume that there exists Δ>0Δ0\Delta>0roman_Δ > 0 and a sequence εk0subscript𝜀𝑘0\varepsilon_{k}\downarrow 0italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ↓ 0, such that

limkτ(Δ)εk(l1)/2l(l+1)=+.subscript𝑘𝜏Δsuperscriptsubscript𝜀𝑘𝑙12𝑙𝑙1\lim_{k\to\infty}\frac{\tau(\Delta)}{\varepsilon_{k}^{-(l-1)/2l(l+1)}}=+\infty.roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT divide start_ARG italic_τ ( roman_Δ ) end_ARG start_ARG italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT end_ARG = + ∞ . (147)

By definition of τ(Δ)𝜏Δ\tau(\Delta)italic_τ ( roman_Δ ), we know that ττ(Δ)for-all𝜏𝜏Δ\forall\tau\leq\tau(\Delta)∀ italic_τ ≤ italic_τ ( roman_Δ ),

|φlσla~(ν,τ)s~(ν,τ)ldρ(ν)|2Δ.subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈𝜏~𝑠superscript𝜈𝜏𝑙differential-d𝜌𝜈2Δ\left|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu,\tau)\widetilde{s}(\nu,\tau)% ^{l}\mathrm{d}\rho(\nu)\right|\geq\,\sqrt{2\Delta}.| italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν , italic_τ ) over~ start_ARG italic_s end_ARG ( italic_ν , italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) | ≥ square-root start_ARG 2 roman_Δ end_ARG .

Now, assume the condition of setting (a) holds and denote

Aε0,η={ω:ε<ε0,min(|a~(ω,0)|,|s~(ω,0)|)>ηε1/2l(l+1),σlφla~(ω,0)s~(ω,0)l>0}.subscript𝐴subscript𝜀0𝜂conditional-set𝜔formulae-sequencefor-all𝜀subscript𝜀0formulae-sequence~𝑎𝜔0~𝑠𝜔0𝜂superscript𝜀12𝑙𝑙1subscript𝜎𝑙subscript𝜑𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙0A_{\varepsilon_{0},\eta}=\left\{\omega:\forall\varepsilon<\varepsilon_{0},\ % \min(|\widetilde{a}(\omega,0)|,|\widetilde{s}(\omega,0)|)>\eta\varepsilon^{1/2% l(l+1)},\ \sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^% {l}>0\right\}.italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT = { italic_ω : ∀ italic_ε < italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_min ( | over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) | , | over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) | ) > italic_η italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT > 0 } . (148)

Then by definition and our assumption that a~(ω,0)~𝑎𝜔0\widetilde{a}(\omega,0)over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) is of the same order as s~(ω,0)~𝑠𝜔0\widetilde{s}(\omega,0)over~ start_ARG italic_s end_ARG ( italic_ω , 0 ), we know that A=ε0>0,η>0Aε0,η𝐴subscriptformulae-sequencesubscript𝜀00𝜂0subscript𝐴subscript𝜀0𝜂A=\cup_{\varepsilon_{0}>0,\eta>0}A_{\varepsilon_{0},\eta}italic_A = ∪ start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 , italic_η > 0 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT. Since ρ(A)>0𝜌𝐴0\rho(A)>0italic_ρ ( italic_A ) > 0, there exists ε0,η>0subscript𝜀0𝜂0\varepsilon_{0},\eta>0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η > 0 such that ρ(Aε0,η)>0𝜌subscript𝐴subscript𝜀0𝜂0\rho(A_{\varepsilon_{0},\eta})>0italic_ρ ( italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT ) > 0. Note that here we can choose ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and η𝜂\etaitalic_η to be arbitrarily small since the set Aε0,ηsubscript𝐴subscript𝜀0𝜂A_{\varepsilon_{0},\eta}italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT is non-increasing in ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and η𝜂\etaitalic_η. For ωAε0,η𝜔subscript𝐴subscript𝜀0𝜂\omega\in A_{\varepsilon_{0},\eta}italic_ω ∈ italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT and ττ(Δ)𝜏𝜏Δ\tau\leq\tau(\Delta)italic_τ ≤ italic_τ ( roman_Δ ), we have

τ(a~(ω)2)=2σla~(ω)s~(ω)l(φlσla~(ν)s~(ν)ldρ(ν))22Δ|σla~(ω)s~(ω)l|τ(s~(ω)2)=2lσla~(ω)s~(ω)l(1ε2βls~(ω)2)(φlσla~(ν)s~(ν)ldρ(ν))2l2Δ|σla~(ω)s~(ω)l|(1ε2βls~(ω)2).subscript𝜏~𝑎superscript𝜔22subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈22Δsubscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙subscript𝜏~𝑠superscript𝜔22𝑙subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈2𝑙2Δsubscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2\begin{split}\partial_{\tau}(\widetilde{a}(\omega)^{2})=\,&2\sigma_{l}% \widetilde{a}(\omega)\widetilde{s}(\omega)^{l}\left(\varphi_{l}-\sigma_{l}\int% \widetilde{a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right)\geq 2\sqrt{% 2\Delta}\left|\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}\right|% \\ \partial_{\tau}(\widetilde{s}(\omega)^{2})=\,&2l\sigma_{l}\widetilde{a}(\omega% )\widetilde{s}(\omega)^{l}\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega% )^{2}\right)\left(\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(% \nu)^{l}\mathrm{d}\rho(\nu)\right)\\ \geq\,&2l\sqrt{2\Delta}\left|\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(% \omega)^{l}\right|\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}% \right).\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = end_CELL start_CELL 2 italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) ≥ 2 square-root start_ARG 2 roman_Δ end_ARG | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = end_CELL start_CELL 2 italic_l italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) end_CELL end_ROW start_ROW start_CELL ≥ end_CELL start_CELL 2 italic_l square-root start_ARG 2 roman_Δ end_ARG | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . end_CELL end_ROW

Moreover, we know that at initialization, |a~(ω,0)|,|s~(ω,0)|>ηε1/2l(l+1)~𝑎𝜔0~𝑠𝜔0𝜂superscript𝜀12𝑙𝑙1|\widetilde{a}(\omega,0)|,|\widetilde{s}(\omega,0)|>\eta\varepsilon^{1/2l(l+1)}| over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) | , | over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) | > italic_η italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT. Using the ODE comparison theorem and a similar argument as that in proving τ(Δ)=Ω(ε(l1)/2l(l+1))𝜏ΔΩsuperscript𝜀𝑙12𝑙𝑙1\tau(\Delta)=\Omega(\varepsilon^{-(l-1)/2l(l+1)})italic_τ ( roman_Δ ) = roman_Ω ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ), we deduce that for sufficiently large k𝑘kitalic_k such that ε=εk<ε0𝜀subscript𝜀𝑘subscript𝜀0\varepsilon=\varepsilon_{k}<\varepsilon_{0}italic_ε = italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, there exist constants C,C>0𝐶superscript𝐶0C,C^{\prime}>0italic_C , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 that does not depend on ε𝜀\varepsilonitalic_ε satisfying the following: For all ωAε0,η𝜔subscript𝐴subscript𝜀0𝜂\omega\in A_{\varepsilon_{0},\eta}italic_ω ∈ italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT and τCε(l1)/2l(l+1)𝜏𝐶superscript𝜀𝑙12𝑙𝑙1\tau\geq C\varepsilon^{-(l-1)/2l(l+1)}italic_τ ≥ italic_C italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT,

min{|a~(ω,τ)|,|s~(ω,τ)|}C,σlφla~(ω,τ)s~(ω,τ)l>0.formulae-sequence~𝑎𝜔𝜏~𝑠𝜔𝜏superscript𝐶subscript𝜎𝑙subscript𝜑𝑙~𝑎𝜔𝜏~𝑠superscript𝜔𝜏𝑙0\min\left\{|\widetilde{a}(\omega,\tau)|,\ |\widetilde{s}(\omega,\tau)|\right\}% \geq C^{\prime},\ \sigma_{l}\varphi_{l}\widetilde{a}(\omega,\tau)\widetilde{s}% (\omega,\tau)^{l}>0.roman_min { | over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) | , | over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) | } ≥ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT > 0 .

This further implies that at time τ𝜏\tauitalic_τ,

s~(ω)2(l1)(l2a~(ω)2(1ε2βls~(ω)2)+s~(ω)2)dρ(ω)C2lρ(Aε0,η)>0.~𝑠superscript𝜔2𝑙1superscript𝑙2~𝑎superscript𝜔21superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2~𝑠superscript𝜔2differential-d𝜌𝜔superscript𝐶2𝑙𝜌subscript𝐴subscript𝜀0𝜂0\int\widetilde{s}(\omega)^{2(l-1)}\left(l^{2}\widetilde{a}(\omega)^{2}\left(1-% \varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)+\widetilde{s}(\omega)% ^{2}\right)\mathrm{d}\rho(\omega)\geq C^{\prime 2l}\rho(A_{\varepsilon_{0},% \eta})>0.∫ over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_ρ ( italic_ω ) ≥ italic_C start_POSTSUPERSCRIPT ′ 2 italic_l end_POSTSUPERSCRIPT italic_ρ ( italic_A start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT ) > 0 . (149)

According to Eq. (98), we know that Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT will decrease to 00 exponentially fast in an O(1)𝑂1O(1)italic_O ( 1 ) time window after τ=Cε(l1)/2l(l+1)𝜏𝐶superscript𝜀𝑙12𝑙𝑙1\tau=C\varepsilon^{-(l-1)/2l(l+1)}italic_τ = italic_C italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT, which contradicts our assumption (147). This proves that τ(Δ)=O(ε(l1)/2l(l+1))𝜏Δ𝑂superscript𝜀𝑙12𝑙𝑙1\tau(\Delta)=O(\varepsilon^{-(l-1)/2l(l+1)})italic_τ ( roman_Δ ) = italic_O ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) under setting (a). Next, we show that setting (b) can be reduced to setting (a). Under setting (b), let us denote

Bε0,η={ω:ε<ε0,σlφla~(ω,0)s~(ω,0)l<0,ands~(ω,0)2>(l+η)a~(ω,0)2}.subscript𝐵subscript𝜀0𝜂conditional-set𝜔formulae-sequencefor-all𝜀subscript𝜀0formulae-sequencesubscript𝜎𝑙subscript𝜑𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙0and~𝑠superscript𝜔02𝑙𝜂~𝑎superscript𝜔02B_{\varepsilon_{0},\eta}=\left\{\omega:\forall\varepsilon<\varepsilon_{0},\ % \sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}<0,\ % \text{and}\ \widetilde{s}(\omega,0)^{2}>(l+\eta)\widetilde{a}(\omega,0)^{2}% \right\}.italic_B start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT = { italic_ω : ∀ italic_ε < italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT < 0 , and over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > ( italic_l + italic_η ) over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } .

Then similar to the previous argument, there exists ε0,η>0subscript𝜀0𝜂0\varepsilon_{0},\eta>0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η > 0 such that ρ(Bε0,η)>0𝜌subscript𝐵subscript𝜀0𝜂0\rho(B_{\varepsilon_{0},\eta})>0italic_ρ ( italic_B start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT ) > 0, and further we can choose ε0subscript𝜀0\varepsilon_{0}italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and η𝜂\etaitalic_η to be arbitrarily small. For ωBε0,η𝜔subscript𝐵subscript𝜀0𝜂\omega\in B_{\varepsilon_{0},\eta}italic_ω ∈ italic_B start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η end_POSTSUBSCRIPT, we have

τ(a~(ω)2)<0,τ(s~(ω)2)<0atτ=0.formulae-sequencesubscript𝜏~𝑎superscript𝜔20subscript𝜏~𝑠superscript𝜔20at𝜏0\partial_{\tau}(\widetilde{a}(\omega)^{2})<0,\ \partial_{\tau}(\widetilde{s}(% \omega)^{2})<0\ \mbox{at}\ \tau=0.∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) < 0 , ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) < 0 at italic_τ = 0 .

Hence, both a~(ω)2~𝑎superscript𝜔2\widetilde{a}(\omega)^{2}over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and s~(ω)2~𝑠superscript𝜔2\widetilde{s}(\omega)^{2}over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT will decrease at initialization. Moreover, Eq. (97) implies that

τ(a~(ω)2)=τ(s~(ω)2)l(1ε2βls~(ω)2)=ε2βllτlog(1ε2βls~(ω)2).subscript𝜏~𝑎superscript𝜔2subscript𝜏~𝑠superscript𝜔2𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2superscript𝜀2subscript𝛽𝑙𝑙subscript𝜏1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2\partial_{\tau}(\widetilde{a}(\omega)^{2})=\frac{\partial_{\tau}(\widetilde{s}% (\omega)^{2})}{l\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}% \right)}=-\frac{\varepsilon^{-2\beta_{l}}}{l}\partial_{\tau}\log\left(1-% \varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right).∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_l ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG = - divide start_ARG italic_ε start_POSTSUPERSCRIPT - 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_l end_ARG ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT roman_log ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Integrating both sides of the above equation, we obtain that

a~(ω,0)2a~(ω,τ)2=ε2βll(log(1ε2βls~(ω,0)2)log(1ε2βls~(ω,τ)2)),~𝑎superscript𝜔02~𝑎superscript𝜔𝜏2superscript𝜀2subscript𝛽𝑙𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔021superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔𝜏2\widetilde{a}(\omega,0)^{2}-\widetilde{a}(\omega,\tau)^{2}=-\frac{\varepsilon^% {-2\beta_{l}}}{l}\left(\log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(% \omega,0)^{2}\right)-\log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,% \tau)^{2}\right)\right),over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = - divide start_ARG italic_ε start_POSTSUPERSCRIPT - 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_l end_ARG ( roman_log ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - roman_log ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) , (150)

which is close to (s~(ω,0)2s~(ω,τ)2)/l~𝑠superscript𝜔02~𝑠superscript𝜔𝜏2𝑙\left(\widetilde{s}(\omega,0)^{2}-\widetilde{s}(\omega,\tau)^{2}\right)/l( over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_l as long as s~(ω,τ)=O(1)~𝑠𝜔𝜏𝑂1\widetilde{s}(\omega,\tau)=O(1)over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) = italic_O ( 1 ). To be accurate, let us define

τa,ω=inf{τ0:a~(ω,τ)=0},subscript𝜏𝑎𝜔infimumconditional-set𝜏0~𝑎𝜔𝜏0\tau_{a,\omega}=\inf\{\tau\geq 0:\widetilde{a}(\omega,\tau)=0\},italic_τ start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT = roman_inf { italic_τ ≥ 0 : over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) = 0 } ,

then we know that s~(ω,τa,ω)=Ω(ε1/2l(l+1))~𝑠𝜔subscript𝜏𝑎𝜔Ωsuperscript𝜀12𝑙𝑙1\widetilde{s}(\omega,\tau_{a,\omega})=\Omega(\varepsilon^{1/2l(l+1)})over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT ) = roman_Ω ( italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) and τa,ω=O(ε(l1)/2l(l+1))subscript𝜏𝑎𝜔𝑂superscript𝜀𝑙12𝑙𝑙1\tau_{a,\omega}=O(\varepsilon^{-(l-1)/2l(l+1)})italic_τ start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT = italic_O ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) under the assumption (147), where the latter claim can be proved through making the change of variable a~(ω)=ε1/2l(l+1)a~(ω)superscript~𝑎𝜔superscript𝜀12𝑙𝑙1~𝑎𝜔\widetilde{a}^{\prime}(\omega)=\varepsilon^{-1/2l(l+1)}\widetilde{a}(\omega)over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ω ) = italic_ε start_POSTSUPERSCRIPT - 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) and s~(ω)=ε1/2l(l+1)s~(ω)superscript~𝑠𝜔superscript𝜀12𝑙𝑙1~𝑠𝜔\widetilde{s}^{\prime}(\omega)=\varepsilon^{-1/2l(l+1)}\widetilde{s}(\omega)over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ω ) = italic_ε start_POSTSUPERSCRIPT - 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ). Note that after the time point τa,ωsubscript𝜏𝑎𝜔\tau_{a,\omega}italic_τ start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT, the sign of a~(ω)~𝑎𝜔\widetilde{a}(\omega)over~ start_ARG italic_a end_ARG ( italic_ω ) changes. Hence, φlσla~(ω)s~(ω)l>0subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙0\varphi_{l}\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}>0italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT > 0, and a~(ω,τ)2~𝑎superscript𝜔𝜏2\widetilde{a}(\omega,\tau)^{2}over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and s~(ω,τ)2~𝑠superscript𝜔𝜏2\widetilde{s}(\omega,\tau)^{2}over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT will begin to increase for ττa,ω𝜏subscript𝜏𝑎𝜔\tau\geq\tau_{a,\omega}italic_τ ≥ italic_τ start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT. Similarly, we can show that in O(ε(l1)/2l(l+1))𝑂superscript𝜀𝑙12𝑙𝑙1O(\varepsilon^{-(l-1)/2l(l+1)})italic_O ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) time after τa,ωsubscript𝜏𝑎𝜔\tau_{a,\omega}italic_τ start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT, both a~(ω)~𝑎𝜔\widetilde{a}(\omega)over~ start_ARG italic_a end_ARG ( italic_ω ) and s~(ω)~𝑠𝜔\widetilde{s}(\omega)over~ start_ARG italic_s end_ARG ( italic_ω ) become of order ε1/2l(l+1)superscript𝜀12𝑙𝑙1\varepsilon^{1/2l(l+1)}italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT, and we still have φlσla~(ω)s~(ω)l>0subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙0\varphi_{l}\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}>0italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT > 0. This reduces our case (b)𝑏(b)( italic_b ) to case (a)𝑎(a)( italic_a ).

We have proven that under settings (a) and (b), τ(Δ)=Θ(ε(l1)/2l(l+1))𝜏ΔΘsuperscript𝜀𝑙12𝑙𝑙1\tau(\Delta)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})italic_τ ( roman_Δ ) = roman_Θ ( italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ) for any fixed Δ(0,φl2/2)Δ0superscriptsubscript𝜑𝑙22\Delta\in(0,\varphi_{l}^{2}/2)roman_Δ ∈ ( 0 , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ). This means that some of the neurons (a~(ω),s~(ω))~𝑎𝜔~𝑠𝜔(\widetilde{a}(\omega),\widetilde{s}(\omega))( over~ start_ARG italic_a end_ARG ( italic_ω ) , over~ start_ARG italic_s end_ARG ( italic_ω ) ) become of order Ω(1)Ω1\Omega(1)roman_Ω ( 1 ) and the l𝑙litalic_l-th component of the target function is learnt at a timescale of order ε(l1)/2l(l+1)superscript𝜀𝑙12𝑙𝑙1\varepsilon^{-(l-1)/2l(l+1)}italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT. Next, we show that if the probability measure ρ𝜌\rhoitalic_ρ is discrete, then the evolution of Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT actually happens in an O(1)𝑂1O(1)italic_O ( 1 ) time window. It suffices to prove that, for any Δ>0Δ0\Delta>0roman_Δ > 0 a small constant (Δ<φl2/4Δsuperscriptsubscript𝜑𝑙24\Delta<\varphi_{l}^{2}/4roman_Δ < italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4),

τ(Δ)τ(φl22Δ)=O(1)𝜏Δ𝜏superscriptsubscript𝜑𝑙22Δ𝑂1\tau(\Delta)-\tau\left(\frac{\varphi_{l}^{2}}{2}-\Delta\right)=O(1)italic_τ ( roman_Δ ) - italic_τ ( divide start_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - roman_Δ ) = italic_O ( 1 ) (151)

as ε0𝜀0\varepsilon\to 0italic_ε → 0. Note that by continuity and monotonicity of Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we have

Rl(τ(Δ))=Δ,Rl(τ(φl22Δ))=φl22Δ,and τ(Δ)τ(φl22Δ).formulae-sequencesubscript𝑅𝑙𝜏ΔΔformulae-sequencesubscript𝑅𝑙𝜏superscriptsubscript𝜑𝑙22Δsuperscriptsubscript𝜑𝑙22Δand 𝜏Δ𝜏superscriptsubscript𝜑𝑙22Δ\mathscrsfs{R}_{l}(\tau(\Delta))=\Delta,\ \mathscrsfs{R}_{l}\left(\tau\left(% \frac{\varphi_{l}^{2}}{2}-\Delta\right)\right)=\frac{\varphi_{l}^{2}}{2}-% \Delta,\ \mbox{and \ }\tau(\Delta)\geq\tau\left(\frac{\varphi_{l}^{2}}{2}-% \Delta\right).italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_τ ( roman_Δ ) ) = roman_Δ , italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_τ ( divide start_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - roman_Δ ) ) = divide start_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - roman_Δ , and italic_τ ( roman_Δ ) ≥ italic_τ ( divide start_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - roman_Δ ) .

By definition of Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we know that ττ(φl2/2Δ)for-all𝜏𝜏superscriptsubscript𝜑𝑙22Δ\forall\tau\geq\tau(\varphi_{l}^{2}/2-\Delta)∀ italic_τ ≥ italic_τ ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 - roman_Δ ),

|a~(ν,τ)s~(ν,τ)ldρ(ν)|1|σl|(|φl|φl22Δ):=rl(Δ)>0.~𝑎𝜈𝜏~𝑠superscript𝜈𝜏𝑙differential-d𝜌𝜈1subscript𝜎𝑙subscript𝜑𝑙superscriptsubscript𝜑𝑙22Δassignsubscript𝑟𝑙Δ0\left|\int\widetilde{a}(\nu,\tau)\widetilde{s}(\nu,\tau)^{l}\mathrm{d}\rho(\nu% )\right|\geq\frac{1}{|\sigma_{l}|}\left(|\varphi_{l}|-\sqrt{\varphi_{l}^{2}-2% \Delta}\right):=r_{l}(\Delta)>0.| ∫ over~ start_ARG italic_a end_ARG ( italic_ν , italic_τ ) over~ start_ARG italic_s end_ARG ( italic_ν , italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) | ≥ divide start_ARG 1 end_ARG start_ARG | italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_ARG ( | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | - square-root start_ARG italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 roman_Δ end_ARG ) := italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Δ ) > 0 .

Denote by {(a~i,s~i)}i[m]subscriptsubscript~𝑎𝑖subscript~𝑠𝑖𝑖delimited-[]𝑚\{(\widetilde{a}_{i},\widetilde{s}_{i})\}_{i\in[m]}{ ( over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT the realizations of {(a~(ω),s~(ω))}ωΩsubscript~𝑎𝜔~𝑠𝜔𝜔Ω\{(\widetilde{a}(\omega),\widetilde{s}(\omega))\}_{\omega\in\Omega}{ ( over~ start_ARG italic_a end_ARG ( italic_ω ) , over~ start_ARG italic_s end_ARG ( italic_ω ) ) } start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT under the discrete measure ρ𝜌\rhoitalic_ρ, and by {pi}i[m]subscriptsubscript𝑝𝑖𝑖delimited-[]𝑚\{p_{i}\}_{i\in[m]}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT the point masses of ρ𝜌\rhoitalic_ρ. Then, we know that

rl(Δ)|a~(ν,τ)s~(ν,τ)ldρ(ν)|=|j=1mpja~j(τ)s~j(τ)l|j=1mpj|a~j(τ)s~j(τ)l|,subscript𝑟𝑙Δ~𝑎𝜈𝜏~𝑠superscript𝜈𝜏𝑙differential-d𝜌𝜈superscriptsubscript𝑗1𝑚subscript𝑝𝑗subscript~𝑎𝑗𝜏subscript~𝑠𝑗superscript𝜏𝑙superscriptsubscript𝑗1𝑚subscript𝑝𝑗subscript~𝑎𝑗𝜏subscript~𝑠𝑗superscript𝜏𝑙\displaystyle r_{l}(\Delta)\leq\left|\int\widetilde{a}(\nu,\tau)\widetilde{s}(% \nu,\tau)^{l}\mathrm{d}\rho(\nu)\right|=\left|\sum_{j=1}^{m}p_{j}\widetilde{a}% _{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|\leq\sum_{j=1}^{m}p_{j}\left|% \widetilde{a}_{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|,italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Δ ) ≤ | ∫ over~ start_ARG italic_a end_ARG ( italic_ν , italic_τ ) over~ start_ARG italic_s end_ARG ( italic_ν , italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) | = | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | , (152)

which implies that j[m]𝑗delimited-[]𝑚\exists j\in[m]∃ italic_j ∈ [ italic_m ], s.t. |a~j(τ)s~j(τ)l|rl(Δ)subscript~𝑎𝑗𝜏subscript~𝑠𝑗superscript𝜏𝑙subscript𝑟𝑙Δ\left|\widetilde{a}_{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|\geq r_{l}(\Delta)| over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | ≥ italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Δ ). Applying Lemma 3 yields

s~(ω)2(l1)(l2a~(ω)2(1ε2βls~(ω)2)+s~(ω)2)dρ(ω)~𝑠superscript𝜔2𝑙1superscript𝑙2~𝑎superscript𝜔21superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2~𝑠superscript𝜔2differential-d𝜌𝜔\displaystyle\int\widetilde{s}(\omega)^{2(l-1)}\left(l^{2}\widetilde{a}(\omega% )^{2}\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)+% \widetilde{s}(\omega)^{2}\right)\mathrm{d}\rho(\omega)∫ over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_ρ ( italic_ω ) (153)
\displaystyle\geq\, pjs~j2(l1)(l2a~j2(1ε2βls~j2)+s~j2)minj[m]pjc(l,rl(Δ))>0.subscript𝑝𝑗superscriptsubscript~𝑠𝑗2𝑙1superscript𝑙2superscriptsubscript~𝑎𝑗21superscript𝜀2subscript𝛽𝑙superscriptsubscript~𝑠𝑗2superscriptsubscript~𝑠𝑗2subscript𝑗delimited-[]𝑚subscript𝑝𝑗𝑐𝑙subscript𝑟𝑙Δ0\displaystyle p_{j}\widetilde{s}_{j}^{2(l-1)}\left(l^{2}\widetilde{a}_{j}^{2}% \left(1-\varepsilon^{2\beta_{l}}\widetilde{s}_{j}^{2}\right)+\widetilde{s}_{j}% ^{2}\right)\geq\min_{j\in[m]}p_{j}\cdot c(l,r_{l}(\Delta))>0.italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ roman_min start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_c ( italic_l , italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Δ ) ) > 0 . (154)

It then follows from Eq. (98) that Rlsubscript𝑅𝑙\mathscrsfs{R}_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT will decrease to 00 exponentially fast, and Eq. (151) holds consequently. This completes the proof for settings (a) and (b).

We then focus on the case (c). By our assumption, for almost every ω𝜔\omegaitalic_ω there exists η>0𝜂0\eta>0italic_η > 0 (may depend on ω𝜔\omegaitalic_ω) such that

σlφla~(ω,0)s~(ω,0)l<ηε1/2l,s~(ω,0)2<(lη)a~(ω,0)2formulae-sequencesubscript𝜎𝑙subscript𝜑𝑙~𝑎𝜔0~𝑠superscript𝜔0𝑙𝜂superscript𝜀12𝑙~𝑠superscript𝜔02𝑙𝜂~𝑎superscript𝜔02\sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}<-\eta% \varepsilon^{1/2l},\ \widetilde{s}(\omega,0)^{2}<(l-\eta)\widetilde{a}(\omega,% 0)^{2}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT < - italic_η italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l end_POSTSUPERSCRIPT , over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( italic_l - italic_η ) over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

for sufficiently small ε𝜀\varepsilonitalic_ε. Therefore, s~(ω,τ)2~𝑠superscript𝜔𝜏2\widetilde{s}(\omega,\tau)^{2}over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a~(ω,τ)2~𝑎superscript𝜔𝜏2\widetilde{a}(\omega,\tau)^{2}over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT will keep decreasing until one of them reaches 00, which means that

a~(ν)s~(ν)ldρ(ν)=o(1)|φlσla~(ν)s~(ν)ldρ(ν)||φl|2.~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈𝑜1subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈subscript𝜑𝑙2\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)=o(1)\implies% \left|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{l}% \mathrm{d}\rho(\nu)\right|\geq\frac{|\varphi_{l}|}{2}.∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) = italic_o ( 1 ) ⟹ | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) | ≥ divide start_ARG | italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_ARG start_ARG 2 end_ARG . (155)

According to Eq. (150) and the inequality s~(ω,0)2<(lη)a~(ω,0)2~𝑠superscript𝜔02𝑙𝜂~𝑎superscript𝜔02\widetilde{s}(\omega,0)^{2}<(l-\eta)\widetilde{a}(\omega,0)^{2}over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( italic_l - italic_η ) over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, a~(ω,τ)2~𝑎superscript𝜔𝜏2\widetilde{a}(\omega,\tau)^{2}over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT will not reach 00 until s~(ω,τ)2~𝑠superscript𝜔𝜏2\widetilde{s}(\omega,\tau)^{2}over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT reaches 00. Furthermore, for any τ0𝜏0\tau\geq 0italic_τ ≥ 0,

a~(ω,τ)2=~𝑎superscript𝜔𝜏2absent\displaystyle\widetilde{a}(\omega,\tau)^{2}=\,over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = a~(ω,0)2+ε2βll(log(1ε2βls~(ω,0)2)log(1ε2βls~(ω,τ)2))~𝑎superscript𝜔02superscript𝜀2subscript𝛽𝑙𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔021superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔𝜏2\displaystyle\widetilde{a}(\omega,0)^{2}+\frac{\varepsilon^{-2\beta_{l}}}{l}% \left(\log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,0)^{2}\right)-% \log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,\tau)^{2}\right)\right)over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ε start_POSTSUPERSCRIPT - 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_l end_ARG ( roman_log ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - roman_log ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )
\displaystyle\geq\, a~(ω,0)2+ε2βlllog(1ε2βls~(ω,0)2)~𝑎superscript𝜔02superscript𝜀2subscript𝛽𝑙𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔02\displaystyle\widetilde{a}(\omega,0)^{2}+\frac{\varepsilon^{-2\beta_{l}}}{l}% \log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,0)^{2}\right)over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ε start_POSTSUPERSCRIPT - 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_l end_ARG roman_log ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
\displaystyle\geq\, 1lηs~(ω,0)21lη/2s~(ω,0)2c(η,l,ω)ε1/l(l+1),1𝑙𝜂~𝑠superscript𝜔021𝑙𝜂2~𝑠superscript𝜔02𝑐𝜂𝑙𝜔superscript𝜀1𝑙𝑙1\displaystyle\frac{1}{l-\eta}\widetilde{s}(\omega,0)^{2}-\frac{1}{l-\eta/2}% \widetilde{s}(\omega,0)^{2}\geq\,c(\eta,l,\omega)\varepsilon^{1/l(l+1)},divide start_ARG 1 end_ARG start_ARG italic_l - italic_η end_ARG over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_l - italic_η / 2 end_ARG over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_c ( italic_η , italic_l , italic_ω ) italic_ε start_POSTSUPERSCRIPT 1 / italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ,

thus leading to

τ(s~(ω)2)=subscript𝜏~𝑠superscript𝜔2absent\displaystyle\partial_{\tau}(\widetilde{s}(\omega)^{2})=\,∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = 2lσla~(ω)s~(ω)l(1ε2βls~(ω)2)(φlσla~(ν)s~(ν)ldρ(ν))2𝑙subscript𝜎𝑙~𝑎𝜔~𝑠superscript𝜔𝑙1superscript𝜀2subscript𝛽𝑙~𝑠superscript𝜔2subscript𝜑𝑙subscript𝜎𝑙~𝑎𝜈~𝑠superscript𝜈𝑙differential-d𝜌𝜈\displaystyle 2l\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}\left(% 1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)\left(\varphi_{l}-% \sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right)2 italic_l italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) ) (156)
\displaystyle\leq\, c(η,l,ω)ε1/2l(l+1)(s~(ω)2)l/2.𝑐𝜂𝑙𝜔superscript𝜀12𝑙𝑙1superscript~𝑠superscript𝜔2𝑙2\displaystyle-c(\eta,l,\omega)\varepsilon^{1/2l(l+1)}(\widetilde{s}(\omega)^{2% })^{l/2}.- italic_c ( italic_η , italic_l , italic_ω ) italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ( over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_l / 2 end_POSTSUPERSCRIPT . (157)

Using again the comparison theorem for ODE, we get that

|s~(ω,τ)|(|s~(ω,0)|l+2+c(η,l,ω)ε1/2l(l+1)τ)1/(l2).~𝑠𝜔𝜏superscriptsuperscript~𝑠𝜔0𝑙2𝑐𝜂𝑙𝜔superscript𝜀12𝑙𝑙1𝜏1𝑙2|\widetilde{s}(\omega,\tau)|\leq\left(|\widetilde{s}(\omega,0)|^{-l+2}+c(\eta,% l,\omega)\varepsilon^{1/2l(l+1)}\tau\right)^{-1/(l-2)}.| over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) | ≤ ( | over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) | start_POSTSUPERSCRIPT - italic_l + 2 end_POSTSUPERSCRIPT + italic_c ( italic_η , italic_l , italic_ω ) italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT italic_τ ) start_POSTSUPERSCRIPT - 1 / ( italic_l - 2 ) end_POSTSUPERSCRIPT . (158)

Since s~(ω,0)ε1/2l(l+1)asymptotically-equals~𝑠𝜔0superscript𝜀12𝑙𝑙1\widetilde{s}(\omega,0)\asymp\varepsilon^{1/2l(l+1)}over~ start_ARG italic_s end_ARG ( italic_ω , 0 ) ≍ italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT, it follows immediately that for any Δ>0Δ0\Delta>0roman_Δ > 0, there exists a constant C(ω,Δ)>0subscript𝐶𝜔Δ0C_{*}(\omega,\Delta)>0italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ω , roman_Δ ) > 0 such that

τC(ω,Δ)ε(l1)/2l(l+1)|s~(ω,τ)|Δε1/2l(l+1).𝜏subscript𝐶𝜔Δsuperscript𝜀𝑙12𝑙𝑙1~𝑠𝜔𝜏Δsuperscript𝜀12𝑙𝑙1\tau\geq C_{*}(\omega,\Delta)\varepsilon^{-(l-1)/2l(l+1)}\implies|\widetilde{s% }(\omega,\tau)|\leq\Delta\varepsilon^{1/2l(l+1)}.italic_τ ≥ italic_C start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_ω , roman_Δ ) italic_ε start_POSTSUPERSCRIPT - ( italic_l - 1 ) / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT ⟹ | over~ start_ARG italic_s end_ARG ( italic_ω , italic_τ ) | ≤ roman_Δ italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_l ( italic_l + 1 ) end_POSTSUPERSCRIPT . (159)

This completes the discussion for case (c), thus concluding the proof of Theorem 1.

Lemma 3.

Let r>0𝑟0r>0italic_r > 0 be a constant that does not depend on ε𝜀\varepsilonitalic_ε. Then there exists a constant c=c(l,r)>0𝑐𝑐𝑙𝑟0c=c(l,r)>0italic_c = italic_c ( italic_l , italic_r ) > 0 that only depends on l𝑙litalic_l and r𝑟ritalic_r such that the following holds: For any a>0𝑎0a>0italic_a > 0, s>0𝑠0s>0italic_s > 0 satisfying aslr𝑎superscript𝑠𝑙𝑟as^{l}\geq ritalic_a italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≥ italic_r and ε2βls21superscript𝜀2subscript𝛽𝑙superscript𝑠21\varepsilon^{2\beta_{l}}s^{2}\leq 1italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 1, we have

s2(l1)(l2a2(1ε2βls2)+s2)c.superscript𝑠2𝑙1superscript𝑙2superscript𝑎21superscript𝜀2subscript𝛽𝑙superscript𝑠2superscript𝑠2𝑐s^{2(l-1)}\left(l^{2}a^{2}\left(1-\varepsilon^{2\beta_{l}}s^{2}\right)+s^{2}% \right)\geq c.italic_s start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ italic_c . (160)
Proof.

If s1𝑠1s\geq 1italic_s ≥ 1, then we immediately get

s2(l1)(l2a2(1ε2βls2)+s2)s2l1.superscript𝑠2𝑙1superscript𝑙2superscript𝑎21superscript𝜀2subscript𝛽𝑙superscript𝑠2superscript𝑠2superscript𝑠2𝑙1s^{2(l-1)}\left(l^{2}a^{2}\left(1-\varepsilon^{2\beta_{l}}s^{2}\right)+s^{2}% \right)\geq s^{2l}\geq 1.italic_s start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ italic_s start_POSTSUPERSCRIPT 2 italic_l end_POSTSUPERSCRIPT ≥ 1 .

Otherwise, 1ε2βls21/21superscript𝜀2subscript𝛽𝑙superscript𝑠2121-\varepsilon^{2\beta_{l}}s^{2}\geq 1/21 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 1 / 2, and consequently

s2(l1)(l2a2(1ε2βls2)+s2)superscript𝑠2𝑙1superscript𝑙2superscript𝑎21superscript𝜀2subscript𝛽𝑙superscript𝑠2superscript𝑠2absent\displaystyle s^{2(l-1)}\left(l^{2}a^{2}\left(1-\varepsilon^{2\beta_{l}}s^{2}% \right)+s^{2}\right)\geq\,italic_s start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ s2(l1)(l2a22+s2)s2(l1)(l2r22s2l+s2)superscript𝑠2𝑙1superscript𝑙2superscript𝑎22superscript𝑠2superscript𝑠2𝑙1superscript𝑙2superscript𝑟22superscript𝑠2𝑙superscript𝑠2\displaystyle s^{2(l-1)}\left(\frac{l^{2}a^{2}}{2}+s^{2}\right)\geq s^{2(l-1)}% \left(\frac{l^{2}r^{2}}{2s^{2l}}+s^{2}\right)italic_s start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( divide start_ARG italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ italic_s start_POSTSUPERSCRIPT 2 ( italic_l - 1 ) end_POSTSUPERSCRIPT ( divide start_ARG italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_s start_POSTSUPERSCRIPT 2 italic_l end_POSTSUPERSCRIPT end_ARG + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=\displaystyle=\,= l2r22s2+s2lc(l,r),superscript𝑙2superscript𝑟22superscript𝑠2superscript𝑠2𝑙𝑐𝑙𝑟\displaystyle\frac{l^{2}r^{2}}{2s^{2}}+s^{2l}\geq c(l,r),divide start_ARG italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_s start_POSTSUPERSCRIPT 2 italic_l end_POSTSUPERSCRIPT ≥ italic_c ( italic_l , italic_r ) ,

where the last line follows from the AM-GM inequality. This completes the proof. ∎

Appendix D Proofs of Theorem 2 and 3: learning with projected SGD

We will prove Theorem 2 which bounds the distance between GF and projected SGD in sub-Sections D.1 through D.3, with sub-Section D.4 devoted to the proof of Theorem 3. Throughout this section, we use M𝑀Mitalic_M to refer to any constant that only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s from Assumptions A1-A3, whereas the value of M𝑀Mitalic_M can change from line to line. We start with an elementary lemma that establishes the Lipschitz continuity of the gradient flow trajectory:

Lemma 4 (A priori estimate).

There exists a constant M𝑀Mitalic_M that only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, such that for all t0𝑡0t\geq 0italic_t ≥ 0, ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is supported on [M(1+t/ε),M(1+t/ε)]×𝕊d1𝑀1𝑡𝜀𝑀1𝑡𝜀superscript𝕊𝑑1[-M(1+t/\varepsilon),M(1+t/\varepsilon)]\times\mathbb{S}^{d-1}[ - italic_M ( 1 + italic_t / italic_ε ) , italic_M ( 1 + italic_t / italic_ε ) ] × blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, namely |ai(t)|M(1+t/ε)subscript𝑎𝑖𝑡𝑀1𝑡𝜀|a_{i}(t)|\leq M(1+t/\varepsilon)| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) | ≤ italic_M ( 1 + italic_t / italic_ε ) for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ]. Moreover, for any 0st0𝑠𝑡0\leq s\leq t0 ≤ italic_s ≤ italic_t, we have

supj[m]|aj(s)aj(t)|subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗𝑠subscript𝑎𝑗𝑡absent\displaystyle\sup_{j\in[m]}\left|a_{j}(s)-a_{j}(t)\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) | ≤ ε1M(ts),superscript𝜀1𝑀𝑡𝑠\displaystyle\varepsilon^{-1}M(t-s),italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M ( italic_t - italic_s ) ,
supj[m]uj(s)uj(t)2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗𝑠subscript𝑢𝑗𝑡2absent\displaystyle\sup_{j\in[m]}\left\|{u_{j}(s)-u_{j}(t)}\right\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s ) - italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ M(1+t/ε)2(ts).𝑀superscript1𝑡𝜀2𝑡𝑠\displaystyle M(1+t/\varepsilon)^{2}(t-s).italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t - italic_s ) .
Proof.

First, notice that along the trajectory of gradient flow, the risk must be non-increasing. In fact, we have

tR=mi=1m(ε2(aiR)2+(uiR)(Iduiui)(uiR))0.subscript𝑡𝑅𝑚superscriptsubscript𝑖1𝑚superscript𝜀2superscriptsubscriptsubscript𝑎𝑖𝑅2superscriptsubscriptsubscript𝑢𝑖𝑅topsubscript𝐼𝑑subscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscriptsubscript𝑢𝑖𝑅0\partial_{t}\mathscrsfs{R}=-m\sum_{i=1}^{m}\left(\varepsilon^{-2}(\partial_{a_% {i}}\mathscrsfs{R})^{2}+(\nabla_{u_{i}}\mathscrsfs{R})^{\top}(I_{d}-u_{i}u_{i}% ^{\top})(\nabla_{u_{i}}\mathscrsfs{R})\right)\leq 0.∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_R = - italic_m ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( ∂ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∇ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( ∇ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R ) ) ≤ 0 .

Therefore, we obtain that

|tai|=subscript𝑡subscript𝑎𝑖absent\displaystyle\left|\partial_{t}a_{i}\right|=\,| ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = ε1|𝔼[yσ(ui,x)]1mj=1maj𝔼[σ(ui,x)σ(uj,x)]|=ε1|𝔼[(yy^)σ(ui,x)]|superscript𝜀1𝔼delimited-[]𝑦𝜎subscript𝑢𝑖𝑥1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝔼delimited-[]𝜎subscript𝑢𝑖𝑥𝜎subscript𝑢𝑗𝑥superscript𝜀1𝔼delimited-[]𝑦^𝑦𝜎subscript𝑢𝑖𝑥\displaystyle\varepsilon^{-1}\left|\mathbb{E}[y\sigma(\langle u_{i},x\rangle)]% -\frac{1}{m}\sum_{j=1}^{m}a_{j}\mathbb{E}[\sigma(\langle u_{i},x\rangle)\sigma% (\langle u_{j},x\rangle)]\right|=\varepsilon^{-1}\left|\mathbb{E}\left[(y-% \widehat{y})\sigma(\langle u_{i},x\rangle)\right]\right|italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | blackboard_E [ italic_y italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x ⟩ ) ] | = italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | blackboard_E [ ( italic_y - over^ start_ARG italic_y end_ARG ) italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) ] |
\displaystyle\leq\, ε1𝔼[(yy^)2]1/2𝔼[σ(ui,x)2]1/2ε12R(0)σL2ε1M,superscript𝜀1𝔼superscriptdelimited-[]superscript𝑦^𝑦212𝔼superscriptdelimited-[]𝜎superscriptsubscript𝑢𝑖𝑥212superscript𝜀12𝑅0subscriptnorm𝜎superscript𝐿2superscript𝜀1𝑀\displaystyle\varepsilon^{-1}\mathbb{E}\left[(y-\widehat{y})^{2}\right]^{1/2}% \mathbb{E}\left[\sigma(\langle u_{i},x\rangle)^{2}\right]^{1/2}\leq\varepsilon% ^{-1}\sqrt{2\mathscrsfs{R}(0)}\left\|{\sigma}\right\|_{L^{2}}\leq\varepsilon^{% -1}M,italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_E [ ( italic_y - over^ start_ARG italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ≤ italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG 2 italic_R ( 0 ) end_ARG ∥ italic_σ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M ,

where the last line follows from our assumption. Since |ai(0)|Msubscript𝑎𝑖0𝑀|a_{i}(0)|\leq M| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) | ≤ italic_M, we know that |ai(t)|M(1+t/ε)subscript𝑎𝑖𝑡𝑀1𝑡𝜀|a_{i}(t)|\leq M(1+t/\varepsilon)| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) | ≤ italic_M ( 1 + italic_t / italic_ε ), and |ai(t)ai(s)|ε1M(ts)subscript𝑎𝑖𝑡subscript𝑎𝑖𝑠superscript𝜀1𝑀𝑡𝑠|a_{i}(t)-a_{i}(s)|\leq\varepsilon^{-1}M(t-s)| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) | ≤ italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M ( italic_t - italic_s ). Moreover, according to Eq. (6), we have

tui2|ai|(V+Usupj[m]|aj|)M(1+t/ε)2,subscriptnormsubscript𝑡subscript𝑢𝑖2subscript𝑎𝑖subscriptnormsuperscript𝑉subscriptnormsuperscript𝑈subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗𝑀superscript1𝑡𝜀2\left\|{\partial_{t}u_{i}}\right\|_{2}\leq|a_{i}|\left(\left\|{V^{\prime}}% \right\|_{\infty}+\left\|{U^{\prime}}\right\|_{\infty}\sup_{j\in[m]}|a_{j}|% \right)\leq M(1+t/\varepsilon)^{2},∥ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( ∥ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) ≤ italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

thus leading to

ui(s)ui(t)2M(1+t/ε)2(ts),i[m].formulae-sequencesubscriptnormsubscript𝑢𝑖𝑠subscript𝑢𝑖𝑡2𝑀superscript1𝑡𝜀2𝑡𝑠for-all𝑖delimited-[]𝑚\left\|{u_{i}(s)-u_{i}(t)}\right\|_{2}\leq M(1+t/\varepsilon)^{2}(t-s),\ % \forall i\in[m].∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t - italic_s ) , ∀ italic_i ∈ [ italic_m ] .

This completes the proof. ∎

In what follows we define two discretized versions of Eq.s (5) and (6), namely the gradient descent (GD) and stochastic gradient descent (SGD) dynamics. They will serve as important intermediate objects for our proof.

  • Gradient descent: Let η>0𝜂0\eta>0italic_η > 0 be the step size, and let the initialization be the same as gradient flow: (a~i(0),u~i(0))=(ai(0),ui(0))subscript~𝑎𝑖0subscript~𝑢𝑖0subscript𝑎𝑖0subscript𝑢𝑖0(\tilde{a}_{i}(0),\tilde{u}_{i}(0))=(a_{i}(0),u_{i}(0))( over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ) = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ) for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ]. We have for k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N,

    a~i(k+1)a~i(k)=mε1ηa~i(k)R=ε1η(V(u,u~i(k);u2,u~i(k)2)1mj=1ma~j(k)U(u~i(k),u~j(k);u~i(k)2,u~j(k)2))u~i(k+1)u~i(k)=mη(Idu~i(k)u~i(k))u~i(k)R=ηa~i(k)(Idu~i(k)u~i(k))(u~i(k)V(u,u~i(k);u2,u~i(k)2)1mj=1ma~j(k)u~i(k)U(u~i(k),u~j(k);u~i(k)2,u~j(k)2)),subscript~𝑎𝑖𝑘1subscript~𝑎𝑖𝑘𝑚superscript𝜀1𝜂subscriptsubscript~𝑎𝑖𝑘𝑅superscript𝜀1𝜂𝑉subscript𝑢subscript~𝑢𝑖𝑘subscriptdelimited-∥∥subscript𝑢2subscriptdelimited-∥∥subscript~𝑢𝑖𝑘21𝑚superscriptsubscript𝑗1𝑚subscript~𝑎𝑗𝑘𝑈subscript~𝑢𝑖𝑘subscript~𝑢𝑗𝑘subscriptdelimited-∥∥subscript~𝑢𝑖𝑘2subscriptdelimited-∥∥subscript~𝑢𝑗𝑘2subscript~𝑢𝑖𝑘1subscript~𝑢𝑖𝑘𝑚𝜂subscript𝐼𝑑subscript~𝑢𝑖𝑘subscript~𝑢𝑖superscript𝑘topsubscriptsubscript~𝑢𝑖𝑘𝑅𝜂subscript~𝑎𝑖𝑘subscript𝐼𝑑subscript~𝑢𝑖𝑘subscript~𝑢𝑖superscript𝑘topsubscriptsubscript~𝑢𝑖𝑘𝑉subscript𝑢subscript~𝑢𝑖𝑘subscriptdelimited-∥∥subscript𝑢2subscriptdelimited-∥∥subscript~𝑢𝑖𝑘21𝑚superscriptsubscript𝑗1𝑚subscript~𝑎𝑗𝑘subscriptsubscript~𝑢𝑖𝑘𝑈subscript~𝑢𝑖𝑘subscript~𝑢𝑗𝑘subscriptdelimited-∥∥subscript~𝑢𝑖𝑘2subscriptdelimited-∥∥subscript~𝑢𝑗𝑘2\begin{split}\tilde{a}_{i}(k+1)-\tilde{a}_{i}(k)=\,&-m\varepsilon^{-1}\eta% \partial_{\tilde{a}_{i}(k)}\mathscrsfs{R}\\ &=\varepsilon^{-1}\eta\Bigg{(}V(\langle u_{*},\tilde{u}_{i}(k)\rangle;\left\|{% u_{*}}\right\|_{2},\left\|{\tilde{u}_{i}(k)}\right\|_{2})\\ &\qquad\qquad-\frac{1}{m}\sum_{j=1}^{m}\tilde{a}_{j}(k)U(\langle\tilde{u}_{i}(% k),\tilde{u}_{j}(k)\rangle;\left\|{\tilde{u}_{i}(k)}\right\|_{2},\left\|{% \tilde{u}_{j}(k)}\right\|_{2})\Bigg{)}\\ \tilde{u}_{i}(k+1)-\tilde{u}_{i}(k)=\,&-m\eta\left(I_{d}-\tilde{u}_{i}(k)% \tilde{u}_{i}(k)^{\top}\right)\nabla_{\tilde{u}_{i}(k)}\mathscrsfs{R}\\ =\,&\eta\tilde{a}_{i}(k)\left(I_{d}-\tilde{u}_{i}(k)\tilde{u}_{i}(k)^{\top}% \right)\Bigg{(}\nabla_{\tilde{u}_{i}(k)}V(\langle u_{*},\tilde{u}_{i}(k)% \rangle;\left\|{u_{*}}\right\|_{2},\left\|{\tilde{u}_{i}(k)}\right\|_{2})\\ &\qquad-\frac{1}{m}\sum_{j=1}^{m}\tilde{a}_{j}(k)\nabla_{\tilde{u}_{i}(k)}U(% \langle\tilde{u}_{i}(k),\tilde{u}_{j}(k)\rangle;\left\|{\tilde{u}_{i}(k)}% \right\|_{2},\left\|{\tilde{u}_{j}(k)}\right\|_{2})\Bigg{)},\end{split}start_ROW start_CELL over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = end_CELL start_CELL - italic_m italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η ∂ start_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_R end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η ( italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) italic_U ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = end_CELL start_CELL - italic_m italic_η ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_R end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL italic_η over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( ∇ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∇ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_U ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (161)

    where we recall from Eq.s (108) and (109):

    V(u,u~i(k);u2,u~i(k)2)=𝑉subscript𝑢subscript~𝑢𝑖𝑘subscriptnormsubscript𝑢2subscriptnormsubscript~𝑢𝑖𝑘2absent\displaystyle V(\langle u_{*},\tilde{u}_{i}(k)\rangle;\left\|{u_{*}}\right\|_{% 2},\left\|{\tilde{u}_{i}(k)}\right\|_{2})=\,italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼[φ(u,x)σ(u~i(k),x)]=𝔼[yσ(u~i(k),x)],𝔼delimited-[]𝜑subscript𝑢𝑥𝜎subscript~𝑢𝑖𝑘𝑥𝔼delimited-[]𝑦𝜎subscript~𝑢𝑖𝑘𝑥\displaystyle\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)\sigma(\langle% \tilde{u}_{i}(k),x\rangle)\right]=\mathbb{E}\left[y\sigma(\langle\tilde{u}_{i}% (k),x\rangle)\right],blackboard_E [ italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) ] = blackboard_E [ italic_y italic_σ ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) ] ,
    u~i(k)V(u,u~i(k);u2,u~i(k)2)=subscriptsubscript~𝑢𝑖𝑘𝑉subscript𝑢subscript~𝑢𝑖𝑘subscriptnormsubscript𝑢2subscriptnormsubscript~𝑢𝑖𝑘2absent\displaystyle\nabla_{\tilde{u}_{i}(k)}V(\langle u_{*},\tilde{u}_{i}(k)\rangle;% \left\|{u_{*}}\right\|_{2},\left\|{\tilde{u}_{i}(k)}\right\|_{2})=\,∇ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼[φ(u,x)σ(u~i(k),x)x]=𝔼[yσ(u~i(k),x)x],𝔼delimited-[]𝜑subscript𝑢𝑥superscript𝜎subscript~𝑢𝑖𝑘𝑥𝑥𝔼delimited-[]𝑦superscript𝜎subscript~𝑢𝑖𝑘𝑥𝑥\displaystyle\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)\sigma^{\prime}(% \langle\tilde{u}_{i}(k),x\rangle)x\right]=\mathbb{E}\left[y\sigma^{\prime}(% \langle\tilde{u}_{i}(k),x\rangle)x\right],blackboard_E [ italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) italic_x ] = blackboard_E [ italic_y italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) italic_x ] ,
    U(u~i(k),u~j(k);u~i(k)2,u~j(k)2)=𝑈subscript~𝑢𝑖𝑘subscript~𝑢𝑗𝑘subscriptnormsubscript~𝑢𝑖𝑘2subscriptnormsubscript~𝑢𝑗𝑘2absent\displaystyle U(\langle\tilde{u}_{i}(k),\tilde{u}_{j}(k)\rangle;\left\|{\tilde% {u}_{i}(k)}\right\|_{2},\left\|{\tilde{u}_{j}(k)}\right\|_{2})=\,italic_U ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼[σ(u~i(k),x)σ(u~j(k),x)],𝔼delimited-[]𝜎subscript~𝑢𝑖𝑘𝑥𝜎subscript~𝑢𝑗𝑘𝑥\displaystyle\mathbb{E}\left[\sigma(\langle\tilde{u}_{i}(k),x\rangle)\sigma(% \langle\tilde{u}_{j}(k),x\rangle)\right],blackboard_E [ italic_σ ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) italic_σ ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) ] ,
    u~i(k)U(u~i(k),u~j(k);u~i(k)2,u~j(k)2)=subscriptsubscript~𝑢𝑖𝑘𝑈subscript~𝑢𝑖𝑘subscript~𝑢𝑗𝑘subscriptnormsubscript~𝑢𝑖𝑘2subscriptnormsubscript~𝑢𝑗𝑘2absent\displaystyle\nabla_{\tilde{u}_{i}(k)}U(\langle\tilde{u}_{i}(k),\tilde{u}_{j}(% k)\rangle;\left\|{\tilde{u}_{i}(k)}\right\|_{2},\left\|{\tilde{u}_{j}(k)}% \right\|_{2})=\,∇ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_U ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼[xσ(u~i(k),x)σ(u~j(k),x)].𝔼delimited-[]𝑥superscript𝜎subscript~𝑢𝑖𝑘𝑥𝜎subscript~𝑢𝑗𝑘𝑥\displaystyle\mathbb{E}\left[x\sigma^{\prime}(\langle\tilde{u}_{i}(k),x\rangle% )\sigma(\langle\tilde{u}_{j}(k),x\rangle)\right].blackboard_E [ italic_x italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) italic_σ ( ⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) , italic_x ⟩ ) ] .

    By convention, we have V(s)=V(s;1,1)𝑉𝑠𝑉𝑠11V(s)=V(s;1,1)italic_V ( italic_s ) = italic_V ( italic_s ; 1 , 1 ) and U(s)=U(s;1,1)𝑈𝑠𝑈𝑠11U(s)=U(s;1,1)italic_U ( italic_s ) = italic_U ( italic_s ; 1 , 1 ) for s[1,1]𝑠11s\in[-1,1]italic_s ∈ [ - 1 , 1 ].

  • One-pass stochastic gradient descent: Under the same choice of the step size and initialization, and let {(xk,yk)}ksubscriptsubscript𝑥𝑘subscript𝑦𝑘𝑘superscript\{(x_{k},y_{k})\}_{k\in\mathbb{N}^{*}}{ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k ∈ blackboard_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT be i.i.d. samples from P𝒫(d×)P𝒫superscript𝑑\mathrm{P}\in\mathscr{P}(\mathbb{R}^{d}\times\mathbb{R})roman_P ∈ script_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R ), where

    P=Law(x,y),x𝖭(0,Id),y=φ(u,x).formulae-sequencePLaw𝑥𝑦formulae-sequencesimilar-to𝑥𝖭0subscript𝐼𝑑𝑦𝜑subscript𝑢𝑥\mathrm{P}=\operatorname{Law}(x,y),\quad x\sim\mathsf{N}(0,I_{d}),\ y=\varphi(% \langle u_{*},x\rangle).roman_P = roman_Law ( italic_x , italic_y ) , italic_x ∼ sansserif_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_y = italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) .

    The iteration equations for one-pass SGD read:

    a¯i(k+1)a¯i(k)=ε1η(yk+11mj=1ma¯j(k)σ(u¯j(k),xk+1))σ(u¯i(k),xk+1)u¯i(k+1)u¯i(k)=ηa¯i(k)(Idu¯i(k)u¯i(k))(yk+11mj=1ma¯j(k)σ(u¯j(k),xk+1))×σ(u¯i(k),xk+1)xk+1.subscript¯𝑎𝑖𝑘1subscript¯𝑎𝑖𝑘superscript𝜀1𝜂subscript𝑦𝑘11𝑚superscriptsubscript𝑗1𝑚subscript¯𝑎𝑗𝑘𝜎subscript¯𝑢𝑗𝑘subscript𝑥𝑘1𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript¯𝑢𝑖𝑘1subscript¯𝑢𝑖𝑘𝜂subscript¯𝑎𝑖𝑘subscript𝐼𝑑subscript¯𝑢𝑖𝑘subscript¯𝑢𝑖superscript𝑘topsubscript𝑦𝑘11𝑚superscriptsubscript𝑗1𝑚subscript¯𝑎𝑗𝑘𝜎subscript¯𝑢𝑗𝑘subscript𝑥𝑘1superscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘1\begin{split}\underline{a}_{i}(k+1)-\underline{a}_{i}(k)=\,&\varepsilon^{-1}% \eta\left(y_{k+1}-\frac{1}{m}\sum_{j=1}^{m}\underline{a}_{j}(k)\sigma(\langle% \underline{u}_{j}(k),x_{k+1}\rangle)\right)\sigma(\langle\underline{u}_{i}(k),% x_{k+1}\rangle)\\ \underline{u}_{i}(k+1)-\underline{u}_{i}(k)=\,&\eta\underline{a}_{i}(k)\left(I% _{d}-\underline{u}_{i}(k)\underline{u}_{i}(k)^{\top}\right)\left(y_{k+1}-\frac% {1}{m}\sum_{j=1}^{m}\underline{a}_{j}(k)\sigma(\langle\underline{u}_{j}(k),x_{% k+1}\rangle)\right)\\ &\times\sigma^{\prime}(\langle\underline{u}_{i}(k),x_{k+1}\rangle)x_{k+1}.\end% {split}start_ROW start_CELL under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = end_CELL start_CELL italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η ( italic_y start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) italic_σ ( ⟨ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) ) italic_σ ( ⟨ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) end_CELL end_ROW start_ROW start_CELL under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = end_CELL start_CELL italic_η under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_y start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) italic_σ ( ⟨ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL × italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT . end_CELL end_ROW (162)

    Note that Eq. (162) can also be written as:

    a¯i(k+1)=subscript¯𝑎𝑖𝑘1absent\displaystyle\underline{a}_{i}(k+1)=\,under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) = a¯i(k)+ε1ηF^i(ρ¯(m)(k);zk+1)subscript¯𝑎𝑖𝑘superscript𝜀1𝜂subscript^𝐹𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1\displaystyle\underline{a}_{i}(k)+\varepsilon^{-1}\eta\widehat{F}_{i}(% \underline{\rho}^{(m)}(k);z_{k+1})under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT )
    u¯i(k+1)=subscript¯𝑢𝑖𝑘1absent\displaystyle\underline{u}_{i}(k+1)=\,under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) = u¯i(k)+η(Idu¯i(k)u¯i(k))G^i(ρ¯(m)(k);zk+1).subscript¯𝑢𝑖𝑘𝜂subscript𝐼𝑑subscript¯𝑢𝑖𝑘subscript¯𝑢𝑖superscript𝑘topsubscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1\displaystyle\underline{u}_{i}(k)+\eta\left(I_{d}-\underline{u}_{i}(k)% \underline{u}_{i}(k)^{\top}\right)\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z_% {k+1}).under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) + italic_η ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) .

D.1 Difference between GF and GD

For notational simplicity, we denote θi(t)=(ai(t),ui(t))subscript𝜃𝑖𝑡subscript𝑎𝑖𝑡subscript𝑢𝑖𝑡\theta_{i}(t)=(a_{i}(t),u_{i}(t))italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) for i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ] and t0𝑡0t\geq 0italic_t ≥ 0, and

ρ(m)(t)=1mi=1mδθi(t)=1mi=1mδ(ai(t),ui(t)).superscript𝜌𝑚𝑡1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝜃𝑖𝑡1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝑎𝑖𝑡subscript𝑢𝑖𝑡\rho^{(m)}(t)=\frac{1}{m}\sum_{i=1}^{m}\delta_{\theta_{i}(t)}=\frac{1}{m}\sum_% {i=1}^{m}\delta_{(a_{i}(t),u_{i}(t))}.italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) end_POSTSUBSCRIPT .

Similarly, θ~i(k)=(a~i(k),u~i(k))subscript~𝜃𝑖𝑘subscript~𝑎𝑖𝑘subscript~𝑢𝑖𝑘\tilde{\theta}_{i}(k)=(\tilde{a}_{i}(k),\tilde{u}_{i}(k))over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = ( over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ), and

ρ~(m)(k)=1mi=1mδθ~i(k)=1mi=1mδ(a~i(k),u~i(k)).superscript~𝜌𝑚𝑘1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript~𝜃𝑖𝑘1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript~𝑎𝑖𝑘subscript~𝑢𝑖𝑘\tilde{\rho}^{(m)}(k)=\frac{1}{m}\sum_{i=1}^{m}\delta_{\tilde{\theta}_{i}(k)}=% \frac{1}{m}\sum_{i=1}^{m}\delta_{(\tilde{a}_{i}(k),\tilde{u}_{i}(k))}.over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ) end_POSTSUBSCRIPT .

Moreover, for θ=(a,u)𝜃𝑎𝑢\theta=(a,u)italic_θ = ( italic_a , italic_u ) and ρ𝒫(×d)𝜌𝒫superscript𝑑\rho\in\mathscr{P}(\mathbb{R}\times\mathbb{R}^{d})italic_ρ ∈ script_P ( blackboard_R × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), we define the following two functionals:

F(θ,ρ)=𝐹𝜃𝜌absent\displaystyle F(\theta,\rho)=\,italic_F ( italic_θ , italic_ρ ) = V(u,u;u2,u2)×daU(u,u;u2,u2)ρ(da,du),𝑉subscript𝑢𝑢subscriptnormsubscript𝑢2subscriptnorm𝑢2subscriptsuperscript𝑑superscript𝑎𝑈𝑢superscript𝑢subscriptnorm𝑢2subscriptnormsuperscript𝑢2𝜌dsuperscript𝑎dsuperscript𝑢\displaystyle V(\langle u_{*},u\rangle;\left\|{u_{*}}\right\|_{2},\left\|{u}% \right\|_{2})-\int_{\mathbb{R}\times\mathbb{R}^{d}}a^{\prime}U(\langle u,u^{% \prime}\rangle;\left\|{u}\right\|_{2},\left\|{u^{\prime}}\right\|_{2})\rho(% \mathrm{d}a^{\prime},\mathrm{d}u^{\prime}),italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_U ( ⟨ italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ; ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,
G(θ,ρ)=𝐺𝜃𝜌absent\displaystyle G(\theta,\rho)=\,italic_G ( italic_θ , italic_ρ ) = a(Iduu)(uV(u,u;u2,u2)×dauU(u,u;u2,u2)ρ(da,du)),𝑎subscript𝐼𝑑𝑢superscript𝑢topsubscript𝑢𝑉subscript𝑢𝑢subscriptnormsubscript𝑢2subscriptnorm𝑢2subscriptsuperscript𝑑superscript𝑎subscript𝑢𝑈𝑢superscript𝑢subscriptnorm𝑢2subscriptnormsuperscript𝑢2𝜌dsuperscript𝑎dsuperscript𝑢\displaystyle a\left(I_{d}-uu^{\top}\right)\left(\nabla_{u}V(\langle u_{*},u% \rangle;\left\|{u_{*}}\right\|_{2},\left\|{u}\right\|_{2})-\int_{\mathbb{R}% \times\mathbb{R}^{d}}a^{\prime}\nabla_{u}U(\langle u,u^{\prime}\rangle;\left\|% {u}\right\|_{2},\left\|{u^{\prime}}\right\|_{2})\rho(\mathrm{d}a^{\prime},% \mathrm{d}u^{\prime})\right),italic_a ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_U ( ⟨ italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ; ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_ρ ( roman_d italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

and Hε(θ,ρ)=(ε1F(θ,ρ),G(θ,ρ))subscript𝐻𝜀𝜃𝜌superscript𝜀1𝐹𝜃𝜌𝐺𝜃𝜌H_{\varepsilon}(\theta,\rho)=(\varepsilon^{-1}F(\theta,\rho),G(\theta,\rho))italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ , italic_ρ ) = ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( italic_θ , italic_ρ ) , italic_G ( italic_θ , italic_ρ ) ). Then, Eq.s (5) and (6) and Eq. (161) can be rewritten as

ddtθi(t)=Hε(θi(t),ρ(m)(t)),θ~i(k+1)θ~i(k)=ηHε(θ~i(k),ρ~(m)(k)),formulae-sequencedd𝑡subscript𝜃𝑖𝑡subscript𝐻𝜀subscript𝜃𝑖𝑡superscript𝜌𝑚𝑡subscript~𝜃𝑖𝑘1subscript~𝜃𝑖𝑘𝜂subscript𝐻𝜀subscript~𝜃𝑖𝑘superscript~𝜌𝑚𝑘\frac{\mathrm{d}}{\mathrm{d}t}\theta_{i}(t)=H_{\varepsilon}(\theta_{i}(t),\rho% ^{(m)}(t)),\quad\tilde{\theta}_{i}(k+1)-\tilde{\theta}_{i}(k)=\eta H_{% \varepsilon}(\tilde{\theta}_{i}(k),\tilde{\rho}^{(m)}(k)),divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_t ) ) , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = italic_η italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) ,

respectively. The lemma below will be used several times in the proof.

Lemma 5.

Denoting ρ(m)=(1/m)i=1mδθisuperscript𝜌𝑚1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscript𝜃𝑖\rho^{(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta_{i}}italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = ( 1 / italic_m ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ρ(m)=(1/m)i=1mδθisuperscript𝜌𝑚1𝑚superscriptsubscript𝑖1𝑚subscript𝛿subscriptsuperscript𝜃𝑖\rho^{\prime(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta^{\prime}_{i}}italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT = ( 1 / italic_m ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. If ui2Csubscriptnormsubscript𝑢𝑖2𝐶\left\|{u_{i}}\right\|_{2}\leq C∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_C and ui2Csubscriptnormsubscriptsuperscript𝑢𝑖2𝐶\left\|{u^{\prime}_{i}}\right\|_{2}\leq C∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_C for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ] (C𝐶Citalic_C is any fixed absolute constant, for example, here we can take C=2𝐶2C=2italic_C = 2), then we have

|F(θi,ρ(m))F(θi,ρ(m))|𝐹subscript𝜃𝑖superscript𝜌𝑚𝐹subscriptsuperscript𝜃𝑖superscript𝜌𝑚absent\displaystyle\left|F(\theta_{i},\rho^{(m)})-F(\theta^{\prime}_{i},\rho^{\prime% (m)})\right|\leq\,| italic_F ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_F ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) | ≤ M((1+supj[m]|aj|)supj[m]ujuj2+supj[m]|ajaj|),𝑀1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle M\left(\left(1+\sup_{j\in[m]}|a_{j}|\right)\cdot\sup_{j\in[m]}% \left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}+\sup_{j\in[m]}\left|a_{j}-a^{\prime}% _{j}\right|\right),italic_M ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) , (163)
G(θi,ρ(m))G(θi,ρ(m))2subscriptnorm𝐺subscript𝜃𝑖superscript𝜌𝑚𝐺subscriptsuperscript𝜃𝑖superscript𝜌𝑚2absent\displaystyle\left\|{G(\theta_{i},\rho^{(m)})-G(\theta^{\prime}_{i},\rho^{% \prime(m)})}\right\|_{2}\leq\,∥ italic_G ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_G ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ M(1+supj[m]|aj|)2supj[m]ujuj2𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle M\cdot\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}\cdot% \sup_{j\in[m]}\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}italic_M ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (164)
+Msupj[m]|ajaj|(1+supj[m]|aj|+supj[m]|ajaj|),𝑀subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle+M\cdot\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|\cdot\left(% 1+\sup_{j\in[m]}\left|a_{j}\right|+\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}% \right|\right),+ italic_M ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) , (165)

where the constant M𝑀Mitalic_M only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s. As a consequence, we obtain that

Hε(θi,ρ(m))Hε(θi,ρ(m))2ε1|F(θi,ρ(m))F(θi,ρ(m))|+G(θi,ρ(m))G(θi,ρ(m))2subscriptnormsubscript𝐻𝜀subscript𝜃𝑖superscript𝜌𝑚subscript𝐻𝜀subscriptsuperscript𝜃𝑖superscript𝜌𝑚2superscript𝜀1𝐹subscript𝜃𝑖superscript𝜌𝑚𝐹subscriptsuperscript𝜃𝑖superscript𝜌𝑚subscriptnorm𝐺subscript𝜃𝑖superscript𝜌𝑚𝐺subscriptsuperscript𝜃𝑖superscript𝜌𝑚2\displaystyle\left\|{H_{\varepsilon}(\theta_{i},\rho^{(m)})-H_{\varepsilon}(% \theta^{\prime}_{i},\rho^{\prime(m)})}\right\|_{2}\leq\varepsilon^{-1}\left|F(% \theta_{i},\rho^{(m)})-F(\theta^{\prime}_{i},\rho^{\prime(m)})\right|+\left\|{% G(\theta_{i},\rho^{(m)})-G(\theta^{\prime}_{i},\rho^{\prime(m)})}\right\|_{2}∥ italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | italic_F ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_F ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) | + ∥ italic_G ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_G ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq\, (ε1+1)M((1+supj[m]|aj|)2supj[m]ujuj2+supj[m]|ajaj|(1+supj[m]|aj|+supj[m]|ajaj|))superscript𝜀11𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle(\varepsilon^{-1}+1)M\cdot\left(\left(1+\sup_{j\in[m]}\left|a_{j}% \right|\right)^{2}\cdot\sup_{j\in[m]}\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}% +\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|\cdot\left(1+\sup_{j\in[m]}% \left|a_{j}\right|+\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|\right)\right)( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ⋅ ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) )
\displaystyle\leq\, (ε1+1)M((1+supj[m]|aj|)2+supj[m]θjθj2)supj[m]θjθj2.superscript𝜀11𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝜃𝑗subscriptsuperscript𝜃𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝜃𝑗subscriptsuperscript𝜃𝑗2\displaystyle(\varepsilon^{-1}+1)M\cdot\left(\left(1+\sup_{j\in[m]}\left|a_{j}% \right|\right)^{2}+\sup_{j\in[m]}\left\|{\theta_{j}-\theta^{\prime}_{j}}\right% \|_{2}\right)\cdot\sup_{j\in[m]}\left\|{\theta_{j}-\theta^{\prime}_{j}}\right% \|_{2}.( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ⋅ ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Proof.

First, by triangle inequality, we have

|F(θi,ρ(m))F(θi,ρ(m))||V(u,ui;u2,ui2)V(u,ui;u2,ui2)|𝐹subscript𝜃𝑖superscript𝜌𝑚𝐹subscriptsuperscript𝜃𝑖superscript𝜌𝑚𝑉subscript𝑢subscript𝑢𝑖subscriptnormsubscript𝑢2subscriptnormsubscript𝑢𝑖2𝑉subscript𝑢subscriptsuperscript𝑢𝑖subscriptnormsubscript𝑢2subscriptnormsubscriptsuperscript𝑢𝑖2\displaystyle\left|F(\theta_{i},\rho^{(m)})-F(\theta^{\prime}_{i},\rho^{\prime% (m)})\right|\leq\left|V(\langle u_{*},u_{i}\rangle;\left\|{u_{*}}\right\|_{2},% \left\|{u_{i}}\right\|_{2})-V(\langle u_{*},u^{\prime}_{i}\rangle;\left\|{u_{*% }}\right\|_{2},\left\|{u^{\prime}_{i}}\right\|_{2})\right|| italic_F ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_F ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) | ≤ | italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) |
+1mj=1m|ajU(ui,uj;ui2,uj2)ajU(ui,uj;ui2,uj2)|1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑢𝑖subscript𝑢𝑗subscriptnormsubscript𝑢𝑖2subscriptnormsubscript𝑢𝑗2subscriptsuperscript𝑎𝑗𝑈subscriptsuperscript𝑢𝑖subscriptsuperscript𝑢𝑗subscriptnormsubscriptsuperscript𝑢𝑖2subscriptnormsubscriptsuperscript𝑢𝑗2\displaystyle+\frac{1}{m}\sum_{j=1}^{m}\left|a_{j}U(\langle u_{i},u_{j}\rangle% ;\left\|{u_{i}}\right\|_{2},\left\|{u_{j}}\right\|_{2})-a^{\prime}_{j}U(% \langle u^{\prime}_{i},u^{\prime}_{j}\rangle;\left\|{u^{\prime}_{i}}\right\|_{% 2},\left\|{u^{\prime}_{j}}\right\|_{2})\right|+ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( ⟨ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) |
\displaystyle\leq\, Vuiui2+Umj=1m|ajaj|+Umj=1m|aj|(uiui2+ujuj2)subscriptnorm𝑉subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnorm𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗subscriptnorm𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle\left\|{\nabla V}\right\|_{\infty}\left\|{u_{i}-u^{\prime}_{i}}% \right\|_{2}+\frac{\left\|{U}\right\|_{\infty}}{m}\sum_{j=1}^{m}\left|a_{j}-a^% {\prime}_{j}\right|+\frac{\left\|{\nabla U}\right\|_{\infty}}{m}\sum_{j=1}^{m}% |a_{j}|\cdot\left(\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}+\left\|{u_{j}-u^{% \prime}_{j}}\right\|_{2}\right)∥ ∇ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG ∥ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + divide start_ARG ∥ ∇ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq\, M((1+supj[m]|aj|)uiui2+supj[m]|ajaj|+supj[m]|aj|supj[m]ujuj2)𝑀1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle M\left(\left(1+\sup_{j\in[m]}|a_{j}|\right)\left\|{u_{i}-u^{% \prime}_{i}}\right\|_{2}+\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|+\sup_% {j\in[m]}|a_{j}|\cdot\sup_{j\in[m]}\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}\right)italic_M ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq\, M((1+supj[m]|aj|)supj[m]ujuj2+supj[m]|ajaj|).𝑀1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle M\left(\left(1+\sup_{j\in[m]}|a_{j}|\right)\cdot\sup_{j\in[m]}% \left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}+\sup_{j\in[m]}\left|a_{j}-a^{\prime}% _{j}\right|\right).italic_M ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) .

Second, using again triangle inequality, we deduce that

G(θi,ρ(m))G(θi,ρ(m))2subscriptnorm𝐺subscript𝜃𝑖superscript𝜌𝑚𝐺subscriptsuperscript𝜃𝑖superscript𝜌𝑚2\displaystyle\left\|{G(\theta_{i},\rho^{(m)})-G(\theta^{\prime}_{i},\rho^{% \prime(m)})}\right\|_{2}∥ italic_G ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_G ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(i)superscript𝑖\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP 2C|ai|uiui2(V+Umj=1m|aj|)+C|ai|Umj=1m|aj|ujuj22𝐶subscript𝑎𝑖subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnorm𝑉subscriptnorm𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝐶subscript𝑎𝑖subscriptnorm𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle 2C\left|a_{i}\right|\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}% \left(\left\|{\nabla V}\right\|_{\infty}+\frac{\left\|{\nabla U}\right\|_{% \infty}}{m}\sum_{j=1}^{m}\left|a_{j}\right|\right)+C\left|a_{i}\right|\cdot% \frac{\left\|{\nabla U}\right\|_{\infty}}{m}\sum_{j=1}^{m}\left|a_{j}\right|% \cdot\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}2 italic_C | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ∥ ∇ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + divide start_ARG ∥ ∇ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) + italic_C | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ⋅ divide start_ARG ∥ ∇ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+C|ai|(2Vuiui2+2Umj=1m|aj|(uiui2+ujuj2))𝐶subscript𝑎𝑖subscriptnormsuperscript2𝑉subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnormsuperscript2𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle+C\left|a_{i}\right|\cdot\left(\left\|{\nabla^{2}V}\right\|_{% \infty}\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}+\frac{\left\|{\nabla^{2}U}% \right\|_{\infty}}{m}\sum_{j=1}^{m}\left|a_{j}\right|\left(\left\|{u_{i}-u^{% \prime}_{i}}\right\|_{2}+\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}\right)\right)+ italic_C | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ⋅ ( ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ( ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) )
+C|aiai|(V+Umj=1m|aj|)+C(|ai|+|aiai|)Umj=1m|ajaj|𝐶subscript𝑎𝑖subscriptsuperscript𝑎𝑖subscriptnorm𝑉subscriptnorm𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝐶subscript𝑎𝑖subscriptsuperscript𝑎𝑖subscript𝑎𝑖subscriptnorm𝑈𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle+C\left|a_{i}-a^{\prime}_{i}\right|\left(\left\|{\nabla V}\right% \|_{\infty}+\frac{\left\|{\nabla U}\right\|_{\infty}}{m}\sum_{j=1}^{m}\left|a_% {j}\right|\right)+C\left(\left|a_{i}\right|+\left|a^{\prime}_{i}-a_{i}\right|% \right)\cdot\frac{\left\|{\nabla U}\right\|_{\infty}}{m}\sum_{j=1}^{m}\left|a_% {j}-a^{\prime}_{j}\right|+ italic_C | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( ∥ ∇ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + divide start_ARG ∥ ∇ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) + italic_C ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + | italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) ⋅ divide start_ARG ∥ ∇ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |
\displaystyle\leq\, 5M(1+supj[m]|aj|)2supj[m]ujuj2+Msupj[m]|ajaj|(1+2supj[m]|aj|+supj[m]|ajaj|)5𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2𝑀subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗12subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle 5M\cdot\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}\cdot% \sup_{j\in[m]}\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}+M\cdot\sup_{j\in[m]}% \left|a_{j}-a^{\prime}_{j}\right|\cdot\left(1+2\sup_{j\in[m]}\left|a_{j}\right% |+\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|\right)5 italic_M ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_M ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( 1 + 2 roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | )
\displaystyle\leq\, M(1+supj[m]|aj|)2supj[m]ujuj2+Msupj[m]|ajaj|(1+supj[m]|aj|+supj[m]|ajaj|),𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2𝑀subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle M\cdot\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}\cdot% \sup_{j\in[m]}\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}+M\cdot\sup_{j\in[m]}% \left|a_{j}-a^{\prime}_{j}\right|\cdot\left(1+\sup_{j\in[m]}\left|a_{j}\right|% +\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|\right),italic_M ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_M ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) ,

where (i)𝑖(i)( italic_i ) follows from the inequality uiui(ui)(ui)op2Cuiui2subscriptnormsubscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscriptsuperscript𝑢𝑖superscriptsubscriptsuperscript𝑢𝑖topop2𝐶subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2\left\|{u_{i}u_{i}^{\top}-(u^{\prime}_{i})(u^{\prime}_{i})^{\top}}\right\|_{% \mathrm{op}}\leq 2C\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT ≤ 2 italic_C ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which is a result of the following direct calculation:

uiui(ui)(ui)op=supx2=1|x,ui2x,ui2|2Csupx2=1|x,uix,ui|=2Cuiui2.subscriptnormsubscript𝑢𝑖superscriptsubscript𝑢𝑖topsubscriptsuperscript𝑢𝑖superscriptsubscriptsuperscript𝑢𝑖topopsubscriptsupremumsubscriptnorm𝑥21superscript𝑥subscript𝑢𝑖2superscript𝑥subscriptsuperscript𝑢𝑖22𝐶subscriptsupremumsubscriptnorm𝑥21𝑥subscript𝑢𝑖𝑥subscriptsuperscript𝑢𝑖2𝐶subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2\displaystyle\left\|{u_{i}u_{i}^{\top}-(u^{\prime}_{i})(u^{\prime}_{i})^{\top}% }\right\|_{\mathrm{op}}=\sup_{\left\|{x}\right\|_{2}=1}\left|\langle x,u_{i}% \rangle^{2}-\langle x,u^{\prime}_{i}\rangle^{2}\right|\leq 2C\sup_{\left\|{x}% \right\|_{2}=1}\left|\langle x,u_{i}\rangle-\langle x,u^{\prime}_{i}\rangle% \right|=2C\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}.∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT | ⟨ italic_x , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_x , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ 2 italic_C roman_sup start_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT | ⟨ italic_x , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ - ⟨ italic_x , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ | = 2 italic_C ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

This completes the proof of Lemma 5, since the “as a consequence” part follows naturally from the upper bounds obtained earlier. ∎

Lemma 6.

Following the notation and assumption of Lemma 5, we have

|R(ρ(m))R(ρ(m))|M((1+supj[m]|aj|)2+supj[m]θjθj2)supj[m]θjθj2.𝑅superscript𝜌𝑚𝑅superscript𝜌𝑚𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝜃𝑗subscriptsuperscript𝜃𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝜃𝑗subscriptsuperscript𝜃𝑗2\left|\mathscrsfs{R}(\rho^{(m)})-\mathscrsfs{R}(\rho^{\prime(m)})\right|\leq M% \cdot\left(\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}+\sup_{j\in[m]}% \left\|{\theta_{j}-\theta^{\prime}_{j}}\right\|_{2}\right)\cdot\sup_{j\in[m]}% \left\|{\theta_{j}-\theta^{\prime}_{j}}\right\|_{2}.| italic_R ( italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_R ( italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) | ≤ italic_M ⋅ ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Proof.

By definition of the risk function and triangle inequality, we deduce that

|R(ρ(m))R(ρ(m))|𝑅superscript𝜌𝑚𝑅superscript𝜌𝑚absent\displaystyle\left|\mathscrsfs{R}(\rho^{(m)})-\mathscrsfs{R}(\rho^{\prime(m)})% \right|\leq\,| italic_R ( italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - italic_R ( italic_ρ start_POSTSUPERSCRIPT ′ ( italic_m ) end_POSTSUPERSCRIPT ) | ≤ 1mi=1m|aiV(u,ui;u2,ui2)aiV(u,ui;u2,ui2)|1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝑉subscript𝑢subscript𝑢𝑖subscriptnormsubscript𝑢2subscriptnormsubscript𝑢𝑖2subscriptsuperscript𝑎𝑖𝑉subscript𝑢subscriptsuperscript𝑢𝑖subscriptnormsubscript𝑢2subscriptnormsubscriptsuperscript𝑢𝑖2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left|a_{i}V(\langle u_{*},u_{i}\rangle;% \left\|{u_{*}}\right\|_{2},\left\|{u_{i}}\right\|_{2})-a^{\prime}_{i}V(\langle u% _{*},u^{\prime}_{i}\rangle;\left\|{u_{*}}\right\|_{2},\left\|{u^{\prime}_{i}}% \right\|_{2})\right|divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) |
+1m2i,j=1m|aiajU(ui,uj;ui2,uj2)aiajU(ui,uj;ui2,uj2)|1superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗𝑈subscript𝑢𝑖subscript𝑢𝑗subscriptnormsubscript𝑢𝑖2subscriptnormsubscript𝑢𝑗2subscriptsuperscript𝑎𝑖subscriptsuperscript𝑎𝑗𝑈subscriptsuperscript𝑢𝑖subscriptsuperscript𝑢𝑗subscriptnormsubscriptsuperscript𝑢𝑖2subscriptnormsubscriptsuperscript𝑢𝑗2\displaystyle+\frac{1}{m^{2}}\sum_{i,j=1}^{m}\left|a_{i}a_{j}U(\langle u_{i},u% _{j}\rangle;\left\|{u_{i}}\right\|_{2},\left\|{u_{j}}\right\|_{2})-a^{\prime}_% {i}a^{\prime}_{j}U(\langle u^{\prime}_{i},u^{\prime}_{j}\rangle;\left\|{u^{% \prime}_{i}}\right\|_{2},\left\|{u^{\prime}_{j}}\right\|_{2})\right|+ divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( ⟨ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ; ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) |
\displaystyle\leq\, Vmi=1m|ai|uiui2+Vmi=1m|aiai|subscriptnorm𝑉𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnorm𝑉𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖subscriptsuperscript𝑎𝑖\displaystyle\frac{\left\|{\nabla V}\right\|_{\infty}}{m}\sum_{i=1}^{m}\left|a% _{i}\right|\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}+\frac{\left\|{V}\right\|_% {\infty}}{m}\sum_{i=1}^{m}\left|a_{i}-a^{\prime}_{i}\right|divide start_ARG ∥ ∇ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG ∥ italic_V ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |
+Um2i,j=1m(|aiai||aj|+|ai||ajaj|)subscriptnorm𝑈superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscriptsuperscript𝑎𝑖subscriptsuperscript𝑎𝑗subscript𝑎𝑖subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle+\frac{\left\|{U}\right\|_{\infty}}{m^{2}}\sum_{i,j=1}^{m}\left(% \left|a_{i}-a^{\prime}_{i}\right|\left|a^{\prime}_{j}\right|+\left|a_{i}\right% |\left|a_{j}-a^{\prime}_{j}\right|\right)+ divide start_ARG ∥ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | )
+Um2i,j=1m|ai||aj|(uiui2+ujuj2)subscriptnorm𝑈superscript𝑚2superscriptsubscript𝑖𝑗1𝑚subscript𝑎𝑖subscript𝑎𝑗subscriptnormsubscript𝑢𝑖subscriptsuperscript𝑢𝑖2subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle+\frac{\left\|{\nabla U}\right\|_{\infty}}{m^{2}}\sum_{i,j=1}^{m}% \left|a_{i}\right|\left|a_{j}\right|\cdot\left(\left\|{u_{i}-u^{\prime}_{i}}% \right\|_{2}+\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}\right)+ divide start_ARG ∥ ∇ italic_U ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq\, M(1+supj[m]|aj|)2supj[m]ujuj2𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝑢𝑗subscriptsuperscript𝑢𝑗2\displaystyle M\cdot\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}\cdot% \sup_{j\in[m]}\left\|{u_{j}-u^{\prime}_{j}}\right\|_{2}italic_M ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+Msupj[m]|ajaj|(1+supj[m]|aj|+supj[m]|ajaj|)𝑀subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗subscriptsuperscript𝑎𝑗\displaystyle+M\cdot\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}\right|\cdot\left(% 1+\sup_{j\in[m]}\left|a_{j}\right|+\sup_{j\in[m]}\left|a_{j}-a^{\prime}_{j}% \right|\right)+ italic_M ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ⋅ ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | )
\displaystyle\leq\, M((1+supj[m]|aj|)2+supj[m]θjθj2)supj[m]θjθj2.𝑀superscript1subscriptsupremum𝑗delimited-[]𝑚subscript𝑎𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝜃𝑗subscriptsuperscript𝜃𝑗2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript𝜃𝑗subscriptsuperscript𝜃𝑗2\displaystyle M\cdot\left(\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}+% \sup_{j\in[m]}\left\|{\theta_{j}-\theta^{\prime}_{j}}\right\|_{2}\right)\cdot% \sup_{j\in[m]}\left\|{\theta_{j}-\theta^{\prime}_{j}}\right\|_{2}.italic_M ⋅ ( ( 1 + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

This concludes the proof. ∎

First, let us define the error function

Δ(t)=supk[0,t/η]maxi[m]θ~i(k)θi(kη)2,Δ𝑡subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript~𝜃𝑖𝑘subscript𝜃𝑖𝑘𝜂2\Delta(t)=\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{% \theta}_{i}(k)-\theta_{i}(k\eta)}\right\|_{2},roman_Δ ( italic_t ) = roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_η ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

and the stop** time TΔ=inf{t0:Δ(t)1}subscript𝑇Δinfimumconditional-set𝑡0Δ𝑡1T_{\Delta}=\inf\{t\geq 0:\Delta(t)\geq 1\}italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT = roman_inf { italic_t ≥ 0 : roman_Δ ( italic_t ) ≥ 1 }. For k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N and t=kηTΔ𝑡𝑘𝜂subscript𝑇Δt=k\eta\leq T_{\Delta}italic_t = italic_k italic_η ≤ italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT, we have the following estimate:

θi(t)θ~i(k)2subscriptnormsubscript𝜃𝑖𝑡subscript~𝜃𝑖𝑘2absent\displaystyle\left\|{\theta_{i}(t)-\tilde{\theta}_{i}(k)}\right\|_{2}\leq\,∥ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0tHε(θi(s),ρ(m)(s))Hε(θ~i(s/η),ρ~(m)(s/η))2dssuperscriptsubscript0𝑡subscriptnormsubscript𝐻𝜀subscript𝜃𝑖𝑠superscript𝜌𝑚𝑠subscript𝐻𝜀subscript~𝜃𝑖𝑠𝜂superscript~𝜌𝑚𝑠𝜂2differential-d𝑠\displaystyle\int_{0}^{t}\left\|{H_{\varepsilon}\left(\theta_{i}(s),\rho^{(m)}% (s)\right)-H_{\varepsilon}\left(\tilde{\theta}_{i}(\lfloor s/\eta\rfloor),% \tilde{\rho}^{(m)}(\lfloor s/\eta\rfloor)\right)}\right\|_{2}\mathrm{d}s∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⌊ italic_s / italic_η ⌋ ) , over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_η ⌋ ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_d italic_s
\displaystyle\leq\, 0tHε(θi(s),ρ(m)(s))Hε(θi(ηs/η),ρ(m)(ηs/η))2dssuperscriptsubscript0𝑡subscriptnormsubscript𝐻𝜀subscript𝜃𝑖𝑠superscript𝜌𝑚𝑠subscript𝐻𝜀subscript𝜃𝑖𝜂𝑠𝜂superscript𝜌𝑚𝜂𝑠𝜂2differential-d𝑠\displaystyle\int_{0}^{t}\left\|{H_{\varepsilon}\left(\theta_{i}(s),\rho^{(m)}% (s)\right)-H_{\varepsilon}\left(\theta_{i}(\eta\lfloor s/\eta\rfloor),\rho^{(m% )}(\eta\lfloor s/\eta\rfloor)\right)}\right\|_{2}\mathrm{d}s∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_η ⌊ italic_s / italic_η ⌋ ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_η ⌊ italic_s / italic_η ⌋ ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_d italic_s
+0tHε(θi(ηs/η),ρ(m)(ηs/η))Hε(θ~i(s/η),ρ~(m)(s/η))2ds.superscriptsubscript0𝑡subscriptnormsubscript𝐻𝜀subscript𝜃𝑖𝜂𝑠𝜂superscript𝜌𝑚𝜂𝑠𝜂subscript𝐻𝜀subscript~𝜃𝑖𝑠𝜂superscript~𝜌𝑚𝑠𝜂2differential-d𝑠\displaystyle+\int_{0}^{t}\left\|{H_{\varepsilon}\left(\theta_{i}(\eta\lfloor s% /\eta\rfloor),\rho^{(m)}(\eta\lfloor s/\eta\rfloor)\right)-H_{\varepsilon}% \left(\tilde{\theta}_{i}(\lfloor s/\eta\rfloor),\tilde{\rho}^{(m)}(\lfloor s/% \eta\rfloor)\right)}\right\|_{2}\mathrm{d}s.+ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_η ⌊ italic_s / italic_η ⌋ ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_η ⌊ italic_s / italic_η ⌋ ) ) - italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⌊ italic_s / italic_η ⌋ ) , over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_η ⌋ ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_d italic_s .

For any s[0,t]𝑠0𝑡s\in[0,t]italic_s ∈ [ 0 , italic_t ], by Lemma 4 and 5 we have (denote [s]=ηs/ηdelimited-[]𝑠𝜂𝑠𝜂[s]=\eta\lfloor s/\eta\rfloor[ italic_s ] = italic_η ⌊ italic_s / italic_η ⌋, and notice that we can take C=2𝐶2C=2italic_C = 2 since tTΔ𝑡subscript𝑇Δt\leq T_{\Delta}italic_t ≤ italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT)

Hε(θi(s),ρ(m)(s))Hε(θi([s]),ρ(m)([s]))2subscriptnormsubscript𝐻𝜀subscript𝜃𝑖𝑠superscript𝜌𝑚𝑠subscript𝐻𝜀subscript𝜃𝑖delimited-[]𝑠superscript𝜌𝑚delimited-[]𝑠2\displaystyle\left\|{H_{\varepsilon}\left(\theta_{i}(s),\rho^{(m)}(s)\right)-H% _{\varepsilon}\left(\theta_{i}([s]),\rho^{(m)}([s])\right)}\right\|_{2}∥ italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_s ) ) - italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( [ italic_s ] ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( [ italic_s ] ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq\, (ε1+1)M(1+t/ε)4(s[s])(ε1+1)M(1+t/ε)4η.superscript𝜀11𝑀superscript1𝑡𝜀4𝑠delimited-[]𝑠superscript𝜀11𝑀superscript1𝑡𝜀4𝜂\displaystyle(\varepsilon^{-1}+1)M(1+t/\varepsilon)^{4}(s-[s])\leq(\varepsilon% ^{-1}+1)M(1+t/\varepsilon)^{4}\eta.( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_s - [ italic_s ] ) ≤ ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_η .

Using again Lemma 4 and 5, we obtain that

Hε(θi(ηs/η),ρ(m)(ηs/η))Hε(θ~i(s/η),ρ~(m)(s/η))2subscriptnormsubscript𝐻𝜀subscript𝜃𝑖𝜂𝑠𝜂superscript𝜌𝑚𝜂𝑠𝜂subscript𝐻𝜀subscript~𝜃𝑖𝑠𝜂superscript~𝜌𝑚𝑠𝜂2\displaystyle\left\|{H_{\varepsilon}\left(\theta_{i}(\eta\lfloor s/\eta\rfloor% ),\rho^{(m)}(\eta\lfloor s/\eta\rfloor)\right)-H_{\varepsilon}\left(\tilde{% \theta}_{i}(\lfloor s/\eta\rfloor),\tilde{\rho}^{(m)}(\lfloor s/\eta\rfloor)% \right)}\right\|_{2}∥ italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_η ⌊ italic_s / italic_η ⌋ ) , italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_η ⌊ italic_s / italic_η ⌋ ) ) - italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⌊ italic_s / italic_η ⌋ ) , over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ⌊ italic_s / italic_η ⌋ ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq\, (ε1+1)M(1+ε1s)2Δ(s)+(ε1+1)MΔ(s)2,superscript𝜀11𝑀superscript1superscript𝜀1𝑠2Δ𝑠superscript𝜀11𝑀Δsuperscript𝑠2\displaystyle(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}s)^{2}\Delta(s)+(% \varepsilon^{-1}+1)M\Delta(s)^{2},( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Δ ( italic_s ) + ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M roman_Δ ( italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

thus leading to

Δ(t)Δ𝑡absent\displaystyle\Delta(t)\leq\,roman_Δ ( italic_t ) ≤ (ε1+1)Mt(1+t/ε)4η+(ε1+1)0t(M(1+ε1s)2Δ(s)+MΔ(s)2)dssuperscript𝜀11𝑀𝑡superscript1𝑡𝜀4𝜂superscript𝜀11superscriptsubscript0𝑡𝑀superscript1superscript𝜀1𝑠2Δ𝑠𝑀Δsuperscript𝑠2differential-d𝑠\displaystyle(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta+(\varepsilon^{-1}% +1)\int_{0}^{t}\left(M(1+\varepsilon^{-1}s)^{2}\Delta(s)+M\Delta(s)^{2}\right)% \mathrm{d}s( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_η + ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Δ ( italic_s ) + italic_M roman_Δ ( italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_s
\displaystyle\leq\, (ε1+1)Mt(1+t/ε)4η+(ε1+1)0tM(1+ε1s)2max(Δ(s),Δ(s)2)ds.superscript𝜀11𝑀𝑡superscript1𝑡𝜀4𝜂superscript𝜀11superscriptsubscript0𝑡𝑀superscript1superscript𝜀1𝑠2Δ𝑠Δsuperscript𝑠2differential-d𝑠\displaystyle(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta+(\varepsilon^{-1}% +1)\int_{0}^{t}M(1+\varepsilon^{-1}s)^{2}\cdot\max\left(\Delta(s),\Delta(s)^{2% }\right)\mathrm{d}s.( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_η + ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_max ( roman_Δ ( italic_s ) , roman_Δ ( italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_s .

For stTΔ𝑠𝑡subscript𝑇Δs\leq t\leq T_{\Delta}italic_s ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT, we have Δ(s)2Δ(s)Δsuperscript𝑠2Δ𝑠\Delta(s)^{2}\leq\Delta(s)roman_Δ ( italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_Δ ( italic_s ). Hence,

Δ(t)(ε1+1)Mt(1+t/ε)4η+(ε1+1)0tM(1+ε1s)2Δ(s)ds.Δ𝑡superscript𝜀11𝑀𝑡superscript1𝑡𝜀4𝜂superscript𝜀11superscriptsubscript0𝑡𝑀superscript1superscript𝜀1𝑠2Δ𝑠differential-d𝑠\Delta(t)\leq(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta+(\varepsilon^{-1}% +1)\int_{0}^{t}M(1+\varepsilon^{-1}s)^{2}\Delta(s)\mathrm{d}s.roman_Δ ( italic_t ) ≤ ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_η + ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Δ ( italic_s ) roman_d italic_s .

Applying Grönwall’s inequality yields

Δ(t)(ε1+1)Mt(1+t/ε)4ηexp((ε1+1)Mt(1+t/ε)2)Mexp((ε1+1)Mt(1+t/ε)2)η.Δ𝑡superscript𝜀11𝑀𝑡superscript1𝑡𝜀4𝜂superscript𝜀11𝑀𝑡superscript1𝑡𝜀2𝑀superscript𝜀11𝑀𝑡superscript1𝑡𝜀2𝜂\Delta(t)\leq(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta\cdot\exp\left((% \varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{2}\right)\leq M\exp((\varepsilon^{-1}+% 1)Mt(1+t/\varepsilon)^{2})\eta.roman_Δ ( italic_t ) ≤ ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_η ⋅ roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η .

Therefore, for all T0𝑇0T\geq 0italic_T ≥ 0 and η1/(Mexp((ε1+1)MT(1+ε1T)2))𝜂1𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2\eta\leq 1/(M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}))italic_η ≤ 1 / ( italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ), we have

supk[0,t/η]maxi[m]θ~i(k)θi(kη)2Mexp((ε1+1)Mt(1+t/ε)2)η1,tmin(T,TΔ).formulae-sequencesubscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript~𝜃𝑖𝑘subscript𝜃𝑖𝑘𝜂2𝑀superscript𝜀11𝑀𝑡superscript1𝑡𝜀2𝜂1for-all𝑡𝑇subscript𝑇Δ\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{\theta}_{i}(k)% -\theta_{i}(k\eta)}\right\|_{2}\leq M\exp((\varepsilon^{-1}+1)Mt(1+t/% \varepsilon)^{2})\eta\leq 1,\ \forall t\leq\min(T,T_{\Delta}).roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_η ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η ≤ 1 , ∀ italic_t ≤ roman_min ( italic_T , italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ) .

This proves TTΔ𝑇subscript𝑇ΔT\leq T_{\Delta}italic_T ≤ italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT, and consequently

supk[0,t/η]maxi[m]θ~i(k)θi(kη)2Mexp((ε1+1)Mt(1+t/ε)2)η,t[0,T],formulae-sequencesubscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript~𝜃𝑖𝑘subscript𝜃𝑖𝑘𝜂2𝑀superscript𝜀11𝑀𝑡superscript1𝑡𝜀2𝜂for-all𝑡0𝑇\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{\theta}_{i}(k)% -\theta_{i}(k\eta)}\right\|_{2}\leq M\exp((\varepsilon^{-1}+1)Mt(1+t/% \varepsilon)^{2})\eta,\ \forall t\in[0,T],roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_η ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η , ∀ italic_t ∈ [ 0 , italic_T ] ,

which immediately implies that

supk[0,t/η]maxi[m]|a~i(k)|M(1+t/ε)+1M(1+t/ε).subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscript~𝑎𝑖𝑘𝑀1𝑡𝜀1𝑀1𝑡𝜀\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\tilde{a}_{i}(k)\right|% \leq M(1+t/\varepsilon)+1\leq M(1+t/\varepsilon).roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≤ italic_M ( 1 + italic_t / italic_ε ) + 1 ≤ italic_M ( 1 + italic_t / italic_ε ) .

Finally, with the aid of Lemma 6, we get the following upper bound on the difference between the risk of gradient flow and gradient descent:

supk[0,t/η]|R(ρ(m)(kη))R(ρ~(m)(k))|subscriptsupremum𝑘0𝑡𝜂𝑅superscript𝜌𝑚𝑘𝜂𝑅superscript~𝜌𝑚𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(\rho^{(m)% }(k\eta))-\mathscrsfs{R}(\tilde{\rho}^{(m)}(k))\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT | italic_R ( italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k italic_η ) ) - italic_R ( over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) | ≤ M(M2(1+t/ε)2+1)Mexp((ε1+1)Mt(1+t/ε)2)η𝑀superscript𝑀2superscript1𝑡𝜀21𝑀superscript𝜀11𝑀𝑡superscript1𝑡𝜀2𝜂\displaystyle M(M^{2}(1+t/\varepsilon)^{2}+1)M\exp((\varepsilon^{-1}+1)Mt(1+t/% \varepsilon)^{2})\etaitalic_M ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η
\displaystyle\leq\, Mexp((ε1+1)Mt(1+t/ε)2)η.𝑀superscript𝜀11𝑀𝑡superscript1𝑡𝜀2𝜂\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{2})\eta.italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η .

To summarize, we have the following:

Theorem 4 (Difference between GF and GD).

There exists a constant M𝑀Mitalic_M that only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, such that for any T0𝑇0T\geq 0italic_T ≥ 0 and

η1Mexp((ε1+1)MT(1+ε1T)2),𝜂1𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2\eta\leq\frac{1}{M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2})},italic_η ≤ divide start_ARG 1 end_ARG start_ARG italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ,

the following holds for all t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]:

supk[0,t/η]maxi[m]|a~i(k)|subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscript~𝑎𝑖𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\tilde{a}_% {i}(k)\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≤ M(1+t/ε),𝑀1𝑡𝜀\displaystyle M(1+t/\varepsilon),italic_M ( 1 + italic_t / italic_ε ) , (166)
supk[0,t/η]maxi[m]θ~i(k)θi(kη)2subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript~𝜃𝑖𝑘subscript𝜃𝑖𝑘𝜂2absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{% \theta}_{i}(k)-\theta_{i}(k\eta)}\right\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k italic_η ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ Mexp((ε1+1)Mt(1+ε1T)2)η,𝑀superscript𝜀11𝑀𝑡superscript1superscript𝜀1𝑇2𝜂\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\eta,italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η , (167)
supk[0,t/η]|R(ρ(m)(kη))R(ρ~(m)(k))|subscriptsupremum𝑘0𝑡𝜂𝑅superscript𝜌𝑚𝑘𝜂𝑅superscript~𝜌𝑚𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(\rho^{(m)% }(k\eta))-\mathscrsfs{R}(\tilde{\rho}^{(m)}(k))\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT | italic_R ( italic_ρ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k italic_η ) ) - italic_R ( over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) | ≤ Mexp((ε1+1)Mt(1+ε1T)2)η.𝑀superscript𝜀11𝑀𝑡superscript1superscript𝜀1𝑇2𝜂\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\eta.italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_η . (168)

D.2 Difference between GD and SGD

The proof for this section is almost identical to Appendix C.5 in [33]. The only difference is that, here we need to verify that (Iduu)σ(u,x)xsubscript𝐼𝑑𝑢superscript𝑢topsuperscript𝜎𝑢𝑥𝑥(I_{d}-uu^{\top})\sigma^{\prime}(\langle u,x\rangle)x( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ italic_u , italic_x ⟩ ) italic_x is an M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT-sub-Gaussian random vector. This follows from the identity (Iduu)x=xu,xusubscript𝐼𝑑𝑢superscript𝑢top𝑥𝑥𝑢𝑥𝑢(I_{d}-uu^{\top})x=x-\langle u,x\rangle u( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_x = italic_x - ⟨ italic_u , italic_x ⟩ italic_u and Assumption A3. We thus obtain the following interpolation bound between GD and SGD:

Theorem 5 (Difference between GD and SGD).

There exists a constant M𝑀Mitalic_M that only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, such that for any T,z0𝑇𝑧0T,z\geq 0italic_T , italic_z ≥ 0 and

η1(d+logm+z2)Mexp((ε1+1)MT(1+ε1T)2),𝜂1𝑑𝑚superscript𝑧2𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-% 1}T)^{2})},italic_η ≤ divide start_ARG 1 end_ARG start_ARG ( italic_d + roman_log italic_m + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ,

the following happens with probability at least 1exp(z2)1superscript𝑧21-\exp(-z^{2})1 - roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ): For all t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], we have

supk[0,t/η]maxi[m]|a¯i(k)|subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\underline% {a}_{i}(k)\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≤ M(1+t/ε),𝑀1𝑡𝜀\displaystyle M(1+t/\varepsilon),italic_M ( 1 + italic_t / italic_ε ) , (169)
supk[0,t/η]maxi[m]θ~i(k)θ¯i(k)2subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript~𝜃𝑖𝑘subscript¯𝜃𝑖𝑘2absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{% \theta}_{i}(k)-\underline{\theta}_{i}(k)}\right\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ Mexp((ε1+1)Mt(1+ε1T)2)η(d+logm+z),𝑀superscript𝜀11𝑀𝑡superscript1superscript𝜀1𝑇2𝜂𝑑𝑚𝑧\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right),italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) , (170)
supk[0,t/η]|R(ρ¯(m)(k))R(ρ~(m)(k))|subscriptsupremum𝑘0𝑡𝜂𝑅superscript¯𝜌𝑚𝑘𝑅superscript~𝜌𝑚𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(% \underline{\rho}^{(m)}(k))-\mathscrsfs{R}(\tilde{\rho}^{(m)}(k))\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT | italic_R ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) - italic_R ( over~ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) | ≤ Mexp((ε1+1)Mt(1+ε1T)2)η(d+logm+z).𝑀superscript𝜀11𝑀𝑡superscript1superscript𝜀1𝑇2𝜂𝑑𝑚𝑧\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right).italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) . (171)

D.3 Difference between SGD and projected SGD

The aim of this section is to prove a coupling bound between the trajectory of SGD and that of projected SGD, thus finally leading to an upper bound on the difference between the risk of projected gradient flow and projected SGD. To begin with, let us fix T,z0𝑇𝑧0T,z\geq 0italic_T , italic_z ≥ 0 and choose

η1(d+logm+z2)Mexp((ε1+1)MT(1+ε1T)2)𝜂1𝑑𝑚superscript𝑧2𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-% 1}T)^{2})}italic_η ≤ divide start_ARG 1 end_ARG start_ARG ( italic_d + roman_log italic_m + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG

as in Theorem 2, where M𝑀Mitalic_M is a large enough constant (to be determined later). Define

Tθ=inf{t0:maxk[0,t/η]maxi[m]|a¯i(k)|2M(1+t/ε),ormaxk[0,t/η]maxi[m]u¯i(k)22},subscript𝑇𝜃infimumconditional-set𝑡0formulae-sequencesubscript𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑘2𝑀1𝑡𝜀orsubscript𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript¯𝑢𝑖𝑘22T_{\theta}=\inf\left\{t\geq 0:\max_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]% }\left|\overline{a}_{i}(k)\right|\geq 2M(1+t/\varepsilon),\ \text{or}\ \max_{k% \in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\overline{u}_{i}(k)}\right\|% _{2}\geq 2\right\},italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = roman_inf { italic_t ≥ 0 : roman_max start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≥ 2 italic_M ( 1 + italic_t / italic_ε ) , or roman_max start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 2 } ,

then for kmin(T,Tθ)/η𝑘𝑇subscript𝑇𝜃𝜂k\leq\min(T,T_{\theta})/\etaitalic_k ≤ roman_min ( italic_T , italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) / italic_η and i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ], we have (note that here t=kη𝑡𝑘𝜂t=k\etaitalic_t = italic_k italic_η)

G^i(ρ¯(m)(k);zk+1)2subscriptnormsubscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘12absent\displaystyle\left\|{\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})}\right% \|_{2}\leq\,∥ over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ M|a¯i(k)|(1+maxi[m]|a¯i(k)|)σ(u¯i(k),xk+1)xk+12𝑀subscript¯𝑎𝑖𝑘1subscript𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑘subscriptnormsuperscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘12\displaystyle M\left|\bar{a}_{i}(k)\right|\left(1+\max_{i\in[m]}\left|% \overline{a}_{i}(k)\right|\right)\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k)% ,x_{k+1}\rangle)x_{k+1}}\right\|_{2}italic_M | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ( 1 + roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ) ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq\, M(1+t/ε)2σ(u¯i(k),xk+1)xk+12.𝑀superscript1𝑡𝜀2subscriptnormsuperscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘12\displaystyle M(1+t/\varepsilon)^{2}\left\|{\sigma^{\prime}(\langle\bar{u}_{i}% (k),x_{k+1}\rangle)x_{k+1}}\right\|_{2}.italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Denoting k=σ(θ¯(0),z1,,zk)subscript𝑘𝜎¯𝜃0subscript𝑧1subscript𝑧𝑘\mathcal{F}_{k}=\sigma(\bar{\theta}(0),z_{1},\cdots,z_{k})caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_σ ( over¯ start_ARG italic_θ end_ARG ( 0 ) , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), we know from Assumption A3 that, conditioning on ksubscript𝑘\mathcal{F}_{k}caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, σ(u¯i(k),xk+1)xk+1superscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘1\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x_{k+1}italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is an M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT-sub-Gaussian random vector. By well-known results on Euclidean norm of sub-Gaussian random vectors (see, e.g., [27]), we know that there exists a constant M𝑀Mitalic_M satisfying

(σ(u¯i(k),xk+1)xk+12M(d+log(1/δ)))δ.subscriptnormsuperscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘12𝑀𝑑1𝛿𝛿\mathds{P}\left(\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x% _{k+1}}\right\|_{2}\geq M\left(\sqrt{d}+\sqrt{\log(1/\delta)}\right)\right)% \leq\delta.blackboard_P ( ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_M ( square-root start_ARG italic_d end_ARG + square-root start_ARG roman_log ( 1 / italic_δ ) end_ARG ) ) ≤ italic_δ .

Choosing δ=ηexp(z2)/(mT)𝛿𝜂superscript𝑧2𝑚𝑇\delta=\eta\exp(-z^{2})/(mT)italic_δ = italic_η roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / ( italic_m italic_T ) and applying a union bound gives

(maxk[0,min(T,Tθ)/η]maxi[m]σ(u¯i(k),xk+1)xk+12M(d+logm+z+T2))1exp(z2).subscript𝑘0𝑇subscript𝑇𝜃𝜂subscript𝑖delimited-[]𝑚subscriptnormsuperscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘12𝑀𝑑𝑚𝑧superscript𝑇21superscript𝑧2\mathds{P}\left(\max_{k\in[0,\min(T,T_{\theta})/\eta]\cap\mathbb{N}}\max_{i\in% [m]}\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x_{k+1}}% \right\|_{2}\leq M\left(\sqrt{d+\log m}+z+T^{2}\right)\right)\geq 1-\exp(-z^{2% }).blackboard_P ( roman_max start_POSTSUBSCRIPT italic_k ∈ [ 0 , roman_min ( italic_T , italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ≥ 1 - roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Therefore, with probability at least 1exp(z2)1superscript𝑧21-\exp(-z^{2})1 - roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), for all kmin(T,Tθ)/η𝑘𝑇subscript𝑇𝜃𝜂k\leq\min(T,T_{\theta})/\etaitalic_k ≤ roman_min ( italic_T , italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) / italic_η and i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ], we have

G^i(ρ¯(m)(k);zk+1)2M(1+t/ε)2(d+logm+z+T2).subscriptnormsubscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘12𝑀superscript1𝑡𝜀2𝑑𝑚𝑧superscript𝑇2\left\|{\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})}\right\|_{2}\leq M(1% +t/\varepsilon)^{2}\left(\sqrt{d+\log m}+z+T^{2}\right).∥ over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M ( 1 + italic_t / italic_ε ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

The above bound also holds for the trajectory of SGD, namely after replacing ρ¯(m)(k)superscript¯𝜌𝑚𝑘\overline{\rho}^{(m)}(k)over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) with ρ¯(m)(k)superscript¯𝜌𝑚𝑘\underline{\rho}^{(m)}(k)under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ). Now, let us define the approximation error Δi(k)=u¯i(k)u¯i(k)subscriptΔ𝑖𝑘subscript¯𝑢𝑖𝑘subscript¯𝑢𝑖𝑘\Delta_{i}(k)=\underline{u}_{i}(k)-\overline{u}_{i}(k)roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) for i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ] and k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N, then we get the following decomposition:

Δi(l)=k=0l1(Δi(k+1)Δi(k))=k=0l1(𝔼[Δi(k+1)Δi(k)|k]+Zi(k+1)),subscriptΔ𝑖𝑙superscriptsubscript𝑘0𝑙1subscriptΔ𝑖𝑘1subscriptΔ𝑖𝑘superscriptsubscript𝑘0𝑙1𝔼delimited-[]subscriptΔ𝑖𝑘1conditionalsubscriptΔ𝑖𝑘subscript𝑘subscript𝑍𝑖𝑘1\Delta_{i}(l)=\sum_{k=0}^{l-1}\left(\Delta_{i}(k+1)-\Delta_{i}(k)\right)=\sum_% {k=0}^{l-1}\left(\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)|\mathcal{F}_{k}% \right]+Z_{i}(k+1)\right),roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] + italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) ) ,

where Zi(k+1)=Δi(k+1)Δi(k)𝔼[Δi(k+1)Δi(k)|k]subscript𝑍𝑖𝑘1subscriptΔ𝑖𝑘1subscriptΔ𝑖𝑘𝔼delimited-[]subscriptΔ𝑖𝑘1conditionalsubscriptΔ𝑖𝑘subscript𝑘Z_{i}(k+1)=\Delta_{i}(k+1)-\Delta_{i}(k)-\mathbb{E}\left[\Delta_{i}(k+1)-% \Delta_{i}(k)|\mathcal{F}_{k}\right]italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) = roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] has zero mean. With our choice of η𝜂\etaitalic_η, one can verify that as long as max(d,m,z)𝑑𝑚𝑧\max(d,m,z)\to\inftyroman_max ( italic_d , italic_m , italic_z ) → ∞, Lemma 7 is applicable to

u1=u¯i(k),g1=G^i(ρ¯(m)(k);zk+1),u2=u¯i(k),g2=G^i(ρ¯(m)(k);zk+1).formulae-sequencesubscript𝑢1subscript¯𝑢𝑖𝑘formulae-sequencesubscript𝑔1subscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1formulae-sequencesubscript𝑢2subscript¯𝑢𝑖𝑘subscript𝑔2subscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1u_{1}=\underline{u}_{i}(k),\ g_{1}=\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z% _{k+1}),\ u_{2}=\overline{u}_{i}(k),\ g_{2}=\widehat{G}_{i}(\overline{\rho}^{(% m)}(k);z_{k+1}).italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) .

Hence, we deduce from the definition of Δi(k)subscriptΔ𝑖𝑘\Delta_{i}(k)roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) that

Δi(k+1)Δi(k)=subscriptΔ𝑖𝑘1subscriptΔ𝑖𝑘absent\displaystyle\Delta_{i}(k+1)-\Delta_{i}(k)=\,roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) = (u¯i(k+1)u¯i(k))(u¯i(k+1)u¯i(k))=(v1u1)(v2u2)subscript¯𝑢𝑖𝑘1subscript¯𝑢𝑖𝑘subscript¯𝑢𝑖𝑘1subscript¯𝑢𝑖𝑘subscript𝑣1subscript𝑢1subscript𝑣2subscript𝑢2\displaystyle\left(\underline{u}_{i}(k+1)-\underline{u}_{i}(k)\right)-\left(% \overline{u}_{i}(k+1)-\overline{u}_{i}(k)\right)=(v_{1}-u_{1})-(v_{2}-u_{2})( under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ) - ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ) = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=\displaystyle=\,= η((Idu1u1)g1(Idu2u2)g2)+O(η2g222),𝜂subscript𝐼𝑑subscript𝑢1superscriptsubscript𝑢1topsubscript𝑔1subscript𝐼𝑑subscript𝑢2superscriptsubscript𝑢2topsubscript𝑔2𝑂superscript𝜂2superscriptsubscriptnormsubscript𝑔222\displaystyle\eta\left(\left(I_{d}-u_{1}u_{1}^{\top}\right)g_{1}-\left(I_{d}-u% _{2}u_{2}^{\top}\right)g_{2}\right)+O\left(\eta^{2}\left\|{g_{2}}\right\|_{2}^% {2}\right),italic_η ( ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_O ( italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

thus leading to the following estimate:

𝔼[Δi(k+1)Δi(k)|k]2\displaystyle\left\|{\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)|\mathcal{F}% _{k}\right]}\right\|_{2}\leq\,∥ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ η𝔼[(Idu2u2)(g1g2)|k]2+η𝔼[(u2u2u1u1)g1|k]2\displaystyle\eta\left\|{\mathbb{E}\left[\left(I_{d}-u_{2}u_{2}^{\top}\right)(% g_{1}-g_{2})\Big{|}\mathcal{F}_{k}\right]}\right\|_{2}+\eta\left\|{\mathbb{E}% \left[\left(u_{2}u_{2}^{\top}-u_{1}u_{1}^{\top}\right)g_{1}\Big{|}\mathcal{F}_% {k}\right]}\right\|_{2}italic_η ∥ blackboard_E [ ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η ∥ blackboard_E [ ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+Cη2𝔼[g222|k]𝐶superscript𝜂2𝔼delimited-[]conditionalsuperscriptsubscriptnormsubscript𝑔222subscript𝑘\displaystyle+C\eta^{2}\mathbb{E}\left[\left\|{g_{2}}\right\|_{2}^{2}\Big{|}% \mathcal{F}_{k}\right]+ italic_C italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
(i)superscript𝑖\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP η𝔼[(g1g2)|k]2+Cηu1u22𝔼[g1|k]2+Cη2𝔼[g222|k]\displaystyle\eta\left\|{\mathbb{E}\left[(g_{1}-g_{2})|\mathcal{F}_{k}\right]}% \right\|_{2}+C\eta\left\|{u_{1}-u_{2}}\right\|_{2}\left\|{\mathbb{E}\left[g_{1% }|\mathcal{F}_{k}\right]}\right\|_{2}+C\eta^{2}\mathbb{E}\left[\left\|{g_{2}}% \right\|_{2}^{2}\Big{|}\mathcal{F}_{k}\right]italic_η ∥ blackboard_E [ ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_C italic_η ∥ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ blackboard_E [ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_C italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=\displaystyle=\,= η𝔼[(G^i(ρ¯(m)(k);zk+1)G^i(ρ¯(m)(k);zk+1))|k]2\displaystyle\eta\left\|{\mathbb{E}\left[\left(\widehat{G}_{i}(\underline{\rho% }^{(m)}(k);z_{k+1})-\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right)% \Big{|}\mathcal{F}_{k}\right]}\right\|_{2}italic_η ∥ blackboard_E [ ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+Cηu¯i(k)u¯i(k)2𝔼[G^i(ρ¯(m)(k);zk+1)|k]2\displaystyle+C\eta\left\|{\overline{u}_{i}(k)-\underline{u}_{i}(k)}\right\|_{% 2}\cdot\left\|{\mathbb{E}\left[\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z_{k+% 1})\Big{|}\mathcal{F}_{k}\right]}\right\|_{2}+ italic_C italic_η ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ blackboard_E [ over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+Cη2𝔼[G^i(ρ¯(m)(k);zk+1)22|k],𝐶superscript𝜂2𝔼delimited-[]conditionalsuperscriptsubscriptnormsubscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘122subscript𝑘\displaystyle+C\eta^{2}\mathbb{E}\left[\left\|{\widehat{G}_{i}(\overline{\rho}% ^{(m)}(k);z_{k+1})}\right\|_{2}^{2}\Big{|}\mathcal{F}_{k}\right],+ italic_C italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ,

where (i)𝑖(i)( italic_i ) is due to the fact that u1,u2σ(k)subscript𝑢1subscript𝑢2𝜎subscript𝑘u_{1},u_{2}\in\sigma(\mathcal{F}_{k})italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_σ ( caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), and u1u1u2u2opCu1u22subscriptnormsubscript𝑢1superscriptsubscript𝑢1topsubscript𝑢2superscriptsubscript𝑢2topop𝐶subscriptnormsubscript𝑢1subscript𝑢22\left\|{u_{1}u_{1}^{\top}-u_{2}u_{2}^{\top}}\right\|_{\mathrm{op}}\leq C\left% \|{u_{1}-u_{2}}\right\|_{2}∥ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT ≤ italic_C ∥ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. According to the definition of G^isubscript^𝐺𝑖\widehat{G}_{i}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we obtain that

𝔼[(G^i(ρ¯(m)(k);zk+1)G^i(ρ¯(m)(k);zk+1))|k]𝔼delimited-[]conditionalsubscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1subscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘1subscript𝑘\displaystyle\mathbb{E}\left[\left(\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z% _{k+1})-\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right)\Big{|}% \mathcal{F}_{k}\right]blackboard_E [ ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]
=\displaystyle=\,= a¯i(k)(u¯i(k)V(u,u¯i(k);u2,u¯i(k)2)\displaystyle\underline{a}_{i}(k)\Big{(}\nabla_{\underline{u}_{i}(k)}V\left(% \langle u_{*},\underline{u}_{i}(k)\rangle;\left\|{u_{*}}\right\|_{2},\left\|{% \underline{u}_{i}(k)}\right\|_{2}\right)under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ( ∇ start_POSTSUBSCRIPT under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
1mj=1ma¯j(k)u¯i(k)U(u¯i(k),u¯j(k);u¯i(k)2,u¯j(k)2))\displaystyle-\frac{1}{m}\sum_{j=1}^{m}\underline{a}_{j}(k)\nabla_{\underline{% u}_{i}(k)}U\left(\langle\underline{u}_{i}(k),\underline{u}_{j}(k)\rangle;\left% \|{\underline{u}_{i}(k)}\right\|_{2},\left\|{\underline{u}_{j}(k)}\right\|_{2}% \right)\Big{)}- divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∇ start_POSTSUBSCRIPT under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_U ( ⟨ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) )
a¯i(k)(u¯i(k)V(u,u¯i(k);u2,u¯i(k)2)\displaystyle-\overline{a}_{i}(k)\Big{(}\nabla_{\overline{u}_{i}(k)}V\left(% \langle u_{*},\overline{u}_{i}(k)\rangle;\left\|{u_{*}}\right\|_{2},\left\|{% \overline{u}_{i}(k)}\right\|_{2}\right)- over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ( ∇ start_POSTSUBSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_V ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
1mj=1ma¯j(k)u¯i(k)U(u¯i(k),u¯j(k);u¯i(k)2,u¯j(k)2)),\displaystyle-\frac{1}{m}\sum_{j=1}^{m}\overline{a}_{j}(k)\nabla_{\overline{u}% _{i}(k)}U\left(\langle\overline{u}_{i}(k),\overline{u}_{j}(k)\rangle;\left\|{% \overline{u}_{i}(k)}\right\|_{2},\left\|{\overline{u}_{j}(k)}\right\|_{2}% \right)\Big{)},- divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∇ start_POSTSUBSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_U ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ⟩ ; ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ,

thus leading to (using the same argument as in the proof of Lemma 5)

𝔼[(G^i(ρ¯(m)(k);zk+1)G^i(ρ¯(m)(k);zk+1))|k]2M(1+ε1T)2supj[m]u¯j(k)u¯j(k)2\displaystyle\left\|{\mathbb{E}\left[\left(\widehat{G}_{i}(\underline{\rho}^{(% m)}(k);z_{k+1})-\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right)\Big{|% }\mathcal{F}_{k}\right]}\right\|_{2}\leq M\left(1+\varepsilon^{-1}T\right)^{2}% \cdot\sup_{j\in[m]}\left\|{\overline{u}_{j}(k)-\underline{u}_{j}(k)}\right\|_{2}∥ blackboard_E [ ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+M(1+ε1T+supj[m]|a¯j(k)a¯j(k)|)supj[m]|a¯j(k)a¯j(k)|,𝑀1superscript𝜀1𝑇subscriptsupremum𝑗delimited-[]𝑚subscript¯𝑎𝑗𝑘subscript¯𝑎𝑗𝑘subscriptsupremum𝑗delimited-[]𝑚subscript¯𝑎𝑗𝑘subscript¯𝑎𝑗𝑘\displaystyle+M\left(1+\varepsilon^{-1}T+\sup_{j\in[m]}\left|\overline{a}_{j}(% k)-\underline{a}_{j}(k)\right|\right)\cdot\sup_{j\in[m]}\left|\overline{a}_{j}% (k)-\underline{a}_{j}(k)\right|,+ italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) | ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) | ,

and

𝔼[G^i(ρ¯(m)(k);zk+1)|k]2M(1+ε1T)2.\displaystyle\left\|{\mathbb{E}\left[\widehat{G}_{i}(\underline{\rho}^{(m)}(k)% ;z_{k+1})\Big{|}\mathcal{F}_{k}\right]}\right\|_{2}\leq M(1+\varepsilon^{-1}T)% ^{2}.∥ blackboard_E [ over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Moreover, by (conditional) sub-Gaussianity of the G^isubscript^𝐺𝑖\widehat{G}_{i}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, we know that

𝔼[G^i(ρ¯(m)(k);zk+1)22|k]M2(1+ε1T)4𝔼[σ(u¯i(k),xk+1)xk+122|k]M(1+ε1T)4d.𝔼delimited-[]conditionalsuperscriptsubscriptnormsubscript^𝐺𝑖superscript¯𝜌𝑚𝑘subscript𝑧𝑘122subscript𝑘superscript𝑀2superscript1superscript𝜀1𝑇4𝔼delimited-[]conditionalsuperscriptsubscriptnormsuperscript𝜎subscript¯𝑢𝑖𝑘subscript𝑥𝑘1subscript𝑥𝑘122subscript𝑘𝑀superscript1superscript𝜀1𝑇4𝑑\mathbb{E}\left[\left\|{\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})}% \right\|_{2}^{2}\Big{|}\mathcal{F}_{k}\right]\leq M^{2}(1+\varepsilon^{-1}T)^{% 4}\mathbb{E}\left[\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle% )x_{k+1}}\right\|_{2}^{2}|\mathcal{F}_{k}\right]\leq M(1+\varepsilon^{-1}T)^{4% }d.blackboard_E [ ∥ over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ; italic_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ) italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d .

Combining the above estimates, it then follows that

𝔼[Δi(k+1)Δi(k)|k]2\displaystyle\left\|{\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)|\mathcal{F}% _{k}\right]}\right\|_{2}\leq\,∥ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ηM(1+ε1T)2supj[m]u¯j(k)u¯j(k)2+η2M(1+ε1T)4d𝜂𝑀superscript1superscript𝜀1𝑇2subscriptsupremum𝑗delimited-[]𝑚subscriptnormsubscript¯𝑢𝑗𝑘subscript¯𝑢𝑗𝑘2superscript𝜂2𝑀superscript1superscript𝜀1𝑇4𝑑\displaystyle\eta M\left(1+\varepsilon^{-1}T\right)^{2}\cdot\sup_{j\in[m]}% \left\|{\overline{u}_{j}(k)-\underline{u}_{j}(k)}\right\|_{2}+\eta^{2}M(1+% \varepsilon^{-1}T)^{4}ditalic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d
+ηM(1+ε1T+supj[m]|a¯j(k)a¯j(k)|)supj[m]|a¯j(k)a¯j(k)|.𝜂𝑀1superscript𝜀1𝑇subscriptsupremum𝑗delimited-[]𝑚subscript¯𝑎𝑗𝑘subscript¯𝑎𝑗𝑘subscriptsupremum𝑗delimited-[]𝑚subscript¯𝑎𝑗𝑘subscript¯𝑎𝑗𝑘\displaystyle+\eta M\left(1+\varepsilon^{-1}T+\sup_{j\in[m]}\left|\overline{a}% _{j}(k)-\underline{a}_{j}(k)\right|\right)\cdot\sup_{j\in[m]}\left|\overline{a% }_{j}(k)-\underline{a}_{j}(k)\right|.+ italic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T + roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) | ) ⋅ roman_sup start_POSTSUBSCRIPT italic_j ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) | .

Using the same proof technique as in Appendix C.5 of [33], we conclude that

(maxi[m]maxl[0,min(T,Tθ)/η]k=0l1Zi(k+1)2M(1+ε1T)2(d+logm+z+T2)Tη)exp(z2).subscript𝑖delimited-[]𝑚subscript𝑙0𝑇subscript𝑇𝜃𝜂subscriptnormsuperscriptsubscript𝑘0𝑙1subscript𝑍𝑖𝑘12𝑀superscript1superscript𝜀1𝑇2𝑑𝑚𝑧superscript𝑇2𝑇𝜂superscript𝑧2\mathds{P}\left(\max_{i\in[m]}\max_{l\in[0,\min(T,T_{\theta})/\eta]\cap\mathbb% {N}}\left\|{\sum_{k=0}^{l-1}Z_{i}(k+1)}\right\|_{2}\geq M(1+\varepsilon^{-1}T)% ^{2}\left(\sqrt{d+\log m}+z+T^{2}\right)\sqrt{T\eta}\right)\leq\exp(-z^{2}).blackboard_P ( roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_l ∈ [ 0 , roman_min ( italic_T , italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T italic_η end_ARG ) ≤ roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Similarly as in the proof of Theorem 4, we define

Δ(t)=maxl[0,t/η]maxi[m]θ¯i(l)θ¯i(l)2,TΔ=inf{t0:Δ(t)1}.formulae-sequenceΔ𝑡subscript𝑙0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript¯𝜃𝑖𝑙subscript¯𝜃𝑖𝑙2subscript𝑇Δinfimumconditional-set𝑡0Δ𝑡1\Delta(t)=\max_{l\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\overline{% \theta}_{i}(l)-\underline{\theta}_{i}(l)}\right\|_{2},\quad T_{\Delta}=\inf\{t% \geq 0:\Delta(t)\geq 1\}.roman_Δ ( italic_t ) = roman_max start_POSTSUBSCRIPT italic_l ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) - under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT = roman_inf { italic_t ≥ 0 : roman_Δ ( italic_t ) ≥ 1 } .

Then, for lmin(T,Tθ,TΔ)/η𝑙𝑇subscript𝑇𝜃subscript𝑇Δ𝜂l\leq\min(T,T_{\theta},T_{\Delta})/\etaitalic_l ≤ roman_min ( italic_T , italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ) / italic_η, we have

supi[m]u¯i(l)u¯i(l)2=subscriptsupremum𝑖delimited-[]𝑚subscriptnormsubscript¯𝑢𝑖𝑙subscript¯𝑢𝑖𝑙2absent\displaystyle\sup_{i\in[m]}\left\|{\overline{u}_{i}(l)-\underline{u}_{i}(l)}% \right\|_{2}=\,roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) - under¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = supi[m]Δi(l)2supi[m]{k=0l1𝔼[Δi(k+1)Δi(k)|k]2+k=0l1Zi(k+1)2}subscriptsupremum𝑖delimited-[]𝑚subscriptnormsubscriptΔ𝑖𝑙2subscriptsupremum𝑖delimited-[]𝑚conditional-setsuperscriptsubscript𝑘0𝑙1evaluated-at𝔼delimited-[]subscriptΔ𝑖𝑘1conditionalsubscriptΔ𝑖𝑘subscript𝑘2subscriptnormsuperscriptsubscript𝑘0𝑙1subscript𝑍𝑖𝑘12\displaystyle\sup_{i\in[m]}\left\|{\Delta_{i}(l)}\right\|_{2}\leq\sup_{i\in[m]% }\left\{\sum_{k=0}^{l-1}\left\|{\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)|% \mathcal{F}_{k}\right]}\right\|_{2}+\left\|{\sum_{k=0}^{l-1}Z_{i}(k+1)}\right% \|_{2}\right\}roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ∥ blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k + 1 ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }
\displaystyle\leq\, ηM(1+ε1T)2k=0l1Δ(kη)+lη2M(1+ε1T)4d𝜂𝑀superscript1superscript𝜀1𝑇2superscriptsubscript𝑘0𝑙1Δ𝑘𝜂𝑙superscript𝜂2𝑀superscript1superscript𝜀1𝑇4𝑑\displaystyle\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-1}\Delta(k\eta)+l% \eta^{2}M(1+\varepsilon^{-1}T)^{4}ditalic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT roman_Δ ( italic_k italic_η ) + italic_l italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d
+M(1+ε1T)2(d+logm+z+T2)Tη.𝑀superscript1superscript𝜀1𝑇2𝑑𝑚𝑧superscript𝑇2𝑇𝜂\displaystyle+M(1+\varepsilon^{-1}T)^{2}\left(\sqrt{d+\log m}+z+T^{2}\right)% \sqrt{T\eta}.+ italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T italic_η end_ARG .

Proceeding with the same argument, it follows that

supi[m]|a¯i(l)a¯i(l)|ε1ηM(1+ε1T)2k=0l1Δ(kη)+ε1M(1+ε1T)2(d+logm+z+T2)Tη.subscriptsupremum𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑙subscript¯𝑎𝑖𝑙superscript𝜀1𝜂𝑀superscript1superscript𝜀1𝑇2superscriptsubscript𝑘0𝑙1Δ𝑘𝜂superscript𝜀1𝑀superscript1superscript𝜀1𝑇2𝑑𝑚𝑧superscript𝑇2𝑇𝜂\displaystyle\sup_{i\in[m]}\left|\overline{a}_{i}(l)-\underline{a}_{i}(l)% \right|\leq\varepsilon^{-1}\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-1}% \Delta(k\eta)+\varepsilon^{-1}M(1+\varepsilon^{-1}T)^{2}\left(\sqrt{d+\log m}+% z+T^{2}\right)\sqrt{T\eta}.roman_sup start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) - under¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l ) | ≤ italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT roman_Δ ( italic_k italic_η ) + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T italic_η end_ARG .

Therefore, we finally conclude that

Δ(lη)Δ𝑙𝜂absent\displaystyle\Delta(l\eta)\leq\,roman_Δ ( italic_l italic_η ) ≤ (ε1+1)ηM(1+ε1T)2k=0l1Δ(kη)+lη2M(1+ε1T)4dsuperscript𝜀11𝜂𝑀superscript1superscript𝜀1𝑇2superscriptsubscript𝑘0𝑙1Δ𝑘𝜂𝑙superscript𝜂2𝑀superscript1superscript𝜀1𝑇4𝑑\displaystyle(\varepsilon^{-1}+1)\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-% 1}\Delta(k\eta)+l\eta^{2}M(1+\varepsilon^{-1}T)^{4}d( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT roman_Δ ( italic_k italic_η ) + italic_l italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d
+(ε1+1)M(1+ε1T)2(d+logm+z+T2)Tηsuperscript𝜀11𝑀superscript1superscript𝜀1𝑇2𝑑𝑚𝑧superscript𝑇2𝑇𝜂\displaystyle+(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}T)^{2}\left(\sqrt{d+\log m% }+z+T^{2}\right)\sqrt{T\eta}+ ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T italic_η end_ARG
\displaystyle\leq\, (ε1+1)ηM(1+ε1T)2k=0l1Δ(kη)+(ε1+1)M(1+ε1T)4(d+logm+z+T2)Tη.superscript𝜀11𝜂𝑀superscript1superscript𝜀1𝑇2superscriptsubscript𝑘0𝑙1Δ𝑘𝜂superscript𝜀11𝑀superscript1superscript𝜀1𝑇4𝑑𝑚𝑧superscript𝑇2𝑇𝜂\displaystyle(\varepsilon^{-1}+1)\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-% 1}\Delta(k\eta)+(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}T)^{4}\left(\sqrt{d+% \log m}+z+T^{2}\right)\sqrt{T\eta}.( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT roman_Δ ( italic_k italic_η ) + ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T italic_η end_ARG .

Applying Grönwall’s inequality (discrete version) yields that

Δ(lη)Δ𝑙𝜂absent\displaystyle\Delta(l\eta)\leq\,roman_Δ ( italic_l italic_η ) ≤ (ε1+1)M(1+ε1T)4(d+logm+z+T2)Tηsuperscript𝜀11𝑀superscript1superscript𝜀1𝑇4𝑑𝑚𝑧superscript𝑇2𝑇𝜂\displaystyle(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}T)^{4}\left(\sqrt{d+\log m% }+z+T^{2}\right)\sqrt{T\eta}( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T italic_η end_ARG
×(1+(ε1+1)lηM(1+ε1T)2exp((ε1+1)lηM(1+ε1T)2))absent1superscript𝜀11𝑙𝜂𝑀superscript1superscript𝜀1𝑇2superscript𝜀11𝑙𝜂𝑀superscript1superscript𝜀1𝑇2\displaystyle\times\left(1+(\varepsilon^{-1}+1)l\eta M(1+\varepsilon^{-1}T)^{2% }\exp\left((\varepsilon^{-1}+1)l\eta M(1+\varepsilon^{-1}T)^{2}\right)\right)× ( 1 + ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_l italic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_l italic_η italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )
\displaystyle\leq\, Mexp((ε1+1)MT(1+ε1T)2)(d+logm+z+T2)η𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2𝑑𝑚𝑧superscript𝑇2𝜂\displaystyle M\exp\left((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}\right% )\left(\sqrt{d+\log m}+z+T^{2}\right)\sqrt{\eta}italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z + italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG
\displaystyle\leq\, Mexp((ε1+1)MT(1+ε1T)2)(d+logm+z)η,𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2𝑑𝑚𝑧𝜂\displaystyle M\exp\left((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}\right% )\left(\sqrt{d+\log m}+z\right)\sqrt{\eta},italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) square-root start_ARG italic_η end_ARG ,

as long as max(d,m,z)𝑑𝑚𝑧\max(d,m,z)\to\inftyroman_max ( italic_d , italic_m , italic_z ) → ∞ with T=O(1)𝑇𝑂1T=O(1)italic_T = italic_O ( 1 ). Note that the above inequality holds for all l[0,min(T,Tθ,TΔ)/η]𝑙0𝑇subscript𝑇𝜃subscript𝑇Δ𝜂l\in[0,\min(T,T_{\theta},T_{\Delta})/\eta]\cap\mathbb{N}italic_l ∈ [ 0 , roman_min ( italic_T , italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ) / italic_η ] ∩ blackboard_N with probability at least 1exp(z2)1superscript𝑧21-\exp(-z^{2})1 - roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which further implies that Tθ,TΔTsubscript𝑇𝜃subscript𝑇Δ𝑇T_{\theta},T_{\Delta}\geq Titalic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ≥ italic_T, and consequently

supk[0,T/η]maxi[m]|a¯i(k)|2M(1+ε1T).subscriptsupremum𝑘0𝑇𝜂subscript𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑘2𝑀1superscript𝜀1𝑇\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\overline{a}_{i}(k)% \right|\leq 2M(1+\varepsilon^{-1}T).roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_T / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≤ 2 italic_M ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) .

Applying again Lemma 6, we deduce that

supk[0,T/η]|R(ρ¯(m)(k))R(ρ¯(m)(k))|(d+logm+z)Mexp((ε1+1)MT(1+ε1T)2)η.subscriptsupremum𝑘0𝑇𝜂𝑅superscript¯𝜌𝑚𝑘𝑅superscript¯𝜌𝑚𝑘𝑑𝑚𝑧𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2𝜂\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(\underline{\rho}^{(m)}% (k))-\mathscrsfs{R}(\overline{\rho}^{(m)}(k))\right|\leq\left(\sqrt{d+\log m}+% z\right)M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2})\sqrt{\eta}.roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_T / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT | italic_R ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) - italic_R ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) | ≤ ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG .

Combining the above estimates gives the following:

Theorem 6 (Difference between SGD and projected SGD).

There exists a constant M𝑀Mitalic_M that only depends on the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, such that for any T,z0𝑇𝑧0T,z\geq 0italic_T , italic_z ≥ 0 and

η1(d+logm+z2)Mexp((ε1+1)MT(1+ε1T)2),𝜂1𝑑𝑚superscript𝑧2𝑀superscript𝜀11𝑀𝑇superscript1superscript𝜀1𝑇2\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-% 1}T)^{2})},italic_η ≤ divide start_ARG 1 end_ARG start_ARG ( italic_d + roman_log italic_m + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_T ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ,

the following happens with probability at least 1exp(z2)1superscript𝑧21-\exp(-z^{2})1 - roman_exp ( - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ): For all t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], we have

supk[0,t/η]maxi[m]|a¯i(k)|subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscript¯𝑎𝑖𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\overline{% a}_{i}(k)\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT | over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) | ≤ M(1+t/ε),𝑀1𝑡𝜀\displaystyle M(1+t/\varepsilon),italic_M ( 1 + italic_t / italic_ε ) , (172)
supk[0,t/η]maxi[m]θ¯i(k)θ¯i(k)2subscriptsupremum𝑘0𝑡𝜂subscript𝑖delimited-[]𝑚subscriptnormsubscript¯𝜃𝑖𝑘subscript¯𝜃𝑖𝑘2absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{% \overline{\theta}_{i}(k)-\underline{\theta}_{i}(k)}\right\|_{2}\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) - under¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ Mexp((ε1+1)Mt(1+ε1T)2)η(d+logm+z),𝑀superscript𝜀11𝑀𝑡superscript1superscript𝜀1𝑇2𝜂𝑑𝑚𝑧\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right),italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) , (173)
supk[0,t/η]|R(ρ¯(m)(k))R(ρ¯(m)(k))|subscriptsupremum𝑘0𝑡𝜂𝑅superscript¯𝜌𝑚𝑘𝑅superscript¯𝜌𝑚𝑘absent\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(% \underline{\rho}^{(m)}(k))-\mathscrsfs{R}(\overline{\rho}^{(m)}(k))\right|\leq\,roman_sup start_POSTSUBSCRIPT italic_k ∈ [ 0 , italic_t / italic_η ] ∩ blackboard_N end_POSTSUBSCRIPT | italic_R ( under¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) - italic_R ( over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_k ) ) | ≤ Mexp((ε1+1)Mt(1+ε1T)2)η(d+logm+z).𝑀superscript𝜀11𝑀𝑡superscript1superscript𝜀1𝑇2𝜂𝑑𝑚𝑧\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right).italic_M roman_exp ( ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 1 ) italic_M italic_t ( 1 + italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_η end_ARG ( square-root start_ARG italic_d + roman_log italic_m end_ARG + italic_z ) . (174)

Theorem 2 then follows as a result of combining Theorem 4, Theorem 5, and Theorem 6.

Lemma 7.

Let v1=u1+η(Idu1u1)g1subscript𝑣1subscript𝑢1𝜂subscript𝐼𝑑subscript𝑢1superscriptsubscript𝑢1topsubscript𝑔1v_{1}=u_{1}+\eta(I_{d}-u_{1}u_{1}^{\top})g_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_η ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, v2=Proj𝕊d1(u2+ηg2)subscript𝑣2subscriptProjsuperscript𝕊𝑑1subscript𝑢2𝜂subscript𝑔2v_{2}=\operatorname{Proj}_{\mathbb{S}^{d-1}}(u_{2}+\eta g_{2})italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Proj start_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where u22=1subscriptnormsubscript𝑢221\left\|{u_{2}}\right\|_{2}=1∥ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and ηg221/2𝜂subscriptnormsubscript𝑔2212\eta\left\|{g_{2}}\right\|_{2}\leq 1/2italic_η ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 / 2. Then we have

(v1u1)(v2u2)=η((Idu1u1)g1(Idu2u2)g2)+O(η2g222).subscript𝑣1subscript𝑢1subscript𝑣2subscript𝑢2𝜂subscript𝐼𝑑subscript𝑢1superscriptsubscript𝑢1topsubscript𝑔1subscript𝐼𝑑subscript𝑢2superscriptsubscript𝑢2topsubscript𝑔2𝑂superscript𝜂2superscriptsubscriptnormsubscript𝑔222(v_{1}-u_{1})-(v_{2}-u_{2})=\,\eta\left(\left(I_{d}-u_{1}u_{1}^{\top}\right)g_% {1}-\left(I_{d}-u_{2}u_{2}^{\top}\right)g_{2}\right)+O\left(\eta^{2}\left\|{g_% {2}}\right\|_{2}^{2}\right).( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_η ( ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_O ( italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Proof.

Using Taylor expansion, we know that

v2=Proj𝕊d1(u2+ηg2)=subscript𝑣2subscriptProjsuperscript𝕊𝑑1subscript𝑢2𝜂subscript𝑔2absent\displaystyle v_{2}=\operatorname{Proj}_{\mathbb{S}^{d-1}}(u_{2}+\eta g_{2})=\,italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Proj start_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = (u2+ηg2)(1+2ηu2,g2+η2g222)1/2subscript𝑢2𝜂subscript𝑔2superscript12𝜂subscript𝑢2subscript𝑔2superscript𝜂2superscriptsubscriptnormsubscript𝑔22212\displaystyle(u_{2}+\eta g_{2})\left(1+2\eta\langle u_{2},g_{2}\rangle+\eta^{2% }\left\|{g_{2}}\right\|_{2}^{2}\right)^{-1/2}( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( 1 + 2 italic_η ⟨ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT
=\displaystyle=\,= (u2+ηg2)(1ηu2,g2+O(η2g222))subscript𝑢2𝜂subscript𝑔21𝜂subscript𝑢2subscript𝑔2𝑂superscript𝜂2superscriptsubscriptnormsubscript𝑔222\displaystyle(u_{2}+\eta g_{2})\left(1-\eta\langle u_{2},g_{2}\rangle+O(\eta^{% 2}\left\|{g_{2}}\right\|_{2}^{2})\right)( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( 1 - italic_η ⟨ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ + italic_O ( italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )
=\displaystyle=\,= (1ηu2,g2)u2+ηg2+O(η2g222)1𝜂subscript𝑢2subscript𝑔2subscript𝑢2𝜂subscript𝑔2𝑂superscript𝜂2superscriptsubscriptnormsubscript𝑔222\displaystyle\left(1-\eta\langle u_{2},g_{2}\rangle\right)u_{2}+\eta g_{2}+O(% \eta^{2}\left\|{g_{2}}\right\|_{2}^{2})( 1 - italic_η ⟨ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_O ( italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=\displaystyle=\,= u2+η(Idu2u2)g2+O(η2g222),subscript𝑢2𝜂subscript𝐼𝑑subscript𝑢2superscriptsubscript𝑢2topsubscript𝑔2𝑂superscript𝜂2superscriptsubscriptnormsubscript𝑔222\displaystyle u_{2}+\eta(I_{d}-u_{2}u_{2}^{\top})g_{2}+O(\eta^{2}\left\|{g_{2}% }\right\|_{2}^{2}),italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_O ( italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

which implies

v2u2=η(Idu2u2)g2+O(η2g222).subscript𝑣2subscript𝑢2𝜂subscript𝐼𝑑subscript𝑢2superscriptsubscript𝑢2topsubscript𝑔2𝑂superscript𝜂2superscriptsubscriptnormsubscript𝑔222v_{2}-u_{2}=\eta(I_{d}-u_{2}u_{2}^{\top})g_{2}+O(\eta^{2}\left\|{g_{2}}\right% \|_{2}^{2}).italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_η ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_O ( italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

The proof is completed by noting that

v1u1=η(Idu1u1)g1.subscript𝑣1subscript𝑢1𝜂subscript𝐼𝑑subscript𝑢1superscriptsubscript𝑢1topsubscript𝑔1v_{1}-u_{1}=\eta(I_{d}-u_{1}u_{1}^{\top})g_{1}.italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_η ( italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

D.4 Proof of Theorem 3

By our assumption, we know that the canonical learning order holds up to level L𝐿Litalic_L, and that

12kL+1φk2δ4.12subscript𝑘𝐿1superscriptsubscript𝜑𝑘2𝛿4\frac{1}{2}\sum_{k\geq L+1}\varphi_{k}^{2}\leq\frac{\delta}{4}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k ≥ italic_L + 1 end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG .

Then, according to Definition 1, there exists ε=ε(δ)subscript𝜀subscript𝜀𝛿\varepsilon_{*}=\varepsilon_{*}(\delta)italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_δ ), T0=T0(δ)subscript𝑇0subscript𝑇0𝛿T_{0}=T_{0}(\delta)italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_δ ) such that for all εε𝜀subscript𝜀\varepsilon\leq\varepsilon_{*}italic_ε ≤ italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and T=T(ε,δ)=T0(δ)ε1/2L𝑇𝑇𝜀𝛿subscript𝑇0𝛿superscript𝜀12𝐿T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/2L}italic_T = italic_T ( italic_ε , italic_δ ) = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_δ ) italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_L end_POSTSUPERSCRIPT, one has

R(T,ε)=limmlimdR(a(T),u(T))δ3.subscript𝑅𝑇𝜀subscript𝑚subscript𝑑𝑅𝑎𝑇𝑢𝑇𝛿3\mathscrsfs{R}_{\infty}(T,\varepsilon)=\lim_{m\to\infty}\lim_{d\to\infty}% \mathscrsfs{R}(a(T),u(T))\leq\frac{\delta}{3}.italic_R start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_T , italic_ε ) = roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_d → ∞ end_POSTSUBSCRIPT italic_R ( italic_a ( italic_T ) , italic_u ( italic_T ) ) ≤ divide start_ARG italic_δ end_ARG start_ARG 3 end_ARG .

Moreover, from Section 4 we know that with probability at least 1eCm1superscript𝑒superscript𝐶𝑚1-e^{-C^{\prime}m}1 - italic_e start_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over the i.i.d. initialization,

supt[0,T]|R(a(t),u(t))R(t,ε)|(1d+1m)CMexp(MT(1+T)2/ε2),subscriptsupremum𝑡0𝑇𝑅𝑎𝑡𝑢𝑡subscript𝑅𝑡𝜀1𝑑1𝑚𝐶superscript𝑀superscript𝑀𝑇superscript1𝑇2superscript𝜀2\sup_{t\in[0,T]}\left|\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\infty}(t,% \varepsilon)\right|\leq\left(\frac{1}{\sqrt{d}}+\frac{1}{\sqrt{m}}\right)CM^{% \prime}\exp(M^{\prime}T(1+T)^{2}/\varepsilon^{2}),roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT | italic_R ( italic_a ( italic_t ) , italic_u ( italic_t ) ) - italic_R start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_t , italic_ε ) | ≤ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ) italic_C italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_exp ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T ( 1 + italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (175)

where Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT only depends on (σ,φ,PA)𝜎𝜑subscriptP𝐴(\sigma,\varphi,{\rm P}_{A})( italic_σ , italic_φ , roman_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ). Now we choose εε𝜀subscript𝜀\varepsilon\leq\varepsilon_{*}italic_ε ≤ italic_ε start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and T=T(ε,δ)=T0(δ)ε1/2L𝑇𝑇𝜀𝛿subscript𝑇0𝛿superscript𝜀12𝐿T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/2L}italic_T = italic_T ( italic_ε , italic_δ ) = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_δ ) italic_ε start_POSTSUPERSCRIPT 1 / 2 italic_L end_POSTSUPERSCRIPT. It then follows that

R(a(T),u(T))δ3+(1d+1m)CMexp(MT3/ε2).𝑅𝑎𝑇𝑢𝑇𝛿31𝑑1𝑚𝐶superscript𝑀superscript𝑀superscript𝑇3superscript𝜀2\mathscrsfs{R}(a(T),u(T))\leq\frac{\delta}{3}+\left(\frac{1}{\sqrt{d}}+\frac{1% }{\sqrt{m}}\right)CM^{\prime}\exp(M^{\prime}T^{3}/\varepsilon^{2}).italic_R ( italic_a ( italic_T ) , italic_u ( italic_T ) ) ≤ divide start_ARG italic_δ end_ARG start_ARG 3 end_ARG + ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ) italic_C italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_exp ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (176)

According to Theorem 2, we know that with probability at least 1exp(z)1𝑧1-\exp(-z)1 - roman_exp ( - italic_z ),

|R(a¯(n),u¯(n))R(a(T),u(T))|η(d+logm+z)Mexp(MT3/ε3)𝑅¯𝑎𝑛¯𝑢𝑛𝑅𝑎𝑇𝑢𝑇𝜂𝑑𝑚𝑧superscript𝑀superscript𝑀superscript𝑇3superscript𝜀3\left|\mathscrsfs{R}(\overline{a}(n),\overline{u}(n))-\mathscrsfs{R}(a(T),u(T)% )\right|\leq\sqrt{\eta(d+\log m+z)}M^{\prime}\exp\left(M^{\prime}T^{3}/% \varepsilon^{3}\right)| italic_R ( over¯ start_ARG italic_a end_ARG ( italic_n ) , over¯ start_ARG italic_u end_ARG ( italic_n ) ) - italic_R ( italic_a ( italic_T ) , italic_u ( italic_T ) ) | ≤ square-root start_ARG italic_η ( italic_d + roman_log italic_m + italic_z ) end_ARG italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_exp ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) (177)

with n=T/η=T(ε,δ)/η𝑛𝑇𝜂𝑇𝜀𝛿𝜂n=T/\eta=T(\varepsilon,\delta)/\etaitalic_n = italic_T / italic_η = italic_T ( italic_ε , italic_δ ) / italic_η. We now take

M=M(ε,δ)=max{9M2exp(2MT(ε,δ)3/ε3)δ2,36C2M2exp(2MT(ε,δ)3/ε2)δ2}.𝑀𝑀𝜀𝛿9superscript𝑀22superscript𝑀𝑇superscript𝜀𝛿3superscript𝜀3superscript𝛿236superscript𝐶2superscript𝑀22superscript𝑀𝑇superscript𝜀𝛿3superscript𝜀2superscript𝛿2M=M(\varepsilon,\delta)=\max\left\{\frac{9M^{\prime 2}\exp(2M^{\prime}T(% \varepsilon,\delta)^{3}/\varepsilon^{3})}{\delta^{2}},\ \frac{36C^{2}M^{\prime 2% }\exp(2M^{\prime}T(\varepsilon,\delta)^{3}/\varepsilon^{2})}{\delta^{2}}\right\}.italic_M = italic_M ( italic_ε , italic_δ ) = roman_max { divide start_ARG 9 italic_M start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT roman_exp ( 2 italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T ( italic_ε , italic_δ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 36 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT roman_exp ( 2 italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T ( italic_ε , italic_δ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } .

Then, by our choice of m𝑚mitalic_m and d𝑑ditalic_d, we know that R(a(T),u(T))2δ/3𝑅𝑎𝑇𝑢𝑇2𝛿3\mathscrsfs{R}(a(T),u(T))\leq 2\delta/3italic_R ( italic_a ( italic_T ) , italic_u ( italic_T ) ) ≤ 2 italic_δ / 3. Further, taking

η=1M(d+logm+z),n=MT(d+logm+z),formulae-sequence𝜂1𝑀𝑑𝑚𝑧𝑛𝑀𝑇𝑑𝑚𝑧\eta=\frac{1}{M(d+\log m+z)},\ n=MT(d+\log m+z),italic_η = divide start_ARG 1 end_ARG start_ARG italic_M ( italic_d + roman_log italic_m + italic_z ) end_ARG , italic_n = italic_M italic_T ( italic_d + roman_log italic_m + italic_z ) , (178)

we obtain that

R(a¯(n),u¯(n))R(a(T),u(T))+δ3δ.𝑅¯𝑎𝑛¯𝑢𝑛𝑅𝑎𝑇𝑢𝑇𝛿3𝛿\mathscrsfs{R}(\overline{a}(n),\overline{u}(n))\leq\mathscrsfs{R}(a(T),u(T))+% \frac{\delta}{3}\leq\delta.italic_R ( over¯ start_ARG italic_a end_ARG ( italic_n ) , over¯ start_ARG italic_u end_ARG ( italic_n ) ) ≤ italic_R ( italic_a ( italic_T ) , italic_u ( italic_T ) ) + divide start_ARG italic_δ end_ARG start_ARG 3 end_ARG ≤ italic_δ . (179)

The above happens with probability 1exp(Cm)exp(z)1superscript𝐶𝑚𝑧1-\exp(-C^{\prime}m)-\exp(-z)1 - roman_exp ( - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m ) - roman_exp ( - italic_z ). Hence, our conclusion follows naturally from the assumption mz𝑚𝑧m\geq zitalic_m ≥ italic_z.

Appendix E Counterexamples to the canonical learning order

E.1 Case 1: σk=0subscript𝜎𝑘0\sigma_{k}=0italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 for some k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N

For any fixed (a,u)=(ai,ui)1im𝑎𝑢subscriptsubscript𝑎𝑖subscript𝑢𝑖1𝑖𝑚(a,u)=(a_{i},u_{i})_{1\leq i\leq m}( italic_a , italic_u ) = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT, we have

𝔼[f(x;a,u)Hek(u,x)]=𝔼delimited-[]𝑓𝑥𝑎𝑢subscriptHe𝑘subscript𝑢𝑥absent\displaystyle\mathbb{E}\left[f(x;a,u)\mathrm{He}_{k}(\langle u_{*},x\rangle)% \right]=\,blackboard_E [ italic_f ( italic_x ; italic_a , italic_u ) roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) ] = 1mi=1mai𝔼[σ(ui,x)Hek(u,x)]1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖𝔼delimited-[]𝜎subscript𝑢𝑖𝑥subscriptHe𝑘subscript𝑢𝑥\displaystyle\frac{1}{m}\sum_{i=1}^{m}a_{i}\mathbb{E}\left[\sigma\left(\langle u% _{i},x\rangle\right)\mathrm{He}_{k}(\langle u_{*},x\rangle)\right]divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ italic_σ ( ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ⟩ ) roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) ]
=\displaystyle=\,= 1mi=1maiσkui,uk=0.1𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖subscript𝜎𝑘superscriptsubscript𝑢𝑖subscript𝑢𝑘0\displaystyle\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma_{k}\langle u_{i},u_{*}% \rangle^{k}=0.divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0 .

Moreover, the risk is always lower bounded by

R(a,u)=12𝔼[(φ(u,x)f(x;a,u))2]𝑅𝑎𝑢12𝔼delimited-[]superscript𝜑subscript𝑢𝑥𝑓𝑥𝑎𝑢2\displaystyle\mathscrsfs{R}(a,u)=\frac{1}{2}\mathbb{E}\left[\left(\varphi(% \langle u_{*},x\rangle)-f(x;a,u)\right)^{2}\right]italic_R ( italic_a , italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ( italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - italic_f ( italic_x ; italic_a , italic_u ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle=\,= 12𝔼[(φkHek(u,x)+(φ(u,x)φkHek(u,x)f(x;a,u)))2]12𝔼delimited-[]superscriptsubscript𝜑𝑘subscriptHe𝑘subscript𝑢𝑥𝜑subscript𝑢𝑥subscript𝜑𝑘subscriptHe𝑘subscript𝑢𝑥𝑓𝑥𝑎𝑢2\displaystyle\frac{1}{2}\mathbb{E}\left[\left(\varphi_{k}\mathrm{He}_{k}(% \langle u_{*},x\rangle)+\left(\varphi(\langle u_{*},x\rangle)-\varphi_{k}% \mathrm{He}_{k}(\langle u_{*},x\rangle)-f(x;a,u)\right)\right)^{2}\right]divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) + ( italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - italic_f ( italic_x ; italic_a , italic_u ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(i)superscript𝑖\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\,start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i ) end_ARG end_RELOP 12φk2+12𝔼[(φ(u,x)φkHek(u,x)f(x;a,u))2]12φk2,12superscriptsubscript𝜑𝑘212𝔼delimited-[]superscript𝜑subscript𝑢𝑥subscript𝜑𝑘subscriptHe𝑘subscript𝑢𝑥𝑓𝑥𝑎𝑢212superscriptsubscript𝜑𝑘2\displaystyle\frac{1}{2}\varphi_{k}^{2}+\frac{1}{2}\mathbb{E}\left[\left(% \varphi(\langle u_{*},x\rangle)-\varphi_{k}\mathrm{He}_{k}(\langle u_{*},x% \rangle)-f(x;a,u)\right)^{2}\right]\geq\frac{1}{2}\varphi_{k}^{2},divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ( italic_φ ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) - italic_f ( italic_x ; italic_a , italic_u ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where (i)𝑖(i)( italic_i ) follows from orthogonality between Hek(u,x)subscriptHe𝑘subscript𝑢𝑥\mathrm{He}_{k}(\langle u_{*},x\rangle)roman_He start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⟨ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ ) and f(x;a,u)𝑓𝑥𝑎𝑢f(x;a,u)italic_f ( italic_x ; italic_a , italic_u ).

E.2 Case 2: φ0==φk=0subscript𝜑0subscript𝜑𝑘0\varphi_{0}=\cdots=\varphi_{k}=0italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ⋯ = italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 for some k1𝑘1k\geq 1italic_k ≥ 1

We consider the reduced mean-field equations (24):

εtai=V(si)1mj=1majU(sisj),tsi=ai(1si2)(V(si)1mj=1majU(sisj)sj).formulae-sequence𝜀subscript𝑡subscript𝑎𝑖𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗𝑈subscript𝑠𝑖subscript𝑠𝑗subscript𝑡subscript𝑠𝑖subscript𝑎𝑖1superscriptsubscript𝑠𝑖2superscript𝑉subscript𝑠𝑖1𝑚superscriptsubscript𝑗1𝑚subscript𝑎𝑗superscript𝑈subscript𝑠𝑖subscript𝑠𝑗subscript𝑠𝑗\begin{split}\varepsilon\partial_{t}a_{i}=\,&V(s_{i})-\frac{1}{m}\sum_{j=1}^{m% }a_{j}U(s_{i}s_{j})\,,\\ \partial_{t}s_{i}=\,&a_{i}\left(1-s_{i}^{2}\right)\left(V^{\prime}(s_{i})-% \frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})s_{j}\right)\,.\end{split}start_ROW start_CELL italic_ε ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL italic_V ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . end_CELL end_ROW

Note that if φ0=φ1=0subscript𝜑0subscript𝜑10\varphi_{0}=\varphi_{1}=0italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0, then V(s)=sv(s)superscript𝑉𝑠𝑠𝑣𝑠V^{\prime}(s)=s\cdot v(s)italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) = italic_s ⋅ italic_v ( italic_s ) for some continuous function v𝑣vitalic_v. Denoting a=(a1,,am)𝑎subscript𝑎1subscript𝑎𝑚a=(a_{1},\cdots,a_{m})italic_a = ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and s=(s1,,sm)𝑠superscriptsubscript𝑠1subscript𝑠𝑚tops=(s_{1},\cdots,s_{m})^{\top}italic_s = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, the above equation regarding the evolution of the sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s can be written as

s(t)=A(a(t),s(t))s(t),superscript𝑠𝑡𝐴𝑎𝑡𝑠𝑡𝑠𝑡s^{\prime}(t)=A(a(t),s(t))s(t),italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A ( italic_a ( italic_t ) , italic_s ( italic_t ) ) italic_s ( italic_t ) ,

where A(a,s)𝐴𝑎𝑠A(a,s)italic_A ( italic_a , italic_s ) is a matrix-valued function satisfying

Aij(a,s)=subscript𝐴𝑖𝑗𝑎𝑠absent\displaystyle A_{ij}(a,s)=\,italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a , italic_s ) = ai(1si2)(v(si)𝟏i=jajmU(sisj)),i,j[m].subscript𝑎𝑖1superscriptsubscript𝑠𝑖2𝑣subscript𝑠𝑖subscript1𝑖𝑗subscript𝑎𝑗𝑚superscript𝑈subscript𝑠𝑖subscript𝑠𝑗for-all𝑖𝑗delimited-[]𝑚\displaystyle a_{i}(1-s_{i}^{2})\left(v(s_{i})\mathbf{1}_{i=j}-\frac{a_{j}}{m}% U^{\prime}(s_{i}s_{j})\right),\ \forall i,j\in[m].italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_v ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_1 start_POSTSUBSCRIPT italic_i = italic_j end_POSTSUBSCRIPT - divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , ∀ italic_i , italic_j ∈ [ italic_m ] .

Using the similar a priori estimate as in the proof of Lemma 1, we can show that

supt[0,T]A(a(t),s(t))opC(T)<subscriptsupremum𝑡0𝑇subscriptnorm𝐴𝑎𝑡𝑠𝑡op𝐶𝑇\sup_{t\in[0,T]}\left\|{A(a(t),s(t))}\right\|_{\mathrm{op}}\leq C(T)<\inftyroman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ italic_A ( italic_a ( italic_t ) , italic_s ( italic_t ) ) ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT ≤ italic_C ( italic_T ) < ∞

for any finite time T𝑇Titalic_T, which immediately implies that s(t)0𝑠𝑡0s(t)\equiv 0italic_s ( italic_t ) ≡ 0 for t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]. Therefore, we won’t be able to learn any component of φ𝜑\varphiitalic_φ with degree 1absent1\geq 1≥ 1.

E.3 Case 3: φk=0subscript𝜑𝑘0\varphi_{k}=0italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 for some k1𝑘1k\geq 1italic_k ≥ 1

We may assume σk0subscript𝜎𝑘0\sigma_{k}\neq 0italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0, and analyze the simplified ODE system (97), which reduces to

τa~(ω)=σk2s~(ω)ka~(ν)s~(ν)kdρ(ν)τs~(ω)=kσk2a~(ω)s~(ω)k1(1ε2βks~(ω)2)a~(ν)s~(ν)kdρ(ν).subscript𝜏~𝑎𝜔superscriptsubscript𝜎𝑘2~𝑠superscript𝜔𝑘~𝑎𝜈~𝑠superscript𝜈𝑘differential-d𝜌𝜈subscript𝜏~𝑠𝜔𝑘superscriptsubscript𝜎𝑘2~𝑎𝜔~𝑠superscript𝜔𝑘11superscript𝜀2subscript𝛽𝑘~𝑠superscript𝜔2~𝑎𝜈~𝑠superscript𝜈𝑘differential-d𝜌𝜈\begin{split}\partial_{\tau}\widetilde{a}(\omega)=\,&-\sigma_{k}^{2}\widetilde% {s}(\omega)^{k}\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{k}\mathrm{d}\rho(\nu)% \\ \partial_{\tau}\widetilde{s}(\omega)=\,&-k\sigma_{k}^{2}\widetilde{a}(\omega)% \widetilde{s}(\omega)^{k-1}\left(1-\varepsilon^{2\beta_{k}}\widetilde{s}(% \omega)^{2}\right)\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{k}\mathrm{d}\rho(% \nu).\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) = end_CELL start_CELL - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) = end_CELL start_CELL - italic_k italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∫ over~ start_ARG italic_a end_ARG ( italic_ν ) over~ start_ARG italic_s end_ARG ( italic_ν ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ν ) . end_CELL end_ROW (180)

We thus obtain the following equations:

τa~(ω)2dρ(ω)=2σk2(a~(ω)s~(ω)kdρ(ω))20,τs~(ω)2dρ(ω)=2kσk2a~(ω)s~(ω)k(1ε2βks~(ω)2)dρ(ω)a~(ω)s~(ω)kdρ(ω)2kσk2ε2βka~(ω)s~(ω)k+2dρ(ω)a~(ω)s~(ω)kdρ(ω),formulae-sequencesubscript𝜏~𝑎superscript𝜔2differential-d𝜌𝜔2superscriptsubscript𝜎𝑘2superscript~𝑎𝜔~𝑠superscript𝜔𝑘differential-d𝜌𝜔20subscript𝜏~𝑠superscript𝜔2differential-d𝜌𝜔2𝑘superscriptsubscript𝜎𝑘2~𝑎𝜔~𝑠superscript𝜔𝑘1superscript𝜀2subscript𝛽𝑘~𝑠superscript𝜔2differential-d𝜌𝜔~𝑎𝜔~𝑠superscript𝜔𝑘differential-d𝜌𝜔2𝑘superscriptsubscript𝜎𝑘2superscript𝜀2subscript𝛽𝑘~𝑎𝜔~𝑠superscript𝜔𝑘2differential-d𝜌𝜔~𝑎𝜔~𝑠superscript𝜔𝑘differential-d𝜌𝜔\begin{split}\partial_{\tau}\int\widetilde{a}(\omega)^{2}\mathrm{d}\rho(\omega% )=\,&-2\sigma_{k}^{2}\left(\int\widetilde{a}(\omega)\widetilde{s}(\omega)^{k}% \mathrm{d}\rho(\omega)\right)^{2}\leq 0,\\ \partial_{\tau}\int\widetilde{s}(\omega)^{2}\mathrm{d}\rho(\omega)=\,&-2k% \sigma_{k}^{2}\int\widetilde{a}(\omega)\widetilde{s}(\omega)^{k}\left(1-% \varepsilon^{2\beta_{k}}\widetilde{s}(\omega)^{2}\right)\mathrm{d}\rho(\omega)% \cdot\int\widetilde{a}(\omega)\widetilde{s}(\omega)^{k}\mathrm{d}\rho(\omega)% \\ \leq\,&2k\sigma_{k}^{2}\varepsilon^{2\beta_{k}}\int\widetilde{a}(\omega)% \widetilde{s}(\omega)^{k+2}\mathrm{d}\rho(\omega)\cdot\int\widetilde{a}(\omega% )\widetilde{s}(\omega)^{k}\mathrm{d}\rho(\omega),\end{split}start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) = end_CELL start_CELL - 2 italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∫ over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0 , end_CELL end_ROW start_ROW start_CELL ∂ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∫ over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) = end_CELL start_CELL - 2 italic_k italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_ρ ( italic_ω ) ⋅ ∫ over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL 2 italic_k italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∫ over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k + 2 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ⋅ ∫ over~ start_ARG italic_a end_ARG ( italic_ω ) over~ start_ARG italic_s end_ARG ( italic_ω ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) , end_CELL end_ROW (181)

which means that for any τ0𝜏0\tau\geq 0italic_τ ≥ 0,

a~(ω,τ)2dρ(ω)a~(ω,0)2dρ(ω)=O(ε1/k(k+1))=o(1).~𝑎superscript𝜔𝜏2differential-d𝜌𝜔~𝑎superscript𝜔02differential-d𝜌𝜔𝑂superscript𝜀1𝑘𝑘1𝑜1\int\widetilde{a}(\omega,\tau)^{2}\mathrm{d}\rho(\omega)\leq\int\widetilde{a}(% \omega,0)^{2}\mathrm{d}\rho(\omega)=O(\varepsilon^{1/k(k+1)})=o(1).∫ over~ start_ARG italic_a end_ARG ( italic_ω , italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) ≤ ∫ over~ start_ARG italic_a end_ARG ( italic_ω , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_ρ ( italic_ω ) = italic_O ( italic_ε start_POSTSUPERSCRIPT 1 / italic_k ( italic_k + 1 ) end_POSTSUPERSCRIPT ) = italic_o ( 1 ) . (182)

Therefore, most of the neurons cannot evolve to the magnitude of Ω(1)Ω1\Omega(1)roman_Ω ( 1 ) in the process of learning the k𝑘kitalic_k-th component, and therefore fails to provide an effective initialization for learning the next component φk+1subscript𝜑𝑘1\varphi_{k+1}italic_φ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT.