Learning time-scales in two-layers neural networks

Raphaël Berthier, Andrea Montanari, Kangjie Zhou EPFL, email address: [email protected]Department of Electrical Engineering and Department of Statistics, Stanford University, email address: [email protected]Department of Statistics, Stanford University, email address: [email protected]

(May 1, 2024)

Abstract

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically ‘simpler’ or ‘easier to learn’ although in a way that is difficult to formalize.

Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

Keywords: Deep learning, Neural network, Gradient flow, Dynamical system, Non-convex optimization, Incremental learning

Mathematics Subject Classification: 34E15, 37N40, 68T07

Communicated by Joan Bruna

1 Introduction

It is a recurring empirical observation that the training dynamics of neural networks exhibits a whole range of surprising behaviors:

1.

Plateaus. Plotting the training and test error as a function of SGD steps, using either small stepsize or large batches to average out stochasticity, reveals striking patterns. These error curves display long plateaus where barely anything seems to be happening, which are followed by rapid drops [41, 48, 39].
2.

Time-scales separation. The time window for this rapid descent is much shorter than the time spent in the plateaus. Additionally, subsequent phases of learning take increasingly longer times [18, 8].
3.

Incremental learning. Models learnt in the first phases of learning appear to be simpler than in later phases. Among others, [5] demonstrated that easier examples in a dataset are learned earlier; [28] showed that models learnt in the first phase of training correlate well with linear models; [22] showed that, in many simplified models, the dynamics of gradient descent explores the solution space in an incremental order of complexity; [39] demonstrated that, in certain settings, a function that approximates well the target is only learnt past the point of overfitting.

Understanding these phenomena is not a matter of intellectual curiosity. In particular, incremental learning plays a key role in our understanding of generalization in deep learning. Indeed, in this scenario, stop** the learning at a certain time $t$ amounts to controlling the complexity of the model learnt. The notion of complexity corresponds to the order in which the space of models is explored.

While a number of groups have developed models to explain these phenomena, it is fair to say that a complete picture is still lacking. An exhaustive overview of these works is out of place here. We will outline three possible explanations that have been developed in the past, and provide more pointers in Section 3.

Theory $\#1$ : Dynamics near singular points.

Several early works [41, 17, 44] pointed out that the parametrization of multi-layer neural networks presents symmetries and degeneracies. For instance, the function represented by a multi-layer perceptron is invariant under permutations of the neurons in the same layer. As a consequence, the population risk has multiple local minima connected through saddles or other singular sub-manifolds. Dynamics near these sub-manifolds naturally exhibits plateaus. Further, random or agnostic initializations typically place the network close to such submanifolds.

Theory $\#2$ : Linear networks.

Following the pioneering work of [7], a number of authors, most notably [43, 30], studied the behavior of deep neural networks with linear activations. While such networks can only represent linear functions, the training dynamics is highly non-linear. As demonstrated in [43], learning happens through stages that correspond to the singular value decomposition of the input-output covariance. Time scales are determined by the singular values.

Theory $\#3$ : Kernel regime.

Following an initial insight of [26], a number of groups proved that, for certain initializations, the training dynamics and model learnt by overparametrized neural networks is well approximated by certain linearly parametrized models. In the limit of very wide networks, the training dynamics of these models converges in turn to the training dynamics of kernel ridge(less) regression (KRR) with respect to a deterministic kernel (independent of the random initialization.) We refer to [9] for an overview and pointers to this literature. Recently [21] show that, in high dimension, the learning dynamics of KRR also exhibits plateaus and waterfalls, and learns functions of increasing complexity over a diverging sequence of timescales.

While each of these theories offers useful insights, it is important to realize that they do not agree on the basic mechanism that explains plateaus, time-scales separation, and incremental learning. In theory $\#1$ , plateaus are associated to singular manifolds and high-dimensional saddles, while in theories $\#2$ and $\#3$ they are related to a hierarchy of singular values of a certain matrix. In $\#2$ , the relevant singular values are the ones of the input-output covariance, and the fact that these singular values are well separated is postulated to be a property of the data distribution. In contrast, in $\#3$ the relevant singular values are the eigenvalues of the kernel operator, and hence completely independent of the output (the target function). In this case, eigenvalues which are very different are proved to exist under natural high-dimensional distributions.

Not only these theories propose different explanations, but they are also motivated by very different simplified models. Theory $\#1$ has been developed only for networks with a small number of hidden units. Theory $\#2$ only applies to networks with multiple output units, because otherwise the input-output covariance is a $d\times 1$ matrix and hence has only one non-trivial singular value. Finally, theory $\#3$ applies under the conditions of the linear (a.k.a. lazy) regime, namely large overparametrization and suitable initialization (see, e.g., [9]).

In order to better understand the origin of plateaus, time-scales separation, and incremental learning, we attempt a detailed analysis of gradient flow for two-layer neural networks. We consider a simple data-generation model, and propose a precise scenario for the behavior of learning dynamics. We do not assume any of the simplifying features of the theories described above: activations are non-linear; the number of hidden neurons is large; we place ourselves outside the linear (lazy) regime.

Our analysis is based on methods from dynamical systems theory: singular perturbation theory and matched asymptotic expansions. Unfortunately, we fall short of providing a general rigorous proof of the proposed scenario, but we can nevertheless prove it in several special cases and provide a heuristic argument supporting its generality.

The rest of the paper is organized as follows. Section 2 describes our data distribution, learning model, and the proposed scenario for the learning dynamics. We review further related work in Section 3. Section 4 describes the reduction of the gradient flow to a ‘mean field’ dynamics that will be the starting point of our analysis. Section 5 presents numerical evidence of the proposed learning scenario. Finally, Sections 6 to 7 present our analysis of the learning dynamics.

Notations.

In this paper, we use the classical asymptotic notations. The notations $f(\varepsilon)=o(g(\varepsilon))$ or $g(\varepsilon)=\omega(f(\varepsilon))$ as $\varepsilon\to 0$ both denote that $|f(\varepsilon)|/|g(\varepsilon)|\to 0$ in the limit $\varepsilon\to 0$ . The notations $f(\varepsilon)=O(g(\varepsilon))$ or $g(\varepsilon)=\Omega(f(\varepsilon))$ both denote that the ratio $|f(\varepsilon)|/|g(\varepsilon)|$ remains upper bounded in the limit. The notation $f(\varepsilon)=\Theta(g(\varepsilon))$ or $f(\varepsilon)\asymp g(\varepsilon)$ denote that $f(\varepsilon)=O(g(\varepsilon))$ and $g(\varepsilon)=O(f(\varepsilon))$ both hold. Finally, $f(\varepsilon)\sim g(\varepsilon)$ denotes that $f(\varepsilon)/g(\varepsilon)\to 1$ in the limit.

2 Setting and canonical learning order

We are given pairs $\{(x_{i},y_{i})\}_{i\leq n}$ , where $x_{i}\in\mathbb{R}^{d}$ is a feature vector and $y_{i}\in\mathbb{R}$ is a response variable. We are interested in cases in which the feature vector is high-dimensional but does not contain strong structure, but the response depends on a low-dimensional projection of the data. We assume the simplest model of this type, the so-called single-index model:

y_{i}=\varphi(\langle u_{*},x_{i}\rangle)\,,\qquad\ x_{i}\sim\mathsf{N}(0,I_{d% }),\;u_{*}\in\mathbb{S}^{d-1},

(1)

where $\varphi:\mathbb{R}\to\mathbb{R}$ is a link function, $\mathsf{N}(0,I_{d})$ denotes the standard multivariate Gaussian distribution in dimension $d$ , and $\mathbb{S}^{d-1}:=\{v\in\mathbb{R}^{d}:\,\|v\|_{2}=1\}$ . We study the ability to learn model (1) using a two-layers neural network with $m$ hidden neurons:

f(x;a,u)=\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle),\qquad\ % a_{1},\cdots,a_{m}\in\mathbb{R},\ u_{1},\cdots,u_{m}\in\mathbb{S}^{d-1},

(2)

where $(a,u):=(a_{1},\cdots,a_{m},u_{1},\cdots,u_{m})$ collectively denotes all the model’s parameter and $\sigma:\mathbb{R}\to\mathbb{R}$ is the activation function of the neural network. The factor $1/m$ in the definition is relevant for the initialization and learning rate. We anticipate that we will initialize the $a_{i}$ ’s to be of order one, which results in second layer coefficients $a_{i}/m=\Theta(1/m)$ .

Remark 2.1.

Standard initializations in deep learning frameworks yield second-layer coefficients $a_{i}/m=\Theta(1/\sqrt{m})$ [29, 23, 24]. However, it is increasingly clear that this initialization presents fundamental limitations for large $m$ . Notably, two-layers networks with this initialization converges to kernel methods [37], and the latter cannot learn ridge functions from polynomially many samples [20, 47].

It is well understood that, in order to drive the learning process outside the kernel regime (for $m\to\infty$ ), it is necessary to set $a_{i}/m=\Theta(1/m)$ . This is often referred to as the ‘mean-field initialization’ [32, 13, 19, 1]. We notice that suitable generalizations of the mean-field initialization are currently used in state-of-the-art implementations [45, 46].

The bulk of our work will be devoted to the analysis of projected gradient flow in $(a_{i},u_{i})_{1\leqslant i\leqslant m}$ on the population risk

	$\displaystyle\mathscrsfs{R}(a,u)$	$\displaystyle=\frac{1}{2}\mathbb{E}\big{\{}\big{(}y-f(x;a,u)\big{)}^{2}\big{\}}$		(3)
		$\displaystyle=\frac{1}{2}\mathbb{E}\Big{\{}\Big{(}\varphi(\langle u_{*},x% \rangle)-\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)\Big{)}^{% 2}\Big{\}}\,.$		(4)

In Section 7, we will bound the distance between stochastic gradient descent (SGD) and gradient flow in population risk. As a consequence, we will establish finite sample generalization guarantees for SGD learning.

Projected gradient flow with respect to the risk $\mathscrsfs{R}(a,u)$ is defined by the following ordinary differential equations (ODEs):

	$\displaystyle\partial_{t}(\varepsilon a_{i})$	$\displaystyle=-m\partial_{a_{i}}\mathscrsfs{R}(a,u)\,,$		(5)
	$\displaystyle\partial_{t}u_{i}$	$\displaystyle=-m(I_{d}-u_{i}u_{i}^{\top})\nabla_{u_{i}}\mathscrsfs{R}(a,u)\,.$		(6)

Here, $\varepsilon$ can be viewed as the relative step size, namely the ratio between the first and second-layer step sizes. It is useful to make a few remarks about the definition of gradient flow:

•

The projection $I_{d}-u_{i}u_{i}^{\top}$ ensures that $u_{i}$ remains on the unit sphere $\mathbb{S}^{d-1}$ .
•

The overall scaling of time is arbitrary, and the matching to SGD steps will be carried out in Section 7. The factors $m$ on the right-hand side are introduced for mathematical convenience, since the partial derivatives are of order $1/m$ .
•

As aforementioned, the factor $\varepsilon$ introduced in the gradient flow of the $a_{i}$ ’s plays the role of the relative step size. Throughout the paper, we will keep $\varepsilon$ as a free parameter independent of $m$ , and study the evolution of gradient flow for small $\varepsilon$ . This corresponds to a setting in which the second-layer coefficients are learned much faster than the first-layer weights. We emphasize however that the small $\varepsilon$ limit is taken after the large $m,d$ limits. Thus, despite the second-layer weights are learnt faster, the evolution of first layer weights will be crucial, and lead to true feature learning.

We assume the initialization to be random with i.i.d. components $(a_{i,\rm{init}},u_{i,\rm{init}})$ :

\displaystyle(a_{i,\rm{init}},u_{i,\rm{init}})\sim{\rm P}_{A}\otimes\mathrm{% Unif}(\mathbb{S}^{d-1})\,,

(7)

where ${\rm P}_{A}$ is a probability measure on $\mathbb{R}$ . The unique solution of the gradient flow ODEs with this initialization will be denoted by $(a(t),u(t))$ . We will be interested in the case of large networks ( $m\to\infty$ ) in high dimension ( $d\to\infty$ ). As shown below, the two limits commute (over fixed time horizons).

Our main finding is that, in a number of cases, $\varphi$ is learnt incrementally. Namely, the function $f(x;a(t),u(t))$ evolves over time according to a sequence of polynomial approximations of $\varphi(\langle u_{*},x\rangle)$ . These polynomial approximations are given by the decomposition of $\varphi$ in $L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)$ , where $\phi(x)$ is the standard normal density: $\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi}$ . (For notational simplicity, we will use the shorthand $L^{2}$ instead of $L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)$ in the sequel.)

In order to describe the polynomial approximations learnt during the training more explicitly, we decompose $\varphi$ and $\sigma$ into normalized Hermite polynomials:

\displaystyle\varphi(z)=\sum_{k=0}^{\infty}\varphi_{k}\mathrm{He}_{k}(z)\,,\;% \;\;\;\sigma(z)=\sum_{k=0}^{\infty}\sigma_{k}\mathrm{He}_{k}(z)\,.

(8)

Here, $\mathrm{He}_{k}$ denotes the $k$ -th Hermite polynomial, normalized so that $\left\|{\mathrm{He}_{k}}\right\|_{L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)}=1$ .

Figure 1: Cartoon illustration of the evolution of the population risk within the canonical learning order of Definition 1.

As we will see, the incremental learning behavior arises for small $\varepsilon$ . By the law of large numbers (see below), the following almost sure limit exists (provided ${\rm P}_{A}$ is square integrable)

\displaystyle\mathscrsfs{R}_{\rm{init}}:=\lim_{m\to\infty}\lim_{d\to\infty}% \mathscrsfs{R}(a_{\rm{init}},u_{\rm{init}})\,=\frac{1}{2}\left(\varphi_{0}-% \sigma_{0}\int\!a\,{\rm P}_{A}(\mathrm{d}a)\right)^{2}+\frac{1}{2}\sum_{k% \geqslant 1}\varphi_{k}^{2}.

(9)

We are now in position to describe the scenario that we will study in the rest of the paper.

Definition 1.

We say that the canonical learning order holds up to level $L$ for a certain target function $\varphi$ , activation $\sigma$ , and distribution ${\rm P}_{A}$ , if the followings hold:

The limit below exists:

\displaystyle\mathscrsfs{R}_{\infty}(t,\varepsilon)=\lim_{m\to\infty}\lim_{d% \to\infty}\mathscrsfs{R}(a(t),u(t)).

(10)

There exist constants $c_{2},\dots,c_{L+1}>0$ such that the following asymptotic holds as $\varepsilon\to 0$ , $t\to 0$ :

\mathscrsfs{R}_{\infty}(t,\varepsilon)\xrightarrow[\varepsilon\to 0,\,t\to 0]{% }\begin{cases}\mathscrsfs{R}_{\rm{init}}&\text{if }t=o(\varepsilon)\,,\\ \frac{1}{2}\sum_{k\geqslant 1}\varphi_{k}^{2}&\text{if }t=\omega(\varepsilon)% \text{ and }t=\frac{1}{4|\sigma_{1}\varphi_{1}|}\varepsilon^{\nicefrac{{1}}{{2% }}}\log\frac{1}{\varepsilon}-\omega(\varepsilon^{\nicefrac{{1}}{{2}}})\,,\\ \frac{1}{2}\sum_{k\geqslant 2}\varphi_{k}^{2}&\text{if }t=\frac{1}{4|\sigma_{1% }\varphi_{1}|}\varepsilon^{\nicefrac{{1}}{{2}}}\log\frac{1}{\varepsilon}+% \omega(\varepsilon^{\nicefrac{{1}}{{2}}})\text{ and }t=c_{2}\varepsilon^{% \nicefrac{{1}}{{4}}}-\omega(\varepsilon^{\nicefrac{{1}}{{3}}})\,,\\ \frac{1}{2}\sum_{k\geqslant l}\varphi_{k}^{2}&\text{if }t=c_{l-1}\varepsilon^{% \nicefrac{{1}}{{2(l-1)}}}+\omega(\varepsilon^{\nicefrac{{1}}{{l}}})\text{ and % }t=c_{l}\varepsilon^{\nicefrac{{1}}{{2l}}}-\omega(\varepsilon^{\nicefrac{{1}}{% {l+1}}})\,,\\ &\qquad\text{ for all }3\leqslant l\leqslant L+1.\end{cases}

Figure 1 provides a cartoon illustration of the canonical learning order.

At first sight, the setting of Eq. (2) is overly restrictive because we require $\|u_{i}\|_{2}=1$ and we do not have offsets in the activations. Therefore, it might seem that $s=1$ and $\varphi=\sigma$ is required in order to approximate arbitrarily well the target function. In contrast, the next proposition shows that the network (2) enjoys universal approximation properties.

Proposition 1.

Assume that $\sigma$ is Lipschitz continuous and generic in the following sense: the decomposition of $\sigma$ into Hermite polynomials does not have any coefficient equal to $0$ . For any Lipschitz function $\varphi:{\mathbb{R}}\to{\mathbb{R}}$ , $\|u_{*}\|_{2}=1$ , and $x\sim{\mathcal{N}}(0,I_{d})$ such that $\mathbb{E}\{\varphi(\langle u_{*},x\rangle)^{2}\}<\infty$ , there exists a sequence $m\to\infty$ and $a^{(m)},u^{(m)}$ with $\|u_{i}^{(m)}\|_{2}=1$ such that

\displaystyle\lim_{d\to\infty}\lim_{m\to\infty}\mathbb{E}\big{\{}\big{(}% \varphi(\langle u_{*},x\rangle)-f(x;a^{(m)},u^{(m)})\big{)}^{2}\big{\}}=0\,.

This result is not surprising in view of the arguments in the next sections, which suggest that indeed gradient flow constructs such an approximation for a broad class of functions of the form $f_{*}(x)=\varphi(\langle u_{*},x\rangle)$ . We nevertheless give an independent proof in Appendix A.

A specific realization of our general setup is determined by the triple $(\sigma,\varphi,{\rm P}_{A})$ , In the rest of the paper, we will provide evidence showing that the canonical learning order holds in a number of cases. Nevertheless, we can also construct examples in which it does not hold:

•

If one or more of the Hermite coefficients of the activation vanish, then the canonical learning order does not hold for general $\varphi$ . Specifically, if $\sigma_{k}=0$ , then for any $t$ the function $f(x;a(t),u(t))$ remains orthogonal to $\mathrm{He}_{k}(\langle u_{*},x\rangle)$ . In particular, if $\varphi_{k}\neq 0$ then the risk remains bounded away from zero for every $t$ . We refer to Appendix E.1 for a formal statement.
•

If the first $k+1$ Hermite coefficients of $\varphi$ vanish, $\varphi_{0}=\dots=\varphi_{k}=0$ , $k\geq 1$ , then the canonical learning order does not hold. (See Appendix E.2 for the proof.)
•

In fact, we expect the canonical learning order might fail every time one or more of the coefficients $\varphi_{k}$ vanish, for $k\geq 1$ . Appendix E.3 provides some heuristic justification for this failure.

Remark 2.2.

We can compare the canonical learning order described here to the ones in earlier literature and described as theory $\#1$ , $\#2$ , $\#3$ in the introduction. There appears points of contact, but also important differences with both theory $\#1$ and $\#3$ :

•

As in theory $\#1$ , the plateaus and separation of time scales arise because the trajectory of gradient flow is approximated by a sequence of motions along submanifolds in the space of parameters $(a,u)$ . Along the $l$ -th such submanifold $f(x;a,u)$ is well-approximated by a degree- $l$ polynomial. Esca** each submanifold takes an increasingly longer time.

This is reminiscent of the motion between saddles investigated in earlier work [41, 17, 44]. However, unlike in earlier work, we will see that this applies to networks with a large (possibly diverging) number of hidden neurons. Also, we identify the subsequent phases of learning with the polynomial decomposition of Eq. (8).
•

As in theory $\#3$ , subsequent phases of learning correspond to increasingly accurate polynomial approximations of the target function $\varphi(\langle u_{*},x\rangle)$ . However, the underlying mechanism and time scales are completely different. In the linear regime, the different time scales emerge because of increasingly small eigenvalues of the neural tangent kernel. In that case, the time required to learn degree- $l$ polynomials is of order $d^{l}$ [21].

In contrast, in the canonical learning order, polynomials of degree $l$ are learnt on a time scale of order one in $d$ (and only depending on the learning rate $\varepsilon$ ). This of course has important implications when approximating gradient flow by SGD. Within the linear regime, the sample size required to learn a polynomial of order $l$ scales like $d^{l}$ [21], while in the canonical learning order, it is only of order $d$ (see Section 7).

3 Further related work

As we mentioned in the introduction, plateaus and time scales in the learning dynamics of kernel models were analyzed by [21]. A sharp analysis for the related random features model was developed by [12].

Our analysis builds upon the mean-field description of learning in two-layer neural networks, which was developed in a sequence of works, see, e.g., [32, 40, 13, 33]. In particular, we leverage the fact that, for the data distribution (1), the population risk function is invariant under rotations around the axis $u_{*}$ , and this allows for a dimensionality reduction in the mean field description. Similar symmetry argument were used by [32] and, more recently, by [1].

The single-index model can be learnt using simpler methods than large two-layer networks. Limiting ourselves to the case of gradient descent algorithms, [31] proved that gradient descent with respect to the non-convex empirical risk $\widehat{R}_{n}(u):=n^{-1}\sum_{i=1}^{n}(y_{i}-\varphi(u^{\top}x_{i}))^{2}$ converges to a near global optimum, provided $\varphi$ is strictly increasing. [4] considered online SGD under more challenging learning scenarios and characterized the time (sample size) for $|\langle u,u_{*}\rangle|$ to become significantly larger than for a random unit vector $u$ .

Learning in overparametrized two-layer networks under model (1) (or its variations) has been studied recently by several groups. In particular, [6] considers a training procedure which runs a single step gradient descent followed by freezing the first layer and performing ridge regression with respect to the second layer. This scheme is amenable to a precise characterization of the generalization error. [11] consider a similar scheme in which a first phase of gradient descent is run to achieve positive correlation with the unknown direction $u_{*}$ . [14] also consider a two-phases scheme, and prove consistency and excess risk bounds for a more general class of target functions whereby the first equation in (1) is replaced by

\displaystyle y_{i}=\varphi(U_{*}^{\top}x_{i})+\varepsilon_{i}\,,\;\;\;U_{*}% \in{\mathbb{R}}^{d\times k}\,,\varphi:{\mathbb{R}}^{k}\to{\mathbb{R}}\,,

(11)

with $k\ll d$ . In particular, near optimal error bounds are obtained under a non-degeneracy condition on $\nabla^{2}\varphi$ .

[1] consider a similar model whereby $x\sim\mathrm{Unif}(\{+1,-1\}^{d})$ , and $y=\varphi(x_{S})$ where $S\subseteq[d]$ , and $x_{S}=(x_{i})_{i\in S}$ (i.e., $x_{S}$ contains the coordinates of $x$ indexed by entries of $S$ ). Under a structural assumption on $\varphi$ (the ‘merged staircase property’), and for $|S|$ fixed, they prove the two stages algorithm learns the target function with sample complexity of order $d$ . This paper is technically related to ours in that it uses mean-field theory to obtain a characterization of learning in terms of a PDE in a reduced $(k+2)$ -dimensional space.

A similar model was studied by [8] that bounds the sample complexity by $d^{O(k)}$ for learning parities on $k$ bits using gradient descent with large batches (if $k=O(1)$ , [8] require $O(1)$ steps with batch size $d^{O(k)}$ ).

Let us emphasize that our objective is quite different from these works. We implement a simple online SGD algorithm with additional projection steps, and try to derive a precise picture of the successive phases of learning (in particular, we do not consider two-stage schemes or layer-by-layer learning). On the other hand, we focus on a relatively simple model.

To clarify the difference, it is perhaps useful to rephrase our claims in terms of sample complexity. While previous works show that the target function can be learnt with $O(d)$ samples, we claim that it is learnt by online SGD with test error $r$ from about $C(r,\varepsilon)d$ samples and characterize the dependence of $C(r,\varepsilon)$ on $r$ for small $\varepsilon$ . (Falling short of a proof in the general case.)

After posting an initial version of this paper, we became aware that [3] independently derived equations similar to (15)-(19), (26), (130). There are technical differences, and hence we cannot apply their results directly. However, Section 4.3 and Appendix B.4 are analogous to their work.

4 The large-network, high-dimensional limit

The first step of our analysis is a reduction of the system of ODEs (5), (6), with dimension $m(d+1)$ to a system of ODEs in $2m$ dimensions. We will achieve this reduction in two steps:

$(i)$

First we reduce to a system in $m(m+3)/2$ dimensions for the variables $a_{i}$ , $\langle u_{i},u_{j}\rangle$ , $\langle u_{i},u_{*}\rangle$ . This reduction is exact and is quite standard. It is done in Section 4.1.
$(ii)$

We then show that the products $\langle u_{i},u_{j}\rangle$ can be eliminated, with an error $O(1/m)$ . This is done in Section 4.2. As further discussed below, the resulting dynamics could also be derived from the mean field theory of [32, 40, 13, 33] (with the required modifications for the constraints $\|u_{i}\|=1$ ).

In order to define formally the reduced system, we define the functions $U,V:[-1,1]\to\mathbb{R}$ via:

	$\displaystyle V(s)$	$\displaystyle:=\mathbb{E}\{\varphi(G)\,\sigma(G_{s})\}=\sum_{k\geqslant 0}% \varphi_{k}\sigma_{k}s^{k}\,,\;\;\;\;\;(G,G_{s})\sim{\mathcal{N}}\left(0,\left% [\begin{matrix}1&s\\ s&1\end{matrix}\right]\right)\,,$		(12)
	$\displaystyle U(s)$	$\displaystyle:=\mathbb{E}\{\sigma(G)\,\sigma(G_{s})\}=\sum_{k\geqslant 0}% \sigma_{k}^{2}s^{k}\,.$		(13)

Note that the above identities follow from [36, Proposition 11.31]. Throughout this section, we will make the following assumptions.

A1.

The distribution of weights at initialization, ${\rm P}_{A}$ is supported on $[-M_{1},M_{1}]$ .

A2.

The activation function is bounded: $\left\|{\sigma}\right\|_{\infty}\leq M_{2}$ . Additionally, the functions $V$ and $U$ are bounded and of class $C^{2}$ , with uniformly bounded first and second derivatives over $s\in[-1,1]$ . A sufficient condition for this is

\sup\left\{\left\|{\sigma^{\prime}}\right\|_{L^{2}},\,\left\|{\sigma^{\prime% \prime}}\right\|_{L^{2}}\right\}\leq M_{2},\quad\ \sup\left\{\left\|{\varphi}% \right\|_{L^{2}},\,\left\|{\varphi^{\prime}}\right\|_{L^{2}},\,\left\|{\varphi% ^{\prime\prime}}\right\|_{L^{2}}\right\}\leq M_{2}.

A3.

Responses are bounded, i.e., $\|\varphi\|_{\infty}\leq M_{3}$ .

Remark 4.1.

We hereby briefly explain the sufficiency of $L^{2}$ -boundedness of derivatives of $\sigma$ and $\varphi$ as claimed in Assumption A2. Suppose for example that $\left\|{\sigma^{\prime}}\right\|_{L^{2}}\leq M_{2}$ and $\left\|{\varphi^{\prime}}\right\|_{L^{2}}\leq M_{2}$ , then we have

\sup_{s\in[-1,1]}\left|V^{\prime}(s)\right|\stackrel{{\scriptstyle(a)}}{{=}}% \sup_{s\in[-1,1]}\left|\mathbb{E}\{\varphi^{\prime}(G)\,\sigma^{\prime}(G_{s})% \}\right|\stackrel{{\scriptstyle(b)}}{{\leq}}\left\|{\varphi^{\prime}}\right\|% _{L^{2}}\left\|{\sigma^{\prime}}\right\|_{L^{2}}\leq M_{2}^{2},

(14)

where $(a)$ follows from Gaussian integration by parts and $(b)$ follows from Cauchy-Schwarz inequality.

4.1 Reduction to $d$ -independent flow

Our first statement establishes reduction $(i)$ mentioned above. The proof of this fact is presented in Appendix B.1.

Proposition 2 (Reduction to $d$ -independent flow).

Define $s_{i}=\langle u_{i},u_{*}\rangle$ , $r_{ij}=\langle u_{i},u_{j}\rangle$ for $i,j=1,\dots,m$ . Then, letting $R=(r_{ij})_{i,j\leq m}$ , we have

\mathscrsfs{R}(a,u)=\mathscrsfs{R}_{\mbox{\tiny\rm red}}(a,s,R):=\frac{1}{2}\|% \varphi\|^{2}_{L^{2}}-\frac{1}{m}\sum_{i=1}^{m}a_{i}V(s_{i})+\frac{1}{2m^{2}}% \sum_{i,j=1}^{m}a_{i}a_{j}U(r_{ij})\,.

(15)

If $(a(t),u(t))$ solve the gradient flow ODEs (5)-(6) then $(a(t),s(t),R(t))$ are the unique solution of the following set of ODEs (note that $r_{ii}=1$ identically)

$\displaystyle\varepsilon\partial_{t}a_{i}=\,$	$\displaystyle V(s_{i})-\frac{1}{m}\sum_{j=1}^{m}a_{j}U(r_{ij})\,,$	(16)
$\displaystyle\partial_{t}s_{i}=\,$	$\displaystyle a_{i}\left(V^{\prime}(s_{i})(1-s_{i}^{2})-\frac{1}{m}\sum_{j=1}^% {m}a_{j}U^{\prime}(r_{ij})(s_{j}-r_{ij}s_{i})\right)\,,$	(17)
$\displaystyle\partial_{t}r_{ij}=\,$	$\displaystyle a_{i}\left(V^{\prime}(s_{i})(s_{j}-s_{i}r_{ij})-\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}-r_{ip}r_{ij})\right)\,,$	(18)
	$\displaystyle+a_{j}\left(V^{\prime}(s_{j})(s_{i}-s_{j}r_{ij})-\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}-r_{jp}r_{ij})\right)\,.$	(19)

The input dimension $d$ does not appear in the reduced ODEs, Eqs. (16) to (19), and only plays a role in the initialization of the $s_{i}$ ’s and the $r_{ij}$ ’s. Namely, since $u_{i,\rm{init}}\sim\mathrm{Unif}(\mathbb{S}^{d-1})$ , we can represent $u_{i,\rm{init}}=g_{i}/\|g_{i}\|_{2}$ with $g_{i}\sim\mathsf{N}(0,I_{d}/d)$ . By concentration of $\|g_{i}\|_{2}$ , this implies that, for $1\leq i<j\leq m$ , $s_{i}$ , $r_{ij}$ are approximately $\mathsf{N}(0,1/d)$ .

This discussion immediately yields the following consequence.

Corollary 1.

Let $(a(t),u(t))$ be the solution of the gradient flow ODEs (5), (6) with initialization (7), and let $(a^{0}(t),s^{0}(t),R^{0}(t))$ be the unique solution of Eqs. (16) to (19), with initialization $a^{0}_{i}(0)=a_{i}(0)$ , $s^{0}_{i}(0)=0$ , $r^{0}_{ij}(0)=0$ for $i\neq j$ . Then, for any fixed $T$ (possibly dependent on $m$ but not on $d$ ), the followings holds with probability at least $1-\exp(-C^{\prime}m)$ over the i.i.d. initialization $(a_{i}(0),u_{i}(0))_{i\in[m]}$ :

	$\displaystyle\sup_{t\in[0,T]}\big{\|}\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{% \mbox{\tiny\rm red}}(a^{0}(t),s^{0}(t),R^{0}(t))\big{\|}\leq\frac{CM}{\sqrt{d}}% \exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,,$		(20)
	$\displaystyle\max\left(\sup_{t\in[0,T]}\frac{1}{\sqrt{m}}\\|a(t)-a^{0}(t)\\|_{2}% ,\frac{1}{\sqrt{m}}\sup_{t\in[0,T]}\\|s(t)-s^{0}(t)\\|_{2}\right)\leq\frac{1}{% \sqrt{d}}\cdot C\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,,$		(21)
	$\displaystyle\sup_{t\in[0,T]}\frac{1}{m}\\|R(t)-R^{0}(t)\\|_{\rm F}\leq\frac{1}{% \sqrt{d}}\cdot C\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,.$		(22)

Here $C,C^{\prime}$ are absolute constants and $M$ only depends on the $M_{i}$ ’s in Assumptions A1-A3.

The proof of Corollary 1 is deferred to Appendix B.2. From now on, we will assume the initialization $s^{0}_{i}(0)=0$ , $r^{0}_{ij}(0)=0$ for $i\neq j$ , but drop the superscript $0$ for notational simplicity. We notice in passing that the right-hand sides of Eqs. (20) to (22) are independent of $m$ : this approximation step holds uniformly over $m$ . (Note that the left hand sides are normalized by $m$ as to yield the root mean square error per entry.)

4.2 Elimination of the products $\langle u_{i},u_{j}\rangle$

In order to state the reduction $(ii)$ outlined above, we define the mean field risk as

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s):=\mathscrsfs{R}_{\mbox{% \tiny\rm red}}(a,s,R=ss^{\top})=\frac{1}{2}\|\varphi\|^{2}_{L^{2}}-\frac{1}{m}% \sum_{i=1}^{m}a_{i}V(s_{i})+\frac{1}{2m^{2}}\sum_{i,j=1}^{m}a_{i}a_{j}U(s_{i}s% _{j})\,.

(23)

Further, we denote by $\{a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t)\}_{i=1}^{m}$ the solution to the following ODEs:

\begin{split}\varepsilon\partial_{t}a_{i}=\,&V(s_{i})-\frac{1}{m}\sum_{j=1}^{m% }a_{j}U(s_{i}s_{j})\,,\\ \partial_{t}s_{i}=\,&a_{i}\left(1-s_{i}^{2}\right)\left(V^{\prime}(s_{i})-% \frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})s_{j}\right)\,.\end{split}

(24)

Note that (24) would be identical to (16)-(17) if we had $r_{ij}=s_{i}s_{j}$ . A priori, this is not the case. However, the two systems of equations are close to each other for large $m$ as made precise by our next proposition, which formalizes reduction $(ii)$ .

The intuitive explanation for the approximation $r_{ij}\approx s_{i}s_{j}$ is quite interesting. For large $m$ , due to ‘propagation of chaos’, the neuron weights $\{(u_{i},a_{i})\}_{i\leq m}$ are approximately independent. Further, because of the symmetry of the problem under rotations that keep $u_{*}$ fixed, weights $(u_{i})_{i\leq n}$ are approximately uniformly distributed conditional on $s_{i}=\langle u_{i},u_{*}\rangle$ . As a consequence, decomposing $u_{i}=s_{i}u_{*}+u_{i}^{\perp}$ , we have $r_{ij}=s_{i}s_{j}+\langle u_{i}^{\perp},u_{j}^{\perp}\rangle$ , with $u_{i}^{\perp}$ , $u_{j}^{\perp}$ approximately uniform on $\operatorname{span}\{u_{*}\}^{\perp}$ and independent. Therefore, in high dimensions we have $r_{ij}\approx s_{i}s_{j}$ .

Proposition 3 (Reduction to flow in $\mathbb{R}^{2m}$ ).

Let $(a_{i}(t),s_{i}(t),r_{ij}(t))_{1\leq i<j\leq m}$ be the unique solution of the ODEs (16)-(19) with initialization $s_{i}(0)=0$ , $r_{ij}(0)=0$ for all $1\leq i\neq j\leq m$ . Let $(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))_{i\leq m}$ be the unique solution of the ODEs (24) with initialization $s^{\mbox{\tiny\rm mf}}_{i}(0)=0$ , $a^{\mbox{\tiny\rm mf}}_{i}(0)=a_{i}(0)$ for all $i\leq m$ .

If assumptions A1-A3 hold, then for any $T<\infty$ there exists a constant

C(T)=M\exp(MT(1+T)^{2}/\varepsilon^{2})

(25)

(with $M$ depending on the constants $\{M_{i}\}_{1\leq i\leq 3}$ appearing in Assumptions A1-A3 only) such that:

\sup_{t\in[0,T]}\frac{1}{m}\sum_{i=1}^{m}\big{\|}(a_{i}(t),s_{i}(t))-(a^{\mbox% {\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))\big{\|}_{2}^{2}\leq\frac{% C(T)}{m}\,.

Consequently,

\sup_{t\in[0,T]}\left|\mathscrsfs{R}_{\mbox{\tiny\rm red}}\left(a(t),s(t),R(t)% \right)-\mathscrsfs{R}_{\mbox{\tiny\rm mf}}\left(a^{\mbox{\tiny\rm mf}}(t),s^{% \mbox{\tiny\rm mf}}(t)\right)\right|\leq\frac{C(T)}{\sqrt{m}}\,.

The proof of this proposition is deferred to Appendix B.3. Now, combining the propositions and corollaries in this section, we deduce that with high probability over the i.i.d. initialization,

\sup_{t\in[0,T]}\left|\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\mbox{\tiny\rm mf% }}\left(a^{\mbox{\tiny\rm mf}}(t),s^{\mbox{\tiny\rm mf}}(t)\right)\right|\leq% \left(\frac{1}{\sqrt{d}}+\frac{1}{\sqrt{m}}\right)CM\exp(MT(1+T)^{2}/% \varepsilon^{2}).

(26)

4.3 Connection with mean field theory

Consider the empirical distributions of the neurons:

	$\displaystyle\widehat{\rho}_{t}$	$\displaystyle:=\frac{1}{m}\sum_{i=1}^{m}\delta_{(a_{i}(t),s_{i}(t))}\,,$		(27)
	$\displaystyle\rho_{t}$	$\displaystyle:=\frac{1}{m}\sum_{i=1}^{m}\delta_{(a^{\mbox{\tiny\rm mf}}_{i}(t)% ,s^{\mbox{\tiny\rm mf}}_{i}(t))}\,,$		(28)

with $(a_{i}(t),s_{i}(t))_{i\leq m}$ , $(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))_{i\leq m}$ as in the statement of Proposition 3, i.e., solving (respectively) Eqs. (16)-(19) and Eq. (24) with initial conditions as given there.

Then, it is immediate to show that $\rho_{t}$ solves (in weak sense) the following continuity partial differential equation (PDE) (we refer to [2, 42] for the definition of weak solutions and basic properties, and Appendix B.4 for a short derivation.)

	$\displaystyle\partial_{t}\rho_{t}(a,s)$	$\displaystyle=-\nabla\cdot\left(\rho_{t}\Psi\left(a,s;\rho_{t}\right)\right)$		(29)
		$\displaystyle:=-\left(\partial_{a}\left(\rho_{t}\Psi_{a}\left(a,s;\rho_{t}% \right)\right)+\partial_{s}\left(\rho_{t}\Psi_{s}\left(a,s;\rho_{t}\right)% \right)\right),$		(30)

where $\Psi=(\Psi_{a},\Psi_{s})$ is given by

	$\displaystyle\Psi_{a}(a,s;\rho)=\,$	$\displaystyle\varepsilon^{-1}\cdot\left(V(s)-\int_{\mathbb{R}^{2}}a_{1}U(ss_{1% })\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right),$		(31)
	$\displaystyle\Psi_{s}(a,s;\rho)=\,$	$\displaystyle a(1-s^{2})\cdot\left(V^{\prime}(s)-\int_{\mathbb{R}^{2}}a_{1}s_{% 1}U^{\prime}(ss_{1})\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right).$		(32)

This equation can be extended to a flow in the whole space $(\mathscr{P}(\mathbb{R}^{2}),W_{2})$ (all probability measures on $\mathbb{R}^{2}$ equipped with the second Wasserstein distance), and interpreted as gradient flow with respect to this metric in the following risk:

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho):=\frac{1}{2}\|\varphi% \|^{2}_{L^{2}}-\int\!aV(s)\,\rho(\mathrm{d}a,\mathrm{d}s)+\frac{1}{2}\int\!a_{% 1}a_{2}U(s_{1}s_{2})\,\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\,\rho(\mathrm{d}a_% {2},\mathrm{d}s_{2})\,\,,

(33)

which is the obvious extension of $\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s)$ of Eq. (23) to general probability distributions. Proposition 3 implies that for any $T<\infty$ , and under the above initial conditions,

\displaystyle\sup_{t\in[0,T]}W_{2}(\rho_{t},\widehat{\rho}_{t})\leq\sqrt{\frac% {M\exp(MT(1+T)^{2}/\varepsilon^{2})}{m}}\,.

(34)

If we further denote by $\rho_{t}^{d}$ the empirical distribution of $(a_{i}(t),s_{i}(t))$ , $i\leq m$ , when $s_{i}(0)=\langle u_{i}(0),u_{*}\rangle$ , $u_{i}(0)\sim\mathrm{Unif}(\mathbb{S}^{d-1})$ , a further application of Corollary 1 yields

\displaystyle\sup_{t\in[0,T]}W_{2}(\rho^{d}_{t},\rho_{t})\leq\sqrt{\frac{M\exp% (MT(1+T)^{2}/\varepsilon^{2})}{m\wedge d}}\,.

(35)

Starting with [32, 13, 40], several authors used continuity PDEs of the form (29) to study the learning dynamics of two-layer neural networks. Following the physics tradition, this is referred to as the ‘mean-field theory’ of two-layer neural networks. Appendix B.5 sketches an alternative approach to prove bounds of the form (26), (35) using the results of [32, 33]. The present derivation has the advantages of yielding a sharper bound and of being self-contained.

4.4 A general formulation

As mentioned above, the system of ODEs in Eq. (24) is a special case of the Wasserstein gradient flow of Eq. (29) whereby we set $\rho_{0}=m^{-1}\sum_{i=1}^{m}\delta_{(a_{i}^{\mbox{\tiny\rm mf}}(0),s_{i}^{% \mbox{\tiny\rm mf}}(0))}$ . In order to study the solutions of Eq. (29) (hence Eq. (24)) we adopt the following framework. Let $(\Omega,\rho)$ denote a probability space. Let $a=a(\omega,t)$ and $s=s(\omega,t)$ ( $\omega\in\Omega$ , $t\geqslant 0$ ) be two measurable functions satisfying (drop** dependencies in $t$ below)

\begin{split}\varepsilon\partial_{t}a(\omega)=\,&V(s(\omega))-\int a(\nu)U(s(% \omega)s(\nu))\mathrm{d}\rho(\nu)\,,\\ \partial_{t}s(\omega)=\,&a(\omega)\left(1-s(\omega)^{2}\right)\left(V^{\prime}% (s(\omega))-\int a(\nu)U^{\prime}(s(\omega)s(\nu))s(\nu)\mathrm{d}\rho(\nu)% \right)\,.\end{split}

(36)

If $\omega=i\in\Omega=\{1,\dots,m\}$ endowed with the uniform measure, we obtain the equations (24). In general, the push-forward $\rho_{t}$ of the measure $\rho$ through the map $\omega\in\Omega\mapsto(a(\omega,t),s(\omega,t))\in\mathbb{R}^{2}$ satisfies the mean-field equation (29). As a consequence, the dynamics (36) can be viewed as a gradient flow on the risk

\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)=\frac{1}{2}\|\varphi\|^{2}-\int a(% \omega)V(s(\omega))\mathrm{d}\rho(\omega)+\frac{1}{2}\int a(\omega_{1})a(% \omega_{2})U(s(\omega_{1})s(\omega_{2}))\mathrm{d}\rho(\omega_{1})\mathrm{d}% \rho(\omega_{2})\,.

(37)

We next characterize the landscape of the risk function $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)$ . In particular, we establish that under certain conditions, the global infimum of $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)$ is $0$ .

Proposition 4.

The risk function $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)$ can be expressed as

\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)=\,\frac{1}{2}\sum_{k=0}^{\infty}% \left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}.

(38)

Assume that $\sigma_{k}\neq 0$ for all $k\geq 0$ , and that

\sum_{k=0}^{\infty}\sigma_{k}^{2}<\infty,\quad\sum_{k=0}^{\infty}\varphi_{k}^{% 2}<\infty.

Then, for any $\delta>0$ , there exists a triple $(a,s,\rho)$ such that $a\in L^{2}(\rho)$ , $s\in[-1,1]$ , and $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)\leq\delta^{2}$ .

This proposition is proved in Appendix B.6.

Remark 4.2.

Proposition 4 complements Proposition 1 which establishes approximability of the target function $f_{*}(x)=\varphi(\langle u_{*},x\rangle)$ using the networks (2) (Proposition 4 can be seen as an $m=d=\infty$ version of the latter). We note that the proofs of these propositions also provides insight into the structure of approximators. Namely, we can take the weights $u_{i}$ to be i.i.d. with distribution $\rho(u)\mathrm{d}u$ that is symmetric under rotations around $u_{*}$ , and $a_{i}=\alpha(u_{i})$ , $a(u)=\alpha(u)\rho(u)$ is concentrated close to $\langle u_{*},u\rangle=0$ (on a scale that can rely on the desired approximation error).

Indeed, the analysis of gradient flow in Section 6 reveals that the solutions found by gradient flow are of this nature. Namely, neurons develop a small but strictly positive alignment with $u_{*}$ . The distribution and size of the alignment evolves over time.

Remark 4.3.

The results in this section can be generalized to multi-index models: $y=\varphi(U_{*}^{\top}x)$ where $U_{*}\in O(d,k)$ , the space of $d\times k$ orthogonal matrices. Further, the corresponding limiting dynamics become

\begin{split}\varepsilon\partial_{t}a(\omega)=\,&V(s(\omega))-\int a(\nu)U% \left(s(\omega)^{\top}s(\nu)\right)\mathrm{d}\rho(\nu)\,,\\ \partial_{t}s(\omega)=\,&a(\omega)\left(I_{k}-s(\omega)s(\omega)^{\top}\right)% \left(\nabla V(s(\omega))-\int a(\nu)U^{\prime}\left(s(\omega)^{\top}s(\nu)% \right)s(\nu)\mathrm{d}\rho(\nu)\right)\,.\end{split}

Here, $s(\omega)\in\mathbb{R}^{k}$ represents $U_{*}^{\top}u(\omega)$ , and for $s\in\mathbb{R}^{k}$ , $\left\|{s}\right\|_{2}\leq 1$ :

V(s)=\mathbb{E}\left[\varphi(G)\sigma(G_{s})\right],\quad(G,G_{s})\sim{% \mathcal{N}}\left(0,\left[\begin{matrix}I_{k}&s\\ s^{\top}&1\end{matrix}\right]\right).

The definition of $U$ is the same as before.

5 Numerical solution

Refer to caption — (a) $\varepsilon=10^{-3}$

In Figure 2, we present the result of an Euler discretization of Eqs. (24) where $\varphi$ is a degree- $2$ polynomial and $\sigma$ is the ReLU activation: $\sigma(s)=\max(s,0)$ ,

\displaystyle\begin{split}&\varphi(s)=\mathrm{He}_{0}(s)-\mathrm{He}_{1}(s)-% \frac{2}{3}\mathrm{He}_{2}(s)\,\\ &\hskip 19.91692pt=\left(1-\frac{2\sqrt{2}}{6}\right)-s-\frac{2\sqrt{2}}{6}s^{% 2}\,.\end{split}

(39)

These plots clearly display two of the features emphasized in the introduction: $(i)$ plateaus separated by periods of rapid improvement of the risk; $(ii)$ increasingly long timescales (notice the logarithmic time axis in the second and third row).

In order to examine the incremental learning structure, we rewrite the risk $\mathscrsfs{R}_{\mbox{\tiny\rm mf}}$ of Eq. (23) by decomposing $\varphi$ and $\sigma$ in the basis of Hermite polynomials

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s)

\displaystyle=\frac{1}{2}\sum_{k\geqslant 0}\left(\varphi_{k}-\frac{\sigma_{k}% }{m}\sum_{i=1}^{m}a_{i}s_{i}^{k}\right)^{2}\,.

(40)

We observe that, for small $\varepsilon$ , the Hermite coefficients of $\varphi$ are learned sequentially, in the order of their degree. When $\varepsilon$ is sufficiently small (right plots), this incremental learning happens in well separated phases. The plateaus and waterfalls in the plots of $\mathscrsfs{R}_{\mbox{\tiny\rm mf}}$ correspond to the network learning increasingly higher degree polynomials.

In Figure 3 we plot the evolution of the values of the $a_{i}$ and $s_{i}$ , for $i\in\{1,\dots,m\}$ . We observe that the overall order of magnitude of the $a_{i}$ ’s and the $s_{i}$ ’s increases when passing through the different phases of the incremental learning process. In the mean time, some of the $a_{i}$ ’s and $s_{i}$ ’s will undergo a sign change during the learning process, which is characterized by a sudden decrease and subsequent rapid increase in its magnitude.

Altogether, the results of Figures 2 and 3 are consistent with the canonical learning order up to level $L=2$ as per Definition 1. While we conjecture that incremental learning also occurs for higher-order polynomials, we found this hard to observe in numerical simulations: we would need to take $\varepsilon$ much smaller than in Figure 2, resulting in prohibitively large simulation costs.

First, as predicted in Definition 1, the times at which the components are learned are closer on a logarithmic scale as the degree increases. It is therefore increasingly difficult to observe time scales corresponding to higher degrees.

Second, we expect there to be a choice of the initialization $(a_{i,\rm{init}},u_{i,\rm{init}})_{i\in[m]}$ , activation and target function, for which not all the components of $\varphi$ are actually learnt. We observed empirically that this happens easily for small $m$ .

To conclude this section, in Figure 4 we compare the simplified neuron dynamics (MF) of Eq. (24) and the evolution of projected gradient descent for the original two-layer neural network (NN). From the plots we observe two remarkable phenomena: (1) the evolution of the risk for NN and MF are close to each other during the entire learning process for both large learning rate ratio ( $\varepsilon=1$ ) and small ( $\varepsilon=10^{-3}$ ), and their risk curves have the same qualitative behavior even if $m$ is small ( $m=10$ ); (2) as we increase the value of $m$ from $10$ to $50$ , the alignment between the learning curves of NN and MF improves significantly. These observations justify our argument in Section 4.2 that the inter-neuron correlations $r_{ij}$ are well approximated by $s_{i}s_{j}$ for wide networks.

6 Timescales hierarchy in the gradient flow dynamics

We are interested in the behavior of the solution of the ODEs (36), initialized from $s(\omega,0)=0$ for all $\omega$ (as per Proposition 3). The canonical learning order of Definition 1 concerns the behavior of solutions for $\varepsilon\to 0$ . This type of questions can be addressed within the theory of dynamical systems using singular perturbation theory [25]. Here, ‘singular’ refers to the fact that $\varepsilon$ multiplies one of the highest-order derivatives. In Eq. (36), $\varepsilon$ multiplies the differential term $\partial_{t}a(\omega)$ , so that the ODE system becomes singular in the limit $\varepsilon\to 0$ . In particular, it degenerates to the following system of differential-algebraic equations:

\begin{split}V(s(\omega))=\,&\int a(\nu)U(s(\omega)s(\nu))\mathrm{d}\rho(\nu)% \,,\\ \partial_{t}s(\omega)=\,&a(\omega)\left(1-s(\omega)^{2}\right)\left(V^{\prime}% (s(\omega))-\int a(\nu)U^{\prime}(s(\omega)s(\nu))s(\nu)\mathrm{d}\rho(\nu)% \right)\,.\end{split}

(41)

Due to singularity, the qualitative behavior of the above system is dramatically different from that of Eq. (36) with $\varepsilon$ small but non-zero. This is in stark contrast to regular perturbation problems, for which the limiting dynamics will still be a system of differential equations with the same order and similar qualitative behavior as the perturbed system.

As a side remark, we note that the system (36) can be seen as a slow-fast dynamical system, where the $a(\omega)$ ’s are the fast variables and the $s(\omega)$ ’s are the slow variables [10]. Formally, the time derivative of the $a(\omega)$ ’s is multiplied by a factor $(1/\varepsilon)$ . From a dynamical systems perspective, the present case is made complicated because of a bifurcation when the $s(\omega)$ ’s become non-zero.

The canonical learning order provides a detailed description of this bifurcation. We will motivate this scenario using a classical, but non-rigorous, technique of singular perturbation theory, called the matched asymptotic expansion [25, Chapter 2]. This technique decomposes the approximation of the solution in several time scales on which a regular approximation holds. These time scales are traditionally called layers in the literature; however, we avoid this terminology due to the potential confusion with the layers of the neural network.

We will work mainly using the Hermite representation of the dynamical ODEs (36), which we write down for the reader’s convenience:

\begin{split}\varepsilon\partial_{t}a(\omega)&=\,\sum_{k=0}^{\infty}\sigma_{k}% s(\omega)^{k}\left(\varphi_{k}-\sigma_{k}\int a(\nu)s(\nu)^{k}\mathrm{d}\rho(% \nu)\right)\,,\\ \partial_{t}s(\omega)&=\,a(\omega)\left(1-s(\omega)^{2}\right)\sum_{k=1}^{% \infty}k\sigma_{k}s(\omega)^{k-1}\left(\varphi_{k}-\sigma_{k}\int a(\nu)s(\nu)% ^{k}\mathrm{d}\rho(\nu)\right)\,.\end{split}

(42)

The rest of this section is organized as follows. We first give a brief overview of the method of matched asymptotic expansions and a summary of our main results regarding the learning timescales in Section 6.1. Sections 6.2-6.4 respectively describe the first three time scales of the matched asymptotic expansion of (42). This gives, for each time scale, an approximation of the $a(\omega)$ , $s(\omega)$ . In Appendix C.2, we detail how these sections induce an evolution of the risk alternating plateaus and rapid decreases, and support the standing learning scenario of Definition 1. Finally, in Section 6.5, we conjecture the behavior on longer time scales.

Notations.

We denote $\mathds{1}$ the constant function $\mathds{1}:\omega\in\Omega\mapsto 1\in\mathbb{R}$ . Denote $\langle.,.\rangle_{L^{2}(\rho)}$ the dot product on $L^{2}(\rho)$ and $\|.\|_{L^{2}(\rho)}$ the associated norm. For $x\in L^{2}(\rho)$ , we denote $x_{\perp}$ the orthogonal projection of $x$ on the hyperplane $\mathds{1}^{\perp}$ of $L^{2}(\rho)$ of functions orthogonal to $\mathds{1}$ :

x_{\perp}(\omega)=x(\omega)-\int x(\nu)\mathrm{d}\rho(\nu)\,.

We denote $a_{\text{init}}(\omega)=a(\omega,0)$ and thus $a_{\perp,\text{\rm{init}}}$ is the orthogonal projection of $a_{\text{init}}$ on $\mathds{1}^{\perp}$ .

6.1 Matched asymptotic expansions

The method of matched asymptotic expansions is a common approach to finding approximate solutions of perturbed differential equations. In the present paper, we are mainly interested in applying this technique to approximate the solution to the specific singularly perturbed ODE system¹¹1Although we keep calling this an ODE system, it is important to keep in mind that it takes place in an infinite-dimensional space. of Eq. (42). Denoting by $t$ the independent variable and by $\varepsilon$ the perturbation parameter, the method of matched asymptotic expansions consists of the following three steps: (1) Divide the domain of $t$ (generally a subinterval of $\mathbb{R}$ ) to several subdomains, which may overlap each other and depend on the perturbation parameter $\varepsilon$ ; (2) Within each subdomain, find an accurate approximation to the perturbed system. This is usually achieved by expanding the perturbed system in powers of $\varepsilon$ , and kee** only terms that are relevant to the current domain; (3) The approximate solutions obtained in Step (2) might not be valid in the overlap of two adjacent subdomains. To resolve this issue, these approximate solutions are then combined together through a process called “matching” to produce an approximation that is valid on the entire domain.

In our setting, the singularly perturbed system (42) takes the form of

	$\displaystyle\varepsilon\partial_{t}a(\omega)=\,$	$\displaystyle f(a(\omega),s(\omega)),$
	$\displaystyle\partial_{t}s(\omega)=\,$	$\displaystyle g(a(\omega),s(\omega)).$

We will carry out explicit calculations for the first three time scales in Sections 6.2-6.4, respectively. Here is a summary of our main findings:

•

In Section 6.2 we explore the learning of the constant component of the target function, which happens at the timescale $t=\Theta(\varepsilon)$ . At the end of this phase, the mean-field risk (see (37)) evolves to

\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\sum_{k\geqslant 1}\varphi_{k% }^{2}+O(\varepsilon)\,.

(43)

In other words, during this phase, gradient flow learns the constant term in $\varphi$ . At the end of this time scale we have $a(\omega)=\Theta(1)$ and $s(\omega)=\Theta(\varepsilon)$ .

•

Then, in Section 6.3 we investigate the second time scale $t=t_{2}\varepsilon^{1/2}$ , $t_{2}\leq c\log(1/\varepsilon)$ , during which the $a(\omega)$ ’s and $s(\omega)$ ’s increase to a different order in $\varepsilon$ . The result of this time scale is mainly technical and needed to understand the transition to the time scale of Section 6.4. We also perform the matching procedure to combine the approximate solution within this time scale to the one obtained in Section 6.2. At the end of this time scale we have $a(\omega)=\Theta(e^{c^{\prime}t_{2}})$ and $s(\omega)=\Theta(e^{c^{\prime}t_{2}}\varepsilon^{1/2})$ .

•

To understand the evolution of the risk relevant to learning the linear component of $\varphi$ , we introduce a new time scale $t=\frac{1}{4|\varphi_{1}\sigma_{1}|}\varepsilon^{1/2}\log\frac{1}{\varepsilon}% +\Theta(\varepsilon^{1/2})$ in Section 6.4, and show that the linear component can be learned within this time scale. To be more accurate, at the end of this time scale we have

\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\sum_{k\geqslant 2}\varphi_{k% }^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})\,,

(44)

and $a(\omega)=\Theta(\varepsilon^{-1/4})$ and $s(\omega)=\Theta(\varepsilon^{1/4})$ .

Finally, in Section 6.5, we conjecture the behavior of the approximate solutions and induced risks for longer time scales.

6.2 First time scale: constant component

We define a “fast” time variable $t_{1}=t/\varepsilon$ and replace it in Eq. (42). We expand the solutions $a(\omega)$ and $s(\omega)$ in powers of $\varepsilon$ :

	$\displaystyle a(\omega)$	$\displaystyle=a^{(0)}(\omega)+\varepsilon a^{(1)}(\omega)+\varepsilon^{2}a^{(2% )}(\omega)+\dots\,,$		(45)
	$\displaystyle s(\omega)$	$\displaystyle=s^{(0)}(\omega)+\varepsilon s^{(1)}(\omega)+\varepsilon^{2}s^{(2% )}(\omega)+\dots\,,$		(46)

where $a^{(0)}(\omega),a^{(1)}(\omega),a^{(2)}(\omega),\dots,s^{(0)}(\omega),s^{(1)}(% \omega),s^{(2)}(\omega),\dots$ are implicitly functions of $t_{1}$ . They are initialized at

	$\displaystyle a^{(0)}(\omega,t_{1}=0)=a_{\rm{init}}(\omega)\,,$	$\displaystyle a^{(1)}(\omega,t_{1}=0)=0\,,$	$\displaystyle a^{(2)}(\omega,t_{1}=0)=0\,,$	$\displaystyle\dots$		(47)
	$\displaystyle s^{(0)}(\omega,t_{1}=0)=0\,,$	$\displaystyle s^{(1)}(\omega,t_{1}=0)=0\,,$	$\displaystyle s^{(2)}(\omega,t_{1}=0)=0\,,$	$\displaystyle\dots$		(48)

to be consistent with the initial condition $a(\omega,t_{1}=0)=a(\omega,t=0)=a_{\rm{init}}(\omega)$ and $s(\omega,t_{1}=0)=s(\omega,t=0)=0$ .

We substitute the expansion in (42):

	$\displaystyle\partial_{t_{1}}a^{(0)}(\omega)+\varepsilon\partial_{t_{1}}a^{(1)% }(\omega)+\dots$		(49)
	$\displaystyle\quad=\sum_{k=0}^{\infty}\sigma_{k}\left(s^{(0)}(\omega)+% \varepsilon s^{(1)}(\omega)+\dots\right)^{k}$		(50)
	$\displaystyle\quad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}(% \nu)+\varepsilon a^{(1)}(\nu)+\dots\right)\left(s^{(0)}(\nu)+\varepsilon s^{(1% )}(\nu)+\dots\right)^{k}\mathrm{d}\rho(\nu)\right)\,,$		(51)
	$\displaystyle\partial_{t_{1}}s^{(0)}(\omega)+\varepsilon\partial_{t_{1}}s^{(1)% }(\omega)+\dots$		(52)
	$\displaystyle\quad=\varepsilon\left(a^{(0)}(\omega)+\varepsilon a^{(1)}(\omega% )+\dots\right)\left(1-\left(s^{(0)}(\omega)+\varepsilon s^{(1)}(\omega)+\dots% \right)^{2}\right)$		(53)
	$\displaystyle\quad\qquad\times\sum_{k=1}^{\infty}k\sigma_{k}\left(s^{(0)}(% \omega)+\varepsilon s^{(1)}(\omega)+\dots\right)^{k-1}$		(54)
	$\displaystyle\quad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}(% \nu)+\varepsilon a^{(1)}(\nu)+\dots\right)\left(s^{(0)}(\nu)+\varepsilon s^{(1% )}(\nu)+\dots\right)^{k}\mathrm{d}\rho(\nu)\right)\,.$		(55)

The basic assumption of matched asymptotic expansions is that terms of the same order in $\varepsilon$ can be identified (with some limitations that we develop below). For now, let us identify terms of order $1=\varepsilon^{0}$ :

	$\displaystyle\partial_{t_{1}}a^{(0)}(\omega)$	$\displaystyle=\sum_{k=0}^{\infty}\sigma_{k}\left(s^{(0)}(\omega)\right)^{k}% \left(\varphi_{k}-\sigma_{k}\int a^{(0)}(\nu)\left(s^{(0)}(\nu)\right)^{k}% \mathrm{d}\rho(\nu)\right)\,,$		(56)
	$\displaystyle\partial_{t_{1}}s^{(0)}(\omega)$	$\displaystyle=0\,.$		(57)

From (57) and (48), we have $s^{(0)}(\omega)=0$ : time $t_{1}=O(1)\Leftrightarrow t=O(\varepsilon)$ is too short for the $s(\omega)$ to be of order $1$ .

Substituting $s^{(0)}(\omega)=0$ in (56), we obtain

\partial_{t_{1}}a^{(0)}(\omega)=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a% ^{(0)}(\nu)\mathrm{d}\rho(\nu)\right)\,.

(58)

Recall that $\langle.,.\rangle_{L^{2}(\rho)}$ is the dot product on $L^{2}(\rho)$ , $\mathds{1}$ denotes the constant function $\mathds{1}:\omega\in\Omega\mapsto 1\in\mathbb{R}$ and $a_{\perp}$ is the orthogonal projection of $a$ on $\mathds{1}^{\perp}$ . Equation (58) can be rewritten as

	$\displaystyle\partial_{t_{1}}\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}$	$\displaystyle=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\langle a^{(0)},\mathds{% 1}\rangle_{L^{2}(\rho)}\right)\,,$
	$\displaystyle\partial_{t_{1}}a_{\perp}^{(0)}$	$\displaystyle=0\,,$

which gives after integration (using (47)):

	$\displaystyle\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}$	$\displaystyle=e^{-\sigma_{0}^{2}t_{1}}\langle a_{\rm{init}},\mathds{1}\rangle_% {L^{2}(\rho)}+\left(1-e^{-\sigma_{0}^{2}t_{1}}\right)\frac{\varphi_{0}}{\sigma% _{0}}\,,$		(59)
	$\displaystyle a_{\perp}^{(0)}$	$\displaystyle=a_{\perp,\rm{init}}\,.$

At this point, we have determined $a^{(0)}(\omega)$ and $s^{(0)}(\omega)$ , and thus $a(\omega)=a^{(0)}(\omega)+O(\varepsilon)$ and $s(\omega)=s^{(0)}(\omega)+O(\varepsilon)$ up to a $O(\varepsilon)$ precision, which is sufficient to obtain a $o(1)$ -approximation of the risk $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}$ (see Section C.2). However, note that we could obtain more precise estimates by identifying higher-order terms in (49)-(55). For instance, identifying the $O(\varepsilon)$ terms in (52)-(55), we obtain $\partial_{t_{1}}s^{(1)}(\omega)=a^{(0)}(\omega)\sigma_{1}\varphi_{1}$ . This shows that the $s(\omega)$ become non-zero, though only of order $\varepsilon$ on the time scale $t_{1}\asymp 1$ ; the inner-layer weights develop an infinitesimal correlation with the true direction $u_{*}$ thanks to the linear component of $\sigma$ and $\varphi$ .

The approximation constructed above should be considered as valid on the time scale $t_{1}\asymp 1\Leftrightarrow t\asymp\varepsilon$ . As $\varepsilon\to 0$ , we obtain the following approximation of the risk (see Eq. (37) for definition, and Appendix C.2 for a detailed derivation):

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}e^{-2\sigma_{0}^% {2}t_{1}}\left(\varphi_{0}-{\sigma_{0}}\left\langle a_{\rm{init}},\mathds{1}% \right\rangle_{L^{2}(\rho)}\right)^{2}+\frac{1}{2}\sum_{k\geqslant 1}\varphi_{% k}^{2}+O(\varepsilon)\,.

This approximation breaks down when we reach a new time scale, at which the $s(\omega)$ are large enough for the $a(\omega)$ to be affected (at leading order) by the linear part of the functions. We detail the new time scale and its resolution in the next section.

6.3 Second time scale: linear component I

In this section, we seek a second, slower time scale, for which the behavior of the asymptotic expansion is different.

Identification of the scale.

Consider $t_{2}=\frac{t}{\varepsilon^{\gamma}}$ , where $\gamma<1$ is to be determined. We rewrite the system (42) using $t_{2}$ , and expand the solutions $a(\omega)$ and $s(\omega)$ :

	$\displaystyle a(\omega)$	$\displaystyle=a^{(0)}(\omega)+\varepsilon^{\delta}a^{(1)}(\omega)+\varepsilon^% {2\delta}a^{(2)}(\omega)+\dots\,,$		(60)
	$\displaystyle s(\omega)$	$\displaystyle=\varepsilon^{\delta}s^{(1)}(\omega)+\varepsilon^{2\delta}s^{(2)}% (\omega)+\dots\,,$		(61)

where the exponent $\delta$ is also to be determined. (Since within the previous time scale we obtained $s(\omega)=O(\varepsilon)$ , it is natural to assume $s^{(0)}(\omega)=0$ .)

Let us pause to comment on our method.

Similarly to what has been done in the previous time scale, we will substitute the expansions (60)-(61) in the equations (42) in order to compute the different terms in the expansion. However, this step also allows us to compute the exponents $\gamma$ and $\delta$ , that give respectively the new time scale and the size of the $s(\omega)$ ’s.

Note that we should have proceeded similarly for the first time scale, by introducing a first time variable $t_{1}=\frac{t}{\varepsilon^{\gamma^{\prime}}}$ , expanding $a(\omega),s(\omega)$ in powers $1,\varepsilon^{\delta^{\prime}},\varepsilon^{2\delta^{\prime}},\dots$ , and determining $\gamma^{\prime}$ and $\delta^{\prime}$ a posteriori. This would have led, indeed, to $\gamma^{\prime}=1$ and $\delta^{\prime}=1$ . However, for simplicity, we preferred to fix these values that are natural a priori.

Finally, note that the expansions (45)-(46) and (60)-(61) are different, because they are valid on different time scales. In fact, the only coherence conditions that we require below is that the expansions match in a joint asymptotic where $t_{1}=\frac{t}{\varepsilon}\to\infty$ and $t_{2}=\frac{t}{\varepsilon^{\gamma}}\to 0$ . We thus build different approximations for each one of the time scales, with some matching conditions; this justifies the name of matched asymptotic expansion.

We now return to our computations and substitute (60)-(61) in (42):

	$\displaystyle\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)+\dots$	$\displaystyle=\sum_{k=0}^{\infty}\sigma_{k}\left(\varepsilon^{\delta}s^{(1)}(% \omega)+\dots\right)^{k}$
		$\displaystyle\qquad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}% (\nu)+\dots\right)\left(\varepsilon^{\delta}s^{(1)}(\nu)+\dots\right)^{k}% \mathrm{d}\rho(\nu)\right)\,,$
	$\displaystyle\varepsilon^{\delta}\partial_{t_{2}}s^{(1)}(\omega)+\dots$	$\displaystyle=\varepsilon^{\gamma}\left(a^{(0)}(\omega)+\dots\right)\left(1-% \left(\varepsilon^{\delta}s^{(1)}(\omega)+\dots\right)^{2}\right)\sum_{k=1}^{% \infty}k\sigma_{k}\left(\varepsilon^{\delta}s^{(1)}(\omega)+\dots\right)^{k-1}$
		$\displaystyle\qquad\qquad\times\left(\varphi_{k}-{\sigma_{k}}\int\left(a^{(0)}% (\nu)+\dots\right)\left(\varepsilon^{\delta}s^{(1)}(\nu)+\dots\right)^{k}% \mathrm{d}\rho(\nu)\right)\,,$

and thus

$\displaystyle\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)+O(% \varepsilon^{1-\gamma+\delta})$	$\displaystyle=\sigma_{0}\left(\varphi_{0}-\sigma_{0}\int a^{(0)}(\nu)\mathrm{d% }\rho(\nu)\right)$	(62)
	$\displaystyle\qquad-\varepsilon^{\delta}{\sigma_{0}^{2}}\int a^{(1)}(\nu)% \mathrm{d}\rho(\nu)+\varepsilon^{\delta}\sigma_{1}\varphi_{1}s^{(1)}(\omega)+O% (\varepsilon^{2\delta})\,,$	(63)
$\displaystyle\varepsilon^{\delta}\partial_{t_{2}}s^{(1)}(\omega)+O(\varepsilon% ^{2\delta})$	$\displaystyle=\varepsilon^{\gamma}\sigma_{1}\varphi_{1}a^{(0)}(\omega)+O(% \varepsilon^{\gamma+\delta})\,.$	(64)

For the first time scale, we chose $\gamma=\delta=1$ , so that the terms of order $\varepsilon^{\delta}$ were negligible compared to $\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)$ in (62). This means that the linear components $\sigma_{1},\varphi_{1}$ of the functions had no effect on the $a(\omega)$ at leading order. We are now interested in a new time scale where $\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)$ and $\varepsilon^{\delta}\sigma_{1}\varphi_{1}s^{(1)}(\omega)$ are of the same order, i.e., $1-\gamma=\delta$ ; then the linear components play a role in the dynamics.

Further, for $s^{(1)}(\omega)$ to be non-zero, we need both sides of (64) to be of the same order, thus $\delta=\gamma$ . Putting together, this gives $\gamma=\delta=1/2$ .

Derivation of the ODEs for this time scale.

Let us summarize equations. For $t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}$ and

	$\displaystyle a(\omega)$	$\displaystyle=a^{(0)}(\omega)+\varepsilon^{\nicefrac{{1}}{{2}}}a^{(1)}(\omega)% +\dots\,,$
	$\displaystyle s(\omega)$	$\displaystyle=\varepsilon^{\nicefrac{{1}}{{2}}}s^{(1)}(\omega)+\dots\,,$

we have from (62)-(64):

$\displaystyle\varepsilon^{\nicefrac{{1}}{{2}}}\partial_{t_{2}}a^{(0)}(\omega)$	$\displaystyle=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(\nu)\mathrm% {d}\rho(\nu)\right)$	(65)
	$\displaystyle\qquad-\varepsilon^{\nicefrac{{1}}{{2}}}{\sigma_{0}^{2}}\int a^{(% 1)}(\nu)\mathrm{d}\rho(\nu)+\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi% _{1}s^{(1)}(\omega)+O(\varepsilon)\,,$	(66)
$\displaystyle\varepsilon^{\nicefrac{{1}}{{2}}}\partial_{t_{2}}s^{(1)}(\omega)$	$\displaystyle=\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}a^{(0)}(% \omega)+O(\varepsilon)\,.$	(67)

First, we identify the terms of order $1=\varepsilon^{0}$ :

0=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)% \right)\,.

(68)

This means that the trajectory remains in the affine hyperplane defined by $\varphi_{0}={\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)$ . Intuitively, the constant component of $\varphi$ remains fitted by the neural network in this second time scale.

Second, we identify the terms of order $\varepsilon^{\nicefrac{{1}}{{2}}}$ in (65)-(67):

	$\displaystyle\partial_{t_{2}}a^{(0)}(\omega)$	$\displaystyle=-{\sigma_{0}^{2}}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)+\sigma_{1}% \varphi_{1}s^{(1)}(\omega)\,,$		(69)
	$\displaystyle\partial_{t_{2}}s^{(1)}(\omega)$	$\displaystyle=\sigma_{1}\varphi_{1}a^{(0)}(\omega)\,.$		(70)

Note that, in Eqs. (69)–(70), the time derivative of $a^{(1)}$ does not appear, and therefore the evolution of $a^{(1)}$ is not determined by these equations. In fact, $a^{(1)}$ is best interpreted as the Lagrange multiplier associated to the constraint (68). Namely, this is a free term that can be adjusted so that the solution of the system (69)–(70) satisfies the constraint (68). We can check unknown term in (69) leaves the right degree of freedom such that this is the case: we have

0=\partial_{t_{2}}\left(\frac{\varphi_{0}}{\sigma_{0}}\right)\underset{\eqref{% eq:aux-12}}{=}\partial_{t_{2}}\left(\int a^{(0)}(\omega)\mathrm{d}\rho(\omega)% \right)\underset{\eqref{eq:aux-10}}{=}-{\sigma_{0}^{2}}\int a^{(1)}(\nu)% \mathrm{d}\rho(\nu)+\sigma_{1}\varphi_{1}\int s^{(1)}(\omega)\mathrm{d}\rho(% \omega)\,.

In this last expression, the first unknown term can always compensate the second term so that the constraint is satisfied. The entire evolution of $a^{(1)}$ is determined by higher orders in the expansion.

To eliminate this Lagrange multiplier, we use again the compact notations:

	$\displaystyle\partial_{t_{2}}a^{(0)}$	$\displaystyle=-\sigma_{0}^{2}\langle a^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}% \mathds{1}+\sigma_{1}\varphi_{1}s^{(1)}\,,$		(71)
	$\displaystyle\partial_{t_{2}}s^{(1)}$	$\displaystyle=\sigma_{1}\varphi_{1}a^{(0)}\,,$		(72)

and thus

	$\displaystyle\partial_{t_{2}}a^{(0)}_{\perp}$	$\displaystyle=\sigma_{1}\varphi_{1}s^{(1)}_{\perp}\,,$		(73)
	$\displaystyle\partial_{t_{2}}s^{(1)}_{\perp}$	$\displaystyle=\sigma_{1}\varphi_{1}a^{(0)}_{\perp}\,.$		(74)

Matching.

The initialization of the ODEs (71)-(72) for the second time scale is determined by a classical procedure that matches with the previous time scale. In this paragraph, we denote $\underline{a},\underline{s}$ the approximation obtained in the first time scale (Section 6.2), and $\overline{a},\overline{s}$ the approximation in the second time scale, described above.

Consider an intermediate time scale $\widetilde{t}=\frac{t}{\varepsilon^{\alpha}}$ , $\nicefrac{{1}}{{2}}<\alpha<1$ , and assume $\widetilde{t}\asymp 1$ so that

\displaystyle t_{1}=\frac{t}{\varepsilon}=\frac{\widetilde{t}}{\varepsilon^{1-% \alpha}}\to\infty\,,

\displaystyle t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}={\varepsilon^{% \alpha-\nicefrac{{1}}{{2}}}}{\widetilde{t}}\to 0\,.

In this intermediate regime, we want the approximations provided on the first and the second time scales to match: $\underline{a}(\widetilde{t})$ and $\overline{a}(\widetilde{t})$ (resp. $\underline{s}(\widetilde{t})$ and $\overline{s}(\widetilde{t})$ ) should match to leading order.

From the first time scale approximation,

$\displaystyle\underline{a}$	$\displaystyle=\underline{a}^{(0)}+O(\varepsilon)$	(75)
	$\displaystyle=\langle\underline{a}^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}% \mathds{1}+\underline{a}^{(0)}_{\perp}+O(\varepsilon)$	(76)
	$\displaystyle=\left[e^{-\sigma_{0}^{2}t_{1}}\langle a_{\rm{init}},\mathds{1}% \rangle_{L^{2}(\rho)}+\left(1-e^{-\sigma_{0}^{2}t_{1}}\right)\frac{\varphi_{0}% }{\sigma_{0}}\right]\mathds{1}+a_{\perp,\rm{init}}+O(\varepsilon)$	(77)
	$\displaystyle=\left[e^{-\sigma_{0}^{2}\nicefrac{{\widetilde{t}}}{{\varepsilon^% {1-\alpha}}}}\langle a_{\rm{init}},\mathds{1}\rangle_{L^{2}(\rho)}+\left(1-e^{% -\sigma_{0}^{2}\nicefrac{{\widetilde{t}}}{{\varepsilon^{1-\alpha}}}}\right)% \frac{\varphi_{0}}{\sigma_{0}}\right]\mathds{1}+a_{\perp,\rm{init}}+O(\varepsilon)$	(78)
	$\displaystyle=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+a_{\perp,\rm{init}}+o(1% )\,.$	(79)

From the second time scale approximation,

	$\displaystyle\overline{a}$	$\displaystyle=\overline{a}^{(0)}(t_{2})+O(\varepsilon^{\nicefrac{{1}}{{2}}})=% \overline{a}^{(0)}({\varepsilon^{\alpha-\nicefrac{{1}}{{2}}}}{\widetilde{t}})+% O(\varepsilon^{\nicefrac{{1}}{{2}}})$		(80)
		$\displaystyle=\overline{a}^{(0)}(0)+o(1)\,.$		(81)

By matching, Equations (79) and (81) should be coherent. Thus the ODE for the second time scale should be initialized from $\overline{a}^{(0)}(0)=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+a_{\perp,\rm{% init}}$ .

Similarly, the matching procedure gives that the ODE for the second time scale should be initialized from $\overline{s}^{(1)}=0$ .

Solution.

As we are done with the matching procedure, we now consider the solution in the second time scale only, that we denote again by $a$ , $s$ as in (71), (72). The matching procedure motivates us to consider the solution of (73)-(74) initialized at $a_{\perp}^{(0)}(0)=a_{\perp,\rm{init}}$ , $s_{\perp}^{(1)}=0$ . This gives

\displaystyle a_{\perp}^{(0)}=\cosh\left(\varphi_{1}\sigma_{1}t_{2}\right)a_{% \perp,\rm{init}}\,,

\displaystyle s_{\perp}^{(1)}=\sinh\left(\varphi_{1}\sigma_{1}t_{2}\right)a_{% \perp,\rm{init}}\,.

To conclude, we note that $\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}=\frac{\varphi_{0}}{\sigma_{0}}$ is constrained by (68). Further, from (70),

\partial_{t_{2}}\langle s^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}% \varphi_{1}\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}\varphi_{% 1}\frac{\varphi_{0}}{\sigma_{0}},

thus $\langle s^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}\varphi_{1}\frac{% \varphi_{0}}{\sigma_{0}}t_{2}$ .

Putting together, these equations give:

\displaystyle a^{(0)}=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+\cosh\left(% \varphi_{1}\sigma_{1}t_{2}\right)a_{\perp,\rm{init}}\,,

\displaystyle s^{(1)}=\sigma_{1}\varphi_{1}\frac{\varphi_{0}}{\sigma_{0}}t_{2}% \mathds{1}+\sinh\left(\varphi_{1}\sigma_{1}t_{2}\right)a_{\perp,\rm{init}}\,.

(82)

We observe that $a^{(0)}$ and $s^{(1)}$ diverge as $t_{2}\to\infty$ . This implies that our approximation on the second time scale must break down at a certain point. Indeed, we analyzed this time scale under the assumption that both $a^{(0)}$ and $s^{(1)}$ are of order $1$ . However, since $a^{(0)}$ and $s^{(1)}$ diverge exponentially as $t_{2}\to\infty$ , as per Eq. (82), this assumption breaks down when $t_{2}\asymp\log(1/\varepsilon)$ .

More precisely, in (65) (resp. (67)), the $O(\varepsilon)$ term includes a term of the form

\displaystyle-\varepsilon s^{(1)}(\omega)\sigma_{1}^{2}\int a^{(0)}(\nu)s^{(1)% }(\nu)\mathrm{d}\rho(\nu)

\displaystyle\left(\text{resp.~{}}-\varepsilon a^{(0)}(\omega)\sigma_{1}^{2}% \int a^{(0)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)\,.

When $a^{(0)}$ and $s^{(1)}$ become of order $\varepsilon^{-\nicefrac{{1}}{{4}}}$ , this term becomes of order $\varepsilon^{\nicefrac{{1}}{{4}}}$ , which is then of the same order as the term $\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}s^{(1)}(\omega)$ in (65) (resp. the term $\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}a^{(0)}(\omega)$ in (67)). At this point, these terms can not be neglected anymore. From (82), we have

\displaystyle a^{(0)}\sim\frac{e^{|\varphi_{1}\sigma_{1}|t_{2}}}{2}a_{\perp,% \rm{init}}\,,

\displaystyle s^{(1)}\sim\operatorname{sign}(\varphi_{1}\sigma_{1})\frac{e^{|% \varphi_{1}\sigma_{1}|t_{2}}}{2}a_{\perp,\rm{init}}\,,

\displaystyle t_{2}\to\infty\,.

Therefore, $a^{(0)}$ and $s^{(1)}$ become of order $\varepsilon^{-\nicefrac{{1}}{{4}}}$ at the time $t_{2}\sim\frac{1}{4|\sigma_{1}\varphi_{1}|}\log\frac{1}{\varepsilon}$ , at which the approximation on the second time scale breaks down. We thus introduce a new time scale centered at this critical point.

6.4 Third time scale: linear component II

We now introduce the time $t_{3}=t_{2}-\frac{1}{4|\varphi_{1}\sigma_{1}|}\log\frac{1}{\varepsilon}$ . As $t_{3}$ is only a translation from $t_{2}$ , the ODEs in terms of $t_{3}$ are the same as the ones in term of $t_{2}$ . However, in this time scale, $a$ and $\varepsilon^{\nicefrac{{1}}{{2}}}s$ have diverged. In coherence with the discussion above, we seek expansions of the form

	$\displaystyle a$	$\displaystyle=\varepsilon^{-\nicefrac{{1}}{{4}}}a^{(-1)}+a^{(0)}+\varepsilon^{% \nicefrac{{1}}{{4}}}a^{(1)}+\dots\,,$		(83)
	$\displaystyle s$	$\displaystyle=\varepsilon^{\nicefrac{{1}}{{4}}}s^{(1)}+\varepsilon^{\nicefrac{% {1}}{{2}}}s^{(2)}+\dots\,.$		(84)

Similarly to the second time scale, we substitute (83)-(84) in (42) and obtain

	$\displaystyle\varepsilon^{\nicefrac{{1}}{{4}}}\partial_{t_{3}}a^{(-1)}(\omega)$	$\displaystyle=-\varepsilon^{-\nicefrac{{1}}{{4}}}{\sigma_{0}^{2}}\int a^{(-1)}% (\nu)\mathrm{d}\rho(\nu)+\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(% \nu)\mathrm{d}\rho(\nu)\right)$
		$\displaystyle\hskip 14.22636pt-\varepsilon^{\nicefrac{{1}}{{4}}}{\sigma_{0}^{2% }}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)+\varepsilon^{\nicefrac{{1}}{{4}}}\sigma% _{1}\left(\varphi_{1}-{\sigma_{1}}\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho% (\nu)\right)s^{(1)}(\omega)+O(\varepsilon^{\nicefrac{{1}}{{2}}})\,,$
	$\displaystyle\varepsilon^{\nicefrac{{1}}{{4}}}\partial_{t_{3}}s^{(1)}(\omega)$	$\displaystyle=\varepsilon^{\nicefrac{{1}}{{4}}}\sigma_{1}\left(\varphi_{1}-{% \sigma_{1}}\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)a^{(-1)}(% \omega)+O(\varepsilon^{\nicefrac{{1}}{{2}}})\,.$

First, we identify the terms of order $\varepsilon^{-\nicefrac{{1}}{{4}}}$ :

0=-{\sigma_{0}^{2}}\int a^{(-1)}(\nu)\mathrm{d}\rho(\nu)=-{\sigma_{0}^{2}}% \left\langle a^{(-1)},\mathds{1}\right\rangle_{L^{2}(\rho)}\,.

(85)

This means that $a$ has no component diverging in $\varepsilon$ in the direction of $\mathds{1}$ .

Second, we identify the terms of order $1=\varepsilon^{0}$ :

0=\sigma_{0}\left(\varphi_{0}-{\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)% \right)=\sigma_{0}\left(\varphi_{0}-\sigma_{0}\left\langle a^{(0)},\mathds{1}% \right\rangle_{L^{2}(\rho)}\right)\,.

(86)

Put together with (85), this equation ensures that the constant component of $\varphi$ remains learned on this third time scale.

Third, we identify the terms of order $\varepsilon^{\nicefrac{{1}}{{4}}}$ :

\displaystyle\begin{split}\partial_{t_{3}}a^{(-1)}(\omega)&=-{\sigma_{0}^{2}}% \int a^{(1)}(\nu)\mathrm{d}\rho(\nu)+\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}% \int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)s^{(1)}(\omega)\,,\\ \partial_{t_{3}}s^{(1)}(\omega)&=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\int a% ^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)\right)a^{(-1)}(\omega)\,.\end{split}

(87)

Again, the term $-{\sigma_{0}^{2}}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)$ is best interpreted as the Lagrange multiplier associated to the constraints (85), (86). Using the compact notations,

	$\displaystyle\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)$	$\displaystyle=\left\langle a^{(-1)},s^{(1)}\right\rangle_{L^{2}(\rho)}={\left% \langle a^{(-1)},\mathds{1}\right\rangle_{L^{2}(\rho)}\left\langle\mathds{1},s% ^{(1)}\right\rangle_{L^{2}(\rho)}}+\left\langle a^{(-1)}_{\perp},s^{(1)}_{% \perp}\right\rangle_{L^{2}(\rho)}$
		$\displaystyle=\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2% }(\rho)}\,,$

where in the last equality we use (85). Thus we can rewrite (87) as

\displaystyle\begin{split}\partial_{t_{3}}a^{(-1)}&=-{\sigma_{0}^{2}}\langle a% ^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}\mathds{1}+\sigma_{1}\left(\varphi_{1}-{% \sigma_{1}}\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(% \rho)}\right)s^{(1)}\,,\\ \partial_{t_{3}}s^{(1)}&=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\left\langle a% ^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(\rho)}\right)a^{(-1)}\,,% \end{split}

(88)

and thus

\displaystyle\begin{split}\partial_{t_{3}}a^{(-1)}_{\perp}&=\sigma_{1}\left(% \varphi_{1}-{\sigma_{1}}\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right% \rangle_{L^{2}(\rho)}\right)s^{(1)}_{\perp}\,,\\ \partial_{t_{3}}s^{(1)}_{\perp}&=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\left% \langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(\rho)}\right)a^{(% -1)}_{\perp}\,.\end{split}

(89)

In Appendix C.1, we solve this system of ODEs and determine the initial condition by matching with the previous layer. The result is that

\displaystyle\begin{split}&a^{(-1)}=a^{(-1)}_{\perp}=\lambda a_{\perp,\rm{init% }}\,,\\ &s^{(1)}=s^{(1)}_{\perp}=\operatorname{sign}(\sigma_{1}\varphi_{1})\lambda a_{% \perp,\rm{init}}\,,\end{split}

(90)

where $\lambda=\lambda(t_{3})$ is the function

\lambda(t_{3})=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left({|\sigma_{1}|}% \left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}+4|\varphi_{1}|e^{-2|% \sigma_{1}\varphi_{1}|t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,.

(91)

This solution finishes to describe how the linear part of the function $\varphi$ is learned. Plugging it into the equations for $a^{(-1)}$ and $s^{(1)}$ , we get

\displaystyle\sigma_{1}\int a^{(-1)}(\nu)s^{(1)}(\nu)\mathrm{d}\rho(\nu)=% \sigma_{1}\left\langle a^{(-1)}_{\perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(% \rho)}=\frac{\varphi_{1}|\sigma_{1}|\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(% \rho)}^{2}}{{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}% +4|\varphi_{1}|e^{-2|\sigma_{1}\varphi_{1}|t_{3}}},

which converges to $\varphi_{1}$ as $t_{3}\to\infty$ . Consequently, we obtain the following approximation for $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}$ within this time scale (again, see Appendix C.2 for details):

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\varphi_{1}^{2}% \left(1-\frac{1}{1+\frac{4|\varphi_{1}|}{|\sigma_{1}|\left\|a_{\perp,\rm{init}% }\right\|_{L^{2}(\rho)}^{2}}e^{-2|\sigma_{1}\varphi_{1}|t_{3}}}\right)^{2}+% \frac{1}{2}\sum_{k\geqslant 2}\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4% }}})\,,\quad\varepsilon\to 0.

6.5 Conjectured behavior for larger time scales

The analysis of the previous sections naturally suggests the existence of a sequence of cutoffs. At each time scale, a new polynomial component of $\varphi$ is learned within a window that is much shorter than the time elapsed before that phase started. Along this sequence, we expect $s$ and $a$ to grow to increasingly larger scales in $\varepsilon$ (but $s$ remains $o(1)$ while $a$ diverges).

More precisely, we assume that during the $l$ -th phase, the network learns the degree- $l$ component $\varphi_{l}$ , and various quantities satisfy the following scaling behavior:

\displaystyle a=O(\varepsilon^{-\omega_{l}}),\;\;\;s=O(\varepsilon^{\beta_{l}}% ),\;\;\;t=O(\varepsilon^{\mu_{l}})\,,

(92)

where $\omega_{l}>0$ is an increasing sequence and $\beta_{l},\mu_{l}>0$ are decreasing sequences. Further, while learning of this component takes place when $t=O(\varepsilon^{\mu_{l}})$ , the actual evolution of the risk (and of the neural network) take place on much shorter scales, namely:

\displaystyle\Delta t=O(\varepsilon^{\nu_{l}})\,,

(93)

where $\nu_{l}$ is also decreasing, with $\nu_{l}>\mu_{l}$ . The goal of this section is to provide heuristic arguments to conjecture the values of $\omega_{l}$ , $\beta_{l}$ , $\mu_{l}$ and $\nu_{l}$ . We will base this conjecture on a rigorous analysis of a simplified model.

The simplified model is motivated by the expectation (supported by the heuristics and simulations in the previous sections) that learning each component happens independently from the details of the evolution on previous time scales. In the simplified model, the activation function $\sigma(x)$ is proportional to the $l$ -th Hermite polynomial, namely $\sigma(x)=\sigma_{l}\mathrm{He}_{l}(x)$ . This is the component of $\sigma$ that we expect to be relevant on the $l$ -th time scale. The gradient flow equations (42) then read:

\begin{split}\varepsilon\partial_{t}a(\omega)&=\,\sigma_{l}s(\omega)^{l}\left(% \varphi_{l}-\sigma_{l}\int a(\nu)s(\nu)^{l}\mathrm{d}\rho(\nu)\right)\,,\\ \partial_{t}s(\omega)&=\,a(\omega)\left(1-s(\omega)^{2}\right)l\sigma_{l}s(% \omega)^{l-1}\left(\varphi_{l}-\sigma_{l}\int a(\nu)s(\nu)^{l}\mathrm{d}\rho(% \nu)\right)\,.\end{split}

(94)

with corresponding risk component

\mathscrsfs{R}_{l}=\frac{1}{2}\left(\varphi_{l}-\sigma_{l}\int a(\nu)s(\nu)^{l% }\mathrm{d}\rho(\nu)\right)^{2}.

We capture the effect of learning dynamics on the previous time scales by the overall magnitude of the $a(\omega)$ ’s and $s(\omega)$ ’s at initialization. Namely, we choose the scale of initialization of the simplified model to be given by the end of the $(l-1)$ -th time scale, i.e., $a(\omega)\asymp\varepsilon^{-\omega_{l-1}}$ and $s(\omega)\asymp\varepsilon^{\beta_{l-1}}$ . Further, in order for the $(l-1)$ -th component to be learned, namely

\int a(\nu)s(\nu)^{l-1}\mathrm{d}\rho(\nu)\approx\frac{\varphi_{l-1}}{\sigma_{% l-1}},

(95)

we require $\omega_{l-1}=(l-1)\beta_{l-1}$ so that $\int a(\nu)s(\nu)^{l-1}\mathrm{d}\rho(\nu)=\Theta(1)$ . Analogously, we assume $\omega_{l}=l\beta_{l}$ .

Based on this consideration, we introduce the rescaled variables

\widetilde{a}(\omega)=\varepsilon^{\omega_{l}}a(\omega),\ \widetilde{s}(\omega% )=\varepsilon^{-\beta_{l}}s(\omega),\ \text{where}\ \widetilde{a}(\omega,0)% \asymp\varepsilon^{\omega_{l}-\omega_{l-1}},\widetilde{s}(\omega,0)\asymp% \varepsilon^{\beta_{l-1}-\beta_{l}}.

Rewriting Eq. (94) in terms of $\widetilde{a}(\omega)$ ’s and $\widetilde{s}(\omega)$ ’s, and using $\omega_{l}=l\beta_{l}$ , we get that

\begin{split}\varepsilon^{1-2l\beta_{l}}\partial_{t}\widetilde{a}(\omega)=\,&% \sigma_{l}\widetilde{s}(\omega)^{l}\left(\varphi_{l}-\sigma_{l}\int\widetilde{% a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right)\\ \varepsilon^{2\beta_{l}}\partial_{t}\widetilde{s}(\omega)=\,&l\sigma_{l}% \widetilde{a}(\omega)\widetilde{s}(\omega)^{l-1}\left(1-\varepsilon^{2\beta_{l% }}\widetilde{s}(\omega)^{2}\right)\left(\varphi_{l}-\sigma_{l}\int\widetilde{a% }(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right).\end{split}

(96)

In order for the $\widetilde{a}(\omega)$ ’s and $\widetilde{s}(\omega)$ ’s to be learned simultaneously, we need $1-2l\beta_{l}=2\beta_{l}$ , which implies $\beta_{l}=1/2(l+1)$ . Making a further change of the time variable $t=\varepsilon^{\nu_{l}}\tau$ , where $\nu_{l}=2\beta_{l}=1/(l+1)$ , it follows that

\begin{split}\partial_{\tau}\widetilde{a}(\omega)=\,&\sigma_{l}\widetilde{s}(% \omega)^{l}\left(\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(\nu% )^{l}\mathrm{d}\rho(\nu)\right)\\ \partial_{\tau}\widetilde{s}(\omega)=\,&l\sigma_{l}\widetilde{a}(\omega)% \widetilde{s}(\omega)^{l-1}\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(% \omega)^{2}\right)\left(\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde% {s}(\nu)^{l}\mathrm{d}\rho(\nu)\right).\end{split}

(97)

Moreover, rewriting the risk in terms of the rescaled variables $\widetilde{a},\widetilde{s}$ , $\mathscrsfs{R}_{l}(\tau)=\mathscrsfs{R}_{l}(\widetilde{a}(\tau),\widetilde{s}(% \tau))$ satisfies the ODE:

\partial_{\tau}\mathscrsfs{R}_{l}=-2\sigma_{l}^{2}\mathscrsfs{R}_{l}\cdot\int% \widetilde{s}(\omega)^{2(l-1)}\left(l^{2}\widetilde{a}(\omega)^{2}\left(1-% \varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)+\widetilde{s}(\omega)% ^{2}\right)\mathrm{d}\rho(\omega).

(98)

Note that with our choice of $\beta_{l}$ and $\omega_{l}$ , we have $\omega_{l}-\omega_{l-1}=\beta_{l-1}-\beta_{l}=1/2l(l+1)$ . This means that the $\widetilde{a}(\omega)$ ’s and $\widetilde{s}(\omega)$ ’s are initialized at the same scale, namely

\displaystyle\widetilde{a}(\omega,0),\widetilde{s}(\omega,0)=\Theta(% \varepsilon^{1/2l(l+1)})\,.

(99)

The theorem below describes quantitatively the dynamics of the simplified model for small $\varepsilon$ , and determines the value of $\mu_{l}$ (recall that $\nu_{l}=1/(l+1)$ ):

Theorem 1 (Evolution of the simplified gradient flow).

Assume $l\geq 2$ and let $(\widetilde{a}(\omega,\tau),\widetilde{s}(\omega,\tau))_{\tau\geq 0}$ be the unique solution of the ODE system (97), initialized as per Eq. (99) (note in particular that $\sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}\asymp% \varepsilon^{1/2l}$ ). Then the followings hold:

(a)

Let us denote

A=\left\{\omega:\sigma_{l}\varphi_{l}\liminf_{\varepsilon\to 0}\varepsilon^{-1% /2l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}>0\right\}

(100)

and assume $\rho(A)>0$ . For $\Delta\in(0,\varphi_{l}^{2}/2)$ , define

\tau(\Delta)=\inf\{\tau\geq 0:\mathscrsfs{R}_{l}(\widetilde{a}(\tau),% \widetilde{s}(\tau))\leq\Delta\}.

(101)

Then, for any fixed $\Delta$ we have $\tau(\Delta)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ as $\varepsilon\to 0$ . Further, if $\rho$ is a discrete probability measure, then there exists $\tau_{*}(\varepsilon)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ and, for any $\Delta>0$ a constant $c_{*}(\Delta)>0$ independent of $\varepsilon$ such that

	$\displaystyle\tau\leq\tau_{}(\varepsilon)-c_{}(\Delta)$	$\displaystyle\Rightarrow\;\;\liminf_{\varepsilon\to 0}\mathscrsfs{R}_{l}(% \widetilde{a}(\tau),\widetilde{s}(\tau))\geq\frac{1}{2}\varphi_{l}^{2}-\Delta\,,$		(102)
	$\displaystyle\tau\geq\tau_{}(\varepsilon)+c_{}(\Delta)$	$\displaystyle\Rightarrow\;\;\limsup_{\varepsilon\to 0}\mathscrsfs{R}_{l}(% \widetilde{a}(\tau),\widetilde{s}(\tau))\leq\Delta\,,$		(103)

namely the $l$ -th component is learnt in an $O(1)$ time window around $\tau_{*}(\varepsilon)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ .

(b)

Similarly, we denote

B=\left\{\omega:\sigma_{l}\varphi_{l}\limsup_{\varepsilon\to 0}\varepsilon^{-1% /2l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}<0,\ \text{and}\ \liminf% _{\varepsilon\to 0}(\widetilde{s}(\omega,0)^{2}/\widetilde{a}(\omega,0)^{2})>l% \right\}.

(104)

If $\rho(B)>0$ , then the same claims as in $(a)$ hold.

(c)

If neither of the conditions at points $(a)$ , $(b)$ holds, and

\sigma_{l}\varphi_{l}\limsup_{\varepsilon\to 0}\varepsilon^{-1/2l}\widetilde{a% }(\omega,0)\widetilde{s}(\omega,0)^{l}<0,\quad\limsup_{\varepsilon\to 0}(% \widetilde{s}(\omega,0)^{2}/\widetilde{a}(\omega,0)^{2})<l

(105)

for almost every $\omega\in\Omega$ . Then, for such $\omega\in\Omega$ and each $\Delta>0$ , there exists a constant $C_{*}(\omega,\Delta)>0$ such that

\displaystyle\tau\geq C_{*}(\omega,\Delta)\varepsilon^{-(l-1)/2l(l+1)}\;\;% \Rightarrow\;\;|\widetilde{s}(\omega,\tau)|\leq\Delta\varepsilon^{1/2l(l+1)}\,,

(106)

meaning that $\widetilde{s}(\omega,\tau)$ converges to $0$ eventually.

We further note that $\tau=\Theta(\varepsilon^{-(l-1)/2l(l+1)})\Longleftrightarrow t=\Theta(% \varepsilon^{\mu_{l}})$ with $\mu_{l}=1/2l$ , and $\tau=O(1)\Longleftrightarrow t=O(\varepsilon^{\nu_{l}})$ with $\nu_{l}=1/(l+1)$ .

The proof of Theorem 1 is deferred to Appendix C.3.

Remark 6.1.

Under the conditions of cases $(a)$ and $(b)$ , we see that the degree- $l$ component of the target function is learnt within an $O(\varepsilon^{1/(l+1)})$ time window around $t_{*}(l,\varepsilon)\asymp\varepsilon^{1/2l}$ , which is consistent with the timescales conjectured in Definition 1.

Remark 6.2.

Case $(c)$ corresponds to $s(\omega)/s(\omega,0)$ becoming close to $0$ in time $t=O(\varepsilon^{\mu_{l}})$ , and staying at $0$ . In other words, the neurons become orthogonal to the target direction and play no role in learning higher-degree components any longer.

Informally, case $(c)$ couples the learning of different polynomial components. It can happen that the learning phase $l-1$ induces an effective initialization $(\widetilde{a}(\omega,0),\ \widetilde{s}(\omega,0))$ within the domain of case $(c)$ .

We expect this not to be the case for suitable choices of initialization (or equivalently ${\rm P}_{A}$ ), $\varphi$ , and $\sigma$ . Establishing this would amount to establishing that the canonical learning order holds.

7 Stochastic gradient descent and finite sample size

So far we focused on analyzing the projected gradient flow (GF) dynamics with respect to the population risk, as defined in Eqs. (5)-(6). In this section, we extract the implications of our analysis of GF on online projected stochastic gradient descent, which is a projected version of the SGD dynamics (162).

For simplicity of notation, we denote by $z=(y,x)\in{\mathbb{R}}\times{\mathbb{R}}^{d}$ a datapoint and by $\theta_{i}=(a_{i},u_{i})\in{\mathbb{R}}\times\mathbb{S}^{d-1}$ the parameters of neuron $i$ . For $z=(y,x)$ and $\rho^{(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta_{i}}=(1/m)\sum_{i=1}^{m}\delta_{(% a_{i},u_{i})}$ , we define

	$\displaystyle\widehat{F}_{i}(\rho^{(m)};z)=\,$	$\displaystyle\left(y-\frac{1}{m}\sum_{j=1}^{m}a_{j}\sigma(\langle u_{j},x% \rangle)\right)\sigma(\langle u_{i},x\rangle),$
	$\displaystyle\widehat{G}_{i}(\rho^{(m)};z)=\,$	$\displaystyle a_{i}\left(y-\frac{1}{m}\sum_{j=1}^{m}a_{j}\sigma(\langle u_{j},% x\rangle)\right)\sigma^{\prime}(\langle u_{i},x\rangle)x.$

The projected SGD dynamics is specified as follows:

\begin{split}\overline{a}_{i}(k+1)=\,&\overline{a}_{i}(k)+\varepsilon^{-1}\eta% \widehat{F}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\\ \overline{u}_{i}(k+1)=\,&\operatorname{Proj}_{\mathbb{S}^{d-1}}\left(\overline% {u}_{i}(k)+\eta\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right),\end{split}

(107)

where for $u\in\mathbb{R}^{d}$ and compact $S\subset\mathbb{R}^{d}$ , $\operatorname{Proj}_{S}(u):=\operatorname*{argmin}_{s\in S}\left\|{s-u}\right% \|_{2}$ , and $\overline{\rho}^{(m)}:=(1/m)\sum_{i=1}^{m}\delta_{\overline{\theta}_{i}}$ . Note that the $(\overline{a}_{i},\overline{u}_{i})$ ’s here are different from the $(\overline{a},\overline{s})$ ’s in Section 6.

We prove that, for small $\eta$ , the projected SGD of Eq. (107) is close to the gradient flow of Eqs. (5)-(6). Throughout this section, we make the following assumptions similar to those assumed in Section 4:

A1.

$\rho_{0}$ is supported on $[-M_{1},M_{1}]\times\mathbb{S}^{d-1}$ . Hence, $|a_{i}(0)|\leq M_{1}$ for all $i\in[m]$ .

A2.

The activation function is bounded: $\left\|{\sigma}\right\|_{\infty}\leq M_{2}$ . Additionally, define for $u,u^{\prime}\in\mathbb{R}^{d}$ :

	$\displaystyle V(\langle u_{},u\rangle;\left\\|{u_{}}\right\\|_{2},\left\\|{u}% \right\\|_{2})=\,$	$\displaystyle\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)\sigma(\langle u,x% \rangle)\right],$		(108)
	$\displaystyle U(\langle u,u^{\prime}\rangle;\left\\|{u}\right\\|_{2},\left\\|{u^{% \prime}}\right\\|_{2})=\,$	$\displaystyle\mathbb{E}\left[\sigma(\langle u,x\rangle)\sigma(\langle u^{% \prime},x\rangle)\right].$		(109)

We then require the functions $V$ and $U$ to be bounded and differentiable, with uniformly bounded and Lipschitz continuous gradients for all $\left\|{u}\right\|_{2},\left\|{u^{\prime}}\right\|_{2}\leq 2$ :

		$\displaystyle\left\\|{\nabla_{u}V}\right\\|_{2}\leq M_{2},\ \left\\|{\nabla_{u}V-% \nabla_{u^{\prime}}V}\right\\|_{2}\leq M_{2}\left\\|{u-u^{\prime}}\right\\|_{2},$		(110)
		$\displaystyle\left\\|{\nabla_{(u,u^{\prime})}U}\right\\|_{2}\leq M_{2},\ \left\\|% {\nabla_{(u,u^{\prime})}U-\nabla_{(u_{1},u_{1}^{\prime})}U}\right\\|_{2}\leq M_% {2}\left(\left\\|{u-u_{1}}\right\\|_{2}+\left\\|{u^{\prime}-u_{1}^{\prime}}\right% \\|_{2}\right).$		(111)

Similar to Remark 4.1, we can show that a sufficient condition for Eq.s (110) and (111) is

\sup\left\{\left\|{\sigma^{\prime}}\right\|_{L^{2}},\,\left\|{\sigma^{\prime% \prime}}\right\|_{L^{2}}\right\}\leq M_{2}^{\prime},\quad\ \sup\left\{\left\|{% \varphi}\right\|_{L^{2}},\,\left\|{\varphi^{\prime}}\right\|_{L^{2}},\,\left\|% {\varphi^{\prime\prime}}\right\|_{L^{2}}\right\}\leq M_{2}^{\prime},

where the constant $M_{2}^{\prime}$ depends uniquely on $M_{2}$ .

A3.

Assume $(x,y)\sim\mathds{P}$ , then we require that $y\in[-M_{3},M_{3}]$ almost surely. Moreover, we assume that for all $\left\|{u}\right\|_{2}\leq 2$ , both $\sigma(\langle u,x\rangle)$ and $\sigma^{\prime}(\langle u,x\rangle)(x-\langle u,x\rangle u)$ are $M_{3}$ -sub-Gaussian.

The following theorem upper bounds the distance between gradient flow and projected stochastic gradient descent dynamics.

Theorem 2 (Difference between GF and Projected SGD).

Let $\theta_{i}(t)=(a_{i}(t),u_{i}(t))$ be the solution of the GF ordinary differential equations (5)-(6). There exists a constant $M$ that only depends on the $M_{i}$ ’s from Assumptions A1-A3, such that for any $T,z\geq 0$ and

\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((1+1/\varepsilon)MT(1+T/\varepsilon)^{2% })},

the following holds with probability at least $1-\exp(-z^{2})$ :

$\displaystyle\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|\overline{% a}_{i}(k)\right\|\leq\,$	$\displaystyle M(1+T/\varepsilon),$	(112)
$\displaystyle\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\\|{\theta_{% i}(k\eta)-\overline{\theta}_{i}(k)}\right\\|_{2}\leq\,$	$\displaystyle\left(\sqrt{d+\log m}+z\right)$	(113)
	$\displaystyle\ \ \times M\exp\left(\left(1+\frac{1}{\varepsilon}\right)MT\left% (1+\frac{T}{\varepsilon}\right)^{2}\right)\sqrt{\eta},$	(114)
$\displaystyle\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\left\|\mathscrsfs{R}(\overline% {a}(k),\overline{u}(k))-\mathscrsfs{R}(a(k\eta),u(k\eta))\right\|\leq\,$	$\displaystyle\left(\sqrt{d+\log m}+z\right)$	(115)
	$\displaystyle\ \ \times M\exp\left(\left(1+\frac{1}{\varepsilon}\right)MT\left% (1+\frac{T}{\varepsilon}\right)^{2}\right)\sqrt{\eta}.$	(116)

The proof is presented in Appendix D and follows the same scheme as in that of Theorem 1 part (B) in [33]. The main difference with respect to that theorem is here we are interested in projected SGD (and GF) instead of plain SGD (and GF), hence an additional step of approximation is required, and the $a_{i}$ ’s and $u_{i}$ ’s need to be treated separately. We next draw implications of the last result on learning by online SGD within the canonical learning order.

Theorem 3.

Fix any $\delta>0$ . Assume $\varphi,\sigma$ and the initialization ${\rm P}_{A}$ be such that the canonical learning order of Definition 1 holds up to level $L$ for some $L\geq 2$ , and that

\sum_{k\geq L+1}\varphi_{k}^{2}\leq\frac{\delta}{2}.

(117)

Then, there exist constants $\varepsilon_{*}=\varepsilon_{*}(\delta)$ , $T_{0}=T_{0}(\delta)$ , $T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/(2L)}$ and $M=M(\varepsilon,\delta)$ that depend on $\varepsilon,\delta$ (together with $\varphi,\sigma$ and ${\rm P}_{A}$ ) such that the following happens. Assume $\varepsilon\leq\varepsilon_{*}(\delta)$ and $m,d,z$ are such that $d\geq M$ , $m\geq\max(M,z)$ , and the step size $\eta$ and number of samples (equivalently, number of steps) $n$ satisfy

	$\displaystyle\eta$	$\displaystyle=\frac{1}{M(d+\log m+z)}\,,$		(118)
	$\displaystyle n$	$\displaystyle=MT(d+\log m+z)\,.$		(119)

Then, with probability at least $1-e^{-z}$ , the projected gradient descent algorithm of Eq. (107) achieves population risk smaller than $\delta$ :

\displaystyle\mathds{P}\Big{(}\mathscrsfs{R}(\overline{a}(n),\overline{u}(n))% \leq\delta\Big{)}\geq 1-e^{-z}\,.

(120)

The proof of Theorem 3 is deferred to Appendix D.4.

Remark 7.1.

Within the lazy or neural tangent regime, learning the projection of the target function $\varphi(\langle u_{*},x\rangle)$ onto polynomials of degree $\ell$ requires $n\gg d^{\ell}$ samples, and $m\gg d^{\ell-1}$ neurons [20, 34, 35].

In contrast, Theorem 3 shows that, within the canonical learning order, $O(d)$ samples and $O(1)$ neurons are sufficient. Further as per Theorem 2, the learning dynamics is accurately described by the GF analyzed in the previous sections.

8 Discussion

We conclude by discussing some of our findings as well as potential extensions of our work. As mentioned in the introduction, our initial motivation was to understand certain ubiquitous phenomena in the learning dynamics of multi-layer neural networks. A particularly striking phenomenon that we could reproduce in the present mathematical setting is the coexistence of plateaus in which the risk barely changes and sudden drops.

In the next paragraphs, we will briefly emphasize results or future directions that were not anticipated at the beginning of this work.

Implicit bias in function space.

We provided evidence towards the canonical learning order of Definition 1. According to this scenario, the target function $\varphi$ is learnt according to its decomposition into Hermite polynomials, with lower degree components learnt first. This theory applies to online SGD via Theorem 2 and Theorem 3. In this setting, the number of SGD steps correspond to the number of samples. Therefore, at a small sample size, SGD will fit a low degree polynomial approximation of the target function, with the degree increasing with samples.

A similar phenomenon is observed with (rotationally invariant) kernel methods [34], with one important difference. Here the number of samples always scale linearly in the degree, while for kernel methods, different polynomial degree correspond to different scalings with the dimension.

Implicit bias in parameter space.

Our analysis tracks the evolution of the weights as well. As explained in Section 6, in order for the degree- $k$ component of the target function to be well approximated (in the $d,m\to\infty$ limit), it is sufficient that $\sigma_{k}\int a(\nu)s(\nu)^{k}\rho(\mathrm{d}\nu)=\varphi_{k}$ . Here $\nu$ is an abstract neuron index, $a(\nu)$ is the second-layer weight and $s(\nu)$ is the projection of the first layer weight along the target direction $u_{*}$ .

Naively, one would expect that, in order for learning to take place, first layer weights should be well aligned with $u_{*}$ , i.e. $s(\nu)$ should concentrate close to one. However this is not the only way to satisfy the constraints $\sigma_{k}\int a(\nu)s(\nu)^{k}d\rho(\nu)=\varphi_{k}$ . Indeed, our analysis in Section 6 indicates that gradient flow satisfies this constraint with $s=\Theta(\varepsilon^{\beta_{k}})$ and $a=\Theta(\varepsilon^{-\omega_{k}})$ with $\beta_{k}=1/2(k+1)$ , $\omega_{k}=k/2(k+1)$ (so that $\sigma_{k}\int a(\nu)s(\nu)^{k}\rho(\mathrm{d}\nu)$ will be of order one) as $\varepsilon\to 0$ . In other words, the alignment is small, and second layer weights are large. (In general, weights on multiple scales coexist.)

The role of the learning rate $\varepsilon$ .

The initialization of parameters and relative step-sizes play a key role in modern (non-convex) machine learning. The combination of the two scalings (initialization and relative stepsize) affects the learning dynamics. In order to clarify this point, we can consider a general parametrization (we keep $\|u_{i}\|_{2}=1$ )

f(x;a,u)=\frac{1}{m^{\gamma}}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)% =:\sum_{i=1}^{m}c_{i}\sigma(\langle u_{i},x\rangle),\;

and gradient flow dynamics

	$\displaystyle\varepsilon\partial_{t}a_{i}$	$\displaystyle=-m\partial_{a_{i}}\mathscrsfs{R}(a,u)\,,$
	$\displaystyle\partial_{t}u_{i}$	$\displaystyle=-m(I_{d}-u_{i}u_{i}^{\top})\nabla_{u_{i}}\mathscrsfs{R}(a,u)\,.$

(Note that the learning rate in the second equation can be set to $1$ without loss of generality, by rescaling the time axis.) Rewriting this in terms of the coefficients $c_{i}$ , so that the function representation is kept fixed, we have

\displaystyle s\partial_{t}c_{i}

\displaystyle=-m\partial_{c_{i}}\overline{\mathscrsfs{R}}(c,u)\,,\;\;s=% \varepsilon m^{2\gamma}\,,

while the second equation remains unchanged. This parametrization allows us to compare various scalings in a uniform fashion.

•

Mean field scaling [31, 13]: $s=\Theta(m^{2})$ , $|c_{i}(0)|=\Theta(m^{-1})$ .
•

In this paper: $s=\varepsilon m^{2}$ , $|c_{i}(0)|=\Theta(m^{-1})$ , $\varepsilon\to 0$ after $m\to\infty$ .
•

Classical scaling [29, 24]: $s=\Theta(1)$ , $|c_{i}(0)|=\Theta(m^{-1/2})$ .

As mentioned already, mean field scaling can exhibit better feature learning properties. In particular, the class of functions studied in the present paper can require much larger sample size to learn under the classical scaling [37, 20, 47]. The choice of initialization in this paper is the same as in the mean field literature, with the difference that the relative learning rate $s$ is a factor $\varepsilon$ smaller, hence making it –in a sense– slightly closer to the the classical scaling. It would be interesting to explore other scalings as well.

We also note that, while the limit of small $\varepsilon$ is interesting, setting directly $\varepsilon=0$ leads to a singular behavior²²2No matter how we rescale time, in this case learning takes place instantly, up to a certain critical degree.. Formally, setting $\varepsilon=0$ corresponds to kee** second layer weights equal to their optimal values: a correct analysis of this case requires to account for the role of stepsize and not just use the gradient flow approximation.

More complex network models.

The choice of the neural network model in this paper was mainly dictated by the desire to avoid inessential technicalities. It would be important to move towards more realistic models.

First, we used projected gradient descent to constrain the weights’ norms $\|u_{i}\|=1$ . While this is a common theoretical device in studying single-index models [4, 11], we believe that techniques developed here can be extended to the more general case. Analogously, we could add biases to the network architecture and hence replace Eq. (2) by

f(x;a,u,b)=\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle+b_{i}),% \qquad\ a_{1},b_{1},\cdots,a_{m},b_{m}\in\mathbb{R},\ u_{1},\cdots,u_{m}\in{% \mathbb{R}}^{d},

(121)

With this change, the limiting mean-field dynamics will be an autonomous ODE system of $(a_{i}(t),b_{i}(t),s_{i}(t),r_{i}(t))_{i=1}^{m}$ where $r_{i}(t)=\left\|{u_{i}(t)}\right\|_{2}$ . We expect that its evolution will be qualitatively similar to that of the simplified dynamics considered in the paper.

Second, the single-index model studied here is a simple example of target function which requires feature learning. An obvious generalization is to consider multi-index models, as already discussed in Remark 4.3.

Finally, it would be interesting to generalize our analysis to classification losses.

Acknowledgments

This work was supported by the NSF through award DMS-2031883, the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF grant CCF-2006489 and the ONR grant N00014-18-1-2729, and a grant from Eric and Wendy Schmidt at the Institute for Advanced Studies. Part of this work was carried out while Andrea Montanari was on partial leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AM’s activity while on leave.

References

Abbe et al. [2022] Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
Arnaboldi et al. [2023] Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high-dimensional & mean-field dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. arXiv preprint arXiv:2302.05882, 2023.
Arous et al. [2021] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788–4838, 2021.
Arpit et al. [2017] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
Ba et al. [2022] Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems, 2022.
Baldi and Hornik [1989] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
Barak et al. [2022] Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv:2207.08799, 2022.
Bartlett et al. [2021] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
Berglund [2001] Nils Berglund. Perturbation theory of dynamical systems. arXiv preprint math/0111178, 2001.
Bietti et al. [2022] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. Advances in Neural Information Processing Systems, 35:9768–9783, 2022.
Bodin and Macris [2021] Antoine Bodin and Nicolas Macris. Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model. Advances in Neural Information Processing Systems, 34:21605–21617, 2021.
Chizat and Bach [2018] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
Damian et al. [2022] Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
[15] Encyclopedia of Mathematics. Bernoulli equation. http://encyclopediaofmath.org/index.php?title=Bernoulli_equation&oldid=40764.
Frye and Efthimiou [2012] Christopher Frye and Costas J Efthimiou. Spherical harmonics in p dimensions. arXiv preprint arXiv:1205.3548, 2012.
Fukumizu and Amari [2000] Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000.
Ghorbani et al. [2020a] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Discussion of:“nonparametric regression using deep neural networks with relu activation function”. The Annals of Statistics, 48(4), 2020a.
Ghorbani et al. [2020b] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020b.
Ghorbani et al. [2021] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
Ghosh et al. [2021] Nikhil Ghosh, Song Mei, and Bin Yu. The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2021.
Gissin et al. [2019] Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The implicit bias of depth: How incremental learning drives generalization. arXiv preprint arXiv:1909.12051, 2019.
Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
Holmes [2013] Mark Holmes. Introduction to Perturbation Methods. Springer Texts in Applied Mathematics, 2013.
Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
** et al. [2019] Chi **, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
Kalimeris et al. [2019] Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.
LeCun et al. [2002] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
Li et al. [2020] Zhiyuan Li, Yu** Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. arXiv preprint arXiv:2012.09839, 2020.
Mei et al. [2018a] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018a.
Mei et al. [2018b] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018b.
Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
Mei et al. [2022] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
Montanari and Zhong [2022] Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816–2847, 2022.
O’Donnell [2014] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
Oymak and Soltanolkotabi [2020] Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
Pinkus [1999] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999.
Power et al. [2022] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022.
Rotskoff and Vanden-Eijnden [2018] Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in neural information processing systems, 31, 2018.
Saad and Solla [1995] David Saad and Sara A Solla. On-line learning in soft committee machines. Physical Review E, 52(4):4225, 1995.
Santambrogio [2015] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
Saxe et al. [2013] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
Wei et al. [2008] Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
Yang and Hu [2020] Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks. arXiv:2011.14522, 2020.
Yang and Hu [2021] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
Yehudai and Shamir [2019] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
Yoshida and Okada [2019] Yuki Yoshida and Masato Okada. Data-dependence of plateau phenomenon in learning with neural network—statistical mechanical analysis. Advances in Neural Information Processing Systems, 32, 2019.

Appendix A Proof of Proposition 1

By standard approximation theory arguments [38], it is sufficient to show that there exists an integrable function $a_{d}\in L^{1}(\mathbb{S}^{d-1},\mu_{0})$ such that

\displaystyle\lim_{d\to\infty}\mathbb{E}\big{\{}\big{(}\int a_{d}(u)\,\sigma(% \langle u,x\rangle)\,\mu_{0}(\mathrm{d}u)-\varphi(\langle u_{*},x\rangle)\big{% )}^{2}\big{\}}=0\,.

(122)

(We denote by $\mu_{0}$ the uniform probability measure over $\mathbb{S}^{d-1}$ .)

Denote by $P_{d,k}$ the Gegenbauer polynomial of order $d$ and degree $k$ (see, e.g., [34]). Namely, $(P_{d,k}:k\geq 0)$ form an orthogonal system with respect to the measure with density $\propto(1-t^{2})^{(d-3)}$ , $t\in[-1,1]$ . Recall that for fixed $v,w$ of norm $1$ , the polynomials $P_{d,j}(\langle v,u\rangle),P_{d,k}(\langle w,u\rangle)$ are spherical harmonics satisfying

\displaystyle\int P_{d,j}(\langle v,u\rangle)P_{d,k}(\langle w,u\rangle)\,\mu_% {0}(\mathrm{d}u)=\delta_{kj}P_{d,k}(\langle v,w\rangle)\,.

(123)

Also, $P_{d,k}(1)=B_{d,k}$ is the dimension of the space of spherical harmonics of degree $k$ , whence $(P_{d,k}(\cdot)/B_{d,k}^{1/2}:\,k\geq 0)$ form an orthonormal set. We will denote by $c_{d,k}(\sigma)$ the $k$ -th coefficient of the expansion of $\sigma(\,.\,\sqrt{d})$ in this basis, and similarly for $\varphi(\,.\,\sqrt{d})$ , with coefficients $c_{d,k}(\varphi)$ , namely

	$\displaystyle\sigma(t\sqrt{d})$	$\displaystyle=\sum_{k=0}^{\infty}\frac{c_{d,k}(\sigma)}{B_{d,k}^{1/2}}\,P_{d,k% }(t)\,,$
	$\displaystyle\varphi(t\sqrt{d})$	$\displaystyle=\sum_{k=0}^{\infty}\frac{c_{d,k}(\varphi)}{B_{d,k}^{1/2}}\,P_{d,% k}(t)\,.$

As shown for instance in [34], $\lim_{d\to\infty}c_{d,k}(\sigma)=c_{k}(\sigma)$ is the $k$ -th Hermite coefficient of $\sigma$ and similarly for $c_{d,k}(\varphi)$ . In particular, $c_{d,k}(\sigma)\neq 0$ for all $d$ large enough. For $N$ a large integer let

\displaystyle a_{d}(u)=\sum_{k=0}^{N}\frac{c_{d,k}(\varphi)}{c_{d,k}(\sigma)}P% _{d,k}(\langle u,u_{*}\rangle)\,.

By Eq. (123), we have, for $\|z\|=\sqrt{d}$ ,

	$\displaystyle\int a_{d}(u)\,\sigma(\langle u,z\rangle)\,\mu_{0}(\mathrm{d}u)$	$\displaystyle=\sum_{k=0}^{N}\frac{c_{d,k}(\varphi)}{c_{d,k}(\sigma)}\frac{c_{d% ,k}(\sigma)}{B_{d,k}^{1/2}}P_{d,k}(\langle u_{*},z\rangle/\sqrt{d})$
		$\displaystyle=\sum_{k=0}^{N}\frac{c_{d,k}(\varphi)}{B_{d,k}^{1/2}}P_{d,k}(% \langle u_{*},z\rangle/\sqrt{d})\,.$

Denoting by $z$ a uniform random vector on the sphere of radius $\sqrt{d}$ , and $r=\|x\|_{2}/\sqrt{d}$ , we have

	$\displaystyle\mathbb{E}\big{\{}\big{(}\int a_{d}(u)\,\sigma(\langle u,x\rangle% )\,\mu_{0}(\mathrm{d}u)-$	$\displaystyle\varphi(\langle u_{},x\rangle)\big{)}^{2}\big{\}}=\mathbb{E}\big% {\{}\big{(}\int a_{d}(u)\,\sigma(r\langle u,z\rangle)\,\mu_{0}(\mathrm{d}u)-% \varphi(r\langle u_{},z\rangle)\big{)}^{2}\big{\}}$
		$\displaystyle\stackrel{{\scriptstyle()}}{{\leq}}\mathbb{E}\big{\{}\big{(}\int a% _{d}(u)\,\sigma(\langle u,z\rangle)\,\mu_{0}(\mathrm{d}u)-\varphi(\langle u_{% },z\rangle)\big{)}^{2}\big{\}}+\frac{C_{N}L^{2}}{d}$
		$\displaystyle\leq\mathbb{E}\big{\{}\varphi_{>N}(\langle u_{*},z\rangle)^{2}% \big{\}}+\frac{C_{N}L^{2}}{d}\,,$

where in $(*)$ we used concentration of $\chi$ -squared random variables, Lipschitzness of $\sigma$ and $\varphi$ , and that $\varphi_{>N}(t\sqrt{d})$ is the projection of $\varphi(t\sqrt{d})$ orthogonal to polynomials of degree at most $N$ (with respect to the measure with density proportional to $(1-t^{2})^{(d-3)/2}$ on $[-1,1]$ ). Therefore

\displaystyle\limsup_{d\to\infty}\mathbb{E}\big{\{}\big{(}\int a_{d}(u)\,% \sigma(\langle u,x\rangle)\,\mu_{0}(\mathrm{d}u)-\varphi(\langle u_{*},x% \rangle)\big{)}^{2}\big{\}}\leq\sum_{k=N+1}^{\infty}c_{k}(\varphi)^{2}\,.

The claim (122) follows by taking $N\to\infty$ .

Appendix B Appendix to Section 4

B.1 Proof of Proposition 2

When $x\sim\mathsf{N}(0,I_{d})$ and $u,u^{\prime}\in\mathbb{S}^{d-1}$ , $\begin{pmatrix}\langle u,x\rangle\\ \langle u^{\prime},x\rangle\end{pmatrix}\sim\mathsf{N}\left(0,\begin{pmatrix}1% &\langle u,u^{\prime}\rangle\\ \langle u,u^{\prime}\rangle&1\end{pmatrix}\right)$ . Thus

$\displaystyle\mathscrsfs{R}(a,u)$	$\displaystyle=\frac{1}{2}\mathbb{E}\left(\varphi(\langle u_{*},x\rangle)-\frac% {1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)\right)^{2}$
	$\displaystyle=\frac{1}{2}\mathbb{E}\left[\varphi(\langle u_{},x\rangle)^{2}% \right]-\frac{1}{m}\sum_{i=1}^{m}a_{i}\mathbb{E}\left[\varphi(\langle u_{},x% \rangle)\sigma(\langle u_{i},x\rangle)\right]+\frac{1}{2}\frac{1}{m^{2}}\sum_{% i,j=1}^{m}a_{i}a_{j}\mathbb{E}\left[\sigma(\langle u_{i},x\rangle)\sigma(% \langle u_{j},x\rangle)\right]$
	$\displaystyle=\frac{1}{2}\\|\varphi\\|^{2}_{L^{2}}-\frac{1}{m}\sum_{i=1}^{m}a_{i% }V(\langle u_{*},u_{i}\rangle)+\frac{1}{2}\frac{1}{m^{2}}\sum_{i,j=1}^{m}a_{i}% a_{j}U(\langle u_{i},u_{j}\rangle)$	(124)
	$\displaystyle=\frac{1}{2}\\|\varphi\\|^{2}_{L^{2}}-\frac{1}{m}\sum_{i=1}^{m}a_{i% }V(s_{i})+\frac{1}{2}\frac{1}{m^{2}}\sum_{i,j=1}^{m}a_{i}a_{j}U(r_{ij})\,.$

This proves (15). Equation (16) follows directly:

\displaystyle\varepsilon\partial_{t}a_{i}

\displaystyle=-m\partial_{a_{i}}\mathscrsfs{R}(a,u)=V(s_{i})-\frac{1}{m}\sum_{% j=1}^{m}a_{j}U(r_{ij})\,.

To obtain equations (17)-(19), we now take gradients in (124):

	$\displaystyle\partial_{t}u_{i}$	$\displaystyle=-m(I_{d}-u_{i}u_{i}^{\top})\nabla_{u_{i}}\mathscrsfs{R}(a,u)$
		$\displaystyle=a_{i}\left(I_{d}-u_{i}u_{i}^{\top}\right)\left(V^{\prime}(% \langle u_{},u_{i}\rangle)u_{}-\frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(% \langle u_{i},u_{j}\rangle)u_{j}\right)$
		$\displaystyle=a_{i}\left(V^{\prime}(\langle u_{},u_{i}\rangle)(u_{}-u_{i}u_{% i}^{\top}u_{*})-\frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(\langle u_{i},u_{j}% \rangle)(u_{j}-u_{i}u_{i}^{\top}u_{j})\right)$
		$\displaystyle=a_{i}\left(V^{\prime}(s_{i})(u_{*}-s_{i}u_{i})-\frac{1}{m}\sum_{% j=1}^{m}a_{j}U^{\prime}(r_{ij})(u_{j}-r_{ij}u_{i})\right)\,.$

Thus

	$\displaystyle\partial_{t}s_{i}$	$\displaystyle=\langle u_{*},\partial_{t}u_{i}\rangle$
		$\displaystyle=a_{i}\left(V^{\prime}(s_{i})(\langle u_{},u_{}\rangle-s_{i}% \langle u_{},u_{i}\rangle)-\frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(r_{ij})(% \langle u_{},u_{j}\rangle-r_{ij}\langle u_{*},u_{i}\rangle)\right)$
		$\displaystyle=a_{i}\left(V^{\prime}(s_{i})(1-s_{i}^{2})-\frac{1}{m}\sum_{j=1}^% {m}a_{j}U^{\prime}(r_{ij})(s_{j}-r_{ij}s_{i})\right)\,.$

This gives (17). Finally, we perform a similar computation to compute $\partial_{t}r_{ij}=\langle\partial_{t}u_{i},u_{j}\rangle+\langle u_{i},% \partial_{t}u_{j}\rangle$ . We compute only the first term, as the second term can be obtained by inverting $i$ and $j$ :

	$\displaystyle\langle\partial_{t}u_{i},u_{j}\rangle$	$\displaystyle=a_{i}\left(V^{\prime}(s_{i})(\langle u_{j},u_{*}\rangle-s_{i}% \langle u_{j},u_{i}\rangle)-\frac{1}{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(% \langle u_{j},u_{p}\rangle-r_{ip}\langle u_{j},u_{i}\rangle)\right)$
		$\displaystyle=a_{i}\left(V^{\prime}(s_{i})(s_{j}-s_{i}r_{ij})-\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}-r_{ip}r_{ij})\right)\,.$

Adding the symmetric term $\langle u_{i},\partial_{t}u_{j}\rangle$ , we obtain (18)-(19).

B.2 Proof of Corollary 1

First, note that in the proof of Lemma 1, we obtain the following a priori estimate on the magnitude of the $a_{i}^{0}$ ’s:

\sup_{1\leq i\leq m}\left|a_{i}^{0}(t)\right|\leq M\left(1+\frac{t}{% \varepsilon}\right),\ \forall t\geq 0,

(125)

where $M$ only depends on the $M_{i}$ ’s in Assumptions A1-A3. Using a similar argument as that in the proof of Proposition 3, we obtain that for any $t\in[0,T]$ and $i\in[m]$ ,

	$\displaystyle\left\|\partial_{t}(a_{i}-a_{i}^{0})\right\|\leq\,$	$\displaystyle\frac{M}{\varepsilon}\left(\left\|s_{i}-s_{i}^{0}\right\|+\frac{1}{% m}\sum_{j=1}^{m}\left\|a_{j}-a_{j}^{0}\right\|\right)+\frac{M(1+t/\varepsilon)}{% \varepsilon}\cdot\frac{1}{m}\sum_{j=1}^{m}\left\|r_{ij}-r_{ij}^{0}\right\|,$
	$\displaystyle\left\|\partial_{t}(s_{i}-s_{i}^{0})\right\|\leq\,$	$\displaystyle M(1+t/\varepsilon)\cdot\left(\left\|a_{i}-a_{i}^{0}\right\|+\frac{% 1}{m}\sum_{j=1}^{m}\left\|a_{j}-a_{j}^{0}\right\|\right)$
		$\displaystyle+M(1+t/\varepsilon)^{2}\cdot\left(\left\|s_{i}-s_{i}^{0}\right\|+% \frac{1}{m}\sum_{j=1}^{m}\left(\left\|s_{j}-s_{j}^{0}\right\|+\left\|r_{ij}-r_{ij% }^{0}\right\|\right)\right),$

and for $1\leq i\neq j\leq m$ ,

	$\displaystyle\left\|\partial_{t}(r_{ij}-r_{ij}^{0})\right\|\leq\,$	$\displaystyle M(1+t/\varepsilon)\left(\left\|a_{i}-a_{i}^{0}\right\|+\left\|a_{j}% -a_{j}^{0}\right\|+\left\|s_{i}-s_{i}^{0}\right\|+\left\|s_{j}-s_{j}^{0}\right\|+% \frac{1}{m}\sum_{p=1}^{m}\left\|a_{p}-a_{p}^{0}\right\|\right)$
		$\displaystyle+M(1+t/\varepsilon)^{2}\cdot\left(\left\|r_{ij}-r_{ij}^{0}\right\|+% \frac{1}{m}\sum_{p=1}^{m}\left(\left\|r_{ip}-r_{ip}^{0}\right\|+\left\|r_{jp}-r_{% jp}^{0}\right\|\right)\right).$

Therefore, we deduce that

	$\displaystyle\partial_{t}\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}\leq\,$	$\displaystyle\frac{M}{\varepsilon}\sum_{i=1}^{m}(s_{i}-s_{i}^{0})^{2}+\frac{M(% 1+t/\varepsilon)}{\varepsilon}\left(\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}+\frac{% 1}{m}\sum_{i,j=1}^{m}(r_{ij}-r_{ij}^{0})^{2}\right),$
	$\displaystyle\partial_{t}\sum_{i=1}^{m}(s_{i}-s_{i}^{0})^{2}\leq\,$	$\displaystyle M(1+t/\varepsilon)\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}+M(1+t/% \varepsilon)^{2}\left(\sum_{i=1}^{m}(s_{i}-s_{i}^{0})^{2}+\frac{1}{m}\sum_{i,j% =1}^{m}(r_{ij}-r_{ij}^{0})^{2}\right),$
	$\displaystyle\partial_{t}\left(\frac{1}{m}\sum_{i,j=1}^{m}(r_{ij}-r_{ij}^{0})^% {2}\right)\leq\,$	$\displaystyle M(1+t/\varepsilon)\left(\sum_{i=1}^{m}(a_{i}-a_{i}^{0})^{2}+\sum% _{i=1}^{m}(s_{i}-s_{i}^{0})^{2}\right)+M(1+t/\varepsilon)^{2}\cdot\frac{1}{m}% \sum_{i,j=1}^{m}(r_{ij}-r_{ij}^{0})^{2}.$

Defining

G(t)=\sum_{i=1}^{m}(a_{i}(t)-a_{i}^{0}(t))^{2}+\sum_{i=1}^{m}(s_{i}(t)-s_{i}^{% 0}(t))^{2}+\frac{1}{m}\sum_{i,j=1}^{m}(r_{ij}(t)-r_{ij}^{0}(t))^{2},

then we know that $G^{\prime}(t)\leq(M(1+t)^{2}/\varepsilon^{2})G(t)$ . Applying Grönwall’s inequality yields

G(t)\leq G(0)\exp\left(\int_{0}^{t}(M(1+s)^{2}/\varepsilon^{2}){\rm d}s\right)% \leq G(0)\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right),\ \forall t\in[0,T].

Since $\{\langle u_{i}(0),u_{*}\rangle\}_{i\in[m]}\sim_{\mathrm{i.i.d.}}{\mathcal{N}}% (0,1/d)$ and for any $i\in[m]$ , $\{\langle u_{i}(0),u_{j}(0)\rangle\}_{j\neq i}\sim_{\mathrm{i.i.d.}}{\mathcal{% N}}(0,1/d)$ . Using standard concentration inequalities, we know that

G(0)=\sum_{i=1}^{m}\langle u_{i}(0),u_{*}\rangle^{2}+\frac{1}{m}\sum_{i\neq j}% \langle u_{i}(0),u_{j}(0)\rangle^{2}\leq C\frac{m}{d}

(126)

with probability at least $1-\exp(C^{\prime}m)$ , where $C$ and $C^{\prime}$ are both absolute constants. Therefore,

$\displaystyle\sup_{t\in[0,T]}\\|a(t)-a^{0}(t)\\|_{2}\leq\,$	$\displaystyle C\sqrt{\frac{m}{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right),$	(127)
$\displaystyle\sup_{t\in[0,T]}\\|s(t)-s^{0}(t)\\|_{2}\leq\,$	$\displaystyle C\sqrt{\frac{m}{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right),$	(128)
$\displaystyle\sup_{t\in[0,T]}\\|R(t)-R^{0}(t)\\|_{\rm F}\leq\,$	$\displaystyle C\frac{m}{\sqrt{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right).$	(129)

Next we upper bound the risk difference, by direct calculation,

		$\displaystyle\big{\|}\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\mbox{\tiny\rm red% }}(a^{0}(t),s^{0}(t),R^{0}(t))\big{\|}$
	$\displaystyle=\,$	$\displaystyle\big{\|}\mathscrsfs{R}_{\mbox{\tiny\rm red}}(a(t),s(t),R(t))-% \mathscrsfs{R}_{\mbox{\tiny\rm red}}(a^{0}(t),s^{0}(t),R^{0}(t))\big{\|}$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{m}\left\|a^{\top}V(s)-(a^{0})^{\top}V(s^{0})\right\|+\frac% {1}{2m^{2}}\left\|a^{\top}U(R)a-(a^{0})^{\top}U(R^{0})a^{0}\right\|$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{m}\left(\sqrt{m}\left\\|{V}\right\\|_{\infty}\left\\|{a-a^{% 0}}\right\\|_{2}+\left\\|{a^{0}}\right\\|_{2}\left\\|{V^{\prime}}\right\\|_{\infty}% \left\\|{s-s^{0}}\right\\|_{2}\right)$
		$\displaystyle+\frac{1}{2m^{2}}\left(\left\\|{U(R)-U(R^{0})}\right\\|_{\mathrm{op% }}\left\\|{a^{0}}\right\\|_{2}^{2}+2\left\\|{U(R)}\right\\|_{\mathrm{op}}\left\\|{a% ^{0}}\right\\|_{2}\left\\|{a-a^{0}}\right\\|_{2}\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{M(1+t/\varepsilon)}{\sqrt{m}}\left(\left\\|{a-a^{0}}\right\\|% _{2}+\left\\|{s-s^{0}}\right\\|_{2}\right)+\frac{M(1+t/\varepsilon)^{2}}{2m}% \left\\|{R-R^{0}}\right\\|_{\rm F}$
	$\displaystyle\leq\,$	$\displaystyle\frac{M}{\sqrt{d}}\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)$

with probability at least $1-\exp(-C^{\prime}m)$ , where the constant $M$ only depends on the $M_{i}$ ’s from Assumptions A1-A3. The conclusion now follows from taking the supremum over all $t\in[0,T]$ . This completes the proof of Corollary 1.

B.3 Proof of Proposition 3

We consider $r_{ij}^{\perp}=r_{ij}-s_{i}s_{j}=\langle u_{i},u_{j}\rangle-\langle u_{i},u_{*% }\rangle\langle u_{*},u_{j}\rangle$ , the dot product between $u_{i}$ and $u_{j}$ that is out of the relevant subspace spanned by $u_{*}$ . We show that these variables satisfy the ODEs

\begin{split}\partial_{t}r_{ij}^{\perp}=\,&-a_{i}\left(V^{\prime}(s_{i})\cdot s% _{i}r_{ij}^{\perp}+\frac{1}{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}^{% \perp}-r_{ip}r_{ij}^{\perp})\right)\\ \,&-a_{j}\left(V^{\prime}(s_{j})\cdot s_{j}r_{ij}^{\perp}+\frac{1}{m}\sum_{p=1% }^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}^{\perp}-r_{jp}r_{ij}^{\perp})\right)\,.% \end{split}

(130)

By definition of $r_{ij}^{\perp}$ , we readily see that

\partial_{t}r_{ij}^{\perp}=\partial_{t}r_{ij}-s_{i}\partial_{t}s_{j}-s_{j}% \partial_{t}s_{i}.

Plugging in Eq.s (17) to (19) gives that

	$\displaystyle\partial_{t}r_{ij}^{\perp}=\,$	$\displaystyle a_{i}\left(V^{\prime}(s_{i})(s_{j}s_{i}^{2}-s_{i}r_{ij})-\frac{1% }{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}-s_{j}s_{p}-r_{ip}r_{ij}+s_{i}% s_{j}r_{ip})\right)$
		$\displaystyle+a_{j}\left(V^{\prime}(s_{j})(s_{i}s_{j}^{2}-s_{j}r_{ij})-\frac{1% }{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}-s_{i}s_{p}-r_{jp}r_{ij}+s_{i}% s_{j}r_{jp})\right)$
	$\displaystyle=\,$	$\displaystyle-a_{i}\left(V^{\prime}(s_{i})s_{i}r_{ij}^{\perp}+\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{jp}^{\perp}-r_{ip}r_{ij}^{\perp})\right)$
		$\displaystyle-a_{j}\left(V^{\prime}(s_{j})s_{j}r_{ij}^{\perp}+\frac{1}{m}\sum_% {p=1}^{m}a_{p}U^{\prime}(r_{jp})(r_{ip}^{\perp}-r_{jp}r_{ij}^{\perp})\right).$

This proves Eq. (130).

Lemma 1.

If Assumptions A1-A3 hold, then we have for any fixed $T>0$ :

\sup_{t\in[0,T]}\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\leq m\exp\left(MT\left(1% +T\right)^{2}/\varepsilon^{2}\right).

Proof.

To begin with, using Eq. (130), we obtain that

	$\displaystyle\partial_{t}\left(\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\right)$	$\displaystyle=2\sum_{i,j=1}^{m}r_{ij}^{\perp}\times\partial_{t}r_{ij}^{\perp}$
		$\displaystyle=-4\sum_{i,j=1}^{m}a_{i}r_{ij}^{\perp}\left(V^{\prime}(s_{i})% \cdot s_{i}r_{ij}^{\perp}+\frac{1}{m}\sum_{p=1}^{m}a_{p}U^{\prime}(r_{ip})(r_{% jp}^{\perp}-r_{ip}r_{ij}^{\perp})\right).$

Using the ODEs for the $a_{i}$ ’s, we obtain that

	$\displaystyle\left\|\partial_{t}a_{i}\right\|=\,$	$\displaystyle\frac{1}{\varepsilon}\left\|V(s_{i})-\frac{1}{m}\sum_{j=1}^{m}a_{j% }U(r_{ij})\right\|=\frac{1}{\varepsilon}\left\|\mathbb{E}\left[\varphi(\langle u% _{*},x\rangle)\sigma(\langle u_{i},x\rangle)\right]-\frac{1}{m}\sum_{j=1}^{m}a% _{j}\mathbb{E}\left[\sigma(\langle u_{i},x\rangle)\sigma(\langle u_{j},x% \rangle)\right]\right\|$
	$\displaystyle=\,$	$\displaystyle\frac{1}{\varepsilon}\left\|\mathbb{E}\left[\sigma(\langle u_{i},x% \rangle)\left(y-f(x;a,u)\right)\right]\right\|\leq\frac{1}{\varepsilon}\mathbb{% E}\left[\sigma(\langle u_{i},x\rangle)^{2}\right]^{1/2}\mathbb{E}\left[\left(y% -f(x;a,u)\right)^{2}\right]^{1/2}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,$	$\displaystyle\frac{1}{\varepsilon}M_{2}\sqrt{2\mathscrsfs{R}(a(0),u(0))}\leq% \frac{M}{\varepsilon},$

where $(i)$ follows from our assumptions and the fact that $\mathscrsfs{R}(a(t),u(t))\leq\mathscrsfs{R}(a(0),u(0))$ , since $\partial_{t}\mathscrsfs{R}(a,u)\leq 0$ by gradient flow equations. Moreover, the constant $M$ only depends on the $M_{i}$ ’s. Since $|a_{i}(0)|\leq M_{1}$ for all $i\in[m]$ , we know that $|a_{i}(t)|\leq M(1+t/\varepsilon)$ for all $t\geq 0$ , thus leading to the following estimate:

	$\displaystyle\partial_{t}\left(\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\right)\leq\,$	$\displaystyle 4\sum_{i,j=1}^{m}M\left(1+\frac{t}{\varepsilon}\right)\left\|r_{% ij}^{\perp}\right\|\left(\left\\|{V^{\prime}}\right\\|_{\infty}\left\|r_{ij}^{% \perp}\right\|+M\left(1+\frac{t}{\varepsilon}\right)\left\\|{U^{\prime}}\right\\|% _{\infty}\left(\frac{1}{m}\sum_{p=1}^{m}\left\|r_{jp}^{\perp}\right\|+\left\|r_{% ij}^{\perp}\right\|\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\left(\sum_{i,j=1}^{m}r% _{ij}^{\perp}(t)^{2}+\frac{1}{m}\sum_{i,j,p=1}^{m}\left\|r_{ij}^{\perp}(t)% \right\|\cdot\left\|r_{jp}^{\perp}(t)\right\|\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\left(\sum_{i,j=1}^{m}r% _{ij}^{\perp}(t)^{2}+\frac{1}{2m}\sum_{i,j,p=1}^{m}\left(r_{ij}^{\perp}(t)^{2}% +r_{jp}^{\perp}(t)^{2}\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\sum_{i,j=1}^{m}r_{ij}^% {\perp}(t)^{2},$

where the constant $M$ only depends on the $M_{i}$ ’s in our assumptions. At initialization, we know that $\sum_{i,j=1}^{m}r_{ij}^{\perp}(0)^{2}=m$ . Applying Grönwall’s inequality yields that

\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\leq m\exp\left(\int_{0}^{t}M\left(1+% \frac{s}{\varepsilon}\right)^{2}\mathrm{d}s\right),\quad\forall t\in[0,T],

which further implies that

\sup_{t\in[0,T]}\sum_{i,j=1}^{m}r_{ij}^{\perp}(t)^{2}\leq m\exp\left(\int_{0}^% {T}M\left(1+\frac{t}{\varepsilon}\right)^{2}\mathrm{d}t\right)\leq m\exp\left(% MT\left(1+T\right)^{2}/\varepsilon^{2}\right).

This completes the proof. ∎

We show that

\sup_{t\in[0,T]}\sum_{i=1}^{m}\left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(% t)\right)^{2}+\left(s_{i}(t)-s_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)% \leq C(T).

(131)

To this end, we define $S(t)=\sum_{i=1}^{m}\left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{% 2}+\left(s_{i}(t)-s_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)$ . By our assumption, $S(0)=0$ . Moreover, using the same technique as in the proof of Lemma 1, we know that $|a_{i}^{\mbox{\tiny\rm mf}}(t)|\leq M(1+t/\varepsilon)$ for all $i\in[m]$ . According to Eq.s (16)-(19) and Eq. (24), we deduce that

		$\displaystyle\left\|\partial_{t}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\right\|=\,% \frac{1}{\varepsilon}\cdot\left\|V(s_{i})-V(s_{i}^{\mbox{\tiny\rm mf}})-\frac{1% }{m}\sum_{j=1}^{m}\left(a_{j}U(r_{ij})-a_{j}^{\mbox{\tiny\rm mf}}U(s_{i}^{% \mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})\right)\right\|$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{\varepsilon}\cdot\left(\left\|V(s_{i})-V(s_{i}^{\mbox{% \tiny\rm mf}})\right\|+\frac{1}{m}\sum_{j=1}^{m}\left\|a_{j}U(r_{ij})-a_{j}U(s_{% i}s_{j})\right\|+\frac{1}{m}\sum_{j=1}^{m}\left\|a_{j}U(s_{i}s_{j})-a_{j}^{\mbox% {\tiny\rm mf}}U(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})\right\|\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{\varepsilon}\cdot\left(\left\\|{V^{\prime}}\right\\|_{% \infty}\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+\frac{M(1+t/\varepsilon)}{m}\left\\|{% U^{\prime}}\right\\|_{\infty}\sum_{j=1}^{m}\left\|r_{ij}^{\perp}\right\|\right)$
		$\displaystyle+\frac{1}{m\varepsilon}\sum_{j=1}^{m}\left(\left\\|{U}\right\\|_{% \infty}\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+M\left(1+\frac{t}{\varepsilon}\right% )\left\\|{U^{\prime}}\right\\|_{\infty}\left(\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+% \|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\|\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+\frac{1}{m}\sum_{j=1}^{m}\left(\|s_{j}% -s_{j}^{\mbox{\tiny\rm mf}}\|+\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+\left\|r_{ij}^{% \perp}\right\|\right)\right),$

thus leading to the following estimate:

	$\displaystyle\sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\cdot\partial_{t}% (a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\leq\,$	$\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \frac{1}{m}\sum_{i=1}^{m}\|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}\|\cdot\sum_{j=1}^{m}% \left(\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+\|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\|\right)$
		$\displaystyle+\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\sum_{i=1}^{m}\|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}\|\|s_{i}-s_{i}^{\mbox{% \tiny\rm mf}}\|+\frac{1}{m}\sum_{i,j=1}^{m}\|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}\|% \left\|r_{ij}^{\perp}\right\|\right)$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,$	$\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i% }-s_{i}^{\mbox{\tiny\rm mf}})^{2}+\frac{1}{m}\sum_{i,j=1}^{m}\left(r_{ij}^{% \perp}\right)^{2}\right)$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\,$	$\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i% }-s_{i}^{\mbox{\tiny\rm mf}})^{2}+\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)% \right),$

where in $(i)$ we use the Cauchy-Schwarz inequality and the inequality of arithmetic and geometric means, and $(ii)$ follows from the conclusion of Lemma 1. Similarly, we obtain that

	$\displaystyle\left\|\partial_{t}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\right\|\leq\,$	$\displaystyle\left\\|{V^{\prime}}\right\\|_{\infty}\|a_{i}-a_{i}^{\mbox{\tiny\rm mf% }}\|+M\left(1+\frac{t}{\varepsilon}\right)\left(\left\\|{V^{\prime\prime}}\right% \\|_{\infty}+2\left\\|{V^{\prime}}\right\\|_{\infty}\right)\|s_{i}-s_{i}^{\mbox{% \tiny\rm mf}}\|$
		$\displaystyle+\frac{1}{m}\left\|a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(r_{ij})(s_{j% }-r_{ij}s_{i})-a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})(1-s_{i}^{2})s_{j% }\right\|$
		$\displaystyle+\frac{1}{m}\left\|a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})(% 1-s_{i}^{2})s_{j}-a_{i}^{\mbox{\tiny\rm mf}}\sum_{j=1}^{m}a_{j}^{\mbox{\tiny% \rm mf}}U^{\prime}(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})(1-(s_% {i}^{\mbox{\tiny\rm mf}})^{2})s_{j}^{\mbox{\tiny\rm mf}}\right\|$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)\cdot\left(\|a_{i}-a_{i}^{% \mbox{\tiny\rm mf}}\|+\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|\right)+\frac{2}{m}% \cdot M\left(1+\frac{t}{\varepsilon}\right)\left(\left\\|{U^{\prime}}\right\\|_{% \infty}+\left\\|{U^{\prime\prime}}\right\\|_{\infty}\right)\sum_{j=1}^{m}\|a_{j}\|% \left\|r_{ij}^{\perp}\right\|$
		$\displaystyle+M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\frac{1}{m}\sum_{j% =1}^{m}\left(\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+\|s_{j}-s_{j}^{\mbox{\tiny\rm mf% }}\|\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\left(\|a_{i}-a_{i}% ^{\mbox{\tiny\rm mf}}\|+\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+\frac{1}{m}\sum_{j=1% }^{m}\left(\|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\|+\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}% }\|+\left\|r_{ij}^{\perp}\right\|\right)\right),$

which further implies that

\sum_{i=1}^{m}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\cdot\partial_{t}(s_{i}-s_{i}^% {\mbox{\tiny\rm mf}})\leq M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\left(% \sum_{i=1}^{m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i}-s_{i% }^{\mbox{\tiny\rm mf}})^{2}+\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)\right).

Combining the above estimates, we finally deduce that

	$\displaystyle S^{\prime}(t)=\,$	$\displaystyle 2\sum_{i=1}^{m}\left((a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\cdot% \partial_{t}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})+(s_{i}-s_{i}^{\mbox{\tiny\rm mf% }})\cdot\partial_{t}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{M(1+t)(1+t/\varepsilon)}{\varepsilon}\cdot\left(\sum_{i=1}^% {m}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})^{2}+\sum_{i=1}^{m}(s_{i}-s_{i}^{\mbox{% \tiny\rm mf}})^{2}+\exp\left(Mt(1+t)^{2}/\varepsilon^{2}\right)\right)$
	$\displaystyle=\,$	$\displaystyle\frac{M(1+t)(1+t/\varepsilon)}{\varepsilon}\cdot\left(S(t)+\exp% \left(Mt(1+t)^{2}/\varepsilon^{2}\right)\right).$

Applying Grönwall’s inequality immediately implies

S(t)\leq\exp\left(Mt(1+t)^{2}/\varepsilon^{2}+\int_{0}^{t}\frac{M(1+s)(1+s/% \varepsilon)}{\varepsilon}\mathrm{d}s\right)\leq\exp\left(Mt(1+t)^{2}/% \varepsilon^{2}\right),

(132)

which further leads to Eq. (131) and concludes the proof of Proposition 3. The “consequently” part can be shown via direct calculation, but we include it here for the sake of completeness. By definition, for any $t\in[0,T]$ we have

		$\displaystyle\left\|\mathscrsfs{R}_{\mbox{\tiny\rm red}}\left(a(t),s(t),R(t)% \right)-\mathscrsfs{R}_{\mbox{\tiny\rm mf}}\left(a^{\mbox{\tiny\rm mf}}(t),s^{% \mbox{\tiny\rm mf}}(t)\right)\right\|$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{m}\left\|\sum_{i=1}^{m}a_{i}V(s_{i})-\sum_{i=1}^{m}a_{i}^% {\mbox{\tiny\rm mf}}V(s_{i}^{\mbox{\tiny\rm mf}})\right\|+\frac{1}{2m^{2}}\left% \|\sum_{i,j=1}^{m}a_{i}a_{j}U(r_{ij})-\sum_{i,j=1}^{m}a_{i}^{\mbox{\tiny\rm mf}% }a_{j}^{\mbox{\tiny\rm mf}}U(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf% }})\right\|$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left(\left\\|{V}\right\\|_{\infty}\left\|a% _{i}-a_{i}^{\mbox{\tiny\rm mf}}\right\|+M(1+t/\varepsilon)\left\\|{V^{\prime}}% \right\\|_{\infty}\left\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\right\|\right)+\frac{1}% {2m^{2}}M(1+t/\varepsilon)^{2}\left\\|{U^{\prime}}\right\\|_{\infty}\sum_{i,j=1}% ^{m}\left\|r_{ij}^{\perp}\right\|$
		$\displaystyle+\frac{1}{2m^{2}}\sum_{i,j=1}^{m}\left(M(1+t/\varepsilon)\left\\|{% U}\right\\|_{\infty}\left(\left\|a_{i}-a_{i}^{\mbox{\tiny\rm mf}}\right\|+\left\|a% _{j}-a_{j}^{\mbox{\tiny\rm mf}}\right\|\right)+M(1+t/\varepsilon)^{2}\left\\|{U^% {\prime}}\right\\|_{\infty}\left(\left\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\right\|+% \left\|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\right\|\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle M(1+t/\varepsilon)\frac{1}{m}\sum_{i=1}^{m}\left\|a_{i}-a_{i}^{% \mbox{\tiny\rm mf}}\right\|+M(1+t/\varepsilon)^{2}\frac{1}{m}\sum_{i=1}^{m}% \left\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\right\|+M(1+t/\varepsilon)^{2}\frac{1}{m% ^{2}}\sum_{i,j=1}^{m}\left\|r_{ij}^{\perp}\right\|$
	$\displaystyle\leq\,$	$\displaystyle M(1+t/\varepsilon)^{2}\cdot\left(\sqrt{\frac{1}{m}\sum_{i=1}^{m}% \left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}+\left(s_{i}(t)-s% _{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)}+\sqrt{\frac{1}{m^{2}}\sum_{i,j% =1}^{m}\left\|r_{ij}^{\perp}(t)\right\|^{2}}\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{\sqrt{m}}M(1+t/\varepsilon)^{2}\exp\left(Mt(1+t)^{2}/% \varepsilon^{2}\right).$

Therefore,

\sup_{t\in[0,T]}\left|\mathscrsfs{R}_{\mbox{\tiny\rm red}}\left(a(t),s(t),R(t)% \right)-\mathscrsfs{R}_{\mbox{\tiny\rm mf}}\left(a^{\mbox{\tiny\rm mf}}(t),s^{% \mbox{\tiny\rm mf}}(t)\right)\right|\leq\frac{M\exp\left(MT(1+T)^{2}/% \varepsilon^{2}\right)}{\sqrt{m}},

(133)

as desired.

B.4 Derivation of the mean field dynamics (29)

For any bounded continuous $f\in C_{b}(\mathbb{R}^{2})$ , we have

		$\displaystyle\int_{\mathbb{R}^{2}}f(a,s)\partial_{t}\rho_{t}(\mathrm{d}a,% \mathrm{d}s)=\partial_{t}\left(\int_{\mathbb{R}^{2}}f(a,s)\rho_{t}(\mathrm{d}a% ,\mathrm{d}s)\right)=\partial_{t}\left(\frac{1}{m}\sum_{i=1}^{m}f(a_{i}^{\mbox% {\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\right)$
	$\displaystyle=\,$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left(\partial_{a}f(a_{i}^{\mbox{\tiny% \rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\cdot\partial_{t}a_{i}^{\mbox{\tiny% \rm mf}}(t)+\partial_{s}f(a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf% }}(t))\cdot\partial_{t}s_{i}^{\mbox{\tiny\rm mf}}(t)\right)$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\,$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left(\partial_{a}f(a_{i}^{\mbox{\tiny% \rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\cdot\Psi_{a}\left(a_{i}^{\mbox{% \tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t);\rho_{t}\right)+\partial_{s}f(a% _{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))\cdot\Psi_{s}\left(% a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t);\rho_{t}\right)\right)$
	$\displaystyle=\,$	$\displaystyle\int_{\mathbb{R}^{2}}\left(\partial_{a}f(a,s)\cdot\Psi_{a}(a,s;% \rho_{t})+\partial_{s}f(a,s)\cdot\Psi_{s}(a,s;\rho_{t})\right)\rho_{t}(\mathrm% {d}a,\mathrm{d}s)$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\,$	$\displaystyle-\int_{\mathbb{R}^{2}}f(a,s)\cdot\left(\partial_{a}\left(\rho_{t}% \Psi_{a}(a,s;\rho_{t})\right)+\partial_{s}\left(\rho_{t}\Psi_{s}(a,s;\rho_{t})% \right)\right)(\mathrm{d}a,\mathrm{d}s),$

where $(i)$ follows from the ODE satisfied by the $(a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))$ ’s, and in $(ii)$ we use integration by parts. We thus obtain that

\partial_{t}\rho_{t}=-\left(\partial_{a}\left(\rho_{t}\Psi_{a}(a,s;\rho_{t})% \right)+\partial_{s}\left(\rho_{t}\Psi_{s}(a,s;\rho_{t})\right)\right)=-\nabla% \cdot\left(\rho_{t}\Psi(a,s;\rho_{t})\right),

which recovers Eq. (29).

B.5 Details of the alternative mean field approach

Let

\overline{\rho}_{t}=\frac{1}{m}\sum_{i=1}^{m}\delta_{(a_{i}(t),u_{i}(t))}\,,

(134)

where $(a_{i}(t),u_{i}(t))_{1\leqslant i\leqslant m}$ is the solution of (5)–(6). $\overline{\rho}_{t}$ is a measure on $\mathbb{R}\times\mathbb{S}^{d-1}$ solving the continuity PDE

\displaystyle\begin{split}\partial_{t}\overline{\rho}_{t}(a,u)&=-\nabla\cdot% \left(\overline{\rho}_{t}\overline{\Psi}\left(a,u;\overline{\rho}_{t}\right)% \right)\\ &=-\left(\partial_{a}\left(\overline{\rho}_{t}\overline{\Psi}_{a}\left(a,u;% \overline{\rho}_{t}\right)\right)+\partial_{u}\left(\overline{\rho}_{t}% \overline{\Psi}_{u}\left(a,u;\overline{\rho}_{t}\right)\right)\right)\,,\end{split}

(135)

where $\overline{\Psi}=(\overline{\Psi}_{a},\overline{\Psi}_{u})$ is given by

	$\displaystyle\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)$	$\displaystyle=\varepsilon^{-1}\left(V(\langle u,u_{*}\rangle)-\int_{\mathbb{R}% \times\mathbb{S}^{d-1}}a_{1}U(\langle u,u_{1}\rangle)\overline{\rho}(\mathrm{d% }a_{1},\mathrm{d}u_{1})\right)\,,$
	$\displaystyle\overline{\Psi}_{u}\left(a,u;\overline{\rho}\right)$	$\displaystyle=a\left(I_{d}-uu^{\top}\right)\left(V^{\prime}(\langle u,u_{}% \rangle)u_{}-\int_{\mathbb{R}\times\mathbb{S}^{d-1}}a_{1}U^{\prime}(\langle u% ,u_{1}\rangle)u_{1}\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)\,.$

A remarkable property of the equation (135) is that it preserves invariance to rotations orthogonal to $u_{*}$ . Indeed, assume that $\overline{\rho}$ is invariant to rotations orthogonal to $u_{*}$ . In this case, we show that $\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)$ and $\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}\right)\rangle$ depend only on $s:=\langle u,u_{*}\rangle$ and $s_{1}:=\langle u_{1},u_{*}\rangle$ . Let $u^{\perp}$ (resp. $u_{1}^{\perp}$ ) denote the component of $u$ (resp. $u_{1}$ ) orthogonal to $u_{*}$ . Let $R$ denote a random uniform rotation orthogonal to $u_{*}$ . By the rotation invariance of $\overline{\rho}$ ,

	$\displaystyle\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)$	$\displaystyle=\varepsilon^{-1}\left(V(\langle u,u_{*}\rangle)-\int_{\mathbb{R}% \times\mathbb{S}^{d-1}}a_{1}\mathbb{E}_{R}\left[U(\langle u,Ru_{1}\rangle)% \right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)$
		$\displaystyle=\varepsilon^{-1}\left(V(s)-\int_{\mathbb{R}\times\mathbb{S}^{d-1% }}a_{1}\mathbb{E}_{R}\left[U(ss_{1}+\langle u^{\perp},Ru_{1}^{\perp}\rangle)% \right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)\,.$

The random variable $B^{(d)}=\left\langle\frac{u^{\perp}}{\|u^{\perp}\|},R\frac{u_{1}^{\perp}}{\|u_% {1}^{\perp}\|}\right\rangle$ is a one dimensional projection of a random variable uniform on the unit sphere of the hyperplane orthogonal to $u_{*}$ ; thus it has the density $p_{B^{(d)}}(b)\propto(1-b^{2})^{d/2-2}$ (see, e.g., [16, Lemma 4.17]). Denote

U^{(d)}(s,s_{1})=\mathbb{E}_{B^{(d)}}\left[U\left(ss_{1}+(1-s^{2})^{1/2}(1-s_{% 1}^{2})^{1/2}B^{(d)}\right)\right]\,,

then we have

\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)=\varepsilon^{-1}\left(V(s)% -\int_{\mathbb{R}\times\mathbb{S}^{d-1}}a_{1}U^{(d)}(s,s_{1})\overline{\rho}(% \mathrm{d}a_{1},\mathrm{d}u_{1})\right)\,.

(136)

Further, we compute

	$\displaystyle\left\langle u_{},\overline{\Psi}_{u}\left(a,u;\overline{\rho}% \right)\right\rangle=\Bigg{\langle}u_{},a\left(I_{d}-uu^{\top}\right)$
	$\displaystyle\hskip 85.35826pt\left(V^{\prime}(s)u_{}-\int_{\mathbb{R}\times% \mathbb{S}^{d-1}}a_{1}\mathbb{E}_{R}\left[U^{\prime}(ss_{1}+\langle u^{\perp},% Ru_{1}^{\perp}\rangle)\left(s_{1}u_{}+Ru_{1}^{\perp}\right)\right]\overline{% \rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right)\Bigg{\rangle}$
	$\displaystyle=a\left[(1-s^{2})V^{\prime}(s)-\int_{\mathbb{R}\times\mathbb{S}^{% d-1}}a_{1}\mathbb{E}_{R}\left[U^{\prime}(ss_{1}+\langle u^{\perp},Ru_{1}^{% \perp}\rangle)\langle u_{},(I_{d}-uu^{\top})(s_{1}u_{}+Ru_{1}^{\perp})% \rangle\right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right]\,.$

In the equation above, we have $\langle u_{*},(I_{d}-uu^{\top})s_{1}u_{*}\rangle=s_{1}(1-s^{2})$ and as $\langle u_{*},Ru_{1}^{\perp}\rangle=0$ a.s., we have

\left\langle u_{*},(I_{d}-uu^{\top})Ru_{1}^{\perp}\right\rangle=-\langle u,u_{% *}\rangle\langle u,Ru_{1}^{\perp}\rangle=-s\langle u^{\perp},Ru_{1}^{\perp}% \rangle\,.

Thus we obtain

	$\displaystyle\left\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}% \right)\right\rangle$
	$\displaystyle=a(1-s^{2})\left[V^{\prime}(s)-\int_{\mathbb{R}\times\mathbb{S}^{% d-1}}a_{1}\mathbb{E}_{R}\left[U^{\prime}(ss_{1}+\langle u^{\perp},Ru_{1}^{% \perp}\rangle)\left(s_{1}-\frac{s}{1-s^{2}}\langle u^{\perp},Ru_{1}^{\perp}% \rangle\right)\right]\overline{\rho}(\mathrm{d}a_{1},\mathrm{d}u_{1})\right]\,.$

Note that

\partial_{s}U^{(d)}(s,s_{1})=\mathbb{E}_{B^{(d)}}\left[U^{\prime}\left(ss_{1}+% (1-s^{2})^{1/2}(1-s_{1}^{2})^{1/2}B^{(d)}\right)\left(s_{1}-\frac{s}{(1-s^{2})% ^{1/2}}(1-s_{1}^{2})^{1/2}B^{(d)}\right)\right]

and thus we have

\displaystyle\left\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}% \right)\right\rangle=a(1-s^{2})\left[V^{\prime}(s)-\int_{\mathbb{R}\times% \mathbb{S}^{d-1}}a_{1}\left(\partial_{s}U^{(d)}\right)(s,s_{1})\overline{\rho}% (\mathrm{d}a_{1},\mathrm{d}u_{1})\right]\,.

(137)

Of course, a discrete measure of the form (134) can not be invariant to rotations orthogonal to $u_{*}$ . However, if the $u_{i}$ are initialized uniformly on the unit sphere, then the measure $\overline{\rho}_{0}$ converges to a measure with the rotation invariance as $m\to\infty$ . One can then apply the results of [33] to control the deviations from this limit. Let us thus assume that $\overline{\rho}_{0}$ satisfies the rotation invariance. Define the map $\varphi(a,u)=(a,\langle u,u_{*}\rangle)$ . Then, from (136), (137), the push-forward $\rho_{t}$ of $\overline{\rho}_{t}$ through the map $\varphi$ satisfies the continuity equation

	$\displaystyle\partial_{t}\rho_{t}(a,s)$	$\displaystyle=-\nabla\cdot\left(\rho_{t}\Psi^{(d)}\left(a,s;\rho_{t}\right)\right)$
		$\displaystyle=-\left(\partial_{a}\left(\rho_{t}\Psi^{(d)}_{a}\left(a,s;\rho_{t% }\right)\right)+\partial_{s}\left(\rho_{t}\Psi^{(d)}_{s}\left(a,s;\rho_{t}% \right)\right)\right),$

where $\Psi^{(d)}=(\Psi^{(d)}_{a},\Psi^{(d)}_{s})$ is given by

	$\displaystyle\Psi^{(d)}_{a}(a,s;\rho)=\,$	$\displaystyle\varepsilon^{-1}\cdot\left(V(s)-\int_{\mathbb{R}^{2}}a_{1}U^{(d)}% (s,s_{1})\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right)\,,$
	$\displaystyle\Psi^{(d)}_{s}(a,s;\rho)=\,$	$\displaystyle a(1-s^{2})\cdot\left(V^{\prime}(s)-\int_{\mathbb{R}^{2}}a_{1}% \partial_{s}U^{(d)}(s,s_{1})\rho(\mathrm{d}a_{1},\mathrm{d}s_{1})\right)\,.$

When $d\to\infty$ , $p_{B^{(d)}}(b)\mathrm{d}b\propto(1-b^{2})^{d/2-2}\mathrm{d}b$ converges weakly to the Dirac mass $\delta_{0}(\mathrm{d}b)$ . As a consequence,

\displaystyle U^{(d)}(s,s_{1})\xrightarrow[d\to\infty]{}U(ss_{1})\,,

\displaystyle\partial_{s}U^{(d)}(s,s_{1})\xrightarrow[d\to\infty]{}U^{\prime}(% ss_{1})s_{1}\,.

As a consequence, in the limit $d\to\infty$ , we recover the equations (29)–(32). Moreover, if $\overline{\rho}_{0}={\rm P}_{A}\otimes\mathrm{Unif}(\mathbb{S}^{d-1})$ , then $\rho_{0}$ converges weakly to ${\rm P}_{A}\otimes\delta_{0}(\mathrm{d}s)$ as $d\to\infty$ .

B.6 Proof of Proposition 4

First, note that the potential functions $U$ and $V$ admit the following expansion:

\displaystyle U(s)=\,\sum_{k=0}^{\infty}\sigma_{k}^{2}s^{k},\ V(s)=\,\sum_{k=0% }^{\infty}\varphi_{k}\sigma_{k}s^{k}.

As a consequence, we deduce that

	$\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}(\rho)=\,$	$\displaystyle\frac{1}{2}\sum_{k=0}^{\infty}\varphi_{k}^{2}-\sum_{k=0}^{\infty}% \varphi_{k}\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)+\frac{1% }{2}\sum_{k=0}^{\infty}\sigma_{k}^{2}\int a(\omega_{1})a(\omega_{2})s(\omega_{% 1})^{k}s(\omega_{2})^{k}\mathrm{d}\rho(\omega_{1})\mathrm{d}\rho(\omega_{2})$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\sum_{k=0}^{\infty}\left(\varphi_{k}^{2}-2\varphi_{k}% \sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)+\sigma_{k}^{2}% \left(\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}\right)$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\sum_{k=0}^{\infty}\left(\varphi_{k}-\sigma_{k}\int a(% \omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}.$

Now we show that the above risk can be arbitrarily small. We will choose $\rho$ to be the Lebesgue measure on $[-1,1]$ and $a\in L^{2}[-1,1]$ so that $\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)=\int_{-1}^{1}a(s)s^{k}% \mathrm{d}s$ . Now, we define the following set of sequences

W=\left\{\left(\sigma_{k}\int_{-1}^{1}a(s)s^{k}\mathrm{d}s\right)_{k\geq 0}\,% \middle|\,a\in L^{2}[-1,1]\right\}\,.

Since $a\in L^{2}[-1,1]$ and $(\sigma_{k})_{k\geq 0}\in\ell^{2}$ , we know that $W\subset\ell^{2}$ , i.e., $W$ is a linear subspace of $\ell^{2}$ . Now it suffices to show that $W$ is dense in $\ell^{2}$ , which is equivalent to $W^{\perp}=\{0\}$ , namely

v\in\ell^{2},\ v\perp W\implies v=0.

Fix any such $v$ and take $\mu\in W$ such that for all $k$ , $\mu_{k}=\sigma_{k}\int_{-1}^{1}a(s)s^{k}\mathrm{d}s$ for some $a\in L^{2}[-1,1]$ . We then have

\displaystyle 0=\langle v,\mu\rangle=\sum_{k=0}^{\infty}v_{k}\mu_{k}=\sum_{k=0% }^{\infty}v_{k}\sigma_{k}\int_{-1}^{1}a(s)s^{k}\mathrm{d}s=\int_{-1}^{1}a(s)% \cdot\left(\sum_{k=0}^{\infty}v_{k}\sigma_{k}s^{k}\right)\mathrm{d}s,

where the last step follows from dominated convergence theorem. Indeed, by Hölder’s inequality,

\sum_{k=0}^{\infty}|v_{k}\sigma_{k}|\leq\left(\sum_{k=0}^{\infty}v_{k}^{2}% \right)^{1/2}\left(\sum_{k=0}^{\infty}\sigma_{k}^{2}\right)^{1/2}<\infty.

As a consequence, the function series $\sum_{k=0}^{n}\sigma_{k}v_{k}s^{k}$ uniformly absolutely converges to the continuous function $f(s)=\sum_{k=0}^{\infty}\sigma_{k}v_{k}s^{k}$ on $[-1,1]$ . The above argument then implies that for any $a\in L^{2}[-1,1]$ , $\int_{-1}^{1}a(s)f(s)\mathrm{d}s=0$ , which further implies that $f(s)\equiv 0$ . Therefore, $\sigma_{k}v_{k}=0$ for all $k\geq 0$ . Since $\sigma_{k}\neq 0$ for all $k\geq 0$ , we must have $v_{k}=0$ for all $k\geq 0$ , i.e., $v=0$ . This completes the proof of the density of $W$ in $\ell^{2}$ , and thus the proof of the Proposition.

Appendix C Calculations for the analysis of mean-field gradient flow

C.1 Solution of Eq. (89)

In order to solve the system (89), we start from an associated one-dimensional ODE.

Lemma 2.

The solution $\lambda=\lambda(t_{3})$ of the ODE

\partial_{t_{3}}\lambda=|\sigma_{1}|\left(|\varphi_{1}|-{|\sigma_{1}|}\left\|a% _{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}\lambda^{2}\right)\lambda

(138)

with initial condition $\lambda(0)$ is

\lambda=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left({|\sigma_{1}|}\left\|% a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}+\left(|\varphi_{1}|\lambda(0)^{-% 2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}\right)e^% {-2|\sigma_{1}\varphi_{1}|t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,.

(139)

Proof.

For simplicity, denote $\alpha=|\sigma_{1}|$ , $\beta=|\varphi_{1}|$ and $\gamma={|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}$ . Then

\partial_{t_{3}}\lambda=\alpha\left(\beta-\gamma\lambda^{2}\right)\lambda\,.

This is Bernoulli differential equation (see, e.g., [15]). In this situation, the classical trick is to reduce the problem to a linear inhomogeneous first-order equation by considering

	$\displaystyle\partial_{t_{3}}\left(\lambda^{-2}\right)$	$\displaystyle=-2\left(\partial_{t_{3}}\lambda\right)\lambda^{-3}=-2\alpha\left% (\beta-\gamma\lambda^{2}\right)\lambda^{-2}$
		$\displaystyle=2\alpha(\gamma-\beta\lambda^{-2})\,.$

Solving this linear inhomogeneous first-order equation gives

\lambda^{-2}=\frac{\gamma}{\beta}+\left(\lambda(0)^{-2}-\frac{\gamma}{\beta}% \right)\,e^{-2\alpha\beta t_{3}}\,,

and thus

\lambda=\frac{\beta^{\nicefrac{{1}}{{2}}}}{\left(\gamma+\left(\beta\lambda(0)^% {-2}-\gamma\right)e^{-2\alpha\beta t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,,

which is the claimed result. ∎

Let $\lambda=\lambda(t_{3})$ be a solution of (138) and consider

\displaystyle\begin{split}&a^{(-1)}=a^{(-1)}_{\perp}=\lambda a_{\perp,\rm{init% }}\,,\\ &s^{(1)}=s^{(1)}_{\perp}=\operatorname{sign}(\sigma_{1}\varphi_{1})\lambda a_{% \perp,\rm{init}}\,.\end{split}

(140)

Then $a^{(-1)},s^{(1)}$ are solutions of the constrained ODE system (85), (88). Indeed,

\left\langle a^{(-1)},\mathds{1}\right\rangle_{L^{2}(\rho)}=\lambda\left% \langle a_{\perp,\rm{init}},\mathds{1}\right\rangle_{L^{2}(\rho)}=0\,,

thus the constraint (85) is satisfied. Further

	$\displaystyle\partial_{t_{3}}a_{\perp}^{(-1)}$	$\displaystyle=\left(\partial_{t_{3}}\lambda\right)a_{\perp,\rm{init}}$
		$\displaystyle=\|\sigma_{1}\|\left(\|\varphi_{1}\|-{\|\sigma_{1}\|}\\|a_{\perp,\rm{% init}}\\|_{L^{2}(\rho)}^{2}\lambda^{2}\right)\lambda a_{\perp,\rm{init}}$
		$\displaystyle=\operatorname{sign}(\sigma_{1}\varphi_{1})\sigma_{1}\left(% \varphi_{1}-\operatorname{sign}(\sigma_{1}\varphi_{1}){\sigma_{1}}\\|a_{\perp,% \rm{init}}\\|_{L^{2}(\rho)}^{2}\lambda^{2}\right)\lambda a_{\perp,\rm{init}}$
		$\displaystyle=\sigma_{1}\left(\varphi_{1}-{\sigma_{1}}\left\langle a^{(-1)}_{% \perp},s^{(1)}_{\perp}\right\rangle_{L^{2}(\rho)}\right)s^{(1)}_{\perp}\,.$

A similar computation shows that the differential equation for $s^{(1)}$ is also satisfied. This concludes that (140) is a valid candidate to solve the third time scale.

Matching.

To determine the value of the initialization $\lambda(0)$ we perform a matching procedure with the previous time scale. In this paragraph, we denote $\underline{a},\underline{s}$ the approximation obtained in the second time scale (Section 6.3), and $\overline{a},\overline{s}$ the approximation in the third time scale (Section 6.4 and above).

Consider an intermediate time scale $\widetilde{t}=t_{2}-c\log\frac{1}{\varepsilon}$ with $0<c<\frac{1}{4|\sigma_{1}\varphi_{1}|}$ . Assume $\widetilde{t}\asymp 1$ . Then

	$\displaystyle t_{2}$	$\displaystyle=\widetilde{t}+c\log\frac{1}{\varepsilon}\xrightarrow[\varepsilon% \to 0]{}+\infty\,,$
	$\displaystyle t_{3}$	$\displaystyle=\widetilde{t}-\left(\frac{1}{4\|\sigma_{1}\varphi_{1}\|}-c\right)% \log\frac{1}{\varepsilon}\xrightarrow[\varepsilon\to 0]{}-\infty\,.$

From the approximation (82) on the second time scale,

$\displaystyle\underline{a}$	$\displaystyle=\underline{a}^{(0)}+O(\varepsilon^{\nicefrac{{1}}{{2}}})$
	$\displaystyle=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+\cosh\left(\varphi_{1}% \sigma_{1}t_{2}\right){a}_{\perp,\rm{init}}+O(\varepsilon^{\nicefrac{{1}}{{2}}})$
	$\displaystyle=\cosh\left(\varphi_{1}\sigma_{1}\left(\widetilde{t}+c\log\frac{1% }{\varepsilon}\right)\right){a}_{\perp,\rm{init}}+O(1)$
	$\displaystyle=\frac{1}{2}e^{\|\varphi_{1}\sigma_{1}\|\widetilde{t}}\varepsilon^{% -c\|\varphi_{1}\sigma_{1}\|}{a}_{\perp,\rm{init}}+O(1)\,.$	(141)

From the approximation on the third time scale,

	$\displaystyle\overline{a}$	$\displaystyle=\varepsilon^{-\nicefrac{{1}}{{4}}}\overline{a}^{(-1)}+O(1)$
		$\displaystyle=\varepsilon^{-\nicefrac{{1}}{{4}}}\lambda{a}_{\perp,\rm{init}}+O% (1)\,.$

Note that as $t_{3}\to-\infty$ , from (139),

\lambda\sim\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left(|\varphi_{1}|% \lambda(0)^{-2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}% ^{2}\right)^{\nicefrac{{1}}{{2}}}}e^{|\sigma_{1}\varphi_{1}|t_{3}}\,.

Thus

	$\displaystyle\overline{a}$	$\displaystyle\sim\varepsilon^{-\nicefrac{{1}}{{4}}}\frac{\|\varphi_{1}\|^{% \nicefrac{{1}}{{2}}}}{\left(\|\varphi_{1}\|\lambda(0)^{-2}-{\|\sigma_{1}\|}\left\\|% a_{\perp,\rm{init}}\right\\|_{L^{2}(\rho)}^{2}\right)^{\nicefrac{{1}}{{2}}}}e^{% \|\sigma_{1}\varphi_{1}\|\left(\widetilde{t}-\left(\frac{1}{4\|\sigma_{1}\varphi_% {1}\|}-c\right)\log\frac{1}{\varepsilon}\right)}a_{\perp,\rm{init}}$
		$\displaystyle\sim\frac{\|\varphi_{1}\|^{\nicefrac{{1}}{{2}}}}{\left(\|\varphi_{1}% \|\lambda(0)^{-2}-{\|\sigma_{1}\|}\left\\|a_{\perp,\rm{init}}\right\\|_{L^{2}(\rho)% }^{2}\right)^{\nicefrac{{1}}{{2}}}}e^{\|\sigma_{1}\varphi_{1}\|\widetilde{t}}% \varepsilon^{-c\|\sigma_{1}\varphi_{1}\|}a_{\perp,\rm{init}}\,.$		(142)

By matching, Equations (141) and (142) should be coherent. This gives

\frac{1}{2}=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left(|\varphi_{1}|% \lambda(0)^{-2}-{|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}% ^{2}\right)^{\nicefrac{{1}}{{2}}}}\,,

and thus

\lambda(t_{3})=\frac{|\varphi_{1}|^{\nicefrac{{1}}{{2}}}}{\left({|\sigma_{1}|}% \left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}+4|\varphi_{1}|e^{-2|% \sigma_{1}\varphi_{1}|t_{3}}\right)^{\nicefrac{{1}}{{2}}}}\,.

(143)

One could check similarly that $s^{(1)}$ also satisfies the matching conditions under the same constraint, and thus that (140) are indeed the solutions of the third time scale.

C.2 Induced approximation of the risk

In this section, we show that the behavior of $a$ and $s$ derived in Sections 6.2–6.4 leads to an evolution of the risk alternating plateaus and rapid decreases, in agreement with the canonical learning order of Definition 1. For the convenience of the reader, we recall the expression (37) of the risk

	$\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}$	$\displaystyle=\frac{1}{2}\\|\varphi\\|^{2}-\int a(\omega)V(s(\omega))\mathrm{d}% \rho(\omega)+\frac{1}{2}\int a(\omega_{1})a(\omega_{2})U(s(\omega_{1})s(\omega% _{2}))\mathrm{d}\rho(\omega_{1})\mathrm{d}\rho(\omega_{2})$
		$\displaystyle=\frac{1}{2}\sum_{k\geqslant 0}\left(\varphi_{k}-\sigma_{k}\int a% (\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}\,.$

First time scale $t_{1}=\frac{t}{\varepsilon}$ (Section 6.2).

On this time scale, we have $a=O(1)$ and $s=O(\varepsilon)$ . Thus for all $k\geqslant 1$ , $\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)=O(\varepsilon)$ whence $\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}=\varphi_{k}^{2}+O(\varepsilon)$ .

Further, using (59),

	$\displaystyle\left(\varphi_{0}-\sigma_{0}\int a(\omega)\mathrm{d}\rho(\omega)% \right)^{2}$	$\displaystyle=\left(\varphi_{0}-{\sigma_{0}}\left\langle a,\mathds{1}\right% \rangle_{L^{2}(\rho)}\right)^{2}=\left(\varphi_{0}-{\sigma_{0}}\left\langle a^% {(0)},\mathds{1}\right\rangle_{L^{2}(\rho)}+O(\varepsilon)\right)^{2}$
		$\displaystyle=e^{-2\sigma_{0}^{2}t_{1}}\left(\varphi_{0}-{\sigma_{0}}\left% \langle a_{\rm{init}},\mathds{1}\right\rangle_{L^{2}(\rho)}\right)^{2}+O(% \varepsilon)\,.$

Thus as $\varepsilon\to 0$ ,

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}e^{-2\sigma_{0}^% {2}t_{1}}\left(\varphi_{0}-{\sigma_{0}}\left\langle a_{\rm{init}},\mathds{1}% \right\rangle_{L^{2}(\rho)}\right)^{2}+\frac{1}{2}\sum_{k\geqslant 1}\varphi_{% k}^{2}+O(\varepsilon)\,.

This describes, in a more detailed form, the first transition in Definition 1.

Second time scale $t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}$ (Section 6.3).

On this time scale, we have $a=O(1)$ and $s=O(\varepsilon^{\nicefrac{{1}}{{2}}})$ . Thus for all $k\geqslant 1$ , $\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}=\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{2}}})$ .

Further, using (68),

\displaystyle\left(\varphi_{0}-\sigma_{0}\int a(\omega)\mathrm{d}\rho(\omega)% \right)^{2}=\left(\varphi_{0}-\sigma_{0}\int a^{(0)}(\omega)\mathrm{d}\rho(% \omega)+O(\varepsilon^{\nicefrac{{1}}{{2}}})\right)^{2}=O(\varepsilon^{% \nicefrac{{1}}{{2}}})\,.

Thus as $\varepsilon\to 0$ ,

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\sum_{k\geqslant 1% }\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{2}}})\,.

This second time scale does not induce any transition of the risk $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}$ (but was necessary to understand the divergence of $a$ and $\varepsilon^{-\nicefrac{{1}}{{2}}}s$ ).

Third time scale $t_{3}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}-\frac{1}{4|\sigma_{1}\varphi% _{1}|}\log\frac{1}{\varepsilon}$ (Section 6.4).

On this time scale, we have $a=O(\varepsilon^{-\nicefrac{{1}}{{4}}})$ and $s=O(\varepsilon^{\nicefrac{{1}}{{4}}})$ . Thus for all $k\geqslant 2$ , $\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)% \right)^{2}=\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})$ .

Further, using (85), (86),

	$\displaystyle\left(\varphi_{0}-\sigma_{0}\int a(\omega)\mathrm{d}\rho(\omega)% \right)^{2}$	$\displaystyle=\left(\varphi_{0}-{\sigma_{0}}\varepsilon^{-\nicefrac{{1}}{{4}}}% \int a^{(-1)}(\omega)\mathrm{d}\rho(\omega)-{\sigma_{0}}\int a^{(0)}(\omega)% \mathrm{d}\rho(\omega)+O(\varepsilon^{\nicefrac{{1}}{{4}}})\right)^{2}$
		$\displaystyle=O(\varepsilon^{\nicefrac{{1}}{{4}}})\,.$

Finally, using (90), (91),

	$\displaystyle\left(\varphi_{1}-{\sigma_{1}}\int a(\omega)s(\omega)\mathrm{d}% \rho(\omega)\right)^{2}$	$\displaystyle=\left(\varphi_{1}-{\sigma_{1}}\left\langle a,s\right\rangle_{L^{% 2}(\rho)}\right)^{2}=\left(\varphi_{1}-{\sigma_{1}}\left\langle a^{(-1)},s^{(1% )}\right\rangle_{L^{2}(\rho)}+O(\varepsilon^{\nicefrac{{1}}{{4}}})\right)^{2}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\left(\varphi_{1}-{\sigma_{1}}% \operatorname{sign}(\sigma_{1}\varphi_{1})\lambda^{2}\\|a_{\perp,\rm{init}}\\|^{% 2}_{L^{2}(\rho)}\right)^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\left(\varphi_{1}-\frac{\sigma_{% 1}\operatorname{sign}(\sigma_{1}\varphi_{1})\|\varphi_{1}\|\\|a_{\perp,\rm{init}}% \\|_{L^{2}(\rho)}^{2}}{\left({\|\sigma_{1}\|}\left\\|a_{\perp,\rm{init}}\right\\|_{% L^{2}(\rho)}^{2}+4\|\varphi_{1}\|e^{-2\|\sigma_{1}\varphi_{1}\|t_{3}}\right)}% \right)^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})$
		$\displaystyle=\varphi_{1}^{2}\left(1-\frac{1}{1+\frac{4\|\varphi_{1}\|}{\|\sigma_% {1}\|\left\\|a_{\perp,\rm{init}}\right\\|_{L^{2}(\rho)}^{2}}e^{-2\|\sigma_{1}% \varphi_{1}\|t_{3}}}\right)^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})\,,$

where in $(a)$ we used (90) and in $(b)$ (91). Thus as $\varepsilon\to 0$ ,

\displaystyle\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}=\frac{1}{2}\varphi_{1}^{2}% \left(1-\frac{1}{1+\frac{4|\varphi_{1}|}{|\sigma_{1}|\left\|a_{\perp,\rm{init}% }\right\|_{L^{2}(\rho)}^{2}}e^{-2|\sigma_{1}\varphi_{1}|t_{3}}}\right)^{2}+% \frac{1}{2}\sum_{k\geqslant 2}\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4% }}})\,.

This describes, in a more detailed form, the second transition in Definition 1.

C.3 Proof of Theorem 1

Throughout the proof, we will use the shorthand $\mathscrsfs{R}_{l}(\tau)$ to represent $\mathscrsfs{R}_{l}(\widetilde{a}(\tau),\widetilde{s}(\tau))$ . First, note that according to the ODE satisfied by $\mathscrsfs{R}_{l}$ (Eq. (98)), we know that $\mathscrsfs{R}_{l}$ must be non-increasing, thus for small enough $\varepsilon>0$ ,

	$\displaystyle\left\|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu,\tau)\widetilde% {s}(\nu,\tau)^{l}\mathrm{d}\rho(\nu)\right\|\leq\,$	$\displaystyle\left\|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu,0)\widetilde{s}% (\nu,0)^{l}\mathrm{d}\rho(\nu)\right\|$
	$\displaystyle\leq\,$	$\displaystyle\left\|\varphi_{l}\right\|+O(\varepsilon^{1/2l})\leq 2\left\|\varphi% _{l}\right\|,\ \forall\tau\geq 0.$

Hence, we obtain the estimates:

\displaystyle\partial_{\tau}\left|\widetilde{a}(\omega)\right|\leq\left|% \partial_{\tau}\tilde{a}(\omega)\right|\leq 2|\sigma_{l}||\varphi_{l}|\left|% \widetilde{s}(\omega)\right|^{l},\ \partial_{\tau}\left|\widetilde{s}(\omega)% \right|\leq\left|\partial_{\tau}\widetilde{s}(\omega)\right|\leq 2l|\sigma_{l}% ||\varphi_{l}|\left|\widetilde{a}(\omega)\right|\left|\widetilde{s}(\omega)% \right|^{l-1}.

According to the comparison theorem for system of ODEs, we know that $|\widetilde{a}(\omega,\tau)|\leq\widehat{a}(\omega,\tau)$ , $|\widetilde{s}(\omega,\tau)|\leq\widehat{s}(\omega,\tau)$ for all $\tau\geq 0$ where

\widehat{a}(\omega,0)=\max\left\{|\widetilde{a}(\omega,0)|,\ |\widetilde{s}(% \omega,0)|\right\},\ \widehat{s}(\omega,0)=l^{1/2}\widehat{a}(\omega,0)=l^{1/2% }\max\left\{|\widetilde{a}(\omega,0)|,\ |\widetilde{s}(\omega,0)|\right\},

and

\partial_{\tau}\widehat{a}(\omega)=2|\sigma_{l}||\varphi_{l}|\widehat{s}(% \omega)^{l},\ \partial_{\tau}\widehat{s}(\omega)=2l|\sigma_{l}||\varphi_{l}|% \widehat{a}(\omega)\widehat{s}(\omega)^{l-1}.

(144)

The above system of ODEs can be solved analytically via integration. First, we note that

\partial_{\tau}(\widehat{s}(\omega)^{2})=4l|\sigma_{l}||\varphi_{l}|\widehat{a% }(\omega)\widehat{s}(\omega)^{l}=l\partial_{\tau}(\widehat{a}(\omega)^{2}),

which implies that (further note $\widehat{s}(\omega,0)^{2}=l\widehat{a}(\omega,0)^{2}$ )

\widehat{s}(\omega,\tau)=l^{1/2}\widehat{a}(\omega,\tau),\ \forall\tau\geq 0.

(145)

The ODE system then reduces to $\partial_{\tau}\widehat{a}(\omega)=2l^{l/2}|\sigma_{l}||\varphi_{l}|\widehat{a% }(\omega)^{l}$ , which admits the solution

\widehat{a}(\omega,\tau)=\left(\widehat{a}(\omega,0)^{-l+1}-2l^{l/2}(l-1)|% \sigma_{l}||\varphi_{l}|\tau\right)^{-1/(l-1)}.

(146)

Since $\widehat{a}(\omega,0)=\Theta(\varepsilon^{1/2l(l+1)})$ , we know that $\widehat{a}(\omega,\tau),\widehat{s}(\omega,\tau)=o(1)$ until $\tau=\Theta(\varepsilon^{-(l-1)/2l(l+1)})-O(1)=\Theta(\varepsilon^{-(l-1)/2l(l% +1)})$ , which means that $\widetilde{a}(\omega,\tau),\widetilde{s}(\omega,\tau)=o(1)$ until $\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ . As a consequence,

\mathscrsfs{R}_{l}(\tau)=\frac{1}{2}\left(\varphi_{l}-\sigma_{l}\int\widetilde% {a}(\nu,\tau)\widetilde{s}(\nu,\tau)^{l}\mathrm{d}\rho(\nu)\right)^{2}=\frac{1% }{2}\varphi_{l}^{2}-o(1)

until $\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ . This means that the learning of the $l$ -th component will not begin until $\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ , namely $\tau(\Delta)=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ for any fixed $\Delta>0$ . Note that the above argument applies to all of the settings in the theorem statement.

Next, we show that for any fixed $\Delta>0$ , $\tau(\Delta)=O(\varepsilon^{-(l-1)/2l(l+1)})$ , which means that the $l$ -th component can be learnt in $O(\varepsilon^{-(l-1)/2l(l+1)})$ time. To prove our claim by contradiction, assume that there exists $\Delta>0$ and a sequence $\varepsilon_{k}\downarrow 0$ , such that

\lim_{k\to\infty}\frac{\tau(\Delta)}{\varepsilon_{k}^{-(l-1)/2l(l+1)}}=+\infty.

(147)

By definition of $\tau(\Delta)$ , we know that $\forall\tau\leq\tau(\Delta)$ ,

\left|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu,\tau)\widetilde{s}(\nu,\tau)% ^{l}\mathrm{d}\rho(\nu)\right|\geq\,\sqrt{2\Delta}.

Now, assume the condition of setting (a) holds and denote

A_{\varepsilon_{0},\eta}=\left\{\omega:\forall\varepsilon<\varepsilon_{0},\ % \min(|\widetilde{a}(\omega,0)|,|\widetilde{s}(\omega,0)|)>\eta\varepsilon^{1/2% l(l+1)},\ \sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^% {l}>0\right\}.

(148)

Then by definition and our assumption that $\widetilde{a}(\omega,0)$ is of the same order as $\widetilde{s}(\omega,0)$ , we know that $A=\cup_{\varepsilon_{0}>0,\eta>0}A_{\varepsilon_{0},\eta}$ . Since $\rho(A)>0$ , there exists $\varepsilon_{0},\eta>0$ such that $\rho(A_{\varepsilon_{0},\eta})>0$ . Note that here we can choose $\varepsilon_{0}$ and $\eta$ to be arbitrarily small since the set $A_{\varepsilon_{0},\eta}$ is non-increasing in $\varepsilon_{0}$ and $\eta$ . For $\omega\in A_{\varepsilon_{0},\eta}$ and $\tau\leq\tau(\Delta)$ , we have

\begin{split}\partial_{\tau}(\widetilde{a}(\omega)^{2})=\,&2\sigma_{l}% \widetilde{a}(\omega)\widetilde{s}(\omega)^{l}\left(\varphi_{l}-\sigma_{l}\int% \widetilde{a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right)\geq 2\sqrt{% 2\Delta}\left|\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}\right|% \\ \partial_{\tau}(\widetilde{s}(\omega)^{2})=\,&2l\sigma_{l}\widetilde{a}(\omega% )\widetilde{s}(\omega)^{l}\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega% )^{2}\right)\left(\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(% \nu)^{l}\mathrm{d}\rho(\nu)\right)\\ \geq\,&2l\sqrt{2\Delta}\left|\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(% \omega)^{l}\right|\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}% \right).\end{split}

Moreover, we know that at initialization, $|\widetilde{a}(\omega,0)|,|\widetilde{s}(\omega,0)|>\eta\varepsilon^{1/2l(l+1)}$ . Using the ODE comparison theorem and a similar argument as that in proving $\tau(\Delta)=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ , we deduce that for sufficiently large $k$ such that $\varepsilon=\varepsilon_{k}<\varepsilon_{0}$ , there exist constants $C,C^{\prime}>0$ that does not depend on $\varepsilon$ satisfying the following: For all $\omega\in A_{\varepsilon_{0},\eta}$ and $\tau\geq C\varepsilon^{-(l-1)/2l(l+1)}$ ,

\min\left\{|\widetilde{a}(\omega,\tau)|,\ |\widetilde{s}(\omega,\tau)|\right\}% \geq C^{\prime},\ \sigma_{l}\varphi_{l}\widetilde{a}(\omega,\tau)\widetilde{s}% (\omega,\tau)^{l}>0.

This further implies that at time $\tau$ ,

\int\widetilde{s}(\omega)^{2(l-1)}\left(l^{2}\widetilde{a}(\omega)^{2}\left(1-% \varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)+\widetilde{s}(\omega)% ^{2}\right)\mathrm{d}\rho(\omega)\geq C^{\prime 2l}\rho(A_{\varepsilon_{0},% \eta})>0.

(149)

According to Eq. (98), we know that $\mathscrsfs{R}_{l}$ will decrease to $0$ exponentially fast in an $O(1)$ time window after $\tau=C\varepsilon^{-(l-1)/2l(l+1)}$ , which contradicts our assumption (147). This proves that $\tau(\Delta)=O(\varepsilon^{-(l-1)/2l(l+1)})$ under setting (a). Next, we show that setting (b) can be reduced to setting (a). Under setting (b), let us denote

B_{\varepsilon_{0},\eta}=\left\{\omega:\forall\varepsilon<\varepsilon_{0},\ % \sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}<0,\ % \text{and}\ \widetilde{s}(\omega,0)^{2}>(l+\eta)\widetilde{a}(\omega,0)^{2}% \right\}.

Then similar to the previous argument, there exists $\varepsilon_{0},\eta>0$ such that $\rho(B_{\varepsilon_{0},\eta})>0$ , and further we can choose $\varepsilon_{0}$ and $\eta$ to be arbitrarily small. For $\omega\in B_{\varepsilon_{0},\eta}$ , we have

\partial_{\tau}(\widetilde{a}(\omega)^{2})<0,\ \partial_{\tau}(\widetilde{s}(% \omega)^{2})<0\ \mbox{at}\ \tau=0.

Hence, both $\widetilde{a}(\omega)^{2}$ and $\widetilde{s}(\omega)^{2}$ will decrease at initialization. Moreover, Eq. (97) implies that

\partial_{\tau}(\widetilde{a}(\omega)^{2})=\frac{\partial_{\tau}(\widetilde{s}% (\omega)^{2})}{l\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}% \right)}=-\frac{\varepsilon^{-2\beta_{l}}}{l}\partial_{\tau}\log\left(1-% \varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right).

Integrating both sides of the above equation, we obtain that

\widetilde{a}(\omega,0)^{2}-\widetilde{a}(\omega,\tau)^{2}=-\frac{\varepsilon^% {-2\beta_{l}}}{l}\left(\log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(% \omega,0)^{2}\right)-\log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,% \tau)^{2}\right)\right),

(150)

which is close to $\left(\widetilde{s}(\omega,0)^{2}-\widetilde{s}(\omega,\tau)^{2}\right)/l$ as long as $\widetilde{s}(\omega,\tau)=O(1)$ . To be accurate, let us define

\tau_{a,\omega}=\inf\{\tau\geq 0:\widetilde{a}(\omega,\tau)=0\},

then we know that $\widetilde{s}(\omega,\tau_{a,\omega})=\Omega(\varepsilon^{1/2l(l+1)})$ and $\tau_{a,\omega}=O(\varepsilon^{-(l-1)/2l(l+1)})$ under the assumption (147), where the latter claim can be proved through making the change of variable $\widetilde{a}^{\prime}(\omega)=\varepsilon^{-1/2l(l+1)}\widetilde{a}(\omega)$ and $\widetilde{s}^{\prime}(\omega)=\varepsilon^{-1/2l(l+1)}\widetilde{s}(\omega)$ . Note that after the time point $\tau_{a,\omega}$ , the sign of $\widetilde{a}(\omega)$ changes. Hence, $\varphi_{l}\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}>0$ , and $\widetilde{a}(\omega,\tau)^{2}$ and $\widetilde{s}(\omega,\tau)^{2}$ will begin to increase for $\tau\geq\tau_{a,\omega}$ . Similarly, we can show that in $O(\varepsilon^{-(l-1)/2l(l+1)})$ time after $\tau_{a,\omega}$ , both $\widetilde{a}(\omega)$ and $\widetilde{s}(\omega)$ become of order $\varepsilon^{1/2l(l+1)}$ , and we still have $\varphi_{l}\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}>0$ . This reduces our case $(b)$ to case $(a)$ .

We have proven that under settings (a) and (b), $\tau(\Delta)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ for any fixed $\Delta\in(0,\varphi_{l}^{2}/2)$ . This means that some of the neurons $(\widetilde{a}(\omega),\widetilde{s}(\omega))$ become of order $\Omega(1)$ and the $l$ -th component of the target function is learnt at a timescale of order $\varepsilon^{-(l-1)/2l(l+1)}$ . Next, we show that if the probability measure $\rho$ is discrete, then the evolution of $\mathscrsfs{R}_{l}$ actually happens in an $O(1)$ time window. It suffices to prove that, for any $\Delta>0$ a small constant ( $\Delta<\varphi_{l}^{2}/4$ ),

\tau(\Delta)-\tau\left(\frac{\varphi_{l}^{2}}{2}-\Delta\right)=O(1)

(151)

as $\varepsilon\to 0$ . Note that by continuity and monotonicity of $\mathscrsfs{R}_{l}$ , we have

\mathscrsfs{R}_{l}(\tau(\Delta))=\Delta,\ \mathscrsfs{R}_{l}\left(\tau\left(% \frac{\varphi_{l}^{2}}{2}-\Delta\right)\right)=\frac{\varphi_{l}^{2}}{2}-% \Delta,\ \mbox{and \ }\tau(\Delta)\geq\tau\left(\frac{\varphi_{l}^{2}}{2}-% \Delta\right).

By definition of $\mathscrsfs{R}_{l}$ , we know that $\forall\tau\geq\tau(\varphi_{l}^{2}/2-\Delta)$ ,

\left|\int\widetilde{a}(\nu,\tau)\widetilde{s}(\nu,\tau)^{l}\mathrm{d}\rho(\nu% )\right|\geq\frac{1}{|\sigma_{l}|}\left(|\varphi_{l}|-\sqrt{\varphi_{l}^{2}-2% \Delta}\right):=r_{l}(\Delta)>0.

Denote by $\{(\widetilde{a}_{i},\widetilde{s}_{i})\}_{i\in[m]}$ the realizations of $\{(\widetilde{a}(\omega),\widetilde{s}(\omega))\}_{\omega\in\Omega}$ under the discrete measure $\rho$ , and by $\{p_{i}\}_{i\in[m]}$ the point masses of $\rho$ . Then, we know that

\displaystyle r_{l}(\Delta)\leq\left|\int\widetilde{a}(\nu,\tau)\widetilde{s}(% \nu,\tau)^{l}\mathrm{d}\rho(\nu)\right|=\left|\sum_{j=1}^{m}p_{j}\widetilde{a}% _{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|\leq\sum_{j=1}^{m}p_{j}\left|% \widetilde{a}_{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|,

(152)

which implies that $\exists j\in[m]$ , s.t. $\left|\widetilde{a}_{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|\geq r_{l}(\Delta)$ . Applying Lemma 3 yields

		$\displaystyle\int\widetilde{s}(\omega)^{2(l-1)}\left(l^{2}\widetilde{a}(\omega% )^{2}\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)+% \widetilde{s}(\omega)^{2}\right)\mathrm{d}\rho(\omega)$		(153)
	$\displaystyle\geq\,$	$\displaystyle p_{j}\widetilde{s}_{j}^{2(l-1)}\left(l^{2}\widetilde{a}_{j}^{2}% \left(1-\varepsilon^{2\beta_{l}}\widetilde{s}_{j}^{2}\right)+\widetilde{s}_{j}% ^{2}\right)\geq\min_{j\in[m]}p_{j}\cdot c(l,r_{l}(\Delta))>0.$		(154)

It then follows from Eq. (98) that $\mathscrsfs{R}_{l}$ will decrease to $0$ exponentially fast, and Eq. (151) holds consequently. This completes the proof for settings (a) and (b).

We then focus on the case (c). By our assumption, for almost every $\omega$ there exists $\eta>0$ (may depend on $\omega$ ) such that

\sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}<-\eta% \varepsilon^{1/2l},\ \widetilde{s}(\omega,0)^{2}<(l-\eta)\widetilde{a}(\omega,% 0)^{2}

for sufficiently small $\varepsilon$ . Therefore, $\widetilde{s}(\omega,\tau)^{2}$ and $\widetilde{a}(\omega,\tau)^{2}$ will keep decreasing until one of them reaches $0$ , which means that

\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)=o(1)\implies% \left|\varphi_{l}-\sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{l}% \mathrm{d}\rho(\nu)\right|\geq\frac{|\varphi_{l}|}{2}.

(155)

According to Eq. (150) and the inequality $\widetilde{s}(\omega,0)^{2}<(l-\eta)\widetilde{a}(\omega,0)^{2}$ , $\widetilde{a}(\omega,\tau)^{2}$ will not reach $0$ until $\widetilde{s}(\omega,\tau)^{2}$ reaches $0$ . Furthermore, for any $\tau\geq 0$ ,

	$\displaystyle\widetilde{a}(\omega,\tau)^{2}=\,$	$\displaystyle\widetilde{a}(\omega,0)^{2}+\frac{\varepsilon^{-2\beta_{l}}}{l}% \left(\log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,0)^{2}\right)-% \log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,\tau)^{2}\right)\right)$
	$\displaystyle\geq\,$	$\displaystyle\widetilde{a}(\omega,0)^{2}+\frac{\varepsilon^{-2\beta_{l}}}{l}% \log\left(1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega,0)^{2}\right)$
	$\displaystyle\geq\,$	$\displaystyle\frac{1}{l-\eta}\widetilde{s}(\omega,0)^{2}-\frac{1}{l-\eta/2}% \widetilde{s}(\omega,0)^{2}\geq\,c(\eta,l,\omega)\varepsilon^{1/l(l+1)},$

thus leading to

	$\displaystyle\partial_{\tau}(\widetilde{s}(\omega)^{2})=\,$	$\displaystyle 2l\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}\left(% 1-\varepsilon^{2\beta_{l}}\widetilde{s}(\omega)^{2}\right)\left(\varphi_{l}-% \sigma_{l}\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{l}\mathrm{d}\rho(\nu)\right)$		(156)
	$\displaystyle\leq\,$	$\displaystyle-c(\eta,l,\omega)\varepsilon^{1/2l(l+1)}(\widetilde{s}(\omega)^{2% })^{l/2}.$		(157)

Using again the comparison theorem for ODE, we get that

|\widetilde{s}(\omega,\tau)|\leq\left(|\widetilde{s}(\omega,0)|^{-l+2}+c(\eta,% l,\omega)\varepsilon^{1/2l(l+1)}\tau\right)^{-1/(l-2)}.

(158)

Since $\widetilde{s}(\omega,0)\asymp\varepsilon^{1/2l(l+1)}$ , it follows immediately that for any $\Delta>0$ , there exists a constant $C_{*}(\omega,\Delta)>0$ such that

\tau\geq C_{*}(\omega,\Delta)\varepsilon^{-(l-1)/2l(l+1)}\implies|\widetilde{s% }(\omega,\tau)|\leq\Delta\varepsilon^{1/2l(l+1)}.

(159)

This completes the discussion for case (c), thus concluding the proof of Theorem 1.

Lemma 3.

Let $r>0$ be a constant that does not depend on $\varepsilon$ . Then there exists a constant $c=c(l,r)>0$ that only depends on $l$ and $r$ such that the following holds: For any $a>0$ , $s>0$ satisfying $as^{l}\geq r$ and $\varepsilon^{2\beta_{l}}s^{2}\leq 1$ , we have

s^{2(l-1)}\left(l^{2}a^{2}\left(1-\varepsilon^{2\beta_{l}}s^{2}\right)+s^{2}% \right)\geq c.

(160)

Proof.

If $s\geq 1$ , then we immediately get

s^{2(l-1)}\left(l^{2}a^{2}\left(1-\varepsilon^{2\beta_{l}}s^{2}\right)+s^{2}% \right)\geq s^{2l}\geq 1.

Otherwise, $1-\varepsilon^{2\beta_{l}}s^{2}\geq 1/2$ , and consequently

	$\displaystyle s^{2(l-1)}\left(l^{2}a^{2}\left(1-\varepsilon^{2\beta_{l}}s^{2}% \right)+s^{2}\right)\geq\,$	$\displaystyle s^{2(l-1)}\left(\frac{l^{2}a^{2}}{2}+s^{2}\right)\geq s^{2(l-1)}% \left(\frac{l^{2}r^{2}}{2s^{2l}}+s^{2}\right)$
	$\displaystyle=\,$	$\displaystyle\frac{l^{2}r^{2}}{2s^{2}}+s^{2l}\geq c(l,r),$

where the last line follows from the AM-GM inequality. This completes the proof. ∎

Appendix D Proofs of Theorem 2 and 3: learning with projected SGD

We will prove Theorem 2 which bounds the distance between GF and projected SGD in sub-Sections D.1 through D.3, with sub-Section D.4 devoted to the proof of Theorem 3. Throughout this section, we use $M$ to refer to any constant that only depends on the $M_{i}$ ’s from Assumptions A1-A3, whereas the value of $M$ can change from line to line. We start with an elementary lemma that establishes the Lipschitz continuity of the gradient flow trajectory:

Lemma 4 (A priori estimate).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for all $t\geq 0$ , $\rho_{t}$ is supported on $[-M(1+t/\varepsilon),M(1+t/\varepsilon)]\times\mathbb{S}^{d-1}$ , namely $|a_{i}(t)|\leq M(1+t/\varepsilon)$ for all $i\in[m]$ . Moreover, for any $0\leq s\leq t$ , we have

	$\displaystyle\sup_{j\in[m]}\left\|a_{j}(s)-a_{j}(t)\right\|\leq\,$	$\displaystyle\varepsilon^{-1}M(t-s),$
	$\displaystyle\sup_{j\in[m]}\left\\|{u_{j}(s)-u_{j}(t)}\right\\|_{2}\leq\,$	$\displaystyle M(1+t/\varepsilon)^{2}(t-s).$

Proof.

First, notice that along the trajectory of gradient flow, the risk must be non-increasing. In fact, we have

\partial_{t}\mathscrsfs{R}=-m\sum_{i=1}^{m}\left(\varepsilon^{-2}(\partial_{a_% {i}}\mathscrsfs{R})^{2}+(\nabla_{u_{i}}\mathscrsfs{R})^{\top}(I_{d}-u_{i}u_{i}% ^{\top})(\nabla_{u_{i}}\mathscrsfs{R})\right)\leq 0.

Therefore, we obtain that

	$\displaystyle\left\|\partial_{t}a_{i}\right\|=\,$	$\displaystyle\varepsilon^{-1}\left\|\mathbb{E}[y\sigma(\langle u_{i},x\rangle)]% -\frac{1}{m}\sum_{j=1}^{m}a_{j}\mathbb{E}[\sigma(\langle u_{i},x\rangle)\sigma% (\langle u_{j},x\rangle)]\right\|=\varepsilon^{-1}\left\|\mathbb{E}\left[(y-% \widehat{y})\sigma(\langle u_{i},x\rangle)\right]\right\|$
	$\displaystyle\leq\,$	$\displaystyle\varepsilon^{-1}\mathbb{E}\left[(y-\widehat{y})^{2}\right]^{1/2}% \mathbb{E}\left[\sigma(\langle u_{i},x\rangle)^{2}\right]^{1/2}\leq\varepsilon% ^{-1}\sqrt{2\mathscrsfs{R}(0)}\left\\|{\sigma}\right\\|_{L^{2}}\leq\varepsilon^{% -1}M,$

where the last line follows from our assumption. Since $|a_{i}(0)|\leq M$ , we know that $|a_{i}(t)|\leq M(1+t/\varepsilon)$ , and $|a_{i}(t)-a_{i}(s)|\leq\varepsilon^{-1}M(t-s)$ . Moreover, according to Eq. (6), we have

\left\|{\partial_{t}u_{i}}\right\|_{2}\leq|a_{i}|\left(\left\|{V^{\prime}}% \right\|_{\infty}+\left\|{U^{\prime}}\right\|_{\infty}\sup_{j\in[m]}|a_{j}|% \right)\leq M(1+t/\varepsilon)^{2},

thus leading to

\left\|{u_{i}(s)-u_{i}(t)}\right\|_{2}\leq M(1+t/\varepsilon)^{2}(t-s),\ % \forall i\in[m].

This completes the proof. ∎

In what follows we define two discretized versions of Eq.s (5) and (6), namely the gradient descent (GD) and stochastic gradient descent (SGD) dynamics. They will serve as important intermediate objects for our proof.

•

Gradient descent: Let $\eta>0$ be the step size, and let the initialization be the same as gradient flow: $(\tilde{a}_{i}(0),\tilde{u}_{i}(0))=(a_{i}(0),u_{i}(0))$ for all $i\in[m]$ . We have for $k\in\mathbb{N}$ ,

\begin{split}\tilde{a}_{i}(k+1)-\tilde{a}_{i}(k)=\,&-m\varepsilon^{-1}\eta% \partial_{\tilde{a}_{i}(k)}\mathscrsfs{R}\\ &=\varepsilon^{-1}\eta\Bigg{(}V(\langle u_{*},\tilde{u}_{i}(k)\rangle;\left\|{% u_{*}}\right\|_{2},\left\|{\tilde{u}_{i}(k)}\right\|_{2})\\ &\qquad\qquad-\frac{1}{m}\sum_{j=1}^{m}\tilde{a}_{j}(k)U(\langle\tilde{u}_{i}(% k),\tilde{u}_{j}(k)\rangle;\left\|{\tilde{u}_{i}(k)}\right\|_{2},\left\|{% \tilde{u}_{j}(k)}\right\|_{2})\Bigg{)}\\ \tilde{u}_{i}(k+1)-\tilde{u}_{i}(k)=\,&-m\eta\left(I_{d}-\tilde{u}_{i}(k)% \tilde{u}_{i}(k)^{\top}\right)\nabla_{\tilde{u}_{i}(k)}\mathscrsfs{R}\\ =\,&\eta\tilde{a}_{i}(k)\left(I_{d}-\tilde{u}_{i}(k)\tilde{u}_{i}(k)^{\top}% \right)\Bigg{(}\nabla_{\tilde{u}_{i}(k)}V(\langle u_{*},\tilde{u}_{i}(k)% \rangle;\left\|{u_{*}}\right\|_{2},\left\|{\tilde{u}_{i}(k)}\right\|_{2})\\ &\qquad-\frac{1}{m}\sum_{j=1}^{m}\tilde{a}_{j}(k)\nabla_{\tilde{u}_{i}(k)}U(% \langle\tilde{u}_{i}(k),\tilde{u}_{j}(k)\rangle;\left\|{\tilde{u}_{i}(k)}% \right\|_{2},\left\|{\tilde{u}_{j}(k)}\right\|_{2})\Bigg{)},\end{split}

(161)

where we recall from Eq.s (108) and (109):

	$\displaystyle V(\langle u_{},\tilde{u}_{i}(k)\rangle;\left\\|{u_{}}\right\\|_{% 2},\left\\|{\tilde{u}_{i}(k)}\right\\|_{2})=\,$	$\displaystyle\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)\sigma(\langle% \tilde{u}_{i}(k),x\rangle)\right]=\mathbb{E}\left[y\sigma(\langle\tilde{u}_{i}% (k),x\rangle)\right],$
	$\displaystyle\nabla_{\tilde{u}_{i}(k)}V(\langle u_{},\tilde{u}_{i}(k)\rangle;% \left\\|{u_{}}\right\\|_{2},\left\\|{\tilde{u}_{i}(k)}\right\\|_{2})=\,$	$\displaystyle\mathbb{E}\left[\varphi(\langle u_{*},x\rangle)\sigma^{\prime}(% \langle\tilde{u}_{i}(k),x\rangle)x\right]=\mathbb{E}\left[y\sigma^{\prime}(% \langle\tilde{u}_{i}(k),x\rangle)x\right],$
	$\displaystyle U(\langle\tilde{u}_{i}(k),\tilde{u}_{j}(k)\rangle;\left\\|{\tilde% {u}_{i}(k)}\right\\|_{2},\left\\|{\tilde{u}_{j}(k)}\right\\|_{2})=\,$	$\displaystyle\mathbb{E}\left[\sigma(\langle\tilde{u}_{i}(k),x\rangle)\sigma(% \langle\tilde{u}_{j}(k),x\rangle)\right],$
	$\displaystyle\nabla_{\tilde{u}_{i}(k)}U(\langle\tilde{u}_{i}(k),\tilde{u}_{j}(% k)\rangle;\left\\|{\tilde{u}_{i}(k)}\right\\|_{2},\left\\|{\tilde{u}_{j}(k)}% \right\\|_{2})=\,$	$\displaystyle\mathbb{E}\left[x\sigma^{\prime}(\langle\tilde{u}_{i}(k),x\rangle% )\sigma(\langle\tilde{u}_{j}(k),x\rangle)\right].$

By convention, we have $V(s)=V(s;1,1)$ and $U(s)=U(s;1,1)$ for $s\in[-1,1]$ .

•

One-pass stochastic gradient descent: Under the same choice of the step size and initialization, and let $\{(x_{k},y_{k})\}_{k\in\mathbb{N}^{*}}$ be i.i.d. samples from $\mathrm{P}\in\mathscr{P}(\mathbb{R}^{d}\times\mathbb{R})$ , where

\mathrm{P}=\operatorname{Law}(x,y),\quad x\sim\mathsf{N}(0,I_{d}),\ y=\varphi(% \langle u_{*},x\rangle).

The iteration equations for one-pass SGD read:

\begin{split}\underline{a}_{i}(k+1)-\underline{a}_{i}(k)=\,&\varepsilon^{-1}% \eta\left(y_{k+1}-\frac{1}{m}\sum_{j=1}^{m}\underline{a}_{j}(k)\sigma(\langle% \underline{u}_{j}(k),x_{k+1}\rangle)\right)\sigma(\langle\underline{u}_{i}(k),% x_{k+1}\rangle)\\ \underline{u}_{i}(k+1)-\underline{u}_{i}(k)=\,&\eta\underline{a}_{i}(k)\left(I% _{d}-\underline{u}_{i}(k)\underline{u}_{i}(k)^{\top}\right)\left(y_{k+1}-\frac% {1}{m}\sum_{j=1}^{m}\underline{a}_{j}(k)\sigma(\langle\underline{u}_{j}(k),x_{% k+1}\rangle)\right)\\ &\times\sigma^{\prime}(\langle\underline{u}_{i}(k),x_{k+1}\rangle)x_{k+1}.\end% {split}

(162)

Note that Eq. (162) can also be written as:

	$\displaystyle\underline{a}_{i}(k+1)=\,$	$\displaystyle\underline{a}_{i}(k)+\varepsilon^{-1}\eta\widehat{F}_{i}(% \underline{\rho}^{(m)}(k);z_{k+1})$
	$\displaystyle\underline{u}_{i}(k+1)=\,$	$\displaystyle\underline{u}_{i}(k)+\eta\left(I_{d}-\underline{u}_{i}(k)% \underline{u}_{i}(k)^{\top}\right)\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z_% {k+1}).$

D.1 Difference between GF and GD

For notational simplicity, we denote $\theta_{i}(t)=(a_{i}(t),u_{i}(t))$ for $i\in[m]$ and $t\geq 0$ , and

\rho^{(m)}(t)=\frac{1}{m}\sum_{i=1}^{m}\delta_{\theta_{i}(t)}=\frac{1}{m}\sum_% {i=1}^{m}\delta_{(a_{i}(t),u_{i}(t))}.

Similarly, $\tilde{\theta}_{i}(k)=(\tilde{a}_{i}(k),\tilde{u}_{i}(k))$ , and

\tilde{\rho}^{(m)}(k)=\frac{1}{m}\sum_{i=1}^{m}\delta_{\tilde{\theta}_{i}(k)}=% \frac{1}{m}\sum_{i=1}^{m}\delta_{(\tilde{a}_{i}(k),\tilde{u}_{i}(k))}.

Moreover, for $\theta=(a,u)$ and $\rho\in\mathscr{P}(\mathbb{R}\times\mathbb{R}^{d})$ , we define the following two functionals:

	$\displaystyle F(\theta,\rho)=\,$	$\displaystyle V(\langle u_{},u\rangle;\left\\|{u_{}}\right\\|_{2},\left\\|{u}% \right\\|_{2})-\int_{\mathbb{R}\times\mathbb{R}^{d}}a^{\prime}U(\langle u,u^{% \prime}\rangle;\left\\|{u}\right\\|_{2},\left\\|{u^{\prime}}\right\\|_{2})\rho(% \mathrm{d}a^{\prime},\mathrm{d}u^{\prime}),$
	$\displaystyle G(\theta,\rho)=\,$	$\displaystyle a\left(I_{d}-uu^{\top}\right)\left(\nabla_{u}V(\langle u_{},u% \rangle;\left\\|{u_{}}\right\\|_{2},\left\\|{u}\right\\|_{2})-\int_{\mathbb{R}% \times\mathbb{R}^{d}}a^{\prime}\nabla_{u}U(\langle u,u^{\prime}\rangle;\left\\|% {u}\right\\|_{2},\left\\|{u^{\prime}}\right\\|_{2})\rho(\mathrm{d}a^{\prime},% \mathrm{d}u^{\prime})\right),$

and $H_{\varepsilon}(\theta,\rho)=(\varepsilon^{-1}F(\theta,\rho),G(\theta,\rho))$ . Then, Eq.s (5) and (6) and Eq. (161) can be rewritten as

\frac{\mathrm{d}}{\mathrm{d}t}\theta_{i}(t)=H_{\varepsilon}(\theta_{i}(t),\rho% ^{(m)}(t)),\quad\tilde{\theta}_{i}(k+1)-\tilde{\theta}_{i}(k)=\eta H_{% \varepsilon}(\tilde{\theta}_{i}(k),\tilde{\rho}^{(m)}(k)),

respectively. The lemma below will be used several times in the proof.

Lemma 5.

Denoting $\rho^{(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta_{i}}$ and $\rho^{\prime(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta^{\prime}_{i}}$ . If $\left\|{u_{i}}\right\|_{2}\leq C$ and $\left\|{u^{\prime}_{i}}\right\|_{2}\leq C$ for all $i\in[m]$ ( $C$ is any fixed absolute constant, for example, here we can take $C=2$ ), then we have

$\displaystyle\left\|F(\theta_{i},\rho^{(m)})-F(\theta^{\prime}_{i},\rho^{\prime% (m)})\right\|\leq\,$	$\displaystyle M\left(\left(1+\sup_{j\in[m]}\|a_{j}\|\right)\cdot\sup_{j\in[m]}% \left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}% _{j}\right\|\right),$	(163)
$\displaystyle\left\\|{G(\theta_{i},\rho^{(m)})-G(\theta^{\prime}_{i},\rho^{% \prime(m)})}\right\\|_{2}\leq\,$	$\displaystyle M\cdot\left(1+\sup_{j\in[m]}\left\|a_{j}\right\|\right)^{2}\cdot% \sup_{j\in[m]}\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}$	(164)
	$\displaystyle+M\cdot\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|\cdot\left(% 1+\sup_{j\in[m]}\left\|a_{j}\right\|+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}% \right\|\right),$	(165)

where the constant $M$ only depends on the $M_{i}$ ’s. As a consequence, we obtain that

		$\displaystyle\left\\|{H_{\varepsilon}(\theta_{i},\rho^{(m)})-H_{\varepsilon}(% \theta^{\prime}_{i},\rho^{\prime(m)})}\right\\|_{2}\leq\varepsilon^{-1}\left\|F(% \theta_{i},\rho^{(m)})-F(\theta^{\prime}_{i},\rho^{\prime(m)})\right\|+\left\\|{% G(\theta_{i},\rho^{(m)})-G(\theta^{\prime}_{i},\rho^{\prime(m)})}\right\\|_{2}$
	$\displaystyle\leq\,$	$\displaystyle(\varepsilon^{-1}+1)M\cdot\left(\left(1+\sup_{j\in[m]}\left\|a_{j}% \right\|\right)^{2}\cdot\sup_{j\in[m]}\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}% +\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|\cdot\left(1+\sup_{j\in[m]}% \left\|a_{j}\right\|+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle(\varepsilon^{-1}+1)M\cdot\left(\left(1+\sup_{j\in[m]}\left\|a_{j}% \right\|\right)^{2}+\sup_{j\in[m]}\left\\|{\theta_{j}-\theta^{\prime}_{j}}\right% \\|_{2}\right)\cdot\sup_{j\in[m]}\left\\|{\theta_{j}-\theta^{\prime}_{j}}\right% \\|_{2}.$

Proof.

First, by triangle inequality, we have

		$\displaystyle\left\|F(\theta_{i},\rho^{(m)})-F(\theta^{\prime}_{i},\rho^{\prime% (m)})\right\|\leq\left\|V(\langle u_{},u_{i}\rangle;\left\\|{u_{}}\right\\|_{2},% \left\\|{u_{i}}\right\\|_{2})-V(\langle u_{},u^{\prime}_{i}\rangle;\left\\|{u_{% }}\right\\|_{2},\left\\|{u^{\prime}_{i}}\right\\|_{2})\right\|$
		$\displaystyle+\frac{1}{m}\sum_{j=1}^{m}\left\|a_{j}U(\langle u_{i},u_{j}\rangle% ;\left\\|{u_{i}}\right\\|_{2},\left\\|{u_{j}}\right\\|_{2})-a^{\prime}_{j}U(% \langle u^{\prime}_{i},u^{\prime}_{j}\rangle;\left\\|{u^{\prime}_{i}}\right\\|_{% 2},\left\\|{u^{\prime}_{j}}\right\\|_{2})\right\|$
	$\displaystyle\leq\,$	$\displaystyle\left\\|{\nabla V}\right\\|_{\infty}\left\\|{u_{i}-u^{\prime}_{i}}% \right\\|_{2}+\frac{\left\\|{U}\right\\|_{\infty}}{m}\sum_{j=1}^{m}\left\|a_{j}-a^% {\prime}_{j}\right\|+\frac{\left\\|{\nabla U}\right\\|_{\infty}}{m}\sum_{j=1}^{m}% \|a_{j}\|\cdot\left(\left\\|{u_{i}-u^{\prime}_{i}}\right\\|_{2}+\left\\|{u_{j}-u^{% \prime}_{j}}\right\\|_{2}\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(\left(1+\sup_{j\in[m]}\|a_{j}\|\right)\left\\|{u_{i}-u^{% \prime}_{i}}\right\\|_{2}+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|+\sup_% {j\in[m]}\|a_{j}\|\cdot\sup_{j\in[m]}\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(\left(1+\sup_{j\in[m]}\|a_{j}\|\right)\cdot\sup_{j\in[m]}% \left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}% _{j}\right\|\right).$

Second, using again triangle inequality, we deduce that

		$\displaystyle\left\\|{G(\theta_{i},\rho^{(m)})-G(\theta^{\prime}_{i},\rho^{% \prime(m)})}\right\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,$	$\displaystyle 2C\left\|a_{i}\right\|\left\\|{u_{i}-u^{\prime}_{i}}\right\\|_{2}% \left(\left\\|{\nabla V}\right\\|_{\infty}+\frac{\left\\|{\nabla U}\right\\|_{% \infty}}{m}\sum_{j=1}^{m}\left\|a_{j}\right\|\right)+C\left\|a_{i}\right\|\cdot% \frac{\left\\|{\nabla U}\right\\|_{\infty}}{m}\sum_{j=1}^{m}\left\|a_{j}\right\|% \cdot\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}$
		$\displaystyle+C\left\|a_{i}\right\|\cdot\left(\left\\|{\nabla^{2}V}\right\\|_{% \infty}\left\\|{u_{i}-u^{\prime}_{i}}\right\\|_{2}+\frac{\left\\|{\nabla^{2}U}% \right\\|_{\infty}}{m}\sum_{j=1}^{m}\left\|a_{j}\right\|\left(\left\\|{u_{i}-u^{% \prime}_{i}}\right\\|_{2}+\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}\right)\right)$
		$\displaystyle+C\left\|a_{i}-a^{\prime}_{i}\right\|\left(\left\\|{\nabla V}\right% \\|_{\infty}+\frac{\left\\|{\nabla U}\right\\|_{\infty}}{m}\sum_{j=1}^{m}\left\|a_% {j}\right\|\right)+C\left(\left\|a_{i}\right\|+\left\|a^{\prime}_{i}-a_{i}\right\|% \right)\cdot\frac{\left\\|{\nabla U}\right\\|_{\infty}}{m}\sum_{j=1}^{m}\left\|a_% {j}-a^{\prime}_{j}\right\|$
	$\displaystyle\leq\,$	$\displaystyle 5M\cdot\left(1+\sup_{j\in[m]}\left\|a_{j}\right\|\right)^{2}\cdot% \sup_{j\in[m]}\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}+M\cdot\sup_{j\in[m]}% \left\|a_{j}-a^{\prime}_{j}\right\|\cdot\left(1+2\sup_{j\in[m]}\left\|a_{j}\right% \|+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|\right)$
	$\displaystyle\leq\,$	$\displaystyle M\cdot\left(1+\sup_{j\in[m]}\left\|a_{j}\right\|\right)^{2}\cdot% \sup_{j\in[m]}\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}+M\cdot\sup_{j\in[m]}% \left\|a_{j}-a^{\prime}_{j}\right\|\cdot\left(1+\sup_{j\in[m]}\left\|a_{j}\right\|% +\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|\right),$

where $(i)$ follows from the inequality $\left\|{u_{i}u_{i}^{\top}-(u^{\prime}_{i})(u^{\prime}_{i})^{\top}}\right\|_{% \mathrm{op}}\leq 2C\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}$ , which is a result of the following direct calculation:

\displaystyle\left\|{u_{i}u_{i}^{\top}-(u^{\prime}_{i})(u^{\prime}_{i})^{\top}% }\right\|_{\mathrm{op}}=\sup_{\left\|{x}\right\|_{2}=1}\left|\langle x,u_{i}% \rangle^{2}-\langle x,u^{\prime}_{i}\rangle^{2}\right|\leq 2C\sup_{\left\|{x}% \right\|_{2}=1}\left|\langle x,u_{i}\rangle-\langle x,u^{\prime}_{i}\rangle% \right|=2C\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}.

This completes the proof of Lemma 5, since the “as a consequence” part follows naturally from the upper bounds obtained earlier. ∎

Lemma 6.

Following the notation and assumption of Lemma 5, we have

\left|\mathscrsfs{R}(\rho^{(m)})-\mathscrsfs{R}(\rho^{\prime(m)})\right|\leq M% \cdot\left(\left(1+\sup_{j\in[m]}\left|a_{j}\right|\right)^{2}+\sup_{j\in[m]}% \left\|{\theta_{j}-\theta^{\prime}_{j}}\right\|_{2}\right)\cdot\sup_{j\in[m]}% \left\|{\theta_{j}-\theta^{\prime}_{j}}\right\|_{2}.

Proof.

By definition of the risk function and triangle inequality, we deduce that

	$\displaystyle\left\|\mathscrsfs{R}(\rho^{(m)})-\mathscrsfs{R}(\rho^{\prime(m)})% \right\|\leq\,$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left\|a_{i}V(\langle u_{},u_{i}\rangle;% \left\\|{u_{}}\right\\|_{2},\left\\|{u_{i}}\right\\|_{2})-a^{\prime}_{i}V(\langle u% _{},u^{\prime}_{i}\rangle;\left\\|{u_{}}\right\\|_{2},\left\\|{u^{\prime}_{i}}% \right\\|_{2})\right\|$
		$\displaystyle+\frac{1}{m^{2}}\sum_{i,j=1}^{m}\left\|a_{i}a_{j}U(\langle u_{i},u% _{j}\rangle;\left\\|{u_{i}}\right\\|_{2},\left\\|{u_{j}}\right\\|_{2})-a^{\prime}_% {i}a^{\prime}_{j}U(\langle u^{\prime}_{i},u^{\prime}_{j}\rangle;\left\\|{u^{% \prime}_{i}}\right\\|_{2},\left\\|{u^{\prime}_{j}}\right\\|_{2})\right\|$
	$\displaystyle\leq\,$	$\displaystyle\frac{\left\\|{\nabla V}\right\\|_{\infty}}{m}\sum_{i=1}^{m}\left\|a% _{i}\right\|\left\\|{u_{i}-u^{\prime}_{i}}\right\\|_{2}+\frac{\left\\|{V}\right\\|_% {\infty}}{m}\sum_{i=1}^{m}\left\|a_{i}-a^{\prime}_{i}\right\|$
		$\displaystyle+\frac{\left\\|{U}\right\\|_{\infty}}{m^{2}}\sum_{i,j=1}^{m}\left(% \left\|a_{i}-a^{\prime}_{i}\right\|\left\|a^{\prime}_{j}\right\|+\left\|a_{i}\right% \|\left\|a_{j}-a^{\prime}_{j}\right\|\right)$
		$\displaystyle+\frac{\left\\|{\nabla U}\right\\|_{\infty}}{m^{2}}\sum_{i,j=1}^{m}% \left\|a_{i}\right\|\left\|a_{j}\right\|\cdot\left(\left\\|{u_{i}-u^{\prime}_{i}}% \right\\|_{2}+\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}\right)$
	$\displaystyle\leq\,$	$\displaystyle M\cdot\left(1+\sup_{j\in[m]}\left\|a_{j}\right\|\right)^{2}\cdot% \sup_{j\in[m]}\left\\|{u_{j}-u^{\prime}_{j}}\right\\|_{2}$
		$\displaystyle+M\cdot\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}\right\|\cdot\left(% 1+\sup_{j\in[m]}\left\|a_{j}\right\|+\sup_{j\in[m]}\left\|a_{j}-a^{\prime}_{j}% \right\|\right)$
	$\displaystyle\leq\,$	$\displaystyle M\cdot\left(\left(1+\sup_{j\in[m]}\left\|a_{j}\right\|\right)^{2}+% \sup_{j\in[m]}\left\\|{\theta_{j}-\theta^{\prime}_{j}}\right\\|_{2}\right)\cdot% \sup_{j\in[m]}\left\\|{\theta_{j}-\theta^{\prime}_{j}}\right\\|_{2}.$

This concludes the proof. ∎

First, let us define the error function

\Delta(t)=\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{% \theta}_{i}(k)-\theta_{i}(k\eta)}\right\|_{2},

and the stop** time $T_{\Delta}=\inf\{t\geq 0:\Delta(t)\geq 1\}$ . For $k\in\mathbb{N}$ and $t=k\eta\leq T_{\Delta}$ , we have the following estimate:

	$\displaystyle\left\\|{\theta_{i}(t)-\tilde{\theta}_{i}(k)}\right\\|_{2}\leq\,$	$\displaystyle\int_{0}^{t}\left\\|{H_{\varepsilon}\left(\theta_{i}(s),\rho^{(m)}% (s)\right)-H_{\varepsilon}\left(\tilde{\theta}_{i}(\lfloor s/\eta\rfloor),% \tilde{\rho}^{(m)}(\lfloor s/\eta\rfloor)\right)}\right\\|_{2}\mathrm{d}s$
	$\displaystyle\leq\,$	$\displaystyle\int_{0}^{t}\left\\|{H_{\varepsilon}\left(\theta_{i}(s),\rho^{(m)}% (s)\right)-H_{\varepsilon}\left(\theta_{i}(\eta\lfloor s/\eta\rfloor),\rho^{(m% )}(\eta\lfloor s/\eta\rfloor)\right)}\right\\|_{2}\mathrm{d}s$
		$\displaystyle+\int_{0}^{t}\left\\|{H_{\varepsilon}\left(\theta_{i}(\eta\lfloor s% /\eta\rfloor),\rho^{(m)}(\eta\lfloor s/\eta\rfloor)\right)-H_{\varepsilon}% \left(\tilde{\theta}_{i}(\lfloor s/\eta\rfloor),\tilde{\rho}^{(m)}(\lfloor s/% \eta\rfloor)\right)}\right\\|_{2}\mathrm{d}s.$

For any $s\in[0,t]$ , by Lemma 4 and 5 we have (denote $[s]=\eta\lfloor s/\eta\rfloor$ , and notice that we can take $C=2$ since $t\leq T_{\Delta}$ )

		$\displaystyle\left\\|{H_{\varepsilon}\left(\theta_{i}(s),\rho^{(m)}(s)\right)-H% _{\varepsilon}\left(\theta_{i}([s]),\rho^{(m)}([s])\right)}\right\\|_{2}$
	$\displaystyle\leq\,$	$\displaystyle(\varepsilon^{-1}+1)M(1+t/\varepsilon)^{4}(s-[s])\leq(\varepsilon% ^{-1}+1)M(1+t/\varepsilon)^{4}\eta.$

Using again Lemma 4 and 5, we obtain that

		$\displaystyle\left\\|{H_{\varepsilon}\left(\theta_{i}(\eta\lfloor s/\eta\rfloor% ),\rho^{(m)}(\eta\lfloor s/\eta\rfloor)\right)-H_{\varepsilon}\left(\tilde{% \theta}_{i}(\lfloor s/\eta\rfloor),\tilde{\rho}^{(m)}(\lfloor s/\eta\rfloor)% \right)}\right\\|_{2}$
	$\displaystyle\leq\,$	$\displaystyle(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}s)^{2}\Delta(s)+(% \varepsilon^{-1}+1)M\Delta(s)^{2},$

thus leading to

	$\displaystyle\Delta(t)\leq\,$	$\displaystyle(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta+(\varepsilon^{-1}% +1)\int_{0}^{t}\left(M(1+\varepsilon^{-1}s)^{2}\Delta(s)+M\Delta(s)^{2}\right)% \mathrm{d}s$
	$\displaystyle\leq\,$	$\displaystyle(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta+(\varepsilon^{-1}% +1)\int_{0}^{t}M(1+\varepsilon^{-1}s)^{2}\cdot\max\left(\Delta(s),\Delta(s)^{2% }\right)\mathrm{d}s.$

For $s\leq t\leq T_{\Delta}$ , we have $\Delta(s)^{2}\leq\Delta(s)$ . Hence,

\Delta(t)\leq(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta+(\varepsilon^{-1}% +1)\int_{0}^{t}M(1+\varepsilon^{-1}s)^{2}\Delta(s)\mathrm{d}s.

Applying Grönwall’s inequality yields

\Delta(t)\leq(\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{4}\eta\cdot\exp\left((% \varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{2}\right)\leq M\exp((\varepsilon^{-1}+% 1)Mt(1+t/\varepsilon)^{2})\eta.

Therefore, for all $T\geq 0$ and $\eta\leq 1/(M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}))$ , we have

\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{\theta}_{i}(k)% -\theta_{i}(k\eta)}\right\|_{2}\leq M\exp((\varepsilon^{-1}+1)Mt(1+t/% \varepsilon)^{2})\eta\leq 1,\ \forall t\leq\min(T,T_{\Delta}).

This proves $T\leq T_{\Delta}$ , and consequently

\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\tilde{\theta}_{i}(k)% -\theta_{i}(k\eta)}\right\|_{2}\leq M\exp((\varepsilon^{-1}+1)Mt(1+t/% \varepsilon)^{2})\eta,\ \forall t\in[0,T],

which immediately implies that

\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\tilde{a}_{i}(k)\right|% \leq M(1+t/\varepsilon)+1\leq M(1+t/\varepsilon).

Finally, with the aid of Lemma 6, we get the following upper bound on the difference between the risk of gradient flow and gradient descent:

	$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left\|\mathscrsfs{R}(\rho^{(m)% }(k\eta))-\mathscrsfs{R}(\tilde{\rho}^{(m)}(k))\right\|\leq\,$	$\displaystyle M(M^{2}(1+t/\varepsilon)^{2}+1)M\exp((\varepsilon^{-1}+1)Mt(1+t/% \varepsilon)^{2})\eta$
	$\displaystyle\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+t/\varepsilon)^{2})\eta.$

To summarize, we have the following:

Theorem 4 (Difference between GF and GD).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for any $T\geq 0$ and

\eta\leq\frac{1}{M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2})},

the following holds for all $t\in[0,T]$ :

$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|\tilde{a}_% {i}(k)\right\|\leq\,$	$\displaystyle M(1+t/\varepsilon),$	(166)
$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\\|{\tilde{% \theta}_{i}(k)-\theta_{i}(k\eta)}\right\\|_{2}\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\eta,$	(167)
$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left\|\mathscrsfs{R}(\rho^{(m)% }(k\eta))-\mathscrsfs{R}(\tilde{\rho}^{(m)}(k))\right\|\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\eta.$	(168)

D.2 Difference between GD and SGD

The proof for this section is almost identical to Appendix C.5 in [33]. The only difference is that, here we need to verify that $(I_{d}-uu^{\top})\sigma^{\prime}(\langle u,x\rangle)x$ is an $M_{3}$ -sub-Gaussian random vector. This follows from the identity $(I_{d}-uu^{\top})x=x-\langle u,x\rangle u$ and Assumption A3. We thus obtain the following interpolation bound between GD and SGD:

Theorem 5 (Difference between GD and SGD).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for any $T,z\geq 0$ and

\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-% 1}T)^{2})},

the following happens with probability at least $1-\exp(-z^{2})$ : For all $t\in[0,T]$ , we have

$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|\underline% {a}_{i}(k)\right\|\leq\,$	$\displaystyle M(1+t/\varepsilon),$	(169)
$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\\|{\tilde{% \theta}_{i}(k)-\underline{\theta}_{i}(k)}\right\\|_{2}\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right),$	(170)
$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left\|\mathscrsfs{R}(% \underline{\rho}^{(m)}(k))-\mathscrsfs{R}(\tilde{\rho}^{(m)}(k))\right\|\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right).$	(171)

D.3 Difference between SGD and projected SGD

The aim of this section is to prove a coupling bound between the trajectory of SGD and that of projected SGD, thus finally leading to an upper bound on the difference between the risk of projected gradient flow and projected SGD. To begin with, let us fix $T,z\geq 0$ and choose

\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-% 1}T)^{2})}

as in Theorem 2, where $M$ is a large enough constant (to be determined later). Define

T_{\theta}=\inf\left\{t\geq 0:\max_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]% }\left|\overline{a}_{i}(k)\right|\geq 2M(1+t/\varepsilon),\ \text{or}\ \max_{k% \in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\overline{u}_{i}(k)}\right\|% _{2}\geq 2\right\},

then for $k\leq\min(T,T_{\theta})/\eta$ and $i\in[m]$ , we have (note that here $t=k\eta$ )

	$\displaystyle\left\\|{\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})}\right% \\|_{2}\leq\,$	$\displaystyle M\left\|\bar{a}_{i}(k)\right\|\left(1+\max_{i\in[m]}\left\|% \overline{a}_{i}(k)\right\|\right)\left\\|{\sigma^{\prime}(\langle\bar{u}_{i}(k)% ,x_{k+1}\rangle)x_{k+1}}\right\\|_{2}$
	$\displaystyle\leq\,$	$\displaystyle M(1+t/\varepsilon)^{2}\left\\|{\sigma^{\prime}(\langle\bar{u}_{i}% (k),x_{k+1}\rangle)x_{k+1}}\right\\|_{2}.$

Denoting $\mathcal{F}_{k}=\sigma(\bar{\theta}(0),z_{1},\cdots,z_{k})$ , we know from Assumption A3 that, conditioning on $\mathcal{F}_{k}$ , $\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x_{k+1}$ is an $M_{3}$ -sub-Gaussian random vector. By well-known results on Euclidean norm of sub-Gaussian random vectors (see, e.g., [27]), we know that there exists a constant $M$ satisfying

\mathds{P}\left(\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x% _{k+1}}\right\|_{2}\geq M\left(\sqrt{d}+\sqrt{\log(1/\delta)}\right)\right)% \leq\delta.

Choosing $\delta=\eta\exp(-z^{2})/(mT)$ and applying a union bound gives

\mathds{P}\left(\max_{k\in[0,\min(T,T_{\theta})/\eta]\cap\mathbb{N}}\max_{i\in% [m]}\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x_{k+1}}% \right\|_{2}\leq M\left(\sqrt{d+\log m}+z+T^{2}\right)\right)\geq 1-\exp(-z^{2% }).

Therefore, with probability at least $1-\exp(-z^{2})$ , for all $k\leq\min(T,T_{\theta})/\eta$ and $i\in[m]$ , we have

\left\|{\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})}\right\|_{2}\leq M(1% +t/\varepsilon)^{2}\left(\sqrt{d+\log m}+z+T^{2}\right).

The above bound also holds for the trajectory of SGD, namely after replacing $\overline{\rho}^{(m)}(k)$ with $\underline{\rho}^{(m)}(k)$ . Now, let us define the approximation error $\Delta_{i}(k)=\underline{u}_{i}(k)-\overline{u}_{i}(k)$ for $i\in[m]$ and $k\in\mathbb{N}$ , then we get the following decomposition:

\Delta_{i}(l)=\sum_{k=0}^{l-1}\left(\Delta_{i}(k+1)-\Delta_{i}(k)\right)=\sum_% {k=0}^{l-1}\left(\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)|\mathcal{F}_{k}% \right]+Z_{i}(k+1)\right),

where $Z_{i}(k+1)=\Delta_{i}(k+1)-\Delta_{i}(k)-\mathbb{E}\left[\Delta_{i}(k+1)-% \Delta_{i}(k)|\mathcal{F}_{k}\right]$ has zero mean. With our choice of $\eta$ , one can verify that as long as $\max(d,m,z)\to\infty$ , Lemma 7 is applicable to

u_{1}=\underline{u}_{i}(k),\ g_{1}=\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z% _{k+1}),\ u_{2}=\overline{u}_{i}(k),\ g_{2}=\widehat{G}_{i}(\overline{\rho}^{(% m)}(k);z_{k+1}).

Hence, we deduce from the definition of $\Delta_{i}(k)$ that

	$\displaystyle\Delta_{i}(k+1)-\Delta_{i}(k)=\,$	$\displaystyle\left(\underline{u}_{i}(k+1)-\underline{u}_{i}(k)\right)-\left(% \overline{u}_{i}(k+1)-\overline{u}_{i}(k)\right)=(v_{1}-u_{1})-(v_{2}-u_{2})$
	$\displaystyle=\,$	$\displaystyle\eta\left(\left(I_{d}-u_{1}u_{1}^{\top}\right)g_{1}-\left(I_{d}-u% _{2}u_{2}^{\top}\right)g_{2}\right)+O\left(\eta^{2}\left\\|{g_{2}}\right\\|_{2}^% {2}\right),$

thus leading to the following estimate:

	$\displaystyle\left\\|{\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)\|\mathcal{F}% _{k}\right]}\right\\|_{2}\leq\,$	$\displaystyle\eta\left\\|{\mathbb{E}\left[\left(I_{d}-u_{2}u_{2}^{\top}\right)(% g_{1}-g_{2})\Big{\|}\mathcal{F}_{k}\right]}\right\\|_{2}+\eta\left\\|{\mathbb{E}% \left[\left(u_{2}u_{2}^{\top}-u_{1}u_{1}^{\top}\right)g_{1}\Big{\|}\mathcal{F}_% {k}\right]}\right\\|_{2}$
		$\displaystyle+C\eta^{2}\mathbb{E}\left[\left\\|{g_{2}}\right\\|_{2}^{2}\Big{\|}% \mathcal{F}_{k}\right]$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\,$	$\displaystyle\eta\left\\|{\mathbb{E}\left[(g_{1}-g_{2})\|\mathcal{F}_{k}\right]}% \right\\|_{2}+C\eta\left\\|{u_{1}-u_{2}}\right\\|_{2}\left\\|{\mathbb{E}\left[g_{1% }\|\mathcal{F}_{k}\right]}\right\\|_{2}+C\eta^{2}\mathbb{E}\left[\left\\|{g_{2}}% \right\\|_{2}^{2}\Big{\|}\mathcal{F}_{k}\right]$
	$\displaystyle=\,$	$\displaystyle\eta\left\\|{\mathbb{E}\left[\left(\widehat{G}_{i}(\underline{\rho% }^{(m)}(k);z_{k+1})-\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right)% \Big{\|}\mathcal{F}_{k}\right]}\right\\|_{2}$
		$\displaystyle+C\eta\left\\|{\overline{u}_{i}(k)-\underline{u}_{i}(k)}\right\\|_{% 2}\cdot\left\\|{\mathbb{E}\left[\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z_{k+% 1})\Big{\|}\mathcal{F}_{k}\right]}\right\\|_{2}$
		$\displaystyle+C\eta^{2}\mathbb{E}\left[\left\\|{\widehat{G}_{i}(\overline{\rho}% ^{(m)}(k);z_{k+1})}\right\\|_{2}^{2}\Big{\|}\mathcal{F}_{k}\right],$

where $(i)$ is due to the fact that $u_{1},u_{2}\in\sigma(\mathcal{F}_{k})$ , and $\left\|{u_{1}u_{1}^{\top}-u_{2}u_{2}^{\top}}\right\|_{\mathrm{op}}\leq C\left% \|{u_{1}-u_{2}}\right\|_{2}$ . According to the definition of $\widehat{G}_{i}$ , we obtain that

		$\displaystyle\mathbb{E}\left[\left(\widehat{G}_{i}(\underline{\rho}^{(m)}(k);z% _{k+1})-\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right)\Big{\|}% \mathcal{F}_{k}\right]$
	$\displaystyle=\,$	$\displaystyle\underline{a}_{i}(k)\Big{(}\nabla_{\underline{u}_{i}(k)}V\left(% \langle u_{},\underline{u}_{i}(k)\rangle;\left\\|{u_{}}\right\\|_{2},\left\\|{% \underline{u}_{i}(k)}\right\\|_{2}\right)$
		$\displaystyle-\frac{1}{m}\sum_{j=1}^{m}\underline{a}_{j}(k)\nabla_{\underline{% u}_{i}(k)}U\left(\langle\underline{u}_{i}(k),\underline{u}_{j}(k)\rangle;\left% \\|{\underline{u}_{i}(k)}\right\\|_{2},\left\\|{\underline{u}_{j}(k)}\right\\|_{2}% \right)\Big{)}$
		$\displaystyle-\overline{a}_{i}(k)\Big{(}\nabla_{\overline{u}_{i}(k)}V\left(% \langle u_{},\overline{u}_{i}(k)\rangle;\left\\|{u_{}}\right\\|_{2},\left\\|{% \overline{u}_{i}(k)}\right\\|_{2}\right)$
		$\displaystyle-\frac{1}{m}\sum_{j=1}^{m}\overline{a}_{j}(k)\nabla_{\overline{u}% _{i}(k)}U\left(\langle\overline{u}_{i}(k),\overline{u}_{j}(k)\rangle;\left\\|{% \overline{u}_{i}(k)}\right\\|_{2},\left\\|{\overline{u}_{j}(k)}\right\\|_{2}% \right)\Big{)},$

thus leading to (using the same argument as in the proof of Lemma 5)

	$\displaystyle\left\\|{\mathbb{E}\left[\left(\widehat{G}_{i}(\underline{\rho}^{(% m)}(k);z_{k+1})-\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})\right)\Big{\|% }\mathcal{F}_{k}\right]}\right\\|_{2}\leq M\left(1+\varepsilon^{-1}T\right)^{2}% \cdot\sup_{j\in[m]}\left\\|{\overline{u}_{j}(k)-\underline{u}_{j}(k)}\right\\|_{2}$
	$\displaystyle+M\left(1+\varepsilon^{-1}T+\sup_{j\in[m]}\left\|\overline{a}_{j}(% k)-\underline{a}_{j}(k)\right\|\right)\cdot\sup_{j\in[m]}\left\|\overline{a}_{j}% (k)-\underline{a}_{j}(k)\right\|,$

and

\displaystyle\left\|{\mathbb{E}\left[\widehat{G}_{i}(\underline{\rho}^{(m)}(k)% ;z_{k+1})\Big{|}\mathcal{F}_{k}\right]}\right\|_{2}\leq M(1+\varepsilon^{-1}T)% ^{2}.

Moreover, by (conditional) sub-Gaussianity of the $\widehat{G}_{i}$ ’s, we know that

\mathbb{E}\left[\left\|{\widehat{G}_{i}(\overline{\rho}^{(m)}(k);z_{k+1})}% \right\|_{2}^{2}\Big{|}\mathcal{F}_{k}\right]\leq M^{2}(1+\varepsilon^{-1}T)^{% 4}\mathbb{E}\left[\left\|{\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle% )x_{k+1}}\right\|_{2}^{2}|\mathcal{F}_{k}\right]\leq M(1+\varepsilon^{-1}T)^{4% }d.

Combining the above estimates, it then follows that

	$\displaystyle\left\\|{\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)\|\mathcal{F}% _{k}\right]}\right\\|_{2}\leq\,$	$\displaystyle\eta M\left(1+\varepsilon^{-1}T\right)^{2}\cdot\sup_{j\in[m]}% \left\\|{\overline{u}_{j}(k)-\underline{u}_{j}(k)}\right\\|_{2}+\eta^{2}M(1+% \varepsilon^{-1}T)^{4}d$
		$\displaystyle+\eta M\left(1+\varepsilon^{-1}T+\sup_{j\in[m]}\left\|\overline{a}% _{j}(k)-\underline{a}_{j}(k)\right\|\right)\cdot\sup_{j\in[m]}\left\|\overline{a% }_{j}(k)-\underline{a}_{j}(k)\right\|.$

Using the same proof technique as in Appendix C.5 of [33], we conclude that

\mathds{P}\left(\max_{i\in[m]}\max_{l\in[0,\min(T,T_{\theta})/\eta]\cap\mathbb% {N}}\left\|{\sum_{k=0}^{l-1}Z_{i}(k+1)}\right\|_{2}\geq M(1+\varepsilon^{-1}T)% ^{2}\left(\sqrt{d+\log m}+z+T^{2}\right)\sqrt{T\eta}\right)\leq\exp(-z^{2}).

Similarly as in the proof of Theorem 4, we define

\Delta(t)=\max_{l\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|{\overline{% \theta}_{i}(l)-\underline{\theta}_{i}(l)}\right\|_{2},\quad T_{\Delta}=\inf\{t% \geq 0:\Delta(t)\geq 1\}.

Then, for $l\leq\min(T,T_{\theta},T_{\Delta})/\eta$ , we have

	$\displaystyle\sup_{i\in[m]}\left\\|{\overline{u}_{i}(l)-\underline{u}_{i}(l)}% \right\\|_{2}=\,$	$\displaystyle\sup_{i\in[m]}\left\\|{\Delta_{i}(l)}\right\\|_{2}\leq\sup_{i\in[m]% }\left\{\sum_{k=0}^{l-1}\left\\|{\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)\|% \mathcal{F}_{k}\right]}\right\\|_{2}+\left\\|{\sum_{k=0}^{l-1}Z_{i}(k+1)}\right% \\|_{2}\right\}$
	$\displaystyle\leq\,$	$\displaystyle\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-1}\Delta(k\eta)+l% \eta^{2}M(1+\varepsilon^{-1}T)^{4}d$
		$\displaystyle+M(1+\varepsilon^{-1}T)^{2}\left(\sqrt{d+\log m}+z+T^{2}\right)% \sqrt{T\eta}.$

Proceeding with the same argument, it follows that

\displaystyle\sup_{i\in[m]}\left|\overline{a}_{i}(l)-\underline{a}_{i}(l)% \right|\leq\varepsilon^{-1}\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-1}% \Delta(k\eta)+\varepsilon^{-1}M(1+\varepsilon^{-1}T)^{2}\left(\sqrt{d+\log m}+% z+T^{2}\right)\sqrt{T\eta}.

Therefore, we finally conclude that

	$\displaystyle\Delta(l\eta)\leq\,$	$\displaystyle(\varepsilon^{-1}+1)\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-% 1}\Delta(k\eta)+l\eta^{2}M(1+\varepsilon^{-1}T)^{4}d$
		$\displaystyle+(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}T)^{2}\left(\sqrt{d+\log m% }+z+T^{2}\right)\sqrt{T\eta}$
	$\displaystyle\leq\,$	$\displaystyle(\varepsilon^{-1}+1)\eta M(1+\varepsilon^{-1}T)^{2}\sum_{k=0}^{l-% 1}\Delta(k\eta)+(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}T)^{4}\left(\sqrt{d+% \log m}+z+T^{2}\right)\sqrt{T\eta}.$

Applying Grönwall’s inequality (discrete version) yields that

	$\displaystyle\Delta(l\eta)\leq\,$	$\displaystyle(\varepsilon^{-1}+1)M(1+\varepsilon^{-1}T)^{4}\left(\sqrt{d+\log m% }+z+T^{2}\right)\sqrt{T\eta}$
		$\displaystyle\times\left(1+(\varepsilon^{-1}+1)l\eta M(1+\varepsilon^{-1}T)^{2% }\exp\left((\varepsilon^{-1}+1)l\eta M(1+\varepsilon^{-1}T)^{2}\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle M\exp\left((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}\right% )\left(\sqrt{d+\log m}+z+T^{2}\right)\sqrt{\eta}$
	$\displaystyle\leq\,$	$\displaystyle M\exp\left((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}\right% )\left(\sqrt{d+\log m}+z\right)\sqrt{\eta},$

as long as $\max(d,m,z)\to\infty$ with $T=O(1)$ . Note that the above inequality holds for all $l\in[0,\min(T,T_{\theta},T_{\Delta})/\eta]\cap\mathbb{N}$ with probability at least $1-\exp(-z^{2})$ , which further implies that $T_{\theta},T_{\Delta}\geq T$ , and consequently

\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left|\overline{a}_{i}(k)% \right|\leq 2M(1+\varepsilon^{-1}T).

Applying again Lemma 6, we deduce that

\sup_{k\in[0,T/\eta]\cap\mathbb{N}}\left|\mathscrsfs{R}(\underline{\rho}^{(m)}% (k))-\mathscrsfs{R}(\overline{\rho}^{(m)}(k))\right|\leq\left(\sqrt{d+\log m}+% z\right)M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2})\sqrt{\eta}.

Combining the above estimates gives the following:

Theorem 6 (Difference between SGD and projected SGD).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for any $T,z\geq 0$ and

\eta\leq\frac{1}{(d+\log m+z^{2})M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-% 1}T)^{2})},

the following happens with probability at least $1-\exp(-z^{2})$ : For all $t\in[0,T]$ , we have

$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\|\overline{% a}_{i}(k)\right\|\leq\,$	$\displaystyle M(1+t/\varepsilon),$	(172)
$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\max_{i\in[m]}\left\\|{% \overline{\theta}_{i}(k)-\underline{\theta}_{i}(k)}\right\\|_{2}\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right),$	(173)
$\displaystyle\sup_{k\in[0,t/\eta]\cap\mathbb{N}}\left\|\mathscrsfs{R}(% \underline{\rho}^{(m)}(k))-\mathscrsfs{R}(\overline{\rho}^{(m)}(k))\right\|\leq\,$	$\displaystyle M\exp((\varepsilon^{-1}+1)Mt(1+\varepsilon^{-1}T)^{2})\sqrt{\eta% }\left(\sqrt{d+\log m}+z\right).$	(174)

Theorem 2 then follows as a result of combining Theorem 4, Theorem 5, and Theorem 6.

Lemma 7.

Let $v_{1}=u_{1}+\eta(I_{d}-u_{1}u_{1}^{\top})g_{1}$ , $v_{2}=\operatorname{Proj}_{\mathbb{S}^{d-1}}(u_{2}+\eta g_{2})$ , where $\left\|{u_{2}}\right\|_{2}=1$ and $\eta\left\|{g_{2}}\right\|_{2}\leq 1/2$ . Then we have

(v_{1}-u_{1})-(v_{2}-u_{2})=\,\eta\left(\left(I_{d}-u_{1}u_{1}^{\top}\right)g_% {1}-\left(I_{d}-u_{2}u_{2}^{\top}\right)g_{2}\right)+O\left(\eta^{2}\left\|{g_% {2}}\right\|_{2}^{2}\right).

Proof.

Using Taylor expansion, we know that

	$\displaystyle v_{2}=\operatorname{Proj}_{\mathbb{S}^{d-1}}(u_{2}+\eta g_{2})=\,$	$\displaystyle(u_{2}+\eta g_{2})\left(1+2\eta\langle u_{2},g_{2}\rangle+\eta^{2% }\left\\|{g_{2}}\right\\|_{2}^{2}\right)^{-1/2}$
	$\displaystyle=\,$	$\displaystyle(u_{2}+\eta g_{2})\left(1-\eta\langle u_{2},g_{2}\rangle+O(\eta^{% 2}\left\\|{g_{2}}\right\\|_{2}^{2})\right)$
	$\displaystyle=\,$	$\displaystyle\left(1-\eta\langle u_{2},g_{2}\rangle\right)u_{2}+\eta g_{2}+O(% \eta^{2}\left\\|{g_{2}}\right\\|_{2}^{2})$
	$\displaystyle=\,$	$\displaystyle u_{2}+\eta(I_{d}-u_{2}u_{2}^{\top})g_{2}+O(\eta^{2}\left\\|{g_{2}% }\right\\|_{2}^{2}),$

which implies

v_{2}-u_{2}=\eta(I_{d}-u_{2}u_{2}^{\top})g_{2}+O(\eta^{2}\left\|{g_{2}}\right% \|_{2}^{2}).

The proof is completed by noting that

v_{1}-u_{1}=\eta(I_{d}-u_{1}u_{1}^{\top})g_{1}.

∎

D.4 Proof of Theorem 3

By our assumption, we know that the canonical learning order holds up to level $L$ , and that

\frac{1}{2}\sum_{k\geq L+1}\varphi_{k}^{2}\leq\frac{\delta}{4}.

Then, according to Definition 1, there exists $\varepsilon_{*}=\varepsilon_{*}(\delta)$ , $T_{0}=T_{0}(\delta)$ such that for all $\varepsilon\leq\varepsilon_{*}$ and $T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/2L}$ , one has

\mathscrsfs{R}_{\infty}(T,\varepsilon)=\lim_{m\to\infty}\lim_{d\to\infty}% \mathscrsfs{R}(a(T),u(T))\leq\frac{\delta}{3}.

Moreover, from Section 4 we know that with probability at least $1-e^{-C^{\prime}m}$ over the i.i.d. initialization,

\sup_{t\in[0,T]}\left|\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\infty}(t,% \varepsilon)\right|\leq\left(\frac{1}{\sqrt{d}}+\frac{1}{\sqrt{m}}\right)CM^{% \prime}\exp(M^{\prime}T(1+T)^{2}/\varepsilon^{2}),

(175)

where $M^{\prime}$ only depends on $(\sigma,\varphi,{\rm P}_{A})$ . Now we choose $\varepsilon\leq\varepsilon_{*}$ and $T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/2L}$ . It then follows that

\mathscrsfs{R}(a(T),u(T))\leq\frac{\delta}{3}+\left(\frac{1}{\sqrt{d}}+\frac{1% }{\sqrt{m}}\right)CM^{\prime}\exp(M^{\prime}T^{3}/\varepsilon^{2}).

(176)

According to Theorem 2, we know that with probability at least $1-\exp(-z)$ ,

\left|\mathscrsfs{R}(\overline{a}(n),\overline{u}(n))-\mathscrsfs{R}(a(T),u(T)% )\right|\leq\sqrt{\eta(d+\log m+z)}M^{\prime}\exp\left(M^{\prime}T^{3}/% \varepsilon^{3}\right)

(177)

with $n=T/\eta=T(\varepsilon,\delta)/\eta$ . We now take

M=M(\varepsilon,\delta)=\max\left\{\frac{9M^{\prime 2}\exp(2M^{\prime}T(% \varepsilon,\delta)^{3}/\varepsilon^{3})}{\delta^{2}},\ \frac{36C^{2}M^{\prime 2% }\exp(2M^{\prime}T(\varepsilon,\delta)^{3}/\varepsilon^{2})}{\delta^{2}}\right\}.

Then, by our choice of $m$ and $d$ , we know that $\mathscrsfs{R}(a(T),u(T))\leq 2\delta/3$ . Further, taking

\eta=\frac{1}{M(d+\log m+z)},\ n=MT(d+\log m+z),

(178)

we obtain that

\mathscrsfs{R}(\overline{a}(n),\overline{u}(n))\leq\mathscrsfs{R}(a(T),u(T))+% \frac{\delta}{3}\leq\delta.

(179)

The above happens with probability $1-\exp(-C^{\prime}m)-\exp(-z)$ . Hence, our conclusion follows naturally from the assumption $m\geq z$ .

Appendix E Counterexamples to the canonical learning order

E.1 Case 1: $\sigma_{k}=0$ for some $k\in\mathbb{N}$

For any fixed $(a,u)=(a_{i},u_{i})_{1\leq i\leq m}$ , we have

	$\displaystyle\mathbb{E}\left[f(x;a,u)\mathrm{He}_{k}(\langle u_{*},x\rangle)% \right]=\,$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}a_{i}\mathbb{E}\left[\sigma\left(\langle u% _{i},x\rangle\right)\mathrm{He}_{k}(\langle u_{*},x\rangle)\right]$
	$\displaystyle=\,$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma_{k}\langle u_{i},u_{*}% \rangle^{k}=0.$

Moreover, the risk is always lower bounded by

		$\displaystyle\mathscrsfs{R}(a,u)=\frac{1}{2}\mathbb{E}\left[\left(\varphi(% \langle u_{*},x\rangle)-f(x;a,u)\right)^{2}\right]$
	$\displaystyle=\,$	$\displaystyle\frac{1}{2}\mathbb{E}\left[\left(\varphi_{k}\mathrm{He}_{k}(% \langle u_{},x\rangle)+\left(\varphi(\langle u_{},x\rangle)-\varphi_{k}% \mathrm{He}_{k}(\langle u_{*},x\rangle)-f(x;a,u)\right)\right)^{2}\right]$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\,$	$\displaystyle\frac{1}{2}\varphi_{k}^{2}+\frac{1}{2}\mathbb{E}\left[\left(% \varphi(\langle u_{},x\rangle)-\varphi_{k}\mathrm{He}_{k}(\langle u_{},x% \rangle)-f(x;a,u)\right)^{2}\right]\geq\frac{1}{2}\varphi_{k}^{2},$

where $(i)$ follows from orthogonality between $\mathrm{He}_{k}(\langle u_{*},x\rangle)$ and $f(x;a,u)$ .

E.2 Case 2: $\varphi_{0}=\cdots=\varphi_{k}=0$ for some $k\geq 1$

We consider the reduced mean-field equations (24):

\begin{split}\varepsilon\partial_{t}a_{i}=\,&V(s_{i})-\frac{1}{m}\sum_{j=1}^{m% }a_{j}U(s_{i}s_{j})\,,\\ \partial_{t}s_{i}=\,&a_{i}\left(1-s_{i}^{2}\right)\left(V^{\prime}(s_{i})-% \frac{1}{m}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})s_{j}\right)\,.\end{split}

Note that if $\varphi_{0}=\varphi_{1}=0$ , then $V^{\prime}(s)=s\cdot v(s)$ for some continuous function $v$ . Denoting $a=(a_{1},\cdots,a_{m})$ and $s=(s_{1},\cdots,s_{m})^{\top}$ , the above equation regarding the evolution of the $s_{i}$ ’s can be written as

s^{\prime}(t)=A(a(t),s(t))s(t),

where $A(a,s)$ is a matrix-valued function satisfying

\displaystyle A_{ij}(a,s)=\,

\displaystyle a_{i}(1-s_{i}^{2})\left(v(s_{i})\mathbf{1}_{i=j}-\frac{a_{j}}{m}% U^{\prime}(s_{i}s_{j})\right),\ \forall i,j\in[m].

Using the similar a priori estimate as in the proof of Lemma 1, we can show that

\sup_{t\in[0,T]}\left\|{A(a(t),s(t))}\right\|_{\mathrm{op}}\leq C(T)<\infty

for any finite time $T$ , which immediately implies that $s(t)\equiv 0$ for $t\in[0,T]$ . Therefore, we won’t be able to learn any component of $\varphi$ with degree $\geq 1$ .

E.3 Case 3: $\varphi_{k}=0$ for some $k\geq 1$

We may assume $\sigma_{k}\neq 0$ , and analyze the simplified ODE system (97), which reduces to

\begin{split}\partial_{\tau}\widetilde{a}(\omega)=\,&-\sigma_{k}^{2}\widetilde% {s}(\omega)^{k}\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{k}\mathrm{d}\rho(\nu)% \\ \partial_{\tau}\widetilde{s}(\omega)=\,&-k\sigma_{k}^{2}\widetilde{a}(\omega)% \widetilde{s}(\omega)^{k-1}\left(1-\varepsilon^{2\beta_{k}}\widetilde{s}(% \omega)^{2}\right)\int\widetilde{a}(\nu)\widetilde{s}(\nu)^{k}\mathrm{d}\rho(% \nu).\end{split}

(180)

We thus obtain the following equations:

\begin{split}\partial_{\tau}\int\widetilde{a}(\omega)^{2}\mathrm{d}\rho(\omega% )=\,&-2\sigma_{k}^{2}\left(\int\widetilde{a}(\omega)\widetilde{s}(\omega)^{k}% \mathrm{d}\rho(\omega)\right)^{2}\leq 0,\\ \partial_{\tau}\int\widetilde{s}(\omega)^{2}\mathrm{d}\rho(\omega)=\,&-2k% \sigma_{k}^{2}\int\widetilde{a}(\omega)\widetilde{s}(\omega)^{k}\left(1-% \varepsilon^{2\beta_{k}}\widetilde{s}(\omega)^{2}\right)\mathrm{d}\rho(\omega)% \cdot\int\widetilde{a}(\omega)\widetilde{s}(\omega)^{k}\mathrm{d}\rho(\omega)% \\ \leq\,&2k\sigma_{k}^{2}\varepsilon^{2\beta_{k}}\int\widetilde{a}(\omega)% \widetilde{s}(\omega)^{k+2}\mathrm{d}\rho(\omega)\cdot\int\widetilde{a}(\omega% )\widetilde{s}(\omega)^{k}\mathrm{d}\rho(\omega),\end{split}

(181)

which means that for any $\tau\geq 0$ ,

\int\widetilde{a}(\omega,\tau)^{2}\mathrm{d}\rho(\omega)\leq\int\widetilde{a}(% \omega,0)^{2}\mathrm{d}\rho(\omega)=O(\varepsilon^{1/k(k+1)})=o(1).

(182)

Therefore, most of the neurons cannot evolve to the magnitude of $\Omega(1)$ in the process of learning the $k$ -th component, and therefore fails to provide an effective initialization for learning the next component $\varphi_{k+1}$ .

	$\displaystyle\left\|\partial_{t}(a_{i}-a_{i}^{0})\right\|\leq\,$	$\displaystyle\frac{M}{\varepsilon}\left(\left\|s_{i}-s_{i}^{0}\right\|+\frac{1}{% m}\sum_{j=1}^{m}\left\|a_{j}-a_{j}^{0}\right\|\right)+\frac{M(1+t/\varepsilon)}{% \varepsilon}\cdot\frac{1}{m}\sum_{j=1}^{m}\left\|r_{ij}-r_{ij}^{0}\right\|,$
	$\displaystyle\left\|\partial_{t}(s_{i}-s_{i}^{0})\right\|\leq\,$	$\displaystyle M(1+t/\varepsilon)\cdot\left(\left\|a_{i}-a_{i}^{0}\right\|+\frac{% 1}{m}\sum_{j=1}^{m}\left\|a_{j}-a_{j}^{0}\right\|\right)$
		$\displaystyle+M(1+t/\varepsilon)^{2}\cdot\left(\left\|s_{i}-s_{i}^{0}\right\|+% \frac{1}{m}\sum_{j=1}^{m}\left(\left\|s_{j}-s_{j}^{0}\right\|+\left\|r_{ij}-r_{ij% }^{0}\right\|\right)\right),$

		$\displaystyle\left\|\partial_{t}(a_{i}-a_{i}^{\mbox{\tiny\rm mf}})\right\|=\,% \frac{1}{\varepsilon}\cdot\left\|V(s_{i})-V(s_{i}^{\mbox{\tiny\rm mf}})-\frac{1% }{m}\sum_{j=1}^{m}\left(a_{j}U(r_{ij})-a_{j}^{\mbox{\tiny\rm mf}}U(s_{i}^{% \mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})\right)\right\|$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{\varepsilon}\cdot\left(\left\|V(s_{i})-V(s_{i}^{\mbox{% \tiny\rm mf}})\right\|+\frac{1}{m}\sum_{j=1}^{m}\left\|a_{j}U(r_{ij})-a_{j}U(s_{% i}s_{j})\right\|+\frac{1}{m}\sum_{j=1}^{m}\left\|a_{j}U(s_{i}s_{j})-a_{j}^{\mbox% {\tiny\rm mf}}U(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})\right\|\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{1}{\varepsilon}\cdot\left(\left\\|{V^{\prime}}\right\\|_{% \infty}\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+\frac{M(1+t/\varepsilon)}{m}\left\\|{% U^{\prime}}\right\\|_{\infty}\sum_{j=1}^{m}\left\|r_{ij}^{\perp}\right\|\right)$
		$\displaystyle+\frac{1}{m\varepsilon}\sum_{j=1}^{m}\left(\left\\|{U}\right\\|_{% \infty}\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+M\left(1+\frac{t}{\varepsilon}\right% )\left\\|{U^{\prime}}\right\\|_{\infty}\left(\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+% \|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\|\right)\right)$
	$\displaystyle\leq\,$	$\displaystyle\frac{M}{\varepsilon}\left(1+\frac{t}{\varepsilon}\right)\cdot% \left(\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+\frac{1}{m}\sum_{j=1}^{m}\left(\|s_{j}% -s_{j}^{\mbox{\tiny\rm mf}}\|+\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+\left\|r_{ij}^{% \perp}\right\|\right)\right),$

	$\displaystyle\left\|\partial_{t}(s_{i}-s_{i}^{\mbox{\tiny\rm mf}})\right\|\leq\,$	$\displaystyle\left\\|{V^{\prime}}\right\\|_{\infty}\|a_{i}-a_{i}^{\mbox{\tiny\rm mf% }}\|+M\left(1+\frac{t}{\varepsilon}\right)\left(\left\\|{V^{\prime\prime}}\right% \\|_{\infty}+2\left\\|{V^{\prime}}\right\\|_{\infty}\right)\|s_{i}-s_{i}^{\mbox{% \tiny\rm mf}}\|$
		$\displaystyle+\frac{1}{m}\left\|a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(r_{ij})(s_{j% }-r_{ij}s_{i})-a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})(1-s_{i}^{2})s_{j% }\right\|$
		$\displaystyle+\frac{1}{m}\left\|a_{i}\sum_{j=1}^{m}a_{j}U^{\prime}(s_{i}s_{j})(% 1-s_{i}^{2})s_{j}-a_{i}^{\mbox{\tiny\rm mf}}\sum_{j=1}^{m}a_{j}^{\mbox{\tiny% \rm mf}}U^{\prime}(s_{i}^{\mbox{\tiny\rm mf}}s_{j}^{\mbox{\tiny\rm mf}})(1-(s_% {i}^{\mbox{\tiny\rm mf}})^{2})s_{j}^{\mbox{\tiny\rm mf}}\right\|$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)\cdot\left(\|a_{i}-a_{i}^{% \mbox{\tiny\rm mf}}\|+\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|\right)+\frac{2}{m}% \cdot M\left(1+\frac{t}{\varepsilon}\right)\left(\left\\|{U^{\prime}}\right\\|_{% \infty}+\left\\|{U^{\prime\prime}}\right\\|_{\infty}\right)\sum_{j=1}^{m}\|a_{j}\|% \left\|r_{ij}^{\perp}\right\|$
		$\displaystyle+M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\frac{1}{m}\sum_{j% =1}^{m}\left(\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}}\|+\|s_{j}-s_{j}^{\mbox{\tiny\rm mf% }}\|\right)$
	$\displaystyle\leq\,$	$\displaystyle M\left(1+\frac{t}{\varepsilon}\right)^{2}\cdot\left(\|a_{i}-a_{i}% ^{\mbox{\tiny\rm mf}}\|+\|s_{i}-s_{i}^{\mbox{\tiny\rm mf}}\|+\frac{1}{m}\sum_{j=1% }^{m}\left(\|s_{j}-s_{j}^{\mbox{\tiny\rm mf}}\|+\|a_{j}-a_{j}^{\mbox{\tiny\rm mf}% }\|+\left\|r_{ij}^{\perp}\right\|\right)\right),$

Learning time-scales in two-layers neural networks

Abstract

1 Introduction

Theory #⁢1#1\#1# 1: Dynamics near singular points.

Theory #⁢2#2\#2# 2: Linear networks.

Theory #⁢3#3\#3# 3: Kernel regime.

Notations.

2 Setting and canonical learning order

Remark 2.1.

Definition 1.

Proposition 1.

Remark 2.2.

3 Further related work

4 The large-network, high-dimensional limit

Remark 4.1.

4.1 Reduction to d𝑑ditalic_d-independent flow

Proposition 2 (Reduction to d𝑑ditalic_d-independent flow).

Corollary 1.

4.2 Elimination of the products ⟨ui,uj⟩subscript𝑢𝑖subscript𝑢𝑗\langle u_{i},u_{j}\rangle⟨ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩

Proposition 3 (Reduction to flow in ℝ2⁢msuperscriptℝ2𝑚\mathbb{R}^{2m}blackboard_R start_POSTSUPERSCRIPT 2 italic_m end_POSTSUPERSCRIPT).

4.3 Connection with mean field theory

4.4 A general formulation

Proposition 4.

Remark 4.2.

Remark 4.3.

5 Numerical solution

6 Timescales hierarchy in the gradient flow dynamics

Notations.

6.1 Matched asymptotic expansions

6.2 First time scale: constant component

6.3 Second time scale: linear component I

Identification of the scale.

Derivation of the ODEs for this time scale.

Matching.

Solution.

6.4 Third time scale: linear component II

6.5 Conjectured behavior for larger time scales

Theorem 1 (Evolution of the simplified gradient flow).

Remark 6.1.

Remark 6.2.

7 Stochastic gradient descent and finite sample size

Theorem 2 (Difference between GF and Projected SGD).

Theorem 3.

Remark 7.1.

8 Discussion

Implicit bias in function space.

Implicit bias in parameter space.

The role of the learning rate ε𝜀\varepsilonitalic_ε.

More complex network models.

Acknowledgments

References

Appendix A Proof of Proposition 1

Appendix B Appendix to Section 4

B.1 Proof of Proposition 2

B.2 Proof of Corollary 1

B.3 Proof of Proposition 3

Lemma 1.

Proof.

B.4 Derivation of the mean field dynamics (29)

B.5 Details of the alternative mean field approach

B.6 Proof of Proposition 4

Appendix C Calculations for the analysis of mean-field gradient flow

C.1 Solution of Eq. (89)

Lemma 2.

Proof.

Matching.

C.2 Induced approximation of the risk

First time scale t1=tεsubscript𝑡1𝑡𝜀t_{1}=\frac{t}{\varepsilon}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_ε end_ARG (Section 6.2).

C.3 Proof of Theorem 1

Lemma 3.

Proof.

Appendix D Proofs of Theorem 2 and 3: learning with projected SGD

Lemma 4 (A priori estimate).

Proof.

D.1 Difference between GF and GD

Lemma 5.

Proof.

Lemma 6.

Proof.

Theorem 4 (Difference between GF and GD).

Theory $\#1$ : Dynamics near singular points.

Theory $\#2$ : Linear networks.

Theory $\#3$ : Kernel regime.

4.1 Reduction to $d$ -independent flow

Proposition 2 (Reduction to $d$ -independent flow).

4.2 Elimination of the products $\langle u_{i},u_{j}\rangle$

Proposition 3 (Reduction to flow in $\mathbb{R}^{2m}$ ).

The role of the learning rate $\varepsilon$ .

First time scale $t_{1}=\frac{t}{\varepsilon}$ (Section 6.2).

E.1 Case 1: $\sigma_{k}=0$ for some $k\in\mathbb{N}$

E.2 Case 2: $\varphi_{0}=\cdots=\varphi_{k}=0$ for some $k\geq 1$

E.3 Case 3: $\varphi_{k}=0$ for some $k\geq 1$