Kernel vs. Kernel: Exploring How the Data Structure Affects Neural Collapse

Vignesh Kothapalli
LinkedIn Inc.
[email protected] &Tom Tirer
Bar-Ilan University, Israel
[email protected]

Abstract

Recently, a vast amount of literature has focused on the “Neural Collapse” (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network’s deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. In this paper, we provide a kernel-based analysis that does not suffer from this limitation. First, given a kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples’ features (and consequently an NC1 metric). Then, we turn to focus on kernels associated with shallow NNs. First, we consider the NN Gaussian Process kernel (NNGP), associated with the network at initialization, and the complement Neural Tangent Kernel (NTK), associated with its training in the “lazy regime”. Interestingly, we show that the NTK does not represent more collapsed features than the NNGP for prototypical data models. As NC emerges from training, we then consider an alternative to NTK: the recently proposed adaptive kernel, which generalizes NNGP to model the feature map** learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.

1 Introduction

Deep Neural Network classifiers are often trained beyond the zero training error point [1, 2]. In this regime, a phenomenon dubbed “Neural Collapse” (NC) emerges [3]. NC is typically described by the following components: (NC1) the networks’ deepest features exhibit a significant decrease in the variability of within-class samples, (NC2) the mean features of different classes approach a certain symmetric structure, and (NC3) the last layer’s weights become more aligned with the penultimate layer features’ means. This behavior has been observed both when using the cross-entropy (CE) loss [3] and the mean squared error (MSE) loss [4].

Recently, a vast amount of literature has been dedicated to exploring NC (as surveyed in [5]), studying the effect of imbalanced data [6, 7], depthwise evolution [8, 9, 10, 11], fine-grained structures [12, 13, 14], and implications [15, 16, 17]. Note that without a sufficient decrease in the features’ within-class variability around the class means, measured by NC1 metrics, one may not gain valuable insights from the structure of the means of different classes. Therefore, oftentimes, NC papers focus specifically on NC1 rather than on other components of NC [12, 13, 14, 16, 11, 18]. Notably, most of the works that attempt to theoretically analyze the NC behavior [19, 6, 15, 4, 20, 8, 21, 7, 22, 12, 23, 10, 24] are based on variants of the unconstrained features model (UFM) [19], which treats the deepest features of the training samples as free optimization variables. Therefore, these analyses cannot predict the effect of the data structure on the extent of collapse.

Since theoretically analyzing the behavior of (deep) NNs is challenging, simplifying approaches that are based on kernel methods [25] have gained massive popularity [26, 27, 28, 29, 30, 31]. Prominent examples include the NN Gaussian Process kernel (NNGP) [26, 27, 32], associated with the infinitely wide NNs at initialization, and the Neural Tangent Kernel (NTK) [28], associated with their training in the “lazy regime”, where the learning rate is sufficiently small. These approaches, and in particular NTK, were used to provide mathematical reasoning for deep learning phenomena such as achieving zero training loss [28, 30, 31], faster learning of lower frequencies [33], benefits of ResNets over fully connected networks [34, 35, 36], usefulness of positional encoding in coordinated-based NNs [37], and more. More recently, finite width variants of the aforementioned kernels have been studied [38, 39, 40], aiming to mitigate the gap between practical NNs behavior and infinite width analyses [41, 42, 43, 44, 45].

In this paper, we provide a kernel-based analysis for NC1, the core component of NC, which does not suffer from the limitations of UFM-based analysis. Thus, it allows us to explore how the data structure affects the collapse. Since kernels provide fixed feature map**, we propose a “kernel vs. kernel” analysis — that is, gaining insights by comparing the properties across NN-related kernels.

Our main contributions can be summarized as follows:

•

Given an arbitrary kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples’ features — and consequently, an NC1 metric that depends on the features only through the kernel function.
•

We specialize our kernel-based NC1 to kernels associated with shallow NNs. We analyze it for NNGP (NN at initialization), and NTK (NN trained in the “lazy regime”) and show that, perhaps surprisingly, the NTK does not represent more collapsed features than the NNGP for prototypical data models.
•

As NC emerges from training, we consider an alternative to NTK: the recently proposed adaptive kernel [40], which generalizes NNGP to model the feature map** learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining deeper insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.

2 Related Work

As mentioned in the previous section, most of the works that attempt to analyze the NC behavior theoretically [19, 6, 15, 4, 20, 8, 21, 7, 22, 23, 10] are based on variants of the UFM [19]. The work in [12] attempts to generalize the UFM by adding a penalty term to the loss that ensures that the features matrix is in the vicinity of a predefined matrix. Yet, the model still lacks an explicit connection between the features and the data. In [46], the authors avoid optimizing the features directly but assume that the model is linear and that the data is nearly orthogonal, which are restrictive assumptions. In [47], the authors claim that having an exact class-wise block structure in the Gram matrix of the empirical NTK on training samples implies NC. Yet, they do not provide reasoning for reaching this collapsed Gram matrix and the analysis is still disconnected from the data. Here, however, we fully depart from the UFM approach and provide analysis that explicitly depends on the data. Furthermore, unlike most works, our analysis is applicable to the less studied case where the data is class imbalanced [6, 17, 23, 7]. Our kernel-based analysis utilizes results on NNGP and NTK from [48, 49, 27, 28, 31]. To simplify the analysis, we theoretically analyze kernels associated with shallow fully connected NNs. Focusing on shallow networks is justified by recent works demonstrating monotonic depthwise evolution of NC1 [8, 9, 10, 11, 12]. Specifically, a data structure that promotes a larger reduction in NC1 for shallow NNs is expected to be more collapsed when using deep NNs. Since NC is related to training, and we show the limitation of NTK to capture it when compared to NNGP, we also utilize the generalization of NNGP that has been proposed in [40]. In this adaptive kernel model, there is an explicit kernel function that depends on the training data. Recently, this richer model has been used to study phase transition behaviors, such as grokking [50], that cannot be captured with the data-independent kernels.

3 Problem Setup

In this section, we outline the notations and the setup.

$\displaystyle\bullet$ Data: We consider a dataset $\displaystyle\mathbf{X}\in\mathbb{R}^{d_{0}\times N}$ , comprising $\displaystyle N$ data points of dimension $\displaystyle d_{0}$ belonging to $\displaystyle C$ classes. Each class has size $\displaystyle n_{c},c\in[C]$ , where $\displaystyle[C]:=\{1,2,\cdots,C\}$ and $\displaystyle\sum\nolimits_{c}{n_{c}}=N$ . The dataset is represented in an “organized” matrix form as $\displaystyle\mathbf{X}=\begin{bmatrix}\mathbf{x}^{1,1}&\cdots&\mathbf{x}^{1,n% _{1}},\mathbf{x}^{2,1}\cdots&\mathbf{x}^{C,n_{C}}\end{bmatrix}\in\mathbb{R}^{d% _{0}\times N}$ , where $\displaystyle\mathbf{x}^{c,i}\in\mathbb{R}^{d_{0}}$ represents the $\displaystyle i^{th}$ data point of the $\displaystyle c^{th}$ class. Specific assumptions on the data distribution will be presented during the paper together with the related theory or experiments.

$\displaystyle\bullet$ Neural Network: Unless stated otherwise, we consider a 2-layer fully connected neural network (2L-FCN) $\displaystyle\psi:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{2}}$ with $\displaystyle l^{th}$ layer width $\displaystyle d_{l},l\in\{1,2\}$ , and point-wise activation function $\displaystyle\phi(\cdot):\mathbb{R}\to\mathbb{R}$ . Let $\displaystyle\mathbf{W}^{(l)}\in\mathbb{R}^{d_{l}\times d_{l-1}},\mathbf{b}^{(% l)}\in\mathbb{R}^{d_{l}}$ denote the weight and bias parameters of the $\displaystyle l^{th}$ layer. At initialization, the entries $\displaystyle W_{ij}^{(l)},b_{i}^{(l)}$ are drawn i.i.d from Gaussian distributions of mean $\displaystyle 0$ and variance $\displaystyle\sigma_{w}^{2}/d_{l-1},\sigma_{b}^{2}$ , respectively. For an input $\displaystyle\mathbf{x}\in\mathbb{R}^{d_{0}}$ to the network $\displaystyle\psi(\cdot)$ , we denote the $\displaystyle i^{th}$ component of the output vector $\displaystyle\hat{y}_{i}(\mathbf{x})\in\mathbb{R}$ as follows:

\hat{y}_{i}(\mathbf{x})=b_{i}^{(2)}+\sum_{j=1}^{d_{1}}W_{ij}^{(2)}\phi\left(z_% {j}(\mathbf{x})\right),\hskip 20.0ptz_{j}(\mathbf{x})=b_{j}^{(1)}+\sum_{k=1}^{% d_{0}}W_{jk}^{(1)}x_{k}.

(1)

$\displaystyle\bullet$ Task: We train the network $\displaystyle\psi(\cdot)$ to classify the data points $\displaystyle\mathbf{x}^{c,i},c\in[C],i\in[n_{c}]$ to their respective classes. Let $\displaystyle\hat{\mathbf{Y}},\mathbf{Y}\in\mathbb{R}^{d_{2}\times N}$ denote the prediction and ground truth label matrices respectively:

\displaystyle\displaystyle\hat{\mathbf{Y}}=\begin{bmatrix}\psi(\mathbf{x}^{1,1% })&\cdots&\psi(\mathbf{x}^{C,n_{C}})\end{bmatrix}=\begin{bmatrix}\hat{\mathbf{% y}}^{1,1}&\cdots&\hat{\mathbf{y}}^{C,n_{C}}\end{bmatrix},\hskip 20.0pt\mathbf{% Y}=\begin{bmatrix}\mathbf{y}^{1,1}&\cdots&\mathbf{y}^{C,n_{C}}\end{bmatrix}

(2)

We aim to minimize the Mean Squared Error (MSE) between $\displaystyle\hat{\mathbf{Y}},\mathbf{Y}$ using the following objective:

\mathcal{R}(\psi,\mathbf{X},\mathbf{Y})=\frac{1}{N}\left\|\hat{\mathbf{Y}}-% \mathbf{Y}\right\|_{F}^{2}+\lambda\sum_{l=1}^{2}\left(\left\|\mathbf{W}^{(l)}% \right\|_{F}^{2}+\left\|\mathbf{b}^{(l)}\right\|_{F}^{2}\right).

(3)

Note that training deep classifiers with MSE loss has been shown to be a useful strategy [51, 4], which is commonly considered in NC analyses [4, 8, 21].

$\displaystyle\bullet$ Pre- and Post-activation Kernels: For any two inputs $\displaystyle\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}\in\mathbb{R}^{d_{0}}$ , we denote their corresponding pre- and post-activation features as $\displaystyle\mathbf{z}^{c,i},\mathbf{z}^{c^{\prime},j}\in\mathbb{R}^{d_{1}}$ and $\displaystyle\phi(\mathbf{z}^{c,i}),\phi(\mathbf{z}^{c^{\prime},j})\in\mathbb{% R}^{d_{1}}$ , respectively. The pre and post-activation kernels corresponding to layer $\displaystyle l=1$ are given by:

\displaystyle\displaystyle\begin{split}K^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^% {\prime},j})&=\mathbf{z}^{c,i\top}\mathbf{z}^{c^{\prime},j}=\left(\mathbf{b}^{% (1)}+\mathbf{W}^{(1)}\mathbf{x}^{c,i}\right)^{\top}\left(\mathbf{b}^{(1)}+% \mathbf{W}^{(1)}\mathbf{x}^{c^{\prime},j}\right),\\ Q^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})&=\phi(\mathbf{z}^{c,i})^{% \top}\phi(\mathbf{z}^{c^{\prime},j}).\end{split}

(4)

4 Within-Class Variability Metric (NC1) for Kernels

In this section, we derive an NC1 metric that depends on the features only through the kernel function. Let $\displaystyle\mathbf{H}$ be a matrix encapsulating arbitrary feature vectors associated with samples of the $\displaystyle C$ classes. We define the within-class covariance $\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{H})$ and between-class covariance $\displaystyle\boldsymbol{\Sigma}_{B}(\mathbf{H})$ matrices of the features as follows:

\displaystyle\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{H})

\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\left(% \mathbf{h}^{c,i}-\overline{\mathbf{h}}^{c}\right)\left(\mathbf{h}^{c,i}-% \overline{\mathbf{h}}^{c}\right)^{\top};\boldsymbol{\Sigma}_{B}(\mathbf{H})=% \frac{1}{C}\sum_{c=1}^{C}\left(\overline{\mathbf{h}}^{c}-\overline{\mathbf{h}}% ^{G}\right)\left(\overline{\mathbf{h}}^{c}-\overline{\mathbf{h}}^{G}\right)^{% \top},

(5)

where $\displaystyle\overline{\mathbf{h}}^{c}=\frac{1}{n_{c}}\sum\nolimits_{i=1}^{n_{% c}}\mathbf{h}^{c,i},\forall c\in[C]$ and $\displaystyle\overline{\mathbf{h}}^{G}=\frac{1}{N}\sum\nolimits_{c=1}^{C}\sum% \nolimits_{i=1}^{n_{c}}\mathbf{h}^{c,i}$ represent the class mean vectors and the global mean vector, respectively. Additionally, we consider the total covariance $\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})$ and non-centered between-class covariance $\displaystyle\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})$ matrices as follows:

\displaystyle\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})=% \frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\mathbf{h}^{c,i\top% },\hskip 20.0pt\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})=\frac{1}{C}\sum% _{c=1}^{C}\overline{\mathbf{h}}^{c}\overline{\mathbf{h}}^{c\top}.

(6)

Based on these formulations, we define the variability metric $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ , introduced in [12] and used also in [14, 46, 52], as:

\mathcal{N}\mathcal{C}_{1}(\mathbf{H}):=\frac{\mathrm{tr}(\boldsymbol{\Sigma}_% {W}(\mathbf{H}))}{\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))}.

(7)

In the following theorem, we formulate the traces $\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))$ and $\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))$ using an arbitrary kernel function $\displaystyle Q:\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}$ that expresses inner product of data samples in feature space.

Theorem 4.1.

For any two data points $\displaystyle\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}$ , let the inner-product of their associated features $\displaystyle\mathbf{h}^{c,i},\mathbf{h}^{c^{\prime},j}$ be given by a kernel $\displaystyle Q:\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}$ as $\displaystyle Q(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=\mathbf{h}^{c,i% \top}\mathbf{h}^{c^{\prime},j}$ . The traces of covariance matrices $\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))$ and $\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))$ can now be formulated as:

	$\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))$	$\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}Q(% \mathbf{x}^{c,i},\mathbf{x}^{c,i})-\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}% }\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j}),$		(8)
	$\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))$	$\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i% =1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})-\frac{1}{N^{% 2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{% \prime}}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).$		(9)

The proof (in Appendix A) leverages matrix trace properties and direct expansions of the covariance matrices $\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H}),\widetilde{% \boldsymbol{\Sigma}}_{B}(\mathbf{H}),\boldsymbol{\Sigma}_{W}(\mathbf{H})$ in terms of vector outer-products to arrive at the results.

Observe that Theorem 4.1 allows us to replace $\displaystyle Q(\cdot,\cdot)$ with any suitable kernel formulation corresponding to the features. In the following sections, we leverage this flexibility to analyze and compare NC1 for kernels that model the behavior of neural networks.

5 Activation Variability in the Lazy Learning Regime

In the UFM-based analysis of NC, the assumption is that the deepest features $\displaystyle\mathbf{H}$ associated with the training samples $\displaystyle\mathbf{X}$ are free optimization variables, thus losing any ability to analyze the effect of the training data, apart from the balancedness of its labels $\displaystyle\mathbf{Y}$ . In this section, we address these shortcomings by analyzing the role of data distributions on $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ in the infinite width regime, where the NN behavior can be well modeled by NNGP and NTK. In particular, we focus on the case where data is sampled from a mixture of Gaussians and understand the limits of $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ reduction.

5.1 Limiting NNGP Kernel

Under the NN model and initialization stated in Section 3, as the hidden layer width $\displaystyle d_{1}\to\infty$ , we can characterize the pre-activation kernel $\displaystyle K^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ in terms of the GP limit [26] (commonly referred to as the NNGP limit [27]) as follows:

\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})=\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{d_{0}}\mathbf{x}^{c,i\top}\mathbf{x% }^{c^{\prime},j}.

(10)

In this limit, the post-activation kernel $\displaystyle Q^{(1)}_{GP}(\cdot,\cdot)$ can have a closed form representation depending on the choice of activation function $\displaystyle\phi(\cdot)$ [27, 31]. The expression for Erf activation [48] is:

\displaystyle\displaystyle\begin{split}Q_{GP-Erf}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})&=\frac{2}{\pi}\arcsin\left(\frac{2K_{GP}^{(1)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})}{\sqrt{1+2K_{GP}^{(1)}(\mathbf{x}^% {c,i},\mathbf{x}^{c,i})}\sqrt{1+2K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},% \mathbf{x}^{c^{\prime},j})}}\right).\\ \end{split}

(11)

The formulation for ReLU-based kernel $\displaystyle Q_{GP-ReLU}^{(1)}(\cdot,\cdot)$ [49] is presented in the Appendix B. Given a kernel function $\displaystyle Q(\cdot,\cdot)$ and samples $\displaystyle\mathbf{X}$ , we can formulate the kernel Gram matrix $\displaystyle\mathbf{Q}\in\mathbb{R}^{N\times N}$ as:

\displaystyle\displaystyle\mathbf{Q}=\begin{bmatrix}\mathbf{Q}_{1,1}&\cdots&% \mathbf{Q}_{1,C}\\ \vdots&\ddots&\vdots\\ \mathbf{Q}_{C,1}&\cdots&\mathbf{Q}_{C,C}\\ \end{bmatrix}_{N\times N},\mathbf{Q}_{c,c^{\prime}}=\begin{bmatrix}Q(\mathbf{x% }^{c,1},\mathbf{x}^{c^{\prime},1})&\cdots&Q(\mathbf{x}^{c,1},\mathbf{x}^{c^{% \prime},n_{c^{\prime}}})\\ \vdots&\ddots&\vdots\\ Q(\mathbf{x}^{c,n_{c}},\mathbf{x}^{c^{\prime},1})&\cdots&Q(\mathbf{x}^{c,n_{c}% },\mathbf{x}^{c^{\prime},n_{c^{\prime}}})\\ \end{bmatrix}_{n_{c}\times n_{c^{\prime}}}.

(12)

Considering the NNGP kernel in (11), we illustrate an example $\displaystyle\mathbf{Q}$ matrix in Figure 1 to visualize the sub-matrices based on the imbalance/balance of class sizes.

5.2 Limiting NTK

In the infinite width limit, we also analyze the NTK [28] to understand the effect of optimization on the NN’s features in the “lazy regime”. Specifically, a well-known result is that in the infinite width limits (with initialization as per Section 3), the deepest feature map** of the NN is fixed during gradient descent optimization with small enough learning rate, and is characterized by the NTK.

Formally, the recursive relationship between the NTK and NNGP [28, 31] can be given as follows:

\Theta^{(2)}_{NTK}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=K_{GP}^{(2)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})\dot{Q}^{(1)}_{GP}(\mathbf{x}^{c,i},\mathbf{x}^{c^{% \prime},j}).

(13)

Here, $\displaystyle K_{GP}^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ can be defined using the recursive formulation [27, 28]:

\displaystyle\displaystyle\begin{split}K_{GP}^{(2)}(\mathbf{x}^{c,i},\mathbf{x% }^{c^{\prime},j})=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j}).\end{split}

(14)

Similar to the activation function specific formulations of $\displaystyle Q^{(1)}_{GP}(\cdot,\cdot)$ in (11), we define the Erf based derivative kernel $\displaystyle\dot{Q}^{(1)}_{GP-Erf}(\cdot,\cdot)$ as follows:

\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-Erf}^{(1)}(\mathbf{x}^{c,i}% ,\mathbf{x}^{c^{\prime},j})&=\frac{4}{\pi}\det\left(\begin{bmatrix}1+2K_{GP}^{% (1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,i})&2K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf% {x}^{c^{\prime},j})\\ 2K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},\mathbf{x}^{c,i})&1+2K_{GP}^{(1)}(% \mathbf{x}^{c^{\prime},j},\mathbf{x}^{c^{\prime},j})\end{bmatrix}\right)^{-1/2% }.\end{split}

(15)

The formulation for ReLU-based kernel $\displaystyle\dot{Q}_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j% })(\cdot,\cdot)$ is presented in the Appendix B.

Remark on the limiting NTK. Denote by $\displaystyle w,\phi$ the parameters and the activation function of an $\displaystyle L$ -layer NN $\displaystyle\psi(\cdot)$ . The limiting NNGP is defined directly on the product of neurons, $\displaystyle\mathbb{E}_{w}\langle\phi(z_{i}(\mathbf{x})),\phi(z_{i}(\tilde{% \mathbf{x}}))\rangle$ , and provides kernel expressions for the inner product of features at different layers of the network (via recursion similar to (11)). On the other hand, the limiting NTK is defined as the inner product of the output’s gradients $\displaystyle\mathbb{E}_{w}\langle\nabla_{w}\psi(\mathbf{x}),\nabla_{w}\psi(% \tilde{\mathbf{x}})\rangle$ . The NTK theory shows that this can model the inner product only of the deepest features (i.e the output of the penultimate layer)¹¹1Note that we denote NNGP $\displaystyle(Q_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}))$ and NTK ( $\displaystyle\Theta^{(2)}_{NTK}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ ) with different superscripts to be consistent with the literature. Yet both are associated with the output of the single hidden layer. .

5.3 1-D Gaussian Data with 2 Classes

Notice that even for shallow NNs, it is challenging to theoretically analyze NNGP and NTK for general data. Therefore, we consider a simplified setting to analyze the NC1 properties of these kernels. Formally, consider a $\displaystyle 1$ -dimensional Gaussian dataset (i.e., $\displaystyle d_{0}=1$ ) with $\displaystyle C=2$ classes. The data points $\displaystyle\{x^{1,i}\},\forall i\in[n_{1}]$ belonging to class $\displaystyle c=1$ are independently sampled from $\displaystyle\mathcal{N}(\mu_{1},\sigma^{2}_{1})$ and have the labels $\displaystyle y^{1,i}=-1,\forall i\in[n_{1}]$ . Similarly, the data points $\displaystyle\{x^{2,j}\},\forall j\in[n_{2}]$ belonging to class $\displaystyle c=2$ are independently sampled from $\displaystyle\mathcal{N}(\mu_{2},\sigma^{2}_{2})$ and have the labels $\displaystyle y^{2,j}=1,\forall j\in[n_{2}]$ .

$\displaystyle\bullet$ Assumption 1: For $\displaystyle\mu_{1}<0,\mu_{2}>0$ , let $\displaystyle\sigma_{1},\sigma_{2}>0$ be small enough such that $\displaystyle|\mu_{1}|\gg\sigma_{1}$ , $\displaystyle|\mu_{2}|\gg\sigma_{2}$ and $\displaystyle\forall i\in[n_{1}],j\in[n_{2}],x^{1,i}x^{2,j}<0$ almost surely.

$\displaystyle\bullet$ Assumption 2: The dataset $\displaystyle\mathbf{X}\in\mathbb{R}^{N\times 1}$ consists of large enough samples $\displaystyle n_{1},n_{2}\gg 1$ .

$\displaystyle\bullet$ Assumption 3: The 2L-FCN $\displaystyle\psi(\cdot)$ has output layer dimension $\displaystyle d_{2}=1$ and $\displaystyle\sigma_{b}\to 0$ .

These assumptions present a scenario where the samples of the two classes are sufficiently far from the origin in opposite directions. Thus, the simplest prediction rule pertains to the sign of a sample.

Theorem 5.1 (ReLU Activation).

Under Assumptions 1-3, let $\displaystyle\phi(\cdot)$ be the ReLU activation. Denote by $\displaystyle\mathbf{H}_{GP},\mathbf{H}_{NTK}$ the features associated with NNGP $\displaystyle Q_{GP}^{(1)}$ and NTK $\displaystyle\Theta_{NTK}^{(2)}$ , respectively. Then:

\displaystyle\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H% }_{GP})\right]=\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H}_{NTK})% \right]=\frac{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{N}-% \frac{\mu_{c}^{2}}{2}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{% 2}\mu_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}+% \Delta_{h.o.t}

(16)

where $\displaystyle\Delta_{h.o.t}$ is a term that vanishes as $\displaystyle\{n_{c}\}$ increase.

Appendix D presents the proof by calculating the expected values of $\displaystyle Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ and employing Theorem 4.1. For a better understanding of the result, consider the balanced class scenario with $\displaystyle n_{1}=n_{2}=N/2$ . This gives us: $\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H}_{GP/NTK})% \right]=2(\sigma_{1}^{2}+\sigma_{2}^{2})/(\mu_{1}-\mu_{2})^{2}+\Delta_{h.o.t}$ , which intuitively captures the sum of the within-class variance of $\displaystyle\mathbf{X}$ in the numerator and the between-class variance in the denominator of the first term.

$\displaystyle\bullet$ Erf Activation. We present a similar analysis with the NNGP and NTK with Erf activation in Appendix E (as the terms involved in the formulation are relatively complex than the ReLU case). In the case of ReLU with balanced data, observe that the numerator (corresponding to $\displaystyle\Sigma_{W}(\mathbf{H}_{GP/NTK})$ ) solely depends on $\displaystyle\sigma_{c}^{2}$ , however, for Erf activation, our analysis shows a dependence on terms $\displaystyle\propto\sigma_{c}^{2}\mu_{c}^{-6}$ i.e, it depends on inverse of higher powers of class means $\displaystyle\mu_{c}$ as well (see (92) in Appendix E). Similar analysis for $\displaystyle\boldsymbol{\Sigma}_{B}(\mathbf{H}_{GP/NTK})$ with Erf in (100) shows a dependence on terms $\displaystyle\propto\sigma_{c}^{2}\mu_{c}^{-4}$ . Importantly, under Assumptions 1-3, we show similar values of the expected NC1 metric for NNGP and NTK even for the Erf activation, which are smaller than the ReLU case.

Remark. These results reveal the effect of the activation function on the NC1 metric when $\displaystyle d_{1}\to\infty$ . Especially under Assumptions 1-3, the Erf-based kernels reflect a larger extent of ‘variability collapse’ (NC1) of the hidden layer post-activations of our 2L-FCN (both at initialization via NNGP and during training via NTK). Additionally, they indicate that the expected $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ of the NTK closely approximates the NNGP counterparts on our 1-D Gaussian dataset. Perhaps surprisingly, this shows that NTK does not represent more collapsed features than NNGP, despite being associated with NN gradient-based optimization. Namely, we have established another result that shows that training in the lazy regime provably deviates from the practical feature learning of NNs [41, 42, 43, 44, 45].

5.4 Experiments with High-Dimensional Gaussian Data

Setup: We conduct experiments on datasets with varying sample sizes and input dimensions to verify our theoretical results and show that insights generalize (e.g., beyond $\displaystyle d_{0}=1$ ). For $\displaystyle C=2$ , a dataset size $\displaystyle N$ chosen from $\displaystyle\{128,256,512,1024\}$ , and input dimension $\displaystyle d_{0}$ chosen from $\displaystyle\{1,2,8,32,128\}$ , we create the data vector and label pairs as follows:

\displaystyle\displaystyle\begin{split}\mathcal{D}_{1}(N,d_{0})&=\left\{(% \mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*\mathbf{I}_{d_{0}})% ,y^{1,i}=-1),\forall i\in[N/2])\right\}\\ &\hskip 15.0pt\cup\left\{(\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}}% ,0.25*\mathbf{I}_{d_{0}}),y^{1,i}=1),\forall j\in[N/2])\right\}.\end{split}

(17)

The vectors and labels from the dataset can then be arranged into the matrix form (as described in the setup) for analysis. The sampling procedure is repeated $\displaystyle 10$ times for each $\displaystyle(N,d_{0})$ ²²2The code is available at: https://github.com/kvignesh1420/shallow_nc1 .

Observations: Figure 2 illustrates the mean and standard deviation (std) of $\displaystyle\log_{10}(NC1(\mathbf{H}))$ for the post-activation NNGP kernel $\displaystyle Q^{(1)}_{GP}$ and NTK $\displaystyle\Theta^{(2)}_{NTK}$ for Erf and ReLU activations. For the low-dimensional case of $\displaystyle d_{0}=1$ and Erf activation, observe from Figure 2(a) that $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ has a small value of $\displaystyle\approx 10^{-2.1}$ . On the contrary, Figure 2(c) illustrates that for ReLU, $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ is more than an order of magnitude larger ( $\displaystyle\approx 10^{-0.95}$ ) than the former. Furthermore, Figures 2(b) and 2(d) corresponding to NTK (with Erf and ReLU respectively) do not exhibit significantly different values from the NNGP counterparts. These observations empirically verify our theoretical results.

As $\displaystyle d_{0}$ increases, $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ increases at similar rates for $\displaystyle Q_{GP-Erf}$ and $\displaystyle\Theta_{NTK-Erf}$ . With ReLU activation, $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ remains almost constant for $\displaystyle Q_{GP-ReLU}$ and exhibits an increasing trend for $\displaystyle\Theta_{NTK-ReLU}$ — implying less collapse for NTK. All these observations corroborate our theory on the limitation of analyzing NC with NTK. We present additional experimental results for imbalanced datasets in Appendix H and show that for a given $\displaystyle N$ , the trends of $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ for increasing $\displaystyle d_{0}$ can vary based on the imbalance ratio of classes (i.e., $\displaystyle n_{1}/n_{2}$ ). Nevertheless, the trends in $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ for NNGP with increasing $\displaystyle d_{0}$ , resemble that of a trained 2L-FCN with a large hidden layer width $\displaystyle d_{1}=2000$ (see Figure 3(b) corresponding to the Erf activation). These observations provide zero-order reasoning for NN behavior where the feature map** is learned based on data properties such as dimension.

6 Activation Variability in the Feature Learning Regime

The explicit kernel formulations in the infinite width limit $\displaystyle(d_{1}\to\infty)$ have allowed us to go beyond the unconstrained features assumption and preserve a link between the features and the data. Yet, the NC phenomenon relates to NN training, while NNGP relates to NN at initialization and NTK has been found unsuitable for NC analysis. Thus, we wish to contrast NNGP with a kernel that is an alternative to NTK and takes into account both optimization and data. To this end, we transition to a large but finite width ( $\displaystyle d_{1}\gg 1$ ) and large sample ( $\displaystyle N\gg 1$ ) setting and analyze the recently introduced ‘adaptive kernels’ approach by Seroussi et al. [40] for fully connected networks.

6.1 Equations of State (EoS)

A transition from the infinite to finite width regime can introduce various corrections to the pre-and post-activations of a $\displaystyle L$ -layer FCN. In this context, Seroussi et al. [40] have observed the following dominant corrections: (1) The mean and covariance of the pre-activations deviate from that of a random FCN and, (2) the collective effect of activations from the $\displaystyle(l+1)^{th}$ and $\displaystyle(l-1)^{th}$ layers determine the covariance of activations in the $\displaystyle l^{th}$ layer. Based on these observations, they employ a Variational Gaussian Approximation (VGA) approach to propose the following system of equations for the pre and post-activation kernels $\displaystyle K^{(l)}(\cdot,\cdot),Q^{(l)}(\cdot,\cdot),l\in[L]$ respectively. We formally define the EoS for a 2-layer FCN as follows (based on specializing the generic $\displaystyle L$ -layer formulation in equation 5 in [40] to $\displaystyle L=2$ , as done in equation 95 in their arxiv extended version):

Definition 6.1.

The “Equations of State” (EoS) for pre and post-activation kernels of a $\displaystyle 2$ -layer FCN with Erf activation, no bias, and $\displaystyle d_{2}=1$ are given by:

\displaystyle\displaystyle\begin{split}\overline{\mathbf{f}}&=\mathbf{Q}^{(1)}% [\sigma^{2}\mathbf{I}+\mathbf{Q}^{(1)}]^{-1}\mathbf{y}\\ [\mathbf{Q}^{(1)}]_{ij}&=\sigma_{a}^{2}\frac{2}{\pi}\arcsin\left(2K^{(1)}_{ij}% \cdot\left(\sqrt{1+2K^{(1)}_{ii}}\sqrt{1+2K^{(1)}_{jj}}\right)^{-1}\right)\\ [\mathbf{C}^{-1}]_{ij}&=\frac{d_{0}}{\sigma_{w}^{2}}\delta_{ij}+\frac{1}{d_{1}% }\mathrm{tr}\left\{\mathbf{A}^{(1)}\partial_{C_{ij}}\mathbf{Q}^{(1)}\right\}\\ \mathbf{A}^{(1)}&=-(\mathbf{y}-\overline{\mathbf{f}})(\mathbf{y}-\overline{% \mathbf{f}})^{\top}\sigma^{-4}+[\mathbf{Q}^{(1)}+\sigma^{2}\mathbf{I}]^{-1}\\ \end{split}

(18)

Here, $\displaystyle\mathbf{C}\in\mathbb{R}^{d_{0}\times d_{0}}$ models the statistical covariance of a row of $\displaystyle\mathbf{W}^{(1)}$ , initialized with $\displaystyle(\sigma_{w}^{2}/d_{0})\mathbf{I}$ , $\displaystyle\mathbf{K}^{(1)}=\mathbf{X}^{\top}\mathbf{C}\mathbf{X}\in\mathbb{% R}^{N\times N}$ , $\displaystyle\sigma>0$ is the regularization parameter, and $\displaystyle\overline{\mathbf{f}}\in\mathbb{R}^{N}$ corresponds to the prediction of the 2-layer FCN (governed by the EoS). Additionally, $\displaystyle\mathbf{K}^{(1)},\mathbf{Q}^{(1)}\in\mathbb{R}^{N\times N}$ are the kernel matrices associated with kernel functions $\displaystyle K^{(1)}(\cdot,\cdot),Q^{(1)}(\cdot,\cdot)$ .

$\displaystyle\bullet$ Relationship with NNGP: At initialization, we set $\displaystyle\mathbf{C}=(\sigma_{w}^{2}/d_{0})\mathbf{I}$ , which implies that $\displaystyle\mathbf{K}^{(1)}=(\sigma_{w}^{2}/d_{0})\mathbf{X}^{\top}\mathbf{X}$ . This resulting $\displaystyle\mathbf{K}^{(1)}$ exactly matches the kernel matrix for the pre-activation GP kernel (10) $\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=(\sigma% _{w}^{2}/d_{0})\mathbf{x}^{c,i\top}\mathbf{x}^{c^{\prime},j}$ as $\displaystyle\sigma_{b}\to 0$ . Similarly, the matrix $\displaystyle\mathbf{Q}^{(1)}$ corresponds to the $\displaystyle Q_{GP-Erf}(\cdot,\cdot)$ kernel function defined in (11). The EoS provides a mechanism for transitioning from NNGP kernels to finite-width-based kernels that adapt to the data. Intuitively, observe that the predictions $\displaystyle\overline{\mathbf{f}}$ are formulated based on kernel ridge regression with $\displaystyle\mathbf{Q}^{(1)}$ and $\displaystyle\mathbf{y}$ . The $\displaystyle\mathbf{Q}^{(1)}$ matrix along with $\displaystyle\mathbf{y}$ and the initial predictions $\displaystyle\overline{\mathbf{f}}$ are then used to update the weight covariance matrix $\displaystyle\mathbf{C}$ . Notice that every entry $\displaystyle[\mathbf{C}^{-1}]_{ij}$ involves a trace operation on a matrix product, resulting in a weighted sum across entries of $\displaystyle\partial_{C_{ij}}\mathbf{Q}^{(1)}$ (i.e the $\displaystyle N^{2}$ pairs of data samples).

$\displaystyle\bullet$ Finite width corrections of EoS on NC1: We solve the EoS (initialized with $\displaystyle\mathbf{C}=(\sigma_{w}^{2}/d_{0})\mathbf{I}_{d_{0}}$ ) and obtain the stable state using the Newton-Krylov method with an annealing schedule, as originally proposed by [40]. (see Appendix G for details). At initialization, $\displaystyle\mathbf{Q}^{(1)}$ is exactly described by the limiting NNGP kernel matrix, which has been analyzed in the previous section. Now, by solving the EoS with the final annealing factors as $\displaystyle 2000$ and $\displaystyle 500$ (which correspond to a 2L-FCN with hidden layer widths $\displaystyle d_{1}=2000,500$ respectively), we illustrate the NC1 metrics of $\displaystyle\mathbf{Q}^{(1)}$ in Figure 3 for the running example of a balanced 2 class dataset $\displaystyle\mathcal{D}_{1}(N,d_{0})$ (17). Notice that for $\displaystyle d_{1}=2000$ , the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values for $\displaystyle\mathbf{Q}^{(1)}$ in Figure 3(a) closely resemble the plots for the limiting NNGP kernel $\displaystyle Q_{GP-Erf}$ in Figure 2(a) and the NTK in Figure 2(b). However, we observe noticeable changes in the metrics when $\displaystyle d_{1}=500$ . Especially, for $\displaystyle d_{0}\geq 8$ and $\displaystyle N\geq 512$ , the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values for the EoS in Figure 3(c) exhibit a noticeable reduction compared to $\displaystyle Q_{GP-Erf}$ in Figure 2(a) and $\displaystyle\Theta_{NTK-Erf}$ in Figure 2(b). Based on this ‘kernel vs. kernel’ analysis, the reduction in $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values for EoS reflect the departure of $\displaystyle\mathbf{Q}^{(1)}$ from the initial NNGP state to a feature learning state (based on finite widths).

6.2 Activation Variability with Adaptive Kernels (EoS) and 2L-FCN

Setup: We train a 2L-FCN with $\displaystyle d_{1}=500,\sigma_{w}=1,\sigma_{b}=0$ and Erf activation using (vanilla) Gradient Descent with a learning rate of $\displaystyle 10^{-3}$ and weight-decay $\displaystyle 10^{-6}$ for $\displaystyle 1000$ epochs on datasets described below. For EoS, we employ the same setup described above, with the final annealing factor of $\displaystyle d_{1}=500$ and $\displaystyle\sigma_{a}^{2}=1/128$ (as per the critical scaling value [40]). Similar to the formulation of $\displaystyle\mathcal{D}_{1}(N,d_{0})$ for $\displaystyle C=2$ , we formulate $\displaystyle\mathcal{D}_{2}(N,d_{0})$ for $\displaystyle C=4$ as follows:

\displaystyle\displaystyle\begin{split}\mathcal{D}_{2}(N,d_{0})&=\left\{(% \mathbf{x}^{1,i}\sim\mathcal{N}(-6*\mathbf{1}_{d_{0}},0.25*\mathbf{I}_{d_{0}})% ,y^{1,i}=-3),\forall i\in[N/4])\right\}\\ &\hskip 10.0pt\cup\left\{(\mathbf{x}^{2,j}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}% },0.25*\mathbf{I}_{d_{0}}),y^{2,j}=-1),\forall j\in[N/4])\right\}\\ &\hskip 10.0pt\cup\left\{(\mathbf{x}^{3,k}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}}% ,0.25*\mathbf{I}_{d_{0}}),y^{3,k}=1),\forall k\in[N/4])\right\}\\ &\hskip 10.0pt\cup\left\{(\mathbf{x}^{4,l}\sim\mathcal{N}(6*\mathbf{1}_{d_{0}}% ,0.25*\mathbf{I}_{d_{0}}),y^{4,l}=3),\forall l\in[N/4])\right\}.\end{split}

(19)

$\displaystyle\bullet$ $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ of EoS provides a good approximation of FCN. Let us consider the running example with the balanced 2 class dataset $\displaystyle\mathcal{D}_{1}(N,d_{0})$ for varying $\displaystyle N,d_{0}$ as per (17). The EoS primarily aims to capture the finite-width corrections (as discussed above) depending on the scaling of $\displaystyle N,d_{0}$ , and $\displaystyle d_{1}$ . To this end, observe from Figure 3(c) that for $\displaystyle d_{1}=500$ , the trends of $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ for EoS vary depending on the scale of $\displaystyle N,d_{0}$ . Figure 3(d) illustrates the the actual 2L-FCN behavior. For $\displaystyle d_{0}=\{1,2\}$ , although $\displaystyle N$ is scaled to larger values, the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ of EoS results in similar values and resemble the 2L-FCN case. However, as the input dimension increases to $\displaystyle d_{0}=\{8,32\}$ , the larger value of $\displaystyle N=\{1024\}$ tends to deviate from the 2L-FCN behaviour. Similar deviations were observed even after choosing higher/lower learning rates for 2L-FCN (i.e $\displaystyle 5*10^{-3},2*10^{-3},5*10^{-4},10^{-4}$ ) and weight-decays (ex: $\displaystyle 10^{-5},10^{-4}$ ). Nonetheless, for the higher dimension of $\displaystyle d_{0}=128$ notice that the larger values of $\displaystyle N=\{512,1024\}$ are required for the EoS to match the 2L-FCN. Additionally, by training deeper FCN’s with $\displaystyle L=\{3,4,5,6\}$ layers and hidden layer widths $\displaystyle 500$ on $\displaystyle\mathcal{D}_{1}(N,d_{0})$ , we observed that the EoS trends (which correspond to 2L-FCN) can also be used to estimate the trends of $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ reduction in the penultimate layers of these networks (see Figure 14 in Appendix H).

The trends in $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ for EoS and 2L-FCN hold even for datasets $\displaystyle\mathcal{D}_{2}(N,d_{0})$ with $\displaystyle C>2$ . Observe from Figure 4 that the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values for the kernels $\displaystyle Q_{GP-Erf},\Theta_{NTK-Erf}$ , EoS and for 2L-FCN are consistently lower than the $\displaystyle C=2$ case as shown in Figures 2(a), 2(b), 3(c), 3(d), even when comparing them at $\displaystyle d_{0}=1$ . We believe this is because $\displaystyle\mathcal{D}_{2}(N,d_{0})$ is constructed by adding $\displaystyle 2$ new classes to $\displaystyle\mathcal{D}_{1}(N,d_{0})$ whose distance between the means is much larger than the within class co-variances.

$\displaystyle\bullet$ On the effects of class imbalances: One of the key conditions for the EoS to approximate the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ behavior of a 2L-FCN is the presence of sufficient data samples [40] (depending on the scale of the input dimension $\displaystyle d_{0}$ as shown above). Thus, extreme class imbalances can lead to biased $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ trends in EoS. Consider a collection of imbalanced datasets based on $\displaystyle\mathcal{D}_{1}(N,d_{0})$ where $\displaystyle N=2048$ is split into two classes as follows: Case 1: $\displaystyle(768,1280)$ , Case 2: $\displaystyle(512,1536)$ , Case 3: $\displaystyle(256,1792)$ . We observe that for Case 3 where the imbalance ratio is relatively large, the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values in EoS are slightly larger than 2L-FCN (Figure 9 in Appendix H). On the other hand, the imbalance ratios in Case 1 and Case 2 lead to good approximations with 2L-FCN. A similar observation can be made for the dataset $\displaystyle\mathcal{D}_{2}(N,d_{0})$ with $\displaystyle C=4$ and $\displaystyle N=1024$ from Figure 10.

$\displaystyle\bullet$ Implications: Our analysis showcases the dependence of $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ on the activation functions and indicates that an increase in the data complexity by increasing the dimension of the data typically leads to larger $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values. Furthermore, the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values depend heavily on the relative positions of the data points. In particular, when the means $\displaystyle\{\mu_{c}\}$ are well separated with smaller $\displaystyle\{\sigma_{c}\}$ , even a large number of classes can lead to smaller $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values. These observations explain the empirical results by Papyan et al. [3] (Figure 6) where extensive experiments on complex datasets (e.g., ImageNet) led to relatively less collapse (i.e larger $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ ) than simpler datasets such as MNIST. We present a broader discussion on the limitations of our work and future efforts in Appendix J, and a discussion on formulating NC1 metrics that consider the structure of the data in Appendix F.

7 Conclusion

In this paper, we presented a kernel-based approach to understanding the role of data in the emergence of the Neural Collapse (NC) phenomenon. By considering a general kernel function, we first formulated the trace expressions for the variability collapse (NC1) of the features of the data samples. By leveraging these results, we provided theoretical and empirical results to showcase that the NTK does not represent more collapsed features than the NNGP for various Gaussian datasets. Next, to capture the feature-learning aspects of finite-width neural networks, we switched to an ‘adaptive kernel’ approach whose state equations (EoS) facilitate the transition of the post-activation kernel beyond the GP limit. Through this “kernel vs. kernel” approach for limiting NNGP, NTK, and adaptive kernels, we showcased a promising direction to analyze the properties of data for which the NC1 behavior of an actual FCN can be understood. Thus addressing the limitations of the unconstrained features based analysis of NC and explaining the empirical observations of [3] on datasets of varying complexity. We believe that future work on analyzing the EoS for multi-layer FCN and convolutional networks [40] can provide further insights into the depthwise reduction of NC1 for deeper networks and provide a framework for analyzing datasets with more complex distributions.

Acknowledgments and Disclosure of Funding

The authors would like to thank Zhengdao Chen for helpful discussions during the preparation of this manuscript. The work of Tom Tirer is supported by the ISF grant No. 1940/23.

References

Hoffer et al. [2017] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
Ma et al. [2018] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325–3334. PMLR, 2018.
Papyan et al. [2020] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
Han et al. [2022] XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022.
Kothapalli [2023] Vignesh Kothapalli. Neural collapse: A review on modelling principles and generalization. Transactions on Machine Learning Research, 2023.
Fang et al. [2021] Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
Thrampoulidis et al. [2022] Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia. Imbalance trouble: Revisiting neural-collapse geometry. Advances in Neural Information Processing Systems, 35:27225–27238, 2022.
Tirer and Bruna [2022] Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pages 21478–21505. PMLR, 2022.
Rangamani et al. [2023] Akshay Rangamani, Marius Lindegaard, Tomer Galanti, and Tomaso A Poggio. Feature learning in deep classifiers through intermediate neural collapse. In International Conference on Machine Learning, pages 28729–28745. PMLR, 2023.
Súken\́mathbf{missing}ik et al. [2023] Peter Súken\́mathbf{i}k, Marco Mondelli, and Christoph H Lampert. Deep neural collapse is provably optimal for the deep unconstrained features model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
He and Su [2023] Hangfeng He and Weijie J Su. A law of data separation in deep learning. Proceedings of the National Academy of Sciences, 120(36):e2221704120, 2023.
Tirer et al. [2023] Tom Tirer, Haoxiang Huang, and Jonathan Niles-Weed. Perturbation analysis of neural collapse. In International Conference on Machine Learning, pages 34301–34329. PMLR, 2023.
Yang et al. [2023] Yongyi Yang, Jacob Steinhardt, and Wei Hu. Are neurons actually collapsed? on the fine-grained structure in neural representations. In International Conference on Machine Learning, pages 39453–39487. PMLR, 2023.
Kothapalli et al. [2023] Vignesh Kothapalli, Tom Tirer, and Joan Bruna. A neural collapse perspective on feature evolution in graph neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Zhu et al. [2021] Zhihui Zhu, Tianyu Ding, **xin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
Galanti et al. [2022] Tomer Galanti, András György, and Marcus Hutter. On the role of neural collapse in transfer learning. In International Conference on Learning Representations, 2022.
Yang et al. [2022] Yibo Yang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng Tao. Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? Advances in Neural Information Processing Systems, 35:37991–38002, 2022.
Xu and Liu [2023] **g Xu and Haoxiong Liu. Quantifying the variability collapse of neural networks. In International Conference on Machine Learning, pages 38535–38550. PMLR, 2023.
Mixon et al. [2020] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
Ji et al. [2022] Wenlong Ji, Yi** Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022.
Zhou et al. [2022] **xin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022.
Yaras et al. [2022] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. Advances in neural information processing systems, 35:11547–11560, 2022.
Dang et al. [2023] Hien Dang, Tan Minh Nguyen, Tho Tran, Hung The Tran, Hung Tran, and Nhat Ho. Neural collapse in deep linear networks: From balanced to imbalanced data. In International Conference on Machine Learning, 2023.
Wojtowytsch et al. [2020] Stephan Wojtowytsch et al. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. arXiv preprint arXiv:2012.05420, 2020.
Schölkopf et al. [2002] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. 2002.
Neal [1995] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 1995.
Lee et al. [2018] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Chizat et al. [2019] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
Arora et al. [2019] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019.
Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
Matthews et al. [2018] Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
Ronen et al. [2019] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
Huang et al. [2020] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective. Advances in neural information processing systems, 33:2698–2709, 2020.
Tirer et al. [2022] Tom Tirer, Joan Bruna, and Raja Giryes. Kernel-based smoothness analysis of residual networks. In Mathematical and Scientific Machine Learning, pages 921–954. PMLR, 2022.
Barzilai et al. [2022] Daniel Barzilai, Amnon Geifman, Meirav Galun, and Ronen Basri. A kernel perspective of skip connections in convolutional networks. arXiv preprint arXiv:2211.14810, 2022.
Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
Hanin and Nica [2019] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
Lee et al. [2020] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156–15172, 2020.
Seroussi et al. [2023] Inbar Seroussi, Gadi Naveh, and Zohar Ringel. Separation of scales and a thermodynamic description of feature learning in some cnns. Nature Communications, 14(1):908, 2023.
Woodworth et al. [2020] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
Ghorbani et al. [2020] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
Wei et al. [2019] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32, 2019.
Yehudai and Shamir [2019] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
Li et al. [2020] Yuanzhi Li, Tengyu Ma, and Hongyang R Zhang. Learning over-parametrized two-layer neural networks beyond ntk. In Conference on learning theory, pages 2613–2682. PMLR, 2020.
Wang et al. [2023] Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, and Qing Qu. Understanding deep representation learning via layerwise feature compression and discrimination. arXiv preprint arXiv:2311.02960, 2023.
Seleznova et al. [2023] Mariia Seleznova, Dana Weitzner, Raja Giryes, Gitta Kutyniok, and Hung-Hsu Chou. Neural (tangent kernel) collapse. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Williams [1996] Christopher Williams. Computing with infinite networks. Advances in neural information processing systems, 9, 1996.
Cho and Saul [2009] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. Advances in neural information processing systems, 22, 2009.
Rubin et al. [2024] Noa Rubin, Inbar Seroussi, and Zohar Ringel. Grokking as a first order phase transition in two layer networks. In The Twelfth International Conference on Learning Representations, 2024.
Hui and Belkin [2021] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In The Ninth International Conference on Learning Representations (ICLR), 2021.
Yaras et al. [2023] Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, and Qing Qu. The law of parsimony in gradient descent for learning deep linear networks. arXiv preprint arXiv:2306.01154, 2023.
Seltman [2012] Howard Seltman. Approximations for mean and variance of a ratio. unpublished note, 2012.
Vershynin [2012] Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655–686, 2012.

Appendix A Proof of Theorem 4.1

To obtain the NC1 formulation corresponding to an arbitrary feature matrix $\displaystyle\mathbf{H}$ , we start with a simple relationship between $\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H}),\widetilde{% \boldsymbol{\Sigma}}_{B}(\mathbf{H}),\boldsymbol{\Sigma}_{W}(\mathbf{H})$ as follows:

\displaystyle\displaystyle\begin{split}\widetilde{\boldsymbol{\Sigma}}_{T}(% \mathbf{H})&=\boldsymbol{\Sigma}_{W}(\mathbf{H})+\widetilde{\boldsymbol{\Sigma% }}_{B}(\mathbf{H})\\ \implies\mathrm{tr}\left(\boldsymbol{\Sigma}_{W}(\mathbf{H})\right)&=\mathrm{% tr}\left(\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})\right)-\mathrm{tr}% \left(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})\right).\end{split}

(20)

Similarly, by considering $\displaystyle\boldsymbol{\Sigma}_{G}(\mathbf{H})=\overline{\mathbf{h}}^{G}% \overline{\mathbf{h}}^{G\top}$ , we get:

\displaystyle\displaystyle\begin{split}\boldsymbol{\Sigma}_{B}(\mathbf{H})&=% \widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})-\boldsymbol{\Sigma}_{G}(% \mathbf{H})\\ \implies\mathrm{tr}\left(\boldsymbol{\Sigma}_{B}(\mathbf{H})\right)&=\mathrm{% tr}\left(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})\right)-\mathrm{tr}% \left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right).\end{split}

(21)

$\displaystyle\bullet$ Formulating $\displaystyle\mathrm{tr}\left(\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})\right)$ : Expanding $\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})$ into individual outer-products of vectors and leveraging the trace properties leads to the following:

	$\displaystyle\displaystyle\mathrm{tr}\left(\widetilde{\boldsymbol{\Sigma}}_{T}% (\mathbf{H})\right)$	$\displaystyle\displaystyle=\mathrm{tr}\left(\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1% }^{n_{c}}\mathbf{h}^{c,i}\mathbf{h}^{c,i\top}\right)=\frac{1}{N}\sum_{c=1}^{C}% \sum_{i=1}^{n_{c}}\mathrm{tr}\left(\mathbf{h}^{c,i}\mathbf{h}^{c,i\top}\right)$
		$\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\mathrm{% tr}\left(\mathbf{h}^{c,i\top}\mathbf{h}^{c,i}\right)$
		$\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}Q(% \mathbf{x}^{c,i},\mathbf{x}^{c,i})$

$\displaystyle\bullet$ Formulating $\displaystyle\mathrm{tr}\left(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})\right)$ : Similar to the above analysis, we can reformulate the trace of non-centered between-class covariance matrix $\displaystyle\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})$ as:

	$\displaystyle\displaystyle\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{B})$	$\displaystyle\displaystyle=\mathrm{tr}\left(\frac{1}{C}\sum_{c=1}^{C}\overline% {\mathbf{h}}^{c}\overline{\mathbf{h}}^{c\top}\right)=\frac{1}{C}\sum_{c=1}^{C}% \mathrm{tr}\left(\overline{\mathbf{h}}^{c}\overline{\mathbf{h}}^{c\top}\right)% =\frac{1}{C}\sum_{c=1}^{C}\mathrm{tr}\left(\overline{\mathbf{h}}^{c\top}% \overline{\mathbf{h}}^{c}\right)$
		$\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\mathrm{tr}\left(\left[% \frac{1}{n_{c}}\sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\right]^{\top}\left[\frac{1}{% n_{c}}\sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\right]\right)$
		$\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\mathrm% {tr}\left(\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}\mathbf{h}^{c,i\top}\mathbf{h}^{% c,j}\right)=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum% _{j=1}^{n_{c}}\mathrm{tr}\left(\mathbf{h}^{c,i\top}\mathbf{h}^{c,j}\right)$
		$\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i% =1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})$

$\displaystyle\bullet$ Formulating $\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right)$ : Reformulation of $\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right)$ can be approached along the same lines:

	$\displaystyle\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right)$	$\displaystyle\displaystyle=\mathrm{tr}\left(\overline{\mathbf{h}}^{G}\overline% {\mathbf{h}}^{G\top}\right)=\mathrm{tr}\left(\overline{\mathbf{h}}^{G\top}% \overline{\mathbf{h}}^{G}\right)$
		$\displaystyle\displaystyle=\mathrm{tr}\left(\left[\frac{1}{N}\sum_{c=1}^{C}% \sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\right]^{\top}\left[\frac{1}{N}\sum_{c=1}^{C% }\sum_{j=1}^{n_{c}}\mathbf{h}^{c,j}\right]\right)$
		$\displaystyle\displaystyle=\frac{1}{N^{2}}\mathrm{tr}\left(\sum_{c=1}^{C}\sum_% {i=1}^{n_{c}}\sum_{c^{\prime}=1}^{C}\sum_{j=1}^{n_{c^{\prime}}}\mathbf{h}^{c,i% \top}\mathbf{h}^{c^{\prime},j}\right)=\frac{1}{N^{2}}\sum_{c=1}^{C}\sum_{i=1}^% {n_{c}}\sum_{c^{\prime}=1}^{C}\sum_{j=1}^{n_{c^{\prime}}}\mathrm{tr}\left(% \mathbf{h}^{c,i\top}\mathbf{h}^{c^{\prime},j}\right)$
		$\displaystyle\displaystyle=\frac{1}{N^{2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C% }\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{\prime}}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c% ^{\prime},j})$

By using these intermediate results, we can formulate $\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{W}(\mathbf{H})\right),% \mathrm{tr}\left(\boldsymbol{\Sigma}_{B}(\mathbf{H})\right)$ as:

	$\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))$	$\displaystyle\displaystyle=\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{T}(% \mathbf{H}))-\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H}))$
		$\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}Q(% \mathbf{x}^{c,i},\mathbf{x}^{c,i})-\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}% }\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})$
	$\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))$	$\displaystyle\displaystyle=\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{B}(% \mathbf{H}))-\mathrm{tr}(\boldsymbol{\Sigma}_{G}(\mathbf{H})))$
		$\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i% =1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})-\frac{1}{N^{% 2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{% \prime}}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).$

Hence, proving the theorem.

Appendix B Limiting NNGP and NTK for ReLU

Consider the GP limit characterization of the pre-activation kernel $\displaystyle K^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ as follows:

\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})=\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{d_{0}}\mathbf{x}^{c,i\top}\mathbf{x% }^{c^{\prime},j}.

(22)

Observe that $\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ is independent of the activation function. Now, the closed form representation of the post-activation NNGP kernel $\displaystyle Q^{(1)}_{GP}(\cdot,\cdot)$ for the ReLU activation is given by:

\displaystyle\displaystyle\begin{split}Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})&=\frac{\tau(x^{c,i},x^{c^{\prime},j})}{2\pi}\sqrt{K% _{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,i})K_{GP}^{(1)}(\mathbf{x}^{c^{% \prime},j},\mathbf{x}^{c^{\prime},j})},\\ \tau(x^{c,i},x^{c^{\prime},j})&=\sin\theta_{c,i}^{c^{\prime},j}+\left(\pi-% \theta_{c,i}^{c^{\prime},j}\right)\cos\theta_{c,i}^{c^{\prime},j}\\ \theta_{c,i}^{c^{\prime},j}&=\arccos\left(\frac{K_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})}{\sqrt{K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,% i})K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},\mathbf{x}^{c^{\prime},j})}}\right).% \end{split}

(23)

Next, we define the ReLU based derivative kernel $\displaystyle\dot{Q}^{(1)}_{GP-ReLU}(\cdot,\cdot)$ as follows:

\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i% },\mathbf{x}^{c^{\prime},j})&=\frac{1}{2\pi}\left(\pi-\theta\right)\end{split}

(24)

Finally, the NTK can be formulated as follows:

\Theta^{(2)}_{NTK-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=K_{GP-ReLU% }^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c% ,i},\mathbf{x}^{c^{\prime},j})\dot{Q}^{(1)}_{GP-ReLU}(\mathbf{x}^{c,i},\mathbf% {x}^{c^{\prime},j}).

(25)

Here, $\displaystyle K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ can be defined using the recursive formulation:

\displaystyle\displaystyle\begin{split}K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP-ReLU}^{(1)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).\end{split}

(26)

Appendix C General Results for NC1 with Kernels

In this section, we present some general results to calculate the expected value of $\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]$ for any given kernel function $\displaystyle Q(\cdot,\cdot)$ that is associated with the features $\displaystyle\mathbf{H}$ . To begin with, we consider a generic formulation of the three cases for $\displaystyle\mathbb{E}\left[Q(x^{c,i},x^{c^{\prime},j})\right]$ :

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[Q(x^{c,i},x^{c^{\prime}% ,j})\right]=\begin{cases}V^{(1)}(c)&\text{if }c=c^{\prime},i=j\\ V^{(2)}(c)&\text{if }c=c^{\prime},i\neq j\\ V^{(3)}(c,c^{\prime})&\text{if }c\neq c^{\prime}\\ \end{cases}.\end{split}

(27)

Lemma C.1.

Given the cases for the expected values of a kernel function $\displaystyle Q(\cdot,\cdot)$ as per (27), the $\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))\right]$ is given by:

\displaystyle\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{W}(% \mathbf{H}))\right]=\sum_{c=1}^{2}\frac{n_{c}}{N}V^{(1)}(c)-\frac{1}{2n_{c}^{2% }}\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)

(28)

Proof.

By leveraging Theorem 4.1, we can compute the expected value of $\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))$ as follows:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathrm{tr}(\boldsymbol% {\Sigma}_{W}(\mathbf{H}))\right]&=\mathbb{E}\left[\frac{1}{N}\sum_{c=1}^{C}% \sum_{i=1}^{n_{c}}Q(x^{c,i},x^{c,i})\right]-\mathbb{E}\left[\frac{1}{C}\sum_{c% =1}^{C}\frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(x^{c,i},x^{c,j% })\right]\\ &=\frac{1}{N}\sum_{c=1}^{2}\sum_{i=1}^{n_{c}}\mathbb{E}\left[Q(x^{c,i},x^{c,i}% )\right]-\frac{1}{2}\sum_{c=1}^{2}\frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum_{j% =1}^{n_{c}}\mathbb{E}\left[Q(x^{c,i},x^{c,j})\right]\\ &=\frac{1}{N}\sum_{c=1}^{2}\sum_{i=1}^{n_{c}}V^{(1)}(c)-\frac{1}{2}\sum_{c=1}^% {2}\frac{1}{n_{c}^{2}}\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\\ &=\sum_{c=1}^{2}\frac{n_{c}}{N}V^{(1)}(c)-\frac{1}{2n_{c}^{2}}\left(n_{c}(n_{c% }-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right).\end{split}

(29)

∎

Lemma C.2.

\displaystyle\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{B}(% \mathbf{H}))\right]=\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^% {2}}\right)\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{% 2n_{1}n_{2}}{N^{2}}V^{(3)}(1,2)

(30)

Proof.

The expected value of $\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))$ can be computed using Theorem 4.1 as:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathrm{tr}(\boldsymbol% {\Sigma}_{B}(\mathbf{H}))\right]&=\mathbb{E}\left[\frac{1}{C}\sum_{c=1}^{C}% \frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}\mathbf{Q}(x^{c,i},x^{c% ,j})\right]-\mathbb{E}\left[\frac{1}{N^{2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{% C}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{\prime}}}\mathbf{Q}(x^{c,i},x^{c^{\prime% },j})\right]\\ &=\left[\frac{1}{2}\sum_{c=1}^{2}\frac{1}{n_{c}^{2}}\left(n_{c}(n_{c}-1)V^{(2)% }(c)+n_{c}V^{(1)}(c)\right)\right]\\ &\hskip 20.0pt-\frac{1}{N^{2}}\left[\sum_{c=1}^{2}\left(n_{c}(n_{c}-1)V^{(2)}(% c)+n_{c}V^{(1)}(c)\right)\right]\\ &\hskip 20.0pt-\frac{1}{N^{2}}\left[2\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}V^{(3% )}(c=1,c^{\prime}=2)\right]\\ &=\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)\left(n% _{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^{2}% }V^{(3)}(1,2)\end{split}

(31)

∎

Lemma C.3.

Given the cases for the expected values of a kernel function $\displaystyle Q(\cdot,\cdot)$ as per (27), the $\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]$ is given by:

\displaystyle\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H% })\right]=\frac{\sum\limits_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{\left(n_{% c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)}{2n_{c}^{2}}}{\left[\sum\limits_{% c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)\left(n_{c}(n_{c}-1)V% ^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{2n_{1}n_{2}V^{(3)}(1,2)}{N^{2}}}% +\Delta_{h.o.t}

(32)

Proof.

Note that the expectation of the ratios can be given as:

	$\displaystyle\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H% })\right]=\frac{\mathbb{E}\left[\mathrm{tr}(\Sigma_{W}(\mathbf{H}))\right]}{% \mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{H}))\right]}+\Delta_{h.o.t}$		(33)
	$\displaystyle\displaystyle=\frac{\sum\limits_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N% }-\frac{\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)}{2n_{c}^{2}}}{% \left[\sum\limits_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)% \left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{2n_{1}n_{2}% V^{(3)}(1,2)}{N^{2}}}+\Delta_{h.o.t}$		(34)

Here, $\displaystyle\Delta_{h.o.t}$ corresponds to higher order terms given by [53]:

\displaystyle\displaystyle\Delta_{h.o.t}=\frac{Var(\mathrm{tr}(\boldsymbol{% \Sigma}_{B}(\mathbf{H})))\mathbb{E}\left[\mathrm{tr}(\Sigma_{W}(\mathbf{H}))% \right]}{\mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{H}))\right]^{3}}-\frac% {Cov(\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H})),\mathrm{tr}(\boldsymbol{% \Sigma}_{B}(\mathbf{H})))}{\mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{H}))% \right]^{2}},

(35)

where, based on the well-studied concentration of sample covariance matrices around the statistical covariance [54], $\displaystyle\Delta_{h.o.t}$ tend to $\displaystyle 0$ for large $\displaystyle n_{c}$ values.

∎

Lemma C.4.

For a random variable $\displaystyle x^{c,i}\sim\mathcal{N}(\mu_{c},\sigma_{c}^{2})$ which represents the $\displaystyle i^{th}$ sample of class $\displaystyle c$ (as per notation in Section 5.3), the expected value $\displaystyle\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]$ is given by:

\displaystyle\displaystyle T(c)=\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]% =\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{2\sigma_{c}^{4}+4\sigma_{c}^{2}% \mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}

(36)

Proof.

Based on the standard result on the expectation of ratios [53], we get:

	$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]$	$\displaystyle\displaystyle=\frac{1}{\mathbb{E}\left[(x^{c,i})^{2}\right]}+% \frac{Var((x^{c,i})^{2})}{\mathbb{E}\left[(x^{c,i})^{2}\right]^{3}}$		(37)
		$\displaystyle\displaystyle=\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{% \mathbb{E}[(x^{c,i})^{4}]-(\mu_{c}^{2}+\sigma_{c}^{2})^{2}}{(\mu_{c}^{2}+% \sigma_{c}^{2})^{3}}$		(38)

Based on the results from the moment-generating function, we know that:

\displaystyle\displaystyle\mathbb{E}[(x^{c,i})^{4}]=3\sigma_{c}^{4}+6\sigma_{c% }^{2}\mu_{c}^{2}+\mu_{c}^{4},

(39)

which gives us:

	$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]$	$\displaystyle\displaystyle=\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{3% \sigma_{c}^{4}+6\sigma_{c}^{2}\mu_{c}^{2}+\mu_{c}^{4}-(\mu_{c}^{2}+\sigma_{c}^% {2})^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}$		(40)
		$\displaystyle\displaystyle=\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{2% \sigma_{c}^{4}+4\sigma_{c}^{2}\mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}.$		(41)

Hence proving the lemma.

∎

Appendix D Proof of Theorem 5.1

D.1 NC1 of limiting NNGP with ReLU activation

In the limit $\displaystyle d_{1}\to\infty$ , we leverage the kernels in the GP limit as per (10), (23). Observe that for any two data points $\displaystyle x^{c,i},x^{c^{\prime},j}\in\mathbb{R}$ , the value of $\displaystyle\theta_{c,i}^{c^{\prime},j}$ can be given as:

	$\displaystyle\displaystyle\theta_{c,i}^{c^{\prime},j}$	$\displaystyle\displaystyle=\arccos\left(\frac{K_{GP}^{(1)}(x^{c,i},x^{c^{% \prime},j})}{\sqrt{K_{GP}^{(1)}(x^{c,i},x^{c,i})K_{GP}^{(1)}(x^{c^{\prime},j},% x^{c^{\prime},j})}}\right)$
		$\displaystyle\displaystyle=\arccos\left(\frac{\sigma_{b}^{2}+\frac{\sigma_{w}^% {2}}{d_{0}}x^{c,i}x^{c^{\prime},j}}{\sqrt{\left(\sigma_{b}^{2}+\frac{\sigma_{w% }^{2}}{d_{0}}x^{c,i}x^{c,i}\right)\left(\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{d% _{0}}x^{c^{\prime},j}x^{c^{\prime},j}\right)}}\right).$

Since $\displaystyle\sigma_{b}\to 0$ , the value of $\displaystyle\theta_{c,i}^{c^{\prime},j}$ simplifies to:

\displaystyle\displaystyle\theta_{c,i}^{c^{\prime},j}=\begin{cases}0&\text{if % }c=c^{\prime}\\ \pi&\text{if }c\neq c^{\prime}\\ \end{cases},

(42)

which follows from $\displaystyle\frac{x^{c,i}x^{c^{\prime},j}}{\sqrt{x^{c,i}x^{c,i}}\sqrt{x^{c^{% \prime},j}x^{c^{\prime},j}}}=\operatorname{sign}(x^{c,i})\operatorname{sign}(x% ^{c^{\prime},j})$ and $\displaystyle x^{1,i}<0,x^{2,j}>0$ almost surely. This leads to:

	$\displaystyle\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\frac{1}{2\pi}\sqrt{\sigma_{w}^{4}(x^{c,i})^{2}(x^{% c^{\prime},j})^{2}}\bigg{(}\sin\theta_{c,i}^{c^{\prime},j}+\left(\pi-\theta_{c% ,i}^{c^{\prime},j}\right)\cos\theta_{c,i}^{c^{\prime},j}\bigg{)}$		(43)
	$\displaystyle\displaystyle\implies Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c,j})$	$\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{2}}{2}\left\|x^{c,i}% \right\|\left\|x^{c^{\prime},j}\right\|&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}$		(44)

For the $\displaystyle c=c^{\prime}$ case, the value of the kernel boils down to the product of norms of independent random variables drawn from the same distribution. Since we assume $\displaystyle x^{c,i}x^{c^{\prime},j}>0$ if $\displaystyle c=c^{\prime}$ , the equation 44 can be rewritten as:

\displaystyle\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c,j})

\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{2}}{2}x^{c,i}x^{c^{% \prime},j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}

(45)

Additionally, since $\displaystyle x^{c,i}$ are random variables, the expected value of the kernel can be formulated as:

\displaystyle\displaystyle\mathbb{E}\left[Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c^{% \prime},j})\right]=\begin{cases}\frac{\sigma_{w}^{2}}{2}\left(\sigma_{c}^{2}+% \mu_{c}^{2}\right)&\text{if }c=c^{\prime},i=j\\ \frac{\sigma_{w}^{2}}{2}\mu_{c}^{2}&\text{if }c=c^{\prime},i\neq j\\ 0&\text{if }c\neq c^{\prime}\end{cases}

(46)

Thus, based on our generic formulation of cases in (27) in Appendix C, we get:

\displaystyle\displaystyle V^{(1)}(c)=\frac{\sigma_{w}^{2}}{2}\left(\sigma_{c}% ^{2}+\mu_{c}^{2}\right);\hskip 10.0ptV^{(2)}(c)=\frac{\sigma_{w}^{2}}{2}\mu_{c% }^{2};\hskip 10.0ptV^{(3)}(c,c^{\prime})=0.

(47)

As $\displaystyle N\gg 1$ and $\displaystyle n_{c}\gg 1,\forall c\in\{1,2\}$ , Lemma C.3 gives us:

\displaystyle\displaystyle\begin{split}\mathbb{E}[\mathcal{N}\mathcal{C}_{1}(% \mathbf{H}_{GP})]&=\frac{\sum_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{V^{(2)}% (c)}{2}}{\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)% \left(n_{c}^{2}V^{(2)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^{2}}V^{(3)}(1,2)}% +\Delta_{h.o.t}\\ \implies\mathbb{E}[\mathcal{N}\mathcal{C}_{1}(\mathbf{H}_{GP})]&=\frac{\sum_{c% =1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}{% \left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}% \right)}+\Delta_{h.o.t}.\end{split}

(48)

D.2 NC1 of limiting NTK with ReLU activation

The recursive relationship between the NTK and NNGP [31, 35] can be given as follows (13):

\Theta^{(2)}_{NTK-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=K_{GP-ReLU% }^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c% ,i},\mathbf{x}^{c^{\prime},j})\dot{Q}^{(1)}_{GP-ReLU}(\mathbf{x}^{c,i},\mathbf% {x}^{c^{\prime},j})

(49)

Here, $\displaystyle K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ can be defined using the following recursive formulation:

\displaystyle\displaystyle\begin{split}K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP-ReLU}^{(1)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).\end{split}

(50)

Based on (24), the derivative $\displaystyle\dot{Q}^{(1)}_{GP-ReLU}$ can be given as follows:

\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i% },\mathbf{x}^{c^{\prime},j})&=\frac{1}{2\pi}\left(\pi-\theta_{c,i}^{c^{\prime}% ,j}\right)\\ \theta_{c,i}^{c^{\prime},j}&=\arccos\left(\frac{K_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})}{\sqrt{K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,% i})K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},\mathbf{x}^{c^{\prime},j})}}\right).% \end{split}

(51)

We build on the results from the NNGP analysis (with $\displaystyle Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})$ ) for computing the variability collapse with the limiting NTK. First, note that:

\displaystyle\displaystyle\theta_{c,i}^{c^{\prime},j}=\begin{cases}0&\text{if % }c=c^{\prime}\\ \pi&\text{if }c\neq c^{\prime}\\ \end{cases}.

(52)

When $\displaystyle\sigma_{b}\to 0$ , we get:

\displaystyle\displaystyle\Theta^{(2)}_{NTK-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^% {c^{\prime},j})=\sigma_{w}^{2}Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c% ^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})\dot{Q}^% {(1)}_{GP-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).

(53)

From (45), we know that:

\displaystyle\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c,j})

\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{2}}{2}x^{c,i}x^{c^{% \prime},j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}

(54)

	$\displaystyle\displaystyle\implies\Theta_{NTK-ReLU}^{(2)}(x^{c,i},x^{c^{\prime% },j})$	$\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{4}}{2}x^{c,i}x^{c,j}% +\frac{\sigma_{w}^{2}}{2}x^{c,i}x^{c,j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases},$		(55)
		$\displaystyle\displaystyle=\begin{cases}\left(\frac{\sigma_{w}^{4}}{2}+\frac{% \sigma_{w}^{2}}{2}\right)x^{c,i}x^{c,j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}.$		(56)

Notice that $\displaystyle\Theta_{NTK-ReLU}^{(2)}(x^{c,i},x^{c^{\prime},j})$ is a scaled version of $\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c^{\prime},j})$ (as per (45)). Thus, we end up with the same result as (48) :

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathcal{N}\mathcal{C}_% {1}(\mathbf{H}_{NTK})\right]=\frac{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}% \sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2% }}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)}+\Delta_{h.o.t}.\end{split}

(57)

Appendix E Results for NC1 with Erf activation

E.1 NC1 of Limiting NNGP with Erf activation

Under the Assumptions described in Section 5.3 with $\displaystyle d_{0}=1,d_{1}\to\infty$ , observe that (11) gives us:

	$\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2K_{GP}^{(1)}(x^{c,% i},x^{c^{\prime},j})}{\sqrt{1+2K_{GP}^{(1)}(x^{c,i},x^{c,i})}\sqrt{1+2K_{GP}^{% (1)}(x^{c^{\prime},j},x^{c^{\prime},j})}}\right)$		(58)
		$\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2\sigma_{b}^{2}+2% \sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\sqrt{1+2\sigma_{b}^{2}+2\sigma_{w}^{2}% (x^{c,i})^{2}}\sqrt{1+2\sigma_{b}^{2}+2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}\right)$		(59)

Considering $\displaystyle\sigma_{b}\to 0$ :

	$\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2\sigma_{w}^{2}x^{c% ,i}x^{c^{\prime},j}}{\sqrt{1+2\sigma_{w}^{2}(x^{c,i})^{2}}\sqrt{1+2\sigma_{w}^% {2}(x^{c^{\prime},j})^{2}}}\right)$		(60)
		$\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{\operatorname{sign}% (x^{c,i})\operatorname{sign}(x^{c^{\prime},j})}{\sqrt{1+\frac{1}{2\sigma_{w}^{% 2}(x^{c,i})^{2}}}\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}}% \right),$		(61)

where the last equality comes from:

\displaystyle\displaystyle\frac{x^{c,i}x^{c^{\prime},j}}{|x^{c,i}|\sqrt{1+% \frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}}\cdot|x^{c^{\prime},j}|\sqrt{1+\frac{1}% {2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}}=\frac{\operatorname{sign}(x^{c,i})% \operatorname{sign}(x^{c^{\prime},j})}{\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,i% })^{2}}}\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}}.

(62)

For notational simplicity, consider:

\displaystyle\displaystyle\rho(x^{c,i},x^{c^{\prime},j})=\sqrt{1+\frac{1}{2% \sigma_{w}^{2}(x^{c,i})^{2}}}\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j}% )^{2}}},

(63)

and represent $\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})$ as:

\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})=\frac{2}% {\pi}\arcsin\left(\frac{\operatorname{sign}(x^{c,i})\operatorname{sign}(x^{c^{% \prime},j})}{\rho(x^{c,i},x^{c^{\prime},j})}\right).

(64)

Based on Assumption 1, we know that $\displaystyle x^{1,i}<0,x^{2,j}>0$ almost surely. This leads to:

\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})=\begin{% cases}\frac{2}{\pi}\arcsin\left(\frac{1}{\rho(x^{c,i},x^{c,j})}\right)&\text{% if }c=c^{\prime}\\ -\frac{2}{\pi}\arcsin\left(\frac{1}{\rho(x^{c,i},x^{c^{\prime},j})}\right)&% \text{if }c\neq c^{\prime}\\ \end{cases}.

(65)

E.1.1 Calculating $\displaystyle\mathbb{E}\left[Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})\right]$

For $\displaystyle|u|\leq 1$ , we consider the expansion of $\displaystyle\arcsin(u)=u+\frac{u^{3}}{6}+\cdots$ to obtain:

\displaystyle\displaystyle\mathbb{E}\left[\arcsin\left(\frac{1}{\rho(x^{c,i},x% ^{c^{\prime},j})}\right)\right]=\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{% \prime},j})}\right]+\mathbb{E}\left[\frac{1}{6\rho(x^{c,i},x^{c^{\prime},j})^{% 3}}\right]+\cdots.

(66)

To this end, based on Assumption 1 of large enough $\displaystyle|\mu_{1}|,|\mu_{2}|$ , we approximate the expectation with only the first term and denote $\displaystyle\xi_{h.o.t}$ to capture the effects of the higher order terms. Notice that since $\displaystyle\rho(x^{c,i},x^{c^{\prime},j})>1$ for finite $\displaystyle(x^{c,i},x^{c^{\prime},j})$ , the effects of $\displaystyle\xi_{h.o.t}$ are finite but decay rapidly compared to the first term. To this end, we get:

\displaystyle\displaystyle\mathbb{E}\left[\arcsin\left(\frac{1}{\rho(x^{c,i},x% ^{c^{\prime},j})}\right)\right]=\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{% \prime},j})}\right]+\xi_{h.o.t}

(67)

Calculating the expectation $\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{\prime},j})}\right]$ can now be split based on $\displaystyle c,c^{\prime}$ .

$\displaystyle\bullet$ Case $\displaystyle c=c^{\prime},i=j$ :

$\displaystyle\displaystyle\rho(x^{c,i},x^{c,i})$	$\displaystyle\displaystyle=1+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}$	(68)
$\displaystyle\displaystyle\implies\mathbb{E}\left[\rho(x^{c,i},x^{c,i})\right]$	$\displaystyle\displaystyle=1+\frac{1}{2\sigma_{w}^{2}}\mathbb{E}\left[\frac{1}% {(x^{c,i})^{2}}\right]$	(69)
	$\displaystyle\displaystyle=1+\frac{T(c)}{2\sigma_{w}^{2}}.$	(70)

The last equality is based on Lemma C.4 which gives the expanded version of $\displaystyle T(c)$ .

Finally, the value of $\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,i})}\right]$ can be given as:

	$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,i})}\right]$	$\displaystyle\displaystyle=\frac{1}{\mathbb{E}\left[\rho(x^{c,i},x^{c,i})% \right]}+\frac{Var(\rho(x^{c,i},x^{c,i}))}{\mathbb{E}\left[\rho(x^{c,i},x^{c,i% })\right]^{3}}$		(71)
		$\displaystyle\displaystyle=\frac{1}{1+\frac{T(c)}{2\sigma_{w}^{2}}}+\delta_{h.% o.t}(\rho(x^{c,i},x^{c,i}))$		(72)

Notice that even in this simple case, the expressions are non-trivial to fully expand. Nonetheless, along with Assumption 1, we consider large enough $\displaystyle|\mu_{1}|,|\mu_{2}|$ such that:

\displaystyle\displaystyle\frac{T(c)}{2\sigma_{w}^{2}}=\frac{1}{2\sigma_{w}^{2% }}\left[\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{2\sigma_{c}^{4}+4\sigma_{% c}^{2}\mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}\right]<1.

(73)

Thus, based on the expansion of $\displaystyle(1+u)^{-1}=1-u+u^{2}-u^{3}+\cdots$ , we obtain the following cleaner approximation of:

\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,i})}% \right]=1-\frac{T(c)}{2\sigma_{w}^{2}}+\Delta_{h.o.t}^{(1)}(c).

(74)

Here $\displaystyle\Delta_{h.o.t}^{(1)}(c)$ captures all the higher order terms corresponding to $\displaystyle\left(\frac{T(c)}{2\sigma_{w}^{2}}\right)^{2}-\left(\frac{T(c)}{2% \sigma_{w}^{2}}\right)^{3}+\cdots$ and $\displaystyle\delta_{h.o.t}(\rho(x^{c,i},x^{c,i}))$ as denoted above.

$\displaystyle\bullet$ Case $\displaystyle c=c^{\prime},i\neq j$ :

In the case of $\displaystyle c=c^{\prime},i\neq j$ , the expectations on the square roots do not have a particular closed form. To this, end we leverage Assumption 1 to obtain the following approximation:

$\displaystyle\displaystyle\rho(x^{c,i},x^{c,j})$	$\displaystyle\displaystyle=\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}}% \sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,j})^{2}}}$	(75)
	$\displaystyle\displaystyle=\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})^{2}}+h.o.% t\right)\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c,j})^{2}}+h.o.t\right)$	(76)
$\displaystyle\displaystyle\implies\mathbb{E}\left[\rho(x^{c,i},x^{c,j})\right]$	$\displaystyle\displaystyle=\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})% ^{2}}+h.o.t\right]\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c,j})^{2}}+h.o% .t\right]$	(77)

Observe that the inner terms in the expectations are scaled versions of the above case. To this end, we approximate $\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,j})}\right]$ as:

	$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,j})}\right]$	$\displaystyle\displaystyle\approx\frac{1}{\left(1+\frac{T(c)}{4\sigma_{w}^{2}}% \right)^{2}}+\delta_{h.o.t}(\rho(x^{c,i},x^{c,j}))$		(78)
		$\displaystyle\displaystyle=\frac{1}{1+\frac{T(c)}{2\sigma_{w}^{2}}+\frac{T(c)^% {2}}{16\sigma_{w}^{4}}}+\delta_{h.o.t}(\rho(x^{c,i},x^{c,j}))$		(79)

Similar to the assumption that led to (74), we get:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x% ^{c,j})}\right]&\approx 1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16% \sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)}(c).\end{split}

(80)

$\displaystyle\bullet$ Case $\displaystyle c\neq c^{\prime}$

A similar analysis as above applies in this case:

$\displaystyle\displaystyle\rho(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}}% \sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}$	(81)
	$\displaystyle\displaystyle=\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})^{2}}+h.o.% t\right)\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}+h.o.t\right)$	(82)
$\displaystyle\displaystyle\implies\mathbb{E}\left[\rho(x^{c,i},x^{c^{\prime},j% })\right]$	$\displaystyle\displaystyle=\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})% ^{2}}+h.o.t\right]\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c^{\prime},j})% ^{2}}+h.o.t\right]$	(83)

Observe that the inner terms in the expectations are similar to the above case. To this end, we approximate $\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{\prime},j})}\right]$ as:

	$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{\prime},% j})}\right]$	$\displaystyle\displaystyle\approx\frac{1}{\left(1+\frac{T(c)}{4\sigma_{w}^{2}}% \right)\left(1+\frac{T(c^{\prime})}{4\sigma_{w}^{2}}\right)}+\delta_{h.o.t}(% \rho(x^{c,i},x^{c^{\prime},j}))$		(84)
		$\displaystyle\displaystyle=\frac{1}{1+\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2% }}+\frac{T(c)T(c^{\prime})}{16\sigma_{w}^{4}}}+\delta_{h.o.t}(\rho(x^{c,i},x^{% c^{\prime},j}))$		(85)

Similar to the assumption that led to (74), we get:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x% ^{c^{\prime},j})}\right]&\approx 1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-% \frac{T(c)T(c^{\prime})}{16\sigma_{w}^{4}}+\Delta^{(3)}_{h.o.t}(c,c^{\prime}).% \end{split}

(86)

Finally, based on (74), (80), (86) we obtain the following result for $\displaystyle\mathbb{E}[Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})]$ as :

\displaystyle\displaystyle\begin{split}&\mathbb{E}\left[Q_{GP-Erf}^{(1)}(x^{c,% i},x^{c^{\prime},j})\right]\\ &\hskip 40.0pt\approx\begin{cases}1-\frac{T(c)}{2\sigma_{w}^{2}}+\Delta_{h.o.t% }^{(1)}(c)&\text{if }c=c^{\prime},i=j\\ 1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t% }^{(2)}(c)&\text{if }c=c^{\prime},i\neq j\\ 1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-\frac{T(c)T(c^{\prime})}{16\sigma% _{w}^{4}}+\Delta_{h.o.t}^{(3)}(c,c^{\prime})&\text{if }c\neq c^{\prime}\\ \end{cases}.\end{split}

(87)

Here $\displaystyle\Delta_{h.o.t}^{(1)}(c),\Delta_{h.o.t}^{(2)}(c),\Delta_{h.o.t}^{(% 3)}(c,c^{\prime})$ are the collective higher order terms that tend to $\displaystyle 0$ as $\displaystyle|\mu_{c}|$ increases relative to smaller values of $\displaystyle\sigma_{c}$ . These cases can now be plugged into our generic formulation of expected values of a kernel function (i.e $\displaystyle V^{(1)}(c),V^{(2)}(c),V^{(3)}(c,c^{\prime})$ ) as per (27) in Appendix C. Thus, based on Lemma C.3 for sufficiently large $\displaystyle\{n_{c}\}$ we get :

\displaystyle\displaystyle\mathbb{E}[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})]=% \frac{\sum_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{V^{(2)}(c)}{2}}{\left[\sum% _{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)\left(n_{c}^{2}V^{(% 2)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^{2}}V^{(3)}(1,2)}+\Delta_{h.o.t}

(88)

$\displaystyle\bullet$ Numerator in the balanced class setting.

To better understand the result, let’s consider the balanced class scenario with $\displaystyle n_{1}=n_{2}=N/2$ , for which the numerator simplifies to:

	$\displaystyle\displaystyle\sum_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{V^{(2)% }(c)}{2}$	$\displaystyle\displaystyle=\sum_{c=1}^{2}\frac{V^{(1)}(c)-V^{(2)}(c)}{2}$		(89)
		$\displaystyle\displaystyle=\sum_{c=1}^{2}\frac{\frac{T(c)^{2}}{16\sigma_{w}^{4% }}+\Delta^{(1)}_{h.o.t}(c)-\Delta^{(2)}_{h.o.t}(c)}{2}.$		(90)

If we were to ignore the effects of the higher order terms, then observe that the numerator primarily depends on $\displaystyle T(c)^{2}$ , which can be given based on Lemma C.4 as:

\displaystyle\displaystyle T(c)^{2}=\left[\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2}% )}+\frac{2\sigma_{c}^{4}+4\sigma_{c}^{2}\mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{% 2})^{3}}\right]^{2}

(91)

Thus, showcasing the dependence on $\displaystyle\mu_{c},\sigma_{c}$ in determining the extent of collapse. For sufficiently large $\displaystyle|\mu_{c}|\gg\sigma_{c}$ , we can approximate this value to:

\displaystyle\displaystyle T(c)^{2}\approx\left[\frac{1}{\mu_{c}^{2}}+\frac{4% \sigma_{c}^{2}}{\mu_{c}^{4}}\right]^{2}=\frac{1}{\mu_{c}^{4}}\left[1+\frac{4% \sigma_{c}^{2}}{\mu_{c}^{2}}\right]^{2}=\frac{1}{\mu_{c}^{4}}\left[1+\frac{8% \sigma_{c}^{2}}{\mu_{c}^{2}}+\frac{16\sigma_{c}^{4}}{\mu_{c}^{4}}\right]

(92)

$\displaystyle\bullet$ Denominator in the balanced class setting.

Similar to the numerator analysis, observe that when $\displaystyle n_{1}=n_{2}=N/2$ , the denominator can be given as:

	$\displaystyle\displaystyle\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac% {1}{N^{2}}\right)\left(n_{c}^{2}V^{(2)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^% {2}}V^{(3)}(1,2)$		(93)
	$\displaystyle\displaystyle\hskip 10.0pt=\frac{V^{(2)}(1)+V^{(2)}(2)-2V^{(3)}(1% ,2)}{4}$		(94)
	$\displaystyle\displaystyle\hskip 10.0pt=\frac{-\frac{T(1)}{2\sigma_{w}^{2}}-% \frac{T(1)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)}(1)-\frac{T(2)}{2\sigma_% {w}^{2}}-\frac{T(2)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)}(2)}{4}$		(95)
	$\displaystyle\displaystyle\hskip 30.0pt+\frac{2\frac{T(1)+T(2)}{4\sigma_{w}^{2% }}+2\frac{T(1)T(2)}{16\sigma_{w}^{4}}-2\Delta_{h.o.t}^{(3)}(1,2)}{4}$		(96)
	$\displaystyle\displaystyle=\frac{-\frac{T(1)^{2}}{16\sigma_{w}^{4}}-\frac{T(2)% ^{2}}{16\sigma_{w}^{4}}+2\frac{T(1)T(2)}{16\sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)% }(1)+\Delta_{h.o.t}^{(2)}(2)-2\Delta_{h.o.t}^{(3)}(1,2)}{4}$		(97)
	$\displaystyle\displaystyle=\frac{-\left(\frac{T(1)-T(2)}{4\sigma_{w}^{2}}% \right)^{2}+\Delta_{h.o.t}^{(2)}(1)+\Delta_{h.o.t}^{(2)}(2)-2\Delta_{h.o.t}^{(% 3)}(1,2)}{4}.$		(98)

Observe that the term $\displaystyle T(1)-T(2)$ represents:

\displaystyle\displaystyle T(1)-T(2)=\left[\frac{1}{(\mu_{1}^{2}+\sigma_{1}^{2% })}+\frac{2\sigma_{1}^{4}+4\sigma_{1}^{2}\mu_{1}^{2}}{(\mu_{1}^{2}+\sigma_{1}^% {2})^{3}}\right]-\left[\frac{1}{(\mu_{2}^{2}+\sigma_{2}^{2})}+\frac{2\sigma_{2% }^{4}+4\sigma_{2}^{2}\mu_{2}^{2}}{(\mu_{2}^{2}+\sigma_{2}^{2})^{3}}\right]

(99)

and for sufficiently large $\displaystyle|\mu_{c}|\gg\sigma_{c}$ , essentially represents:

\displaystyle\displaystyle T(1)-T(2)\approx\frac{1}{\mu_{1}^{2}}+\frac{4\sigma% _{1}^{2}}{\mu_{1}^{4}}-\frac{1}{\mu_{2}^{2}}-\frac{4\sigma_{2}^{2}}{\mu_{2}^{4% }}.

(100)

E.2 NC1 of Limiting NTK with Erf activation

Recall from (13) that the recursive relationship between the NTK and NNGP can be given as follows:

\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})=K_{GP-Erf}^{(2)}(x^{c,i},x^{c% ^{\prime},j})+K_{GP}^{(1)}(x^{c,i},x^{c^{\prime},j})\dot{Q}^{(1)}_{GP-Erf}(x^{% c,i},x^{c^{\prime},j}),

(101)

where:

$\displaystyle\displaystyle K_{GP-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP-Erf}^{(1)}(x^{c,% i},x^{c^{\prime},j})$	(102)
$\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2K_{GP}^{(1)}(x^{c,% i},x^{c^{\prime},j})}{\sqrt{1+2K_{GP}^{(1)}(x^{c,i},x^{c,i})}\sqrt{1+2K_{GP}^{% (1)}(x^{c^{\prime},j},x^{c^{\prime},j})}}\right)$	(103)
$\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-Erf}^{(1)}(x^{c,i},x^{c^{% \prime},j})&=\frac{4}{\pi}\left[\left(1+2K_{GP}^{(1)}(x^{c,i},x^{c,i})\right)% \left(1+2K_{GP}^{(1)}(x^{c^{\prime},j},x^{c^{\prime},j})\right)-\right.\\ &\left.\hskip 20.0pt\left(2K_{GP}^{(1)}(x^{c,i},x^{c^{\prime},j})\right)^{2}% \right]^{-1/2}\end{split}$		(104)

Considering $\displaystyle\sigma_{b}\to 0$ , $\displaystyle d_{0}=1$ (as per the setting and assumptions), we get:

\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})=\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}

(105)

\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})=\frac{2}% {\pi}\arcsin\left(\frac{2\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\sqrt{1+2% \sigma_{w}^{2}(x^{c,i})^{2}}\sqrt{1+2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}% \right).

(106)

\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-Erf}^{(1)}(x^{c,i},x^{c^{% \prime},j})&=\frac{4}{\pi}\left(\left(1+2\sigma_{w}^{2}x^{c,i}x^{c,i}\right)% \left(1+2\sigma_{w}^{2}x^{c^{\prime},j}x^{c^{\prime},j}\right)-\left(2\sigma_{% w}^{2}x^{c,i}x^{c^{\prime},j}\right)^{2}\right)^{-1/2}\\ &=\frac{4}{\pi\sqrt{1+2\sigma_{w}^{2}\cdot(x^{c,i})^{2}+2\sigma_{w}^{2}\cdot(x% ^{c^{\prime},j})^{2}}}.\end{split}

(107)

This gives us:

$\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})\dot{Q}_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\frac{4\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\pi% \sqrt{1+2\sigma_{w}^{2}\cdot(x^{c,i})^{2}+2\sigma_{w}^{2}\cdot(x^{c^{\prime},j% })^{2}}}$	(108)
	$\displaystyle\displaystyle=\frac{4\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\pi% \sigma_{w}\|x^{c,i}\|\|x^{c^{\prime},j}\|\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2% }(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c,i})^{% 2}}}}$	(109)
	$\displaystyle\displaystyle=\frac{4\sigma_{w}\operatorname{sign}(x^{c,i})% \operatorname{sign}(x^{c^{\prime},j})}{\pi\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i% })^{2}(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c,% i})^{2}}}}$	(110)

For notational simplicity, consider:

\displaystyle\displaystyle\kappa(x^{c,i},x^{c^{\prime},j})=\sqrt{\frac{1}{% \sigma_{w}^{2}(x^{c,i})^{2}(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})% ^{2}}+\frac{2}{(x^{c,i})^{2}}}

(111)

which simplifies the kernel formulation to:

\displaystyle\displaystyle\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})=% \sigma_{w}^{2}Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})+\frac{4\sigma_{w}% \operatorname{sign}(x^{c,i})\operatorname{sign}(x^{c^{\prime},j})}{\pi\kappa(x% ^{c,i},x^{c^{\prime},j})}

(112)

E.2.1 Calculating $\displaystyle\mathbb{E}\left[\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})\right]$

Similar to the NNGP analysis, we break down the calculation of $\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c^{\prime},j})\right]$ into three cases.

$\displaystyle\bullet$ Case $\displaystyle c=c^{\prime},i=j$

$\displaystyle\displaystyle\kappa(x^{c,i},x^{c,i})$	$\displaystyle\displaystyle=\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{4}}+\frac{4% }{(x^{c,i})^{2}}}=\sqrt{1-\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i})^{4}}-\frac{% 4}{(x^{c,i})^{2}}\right)}$	(113)
	$\displaystyle\displaystyle=1-\frac{1}{2}\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i% })^{4}}-\frac{4}{(x^{c,i})^{2}}\right)+\xi_{h.o.t}$	(114)
	$\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{4}}+% \frac{2}{(x^{c,i})^{2}}+\xi_{h.o.t}$	(115)

This gives us:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})% \right]&=\frac{1}{2}+\mathbb{E}\left[\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{4}}% \right]+\mathbb{E}\left[\frac{2}{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi_{h.o.t}]% \\ &=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}}\left[\frac{1}{\mathbb{E}\left[(x^{c,i}% )^{4}\right]}+\frac{Var((x^{c,i})^{4})}{\mathbb{E}\left[(x^{c,i})^{4}\right]^{% 3}}\right]+2\left[\frac{1}{\mathbb{E}\left[(x^{c,i})^{2}\right]}+\frac{Var((x^% {c,i})^{2})}{\mathbb{E}\left[(x^{c,i})^{2}\right]^{3}}\right]\\ &\hskip 20.0pt+\mathbb{E}[\xi_{h.o.t}]\end{split}

(116)

Based on the results from the moment-generating function, we know that:

\displaystyle\displaystyle\mathbb{E}[(x^{c,i})^{4}]=3\sigma_{c}^{4}+6\sigma_{c% }^{2}\mu_{c}^{2}+\mu_{c}^{4},

(117)

which can be used along with Lemma C.4 to obtain:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})% \right]&=\frac{1}{2}+2T(c)+\mathbb{E}[\xi_{h.o.t}]\end{split}

(118)

For notational simplicity, we define a helper function as follows:

\displaystyle\displaystyle\begin{split}S(\mu_{c},\sigma_{c})&=-\frac{1}{2}+2T(% c)+\mathbb{E}[\xi_{h.o.t}],\end{split}

(119)

which gives us:

\displaystyle\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})\right]=1+S(% \mu_{c},\sigma_{c})

(120)

Finally, the value of $\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,i})}\right]$ can be given as:

	$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,i})}\right]$	$\displaystyle\displaystyle=\frac{1}{\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})% \right]}+\frac{Var(\kappa(x^{c,i},x^{c,i}))}{\mathbb{E}\left[\kappa(x^{c,i},x^% {c,i})\right]^{3}}$		(121)
		$\displaystyle\displaystyle=\frac{1}{1+S(\mu_{c},\sigma_{c})}+\delta_{h.o.t}(% \kappa(x^{c,i},x^{c,i}))$		(122)

Notice that even in this simple case, the expressions are non-trivial to fully expand. Nonetheless, along with Assumption 1, we consider large enough $\displaystyle|\mu_{1}|,|\mu_{2}|$ such that:

\displaystyle\displaystyle S(\mu_{c},\sigma_{c})<1.

(123)

Thus, based on the expansion of $\displaystyle(1+u)^{-1}=1-u+u^{2}-u^{3}+\cdots$ , we obtain the following cleaner approximation of:

\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,i})}% \right]=1-S(\mu_{c},\sigma_{c})+\widetilde{\delta}_{h.o.t}(\kappa(x^{c,i},x^{c% ,i}))

(124)

$\displaystyle\bullet$ Case $\displaystyle c=c^{\prime},i\neq j$ :

$\displaystyle\displaystyle\kappa(x^{c,i},x^{c,j})$	$\displaystyle\displaystyle=\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2}(x^{c,j})% ^{2}}+\frac{2}{(x^{c,j})^{2}}+\frac{2}{(x^{c,i})^{2}}}$	(125)
	$\displaystyle\displaystyle=\sqrt{1-\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2% }(x^{c,j})^{2}}-\frac{2}{(x^{c,j})^{2}}-\frac{2}{(x^{c,i})^{2}}\right)}$	(126)
	$\displaystyle\displaystyle=1-\frac{1}{2}\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i% })^{2}(x^{c,j})^{2}}-\frac{2}{(x^{c,j})^{2}}-\frac{2}{(x^{c,i})^{2}}\right)+% \xi^{\prime}_{h.o.t}$	(127)
	$\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}(x% ^{c,j})^{2}}+\frac{1}{(x^{c,j})^{2}}+\frac{1}{(x^{c,i})^{2}}+\xi^{\prime}_{h.o% .t}$	(128)

Thus, based on Lemma C.4, we get:

$\displaystyle\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c,j})\right]$	$\displaystyle\displaystyle=\frac{1}{2}+\mathbb{E}\left[\frac{1}{2\sigma_{w}^{2% }(x^{c,i})^{2}(x^{c,j})^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c,j})^{2}}% \right]+\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi^{\prime}% _{h.o.t}]$	(129)
	$\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}}\mathbb{E}% \left[\frac{1}{(x^{c,i})^{2}}\right]\mathbb{E}\left[\frac{1}{(x^{c,j})^{2}}% \right]+\mathbb{E}\left[\frac{1}{(x^{c,j})^{2}}\right]+\mathbb{E}\left[\frac{1% }{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi^{\prime}_{h.o.t}]$	(130)
	$\displaystyle\displaystyle=\frac{1}{2}+\frac{T(c)^{2}}{2\sigma_{w}^{2}}+2T(c)+% \mathbb{E}[\xi^{\prime}_{h.o.t}]$	(131)

This leads to:

$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,j})}\right]$	$\displaystyle\displaystyle=\mathbb{E}\left[\frac{1}{1+\left(-\frac{1}{2}+\frac% {T(c)^{2}}{2\sigma_{w}^{2}}+2T(c)+\mathbb{E}[\xi^{\prime}_{h.o.t}]\right)}\right]$	(132)
	$\displaystyle\displaystyle=1-\left(-\frac{1}{2}+\frac{T(c)^{2}}{2\sigma_{w}^{2% }}+2T(c)+\mathbb{E}[\xi^{\prime}_{h.o.t}]\right)+\delta_{h.o.t}^{\prime}(% \kappa(x^{c,i},x^{c,j}))$	(133)
	$\displaystyle\displaystyle=\frac{3}{2}-\frac{T(c)^{2}}{2\sigma_{w}^{2}}-2T(c)+% \widetilde{\delta}_{h.o.t}(\kappa(x^{c,i},x^{c,j}))$	(134)

$\displaystyle\bullet$ Case $\displaystyle c\neq c^{\prime}$ :

	$\displaystyle\displaystyle\kappa(x^{c,i},x^{c^{\prime},j})$	$\displaystyle\displaystyle=\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2}(x^{c^{% \prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c,i})^{2}}}$		(135)
		$\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}(x% ^{c^{\prime},j})^{2}}+\frac{1}{(x^{c^{\prime},j})^{2}}+\frac{1}{(x^{c,i})^{2}}% +\xi^{\prime\prime}_{h.o.t}$		(136)

Thus, based on Lemma C.4, we get:

$\displaystyle\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c^{\prime},j})\right]$	$\displaystyle\displaystyle=\frac{1}{2}+\mathbb{E}\left[\frac{1}{2\sigma_{w}^{2% }(x^{c,i})^{2}(x^{c^{\prime},j})^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c^{% \prime},j})^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]+\mathbb% {E}[\xi^{\prime\prime}_{h.o.t}]$	(137)
	$\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}}\mathbb{E}% \left[\frac{1}{(x^{c,i})^{2}}\right]\mathbb{E}\left[\frac{1}{(x^{c^{\prime},j}% )^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c^{\prime},j})^{2}}\right]+\mathbb{% E}\left[\frac{1}{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi^{\prime\prime}_{h.o.t}]$	(138)
	$\displaystyle\displaystyle=\frac{1}{2}+\frac{T(c)T(c^{\prime})}{2\sigma_{w}^{2% }}+T(c^{\prime})+T(c)+\mathbb{E}[\xi^{\prime\prime}_{h.o.t}].$	(139)

This gives us:

$\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c^{\prime% },j})}\right]$	$\displaystyle\displaystyle=\mathbb{E}\left[\frac{1}{1+\left(-\frac{1}{2}+\frac% {T(c)T(c^{\prime})}{2\sigma_{w}^{2}}+T(c^{\prime})+T(c)+\mathbb{E}[\xi^{\prime% \prime}_{h.o.t}]\right)}\right]$	(140)
	$\displaystyle\displaystyle=1-\left(-\frac{1}{2}+\frac{T(c)T(c^{\prime})}{2% \sigma_{w}^{2}}+T(c^{\prime})+T(c)+\mathbb{E}[\xi^{\prime\prime}_{h.o.t}]% \right)+\delta_{h.o.t}^{\prime}(\kappa(x^{c,i},x^{c,j}))$	(141)
	$\displaystyle\displaystyle=\frac{3}{2}-\frac{T(c)T(c^{\prime})}{2\sigma_{w}^{2% }}-T(c)-T(c^{\prime})+\widetilde{\delta}_{h.o.t}(\kappa(x^{c,i},x^{c^{\prime},% j}))$	(142)

Finally, the cases for the expected value of the kernel can be given as:

\displaystyle\displaystyle\mathbb{E}\left[\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^% {\prime},j})\right]

\displaystyle\displaystyle=\begin{cases}\mathbb{E}\left[\sigma_{w}^{2}Q_{GP-% Erf}^{(1)}(x^{c,i},x^{c,j})\right]+\mathbb{E}\left[\frac{4\sigma_{w}}{\pi% \kappa(x^{c,i},x^{c,j})}\right]&c=c^{\prime}\\ \mathbb{E}\left[\sigma_{w}^{2}Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})\right% ]-\mathbb{E}\left[\frac{4\sigma_{w}}{\pi\kappa(x^{c,i},x^{c^{\prime},j})}% \right]&c\neq c^{\prime}\end{cases},

(143)

From (87), we know that:

\displaystyle\displaystyle\begin{split}&\mathbb{E}\left[Q_{GP-Erf}^{(1)}(x^{c,% i},x^{c^{\prime},j})\right]\\ &\hskip 40.0pt\approx\begin{cases}1-\frac{T(c)}{2\sigma_{w}^{2}}+\Delta_{h.o.t% }^{(1)}(c)&\text{if }c=c^{\prime},i=j\\ 1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t% }^{(2)}(c)&\text{if }c=c^{\prime},i\neq j\\ 1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-\frac{T(c)T(c^{\prime})}{16\sigma% _{w}^{4}}+\Delta_{h.o.t}^{(3)}(c,c^{\prime})&\text{if }c\neq c^{\prime}\\ \end{cases}.\end{split}

(144)

To simplify the presentation, we can ignore the higher-order terms and obtain:

	$\displaystyle\displaystyle\mathbb{E}\left[\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^% {\prime},j})\right]$		(145)
	$\displaystyle\displaystyle\hskip 10.0pt\approx\begin{cases}\sigma_{w}^{2}\left% (1-\frac{T(c)}{2\sigma_{w}^{2}}\right)+\frac{4\sigma_{w}}{\pi}\left(\frac{3}{2% }-2T(c)\right)&c=c^{\prime};i=j\\ \sigma_{w}^{2}\left(1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16\sigma_{w% }^{4}}\right)+\frac{4\sigma_{w}}{\pi}\left(\frac{3}{2}-\frac{T(c)^{2}}{2\sigma% _{w}^{2}}-2T(c)\right),&c=c^{\prime},i\neq j\\ \sigma_{w}^{2}\left(1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-\frac{T(c)T(c% ^{\prime})}{16\sigma_{w}^{4}}\right)-\frac{4\sigma_{w}}{\pi}\left(\frac{3}{2}-% \frac{T(c)T(c^{\prime})}{2\sigma_{w}^{2}}-T(c)-T(c^{\prime})\right),&c\neq c^{% \prime}\\ \end{cases}$		(146)

Observe that the order of the $\displaystyle T(c)$ terms involved here resemble that of the NNGP scenario in (87). Thus, we can make similar conclusions regarding the role of the order of $\displaystyle\mu_{c},\sigma_{c}$ in determining the value of $\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]$ .

Appendix F Activation Variability Relative to Data

In this section, we introduce a relative measure of activation variability collapse with respect to the data. First, we begin by defining the within-class and between-class data covariance matrices $\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{X}),\boldsymbol{\Sigma}_{B}(% \mathbf{X})\in\mathbb{R}^{d_{0}\times d_{0}}$ for the data samples as:

\displaystyle\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{X})

\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\left(% \mathbf{x}^{c,i}-\overline{\mathbf{x}}^{c}\right)\left(\mathbf{x}^{c,i}-% \overline{\mathbf{x}}^{c}\right)^{\top};\hskip 10.0pt\boldsymbol{\Sigma}_{B}(% \mathbf{X})=\frac{1}{C}\sum_{c=1}^{C}\left(\overline{\mathbf{x}}^{c}-\overline% {\mathbf{x}}^{G}\right)\left(\overline{\mathbf{x}}^{c}-\overline{\mathbf{x}}^{% G}\right)^{\top},

(147)

where $\displaystyle\overline{\mathbf{x}}^{c}=\frac{1}{n_{c}}\sum\nolimits_{i=1}^{n_{% c}}\mathbf{x}^{c,i},\forall c\in[C]$ and $\displaystyle\overline{\mathbf{x}}^{G}=\frac{1}{N}\sum\nolimits_{c=1}^{C}\sum% \nolimits_{i=1}^{n_{c}}\mathbf{x}^{c,i}$ represent the data class mean vectors and the data global mean vector respectively.

Definition F.1.

Set a small $\displaystyle\tau>0$ . The variability collapse relative to the data is given by:

\displaystyle\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X}):=% \frac{\mathcal{N}\mathcal{C}_{1}(\mathbf{H})}{\mathcal{N}\mathcal{C}_{1}(% \mathbf{X})+\tau},\hskip 10.0pt\textit{where }\mathcal{N}\mathcal{C}_{1}(% \mathbf{X}):=\frac{\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{X}))}{\mathrm{% tr}(\boldsymbol{\Sigma}_{B}(\mathbf{X}))}

(148)

The constant $\displaystyle\tau$ prevents numerical instabilities. Through this approach, we capture the extent of variability collapse of activation features relative to the variability collapse of the data samples itself.

Corollary F.2.

Under Assumptions 1-3 (as per Section 5.3), let $\displaystyle\phi(\cdot)$ be the ReLU activation, and the limiting NNGP kernel be $\displaystyle Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=% \mathbf{h}^{c,i\top}\mathbf{h}^{c^{\prime},j}$ , then:

\displaystyle\displaystyle\frac{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(% \mathbf{H})\right]}{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{X})% \right]}\approx 1-\frac{\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}{\left(\sum% _{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)}

(149)

Proof.

To keep the derivation similar to those for the kernel formulation in equation 45, we consider a simplified kernel on $\displaystyle\mathbf{X}$ (identity feature map):

\displaystyle\displaystyle K_{data}(x^{c,i},x^{c^{\prime},j})=x^{c,i}x^{c^{% \prime},j}.

(150)

Additionally, since $\displaystyle\mathbf{x}^{c,i}$ are 1-d random variables, the expected value of the kernel is given by:

\displaystyle\displaystyle\mathbb{E}\left[K_{data}(x^{c,i},x^{c^{\prime},j})% \right]=\begin{cases}\sigma_{c}^{2}+\mu_{c}^{2}&\text{if }c=c^{\prime},i=j\\ \mu_{c}^{2}&\text{if }c=c^{\prime},i\neq j\\ \mu_{c}\mu_{c^{\prime}}&\text{if }c\neq c^{\prime}\\ \end{cases}

(151)

We use Lemma C.3 with cases $\displaystyle V^{(1)}(c)=\sigma_{c}^{2}+\mu_{c}^{2}$ , $\displaystyle V^{(2)}(c)=\mu_{c}^{2}$ and $\displaystyle V^{(3)}(c,c^{\prime})=\mu_{c}\mu_{c^{\prime}}$ to obtain:

\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathcal{N}\mathcal{C}_% {1}(\mathbf{X})\right]&=\frac{\mathbb{E}\left[\mathrm{tr}(\Sigma_{W}(\mathbf{X% }))\right]}{\mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{X}))\right]}=\frac{% \sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{N}-\frac{n_{c}^{2}% \mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{2n_{c}^{2}}}{\left(\sum_{c=1}^{2}\frac{n_{c}^% {2}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{2n_{c}^{2}}-\frac{n_{c}^{2}\mu_{c}^{2}+n_{% c}\sigma_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}+% \Delta^{X}_{h.o.t}\\ \end{split}

(152)

Finally, the ratio $\displaystyle\frac{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})% \right]}{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{X})\right]}$ for ReLU (Theorem 5.1) with large enough $\displaystyle n_{c}\gg 1$ is given by:

$\displaystyle\displaystyle\frac{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(% \mathbf{H})\right]}{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{X})% \right]}$	$\displaystyle\displaystyle=\frac{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}% \sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2% }}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)}\cdot\frac{\left(\sum_{c=1}^{2% }\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2% }}\prod_{c=1}^{2}n_{c}\mu_{c}}{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}% \sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}+\Delta^{\prime}_{h.o.t}.$	(153)
	$\displaystyle\displaystyle=\frac{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-% \frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}% \mu_{c}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}% {N^{2}}\right)}+\Delta^{\prime}_{h.o.t}$	(154)
	$\displaystyle\displaystyle=1-\frac{\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}% {\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}% \right)}+\Delta^{\prime}_{h.o.t}$	(155)

∎

To better understand the result, let us consider the balanced class scenario where $\displaystyle n_{1}=n_{2}=n=N/2$ . This results in a ratio of $\displaystyle\approx 1-(2\mu_{1}\mu_{2})/(\mu_{1}^{2}+\mu_{2}^{2})$ . Furthermore, if $\displaystyle|\mu_{1}|=|\mu_{2}|$ (so $\displaystyle\mu_{1}=-\mu_{2}$ ), then the ratio $\displaystyle\approx 2$ . Thus, it emphasizes the interplay between class imbalance/balance and the values of expected class means on the relative variability collapse.

$\displaystyle\bullet$ Addressing misleading $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values: Consider the case where $\displaystyle\sigma_{1},\sigma_{2}\to 0$ . Then Theorem 5.1 for $\displaystyle Q_{GP-ReLU}$ indicates that $\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]\to 0$ (considering smaller fluctuations from $\displaystyle\Delta_{h.o.t}$ ) in the balanced class setting. Such an observation can be misleading if one were to ignore $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{X})$ . For instance, such an empirical result while training deep neural networks fails to differentiate between settings where the network learned meaningful features and learned to classify complex datasets or was simply able to leverage the already collapsed data vectors. This applies to Erf activation as well. We justify this argument with the following experiment. For a sample size $\displaystyle N$ chosen from $\displaystyle\{128,256,512,1024\}$ , and input dimension $\displaystyle d_{0}$ chosen from $\displaystyle\{1,2,8,32,128\}$ , we sample the vectors $\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-10*\mathbf{1}_{d_{0}},\mathbf{I}% _{d_{0}}),i\in[N/2]$ for class $\displaystyle 1$ and $\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(10*\mathbf{1}_{d_{0}},\mathbf{I}_% {d_{0}}),j\in[N/2]$ for class $\displaystyle 2$ as our dataset. From Figure 5(a), 5(b), observe that $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})$ values for $\displaystyle Q_{GP-Erf}$ can be orders of magnitude larger than $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ , and for high-dimensions $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})>1$ . Essentially, the raw data is ‘more’ collapsed than the activations in these settings. Similar observations can be made for the NTK $\displaystyle\Theta_{NTK-Erf}$ in Figure 5(c), 5(d).

Appendix G Numerical solutions of EoS

We solve the EoS using the Newton-Krylov method with an annealing schedule (as originally proposed by [40]) using the scipy.optimize.newton_krylov python API. We initialize $\displaystyle\mathbf{C}$ with the GP limit value of $\displaystyle(\sigma_{w}^{2}/d_{0})\mathbf{I}_{d_{0}}$ and choose a large annealing factor (ex: $\displaystyle 10^{5}$ ) as the value for $\displaystyle d_{1}$ . The result of optimizing with newton_krylov is a new $\displaystyle\mathbf{C}$ , which in addition to a lower annealing factor is used as an input for the next newton_krylov function call. This loop is repeated until the end of an annealing schedule. For instance, to analyze the EoS corresponding to $\displaystyle d_{1}=500$ , we choose the following list of step-wise annealing factors:

\displaystyle\displaystyle\texttt{factors}=[\underbrace{10^{5},9*10^{4},\cdots% ,2*10^{4}}_{\texttt{step}=-10^{4}},\underbrace{10^{4},9*10^{3},\cdots,2*10^{3}% }_{\texttt{step}=-10^{3}},\underbrace{10^{3},\cdots,500}_{\texttt{step}=-10^{2% }}].

(156)

Similarly, for a choice of $\displaystyle d_{1}=2000$ , we select the slice of the above list up to $\displaystyle 2000$ . Selecting the schedule is a manual operation and can be treated as a hyper-parameter. In our experiments, we observed that this schedule is sufficient to obtain insights on the NC1 metrics of $\displaystyle\mathbf{Q}^{(1)}$ . Thus, we leave the exploration of various annealing strategies as future work.

$\displaystyle\bullet$ Comparing the spectrum of weight covariance matrices: Since $\displaystyle\mathbf{C}$ is subject to change while obtaining the stable state of the EoS, we analyze its initial and final (normalized) spectra for two different datasets of dimension $\displaystyle d_{0}=32$ and $\displaystyle N=1024$ . Dataset 1: $\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),i\in[N/2]$ , $\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),j\in[N/2]$ . Dataset 2: $\displaystyle\mathbf{x}^{c,i}\sim\mathcal{N}(\mathbf{0}_{d_{0}},4*\mathbf{I}_{% d_{0}}),i\in[N/2],c\in[2]$ The first dataset is our running example, and the second is pure random noise data. Surprisingly, we observed that the EoS solution captures correlations in the data for both datasets, which is reflected in its final spectrum. In particular, the singular values shift from being constant at initialization to exhibiting a decay in their values (Figure 7). Such a shift does not exactly match the case of $\displaystyle 2$ L-FCN because of (1) the difference in the dynamics of GD and Newton-Krylov with annealing, and (2) we start with a GP-based initial value for $\displaystyle\mathbf{C}$ in EoS. A rigorous analysis of the EoS dynamics is an open research direction (as also highlighted by the Seroussi et al. [40]). Nonetheless, the EoS offers a richer data-dependent setup to analyze the activations and weights, than the UFM.

Appendix H Additional Experiments

Compute Resources: All the experiments in this paper were executed on a machine with $\displaystyle 16$ GB of host memory and $\displaystyle 8$ CPU cores. Experiments with the EoS on datasets of varying dimensions and sample sizes took the longest time $\displaystyle\approx 1$ hour to finish.

Appendix I Impact Statement

This paper aims to address the limitations of the unconstrained features model to understand the role of data on the Neural Collapse phenomenon. Our work does not have any direct negative societal impact. On the contrary, our work takes a step forward in understanding the characteristics of datasets that govern the performance of deep neural network classifiers. Thus, laying the groundwork for theoretically informed training practices for wide societal use.

Appendix J Limitations and Future Work

In certain cases, we have observed that none of the kernel methods approximate the 2L-FCN reasonably. One such instance is the following, where we sample $\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},4*\mathbf{I% }_{d_{0}}),y^{1,i}=-1,i\in[N/2]$ for class $\displaystyle 1$ and $\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},4*\mathbf{I}% _{d_{0}}),y^{2,j}=1,j\in[N/2]$ for class $\displaystyle 2$ of our dataset. Essentially, these are scenarios where there is a significant overlap between samples of the two classes. First, we note that we had to increase the learning rate of our 2L-FCN from $\displaystyle 10^{-3}$ to $\displaystyle 5\cdot 10^{-3}$ and run GD for $\displaystyle 2000$ epochs for convergence. For dimensions $\displaystyle d_{0}=\{8,16,32\}$ , the EoS reasonably approximates the 2L-FCN but for $\displaystyle d_{0}=\{64,128\}$ , the $\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})$ values for 2L-FCN turned out to be almost twice as large as the EoS (see Figure 13). To this end, we leave modifications to the EoS for handling such noisy data cases and different activation functions as future work.

Additionally, we highlight the difficulties in the theoretical/empirical analysis of NC1 with EoS. The primary bottleneck is a lack of rigorous study on the existence and uniqueness of solutions (As also highlighted by [40]). Since we deviate from the lazy regime and deal with kernels in the feature learning setup, we cannot expect simpler closed-form solutions like the limiting NNGP/NTK for the EoS. However, analytical solutions to the EoS can sometimes be time-consuming and require a manual selection of the annealing schedule. This is a tradeoff that can be improved with future research. Furthermore, the role of scaling $\displaystyle N,d_{0},d_{1}$ on NC1 is yet to be fully understood and we hope that our analysis lays the groundwork for such efforts.

Finally, we point the reader to Appendix F for a discussion on a relative NC1 metric that explicitly incorporates the variability collapse of the data vectors into the NC1 metric. In particular, we aim to differentiate between settings where the neural network learned meaningful features and learned to classify complex datasets or was simply able to leverage the already collapsed data vectors. Our results showcase that in higher dimensions, the data vectors are ‘more’ collapsed than the activations themselves. Thus showcasing the limitations of the current NC1 metrics and encouraging the reader to explore much richer variants.